Scanning the knowledge (Wikipedia analysis)

Protocol

Introduction

This protocol aims to discover links between Wikipedia pages inherent our topic, in order to find out connections, clustering and alignments of themes around a specific area of knowledge. Besides we wanted to discover the behaviour of the authors of the main pages about our topic, highlighting the creation of contents, the groth of the articles and the acts of vandalism and whether there are patterns about authors and vandals.

First step

We picked 5 pages of Wikipedia that best represent our theme:
Deep Web, Tor (anonymity network), Silk Road, Internet privacy and Anonymous web browsing.
Then we chose some first level "see also" from those five initials seeds considered relevant:
Tor2web, The Hidden Wiki, Anonymous P2P, Crypto-anarchism, Freedom of Information, GlobaLeaks, Internet censorship, Internet privacy, Anonymity, Anonymous blogging, Anonymous web browsing, Information Policy, Internet censorship circumvention, Privacy law, Information privacy law, Data Protection Directive, Privacy laws of the United States, Surveillance, Computer and network surveillance, Mass surveillance, Mass surveillance in the United States, Data privacy, HTTP tunnel, Internet privacy, OpenVPN, Privacy software, Privacy-enhancing technologies, Onion routing , Agorism, Bitcoin protocol, Crypto-anarchism, Operation Web Tryp, The Hidden Wiki, War on Drugs and Sheep Marketplace. Imported the list of selected URLs in Seealsology and launched the crawl with depth 3, we exported the GEFX file and opened it in Gephi with "graph type" as "directed". After having exported a CSV file of the edges, we edit and standardization of Id of double nodes (different for capitals, brackets...).
Then we imported the edges table cleaned up in Gephi (in a new project), forcing the creation of new nodes. After the calculation dell'Average Degree all nodes with Degree equal to 1 were eliminated and the remaining were resized according to their In-Degree. We did a spatialization by Force Atlas 2 (LinLog mode, Scaling 0.35 , Gravity 0.25) and then applied Prevent Overlap. At last we made Gephi Calculate Modularity and assign colors to the nodes according to it.

Second step

We chose two main pages related to our subject: Deep Web and Tor (anonymity network) and scraped the whole History of the two pages using Kimono and Wikipedia2geo. So we made a visualization of the size of the two pages in relation to time since creation.
We kept the changes only until the beginning of 2010 (the period when Google Trends shows a significant increase in research related to the topic) and eliminated all changes bigger than 900 bytes (negative vandalism and positive reversals) and smaller than 50 bytes (insignificant changes of few words).

Third step

After having calculated the changes to pages in absolute value and created a Pivot Table to count changes and calculate the total size of the changes for each author, we selected top 50 biggest authors (based on absolute change, regardless positive/negative) and eliminated bots from the list. Then we scraped using Kimono the History of changes to Wikipedia Articles by these authors back to January 2010. After that we eliminated all changes with motivation Reverted or Undid and removed changes in absolute value smaller than 50 bytes (insignificant changes of a few words). We created an Edges Table connecting the authors (Source) to the pages they modified (Target) and the total bytes modified for each page (Weight). We also created a Nodes Table assigning to author from which Wikipedia page the Protocol started ("null" for nodes of the pages). So we proceeded to create a network graph of the selected authors of both the main pages and the pages they modified. First we eliminated the nodes of Deep Web and Tor (anonymity network) pages. Then, after having calculated the Average Degree, we eliminated the nodes with Degree less than or equal to 1, more time until we complete cleaned up the network and its graph. We resized the nodes of the pages based on their In-Degree. Then, after having showed only the nodes of the authors through filter Partition, we made a Spatialization by Circular Layout, locked the nodes, standardize the size and unhide the filter. In the end we spatialize node pages by Force Atlas 2 (LinLog Mode, Scaling 0.35, Gravity 0.25) and at last applying Prevent Overlap.

Fourth step

Starting from the end of the Second Step, we created a Pivot Table to calculate the number of changes and the total size of the changes for each author (not in absolute value). We selected the authors who have deleted 900 bytes or more (vandals) and geocalized every IPv4 by GeoIP and IPV6 by IPv6Locator. At last we created a Pivot Table to count how many vandals there are in each country and visualized their position and number using CartoDB.

Metadata

Timestamp:
17/11/2014 - 26/11/2014

Data source:
Wikipedia

Tools:
Seealsology, Gephi, Microsoft Excel, Wikipedia2geo , Kimono, GeoIP, IPv6Locator, Cartodb