Wikipedia "See Also" analysis

A huge cloud of topics (about property)

Introduction

This graph shows the connections between the third level’s See Also links of five pages in Wikipedia that have been chosen to represent the theme. The pages are: “Copyright”, “Legal aspects of copyright infringement”, “Copy protection”, “Recording industry association of America” and “File sharing”.

How to read the visualization

The representation is composed by nodes, which are the Wikipedia’s pages, and the links between them. The node’s dimension depends from the numbers of InDegree links, which is the number of pages that has a direct connection with that node (how many pages have that link in their See also section). The function Modularity Class divides automatically the pages in clusters based on the number of common links. In this graph it’s possible to describe the cluster with some key-words:


How it has been done

The graph created is based on a dataset of the connections between Wikipedia’s pages.
After the choice of 5 pages related with the theme, their See Also were analysed and it was created a list of the pertinent first level's links. From this new list were analysed the second and third See Also pages.
By using the software Gephi, it was applied the layout Force Atlas 2 which approachs the pages with a bigger number of common links. The pages which have only one connection with all the other were removed from the graph. To show better the connections the pages are divided in 5 cluster (made by the "Modularity Class", resolution:3.8).

Findings

Reading the global graph, it’s clear that there is a huge cloud of pages strictly conneted between them and some small groups at the edges, which are connected with the cloud but not with a big number of links.

If we consider the denser area, it’s possible to see some cluster which represent some sub-themes. The more evidents are between the concept of copyright and the laws to fight against file sharing, which is also the center of this research. In this area can be found all the main concepts of the controversy: from the Stop Online Piracy Act (SOPA) to the Digital Millenium Copy Act, from copyright to file sharing.

Privacy and online surveillance are themes close to each other but not at the center of the "cloud" (they are in the bottom). Between these concepts and the previuos cluster there is an empty space to show that the themes are connected but not so strictly. For this reason it has been decided that privacy and surveillance are not included in this analisys.





Looking at the dimension it can be seen that some pages are bigger than others: they’re the keyword of the reserach.




Metadata

Timestamp: 17/11/14

Data source: Wikipedia

Related Protocol

Download data (4MB)