Named entities connections

Protocol

Introduction

This protocol is meant to find out what are the main named entities, so the main actors, in the controversy, what are their relationships, and how much they are influent inside these connections.

Scraping

Google.com set in Incognito Mode.

Scraping of first 200 web pages results from Google, with the 4 queries:

  • “File Sharing” + Effects;
  • “File Sharing” + Consequences;
  • Piracy + Effects;
  • Piracy + Consequences.

URL extraction and text cleaning, to exclude Social Networks and non-text results.

Text extraction, from the results, with Zup, and cleaning of empty or off topic files.

Results divided into 3 categories: File Sharing Enemies, File Sharing Neutral, File Sharing Friends.

Result Analysis

Uploading of the txt in SVEN to have 3 csv files with the TF, TF-IDF and distribution analisys.

Selection of first 500 most distributed words for every category (Enemies, Neutral, Friends).

Manual extraction of named entities from the list: associations, companies, people, blogs, newspapers, institutions.

Results categorization into: Actors-File Sharing Enemies, Actors-File Sharing Neutral, Actors-File Sharing Friends.

Qualitative and manual selection of one reference webpage for every actor:

  • Homepages for associations, companies, institutions;
  • personal page on University's database for professors;
  • personal page on company's database for managers;
  • most recent page talking about the theme for blogs and newspapers.

Network and Graph

Webpages links crawling with HYPHE, and their analysis to extract their first degree out bounding connections to other websites.

Result importation inside GEPHI; graph filtered to hide off topic links (Social Networks), and nodes with Degree Range beneath 2.

Graph creation (ForceAtlas2, LinLog mode, scaling 3.5, gravity 0.1, prevent overlap).

Qualitative division of out bounding links into Enemies, Neutral, Friends.

Nodes scaling based on InDegree partition.

Metadata

Timestamp:
4/12/2014 – 15/12/2014

Data source: Google, Hyphe

Tools:
URL Extractor, Zup, Sven, Gephi