GEONAMES FREQUENCE MAP

Protocol

Introduction

The protocol described allows to visualize on a map all the geonames contained in the web pages analysis.

Procotol

Google.com set in Incognito Mode.

Scraping of first 200 web pages results from Google, with the 4 queries:

  • “File Sharing” + Effects;
  • “File Sharing” + Consequences;
  • Piracy + Effects;
  • Piracy + Consequences.

URL extraction and text cleaning, to exclude Social Networks and non-text results.

Text extraction, from the results, with Zup, and cleaning of empty or off topic files.

IDENTIFICATION OF GEONAMES

Replacing of every new line with a space by using a script of the terminal.

Finding of all the entities in the texts using Open Refine.

Import the result in Microsoft Excel to finalise the dataset:

  • removal of the words that are not geonames;
  • removal of all those names repeated in the same page;
  • Organization of the dataset:
  • counting of how many times a single name is present in the list (put that number in “times” column); than delete repeated names;
  • using latlong.net to extract coordinates of every geoname in the list (“lat” and “lng” columns);
  • classification of every geonames in 3 categories: country, city, university (“category” column).

Metadata

Timestamp:
24/11/14 - 03/12/14

Data source:
Google

Tools:
URL Extractor, Zup, Sven, Open Refine, Microsoft Excel, LatLog, CartoDB