TF-IDF ANALYSIS

Protocol

Introduction

The graph shows the results of TD-IDF analysis made on the texts taken from the first 200 web pages taken from Google. It shows which are the more pertinent concepts divided for area of interest (economic, ethical and legal) and for taken position about file sharing (friends, neutral and enimies).

SCRAPING

Google.com set in Incognito Mode.

Scraping of first 200 results from Google, with the 4 queries:

  • “File Sharing” + Effects;
  • “File Sharing” + Consequences;
  • Piracy + Effects;
  • Piracy + Consequences.

URL extraction and text cleaning, to exclude Social Networks and non-text results.

Text extraction, from the results, with Zup, and cleaning of empty or off topic files.

Results divided into areas of interest (economic, ethical and legal) and positions(File Sharing Enemies, File Sharing Neutral, File Sharing Friends) in order to divide the texts in 9 corpus:

ANALYSIS OF THE CONCEPTS

Uploading of the txt in SVEN to have 9 csv files with the TF, TF-IDF and distribution analisys.

Reading of the datasets: the most interesting thing is to show the TF-IDF analisys’ results, so that column was cleaned and organise by Open Refine.

FINALISATION OF THE DATASET

Uploading in Microsoft Excel to finalise the dataset:

  • Selection of the first 200 concepts
  • Union manually of double concepts or synonymous
  • Extraction of the first 13 results concept in every corpus
  • Organization of the dataset: calculation of the TF-IDF average for every concept and sort them descending
  • Visualisation in Raw (Scutter Plot) to compare how much important is a concept in the different corpus

Data visualisation.

Metadata

Timestamp:
24/11/14 - 12/12/14

Data source:
Google

Tools:
URL Extractor, Zup, Sven, Open Refine, Microsoft Excel, Raw