TF-IDF ANALISYS

Always the enimies decide what to talk about

Introduction

The graph shows the results of TD-IDF analysis made on the texts taken from the first 200 web pages from Google. The queries used to search are “file sharing”+ effects, “file sharing”+ consequences, piracy effects and piracy consequences. TF-IDF is a numeric statistic that is intended to reflect how important a word is to a collection of corpus. The value increases proportionally to the number of times a word appears in the text, but is offset by frequency, which helps to adjust the fact that some words appear more frequently in general.

How to read the visualization

The dataset is divided in 9 corpus which represents the areas of interest (economic, ethical and legal) and the respective positions around the theme. Every corpus has a column in the graph and shows by the dot’s dimention the TD-IDF index of the concepts. In this way it’s possible to compare the “size” of the same concept in different corpus or to find the most pertinent words in every single column.

How it has been done

Started from the list of links obtained from the first 200 web pages taken from Google, it was completed a dataset with informations about the area of interest (economic, ethical of legal) and about the taken position. Three questions are used to identify the positions:

From this dataset were created 9 corpus which were analysed by SVEN to make the TD-TDF top list. After cleaning and selection of the first words it was created the final dataset with the top 13 pertinent concepts. For every words it is compared the TF-IDF index in the different corpus in order to compare their presence and importance. To obtain a clear and simple visualisation it is chosen the “Scutter Plot” graph (RAW) that permit a direct comparison. The order of the concept is chosen by the average of the TF-IDF index between the same concept in all the corpus. So, at the top of the list there are the most common concepts.

Findings

First of all it’s clear that the concepts which are in the economic and ethical corpus have an index TD-IDF greater then the ethical one. The reason could be that they are the most common topic between the enimies and the friends have to respond about.

Specifically it’s strong the difference between the corpus economic friends and legal enimies. That could be because the objective illegality of file sharing is one of the stronger argument for the enimies and, on the other hand, the not proved economical harm is impotant for the friends.

Talking about the ethical area it can be seen that there aren’t strong concepts, maybe because this argument is less discussed or maybe because they have different arguments.

The most pertinent concepts are about the music and film ambient because they are the most openly discussed, both to fight against file sharing and to defend it.

NEVER-ENDING SHARING

WIKIPEDIA "SEE ALSO" ANALYSIS

WEB PAGES ANALYSIS

TF-IDF ANALYSIS

GEONAMES FREQUENCE MAP

WEB PAGES ANALYSIS VS CASE LAW

NAME ENTITIES CONNECTION

NEWSPAPERS' API ANALYSIS

TF-IDF ANALISYS

Always the enimies decide what to talk about

Introduction

How to read the visualization

How it has been done

Findings

Metadata

Timestamp: 24/11/14 - 12/12/14

Data source: Google

Related Protocol

Download data (4MB)

project by

Faculty

Teaching Assistants