Words matter

Queries composition and results

Introduction

The aim of this visualization is to show differencies and similarities between the most relevant words referred to Deep Web and to Tor. This visualization shows the queries' composition (the keyword and the matched words) and the terms obtained from the text analysis. They are associated to a category to show the different perception people have of Deep Web and Tor.

How to read the visualization

In the middle of the circular dendrogram there is the keyword (it is changeable with the buttons in the left, from Tor to Deep Web and vice versa) and around it there are the 8 words that have been added to compose the 16 different queries. From each arranged query come out the 10 more relevant for Tf_idf value (term frequency–inverse document frequency) terms. Each term is associated to a category, explained in the legend at the left of the graph. The legend shows the amount of words for each query too.

How it has been done

After the capture of the first 100 results for each query in Google.com and the selection of 25 of them removing invalid urls, the text cointained in fine results has been extracted with dev.zup.densitydesign.org. With dev.sven.densitydesign.org a text analysis was performed on each query's corpus of text documents. From the analysis emerged a list of words (n-grams) sorted by Tf_idf value (term frequency–inverse document frequency). We took the first 10 words for each query and added a tag to them, creating a spreadsheet with xx columns: the associated keywords, the added queries, the n-grams and the related categories. With raw.densitydesign.org was produced a circular dendrogram in order to show queries' hierarchy and results' clustering.

Findings

Observing the most used words in the debate we can notice a sharp contrast between a technical sphere of specific terms linked to the Tor queries, probably used by experts and hackers and a sphere of suggestive terms that refer to the dark side of the Deep Web, linked to the Deep Web queries. This can imply that exist two points of view of the phenomenon, the computer science point of view and the public opinion point of view.

Metadata

Timestamp: 17/11/2014 - 10/12/2014

Data source: Google

Related Protocol

Download data (11 KB)