Words matter

Distribution of terms between queries

Introduction

This visualization has the aim to show the distribution of words between two queries: Deep Web and Tor. The distribution is indicative of the use of words in relation with the two keywords.

How to read the visualization

On the left in green there are the words come out from the text analysis of the Tor corpus by dev.sven.densitydesign.org, on the right in violet those come out from the Deep Web corpus. In the middle there are the words in common between the two analysis, leftmost the terms with a higher Tf_idf value for Tor and vice versa.

How it has been done

After the capture of the first 100 results for each query in Google.com and the selection of 25 of them removing invalid urls, the text cointained in fine results has been extracted with dev.zup.densitydesign.org. A corpus of 200 text documents for each query was composed and analyzed with dev.sven.densitydesign.org to extrapolate the relevance of words contained in it. From the text analysis results we took the first 50 words for each aggregated corpus for Tf_idf value (term frequency–inverse document frequency) and we created a spreadsheet with these words, the Tf_idf value and the query. We compared terms' values in order to get a list polarized between the two keywords, Deep Web and Tor, to point out which and how much words are most relevant in each aggregated corpus. Some of these words are present in both lists, with different Tf_idf value. We created a Json file with this elements and used Bforce.js to visualize the distribution of the words between the two keywords.

Findings

Once again we can see how the words nearest to the Tor query are more technical and specific than those nearest to the Deep Web query. In the middle is interesting to read "drug seller", which probably means that the illegal part of the Deep Web is more popular than the legal one.

Metadata

Timestamp: 01/12/2014 - 20/12/2014

Data source: Google

Related Protocol

Download data (2 KB)