Words matter (Google queries text analysis)

Protocol

Introduction

The aim of this research protocol is to bring out the most used words about the Deep Web and related topics, in order to understand how people talk about the phenomenon, for showing up the use of language people make speaking of Tor and Deep Web. What are the more frequent used terms in the debate on the web? And what we can catch from this?

First step

We designed and calibrated the queries in order to extrapolate the most related terms to the topic. The two main words or keywords that describe the theme are "Deep Web" and "Tor", the cyber environment and the tool for surfing it. We added to each of these two words eight words connected with the controversy. At the end we had sixteen queries: Deep Web technology, Deep Web legal, Deep Web risk, Deep Web freedom, Deep Web crime, Deep Web anonimity, Deep Web censorship, Deep Web security, Tor technology, Tor legal, Tor risk, Tor freedom, Tor crime, Tor anonimity, Tor censorship and Tor security. This queries were used for searching materials on Google.com, to get the first 100 link for each query. We cleaned the list and get 25 urls for each query. The text from each query's corpus was scraped and analyzed with dev.sven.densitydesign.org for obtaining a list of words sorted by Tf_idf value (term frequency–inverse document frequency). We added a category (perceptions, items, verbs, actors involved, technology and ambients) to the first 10 words for query, to see differences and similarities among the results.

Second step

We created a comprehensive corpus with the Deep Web queries' results and with the Tor's. A text analysis was performed with dev.sven.densitydesign.org on the aggregated corpus to get a list of words sorted by Tf_idf value (term frequency–inverse document frequency)for each. We added a category (perceptions, items, verbs, actors involved, technology and ambients) to the first 150 words emerged by the analysis. We visualized the differences of used terms between the comprehensive corpuses of Tor and Deep Web.

Third step

From the text analysis results we took the first 50 words for each aggregated corpus for Tf_idf value (term frequency–inverse document frequency). We compared their values in order to get a list of terms polarized between the two keywords, Deep Web and Tor, to point out which and how much words are most relevant in each aggregated corpus. Some of these words are present in both lists, with different Tf_idf value. We used Bforce.js to visualize the distribution of the words between the two keywords.

Metadata

Timestamp:
17/11/2014 - 10/12/2014

Data source: Google

Tools:
Microsoft Excel, zup, Sven, Raw, Bforce.js, pgl.yoyo