Protocol
1. Once the 120 movie list was set, each Imdb page url of these films has been scraped using Kimono.
2. A dataset has been created with the 120 urls.
3. We added the string “/keywords?ref_=tt_stry_kw” to all urls using regular expressions in Text Wrangler in order to allow an automated scraping.
4. Using these 120 new urls, a second Kimono API was created to obtain a dataset containing all keywords related to our movies.
5. A pivot table in Excel helped us to discover which were the most recurring keywords and to create an univocal list with 5216 keywords.
6. We noticed that keywords were often too specific, so we decided to assign them to more general topics. The 40 more recurrent theme were: violence, travel, time, technology, sport, social issues, sex, security forces, religion, relationships, reference, politics, people, other, nationalities, media, love, language, justice, job, immigration, history, health, places, gender issues, food, film, family, emotions, education, economy, death, cultural differences, criminality, car, arts, animals, ages, addiction and sexual abuse.
7. Once this dataset was obtained, we selected the more interesting themes in order to visualize only the more relevant issues. These themes were: violence, job, arts, immigration, family, criminality, cultural difference, sex and security forces.
8. A single-column dataset was created for each movie containing only the tags of the topics we selected.
9. Using Raw we create 120 treemap visualization: the “Tag” value is assigned to the dimensions “Hierarchy” and “Color”. the size is given by the repetition of the tag in the movie. Height and width are set to 500 px and padding is set to 1 px.
10. In Illustrator the 120 treemaps were assembled and colored. The graphs are disposed in a boustrophedic way (from the top left corner to the bottom left, starting from left to right, then viceversa till the bottom), following our ranking criteria (see previous chapter).
11. Each category had a color assigned which tend to red if the topic has a more negative connotation and blue for a more positive tendency, with white in the middle for the neutral themes.