Definition of our movie corpus

What films are the most relevant when searching for migration movies?

Introduction


As a first step of our research we wanted to find a corpus of movies that are more relevant referring to the topic of migration. In order to do that we collected and compared several movie lists, expecting that if a movie is often present it is because the story it’s more adherent to our theme.

Furthermore we tried to approach our research from different points of access considering some possible channels that people could use to select a movie about migration. That has driven us to choose platforms such as Google and Wikipedia and some of the more diffuse and complete database dedicated exclusively to movies like: IMDB, The Movie DB, All Movie

Protocol


1. We chose 6 queries: immigration, immigrant, migration, migrant, emigration, emigrant.

2. Five different platforms were selected based on two requirements: they should allow the use of keywords for the research and have a clear ranking criterion. We decided to use: google.com, wikipedia.org, imdb.com, themoviedb.com, allmovie.com.

The queries were adapted to take advantage of the specific characteristics of the platforms. Here the processes we followed in each platforms:


Google:

a. With the query “movies” the browser returns a slider with the covers of the movies ranked by a “most frequently asked” criterion.
b. Google.com (only english pages) with incognito window navigation.
c. Six queries were matched with the word “movies”: “immigration movies”, “immigrant movies”, “emigration movies”, “emigrant movies”, “migration movies”, “migrant movies”.
d. Only two queries returned some results: immigration movies and immigrant movies.
e. Manual scraping of the results.
f. Two resulting lists: for “immigration movies” and “immigrant movies” both with 51 results.

Wikipedia:

a. In wikipedia there’s an already existing category “Films about Immigration” which lists 160 movies.
b. Use of stats.grok.se to sort them with a popularity ranking based on the page views in the last 90 days.
c. Manual compilation of a dataset collecting the titles, the views and the positioning from the most to the least viewed.

Allmovie:

a. Use of filter “Movie themes” in the section “Advanced search”. Typing the word “migr” it returns the filter “Immigrant life” with movies displayed with a popularity criterion.
b. Scraping with Kimono. 728 resulting movies

Themdb:

a. It allows to browse through the movies in the section “Discover”. We used the filters “Year: none” “Sort by: Popularity” and the keywords: “immigration”, “immigrant”, “migration”, “migrant”, “emigration”, “emigrant”.
b. The query “migrant” didn’t return any result.
c. Scraped each outcome with Kimono. The resulting lists were the following: “immigration” with 63 movies, “immigrant” with 71 movies, “emigration” with 17 movies, “emigrant” with 5 movies and “migration” with 12 movies.

Imdb:

a. Type “migr” in the search field selecting “Keywords” in the drop-down menu. Selection of all lists with these queries: “immigration”, “immigrant”, “migration”, “migrant”, “emigration”, “emigrant”.
Results sorted by popularity.
b. Scraping with Kimono with following results: “immigration” list with 867 films, “immigrant” list with 1259 films, “emigration” list with 220 films, “emigrant” list with 111 films, “migration” list with 236 films and “migrant” list with 56 films.


3. All lists were reordered in Excel with title, year and positioning of each movie.

4.A contingency table was created listing all 3140 titles in the rows and the 15 lists in the columns. The values showed at the intersection is the position of the movie in the list.

5.The occurrences were calculated using the Excel function “Count”.

6.The movies with at least three occurrences formed the basis of our corpus.

7.The plots of all movies were read to verify the connection with our topic. If they narrate a fictional migration, they take place in future or the topic was considered only marginally they were excluded.Thirteen movies were left out.

8.The remaining 120 movies were reordered according to their occurrences and the popularity based on Imdb ranking. To set the popularity ranking between our 120 films we created an Imdb pro account. All movies were collected in a personal list, then sorted by popularity. A Kimono scraping API created a dataset with the internal popularity ranking.

9.The movies with higher occurrences at the beginning, the lower ones at the bottom. The movies with same occurrence value were ordered following imdb popularity.

How to read it


All lists are displayed vertically, grouped by searching queries and ordered from the shortest to the longest.
Each element represents a movie: the one at the top is related to the first movie of the list while the one at the bottom is the last. If a film appears at least in three lists, the related element is highlighted so the positions of the selected movies inside the lists are visible.
Under the platform name, a bar visualizes the percentage of selected movies on the total movies of a certain list. By using of the drop-down-menu, it’s possible to see the position of the selected film in each list.

Findings


Out of the 3140 movies collected in these 15 lists, 133 were present in at least 3 lists, and 120 were considered relevant after reading the plots.
Two of the main findings: even though they’re shorter than others, the lists collected using Google and Wikipedia have an higher percentage of selected movies, which means they are more reliable when looking for a relevant movie related to the topic.
Secondly, the lists found in Themdb are less in agreement with the others, in particular with the Google ones.