How can the first research method be automatized in order to be more reliable?

Description

After our first approach to the North Korean debate, we decided to take a step back to critically consider our research method. Setting the framework defining devices, type of contents and topics gave us the possibility to quickly retrieve qualitative results, but can this method be considered as something replicabile and reliable?

To answer to this kind of methodological question we started to look at our issue in an even more broad way, our aim was to avoid any kind of filtering. To do so we used a tool to automate the process, in order to work on a greater number of results. Crimson Hexagon* allowed us to create and refine our query.

On a first instance, the tool retrieved more than 27,000,000 online conversations about North Korea. Was our answer to the first question on point? Do controversies around North Korea actually revolve around political issues and then, on a more hidden and deep level, about defectors, economy and social issues? In order to obtain different categories we went through an algorithm training. Giving Crimson ten examples per per topic, selecting the manually, we were able to recreate our categorization.

Our first attempt, anyway, was unsuccessful: our initial categorization was too refined and the algorithm gave us a misleading result. This forced us to reconsider our categorization, was it too specific? After a couple of unsuccessful trainings we understood that the more vertical were the categories the more accurate was the algorithm in its classifying process. We redefined our categories, creating a macro group with Nuclear, Economy and Politics, as the three topics were often overlapping. Finally we obtained a consistent result: war is always the most spoken topic, followed by nuclear, economy and politics. Surprisingly topics such as social issues and defectors are trending in the last months, leading us to observe a substantial evolution of the public debate.

Working with Crimson gave us some important insights and led us to reconsider our devices choice, including Twitter in our analysis, as the majority of user generated contents are located on this platform.

*Crimson Hexagon license was provided from MSL Group, which kindly helped us during our research process.

Protocol

We used BrightView, Crimson algorithm. BrightView can be trained in order to understand topics inside conversations. Obviously the training has to be human made and, in order to reduce the noise caused by machine bias, should be consistent with vertical categories: avoid topics overlapping, irony based contents and privilege short contents are some of the important rules.

After this categorization we took a look at the results and we started our query refinement, in order to get only defectors related contents. We started with this query: (“north korea”).

This was the second query we ran: 2_(“north korea") AND (migration OR defectors OR refugees OR defection OR escape OR "labour camp" OR "detention facilities" OR famine OR torture OR regime OR escaping OR killing OR prisoner* OR death OR "human rights" OR border OR abduction) AND is inclusive, OR specifies the possibility of having variations of keywords inside contents. Still this query was too general and did not excluded noisy topics that didn’t interest us.

Then we tried a different form: 3_ (“north korea") AND (migration OR defectors OR refugees OR defection OR escape OR "labour camp" OR "detention facilities" OR famine OR torture OR regime OR escaping OR killing OR prisoner* OR death OR "human rights" OR border OR abduction) AND - (trump OR missile OR nuclear OR “Kim Jong-nam” OR iran) AND - excludes contents containing those keywords.

4_((“north korea") AND (migration OR defectors OR refugees OR defection OR escape OR "labour camp" OR "detention facilities" OR famine OR torture OR regime OR escaping OR killing OR prisoner* OR death OR "human rights" OR border OR abduction)) AND - (trump OR missile OR nuclear OR “Kim Jong-nam” OR iran) This fourth query was already satisfying, but still we observed that some contents were not pertinent with our research.

Then we applied a further rule: 5_((("north korea") AND (migration OR defectors OR refugees OR defection OR escape OR "labour camp" OR "detention facilities" OR famine OR torture OR regime OR escaping OR killing OR prisoner OR prisoners OR death OR "human rights" OR border OR abduction))~10) AND -(trump OR missile OR nuclear OR “Kim Jong-nam” OR iran OR author:@realdonaldtrump OR author:@potus OR otto OR warmbier OR realdonaldtrump OR @realdonaldtrump) Where ~10) is a near rule that excludes all contents were other keywords hadn’t sufficient proximity to the first keyword “north korea”.

At this point we had an automatized protocol really near to the hand-coded one.

Data

Timestamp: 11/2016 - 11/2017

Data source: Crimson Exagon

Download data (40KB)

All datasets used in this phase were downloaded from Crimson Hexagon view.

prev

next