Description

After we extracted the keywords and got the word frequency. we want to analyze the common terms in this two corpus. and try to find out the different orientation on report by comparing their quantities.

The visualization above shows the different levels of concern of one word on the different countries. The value of X axis represents the word frequency on Baidu News, the value of Y axis represents the word frequency on Google news. The blue line is X = Y, the words on the line have the same frequency in the two corpus. This can be interpreted as it has the same level of concern on both side. The words in the up-left area have higher degree of attention on Google News, the words in the down-right area have higher degree of attention on Baidu News . In next part of the visualization, the blue value is the distance from the point to blue line, the greater the distance, the greater the difference.

In the visualization we can see that Chinese media and American media have different report tendency on climate change issues, and on the same issues they also have different levels of concern.

Protocol

1.use the extracted keywords from two data source, and then find the words which appears in both two corpus and rank them by their frequencies.

2.choose the top 45 common word from both side, and then use Raw (scater plot) to visualize it. The value of X axis represents the word frequency on Baidu News, and the value of Y axis represents the word frequency on Google news.

3.use Excle to calculate (ABS(x - y) / (1^2 + -1^2) ^0.5) the distance from a point (word) to the line (X = Y). X = Y means that word appears the same times in the two corpus. This can be interpreted as it has the same level of concern on the both side.

4.choose the top 10 words with highest distance value of each corpus, and top 10 word with lowest distance value from the two corpus. and put them in a matrix according to their position.

Data

Data source: Baidu, Google