Keyness is one tool within the free utility AntConc and I’ll cover what I’ve done with it on my last post in this series on the first part of the text analysis. It’s a simple but powerful tool that has the potential to be the most revealing.
Keyness essentially measures how “unusual” a word is in a corpus of text compared to a “reference” corpus. The tool allows anyone to put in a reference text that it then labels as normal. When you upload your own corpus, the tool carries out a simple mathematical formula on word frequencies in each corpus. By comparing frequencies, it quantifies how strange the appearance of one word is compared to the norm.
The results allow us to see into which topics are discussed in one period and not another. A newspaper’s publications naturally vary in topic over the course of even a week, so the big names will most certainly change over time. Sometimes it is quite difficult to discern whether a term is related to a specific story being covered, or if it is being used more often in other contexts.
Our results however have shown that not only proper nouns change in frequency over time. How other, less unique, nouns and even verbs and transitional phrases alter frequency can give us real insight into what change really means.
There are two types of function, one based on a chi-squared test, and the other, labeled “logarithmic”. I couldn’t find the math for the former, so I am hesitant to trust it. “Logarithmic” keyness formulas can be found here.
You have the freedom to assign whichever corpus you want to the reference corpus. Several studies compare their own corpus to “standardized” ones, such as the Brown corpus or British National Corpus. Many of these are expensive to obtain, or, like the Brown corpus which includes a journalistic corpus created from 1960 articles, are too different to compare to my own. Until something more relevant appears, I’ll remain to comparing within in my own corpus, which is nevertheless very revealing.
The image below shows the data from comparing the text from before the protests to the text from after, and then vice-versa. The third column is keyness, the quantified measure of unusualness. The firs thing to note even before looking at keyness is difference between each document’s number of words before and after cut. Does this suggest that there is a greater variety of words in the articles after the protests than before?
Some notable results: ‘police’ appears as unusual in the text after the protests when compared to the text before. ‘Court’ appears with a surprisingly high frequency as unusual there as well.
We have to be careful, because frequency of the term matters. If the story has a relatively low frequency compared to the rest of the terms at its value of keyness, it may indicated only one story was published on that topic. The publication of one story can certainly be interesting, but ‘machete’, referring to series of stories on one attack in Istanbul, shows up at a high keyness. It’s unusual, but doesn’t shed light on the perspective of the newspaper.
We can do another test, one that revealed something I overlooked in the results above. We can compare one subsection of the corpus we have to the larger corpus, and the results for this are below.
The most interesting thing here is the surprisingly high measure of keyness for the term ‘Turkey’ in the subcorpus before the protests compared to the full. ‘Turkey’ or ‘turkish’ do not even appear on this short list of results for the ‘After’ results. If Turkey’s press is famous for its nationalism, then why do these data seem to show that ‘turkey’ and ‘turkish’ would be unusual if they were found after the protests? What does this suggest about the topics or tone chosen by the newspaper?
As always, there are myriad questions that can be asked of all of the results I wrote about in the first, second, third, and fourth parts of this project as well. I’ll continue to ask them as we delve deeper.