Digital humanities work allows us to ask new questions, questions that could not previously have been asked.
After getting the conceptualization around my project down, I picked a newspaper and an event to begin collecting data and analysis. My data are 1065 articles from the domestic section of Today’s Zaman from May 18 to July 15. I chose these dates because it surrounds two weeks before and two weeks after the height of the Gezi Park protests in Istanbul and Turkey in general. The protests became an extreme example of a crackdown on the news media in Turkey.
Another part of the project attempts to visually show the relationship between these events of media freedom violations, their effect, measures of variables that affect media freedom, and press freedom indices. The statistics for things like corruption measures, journalist killings and arrests, newspaper ad revenue, circulation, and readership were easy to find. The majority of this first step of research thus consisted of the text analyses.
I searched through many tools to analyze my corpus (the 1065 articles), but ended up focusing on a few. Many tools, such as scatterplot, looked interesting but were untrustworthy or unrevealing. This is the first part and tool that began the exciting stuff.
Voyant has a series of free tools on the web that anyone can easily input text data into. TermsRado, RezoViz, and Scatterplot are three Voyant tools that I’ll talk about here. They are also tools that I quickly moved away from and cannot be used as comprehensive analyses within themselves. This is mainly because Voyant publishes none of the required maths or code that they use to carry out analyses, so I can neither trust it fully, nor understand fully what comes out the other side.
Still, they are good starting points. TermsRadio analyzes the frequency of words in each .txt file. The picture below shows the result when I enter my corpus as individual files and apply stop words (words that appear frequently and regularly in English, like ‘the’ and ‘of’, but also numbers and ‘however’ and ‘nevertheless’).
The chart shows frequency of highlighted words across the corpus. The top bar displays the entire corpus and the highlighted words (red=’police’, green=’turkish’, yellow=’party’) while the larger chart zooms in on specific .txt files. Aaaaaand it looks like it reveals nothing. But what matters in the text analysis is not the chronology of articles within a single day (they could have been published at a certain time), but the chronology of articles over time, by day. The next picture displays the corpus aggregated into days.
Now you can see trends a lot more clearly. I’ve highlighted in this picture the period just before the protests began (May 28) and a couple days after. As expected, words like ‘Taksim’ and ‘Gezi’ appear more often in the articles starting May 28. Words related to ‘protest’ increase in frequency just after that. The word ‘police’ fluctuates before May 28, then remains at a high frequency after the protests begin. This link should give you the frequency chart in its interactive mode.
TermsRadio, although a simple tool, can be viewed a number of ways. Words and their frequencies can be viewed with either a within or between tactic. For example, we can ask the within-frequency question, Why does ‘police’ spike at certain points and dip at others? Why does police remain at such a high frequency even after the height of the protests? Similar between-frequency questions would be, Why does ‘police’ spike when words like ‘gezi’ also spike or when words like ‘party’ dip?
I’ve also only highlighted the terms which I thought would be interesting to view. The tool, however, allows us to highlight dozens of words, not all of which are nouns like ‘Turkey’ or ‘minister’. What would a view of ‘according’ or ‘however’ look across the corpus? And what would that say about the way information was presented in the paper?
The dip towards the very end of the picture above is the result of a saturday when only two articles were published (always consider sample size). But the yellow dp before that is not. Instead, it begs the question, Why does ‘taksim’ dip while ‘gezi’ does not? What can this tell us about the way this newspaper began to refer to the protests?
And it is always important to realize that our questions would more accurately be phrased, “Why does this tool show that the frequency of the word ‘police’ spike or dips at certain points?” Any single tool is imperfect (a useful concept when cleaning your data for machine-readability). Large-scale analyses of text corpora are already faulty as it is.