This is the second tool review and text analysis in this series. The first is here.
ScatterPlot irritates me and fascinates me at the same time. It purports to be able to measure how each word corresponds to other words in terms of frequencies. It uses “statistical analyses” to set each text input as a dimension then condense it into an easily visualized 3D or 2D space. I have no idea what that means. I’m not sure anybody is meant to be able to understand what “statistical analyses” are. Below is an example, using the first several days of my corpus.
If you look at the axes of this ‘3D’ graph, there are no labels. While it declares that it plots frequencies of words, it seems strange that you can have -1 frequency. In the upper left, three percentages explain how each term can be explained by each of the three dimensions, represented by the two axes and the fill of the blue-labeled terms (I assume).
The kicker about this tool is that it may be graphically displaying how many and which topics are covered in a given period. But we need to know more about it.
It clearly has three points. Terms like ‘attack’ and ‘bombings’ appear at one point. ‘Court’ and ‘law’ appear at another, and ‘party’ and ‘parliament’ at the last. The articles surrounding those terms speak to a great extent about those topics. .
This graph perplexes be because it seems it could be incredibly revealing, but I find it hard to trust because I cannot confirm what it does. Luckily, there are some possible explanations that would allow me to use this tool.
The axes may be identifying a standard deviation away from a mean frequency. It is less clear what each axis represents, but if the description is correct, similar articles may be combined into one axis. The third axis (for a 3D graph) is represented in the fill of the highest frequency terms (with stop words applied).
The percent explained by each dimension may be an attempt to explain the correlation of terms as a result of these three axes (or points?). This is the concept of principal component analysis (PCA). It takes all variables in a given case and calculates a proportional weight between their change from case to case and a single variable. For example, this tool may be calculating to what the frequency of the term ‘court’ may be affected by the frequency of ‘suspects’. ‘Suspects’ appear between the ‘attacks’ point and ‘court’ point, which may indicate a high correspondence with ‘attacks’ and ‘court’.
Unfortunately, when you change one visual setting on the graph, the image displayed shows a very different picture:
Below is an example of the graph created with articles after the height.
As you can see, the points emphasized are shown in a completely different way and with different terms which characterize each point. Does this represent a more varied array of topics in the articles? The article to the far left around the term ‘children’ are highly anomalous in the scope of the corpora. They are two articles speaking about alternative learning opportunities during the summer for children. Might this indicate that we still have a relatively small sample size for articles of this time?