The Google Ngram Viewer is old news by now, and several have blogged about it. A few blogs were quick to identify the main issue around its reliability: OCR errors. This famous post about the medial (or long) s is particular amusing. Nonetheless, I’d like to go back on the topic after a discussion I had with my supervisor some time ago.
In brief, the Google Ngram Viewer allows one to plot occurrences of a search query on the vast collection of books that Google scanned and OCR’d for Google Books. To have a better idea of what it does, have a look at this Tumblr that collects interesting Ngrams, or try it out for yourself.
While looking at the Viewer with my supervisor, we tried to look for a few composers, particularly opera composers. We started by confirming our most obvious assumptions, like Beethoven being utterly dominant (in most corpora, and surely for a lot more than just Fidelio), closely followed by Purcell (in the English corpus) and Mozart. What baffled us for a bit was the fact that most of the composers seemed to experience a big drop around the 1950s and 1960s.
Once at home, I looked up in the English corpus some notorious opera composers between the 17th and early 20th century and confirmed this trend (N.B. this selection is not at all representative of the full landscape of opera composers, just first names that came to mind).
Now, what made us perplexed was the fact that academic and non-academic literature about composers probably grew after the 1950s rather than diminish. But in fact, the Ngram Viewer normalizes the occurrences by the number of books published every year. This is why on the Y axis we see percentages rather than number of occurrences. So why is the drop there?
After some head scratching, we got to the approximate conclusion that starting in the 1950s, book publishing begins targeting a larger and growing audience, resulting in a wider range of topics and genres. Notwithstanding the normalization, composers get submerged -in percentage of total published words per year- by many other printed words.
My knowledge of history of book publishing is very limited, but after doing some digging around I stumbled upon A history of British Publishing by John Feather that seems to confirm this idea. Specifically, the increase of adult literacy determined the success of paperback editions that “made [their] way into the British publishing scene in the 1950s…” and were ubiquitous by the 1970s. Assuming that a similar phenomenon took place in other English-speaking countries around the same years, we can tentatively take this as a plausible explanation for the drop.
It goes without saying that because of the noise caused by OCR, one should look with some suspicion to any information inferred from the Google Ngram Viewer. However, because of the corpus size, it is a unique tool and it shows the potential -and risks- of using numeric “evidence” to formulate and discuss theories about text-based culture.
I also find somewhat amusing that Google, by means of relevance rather than complexity, finally brought some computational linguistics on Science Magazine.
Trivia: apparently, the Ngram Viewer, or some components of it, is written in Python, as I could see from one error message that I got during one of my tests ;)
Why do you think the drop may be there? Leave your thoughts in the comments!
John Lavagnino (2011-11-09 17:31:08): It’s a great problem, and I think the paperback phenomenon may well be the explanation. A possible difficulty, though, is that Google Books is mostly based on the contents of academic libraries, and 1950s paperbacks may not be that well represented in their collections: they aren’t usually academic material, and are often reprints of books libraries might already have. I wonder if there is also or instead a change in the profile of academic publishing in this period, especially because in the USA funding for the hard sciences greatly increased in the postwar era.
Raff Viglianti (2011-11-10 10:16:03): Hello John, thanks for your comment! Yes, that makes sense. It would be interesting to test for other humanities topics to see whether there is a general drop that might have been caused by more publications in the hard sciences. Though I wonder how well represented academic publications are on the Google books corpus? Would a change within academic publications be sufficient to cause the drop?