Science published yesterday a “Quantitative Analysis of Culture Using Millions of Digitized Books,” an analysis of all the words in about “4% of all books ever printed.” The article (modestly) heralds the arrival of Culturomics, a “new science” which “extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.”See here for more background on the project and here and here for some of the press coverage. I don’t think there is much “rigorous science” in an analysis based on a clearly biased sample – 5 million out of the 15 million books Google has scanned so far, selected for “the quality of their OCR and metadata.” More important, what can you really conclude about culture without the context for the words analyzed? But, following up on their investment in Zynga, this database of five million books allows Google to provide us with one of the most popular Web entertainments today, a game. This one is called The “Books Ngram Viewer.” You can enter a word or a number of words and trace the frequency of their occurrence in the database over time.
In the interest of the readers of this blog, I put in “data,” “information,” and “knowledge,” and got the following results. Looks like “information” has triumphed over “data” and “knowledge” sometime in the 1990s, just when I started to write about it! Did I cause a Cutluromics Tipping Point?
Speaking of data, here are some interesting estimates from the article…
On the recent explosion in linguistic creativity or the proliferation of words like “culturomics”: “The addition of ~8500 words/year has increased the size of the [English] lexicon by over 70% during the last 50 years.”
On information overload or filter failure: “52% of the English lexicon – the majority of words used in English books – consists of lexical ‘dark matter’ undocumented in standard references.”
On the increasingly limited range of our memories or what the Internet is doing to our brains: “We are forgetting our past faster with each passing year.”
On Lady Gaga and where she’ll be in 2020: “The most famous people alive today are more famous – in books – than their predecessors. Yet this fame is increasingly short-lived. ”
…and an invitation to participate in the new science: “The full dataset, which comprises over two billion culturomic trajectories, is available for download or exploration at http://www.cuturomics. org.” Do you want to be a Culturomist?