This might be utterly obvious, but let me posit that one of the most compelling features of the current information avalanche is that (if you will): “big-data solves the problem of big-data.”
The problem is that the amount of information has expanded so much that it has become almost impossible to work with or comprehend in its totality. But new techniques that actually rely on the huge scale makes that huge scale manageable and indeed useful. And though it is certainly an overstatement to say it “solves the problem,” I’d argue this the right way to think about it.
Examples abound, if we look at things the right way. For example, computer translation was hard until the “test data” went from the billions to one trillion words — and then the machines got talking (as Google’s Peter Norvig explains here and Steven Levy tell here). Likewise, Jeff Jonas of IBM recounts a situation years ago when by adding more information to a database of people, the number of records on individuals actually shrunk: he was able to identify and consolidate duplicate entries.
But the inspiration for these musings is an article in last week’s The Economist, “The science of science: How to use the web to understand the way ideas evolve.” Researchers came up with a clever way to identify and classify texts by grasping meaning from their content, outside of what the authors felt the classification ought be.
This lets machines parse huge volumes of text that people can’t do, or can’t do well. Academic authors label the subject areas of their papers, but sometimes use far too many as a way to trick people into reading it, or are limited to just five labels which may be too narrow. Sometimes they are required to use pre-determined labels from library science, which fails to account for emerging areas of scholarship. So for example, Adam Smith never regarded himself an economist — the term didn’t exist in that context — rather, he was a moral philosopher. This system would place him alongside Malthus, a pastor by trade and demographer by study, who incontestably wrote on economics.
Moreover, the system enables one to see how ideas molt and meld over time — just as Smith and Malthus seemed out of step with their “professions” in their time, but were foundational for the new field of economics. And it bears repeating: the reason the technology described in the article works is because there is enough data to make inferences about meaning. As the article states:
“Citation indices, which work only where publications refer to their sources explicitly, form a tiny nebula in the digital universe. News articles, blog posts and e-mails often lack a systematic reference list that could be used to make a citation index. Yet they, too, are part of what makes an idea influential.”
This opens up new areas for researches to amass sources. For instance, the huge area of “gray literature” (as it’s called in library science) that is slightly outside the mainstream publication world is now more easily retrievable and citable.
It also indirectly overcomes Google’s inherent shortcoming. Google’s PageRank algorithm, at its most basic level, counts inbound link akin to academic citations and presumes that a page with more is more relevant. But basing relevance on link structure invites imperfection because ordinary people are themselves imperfect and may not link to the ideal content, thus creating suboptimal search results. The technique described in the article may help remedy this.
The upshot is that we are generally familiar with the idea that a characteristic of big-data is it seems to exhibit “inverse scaling features”: the more data you add, the better the system gets (rather than deteriorates, as most systems do when under more load). But another step ahead of this point is that “big-data solves the problem of big-data.” With so much info around, the only way to tackle it is to use its huge size to sort itself. This idea sounds like a serpent eating its tail — but it may be more than that.
- RT @Reza_Zadeh: Google & Apple patents visualized: nodes are people and co-patenting edges. Apple is more hierarchical than Google! https:/… ... 1 day ago
- RT @heatherahopkins: @kncukier justice at last! https://t.co/ecUHj86SB5 ... 2 days ago
- RT @rosey18: My father, who taught journalism, gave this to his students for nearly 50 years starting in the 1930's. More relevant than eve… ... 2 days ago
- A crime has been committed against @heatherahopkins - and I'm finding hard to stick to my diet. Just sayn' ... https://t.co/UaopX5qiHc ... 2 days ago
- RT @LibyaLiberty: America began as an angry protest. twitter.com/cnn/status/835… ... 2 days ago