Wednesday, March 27, 2013

Big Data

Probably since forever philosophers and mathematicians have dreamed of mechanizing thought, of removing judgment from thinking. The newest aspirant in this millennial quest is Big Data, and not surprising there is an eponymous book (excerpted here) with the following provided summary:

This revelatory exploration of big data, which refers to our newfound ability to crunch vast amounts of information, analyze it instantly and draw profound and surprising conclusions from it, discusses how it will change our lives and what we can do to protect ourselves from its hazards.

Big Data (BD) is the new New Thing, the method by which diligence can substitute for thought. The idea actually has a certain charm as it reverberates with our sense of justice. Collecting data is hard work, but it is generally the kind of work for which effort is rewarded. Work hard and you will do well. Put in the hours and the data will pile up.  It’s an activity that rewards virtue.

In this it is entirely unlike coming up with a plausible analysis, aka thinking. This activity is totally unfair.  Lazy people can have excellent ideas. Sloth is no bar to insight and profligacy no guarantee of intellectual stagnation.  Here even the wicked, sloppy, and lazy can prosper. How unfair.

In a just world, virtue would be rewarded. In a just world hard work would guarantee enlightenment. We don’t live in a just world. Big Data is the unfounded belief that this can be remedied. The hard work of data gathering can substitute for the caprice of thought. It cannot be, and, unfortunately, believing it can is likely to deform scientific practice.  To see this, consider the following quote:

The era of big data challenges the way we live and interact with the world. Most strikingly, society will need to shed some of its obsession for causality in exchange for simple correlations: not knowing why but only what. This overturns centuries of established practices and challenges our most basic understanding of how to make decisions and comprehend reality (10).

And that is precisely the problem.  Big Data is part of an enterprise aimed at reforming scientific practice. Dump why aim for what.  However, contrary to the prevailing conception, without a model/theory it is not clear what it means to just “look for correlations.” Data do not speak for themselves. So gathering lots of data will not result in eloquent models that understand the whats that matter. Big data sets cannot pull themselves up by the bootstraps (nothing can pull itself up by the bootstraps!) thereby yielding useful models. So, without explicit thoughtful models that guide the enterprise, we will be saddled with implicit models that obscure (and trivialize) what we are doing (as noted here without good models it is even difficult to separate good data from bad).

None of this would be worth mentioning were it not for the mesmerizing powers of Big Data. We have seen this before (here, and here for example).  Big Data is the modern avatar to classical empiricist methodology. It’s appeal is its promises to provide insight without intellectual sweat. This time, however, Empiricism has found a slogan attached to a technology, Google being the all-powerful mantra. Not surprisingly, money-making slogans can be very enticing and Google intellectuals (e.g. Peter Norvig) can gain powerful platforms. And though I am quite sure that like all other (empiricist) attempts to circumvent thought, this too will ultimately fail, it’s demise may not come soon enough to prevent serious damage. So when you hear the siren calls of Big Data I suggest the following prophylactic procedure; repeat Kant’s dictum to yourself, viz. data without theory is blind, data without theory is blind, data without theory is blind…and hope it soon goes away.


  1. Hear hear, but, cleverness won't hold its own over Big Data unless the people who think they're smart can find the right places to be clever. I am suspect that you're right about the link between Big Data-ism and the currently rising tide of sanctimoniousness and process-worship in the 'Anglosphere'.

  2. I think I am missing something important here, so instead of jumping to conclusions i ask for clarification first. It had been a while since i read my Kant so i looked it up to confirm that my memory [suggesting Norbert had only quoted half of the dictum] was correct:

    Thoughts without [intensional] content (Inhalt) are empty (leer), intuitions without concepts are blind (blind). It is, therefore, just as necessary to make the mind's concepts sensible — that is, to add an object to them in intuition — as to make our intuitions understandable — that is, to bring them under concepts. These two powers, or capacities, cannot exchange their functions. The understanding can intuit nothing, the senses can think nothing. Only from their unification can cognition arise. (A50-51/B74-76)

    So I imagine Norbert does not want to persuade anyone to abandon search for data entirely but he seems to accuse 'empiricists' or 'Google intellectuals like Norvig' of collecting data just for the sake of collecting data in the absence of ANY theory - is this the charge?

  3. How relevant 'Big Data' (ie 500 Billion Words) is to structural linguistics is questionable, but we certainly need to take an interest in 'Middle-Sized Data', the 30-100 million sized corpora that language leaners between the ages of 3 and 10 are looking at (depending partly on age and SES).

    I for example feel rather embarassed that I can't tell my students how many times they need to see object NP preceeding PP in a corpus but not vice versa before they can conclude with some confidence that there is a NP<PP rule/principle in operation. I've just requested my uni library to order a book called 'Frequency Effects in Language Acquisition' which seems like it might be highly relevant to this kind of query.

    1. Some patterns don´t start showing up before the 100 million word mark. For example tense distribution for really low frequency constructions.

    2. So (noticing this reply after a rather long delay, but it is apropos of topics that are often discussed here), if the relevant facts can be found in the intuitions or performance of 7 year olds, they are presumably projected (in the sense of Peters' discussion of the projection problem) from more 'basic' data found in smaller corpora, whereas if they can only be found reliably in the intuitions or performance of college students, they might be acquired as features of the language with some degree of independence from other features.

  4. This comment has been removed by a blog administrator.

  5. This comment has been removed by the author.