Response to “Eight Problems with Big Data”

April 26, 2012 by
Filed under: Analytics, Business Impacts, Data Analysis 

After reading Jay Stanley’s ACLU article on “Eight Problems with Big Data,” it is worth reflecting on what could be construed as a fear-mongering indictment of the use of big data analytics and the implication that big data analytics and its implementation of data mining algorithms are tantamount to all-out invasion of privacy. What is interesting, though, is the presumption that privacy advocates have been “grappling” with data mining since “not long after 9/11,” yet data mining was already quite a mature discipline by that point in time, as was the general use of customer data for marketing, sales, and other business purposes. Raising an alarm about “big data” and “data mining” today is akin to shutting the barn door decades after the horses have bolted.

However, to be fair it is worth reviewing some of the author’s points, all of which I am directly quoting from the article:

  1. 1.    “It incentivizes more collection of data and longer retention of it. If any and all data sets might turn out to prove useful for discovering some obscure but valuable correlation, you might as well collect it and hold on to it. In long run, the more useful big data proves to be, the stronger this incentivizing effect will be – but in the short run it almost doesn’t matter; the current buzz over the idea is enough to do the trick.”

Of course, there is the attitude that more collection of data and longer retention might turn out to be of value, yet without a crystal ball to tell you exactly what data is going to be useful to you in the future, you’d have to store all the data. A point in fact: one of our (social services) customers expressed regret that they only began to track certain pieces of information long after their systems were initially put in place. The missing data enabled potential scenarios in which benefits might be provided to people who did not truly qualify, yet the gap allowed people to game the system. But without knowing that the agency would need that data when the application was deployed, those information gaps remain.

And even if organizations do feel incentivized to collect and retain data, few of these organizations have ironed out the gaps in their own process maturity be effective at extracting much value. In the foreseeable future, the costs and efforts for capturing, managing, archiving, and ultimately, trying to even find the data they need is going to far exceed the value most businesses can derive.

  1. 2.    “When you combine someone’s personal information with vast external data sets, you create new facts about that person (such as the fact that they’re pregnant, or are showing early signs of Parkinson’s disease, or are unconsciously drawn toward products that are colored red or purple). And when it comes to such facts, a person a) might not want the data owner to know b) might not want anyone to know c) might not even know themselves. The fact is, humans like to control what other people do and do not know about them – that’s the core of what privacy is, and data mining threatens to violate that principle.”

I feel compelled to push back on this one on a number of fronts. First of all, data analysis does not “create facts” – the discovery that a person exhibits purchasing behaviors that are consistent with pregnancy does not make that person pregnant (at least according to what I learned in biology class). Granted, though, there may be situations in which people will mistake analytical results for facts; however, that can happen with any inference, such as guessing that a celebrity is pregnant based on the exposure of her bump in the tabloids.

On the second front, most interactions (business- or otherwise) involve an exchange of information. People who want to control what other people know and do not know about them must be aware of this fact. Alternatively, each of us derives some value from the interaction. Instead of suggesting that people need to control information, a better approach might be to suggest that those people must weigh the value they get from the interaction and the cost of exposing the data.

For example, when you search for information through using a search engine, you benefit by getting access to the information you were looking for. On the other hand, the search engine has to capture what you are looking for in order to help you, and the search histories help refine the algorithms and make the results more precise. So you get some information in return for providing some information. It is up to you to decide whether you got a good deal or not.

And again, in some situations the availability of data that can help inform a decision about what is (or is not) a fact is benefited by big data analysis. For example, large-scale analysis of clinical health care information can help a provider in more precise (and hopefully more accurate) diagnosis, as well as effectiveness research that can help in identifying the choices for treatment that have been most successful under similar circumstances.

  1. 3.    “Many (perhaps most) people are not aware of how much information is being collected (for example, that stores are tracking their purchases over time), let alone how it is being used (scrutinized for insights into their lives).  The fact that Target goes to considerable trouble to hide its knowledge from its customers tells you all you need to know on that front.”

Actually, stores and many other businesses have been collecting information for decades, and much of that data is willingly offered because the information provider expects to get some benefit in return. Supermarket loyalty cards, joining frequent flyer programs, filing out product registration forms, even that drawing you entered when you filled out that entry and dropped it in the box at the mall – each of these present some opportunity for value in return for information. And when that data is used for insights that are beneficial, many people don’t have a problem with it. For example, my kids’ English teachers suggest looking up books they liked on Amazon and scanning the other suggested titles to find books that they might enjoy.

In many cases, the data is offered without any expectation – consider those who willingly allowed ratings companies to monitor their television viewing choices, or who regularly respond to form or telephone surveys. The point is that if you willingly offer information without any constraints or limitations, don’t be surprised when that data is used.

  1. 4.    “Big data can further tilt the playing field toward big institutions and away from individuals. In economic terms, it accentuates the information asymmetries of big companies over other economic actors and allows for people to be manipulated. If a store can gain insight into just how badly I want to buy something, just how much I can afford to pay for it, just how knowledgeable I am about the marketplace, or the best way to scare me into buying it, it can extract the maximum profit from me.”

All sales processes are driven by the need for the sales person to influence the decision of the buyer, and therefore you could extrapolate that in every sales situation the buyer is being “manipulated.” The use of big data analytics does not change the core sales process, it merely informs the salesperson. In addition, the suggestion that a buyer is subject to manipulation is actually somewhat insulting to the buyer, who is also getting something out of the transaction, not just forking over the money.

If a company has a lot of insight but provides bad products, poor customer service, or limited warranty protection for their products or services, customers will still go to other places to gets the things they want. On the other hand, if a company uses customer insight to provide better products and services, stock the kinds of products the customers want, and engages the customer in a relationship, customers will go there, even without the data mining.

Lastly, companies are usually in the business of generating profits, so it is ingenuous to fault them for wanting to extract the maximum profit. In fact, analyzing customer sensitivity to product pricing might show that lowering certain product prices might increase volume sales, leading to greater profits. In this case they are “extracting the maximum profit,” but not necessarily through increased prices.

  1. 5.    “It holds the potential to accentuate power differentials among individuals in society by amplifying existing advantages and disadvantages. Those who are savvy and well educated may get improved treatment from companies and government – while those who are poor, underprivileged, and perhaps already have some strikes against them in life (such as a criminal record) will be easily identified, and treated worse. In that way data mining may increase social stratification.”

It is not clear what big data has to do with this; even without data mining, those who are savvy and well educated may get improved treatment as a result of their savviness and education – they may make themselves better informed, or ask better questions. Those with a criminal record won’t be able to hide that either, since it is certainly legal to perform a criminal background check under defined situations. Again, that has nothing to do with big data and has a lot to do with the bill of rights, in retrospect (“habeas corpus” anyone?).

  1. 6.    “Data mining can be used for so-called “risk analysis” in ways that treat people unfairly and often capriciously – for example, by insurance companies or banks to approve or deny applications. Credit card companies sometimes lower a customer’s credit limit based on the repayment history of the other customers of stores where a person shops. Such “behavioral scoring” is a form of economic guilt-by-association based on making statistical inferences about a person that go far beyond anything that person can control or be aware of.”

An organization has a fiduciary responsibility to attempt to limit its risk, especially when it comes to offering credit or lending money. The models for risk analysis do (and probably should) take behavior characteristics into account. For example, parachuting out of airplanes probably impacts your life insurance premium. And you don’t need data mining to figure that out, but rather the standard statistical and probability analysis that actuaries have been doing for a long time.

Identifying the characteristics of a pool of individuals that increase risk of defaulting on credit payments is a way of (1) protecting the corporate self-interest but also (2) protecting the interests of those who don’t choose to default on their payments but are the ones whose interest rates are raised to accommodate for loss and fraud.

Let’s cast this concept into perspective of the recent worldwide financial crisis that was, at its core, driven by unscrupulous practices in lending money to individuals who essentially could not afford to pay it back. One of the root causes? Simplified somewhat, subprime lending to individuals with low documentation or no documentation on their applications, coupled with the obfuscation of the real risks of default as the mortgage pools were continually mashed together, reconstructed, and re-rated by the ratings agencies, together eventually led to the 2008-2009 blowup. Perhaps, in this case a little more data might have helped?

  1. 7.    “Its use by law enforcement raises even sharper issues – and when our national security agencies start using it to try to spot terrorists, those stakes can get even more serious. We know too little about how our security agencies are using Big Data, but such approaches have been discussed since the days of the Total Information Awareness program and before – and there is strong evidence that it’s being used by the NSA to sift through the vast volumes of communications that agency collects.  The threat here is that people will be tagged and suffer adverse consequences without due process, the ability to fight back, or even knowledge that they have been discriminated against. The threat of bad effects is magnified by the fact that data mining is so ineffective at spotting true terrorists.”

I recall reading some papers after 9/11 that demonstrated that analyzing data about the hijackers that demonstrated a high degree of relationship among them, implying that had we done the analysis, they could have been stopped. I actually looked back at some writings of author Malcolm Gladwell, in early 2003, which refers to “creeping determinism,” a term he notes was coined three decades earlier by Baruch Fischhoff, to refer to “the sense that grows upon us, in retrospect, that what has happened was actually inevitable.” To some degree, the fear of big data emanates from the thought that government agencies are able to predict who is or who is not a terrorist.

First of all, if this is true, then where were all the analysts on September 10th? The gap in terrorism prediction is probably a byproduct of the massiveness of the volumes of data. You have so much data it is hard to know where to look. In the past, I have suggested that as opposed to the common metaphor of looking for a needle in a haystack, it is akin to looking for a specific needle in a huge field filled with stacks of needles.

However, I will grant the author his point here, but the failure is not with the data but rather with the people and their elected representatives they support, who rushed through legislation such as the PATRIOT Act that effectively abdicated the right to privacy. The widespread support for yielding their liberties as a way to save the country from terrorism is probably what Benjamin Franklin had in mind when he said “Those who would give up essential liberty to purchase a little temporary safety deserve neither liberty nor safety.

On the other hand, there are situations in which security agencies have used their data analysis to spot risky situations and true terrorists. The problem for people like me and you is that since the catastrophes are prevented, we’ll never know…

  1. 8.    “Over time such consequences will lead to chilling effects, as people become more reluctant to engage in any behaviors that will put them under the macroscope (more about that in a future post).”

Actually, perhaps when people realize the degree to which they have willingly compromised their own private information for little return, they should become more reluctant to engage in those behaviors.

In summary, while many of the concerns are valid, they are orthogonal to the questions of using big data and data mining. The risks are not in the bigness of the data, but rather the responsibilities and accountability of those using the data. Are the companies using data mining ensuring that the source data sets of measurable quality that is suitable for the analyses? Do the consumers of the results of the data mining believe that the results are “factual” or that they are informative? Are they drawing correct or incorrect conclusions? Do they have the corporate maturity to integrate the analytical results to benefit the consumer communities? Do they publish their data use practices? Do they allow for individuals to opt out?


Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!