Skip to main content

The Madness of Big Data

To take a break from pure e-discovery, we wanted to focus on something fun. Now that Selection Sunday is behind us, it is time to get down to the important business of filling out brackets for the NCAA Division 1 Men’s Basketball Tournament. To make our job easier, Microsoft has kindly analyzed millions of data points from more than a decade’s worth of historical data, such as win/loss ratio and the proximity of campuses to game sites, to predict the winners of each game. (Spoiler alert: Bing predicted that Kentucky would take home the trophy.)

While the accuracy of Bing’s predictions based on millions of data points remains to be seen, it pales in comparison to recent research from MIT, where scientists were able to use as few as three data points to identify individuals based on three months of anonymized credit card transaction data from 1.1 million people in 10,000 stores.

In January, a team from MIT published a paper in Science magazine showing that with the dates and locations of four purchases, they could identify 90 percent of consumers in the pool. Having price data simplified the task, so that “someone with copies of just three of your recent receipts—or one receipt, one Instagram photo of you having coffee with friends, and one tweet about the phone you just bought—would have a 94 percent chance of extracting your credit card records from those of a million other people.” Even when they made the data less precise, the scientists still identified more than 70 percent of users.

So, before companies disclose their data or metadata, even if anonymized, they should consider the risks of re-identification. In light of this MIT study, stripping away personally identifiable information could prove insufficient to shield companies from liability for violations of privacy laws. Therefore, organizations should shore up their network security and take an inventory of their data repositories. If their information governance regime permits, they could purge this data; if not, they should consider encrypting this data. This step is especially important for global companies with operations in countries that have strict privacy regimes.

While big data sets can be treasure troves for NCAA prognosticators or companies seeking analytical insights about their customers from anonymized data streams, they can also be rich fodder for cyberthieves looking to steal sensitive information—even from seemingly innocuous data sets that have been scrubbed of obvious identifiers.

Bill Schiefelbein is senior vice president, e-discovery consulting at Conduent. He can be reached at

About the Author