But, second, let’s all accept that our million-fold increases in data have not led to million-fold, or even tenfold, increases in wisdom.
The grand vision of “Big Data”—or whatever term you like for information sets that are so huge we can barely figure out how to store them—is that it will lead to big insights on crucial connections between seemingly disparate phenomena.
Perhaps the movements and food choices of thousands of tagged lobsters in Maine will turn out to predict the wine harvests of Bordeaux. Perhaps victory patterns in Yankees games will be the key to understanding the behavior of gamblers in Macao. Perhaps financial markets will turn out to have some underlying pattern we never gleaned before because we hadn’t viewed market fluctuations in the context of cell biology, rainfall or the Large Hadron Collider.
Whether such hopes for Big Data inspire yearning or fear, they remain unfulfilled.
We’re not the first to note this mismatch between data dreams and realities. When Peter Drucker first wrote of an “information explosion” that was taking place, the year was 1969. “All of us feel—and overeat—very much like the little boy who has been left alone in the candy store,” he wrote. “But what has to be done to make this cornucopia of data redound to information, let alone to knowledge?”
For many of the entities that claim to work with Big Data, the best answer to Drucker’s question might be awkward silence.
If you work for the National Security Agency, you probably have no shortage of resources at your disposal: money, storage capacity, personnel. But if you work for a municipal department, chances are that you’re going to have to figure out how to do clever things on a shoestring.
Few, if any, entities embody this spirit more than New York City’s Department of Transportation.
In 2008, when Cordell Schachter took over as chief technology officer of the DOT, budgets were lean and the iPhone was new. But technology evolved fast. Within three years, the city’s Metropolitan Transportation Authority began, like its counterparts in many other cities, to track every subway train and bus in its fleet, creating a record of every movement each minute of the day. This data feed was made available to the public.
For the city’s DOT, which is separate from the MTA, this was a tremendous gift of resources. “We try to be very opportunistic here and use available information,” says Schachter. “And in the MTA’s case, we could get it for free.”
DOT staffers also noticed that the company NYC Bike Share was tracking the availability of shared bicycles in the city, creating yet another data feed that they could tap at essentially no cost.
Combining these data streams, DOT staffers developed an application called iRideNYC, which allows the user to stand anywhere in New York and see the closest transportation options (bikeshare, bus, train), their times of arrival and departure, and estimated walking time.
A few rules have guided the department in its work on the iRideNYC application. The first is that it must work on all devices—iOS, Windows, Android, even a desktop computer. The second is that the code that went into iRideNYC should be made publicly available. “We are taxpayer-funded,” Schachter says, “so why should taxpayers pay again if another jurisdiction wants to use this?”
DON’T FORGET TO USE YOUR OWN BRAIN
When you enter the Institute for Systems Biology, just south of Lake Union in Seattle, it feels like you’ve wandered onto the set of The Big Bang Theory. White boards in hallway meeting areas are messy with equations and impromptu computations. Employees work among sundry toys such as hula hoops and Daleks, the robot from Doctor Who. The labs have chairs draped with white coats and sundry machines such as ion-trap mass spectrometers.
Launched in 2000 by three University of Washington scientists, today ISB employs 200 staff members from 45 academic fields.
While precise definitions of the term “systems biology” will differ, what is essential is the modeling of complex networks. This is done in the belief that many of the mysteries of biology will be cleared up by, in effect, stepping back rather than just pushing forward. The migration habits of a single bird will make sense in the larger context of weather, for instance.
I would say that an interaction between the model and the human brain has to make the prediction.
Senior Vice President and Director, Institute for Systems Biology
Since 2000, ISB research has resulted in more than 1,300 papers published in academic journals. Most have titles—like “Genotoxic stress/p53-induced DNAJB9 inhibits the pro-apoptotic function of p53”—that are too abstruse for the lay person to understand.
To do this work, ISB has increased greatly both the data storage capacity and processing power it has available. But with this has come two big cautionary notes: For starters, it’s crucial not to get “wedded to the technology,” Baliga says. And it’s equally crucial not to get wedded to the mountains of information generated by the technology.
“For us, the model is the beginning, and the model essentially organizes complex data in a way that lets human intuition kick in,” Baliga says.
When data sets are massive, Baliga warns, it is particularly easy to be seduced by meaningless correlations. (Harvard Law student Tyler Vigen, who has made a side project of unearthing spurious links, has, for example, found a 99% correlation between U.S. spending on space, science and technology and suicides by hanging, strangling and suffocation.)
That’s why Baliga and his colleagues are always careful to take the data spit out by their computers and supplement it with other experiments, their own observations and a real-world gut check. In this way, taming Big Data is no different from the way that Drucker described the discipline of innovation: “Because innovation is both conceptual and perceptual, would-be innovators must … go out and look, ask and listen,” he wrote. “Successful innovators use both the left and right sides of their brains.”
Asserts Baliga: “Mathematicians and statisticians might feel that a model has to make a prediction. I would say that an interaction between the model and the human brain has to make the prediction.”
THE RACE IS TO THE SWIFT
It’s one thing to look at a pile of data and try to make sense of it over time. It’s another to look at an abundance of incoming data and make sense of it right away. For many companies, the latter has become a business necessity.
Take Uber, the ride-share company that threatens to put traditional taxis out of business. Every minute of the day, Uber’s computers must process a torrent of incoming information: the location of every logged-in user in the world, the location of every one of its drivers, the destination of every passenger, the status of every ride, the payment of every fare, the rating of every driver. Most of these facts and figures must be processed immediately, so that Uber can—without pause—seamlessly provide more than a million rides per day.
One thing that makes such speed and fluidity possible is a system called Apache Kafka—a tool that offers organizations a way to take in a deluge of data, trillions of messages a day, and process it all right away.
Kafka’s creators have compared it to a central nervous system, the instrument by which the Big Data of our brains gets translated instantaneously into movement by our bodies.
Kafka came into being at LinkedIn, where engineer Jay Kreps and two colleagues, Neha Narkhede and Jun Rao, increasingly found themselves resorting to patchwork and improvisation in their efforts to gain instant control of the information constantly pouring in to the professional social network.
LinkedIn already had some Big Data capabilities, and, like many companies, was making use of a software framework called Hadoop. But this only got Kreps and his teammates so far. What they needed was a means by which to process data as it came in, not just in batches after the fact.
To take an analogy, Hadoop could tell you amazing things about a baseball game once it was over—the average speed of every pitch that was thrown, the precise distance of the home runs that were hit, the way that a shift in the infield thwarted what would have been a game-winning double. But in the middle of the fifth inning it couldn’t actually tell you who was winning.
The transition from periodic once-a-day computation to something that happens all the time is a tectonic shift.
Co-Founder and CEO, Confluent
So Kreps, Narkhede and Rao came up with Kafka, a platform that allowed LinkedIn to handle vast feeds, making data immediately responsive to users when they entered in information—whether it was to change their status or to connect with others. Not long after, in 2011, LinkedIn “open-sourced” Kafka, meaning that anyone could access the code and offer improvements to it or otherwise adapt it to their own business needs.
For the next several years, Kreps and his co-workers continued at LinkedIn, but they also found themselves in high demand as the open-source stewards of Kafka. “We started getting a lot of requests for help from companies trying to adopt it in a large way,” Narkhede recalls.
Among those Kafka enthusiasts were Twitter, Netflix, Square, Spotify, Pinterest and Uber. So in 2014, Kreps and his team decided to make a business of their own, founding Confluent, a company that helps other organizations use Kafka effectively. If Kafka is the internal combustion engine that anyone can acquire (and for free, no less) Confluent is the outfit that helps build you the rest of the car around the engine (for fees that it won’t discuss publicly).
In July, Confluent, now a 25-person team based in Palo Alto, Calif., raised $24 million in a new round of venture funding.
“The transition from periodic once-a-day computation to something that happens all the time is a tectonic shift,” Kreps says.
Indeed, as Confluent grows, it is finding that all sorts of organizations—in retail, telecommunications, healthcare and financial services—are discovering one of the secrets to taming Big Data: Not only do you have to use it smartly, you have to do so in a flash.*
What will you do on Monday that’s different?
DO A DATA DIAGNOSIS
Have each member of your team review all of the data that he or she receives on a regular basis—and assess how useful it is in actually helping to make good decisions. If it’s not useful, fix it. Or kill it.
CONSIDER GETTING FLATTER
Traditionally, organizations have had multiple layers whose main function was to coordinate the passing of information back and forth through the enterprise. With more data readily available to everyone, ask yourself: Where are there opportunities to streamline?
TAKE A STEP BACK
“An adequate information system,” Peter Drucker wrote, must lead executives “to ask the right questions, not just feed them the information they expect. That presupposes first that executives know what information they need.” Do you?