MONDAY*

Issue #7 | September–October 2015

TAMING BIG DATA: 3 LESSONS FOR THE NEW INFORMATION AGE

EXECUTIVE
BRIEFING

The grand vision of “Big Data” is that it will lead to big insights on crucial connections between seemingly disparate phenomena. So far, though, that promise remains mostly unfulfilled. And it has been that way for quite a long time. When it comes to data, “all of us feel—and overeat—very much like the little boy who has been left alone in the candy store,” Peter Drucker wrote in 1969. “But what has to be done to make this cornucopia of data redound to information, let alone to knowledge?” What follows are three stories of organizations from across the public, nonprofit and private sectors, each of which provides at least a partial answer to Drucker’s question. Their insights are keen: Be opportunistic. Don’t forget to use your own brain. And the race is to the swift.

TAMING BIG DATA

Everyone reading this article has access to limitless quantities of information. So, first, thanks for your time.

big_data_image-01

But, second, let’s all accept that our million-fold increases in data have not led to million-fold, or even tenfold, increases in wisdom.

The grand vision of “Big Data”—or whatever term you like for information sets that are so huge we can barely figure out how to store them—is that it will lead to big insights on crucial connections between seemingly disparate phenomena.

Perhaps the movements and food choices of thousands of tagged lobsters in Maine will turn out to predict the wine harvests of Bordeaux. Perhaps victory patterns in Yankees games will be the key to understanding the behavior of gamblers in Macao. Perhaps financial markets will turn out to have some underlying pattern we never gleaned before because we hadn’t viewed market fluctuations in the context of cell biology, rainfall or the Large Hadron Collider.

Whether such hopes for Big Data inspire yearning or fear, they remain unfulfilled.

We’re not the first to note this mismatch between data dreams and realities. When Peter Drucker first wrote of an “information explosion” that was taking place, the year was 1969. “All of us feel—and overeat—very much like the little boy who has been left alone in the candy store,” he wrote. “But what has to be done to make this cornucopia of data redound to information, let alone to knowledge?”

For many of the entities that claim to work with Big Data, the best answer to Drucker’s question might be awkward silence.

FRAME WORK

The Drucker Institute’s Phalana Tiller visits with Tim Leberecht, author of The Business Romantic, to talk about how businesses must balance hard data with real passion.

VERY HANDY

The New York City Department of Transportation’s iRideNYC app is built, in part, on data that the agency can obtain for little or no cost.

Organizations like to use the words “Big Data” to convey sophistication and dynamism. When you take a closer look, though, the ability to produce and store Big Data often seems more like the chef’s kitchen that was installed to allow for grand-scale entertaining. We plan to throw that party someday, but for now we’re more likely to be throwing Stouffer’s in the microwave. Or, to use a different analogy, we’ve got a river full of gold, but we’re still just learning how to pan it.

Many of the challenges to using Big Data effectively turn out to be similar to those of cross-pollinating ideas to spur innovation, a phenomenon that we examined in the last issue of MONDAY*. You can put professionally diverse people together in a room, but how do the insights from one field combine felicitously with those of another? And how, given all of the differences in the relevant professional languages (between, say, sociologist and physicist), do they even communicate?

Likewise, given the varieties in how information is collected and employed, how do you combine one data set with another and wind up with something useful? Consider a string quartet that gets recorded and released in three formats: on a vinyl record, on a CD and on iTunes. The relevant underlying data is the same—the sounds created by the four musicians—but the coding and the equipment needed to read it is in each case different. How do you mix and match?

Finally, how do you begin to come to grips with incomprehensible vastness? Perhaps placing a tracker on every car in the United States and collecting its location every 30 seconds for a year will, when combined with high math, lead you to insights about human movement patterns. Or perhaps it will just take up a lot of disk space.

This makes it all the more important to examine organizations leading the way in helping us to make sense of Big Data. We’re going to look at three—one in government, one in the nonprofit world and one in business—that are off to a notable start. Each provides a valuable lesson that should be applicable in any sector.

iridenycapp

BE OPPORTUNISTIC

If you work for the National Security Agency, you probably have no shortage of resources at your disposal: money, storage capacity, personnel. But if you work for a municipal department, chances are that you’re going to have to figure out how to do clever things on a shoestring.

Few, if any, entities embody this spirit more than New York City’s Department of Transportation.

In 2008, when Cordell Schachter took over as chief technology officer of the DOT, budgets were lean and the iPhone was new. But technology evolved fast. Within three years, the city’s Metropolitan Transportation Authority began, like its counterparts in many other cities, to track every subway train and bus in its fleet, creating a record of every movement each minute of the day. This data feed was made available to the public.

For the city’s DOT, which is separate from the MTA, this was a tremendous gift of resources. “We try to be very opportunistic here and use available information,” says Schachter. “And in the MTA’s case, we could get it for free.”

DOT staffers also noticed that the company NYC Bike Share was tracking the availability of shared bicycles in the city, creating yet another data feed that they could tap at essentially no cost.

Combining these data streams, DOT staffers developed an application called iRideNYC, which allows the user to stand anywhere in New York and see the closest transportation options (bikeshare, bus, train), their times of arrival and departure, and estimated walking time.

A few rules have guided the department in its work on the iRideNYC application. The first is that it must work on all devices—iOS, Windows, Android, even a desktop computer. The second is that the code that went into iRideNYC should be made publicly available. “We are taxpayer-funded,” Schachter says, “so why should taxpayers pay again if another jurisdiction wants to use this?”

VOX BOX

WHAT IS THE BIGGEST BIG DATA MISTAKE THAT YOU SEE ORGANIZATIONS MAKE?

—Interviews by Ian Gallogly

TOM DAVENPORT

Professor, Babson College; Author, Big [email protected]

LUCY BERNHOLZ

Lead Researcher, Stanford’s Digital Civil Society Lab

TOM HORAN

Dean, Drucker-Ito School of Management; Director, Center for Information Systems and Technology Claremont Graduate University

The biggest mistake I see companies making with Big Data is not to have a particular business problem or decision in mind. It has become fashionable to believe that you can just sift through the data—as if it were a gold mine—to find out important trends and insights in your business. That is a foolish exercise; it takes a long time and costs a lot of money, and is unlikely to yield anything particularly valuable. It’s much more feasible to identify a problem or application domain from the beginning. It might be trying to learn what factors predict customer attrition, for example, or learning whether social media comments are a good predictor of revenues in the next quarter. How do you know what problem to pick? It turns out that question is not answered very analytically most of the time. It’s based on a deep understanding of what’s important in your business, and an awareness of what data might shed light on those factors. Successful senior managers almost always have that kind of understanding. All they may need to do is to have a conversation with Big Data analysts about whether data might be available to address their concern.

That’s easy: They don’t take consent seriously enough. The nonprofit sector is a voluntary sector—we participate in it voluntarily, whether that’s as donors, volunteers, etc. And in the digital jargon, volunteering is about opting-in; it’s all about consent. The mistake that nonprofits are making is that they’re acting too much like businesses. They’re defaulting to the same practices as companies and governments that have very different reasons for extracting information from data without the consent of the people from whom the data comes, and in the end that’s going to blow up on them. What it does is extinguish any difference between the sectors in terms of data. The nonprofit sector needs to recognize why it is distinct from the market and government, and cherish that and protect it and use it to its advantage. Particularly as people become more concerned about privacy, it’s absolutely a distinguishing value. The nonprofit sector needs to be very overt about consent and say very clearly to people, “This is what we’re using, this is why we’re using it, this is what we’ll do with it, this is what we won’t do with it, this is how we’ll destroy it,” and that will build trust.

One mistake is thinking that just providing Big Data is useful in and of itself. Agencies have a lot of Big Data, particularly federal agencies. But if it’s unusable data, if it’s not tailored to specific needs, then it’s kind of a supply-driven environment, where the agencies think they’re doing a good thing by making the data available, but it may not be helpful. The biggest mistake they make is thinking that Big Data is always better data, instead of focusing on how to get the right kind of Big Data. There’s a lot of great potential in Big Data to uncover important relationships in health, in transportation, in society; but in order to be effective, doing the analysis—and being able to tease out meaningful relationships—is key. One challenge is just that it’s such new phenomena. The capacity of the public agencies to actually be effective in analyzing their data is definitely an issue right now.

TOM DAVENPORT

Professor, Babson College; Author, Big [email protected]

MORE

The biggest mistake I see companies making with Big Data is not to have a particular business problem or decision in mind. It has become fashionable to believe that you can just sift through the data—as if it were a gold mine—to find out important trends and insights in your business. That is a foolish exercise; it takes a long time and costs a lot of money, and is unlikely to yield anything particularly valuable. It’s much more feasible to identify a problem or application domain from the beginning. It might be trying to learn what factors predict customer attrition, for example, or learning whether social media comments are a good predictor of revenues in the next quarter. How do you know what problem to pick? It turns out that question is not answered very analytically most of the time. It’s based on a deep understanding of what’s important in your business, and an awareness of what data might shed light on those factors. Successful senior managers almost always have that kind of understanding. All they may need to do is to have a conversation with Big Data analysts about whether data might be available to address their concern.

LUCY BERNHOLZ

Lead Researcher, Stanford’s Digital Civil Society Lab

MORE

That’s easy: They don’t take consent seriously enough. The nonprofit sector is a voluntary sector—we participate in it voluntarily, whether that’s as donors, volunteers, etc. And in the digital jargon, volunteering is about opting-in; it’s all about consent. The mistake that nonprofits are making is that they’re acting too much like businesses. They’re defaulting to the same practices as companies and governments that have very different reasons for extracting information from data without the consent of the people from whom the data comes, and in the end that’s going to blow up on them. What it does is extinguish any difference between the sectors in terms of data. The nonprofit sector needs to recognize why it is distinct from the market and government, and cherish that and protect it and use it to its advantage. Particularly as people become more concerned about privacy, it’s absolutely a distinguishing value. The nonprofit sector needs to be very overt about consent and say very clearly to people, “This is what we’re using, this is why we’re using it, this is what we’ll do with it, this is what we won’t do with it, this is how we’ll destroy it,” and that will build trust.

TOM HORAN

Dean, Drucker-Ito School of Management; Director, Center for Information Systems and Technology Claremont Graduate University

MORE

One mistake is thinking that just providing Big Data is useful in and of itself. Agencies have a lot of Big Data, particularly federal agencies. But if it’s unusable data, if it’s not tailored to specific needs, then it’s kind of a supply-driven environment, where the agencies think they’re doing a good thing by making the data available, but it may not be helpful. The biggest mistake they make is thinking that Big Data is always better data, instead of focusing on how to get the right kind of Big Data. There’s a lot of great potential in Big Data to uncover important relationships in health, in transportation, in society; but in order to be effective, doing the analysis—and being able to tease out meaningful relationships—is key. One challenge is just that it’s such new phenomena. The capacity of the public agencies to actually be effective in analyzing their data is definitely an issue right now.

MODELS AND MINDS

Nitin Baliga, director of the Institute for Systems Biology, oversees a team building predictive models based on reams of data. But in the end, he says, it’s all about having “human intuition kick in.”

The third guiding principle is that the application must focus on improving transportation access for those with physical or cognitive impairments, so that factors like pinpointing stairwell location and map brightness come into play.

IRideNYC is only one data-taming effort being undertaken by DOT. Another has been the streamlining of the permitting process for New Yorkers who need to excavate the street, under which lie countless feet of pipes, cables and tunnels. This requires no small number of permits: more than half a million per year. A decade ago, you had to file physical documents and pick up a paper permit to post at your job site. Today, the process is digital and can be completed on a mobile device.

Getting to this stage has required the creation and storage of a lot of data—on locations, on relevant restrictions, on project scope—that is now more easily searchable.

A major aspect of handling all of this information—and doing it efficiently and cost effectively—is knowing what to keep and how.

direct-from-drucker2x

DOT receives lots of sensitive electronic data that must be kept secure. This gets stored in multiple city-owned physical sites that cannot be breached wirelessly. Less sensitive data—such as details of traffic speed, street construction, public space, pedestrian counts, ferry service, parking availability, bridge clearance, truck routes and traffic cameras, among other things—is likewise stored at those sites. But it’s also made available in the “cloud,” provided at relatively low cost by companies such as Amazon.

And some data, such as the feeds provided by the MTA, is generally allowed to flow by like a stream, rather than getting retained. This approach makes good sense: Most of us want to know that the D train is running five minutes late right now. Three months down the road, few of us will care that the train pulled into the station a bit behind schedule.

Schachter says his department has only just begun to explore data mining. In the future, he says, researchers may be able to anticipate something like pavement deterioration and calculate whether it makes the most sense to repave it, patch it or just leave it alone.

In the end, though, what makes NYC DOT an outstanding tamer of Big Data is that it looks for modest ways to make the most of the resources it has. With more and more products and services available commercially, “you can pay for a lot of things as you go,” Schachter says, which leaves funds for what he says pays off best: “investing in your team.”

nitin-baliga

DON’T FORGET TO USE YOUR OWN BRAIN

When you enter the Institute for Systems Biology, just south of Lake Union in Seattle, it feels like you’ve wandered onto the set of The Big Bang Theory. White boards in hallway meeting areas are messy with equations and impromptu computations. Employees work among sundry toys such as hula hoops and Daleks, the robot from Doctor Who. The labs have chairs draped with white coats and sundry machines such as ion-trap mass spectrometers.

Launched in 2000 by three University of Washington scientists, today ISB employs 200 staff members from 45 academic fields.

While precise definitions of the term “systems biology” will differ, what is essential is the modeling of complex networks. This is done in the belief that many of the mysteries of biology will be cleared up by, in effect, stepping back rather than just pushing forward. The migration habits of a single bird will make sense in the larger context of weather, for instance.

I would say that an interaction between the model and the human brain has to make the prediction.

NITIN BALIGA
Senior Vice President and Director, Institute for Systems Biology

WATER 
COOLER 
CHATTER

Wal-Mart handles more than 1 million customer transactions each hour.
Tweet this


John Mashey, who was the chief scientist at Silicon Graphics in the 1990s, is credited with coining the term “Big Data.”
Tweet this


The digital universe is doubling in size every two years.
Tweet this


Amazon has more than 1.4 million servers.
Tweet this


A hat-tip to: The Economist; Steve Lohr of The New York Times; Bernard Marr; and Enterprise Tech, respectively.

Nitin Baliga, who as senior vice president and director of ISB oversees most of the day-to-day operations, is one of nine faculty members on staff. His lab builds predictive models that will, it is hoped, be useful in combating disease, generating clean energy and protecting the environment.

Since 2000, ISB research has resulted in more than 1,300 papers published in academic journals. Most have titles—like “Genotoxic stress/p53-induced DNAJB9 inhibits the pro-apoptotic function of p53”—that are too abstruse for the lay person to understand.

To do this work, ISB has increased greatly both the data storage capacity and processing power it has available. But with this has come two big cautionary notes: For starters, it’s crucial not to get “wedded to the technology,” Baliga says. And it’s equally crucial not to get wedded to the mountains of information generated by the technology.

“For us, the model is the beginning, and the model essentially organizes complex data in a way that lets human intuition kick in,” Baliga says.

When data sets are massive, Baliga warns, it is particularly easy to be seduced by meaningless correlations. (Harvard Law student Tyler Vigen, who has made a side project of unearthing spurious links, has, for example, found a 99% correlation between U.S. spending on space, science and technology and suicides by hanging, strangling and suffocation.)

That’s why Baliga and his colleagues are always careful to take the data spit out by their computers and supplement it with other experiments, their own observations and a real-world gut check. In this way, taming Big Data is no different from the way that Drucker described the discipline of innovation: “Because innovation is both conceptual and perceptual, would-be innovators must … go out and look, ask and listen,” he wrote. “Successful innovators use both the left and right sides of their brains.”

Asserts Baliga: “Mathematicians and statisticians might feel that a model has to make a prediction. I would say that an interaction between the model and the human brain has to make the prediction.”

THE RACE IS TO THE SWIFT

It’s one thing to look at a pile of data and try to make sense of it over time. It’s another to look at an abundance of incoming data and make sense of it right away. For many companies, the latter has become a business necessity.

Take Uber, the ride-share company that threatens to put traditional taxis out of business. Every minute of the day, Uber’s computers must process a torrent of incoming information: the location of every logged-in user in the world, the location of every one of its drivers, the destination of every passenger, the status of every ride, the payment of every fare, the rating of every driver. Most of these facts and figures must be processed immediately, so that Uber can—without pause—seamlessly provide more than a million rides per day.

One thing that makes such speed and fluidity possible is a system called Apache Kafka—a tool that offers organizations a way to take in a deluge of data, trillions of messages a day, and process it all right away.

Kafka’s creators have compared it to a central nervous system, the instrument by which the Big Data of our brains gets translated instantaneously into movement by our bodies.

Kafka came into being at LinkedIn, where engineer Jay Kreps and two colleagues, Neha Narkhede and Jun Rao, increasingly found themselves resorting to patchwork and improvisation in their efforts to gain instant control of the information constantly pouring in to the professional social network.

LinkedIn already had some Big Data capabilities, and, like many companies, was making use of a software framework called Hadoop. But this only got Kreps and his teammates so far. What they needed was a means by which to process data as it came in, not just in batches after the fact.

To take an analogy, Hadoop could tell you amazing things about a baseball game once it was over—the average speed of every pitch that was thrown, the precise distance of the home runs that were hit, the way that a shift in the infield thwarted what would have been a game-winning double. But in the middle of the fifth inning it couldn’t actually tell you who was winning.

The transition from periodic once-a-day computation to something that happens all the time is a tectonic shift.

JAY KREPS
Co-Founder and CEO, Confluent

So Kreps, Narkhede and Rao came up with Kafka, a platform that allowed LinkedIn to handle vast feeds, making data immediately responsive to users when they entered in information—whether it was to change their status or to connect with others. Not long after, in 2011, LinkedIn “open-sourced” Kafka, meaning that anyone could access the code and offer improvements to it or otherwise adapt it to their own business needs.

For the next several years, Kreps and his co-workers continued at LinkedIn, but they also found themselves in high demand as the open-source stewards of Kafka. “We started getting a lot of requests for help from companies trying to adopt it in a large way,” Narkhede recalls.

Among those Kafka enthusiasts were Twitter, Netflix, Square, Spotify, Pinterest and Uber. So in 2014, Kreps and his team decided to make a business of their own, founding Confluent, a company that helps other organizations use Kafka effectively. If Kafka is the internal combustion engine that anyone can acquire (and for free, no less) Confluent is the outfit that helps build you the rest of the car around the engine (for fees that it won’t discuss publicly).

In July, Confluent, now a 25-person team based in Palo Alto, Calif., raised $24 million in a new round of venture funding.

“The transition from periodic once-a-day computation to something that happens all the time is a tectonic shift,” Kreps says.

Indeed, as Confluent grows, it is finding that all sorts of organizations—in retail, telecommunications, healthcare and financial services—are discovering one of the secrets to taming Big Data: Not only do you have to use it smartly, you have to do so in a flash.*

Tweet this article

Monday Mandate*

What will you do on Monday that’s different?

DO A DATA DIAGNOSIS

Have each member of your team review all of the data that he or she receives on a regular basis—and assess how useful it is in actually helping to make good decisions. If it’s not useful, fix it. Or kill it.

CONSIDER GETTING FLATTER

Traditionally, organizations have had multiple layers whose main function was to coordinate the passing of information back and forth through the enterprise. With more data readily available to everyone, ask yourself: Where are there opportunities to streamline?

TAKE A STEP BACK

“An adequate information system,” Peter Drucker wrote, must lead executives “to ask the right questions, not just feed them the information they expect. That presupposes first that executives know what information they need.” Do you?