Big Data is a Big Deal

By Amanda Garris, Ph.D. ’04

earth

Your route to work today. A re-tweet during a presidential debate. The dishwasher running at 7 p.m. Thanks to the convergence of sensors, Global Positioning System (GPS), digital communications, and wireless connectivity, they are all now potential data points. The transformation of behaviors into bytes of data is mirrored by an explosion of information about the natural world, from the identification of an individual’s unique genetic mutations to the chemical identities of tiny particles aloft in the stratosphere. Welcome to the era of big data, hailed as a breakthrough as revolutionary as the microscope. Researchers in CALS are reaping a new kind of harvest from big data, developing tools to fight disease, mitigate climate change, and gain new insights into the dynamics of human communication and the future of commerce. 

Aija Leiponen
“An economy will emerge to realize the value of this data, but what is special about data as a product? It’s not delimited, copyrightable; it’s an intangible good that can be at the same timean intermediate and a final product. We don’t yet have enough data on big data—a huge research opportunity.” 

-AIJA LEIPONEN,
Associate professor in the Charles H. Dyson School of Applied Economics and Management
Photo: Robyn Wishna

Big Business

As an expert in technological change, digital business strategies and the process of innovation in businesses, Aija Leiponen has been eagerly tracking the rise of big data in commerce. Her previous research has shown that tapping into a breadth of information—from university research and consumers to suppliers and competitors—is a precursor to innovation in companies. Big data promises dramatic changes for both businesses and consumers. 

“We are really observing a data explosion,” said Leiponen, an associate professor in the Charles H. Dyson School of Applied Economics and Management. “For instance, big data are created by the internet of things, in which sensors embedded in objects relay information through the internet. It is cheap to connect products with transponders or wirelessly readable tags, so a producer can know where everything is, offering a very detailed view into the supply chain.”

Big data is already changing decades-old models for doing business, even at 300 miles per hour, 27,000 feet in the air. 

“Huge leaps have been made in sensors, such that some airplane engines wirelessly relay performance data in real time on specific airplanes back to the manufacturer,” she said. “It’s part of a changing model for that business: instead of purchasing engines, aircraft makers can pay by the hour for their use.” 

Such “servitization” based on data and enabled by communication networks is spreading to myriad industries. Leiponen predicts that it won’t be long before big data is used in ways that are more obvious to consumers. With larger and larger broadband pipes for information and wireless technology linking objects to the network­­—the impact may crop up in the power grid.

“Smart meters in the home will let you know what appliances are used when, allowing companies to manage demand as the sensor conveys a dynamic picture of energy use, Leiponen said. “That can then be incorporated into the spot price offered to households.”

A new development is that data itself is becoming a resource with economic value, in particular data about human behavior, harvested from all of our purchasing and other trackable behaviors.  While many companies perceive this information as important, defining the monetary value for different pools of behavioral data—as well as rules for its appropriate use—is still challenging.

“An economy will emerge to realize the value of this data, but what is special about data as a product? It’s not delimited, copyrightable; it’s an intangible good that can be at the same time an intermediate and a final product,” Leiponen said. “We don’t yet have enough data on big data—a huge research opportunity.” 

In this case, the regulations are playing catch up with the economy. Data brokers are already finding new ways to capitalize on consumer behavioral data. Leiponen noted that several are already under review by the Federal Trade Commission seeking to gather information about what is collected, how it is used, and whether consumers have the ability to access and correct their information or to opt out of having their personal information sold.

“Big brother and big data go hand in hand,” Leiponen said. “Consumers sell data on the cheap, in essence donating valuable data to companies.” 

#Empathy 

Drew Margolin
“In real time you can see the raw expression of emotion,
before national scale attention floods in and the national media
has framed the event. Our analysis showed that the 
immediate expression of fear was also directly related to the subsequent expression of solidarity.”

-DREW MARGOLIN,
assistant professor of communication
Photo: Robyn Wishna

It’s precisely the opportunity for real-time, real world observation of human behavior that encouraged assistant professor of communication Drew Margolin to turn to Twitter to study the shape of discourse—who speaks, who speaks to whom, and what do they say? Tweets offer some real advantages over more traditional experimental methods, where surveys and sterile rooms can add up to self-consciousness. 

“The main thing that’s different is the ability to get fine-grained behavioral data,” he said. “You don’t have access to people’s thoughts, but you’re seeing so much behavior in the real world.”

This ability to observe communications in a “natural habitat”—perhaps a comfortable chair in the den in front of the television with phone in hand—is enticing to Margolin. One area he has studied is political discourse, and the choices Tweeters face between imitation and confrontation.

Looking at tweeting during the presidential debates, he found that retweeting of elites replaced users’ more typical interactions, netting an overall decrease in the diversity of voices and quality of exchange. 

Another advantage of Twitter data is the ability to track emerging ideas in response to an unexpected event, from a natural disaster to a terrorist attack, again, in real time.

“You can’t simulate things like a terrorist attack. It’s a unique case of unanticipated, unrehearsed human behavior, and memory can be very unreliable,” Margolin said. “For example, being asked who would you call if you were involved in a terrorist attack is different than who you actually call.”

It’s led him to see the very human inclinations in the sea of data: emotions like fear, sadness and empathy. For example, Twitter data after the Boston Marathon bombings in 2013—180 million total tweets—showed that activity was greater in people who we might infer had a strong connection to the place—based on the geo-tags on their tweets. Another example is a school shooting in Florida, where people who had been in a location which also experienced shootings tweeted their support, with messages like “we are with you … we feel solidarity with you.”

“In real time you can see the raw expression of emotion, before national scale attention floods in and the national media has framed the event,” Margolin said. “Our analysis showed that the immediate expression of fear was also directly related to the subsequent expression of solidarity.”

For the practical purposes of first responders or government agencies seeking to reduce or quell fear in neighboring populations after a terrorist attack, his findings suggest that fear will spread to those communities with the most similar personal experiences—a related event may bring empathy from a community, but it also means reliving its most fearful moments, too. 

Haiyuan Yu
“The topology of the network can tell you a lot about the biology of the system. Just like airline hubs, there are proteins—a very small fraction—that are super-connected, which has implications for the robustness of the network and its resilience in the face of malfunction.”

-HAIYUAN YU,
assistant professor of biological statistics
and computational biology
Photo: Robyn Wishna

Tapping into Topology

The ability to look at a whole network is an advantage that social and natural sciences alike embrace. The networks that Haiyuan Yu, assistant professor of biological statistics and computational biology, studies are not peer to peer; they are protein to protein. Mapping the network created by the interactions among proteins—5,000 in the yeast he studies and 20,000 in humans—is key to understanding, diagnosing and curing diseases, from cancer to muscular dystrophy. 

“Of all the proteins in a cell, none act alone,” Yu said. “The whole cell is connected by ‘six degrees of separation’.”

His approach is to test for the interactions of each protein with all the others in the lab and then build the network. Mapping the interactions as a network can yield insights not available if they were looking at individual pairs of proteins, because the connectivity of the network can reveal crucial information. 

“The topology of the network can tell you a lot about the biology of the system,” Yu said. “Just like airline hubs, there are proteins—a very small fraction—[that] are super-connected, which has implications for the robustness of the network and its resilience in the face of malfunction.”

While protein networks can be robust to random genetic errors, the flip side is that they are much more fragile in the face of a targeted attack—like cellular acts of terrorism that target network hubs. Knowing a protein’s position in the network helps Yu understand its potential impact and has great predictive power, for example in the area of drug side effects. 

“About 20 percent of drug candidates fail in early clinical trials due to safety issues causes by side effects,” Yu said. “Understanding and predicting what causes side effects is of paramount importance to human health and the pharmaceutical industry.”

Yu’s lab looked at the protein targets of drugs in the framework of the human protein interactome network, the web of physical interactions among molecules. They found that what mattered was not the total number of a drug’s targets, but that the number of essential targets—proteins that were hubs of their networks—determined the occurrence of its side effects. The findings will shed light on new factors to be incorporated into the drug development pipeline.

They have just completed a project mapping the interactome network for fission yeast, which Yu calls a ‘forgotten organism,’ which has a lot of pathways in common with humans for fundamental processes such as how genes are turned off and on. All 5,000 fission yeast proteins were tested for interaction with each other and with three replicates—using brute force in the lab over three years—yielding 75 million prospective pairs to analyze. Yu’s lab was able to identity about 2,300 that interacted. Big data allowed Yu to find his those needles in the haystack.

Using the big data approach has led them to discover several novel factors that are involved in gene regulation that were previously unknown.

“It really highlights the role of big data in discovery. When you look at the whole system you don’t miss things,” he said. “I feel really lucky to be doing biology right now.”

Toby Ault
“This is really exciting to me— the interplay between natural
variation and human activity. It’s very applied. Farmers and
growers don’t really care about the weather in 2100, but when
to plant in a particular spring, that has important social implications.”

-TOBY AULT, assistant professor of earth and atmospheric sciences

Parsing Predictions

When climate scientists use big data to study whole systems, it encompasses plant life, soil moisture, biogeochemistry, the oceans and the atmosphere. That’s why, when it comes to the absolute size of big data, the bragging rights in CALS currently rest with the atmospheric scientists. 

“Big data is what atmospheric scientists have been calling ‘data’,” explained Toby Ault, an assistant professor of earth and atmospheric sciences. “Our capacity to generate data can exceed our ability to interpret the results. Big data can be a big mess.”

Ault works with data measured by the petabyte—one million gigabytes—using one of the largest public access research science databases to generate global climate simulations. To understand climate modeling, Ault suggests picturing a really good video game, where the water flowing looks like real water, because the game is built on the physics of motion and energy exchange. 

It also relies on real-time, highly detailed data collected via satellites. To look at global climate models, he uses a grid of collection sites extending like a column from the ground into the upper atmosphere, and at time points as frequent as every 20 minutes, collected over years.

“In essence, the models capture the whole system on the earth to better make predictions about climate change, based on the most optimistic or pessimistic assumptions regarding the adoption of mitigation strategies,” Ault said. “For example, if you want to simulate a particular future period, climate models allow you to compare and contrast what happens with and without an increase in atmospheric carbon dioxide.”

His analyses have resulted in some dire odds for the American Southwest and the Great Plains. Due to global warming, the chance of a decade long drought in the Southwest is at least 50 percent. Furthermore, the chance of a “megadrought,” one that lasts over 30 years, is very high before the end of this century in both areas, unless greenhouse gas emissions are lowered drastically in the next ten years. 

Today, Ault is drawn to predictions about time periods from ten days to ten years in the future and mitigating the impact of climate change on farmers’ livelihoods. For example, small fluctuations in the onset of spring can have major impacts on the time to plant corn or the potential for damage to tree fruit due to the danger of early frost on emerging buds and flowers. Ault’s models predict that both early springs and early fall frosts will increase in the next several decades, indicating that farmers will need both weather prediction tools and plant varieties adapted to the new normal. 

“This is really exciting to me—the interplay between natural variation and human activity. It’s very applied,” he said. “Farmers and growers don’t really care about the weather in 2100, but when to plant in a particular spring, that has important social implications.”

Big Data, Small Particles

The work of Sara C. Pryor, professor of earth and atmospheric sciences, focuses on understanding causes of climate variability and change in order to make better projections of future regional and local climates. 

“Aerosol particles are the largest source of uncertainty in the science of climate change and prediction, particularly at the regional scale,” said Pryor, who was recently made a Fellow of the American Association for the Advancement of Science. “They are important because they can both reflect light and cause cloud formation, so they tend to lead to surface cooling and offset some of the warming caused by greenhouse gases,” she said.

The abundance of these aerosols also has some down-to-earth implications for human health. Millions suffer impaired health due to the effects of aerosol exposure. Given that aerosol particles’ impacts on climate and health are strongly influenced by their size—from ten times smaller than a typical virus to the width of a human hair—Pryor’s work tracks processes occurring across scales from micrometers to kilometers.

Pryor’s research aims to quantify aerosol concentrations, size and composition in time and space, model their influence on climate, and determine how and where they are removed from the atmosphere. She uses both ground-based and satellite radiometers, in addition to in situ instruments that measure aerosol concentrations and fluxes every second in dozens of size classes. She also conducts simulations using increasingly detailed and sophisticated numerical models.

All of this leads to huge data volumes. For example, Earth Observing Systems operated by NASA generate over four terabytes of data every day. Pryor’s group has recently completed simulations on a 12-kilometer grid across the entire continental United States, with 32 vertical layers through the atmosphere, and simulating the concentration of more than 200 gases and 32 aerosol particle types and sizes. The resulting output is also many terrabytes in size.

“These data volumes represent both an unparalleled opportunity for generating new insights into the function of the atmosphere and interaction with the earth’s surface and for evaluating the models we use for predictions,” Pryor said.

One of her projects is focused on examining the dual role of forests in contributing to both the formation and removal of aerosols—and how those roles could change under different climate warming scenarios. Forest canopies release biogenic volatile organic compounds (BVOCs) that can form aerosols. On average, warmer temperatures tend to lead to higher BVOC emissions, potentially increasing regional aerosol concentrations that in turn can reflect away more sunlight and thus suppress greenhouse gas warming locally and regionally. However, her recent research has shown this effect is very dependent on adequate water supply: It ‘switches off’ during drought. This evidence of the complexity of the biosphere’s response to warming is an insight made possible by big data and the technology that feeds it, and Pryor said it’s just the beginning. 

“Availability of ‘big data’ means we can ask different questions, generate new hypotheses, but it also requires that we develop and apply new tools to optimally use complex data streams from different sources that have different characteristics, uncertainties and scales,” she noted. “I feel a real sense of optimism—it’s a great time to be a scientist!”