Big Data
Patricia M. Williams Term Professor in Biology Junhyong Kim navigates the complex computations of single-cell genomics.
While most scientists spend their time trying to accumulate as much data as possible, Junhyong Kim has the opposite problem—too much data to manage. Fortunately for Kim and his co-primary investigator Zack Ives, Professor and Markowitz Faculty Fellow in Computer and Information Science in Penn Engineering, it’s a dilemma they will be able to tackle head-on with one of the first round of grants from the National Institutes of Health (NIH) Common Fund BD2K (BigData to Knowledge) initiative. The main goal: Design a computer software program that can handle the intense amount of data management involved in single-cell genomics and other genome-enabled medicine.
Kim’s own research involves trying to understand the function of individual cells in the human body. “People understand that there are different types of cells—that skin cells are different from brain cells and eye cells and heart cells and so on,” says Kim, Patricia M. Williams Term Professor in Biology; co-director, Penn Program in Single-Cell Biology; and adjunct professor of computer and information science. “But there’s a tendency to view certain kinds of groups of cells like bricks in the wall, when it turns out that every cell is actually quite different.” The ability to analyze individual cells on a minute level is a very recent development. Kim’s close collaborator James Eberwine, Elmer Holmes Bobst Professor of Pharmacology at the Perelman School of Medicine, was one of the pioneers of that technology.
The process involves multiple steps: isolating ribonucleic acid (RNA) from individual cells, amplifying it so that it’s numerous enough to process, and, finally, sequencing it in order to reveal gene expressions. This glimpse at individual cells’ function and behavior presents key translation opportunities, like better understanding of degenerative diseases like Alzheimer’s—an understanding which could lead to pharmaceuticals equipped to target very specific genes. “If you look at your face and see where your freckles appear, it’s clear that there’s a lot of heterogeneity,” says Kim. “So when we get degenerative diseases, it’s not like your whole brain goes at once. Some cells go and some cells don’t. And that’s due to this cell-to-cell variation we are working to understand.”
Keeping track of complex experimental processes and data analysis steps involved in genomic technologies is where things get tricky, and where the new BD2K grant, titled “Approximating and Reasoning about Data Provenance,” will make a crucial difference. Kim attributes the massive amounts of data in genomic research to the incredible advances in technology that have brought the cost of sequencing the human genome down from $3 billion to $1,000. But this "truly amazing" advance also created a data deluge whose processing and management is now the key bottleneck in research and translational applications. In order to chart the “provenance” of the data—how that data came about—each step has to be traced meticulously. In his single cell projects it starts with the neurosurgeons that collect the samples. “When the patient comes in for neurosurgery, whether it’s for epileptic surgery or tumor resection or any other type of procedure, there is a lot of information associated with that patient and why they are there,” says Kim. “Who the patient is, what previous pathologies they had, what kind of medication that they were on, even who the surgeon was, all affect the measurement that we’re eventually going to make when we test that tissue sample.”
The variables continue in the lab. Once the tissue comes in, it’s separated into various different processing streams: Sometimes it’s taken directly for examination, while other specimens are frozen. Eventually, individual cells’ RNA is collected and amplified, a process which involves multiple steps in which different reagents are used. “There have been cases where we find these very funny results that we don’t understand,” says Kim. “But when we go back through all the steps, we realize the commonality between all these strange results is that we got this particular reagent from a particular manufacturer at this particular time.”
The real hard-drive hog, however, is the endless amounts of ancillary information associated with genomic sequencing. This is where the NIH grant comes in. Over the next three years Kim, Ives, and their team will design new algorithms to handle all the complex data involved in the provenance process. Kim and Ives will be working with Susan Davidson, Weiss Professor of Computer and Information Science, and Sampath Kannan, Henry Salvatori Professor and Department Chair in the Department of Computer and Information Science. “We’re talking about hundreds of gigabytes of data for these experiments and it’s just not feasible to keep track of all the steps involved in their processing,” says Kim. “It needs to be an automated process so we can see and track anomalies.” What is it like to work with such a varied team? “I have been interacting with them for over 10 years now, so we share a language,” says Kim. “The key to this kind of interdisciplinary research is to be willing to work on something trivial at the beginning. And then that develops into something deep and interesting.”
Any software Kim and his collaborators work on will be open source so researchers around the world can iterate and improve the platform down the road. “This project is an example of the kind of interdisciplinary research that really is important for us all to be able to go across school boundaries,” says Kim. “In addition to the arts and sciences, this impacts Penn medicine and engineering. And that’s sort of the great thing about the University—being able to do that.”