The Night Sky Is the Limit

Physicist Bhuvnesh Jain on how big data is transforming not only the study of the universe, but much of academia.

May 27, 2021

Spring/Summer 2021

Bhuvnesh Jain was 15 when he happened upon The First Three Minutes: A Modern View of the Origin of the Universe by American theoretical physicist and Nobel Laureate Steven Weinberg. Jain was attending a family wedding in a small town in India—a multi-day affair packed with festivities—but he got absorbed in a stash of popular science books in his older cousin’s room.

" " — Bhuvnesh Jain, Walter H. and Leonore C. Annenberg Professor in the Natural Sciences and Co-Director of the Center for Particle Cosmology

Photo credit: Paul Benson

“Because of the intensity of those three days, when I came back to Jaipur, I told my friends, ‘This is fascinating—maybe I’ll study theoretical physics in college,’” says Jain. “Really just based on some imagined little leaps and the little bit that I understood from Weinberg’s book.”

Jain is now Walter H. and Leonore C. Annenberg Professor in the Natural Sciences and Co-Director of the Center for Particle Cosmology. As a Fellow of the American Physical Society and a world-renowned cosmologist, he is at the forefront of discovery regarding some of the least-understood phenomena in the universe, including dark matter, cosmic acceleration, and dark energy. Jain led the Gravitational Lensing Group of the Dark Energy Survey (DES) and now serves as co-chair of its advisory board. One of the largest and most ambitious cosmic mapping projects in the world, DES collects images of hundreds of millions of galaxies. Beyond DES, Jain is helping to set the research agenda for the next generation of cosmology experiments, including the Vera Rubin Observatory and the space telescopes Euclid and the Roman Space Telescope.

Central to Jain’s research is the utilization of big data—the analysis of extremely large data sets to reveal patterns and associations. But the explosion of big data is not limited to the study of the universe—and it’s not limited to the natural sciences. As an enthusiast in the use of big data, and a leader in astronomy survey projects, Jain is now working alongside colleagues to immerse the next generation of students in data science education. But data gathering and analysis wasn’t always so sophisticated. Jain had to learn the ropes himself before bringing his expertise to an eager, young audience.

Up, Up, and Away

Jain grew up in the city of Jaipur, not far from the desert in northwestern India. He would sleep on the terrace in the summer when he visited his grandparents. “It was pretty powerful seeing the night sky in these desert environments—the totality of it,” he says.

Jain began his undergraduate studies in Delhi, then transferred to Princeton during his junior year. When he arrived in late summer, he was told that the rest of the class had already settled on junior research projects, and that he “had better find one.”

After that semester-long class, I became serious about astrophysics and I ended up doing my senior thesis there. Plus, they had a ping-pong table, available all night, which was a big deal for me.

“I started walking around the department and most of the offices were closed,” says Jain. The first office he walked into happened to be a cosmologist by the name of Jim Peebles. “He won the Nobel Prize in 2019,” Jain offers, as an aside. “He was already very eminent, though I didn’t know that at the time. Otherwise, I might’ve never talked to him. So, I said, ‘Can I sign up with you?’ He seemed surprised, but said yes. After that semester-long class, I became serious about astrophysics and I ended up doing my senior thesis there. Plus, they had a ping-pong table, available all night, which was a big deal for me,” Jain laughs.

Between ping-pong games, Jain was getting his first taste of research in cosmology. His junior-year project focused on seed fluctuations, which are density variations in the early universe that later formed galaxies. For his senior thesis, which focused on the flows of galaxies, he worked with an extremely visually oriented professor, Rich Gott, from whom he took inspiration. “He would teach advanced mathematical topics by bringing colored ribbons to class and drawing spiral patterns and running along them to show what it’s like to fall into a black hole,” says Jain. “He found a way to retain that instinctive, basic appeal of science—the way you would teach middle school kids a topic in popular science.”

Post-college, Jain continued his research on the distribution of galaxies and how to model them. “I wanted to bridge my interest in theoretical physics and astronomy,” Jain says. “I was fortunate during my graduate studies at MIT to work with both a theorist and a more data-driven astronomy professor. It so happened that I did most of my research with the astronomy advisor, and most of my talking with the theorist, which usually amounts to your professor sitting on a couch, and you standing by the blackboard and writing equations. This probably hasn’t changed in at least a hundred years.”

Jain’s theory advisor, Alan Guth, who had invented a famous theory of the inflation of the early universe, had some eccentricities. “His office was unbelievably messy. You were kind of in danger of tripping while writing on the blackboard, because there were piles of books and papers everywhere. He actually made it to the Boston Globe because he got nominated for most untidy office in Boston,” Jain says. These weekly meetings helped ground Jain’s theoretical process.

Meanwhile, with his astronomy advisor, Ed Bertschinger, Jain was analyzing early “supercomputer” simulations. “He was one of the pioneers of numerical astronomy, so conversations with him were often about looking at data and movies on the computer and going back and forth,” Jain says. “We didn’t have laptops, so it would be his big screen, and after I did a calculation, he would say, ‘Yeah, but you can test it with the simulation.’ And most of the time any model that we came up with failed. So, when we did figure something out, it had enduring value.”

After his Ph.D., as a postdoctoral fellow at the Max Planck Institute in Munich, Jain took on a new research area: gravitational lensing. Instead of studying galaxies directly, lensing studies how their images are distorted as light travels through space. The distortions are similar to how a powerful lens magnifies and stretches what you see through it. In cosmology, the nature of dark matter remains a central puzzle, and Jain ran simulations of how dark matter and other features in the universe lens galaxy images—a technique he continues to utilize at Penn.

Light-Years Ahead

The use of mapping data in astrophysics has come a long way in the last two decades. In days past, surveyors would operate a telescope, choose their targets, and then direct the telescope to certain sectors of the night sky and take stills. “What took them many months of painstaking observations is something that we can do in a single night with the technology that exists now,” says Jain. “So, the advances in brute power are just astonishing from one decade to another. But that basic intuitive appeal of optical astronomy is still the same. At the end of the night, you have a series of images of beautiful galaxies in various stages of evolution over cosmic time. And though the process of observing at the telescope is more standardized, it is still a wonderful experience to be on the mountaintop night after night, with the Milky Way overhead, way brighter than we get to see it otherwise.”

The Dark Energy Survey (DES), a collaborative survey project involving universities and institutions around the globe, recently released nearly 700 million images of stars and galaxies, the largest collection to date. Jain’s Gravitational Lensing Group, within DES, uses the images to discern patterns of cosmic structure that provide hints about phenomena like dark matter. Jain works with colleagues Gary Bernstein, Reese W. Flower Professor of Astronomy and Astrophysics, and Mike Jarvis, Staff Scientist, using visible light filters to help distinguish the colors of stars and galaxies, allowing for a more informed study of their distances, and the expansion of the universe. Other survey projects underway hold the promise of even higher fidelity mapping images. The Rubin Observatory’s Large Synoptic Survey Telescope (LSST) will map the entire southern sky in exquisite detail, and both the Roman telescope and Euclid will be launched into space, allowing them to see further than their ground-based counterparts, and with higher fidelity.

The analytical phase that comes after image collection has also changed dramatically over the decades. Faster and faster computing allows for more complex algorithms and machine learning, which gives researchers more decoding power. “There was a bit more guesswork involved before, as we didn’t have as much computing power. You had to choose which model you wanted to simulate,” says Jain. “Now, we have simulations of billions of galaxies. So, the question of what do you want to compare with data is where the judgment and insights come in now.”

The judgment phase, however, is still in the hands of researchers, Jain notes, and there has never been a more exciting time to produce innovative theories. Penn’s Center for Particle Cosmology, which Jain co-directs with Mark Trodden, Fay R. and Eugene L. Langberg Professor of Physics and Department Chair, aims to foster just this kind of research at the interface of theory and data. “There is more space than ever to be creative, though it has a different flavor. It takes a blend of developing a gut feeling for the subject, as before, and querying massive datasets in new ways, often by developing clever algorithms. Interestingly, it’s sometimes the undergraduate students who ask the most creative questions. So, even though we have advanced technology and big data now, it is no less important to see the problem with fresh eyes and ask bold questions.”

What took them many months of painstaking observations is something that we can do in a single night with the technology that exists now ... but that basic intuitive appeal of optical astronomy is still the same.

Having witnessed evolving methods of data collection and analysis for so long in his own career, and having such a keen awareness of its import in the future, Jain felt he had a responsibility to prepare the next generation for the ascendancy of big data, which, he notes, is no longer merely a tool for academics—it’s a necessity. Jain is now front and center in Penn Arts & Sciences’ commitment to data-driven discovery, a priority designed to foster interdisciplinary interactions that generate intellectual pursuits through sponsorship of educational programs in data analysis across diverse disciplines.

Next Gen

In 2019, Jain founded the Undergraduate Summer Data Science Hangout with the help of Masao Sako, Professor of Physics and Astronomy; David Brainard, RRL Professor of Psychology and Associate Dean for the Natural Sciences; Emily Hannum, Professor of Sociology and Associate Dean for the Social Sciences; and Ann Vernon-Grey, Associate Director at the Center for Undergraduate Research & Fellowships (CURF). Designed as an informal gathering for undergraduate students whose summer research involves the quantitative analysis of datasets, including variants of machine learning, the Data Hangout gives students an opportunity to present their work to peers, learn about research being conducted by undergraduates in other departments, listen to faculty talks on data science, and attend tutorials.

“The Hangout started as a solution to a very individual problem,” Jain says. “Students working with me wanted to talk to other students like them. I could see that they really enjoyed coding together, and I realized that within my group, it was not easy in a given summer to provide enough interaction. They were also curious about research in other departments, especially involving data science tools.” It soon became clear to Jain that across groups, students were using similar pieces of code, and tackling similar algorithmic problems, even when applied to very different data.

We wanted to erase these boundaries of physical versus biological sciences and so on, and explore data together.

Another impetus for the creation of the Data Hangout—initially held in the Collaborative Classroom in the Weigle Information Commons at the Van Pelt Library, pre-COVID-19, before transitioning to a remote setting—was a desire across the arts and sciences for social scientists and natural scientists to interact across disciplinary boundaries. Jain says, “We wanted to erase these boundaries of physical versus biological sciences and so on, and explore data together.”

At any given time at the Data Hangout, which boasts a 100-strong body of students, you might find a political science student and a neuroscience student trading coding tips. A lot of these student interactions and queries are triggered by the faculty talks. The talks are supplemented by a set of tutorials taught by graduate students and postdocs.

“We’ve had really great presenters,” Jain says. “For instance, in addressing COVID data, we had a mathematician modeling how the virus is transmitted in university environments, along with a biologist talking about the evolutionary origins of viruses through data mining.” Past faculty talks include such topics as the use of big data in preventing mass shootings, data-driven modeling of human color vision, and leaf fingerprinting. The presentations are often interactive in nature, and extend across schools. For instance, the presentations included a Wharton researcher who used large-scale experience sampling to explore human happiness, and a computer scientist studying how algorithms can include ethical considerations.

Jain says the discourse and relationships forged have led to innovative student research projects. At the encouragement of Jain, Tara DaCunha, C’22, worked with Joshua Plotkin, Walter H. and Leonore C. Annenberg Professor in the Natural Sciences, in the summer of 2020, using a corpus of linguistic data to study the evolution of vowel phonemes in Philadelphian speech, and its relationship to a proposed and calculated metric of “confusability.”

“Professor Plotkin uses computational and mathematical methods in a variety of different research projects,” says DaCunha. “He chatted with us and described a number of the projects he was involved in, including ones on music, insect behavior, and linguistics. As a linguistics minor, I got to explore that interest through research alongside my work in astrophysics and truly experience what data science could do in different contexts.”

DaCunha’s most recent project, a collaboration with Jain, centers on examining galaxy clusters and the populations of galaxies that reside in them. “This has involved using ‘density profiles’ to analyze the distribution of galaxies throughout clusters and locate a feature known as the ‘Splashback Radius,’ which forms the boundary of clusters. It lets us compare the matter distributions in simulations with observed clusters, study how the star-formation in galaxies evolves, and how they move under the gravity of the cluster,” says DaCunha. “It has been an amazing experience to get to work with the professors over the last few years and witness the collaborative nature of astrophysics research.”

Another Data Hangout alum, Sarah Kane, C’23, is doing research on exoplanets, which, she explains, “are planets outside our solar system, which sometimes pass in front of their host star and cause a small dip in the star’s light that we can observe from Earth. By looking for these dips, or ‘transits,’ we can identify new exoplanets. But transits are often small and difficult to find among the many thousands of stars we can see. It’s like looking for a needle in the proverbial haystack,” says Kane. She and her collaborators designed a neural network—a type of machine learning algorithm—to search through the vast quantities of stellar data available publicly from NASA and look for exoplanet transits.

Kane is also involved in Astronify, a NASA project which has the goal of developing software that makes astronomical data more accessible to blind and visually impaired researchers. With Astronify, data that is usually visualized in a graph is “sonified,” or turned into sound.

“As I have been legally blind since birth, I was thrilled to find the Astronify project,” Kane says. “I immediately reached out to ask if there was any way I could contribute and, over the last six months, have done usability tests for the project. Projects like Astronify make all the difference to people like me in being able to accomplish our goals.”

Jain says current big-picture challenges reinforce the need for students to immerse themselves in data comprehension. “Climate change and COVID-19 are two examples of central challenges facing our times that involve a sophisticated understanding of data, maps, and figuring out the relationships between many causal factors.” To this end, Jain posed a question to his class, Data Analysis for the Natural Sciences, Part II: Machine Learning: What factors are responsible for the current decline in COVID rates? How are they related? “The students figured out which publicly available data they could use to address these questions,” says Jain.

Penn Arts & Sciences’ commitment to data-driven discovery extends to other programs and centers. The Introduction to Python for Data Science Summer Boot Camp for graduate students—sponsored by the Office of the Dean, the Center for Particle Cosmology, and MindCORE (see our feature on p. 38)—acts as a primer course for the powerful programming platform, one of the main languages used in modern machine learning and data analysis. The Price Lab for Digital Humanities supports innovative uses of technology in the study of history, art, and culture. And the Linguistic Data Consortium is an open consortium of universities, libraries, corporations, and government research laboratories formed to address the critical data shortage then facing language technology research and development.

Steven J. Fluharty, Penn Arts & Sciences Dean, says nearly every academic activity is now impacted by the availability of large data sets. “In the natural sciences, there are great advances in bioinformatics, cosmology, and astrophysics, while in the social sciences, we have programs like the Penn Opinion Research and Election Studies, also known as PORES, which promotes a data-driven approach to understanding political outcomes,” says Fluharty, also Thomas S. Gates, Jr. Professor of Psychology, Pharmacology, and Neuroscience. “The big change has been in the humanities: Archeology now involves cartographic modeling of sites, and the study of literature incorporates the high-speed computerized reading of text. As we look toward the future of big data, it is clear that cross-disciplinary collaboration will drive innovation.”

Harnessing and playing with data is easy and fun and extremely empowering. Whether you’re reading The New York Times, or it’s part of your career, it makes you a more sophisticated consumer and interrogator of data.

Remote learning hasn’t slowed the Data Hangout down. Lectures are held online, open to any students involved in data research, regardless of discipline. “This is an experiment that I hope will grow to demonstrate the importance of harnessing data, and stimulate our students to become interdisciplinary researchers,” says Jain. “We’d like to develop two directions in the coming summers: to bring research closer to the students and make it easy—and compelling—for them to join a project in a different discipline from their majors, and to have the students teach each other some of these complementary skills. Harnessing and playing with data is easy and fun and extremely empowering. Whether you’re reading The New York Times, or it’s part of your career, it makes you a more sophisticated consumer and interrogator of data.”

Blake Cole

Writer

Jasu Hu

Illustrator