As COVID-19 tore across the globe, the open source community sprang into action. Stuck at home after shelter-in-place orders began in March, contributors were eager to put their time and talent to good use. Open source developers have been collaborating across national and organizational borders for decades, tackling incredibly difficult technical problems in fields like distributed computing and artificial intelligence. The COVID-19 response requires exactly that type of coordination.
According to GitHub’s 2020 State of the Octoverse report, there was a huge spike in both the creation of new open source projects and contributions to existing projects amidst the pandemic. Users created 5,646 repositories specifically related to COVID-19.
It wasn’t a one-off occurrence: COVID-19 related projects saw sustained contributions all year.
Many of these projects were created to make sense of the pandemic’s emerging data sets. In 2020, six percent more scientists and 10 percent more data analysts created GitHub accounts, according to the State of the Octoverse report.
The amount of data in the world was exploding even before the pandemic. But the need to organize and visualize data and make it useful to the public and to decision makers in public health and government became more pressing than ever. Open source developers lent researchers a hand to help make sense of COVID-19 data.
For example, GitHub data scientist Hamel Husain created COVID-19 Dashboards to provide a platform for anyone who wants to quickly and easily publish datasets. Anyone can submit a dataset to the site in the form of a Microsoft Word document, Markdown file, or Jupyter Notebook— a popular format for working with both code and data. The data are then automatically rendered into HTML through GitHub Actions and added to the COVID-19 Dashboards site.
“I noticed people were on social media sharing their analysis in various formats,” Husain explains. “It wasn’t easy for non-technical people to consume that data, and it wasn’t easy for researchers to update datasets they had already shared.” The problem is tools used by researchers and data scientists don’t always translate well to the media everyone else is accustomed to.
For example, if you want to include a dataset from a Jupyter Notebook in a blog post, you need to copy and paste into the blog post and format it to make it presentable. Same with visualization—you might have to take a screenshot and then paste it into the post. To make an update, you’ll need to copy and paste again. It’s easy to make an error—and even easier for data to just fall out of date.
“COVID-19 Dashboards is easier because the general public can view the outputs of Jupyter Notebooks without needing to install any special software, they just need a modern web browser,” says Canadian software developer Sophiah Ho, who has published dashboards on COVID-19 growth rates in the Ontario, Canada region.
COVID-19 Dashboards is built on Fastpages, a blogging platform Husain maintains with the artificial intelligence company FastAI. Fastpages makes it easier to share data and code by letting you do all your work in a Jupyter Notebook without having to switch between multiple files and applications.
COVID-19 Dashboards started when Husain used Fastpages to re-publish some COVID-19 growth rate data a statistician had shared as a Jupyter Notebook. “It was an on-topic example of how you could use Fastpages,” Husain explains. “It really caught on. People wanted to add additional reports and analysis. It just grew from there.”
Almost a year after launching Dashboards, Husain says he doesn’t have to add so much data to the project anymore—now, so many different contributors from around the world pitch in to add and update data.
Contributions without (organizational) borders
The open source community also helped build custom tools to put data to use. As COVID-19 spread, hospitals were faced with the problem of planning for the inevitable surges in their communities. How much personal protective equipment (PPE) would they need, and when? At what point should they halt non-emergency procedures to free-up beds for COVID-19 patients?
To answer these difficult questions, many hospitals and governments around the world turned to an interactive, open source tool called CHIME, short for COVID-19 Hospital Impact Model for Epidemics. CHIME enabled users to plug in a few pieces of information, like a region’s population and the current number of COVID-19 hospitalizations, and generate a forecast of future admissions.
CHIME started out as an internal project at Penn Medicine. “We had a data scientist collaborate with an epidemiologist to build a model,” explains Michael Becker, one of the hospital’s data scientists. Once the model was created however, the team realized that it would be more useful for everyone else in healthcare if it had a graphical interface to change inputs and visualize data. “There was a lot of uncertainty in the model,” Becker says. “We wanted users to be able to experiment and see how changing one parameter would affect the forecast.” But their Predictive Healthcare team was small and didn’t have much experience building consumer-facing interactive apps.
Becker turned to Code for Philly to solve the problem. “I sort of co-opted one of their Slack channels and asked what type of tools someone would use to build an interactive data modeling platform,” Becker says.
Volunteers with Code for Philly flocked to CHIME to help. “Normally, a good project will have 30 people involved,” says Code for Philly co-director Marieke Jackson. “We had 400 people join the Slack for CHIME in the span of two weeks. It was the first weekend of Philadelphia’s shelter in place order and there were a slew of technologists worrying about COVID-19. They were eager to find a way to put their skills to use.”
The community chose to build the interactive app using the open source machine learning framework Streamlit. A month later, with the help of Code for Philly, the Penn Medicine team had produced a working proof-of-concept. Two weeks later CHIME was deployed to the public. It soon helped Penn Medicine estimate clinical needs during the early stages of the pandemic and ended up being used by health care facilities and public health agencies around the world, perhaps most famously by the city of Washington, DC and the state of New Jersey.
CHIME suffered capacity issues soon after launch. “That’s when Code for Philly support really ramped-up,” Becker says. “They have a bunch of DevOps folks who built, deployed, and hosted a new Kubernetes-based infrastructure for CHIME.”
Apart from Slack, the project was organized primarily through GitHub. “When you’re asking people to volunteer their time, you can’t ask them to go to a whole bunch of separate places,” Jackson says. “You need to have one place for version tracking, issue tracking, and documentation–not three different tools.”
They also relied on GitHub for their CI/CD pipeline. A volunteer named Michelle Lee setup GitHub Actions to automate a lot of project management stuff, which was crucial for the small Predictive Healthcare team. “The open source model and workflow made CHIME possible,” Becker says.
In addition to Streamlit, CHIME depends on packages like pandas, NumPy, and Jupyter. “Without the modern open source data science stack, CHIME couldn’t exist,” says Becker. “Someone asked me why these kinds of tools didn’t exist for previous outbreaks. And my answer to this question was that the technology, specifically open source tools, have come a long way in the past several years.”
Building for the future
Becker says CHIME’s model is now outdated. But those underlying components are still useful. For example, University of California Davis School of Veterinary Medicine used Streamlit to visualize COVID-19 hotspots.
“It’s hard for us to attribute contributions specifically due to COVID. Streamlit was released publicly October 1, 2019, so essentially the project has been running the entire pandemic,” says Streamlit head of developer relations Randy Zwitch. “We have seen explosive growth in apps created, GitHub stars, issues, pull requests, community forum traffic and every other growth metric you can imagine since the public launch of the product, and certainly some of that has to be due to the pandemic.”
Likewise, PyMC3, an open source Python package built for Bayesian statistical modeling, long predates the pandemic, but has proven to be a valuable tool for developers and researchers. For example, RT.live, a project created by Instagram founders Kevin Systrom and Mike Krieger to track transmission rates of COVID-19, is built on PyMC3.
Since the pandemic, PyMC3 has seen a major uptick in both users and contributions. Like Zwitch, PyMC3 maintainer Thomas Wiecki can’t say that COVID-19 is responsible. “But, given that people are spending more time indoors and working remotely, I assume that’s encouraged them to work on or with PyMC3,” he says.
PyMC3—itself built upon previous open source efforts, including its predecessor PyMC2, and the deep learning package Theono—had already been used for many purposes, ranging from supply chain optimization at SpaceX to searching for new planets. It will continue to be useful even after the pandemic is over.
Many of the open source tools created or improved to respond to COVID-19 could help us fight future epidemics—or solve entirely different problems. Open source contributors aren’t just building for today’s most urgent needs—they’re preparing us for tomorrow’s.