- 16th April 2018
- Posted by: Manolis
Wes McKinney hates the idea of researchers wasting their time. “Scientists unnecessarily dealing with the drudgery of simple data manipulation tasks makes me feel terrible,” he says.
Perhaps more than any other person, McKinney has helped fix that problem. McKinney is the developer of “Pandas”, one of the main tools used by data analysts working in the popular programming language Python.
Millions of people around the world use Pandas. In October 2017 alone, Stack Overflow, a website for programmers, recorded 5 million visits to questions about Pandas from more than 1 million unique visitors. Data scientists at Google, Facebook, JP Morgan, and virtually other major company that analyze data uses Pandas. Most people haven’t heard of it, but for many people who do heavy data analysis—a rapidly growing group these days—life wouldn’t be the same without it. (Pandas is open source, so it’s free to use.)
So what does Pandas do that is so valuable? I asked McKinney how he explains it to non-programmer friends. “I tell them that it enables people to analyze and work with data who are not expert computer scientists,” he says. “You still have to write code, but it’s making the code intuitive and accessible. It helps people move beyond just using Excel for data analysis.”
Basically, Pandas makes it so that data analysis tasks that would have taken 50 complex lines of code in the past now only take 5 simple lines, because McKinney already did the heavy lifting.
McKinney, 32, grew up in Akron, Ohio. From an early age, he showed a penchant for math and technology—in high school he was a mathleteand ran a website dedicated to the video game GoldenEye 007. He went on to attend the Massachusetts Institute for Technology where he studied pure math.
Like many quants, after graduating McKinney headed to New York to work in finance at AQR Capital Management. At the hedge fund he found that the hard finance problems were more about dealing with data than math. The most valuable work involved gathering new sources of data, merging datasets together, an cleaning it all up. As anyone who works in data science knows, quality data is far more important than fancy analysis.
As anyone in data science knows, quality data is far more important than fancy analysis. McKinney was frustrated with the tools available to complete these basic data tasks at the time—he was not a fan of Excel or R (another popular programming tool). A colleague suggested McKinney try the language Python. McKinney was smitten. “I fell in love pretty quickly with Python,” McKinney told me. “I loved it for its economy of expressions. You can express complicated ideas in Python with very little code, and it is very easy to read.”
But Python was missing some key features that would make it a good language for data analysis. For example, it was challenging to import CSV files (one of the most common formats for storing datasets). It also didn’t have an intuitive way of dealing with spreadsheet-like datasets with rows and columns, or a simple way to create a new column based on existing columns.
Pandas addressed these problems. David Robinson, a data scientist at Stack Overflow, explained the importance of it in technical terms. “The idea of treating in memory data like you would a SQL table is incredibly powerful,” he says. “By introducing the ‘DataFrame,’ Pandas made it possible to do intuitive analysis and exploration in Python that wasn’t possible in other languages like Java. And is still not possible.”
Pandas in the wild
McKinney built the basics of Pandas in 2008, and made the project public in 2009. By 2010, some people were independently discovering the tool on the internet or by seeing McKinney speaking about it at data science conferences. That year, McKinney left AQR to pursue a PhD in statistics at Duke, leaving him little time to work on improving Pandas.
Towards the end of his first year at Duke, McKinney and Pandas faced a pivotal moment. He gets a bit sentimental speaking about it. “I felt that Python as a language was facing an existential crisis,” he says. “Python was either going to become relevant as a statistical computing language or it wasn’t, and I felt it had so much potential. I decided to drop out of graduate school to work on Pandas as much as possible. I wanted to make it the cornerstone of the Python [data science] ecosystem, that I thought it needed to be.”
“I felt that Python as a language was facing an existential crisis.” He made the right choice. With McKinney dedicating his time to improving Pandas, and the development of other important Python data science tools—like the plotting program MatPlotLib and interactive user interface iPython—Python would become perhaps the most important programming language in data science.
The chart below shows the rise of Python in terms of traffic on Stack Overflow. In a blog post, Stack Overflow concluded that the rising popularity of Python stems from its importance in the burgeoning field of data science, and that much of this popularity can be ascribed to the influence of Pandas.
McKinney said it became clear to him by the middle of 2012 that Pandas was taking off. He didn’t take much time to bask in its success. The original code was “inelegant,” he says, so he spent years improving the backbone of the tool, and trying to add features. McKinney attributes Pandas’s prominence, in large part, to his willingness to be vulnerable. “With any creative project, but particularly with open source, it can be terrifying because you are opening yourself up to criticism from anybody,” he notes. The key is to welcome that criticism, he stressed.
Today, McKinney works full time on Pandas and other open-source data science projects as a software engineer for the investment fund Two Sigma. Two Sigma has many Pandas users, and McKinney says they hired him to make sure data science tools for Python continue to develop. He thinks more companies should follow Two Sigma’s lead by hiring the developers of the open-source projects they rely on.
Talking with McKinney, it is striking how strongly he feels about improving data science tools—not a topic that usually elicits such passion. I asked him why it was so important to him. “My goal is to empower people to solve problems,” he replies. “When people can analyze data more effectively, it makes them more productive, and helps us make more progress as humans. I want to free people from mundane tasks that allow them focus on the problems they are an expert in.”