Why Data Science?

Reflecting on the completion of the first phase of the Flatiron Data Science Bootcamp

I’ve always been fascinated by data science, even if I didn’t call it that, or really knew the term. “Analytics” was more the phrase I used. Regardless, data science has been something at the forefront of what I’m most interested in. It lives at the cross-section of sports, economics, entertainment, and politics. Understanding data science means understanding underlying trends — what moves the world, what’s worth paying attention to, and what’s just noise.

I didn’t realize at the time, but when I wrote a research paper in my Health Economics class about the long term health effects of having insurance as a child, I was performing data science, just with Stata instead of Python. Data science in economics leads to some surprising conclusions. This paper, written by my professor at the time, Laura Kawano, studies the economic impact of Hurricane Katrina on New Orleans residents, primarily those who were forced to move. They find, surprisingly, that the average New Orleans hurricane victim ends up earning more in income after 2008 than an average resident in one of the control group of cities (similar cities in demographic and economic makeups to New Orleans which were not hit by a hurricane). One somewhat negative explanation for this was that by moving to a more expensive location, a person may have received an increase in nominal wages, but a reduction in real wages because of the increase in cost of living expenses.

The other two explanations they find are more interesting. For one, they find that transformations after the hurricane actually improved the New Orleans labor market as prevailing wages increased. Secondly, the researchers found that residents forced to move often chose places to live with better economics opportunities. They write, “Consistent with this hypothesis, we find that the increase in labor income was highest for those who left and never returned to New Orleans.” New Orleans is a city with strong ties amongst its residents, meaning many never leave for green economic pastures. This however, does not mean these former residents with increased income saw an increase in utility. Seeing everything you’ve ever known destroyed is impossibly hard, and higher wages or not, these displaced people I’m sure will always have a longing for home.

In addition to the ties with economics, I am absolutely enamored with the use of data science in professional sports. The insights derived from algorithms and models in baseball, football, and basketball have been literally game changing, and the smartest front offices have held an edge. Even the casual sports fan is probably aware of how many 3s the average NBA team takes (not specifically per team, but just that they take a lot).

They may be less aware that it has transformed baseball into a sport where only 3 things happen: walks, strikeouts, and home runs. 2018 was the first year in history where an at-bat ended in one of these outcomes over a third of the time. Starting with the Moneyball revolution, teams realized that having players who got on base more were extremely valuable, and led to team success more often. Teams also discovered that it was more efficient to just have a player hit a home run, rather than put the ball in play and risk an out, or give up an out to score a run. Why bother with singles and bunts when you have guys who can hit the ball 400 feet. To combat this change, teams invested in pitchers who could routinely throw 100 mph. No matter how talented you are, a pitch at that speed is incredibly hard to hit, leading to the increase in strikeouts.

The analytics revolution in football however, has been the most fascinating change. Football will never be on par with baseball in the analytics department. That sport is turn-based with few variables. A play in football involves 22 people all moving at the same time, with a different role for every position. What interests me most though is 4th down decision-making. NFL coaches are notoriously risk averse, choosing often to punt on short 4th downs. Brian Burke, a former government contractor who ran a football analytics site and supplied his 4th down model to the NY Times 4th Down bot, has 3 rules, all backed up by probability:

On fourth-and-1, go for it any place on the field where that is possible, starting at your 9-yard line.

On fourth-and-2, go for it everywhere beyond your 28-yard line.

On fourth-and-3, go for it almost everywhere beyond your 40.

This season, teams are going for it a record amount on fourth-and-1, but still less in other situations. For situations where coaches are particularly cowardly, we have one of my favorite sports twitter accounts: The Surrender Index. This account tweets out anytime a team punts and scores it based on a model created using NFL data and Python. The model takes into account the score, field position, expected points, and other factors to determine how cowardly the decision to punt was.

The fact that this is all done in Python and Jupyter notebook showed me just how many different types of applications there are for data science, and the sorts of things I’m learning in this bootcamp. As I finish up my Phase 1 project, I’m really encouraged about what I’ve learned so far, but even more excited to build on it going forward.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store