Machine Learning and HR

3 min readMar 28, 2021

Predicting Employee Attrition

Phase 3 of the Flatiron Data Science program has been all about classification problems. We’ve learned a number of models, from logistic regression to decision trees to ensemble methods that encompass multiple models.

One of the more interesting applications of classification models is predicting employee attrition. HR data, which, depending on the company, might not be considered a key business metric, contains really useful information and can be a real advantage for firms that take time to understand it.

Any profit-maximizing firm would want to reduce employee turnover. It’s not only expensive to go through a hiring process, it’s also expensive from an institutional knowledge standpoint. Longer tenured employees (a year or more) have spent the necessary time to learn the company’s culture, how different technical processes work, and what actions to take to reduce inefficiencies.

I am experiencing this firsthand. I just recently accepted a new job, and I put in my two-weeks notice just about two weeks ago. My current job had lots of slow days, but also plenty of busy ones. However, I didn’t really come to terms with how ingrained with the company I was (tenure was 2.5 years) until the last couple weeks. I’ve created job aids and information dumps on the various things I’ve worked on, which was more than I had realized. A lot of work that was simply habit for me was, in reality, a lot to learn, and will be a lot to learn for the next person in this position. The process to backfill my role will take time, plus the time needed for training that person. The opportunity cost of not having me in that role is not insignificant.

Knowing when employees will leave, and being able to predict which ones are most likely to leave, could position to firm to proactively manage how to retain the employee. This could include promotions and pay raises, trainings and employee investment, a switch to more flexible remote work, etc. Job satisfaction is one of the most predictive factors of whether someone will leave. Learning how to improve job satisfaction should be a top priority of a firm.

For my Phase 3 project, I used a dataset created by IBM data scientists to train classification models. The goal was to teach these models how to predict employee attrition, with an emphasis on reducing the false negative rate. We want a model to accurately predict as many employees who will leave as possible.

In this context, minimizing the false negative rate is most important to us. The smaller it is, the better our model is doing at predicting which employees will leave the company. In my analysis, I found that the best model was logistic regression. Although it does sacrifice some accuracy and precision, having a high false positive rate is not necessarily damaging to a company. Our model misclassified 39% of employees who didn’t leave the company. This is a large percentage, but in this context, it is likely less costly to assume more employees might leave than actually do, than missing employees who actually leave. The company would likely invest more in retaining employees who weren’t going to leave anyway, but this wouldn’t harm the company. Expenses might rise, but you’ll have a happier firm overall in theory. The rise in productivity would likely offset the investment to retain these employees.

What does concern me about this data and the models developed is potential unethical use. For example, gender and race should probably be excluded from any model as to avoid bias from employers. According to the Center for American Progress, Black workers face far higher unemployment rates than white workers, and when they do get jobs, they’re paid systematically less than white employees. In addition, Black workers earn fewer benefits and work in less stable industries than white workers. Data scientists developing machine learning algorithms in this space (and across ML applications) must work to remove biases from their models. Future work on this model should focus on not introducing racial or gender bias into the model, and ensure that employers can’t abuse this model and use it as an excuse to fire employees they do not like.

Machine Learning and HR

Written by Matthew Schwartz