Machine Learning isn’t Magic: Matching Datasets & Problem Spaces Pt . 1

4 min readSep 11, 2023

The first step in a predictive analytics AI/Machine Learning Project (ML) should be to validate that the available or selected data can solve the problem at hand. I.e. given the problem space, verify that this data can even solve the problem. Instead what often happens instead is: “go solve the problem with this data”, as if that’s the magic of ML: it doesn’t need good data, just vaguely related data. This “skipping of steps” is the reason a lot of ML projects fail: the team was never working on things required for them to be successful.

Before we go further, a thought experiment: you’re starting a baked goods company, the plan is to produce a variety of baked goods that you will sell to restaurants, grocery stores, coffee shops and the like. Supply chain wise, a reasonable approach would be to set a menu and then work backwards from there to determine the ingredients you’ll need to produce the various cakes, pies and other pastries. E.g., if you want to make chocolate cakes, you need to order the ingredients that go into the style of chocolate cake you wish to produce. An unreasonable approach would be to give the cooking staff 60–80% of the needed ingredients for a particular menu item and tell them: “if the baked goods sell well, we can investigate having all the proper ingredients in phase 2.” I think most would agree that what I just described is a terrible approach that is destined to fail, so why then, do we do this on data projects, especially ML ones where the model is only as good as the data we use to train it? The first time I encountered this, I thought “well that client was weird”, the second time it happened I thought: “we got things back on track, the client just didn’t know” and then the third time, when sifting through the ruins of a project I inherited from another team, I thought to myself: “the same problem across three wildly different orgs? Maybe this is a pattern”, and by the fourth and fifth times, well, you get the point. Let’s talk about two of those examples:

The journey of 1,000 steps becomes 2,000 if you start out going in the wrong direction

My team was selected to take over a failed project from another team. The narrative we received was that the prior team had worked for months and failed to produce anything substantial. The thinking behind bringing in my team was that with our superior engineering and data chops we would be able to succeed where they failed. I.e., they were working on the right things, they just didn’t do them well enough or fast enough to produce the desired result.

Before I go further let’s quickly discuss how ML typically works in a predictive analytics scenario: you have historical data that has the target or independent variable you want to predict (i.e., y variable), and then you have your dependent variables (or x variables) that relate to the target one. E.g., if you want to predict mortgage defaults you have default status as your target variable and then your dependent variables are things like income, payment history, home equity levels, credit score, changes in credit score, current debt to income ratio, etc., etc. The model is built by solving for the weight and bias terms to add to the dependent variables that enable it to reliably predict the target variable.

Going back to the example: they were never working with the right dataset(s) to achieve the desired goals: they didn’t have target variables and the dependent variables were incomplete. It’s worth noting that the data scientists knew this, I found notes and comments throughout their code noting that they were doing their best with a “weak dataset”, and clearly describing the additional data they needed and who had it. Their documentation even explicitly laid out what teams and data sources were needed build the desired models: “the new team will need to acquire data from….”. So, what happened here beyond no one listening to the data professionals? It’s simple: the goal that the stakeholders were excited about and/or could get funding for was defined as building a model with a specific set of inputs, and no one was willing to either listen to the data professionals and/or shift course. As I stated earlier, the first time I encountered this I thought it was a weird anomaly, but after seeing it several times…

…TL/DR: the team spent months building data pipelines, dashboards and the like around data that the stakeholders felt would result in predictive models, not because it was the right approach but because that was the approach the leaders and/or stakeholders insisted on/thought would work/were excited about. The stakeholders were then surprised, upset and frustrated at the lack of progress, but the data scientists saw it coming from the start because they were fully aware that they weren’t working on the right things to achieve the desired goals. It sounds wild, but you’d be surprised how often this happens in the data space. Building X from Y often becomes more the goal than “let’s solve this problem in the most effective way possible”.

Stay tuned for part two, where I’ll discuss a situation where some of the issues were quickly recognized up front, how the project was able to pivot, and then go over how to structure your projects to avoid these issues.

Machine Learning isn’t Magic: Matching Datasets & Problem Spaces Pt . 1

Written by Markham Lee

No responses yet