Machine Learning isn’t Magic: Matching Datasets & Problem Spaces Pt . 2

Markham Lee
4 min readSep 18, 2023

--

Photo by Leeloo Thefirst from Pexels

Last time we talked about how poor upfront planning, bad assumptions and fixating on leveraging machine learning in a specific way can doom a machine learning initiative. For this installment, let’s talk about how approaching a machine learning initiative as a “science project” where you focus on the problem and investigate ways to solve the problem (of which ML would just one of them) can lead to great outcomes, even if the eventual solution isn’t what was initially planned.

An effective solution > an ML Model

This example has a “happier ending”, but it starts out much the same: a client wants a predictive model but has an incomplete dataset, but with the added wrinkle that they’re aware of some of the gaps in their dataset(s) and are hoping we can use ML to fill them in, they’re also aware that we’ll need a historical dataset and are willing to wait for us to aggregate that data. I.e., collect data from their day-to-day business operations over the course of several weeks.

I was skeptical we’d be able to fill in those gaps, so while the data engineers were building the dataset, another data scientist and I did some research on how similar companies had solved this problem. We even dug into operational reports, annual reports and the like to see if other groups within the company were collecting the data we needed. The outcome of all of this research was that we were able to identify the type of data and additional data collection capabilities we’d need to solve the client’s problem.

After presenting our findings to our stakeholders, they did some digging within the company and helped us pull together a meeting across three different teams to discuss how they could better share information and work together. It started off a bit rocky as these groups didn’t historically work well together, but once everyone got focused, we walked away with agreements to share data, send alerts between groups, provide access to databases, etc. In the end we didn’t deliver the main items in our SOW, namely: an ML model and integrating its outputs into their current systems, despite that, we had a very client who happily paid our invoice. Why? Because we solved his problem and that’s what he really wanted, ML was a potential means to the end, not the goal in of itself.

TL/DR: the client didn’t need an ML model, they needed someone to help them break down silos and facilitate better data sharing across the company, the real deliverable was solving their problem, not building an ML model.

So, the point?

The point is that you can’t take shortcuts if you want to not just build a machine learning model but build one that produces value for your business. This means following some version of the following process:

1) Study the problem space in depth and think through the ways you can use technology to solve said problem. As noted above, you may not need a machine learning model and that’s okay.

2) If you are going to use machine learning, then you need to identify the data that is likely needed to build that model. For example: predicting train arrivals is less a function of speed, location and distance to the next station, it’s more an exercise of predicting the things that can cause delays and then predicting how long those delays will last.

3) Once you’ve identified the data you need, compare that to the data that’s currently available, identify the gaps and then work to curate and/or build the desired dataset. A good idea is to do some “quick and dirty” data pulls from the various identified data sources so you can build some initial models to validate your approach, prior to building out the data collection infrastructure. The goal here is to validate that applying ML to the data is capable of solving the problem. That being said, you can often do the validation work and data infrastructure work in parallel as centralizing data collection and management often provides significant benefits in advance of the ML models being built.

4) Presuming step three goes well, the project will transition to more of a data engineering project as you start building out the infrastructure to support training, maintaining, monitoring and deploying the machine learning model(s).

A lot of projects fail because they don’t really do the research into the types of data needed to solve the problem, nor do they engage in any sort of validation exercise, they just move forward based on the assumption of “ML” + the initially identified dataset will produce a solution. Meanwhile as any data scientist with more than two months on the job can tell you: unless you’re improving upon existing models, the initially applied dataset nearly always need augmentation from additional sources to solve the business’ problem.

Skipping steps and leaving out key ingredients rarely works outside of the world of data science and most certainly doesn’t work within it. Slowing down, evaluating datasets, validating assumptions and doing the upfront work to make sure you’re on the right path, is a critical part of your project’s success. If you don’t do the initial study and validation work up front, far more often than not you’re going to do it anyway when your team has to regroup or pivot when the initial attempt doesn’t work (see my first example). I.e., do things right or do them twice. ML isn’t magic that can withstand a lack of research and/or being unwilling to pivot.

--

--