How data quality impacts your ML algorithm

By Josephine Clercx • published December 1, 2021 • last updated June 27, 2022

If you’re at all interested in tech, the rise of Machine Learning won’t have escaped your notice. High on the agenda for a wide range of business sectors, it’s becoming required knowledge in the analyst world.

So what’s the secret to building high quality ML algorithms? The data you use to train, test, and ultimately run your model is what will make or break it, and preparing that data well takes some serious work. Here, we’ll walk through selecting the right model for your ML objectives and how to feed, train, and test your creation.

How to choose the right model for your Machine Learning goals

Deciding on the right model for your ML algorithm depends on the business objectives you want it to meet. As it’s a key component of accurate forecasting for any sector, the potential of ML is vast. So it’s vital to hone in on exactly what you want to achieve with your algorithm before you begin to build and train it. That way, you’ll ensure it’s usable, accurate, and effective for the business target in question.

You should always trial multiple ML algorithms. This will allow you to compare results as you develop your model, shelving algorithms that don’t produce useful results and enhancing more promising options.

You should also consider how much data you have available to put into your model. Some more complex ML models will fail without huge quantities of input data — and also take up a lot of your time and storage capacity (LSTM models are a prime example).

So more complex isn’t necessarily better. Choosing a simpler, lighter ML approach (such as a linear regression model) can help you stay agile, save time, and progress rapidly. Often, straightforward models with minimal features produce better results.

How to feed your ML algorithms the right data

The success of your ML model hinges on the data you use to power it. These golden rules will keep things on track:

1. You can never have enough input data

The more data you give your ML model, the more experienced, mature, and intelligent it becomes. Logically, this will enhance its outputs; so the more quality data you can feed into your algorithms, the better. 

2. Find data that represents your problem/objective

Inputting relevant data is key to maximizing the power of your ML. Thorough data classification can help, as well as human eyes to decide which data sources are right for the job.

Let’s take the aviation industry as an example. European flight data tends to show a dip in travel on Christmas Day, as fewer people travel. Enabling an ML model to distinguish between days of the year, therefore, improves the algorithm as it can attribute this dip directly to December 25th. The airline can then reliably schedule fewer flights, contract fewer crew members, order less in-flight catering, and so on, without compromising on service.

In this vein, we might think that teaching the ML model to recognize weekends and weekdays would produce similar useful results. If more people fly at the weekend, and some months have more weekend days than others, airlines could identify opportunities to maximize efficiency and save costs. Yet our research for an aviation client showed increased weekend flights for just 2% of their data. The outcome? The time and effort to teach their ML to separate out weekends from weekdays wouldn’t be worthwhile.

3. Check your data thoroughly

Only accurate input data will generate useful forecasts. Check yours for anomalies, normalization, missing pieces, and improper format, labels, or structure. Eliminate noise too. In our aviation example, it was key to delete any flights that had only flown as stand-ins for regular services. 

4. Use clustering to your advantage

Clustering your data allows you to group ideas with similar behavior. This streamlines your work by enabling you to create 10 rather than 10,000 forecasts. For example, if a new iPhone comes out, we can cluster data on items sold from older iPhones to predict the new model’s success. We can reliably draw out derived indicators as well, such as production and labor costs. 

At Cohelion, we think of clustering as optimizing without a target, as you never know in advance how many clusters you’ll create. It’s about finding patterns as you sift through the data, always taking care to separate out solid trends from mere coincidences.

5. Add features wisely

Make sure any features you introduce to your ML enhance it, rather than just over complicate it. Testing is the way ahead here: Add features one by one, so you can clearly see their effect on the model and immediately know where to backtrack if needed. On balance, simpler ML models will require more features to produce useful results. 

Forecasting in the context of COVID-19

The impact of the pandemic produced unusual data in every sector. As a result, many businesses are taking a dual approach to business-critical forecasting. 

One scenario projects a COVID-free future, based on all historical data aside from 2020 and 2021. The other covers a future with ongoing COVID, built on historical data including the COVID years of 2020 and 2021. Depending on the global situation, managers can then base decisions on either forecast, or identify a middle path. 

How to train your ML model

With your model selected and ready to learn, it’s time to train and test it using specific sets of data. Keep these pointers in mind: 

1. Use the same training sets and measures to evaluate your models

To usefully compare the various algorithms you’re trialing, you’ll need to train them with the same data. Once they’re trained, rate them against a consistent test set (examining MAE, RMSE, etc. depending on your ML model’s goal). 

2. Keep an eye out for overfitting/underfitting

If your ML model overfits, it’s working too well to accurately forecast reality. It’s paying too much attention to any small errors in your training set and, as a result, it’s producing skewed projections. Underfitting is the opposite: Your model is doing too little with the data you’re giving it. 

Most often, you’ll be able to detect overfitting or underfitting by eye. Some ML models will include features that flag these errors, but the best way to avoid them is by feeding your algorithms quality input data in the first place.

Powering business-critical decisions with Machine Learning

At Cohelion, we dive deep into the technicalities of ML to deliver optimum results for our clients. Developing ML models is a process of learning, evolution, and nurture that we constantly pursue, working with technical experts who are inspired by ML’s power and potential

Want to know more?

Get in touch with us