The machine learning model that we use to make predictions on new data is called the final model.
There can be confusion in applied machine learning about how to train a final model.
This error is seen with beginners to the field who ask questions such as:
This post will clear up the confusion.
In this post, you will discover how to finalize your machine learning model in order to make predictions on new data.
Let’s get started.
How to Train a Final Machine Learning Model
Photo by Camera Eye Photography, some rights reserved.
A final machine learning model is a model that you use to make predictions on new data.
That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value).
For example, whether the photo is a picture of a dog or a cat, or the estimated number of sales for tomorrow.
The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:
In your project, you gather the data, spend the time you have, and discover the data preparation procedures, algorithm to use, and how to configure it.
The final model is the pinnacle of this process, the end you seek in order to start actually making predictions.
Why do we use train and test sets?
Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.
The training dataset is used to prepare a model, to train it.
We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.
Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.
When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).
The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.
We generalize the performance measure from:
to
This is quite a leap and requires that:
A lot rides on the estimated skill of the whole procedure on the test set.
In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.
The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another.
Often, time permitting, we prefer to use k-fold cross-validation instead.
Why do we use k-fold cross validation?
Cross-validation is another method to estimate the skill of a method on unseen data. Like using a train-test split.
Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.
This, in turn, provides a population of performance measures.
This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.
Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.
Both train-test splits and k-fold cross validation are examples of resampling methods.
The problem with applied machine learning is that we are trying to model the unknown.
On a given predictive modeling problem, the ideal model is one that performs the best when making predictions on new data.
We don’t have new data, so we have to pretend with statistical tricks.
The train-test split and k-fold cross validation are called resampling methods. Resampling methods are statistical procedures for sampling a dataset and estimating an unknown quantity.
In the case of applied machine learning, we are interested in estimating the skill of a machine learning procedure on unseen data. More specifically, the skill of the predictions made by a machine learning procedure.
Once we have the estimated skill, we are finished with the resampling method.
They have served their purpose and are no longer needed.
You are now ready to finalize your model.
You finalize a model by applying the chosen machine learning procedure on all of your data.
That’s it.
With the finalized model, you can:
What about the cross-validation models or the train-test datasets?
They’ve been discarded. They are no longer needed. They have served their purpose to help you choose a procedure to finalize.
This section lists some common questions you might have.
and
You can if you like.
You may save time and effort by reusing one of the models trained during skill estimation.
This can be a big deal if it takes days, weeks, or months to train a model.
Your model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.
This is why we prefer to train the final model on all available data.
I think this question drives most of the misunderstanding around model finalization.
Put another way:
You have already answered this question using the resampling procedure.
If well designed, the performance measures you calculate using train-test or k-fold cross validation suitably describe how well the finalized model trained on all available historical data will perform in general.
If you used k-fold cross validation, you will have an estimate of how “wrong” (or conversely, how “right”) the model will be on average, and the expected spread of that wrongness or rightness.
This is why the careful design of your test harness is so absolutely critical in applied machine learning. A more robust test harness will allow you to lean on the estimated performance all the more.
Machine learning algorithms are stochastic and this behavior of different performance on the same data is to be expected.
Resampling methods like repeated train/test or repeated k-fold cross-validation will help to get a handle on how much variance there is in the method.
If it is a real concern, you can create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance.
I talk more about this in the post:
In this post, you discovered how to train a final machine learning model for operational use.
You have overcome obstacles to finalizing your model, such as: