The Importance of Data Splitting
In most machine learning projects, it is a good idea to split your data into three separate groups: a training set, a testing set, and a validation set.
To see why, let's pretend that we have a collection of two types of pets:
Cats: Dogs:
For each pet in our collection, we have two facts: weight and fluffiness.
Our goal is to find and test good ways to guess if a pet is a cat or a dog. We'll use train/test/validations splits to do this!
Train, Test, and Validation Splits
The first step in our guessing game is to randomly divide our pets into three separate groups:
Training Set: The group we show our computer model so it can learn patterns.
Validation Set: The group we use to check how well our model is doing while we try different settings.
Test Set: The group we use to guess how well our model will work in the real world.
The Training Set
The training set is the group we use to teach our model. This is the data our model uses to learn the rules that will help it make guesses later on.
Since our model learns from this, it is very important that the training set looks similar to the real world (the population we are trying to model).
Also, we need to be careful and make sure it is fair (unbiased), because any unfairness here will cause problems later when making predictions.
To give our model as much info to learn from as possible, we usually put the majority (e.g. 60–80%) of our original data into the training set.
Building Our Model
Our goal is to decide if a pet is a cat or a dog. This is a yes/no question, so we will use a simple but effective tool called: logistic regression.
Given a specific mix of facts (None, Weight, Fluffiness, or both Weight and Fluffiness), the logistic regression tool will draw a line to separate the pets into cats or dogs. Each pet will be labeled based on which side of the line it falls on: one side for dogs, the other for cats.
Select the features you want to use to see the line drawn by the model.
Drag each animal in the training set to a new spot to see how the line moves!
The Validation Set
For this method, we can build four different models — one for each choice of facts we use: none , just weight, just fluffiness, or both weight and fluffiness.
How should we decide which model is best?
We could check the score of each model on the training set, but that gives us a wrong idea — if we use the same data for both learning and testing, the model will cheat (overfit) and won't work well on new data.
This is where the validation set comes in — it acts as a fair, separate checklist to compare different settings and choices trained on our training set.
In our case, we are only looking at one method (logistic regression), but we can view the number of features as a setting that we want to adjust[*] For expository purposes, we won't worry about other choices for our logistic regression model, such as specific solvers or regularization penalties.
Select a feature to see how well the model does on the validation set in the table below.
Drag the feature across the line to see how the score changes!
The Testing Set
So, assuming you didn't move the pets around too much, our best model (according to the validation set) is the one that uses both facts.
Once we have used the validation set to pick the best settings that we want to use, the test set is used to see how the model will likely perform in the real world.
Let's say that again: the test set is the final step to grade our model's performance on new data it hasn't seen.
We should never, ever, look at the test set's score before picking our final model.
Picking the best model and settings is the job of the validation set.
Peeking at the test set answers early is a form of cheating (overfitting), and will likely lead to bad results later on. It should only be checked at the very end, after the validation set has helped us find the best model.
Summary
You may have noticed that the test score of the "just fluffiness" model was higher than the "both features" model, even though the validation set picked the "both features" model as the best. This can happen where the validation score doesn't match the test score perfectly, but it's not a bad thing. Remember that the test score is not a number to improve — it is just a way to guess how well we will do later. It lets us estimate, with confidence, that our model can tell the difference between cats and dogs with 87.5% accuracy.
Key Takeaways
It is a best practice in machine learning to split our data into the following three groups:
Training Set: For teaching the model.
Validation Set: For checking the model fairly.
Test Set: For the final grade of the model.
Following these rules helps ensure that we have a realistic idea of how our model works, and that we (hopefully) built a model that works well on new data.