TL;DR – The train_test_split function is for splitting a single dataset for two different purposes: training and testing. The testing subset is for building your model. The testing subset is for using the model on unknown data to evaluate the performance of the model.
Contents
What Sklearn and Model_selection are
Before discussing train_test_split
, you should know about Sklearn (or Scikit-learn). It is a Python library that offers various features for data processing that can be used for classification, clustering, and model selection.
Model_selection is a method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results when making a prediction.
To do that, you need to train your model by using a specific dataset. Then, you test the model against another dataset.
If you have one dataset, you'll need to split it by using the Sklearn train_test_split
function first.
What is train_test_split?
train_test_split
is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don't need to divide the dataset manually.
By default, Sklearn train_test_split will make random partitions for the two subsets. However, you can also specify a random state for the operation.
Parameters
Sklearn test_train_split
has several parameters. A basic example of the syntax would look like this:
train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)
X, y
. The first parameter is the dataset you're selecting to use.train_size
. This parameter sets the size of the training dataset. There are three options:None
, which is the default,Int
, which requires the exact number of samples, andfloat
, which ranges from 0.1 to 1.0.test_size
. This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.random_state
. The default mode performs a random split usingnp.random
. Alternatively, you can add an integer using an exact number.
The use of train_test_split
First, you need to have a dataset to split. You can start by making a list of numbers using range() like this:
X = list(range(15))
print (X)
Then, we add more code to make another list of square values of numbers in X:
y = [x * x for x in X]
print (y)
Now, let's apply the train_test_split
function. Here, we set the train size to 65% of the entire dataset. Remember to write 0.65.
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.65,test_size=0.35, random_state=101)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)
You can set only the test_size
as the train_size
will adjust accordingly. You can also set the random_state
to 0 as shown below:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
Note: Sklearn
train_test_split
function ignores the original sequence of numbers. After a split, they can be presented in a different order.
Why use the Sklearn train_test_split function?
Using the same dataset for both training and testing leaves room for miscalculations, thus increases the chances of inaccurate predictions.
The train_test_split
function allows you to break a dataset with ease while pursuing an ideal model. Also, keep in mind that your model should not be overfitting or underfitting.
Overfitting and underfitting
Overfitting is a situation when a model shows almost perfect accuracy when handling training data. This situation happens when the model has a complex set of rules. When a model is overfitting, it can be inaccurate when handling new data.
Underfitting is when a model doesn't fit the training data due to sets of rules that are too simple. You can't rely on an underfitting model to make an accurate prediction.
Train_test_split: useful tips
- Unless specified to use
random_state
function,train_test_split
will split arrays into random subsets. - The ideal split is said to be 80:20 for training and testing. You may need to adjust it depending on the size of the dataset and parameter complexity.