bayesian-search-is-it-the-best-way-to-tune-your-hyperparameters
As you may know, every machine learning model, whether it's a tree based one, or a deep neural network consists of trainable parameteres (weights, bias etc.) and hyperparameteres that can not be optimized during the training process. After acquiring a dataset, you have to initialize and optimize the weights of every layer, or just optimizing (if you use transfer learning). This process is computationaly expensive for a set of hyperparameters (h1, h2,…, hn) and we have to find a way of performing fewer runs. In this article, we will briefly talk about 3 ways of hyperparameter tunning, with the final accuracy and the runtime.
For this article, we've chosed a light dataset regarding the dairy products industry. We will forecast the monthly milk production based on past quantities of Cotage cheese, Icecream and fat. Because the values varies through time we have to deal with a time series subject.Therefore, we'll use a tree based model, xgboost and try to optimize its hyperparameters.
In [1]:
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from skopt import BayesSearchCV
from sklearn.model_selection import train_test_split, GridSearchCV,RandomizedSearchCV, \
TimeSeriesSplit
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error, explained_variance_score
Out[2]:
We will use Month.Count as index drop the last 2 columns which represents the squared and the cubed value of the month number.
In [3]:
columns_to_drop =['monthNumSqred', 'monthNumCubed']
for col in columns_to_drop:
dataset.pop(col)
year_column = 'Year'
month_column = 'Month'
timestamp = 'Month.Count'
cheese_column = 'Cotagecheese.Prod'
icecream_column = 'Icecream.Prod'
milk_column = 'Milk.Prod'
fat_column = 'N.CA.Fat.Price'
In [4]:
dataset.head()
Out[4]:
Now, one of the most important step is understanding the data and "make a lot of plots". We can easily notice that the icecream production has higher values than the others. Therefore we decided to plot it in another figure.
In [5]:
fig, (ax1, ax2) = plt.subplots(2,1, sharey=False)
fig.set_figheight(10)
fig.set_figwidth(15)
ax1.plot(dataset[timestamp], dataset[cheese_column])
ax1.plot(dataset[timestamp], dataset[milk_column])
ax1.plot(dataset[timestamp], dataset[fat_column])
ax1.legend(['cheese', 'milk', 'fat'])
ax2.plot(dataset[timestamp], dataset[icecream_column])
Out[5]:
[<matplotlib.lines.line2d at0x170a5bb89c8="">]</matplotlib.lines.line2d>
Here we can see a seasonal patern inside the icecream graph. The milk is week corelated in the begining, but with time and on the test side may work relatively well.
The next step is to encode the month column using a LabelEncoder from sklearn.
In [6]:
encoder = LabelEncoder()
month_column_enc = month_column + '.enc'
dataset[month_column_enc] =encoder.fit_transform(dataset[month_column])
In [7]:
features = [timestamp, year_column, month_column_enc, cheese_column, icecream_column, \
fat_column]
features_scaled = [col + '.scaled' for col in features]
milk_column_scaled = milk_column + '.scaled'
Another sklearn tool will help us spliting the data into train and test. Test will be 15% of the data and because it's a time series problem, we'll not shuffle the lines.
In [8]:
X_train, X_test, y_train, y_test =train_test_split(dataset[features],
dataset[milk_column],
test_size=0.15,
shuffle=False)
In [9]:
y_train = pd.DataFrame(y_train,columns=[milk_column])
y_test = pd.DataFrame(y_test, columns=[milk_column])
Now we'll scale the other numerical features, manually simulating a MinMaxScaller. Precisely, we'll get the maximum and minimum values from the training set and apply the formula with these parameteres on both the training and test set.
In [10]:
X_minimum = X_train.min()
X_maximum = X_train.max()
y_minimum = y_train.min().values[0]
y_maximum = y_train.max().values[0]
In [11]:
X_train[features_scaled] =(X_train[features] - X_minimum[features]) / \
(X_maximum[features]- X_minimum[features])
X_test[features_scaled]= (X_test[features] - X_minimum[features]) / \
(X_maximum[features]- X_minimum[features])
y_train[milk_column_scaled]= (y_train[milk_column] - y_minimum) / (y_maximum -y_minimum)
y_test[milk_column_scaled]= (y_test[milk_column] - y_minimum) / (y_maximum -y_minimum)
In [12]:
X_train_arr =np.array(X_train[features_scaled])
y_train_arr = np.array(y_train[milk_column_scaled])
X_test_arr = np.array(X_test[features_scaled])
y_test_arr = np.array(y_test[milk_column_scaled])
The following search methods require K-Fold Cross Validation. However, the regular one does not fit on time series subjects, because that means predicting the past behaviour based on future events, and this may interfere with the pattern that our model tries to understand. So,we'll use a TimeSeriesSplit which folds the dataset only from right to left (future through past).
In [14]:
xg_reg = xgb.XGBRegressor()
ts_split = TimeSeriesSplit(n_splits=2)
One of the most popular approach is to list all the hyperparameter possible values in a multidimensional grid and to run the training process for each point. In the bellow figure, we have 2 parameters,one with a great impact and one with a lesser. The colored graph represents the score of the trained model.
As you may see on the red graph, the grey points are the values we get; but there are a lot of higher points between them that we ignore. A naive solution would be to set more points for all parameteres. But acording to the exponential complexity, this will lead to the Curse of Dimensionality.
Bellow, we set the search space: it consist of list of fixed values.
In [52]:
search_params = {
'objective': ['reg:squarederror'],
'eval_metric': ['rmse'],
'n_estimators': np.arange(1, 25),
'max_depth': np.arange(1, 15),
'min_child_weight': np.arange(1, 15),
'booster':['gbtree', 'dart'],
'verbosity': [0],
'learning_rate': [0.01],
'random_state': [101]
}
In [53]:
start_time = datetime.now()
grid_search = GridSearchCV(xg_reg, search_params,cv=ts_split,
verbose=1,n_jobs=3)
grid_search.fit(X_train_arr, y_train_arr)
time_elapsed = datetime.now() - start_time
time_elapsed
Fitting 2 folds for each of 15288candidates, totalling 30576 fits
In [54]:
print(-grid_search.best_score_)
grid_search.best_params_
5.93167215160446
{'booster':'gbtree',
'eval_metric': 'rmse',
'learning_rate': 0.01,
'max_depth': 3,
'min_child_weight': 1,
'n_estimators': 39,
'objective': 'reg:squarederror',
'random_state': 101,
'verbosity': 0}
Instead of having a list of possible value, this time we'll use a probability distribution (usually uniform). On every step, a random variable is generated for each hyperparameter. This is much faster and the process has no end, therefore a new parameter should be specified (n_iter).
In the bellow figure, we have 2 parameters of opposite importance and a scenario where the local maxima is archived. But there are multiple scenarios where it is hardly found, because, for example, if we have uniform distribution, each real point is an equal candidate to be the next one.So the local minima/maxima can be found either in 2 steps, either in 2000,depending how luck you are, right? :)
In [55]:
import scipy.stats.distributionsas dists
search_params = {
'objective': ['reg:squarederror'],
'eval_metric': ['rmse'],
'n_estimators': dists.randint(1, 25),
'max_depth': dists.randint(1, 15),
'min_child_weight': dists.randint(1, 15),
'booster':['gbtree', 'dart'],
'verbosity': [0],
'learning_rate': [0.01],
'random_state': [101]
}
In [56]:
start_time = datetime.now()
random_search = RandomizedSearchCV(xg_reg,search_params, cv=ts_split,
verbose=1,n_jobs=3, n_iter=2400)
random_search.fit(X_train_arr, y_train_arr)
time_elapsed = datetime.now() - start_time
time_elapsed
Fitting 2 folds for each of 2400candidates, totalling 4800 fits
In [57]:
print(-random_search.best_score_)
random_search.best_params_
6.697564288240948
{'booster': 'dart',
'eval_metric': 'rmse',
'learning_rate': 0.01,
'max_depth': 3,
'min_child_weight': 1,
'n_estimators': 23,
'objective': 'reg:squarederror',
'random_state': 101,
'verbosity': 0}
This one is a solution to the luckily randomized search mentioned above. It provides a method of adjusting the probability distribution, based on previous experiments.
If we take the performance of the model on the given data as a function f(x) = Y, we want to minimize it but without using gradients base methods. The idea is based on taking f as a stochastic gaussian process.
That sounds pretty tough, right? :) Well, it's not. It means that a vector Y of all unknown points is a multivariate normally distributed variable. This give us a probability P (below formula), where y0 is a lower value of our best loss function, and the greek letter Phi is the cumulative normal distribution function with given parameters (cdf).
Actually the implemented algorithm computes the conditional probability (based on ys which are the observed points). If you're interested in the formula and the proof of this assumption you can check here. The key partis that this probability is updated after each step.
Also it gaves us a way of implementing some acquisition functions. These are lighter function that can be minimized and the result is the next tuple of parameters to be tried. One example is the lower confidence bound. As a consequence to the distribution of Y, every point y is normally distributed.
According to 3-sigma rule, we have the possible values for y (below formula). But we're interested in the lower bound, therefore, if you take as acquisition function a , and minimize it, you'll get h as the next hyperparameter tuple.
In [58]:
from skopt.space import Real,Categorical, Integer
search_params = {
'objective': Categorical(['reg:squarederror']),
'eval_metric': Categorical(['rmse']),
'n_estimators': Integer(1, 25),
'max_depth': Integer(1, 15),
'min_child_weight': Integer(1, 15),
'booster':Categorical(['gbtree', 'dart']),
'verbosity': Categorical([0]),
'learning_rate': Categorical([0.01]),
'random_state':Categorical([101])
}
In [64]:
start_time = datetime.now()
bayesian_search = BayesSearchCV(xg_reg,search_params, cv=ts_split,
verbose=0,n_jobs=3, n_iter=100)
bayesian_search.fit(X_train_arr, y_train_arr)
time_elapsed = datetime.now() - start_time
time_elapsed
Out[64]:
datetime.timedelta(seconds=269,microseconds=303622)
In [65]:
print(-bayesian_search.best_score_)
bayesian_search.best_params_
6.593563642023402
OrderedDict([('booster','gbtree'),
('eval_metric', 'rmse'),
('learning_rate', 0.01),
('max_depth', 3),
('min_child_weight', 1),
('n_estimators', 25),
('objective', 'reg:squarederror'),
('random_state', 101),
('verbosity', 0)])
On these examples, the grid search gave us a better result than the randomized ones. So there is not a right answer on which optimization method is the best one, it depends on the context. If your model is light and runs in a few miliseconds(like this one) you may very well try the grid approach. If your model is computationally expensive like a deep neural network, you should try the latest 2, especially the bayesian search. Its strategy may lead you, in the worst case, in the proximity of a local minima, but in much faster way than grid search.
Also if you are having trouble in understanding the meaning of each hyperparameteres, the bayesian relieve the importance or the a patern of combining a few of them to optimize your final model.
If you want to run this, you need to install some libraries with pip, anaconda or other tools. I recommend you Anaconda, to install it and use the links from below to run from the command line of the specific environment. If you have any issues with the xgboost or skopt running inside a Jupyter Notebook, run the lines from the following cell.
In [ ]:
import sys
!{sys.executable} -m pip install xgboost
!{sys.executable} -m pip install scikit-optimize