Part 1: Hyperparameter Tuning with GridSearchCV, RandomizedSearchCV, and BayesSearchCV
Understanding how to improve the learning process of a machine learning model can blur the lines between science and art. Hyperparameters are not updated in the learning process but are involved in model development and need to be tuned by the practitioner. For widely used models, a huge body of research is available that would allow a practitioner to tune an estimator by hand. However, there are cases where that’s not practical, either due to the model's novelty or lack of practitioner experience. This is where an automated process of finding the optimal hyperparameter configuration becomes beneficial. We will go over applying three common search algorithms used for hyperparameter tuning. For more details on the code used behind this analysis, please refer to the Github repo.
Not everything is learned within estimators, so adjustments to hyperparameters are essential in maximizing machine learning models’ effectiveness. Manual inputting a set of parameter values, fit, and score can be time-consuming. Luckily there are two basic algorithms commonly used to automate the search of the parameter space. The algorithms are available in the Python Scikit-learn package; GridSearchCV[1] and RandomizedSearchCV[2].
The GridSearchCV algorithm performs an exhaustive search of the distributions of parameters provided, while RandomizedSearchCV samples the given parameter space. Both class objects take a search space, a model, and a cross-validation configuration. In terms of efficiency, RandomizedSearchCV is better than GridSearchCV though the latter will offer better metric results in some cases.
Another option is going through the parameter space in a Bayesian way. In Bayes theorem, the objective is to calculate the conditional probability of an event. The formula looks like this:
P(A|B) = P(B|A) * P(A) / P(B)
However, since, in our case, we’re not looking for a conditional probability but trying to optimize a specific quantity, we can remove the normalizing value of P(B). Then we can rewrite the equation as follows:
P(A|B) = P(B|A) * P(A)
Now let’s make the following substitutions:
A = M, representing our model
B = S, data representing our parameter space
P(M|S) = P(S|M) * P(M)
The BayesSearchCV[3] module from the Scikit-Optimize package is an excellent substitute for the GridSearchCV estimator and has a similar configuration as both GridSearchCV and RandomizedSearchCV. Bayesian optimization on objective functions is a well-researched topic and for more details, check here. Now let’s run through an example.
The Data
The target data used in this exercise is the Census Income data set taken from the UCI Machine Learning repository[4]. It’s a cleaned version of the 1994 census data donated in 1996 to the UCI ML repo. The main objective is to predict whether an individual’s income exceeds $50K/yr based on features available in the data. Now let’s pull the data from its source.
After we import our data, we need to add column labels to it. This information is also located on the UCI repo, so refer to the link earlier.
Exploring the Data
The next step is to explore the data a little and create features we can use in our analysis. There 32,561 records where each record is an individual, and 15 columns related to personal attributes, like a person’s education level and marital status. We will use a simple logistic regression model to make our class prediction. In regression models, it is best that the independent variables used to predict the dependent variable have a weak relationship with each other. A solid relationship among independent variables can make the coefficient estimates unstable and difficult to interpret. Let’s check for multicollinearity in the data using the user-defined function described below.
Our function uses the variance inflation factor (VIF) technique to measure the correlation among variables. The VIF is derived by linearly regressing a predictor with other predictors producing the coefficient of determination (R²). Now, the VIF simple is 1/(1-R²). The lower bound is 1 but no upper bound limit. If the R² value is 0.8, then the score is 5. A VIF value of 1.5 would indicate that a variance of a particular coefficient is 50% larger than what one would expect if there were no multicollinearity — no relationship with other independent variables. There’s an ongoing debate on what constitutes too high of a VIF score, and answers vary among researchers. Some researchers take a conservative stance that 2.5 and up is high, while others say anything above 10 is a cause for concern. Three features, age
, education-num
, and hours-per-week
show a strong positive correlation. Since our goal for this article is to make predictions (multicollinearity impacts coefficients and p-values) and observe the impacts of using three optimization algorithms on selecting hyperparameter values, we will not exclude or make any changes to these variables.
Looking at the distribution of records around the age
, education-num
, and hours-per-week
columns, we can make certain inferences about the data.
If we were to pick a record from this data set randomly, it would likely be an individual who is white, married, between the ages of 30–40, a high school graduate, and typically works 40-hours per week. Let’s move to the dependent variable, income
. This is a binary column that indicates whether a person makes over $50K annually or not.
We are dealing with unbalanced classes in this data set with a ratio close to 2:8. Unbalanced classes are a common phenomenon in areas of fraud detection and churn prediction. Machine learning algorithms tend to assume that the classes are evenly distributed, and in the presence of unbalanced classes, the model will be biased towards the majority class. This would impact the prediction quality of our model (high accuracy score due to predicting the majority class but low precision to predict the minority class correctly). We will touch on how to deal with this later but for now, let’s prepare the data for our model.
Prepare Data
Let’s remove unnecessary columns from the data. The education
and education-num
columns seem to represent the same idea; just that one is in a categorical form and the other numeric. We will remove the categorical and keep the numeric column.
Next, encode non-numerical columns into numeric. Encoding, in this case, is creating new columns based on the values within a particular column, using a column_value syntax. For an example of the column sex
where there are two values - Male and Female. Two new columns will be created looking like this - 'sex_Female' and 'sex_Male.' The values within these columns will be 1 indicating the presence and 0 for the absence. Using the Pandas.get_dummies
method may increase memory usage, but it is a useful tool for this relatively small data set. On larger data sets, it is recommended to utilize sparse matrices.
After the transformation, we can see that two columns — income_<=50K
and income_>50K
- can be used to assert whether someone makes over $50K. We will go ahead and drop the income_<=50K
column.
Last, we split the data into predictor and response objects denoted by X
and y
respectively.
Logistic Regression
I created a class object — ClassifierModel
— specifically for this presentation. Please examine it in my Github repo here for more information on the structure and methods of this class object. Let’s implement a basic logistic regression model on our data and print its accuracy score.
The results show an 80% accuracy score, but the F1 score is 39%. Our F1 score is not 0, which means it can correctly classify some of the minority class. However, it’s still not great; we can do better and get a score above 60%. The Precision-Recall[5] curve can be used to measure an estimator's performance when there is a large skew in class distribution. We typically want to see the Precision-Recall area under the curve (average precision — AP) range of 0.5–1. The below results are below 50% and are likely due to the unbalanced classes.
The logistic regression Scikit-learn API allows for the adjustments to weights assigned to the classes. The default value is set to 1, but we can enter the value balanced
that allows the model to automatically adjust the weights to be inversely proportional to the class frequencies — n_samples / (n_classes * np.bincount(y)
using the values of the response object ( y
). Where n_samples
is the total number of records in the data, n_classes
refers to the unique classes in the response object, and np.bincount(y)
is the number of records for the respective classes. These new weights are then added to the model's cost function. Unlike linear regression, where the cost function is the mean-squared error, in logistic regression, this is the log loss function. Now, let’s apply it and see the results.
The accuracy score dropped to 66%, but the F1 score looks better at 44% versus the previous 39%, and we can see improvement looking at the precision-recall curve as well (AP or AUC-PR= 52.5%). Let’s see what happens when we try to manually assign weights to it if we could get a better result than the balanced
mode.
It appears the scores are no better than the balanced
mode and takes longer for the model to run; 34 minutes runtime on my laptop. Finally, we’ll attempt to tune other hyperparameters using the three optimization algorithms mentioned earlier.
Next, set up the range of values of the parameter spaces for the search algorithms to traverse.
Finally, we run the classifiers and print the results of each parameter space search strategy.
All three choose the same type of solver (newton-cg
) and two had similar levels of the inverse regularization parameter (C
) picked. After we collected our results, let’s visualize them in a plot.
Conclusion
The RandomizedSearchCV algorithm performed the best in searching the parameter spaces for the best configuration, then BayesSearchCV, with GridSearchCV trailing. The random search algorithm performed well in this example because it exhausted its budget faster than the other algorithms. Note that this is not necessarily the case when applied to other model types. I will expand in more detail when using a decision tree classifier in “Part 2." There, I will also cover more into using Baysian optimization using other alternatives to the BayesSearchCV API, like Bayesian optimization using Gaussian processes or sequential optimization using decision trees.
Resources
API
[1]: GridSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
[2]: RandomizedSearchCV https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
[3]: BayesSearchCV https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html
Data Source
[4]: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Readings
[5]: Jesse Davis, and Mark Goadrich. The Relationship Between Precision-Recall and ROC curves [https://www.biostat.wisc.edu/~page/rocpr.pdf]. Madison, WI: University of Wisconsin-Madison, Department of Computer Sciences and Department of Biostatistics and Medical Informatics.
Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning [https://arxiv.org/pdf/1012.2599.pdf], December 2010.
Tong Yu, and Hong Zhu. Hyper-Parameter Optimization: A Review of Algorithms and Applications [https://arxiv.org/pdf/2003.05689.pdf], March 2020.