Model

Hyperparameters

src.models.hyperparameters_model.kfold_nFeaturesSelector(data_in, features_in, target, random_state, max_features, n_splits=5, shuffle=True)

Use k-fold linear regression to estimate average score per #parameters.

Estimate average R-squared adjustment on kfold cross-validated sample grouped by number of explanatory variables used. Applicable for regression tasks. Using cross-validation tests average. Estimated by linear model.

Parameters:
data_instr

DataFrame with independent variables to analyze.

features_instr

List of variables to be chosen from. Must come from data_in DataFrame.

targetstr

Series with dependent variable. Must be continuous.

random_stateint, default = 123

Random number generator seed. used for KFold sampling.

max_featuresint, default = 10

Limit of features in iteration. Algorithm will compute for models from i = 1 feature to max_features.

n_splitsint, default = 5

Cross-validation parameter - will split data_in to n_splits equal parts. VarClusHi library).

shufflebool, default = True

Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.

Returns:
table: top10 scores

Top 10 R-squared scores by mean test-set score and corresponding number of features category.

plot: mean scores plot

Line plot of number of features selected versus average train & test sample R-squared scores.

Notes

Required libraries:

import pandas as pd:

import matplotlib.pyplot as plt:

import numpy as np:

from sklearn.model_selection import (KFold, GridSearchCV):

from sklearn.linear_model import LinearRegression:

from sklearn.feature_selection import RFE

src.models.hyperparameters_model.varSelect_fwdBckw(data_in, features_in, target, n_features_space, variable_dictionary)

Select subsets of n best predictors and save in a variable dictionary.

Creates dictionary of variable sets for modelling using RidgeCV regression. Variables are chosen using 3 algorithms: Select from model, forward selection, backwards selection with constraint to no more variables than indicated in n_features_space list.

Parameters:
data_instr

DataFrame with independent variables to analyze.

features_instr

List of variables to be chosen from. Must come from data_in DataFrame.

targetstr

Series with dependent variable. Must be continuous.

n_features_spacelist

List of limit of selected features. Passed value of [1,2,3] will select 3 sets of features for each method.

variable_dictionarystr

Pointer at variable dictionary that will be updated with variable_dictionary[value] = [list of selected features].

Returns:
variable_dictionary[“SFM_”+str(n)]list

List of variables stored as dictionary entry. Chosen by select from model method.

variable_dictionary[“FWD_”+str(n)]list

List of variables stored as dictionary entry. Chosen by forward variable selection method.

variable_dictionary[“BKWD_”+str(n)]list

List of variables stored as dictionary entry. Chosen by backward variable selection method.

Notes

Required libraries:

  • from sklearn.linear_model import RidgeCV

  • from sklearn.feature_selection import (SequentialFeatureSelector, SelectFromModel)

References

Source materials:

  1. Diabetes use-case <https://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html>

Train

class src.models.train_model.EstimatorSelectionHelper(models, params)

Bases: object

Iterate through dictionaries of models and returning sum of results.

Class used for iterating through two dictionaries - one with list of models, and one with list of parameters (describing hyperparameter space). For each combination of two dictionaries class instantiates model object, trains it on provided dataset using k-fold linear regression to estimate average score per #parameters. Applicable for regression.

Parameters:
modelsstr

Dictionary of model object instances.

paramsstr

Dictionary of hyperparameters for each model object from models dict.

Notes

Required libraries:

import pandas as pd

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge

from sklearn.linear_model import ElasticNet

from sklearn.linear_model import Lasso

from sklearn.linear_model import TheilSenRegressor

from sklearn.linear_model import RANSACRegressor

from sklearn.linear_model import HuberRegressor

from sklearn.linear_model import SGDRegressor

from sklearn.linear_model import Lars

from sklearn.linear_model import RidgeCV

from sklearn.model_selection import GridSearchCV

References

Source materials:

  1. Blog <https://www.davidsbatista.net/blog/2018/02/23/model_optimization/>

Methods

score_summary(self, sort_by=’mean_score’)

Return summary of scores.

fit(self, X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False))

Fit models from dictionaries to data using GridSearchCV.

__init__(self, models, params)

Constructor method.

fit(X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False)

Fit models from dictionaries on given dataset using cross-validation.

Using each model from models dictionary and set of hyperparameters from params dictionary method fits model to the dataset and estimates parameters.

Parameters:
Xstr

DataFrame with independent variables to analyze.

ystr

Series with dependent variable.

cvint, default = 3

Cross-validation parameter - splits X dataset into cv independent samples.

n_jobsint, default = 3

Multi-threading parameter - runs n_jobs in parallel.

verboseint, default = 1

Controlling output - values from 0 (no messages) to 3 (all messages and times of computation).

scoringstr, default = None

Test evaluation strategy.

refitbool, default = False

Refit an estimator using the best found parameters on the whole dataset.

score_summary(sort_by='mean_score')

Create model fit summary scores.

Creates DataFrame containing information about model (estimator) and cross-validation fit scores on provided dataset: minimum, maximum, mean and standard deviation.

Parameters:
sort_bystr, default = ‘mean_score’

Variable used to sort resulting dataframe (descending).

Returns:
df: DataFrame

Pandas Dataframe with cross-validation estimation results for each model.