Model

Hyperparameters

src.models.hyperparameters_model.kfold_nFeaturesSelector(data_in, features_in, target, random_state, max_features, n_splits=5, shuffle=True)

Use k-fold linear regression to estimate average score per #parameters.

Estimate average R-squared adjustment on kfold cross-validated sample grouped by number of explanatory variables used. Applicable for regression tasks. Using cross-validation tests average. Estimated by linear model.

Parameters:

data_instr: DataFrame with independent variables to analyze.
features_instr: List of variables to be chosen from. Must come from data_in DataFrame.
targetstr: Series with dependent variable. Must be continuous.
random_stateint, default = 123: Random number generator seed. used for KFold sampling.
max_featuresint, default = 10: Limit of features in iteration. Algorithm will compute for models from i = 1 feature to max_features.
n_splitsint, default = 5: Cross-validation parameter - will split data_in to n_splits equal parts. VarClusHi library).
shufflebool, default = True: Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.

Returns:

table: top10 scores: Top 10 R-squared scores by mean test-set score and corresponding number of features category.
plot: mean scores plot: Line plot of number of features selected versus average train & test sample R-squared scores.

Notes

Required libraries:

import pandas as pd:

import matplotlib.pyplot as plt:

import numpy as np:

from sklearn.model_selection import (KFold, GridSearchCV):

from sklearn.linear_model import LinearRegression:

from sklearn.feature_selection import RFE

src.models.hyperparameters_model.varSelect_fwdBckw(data_in, features_in, target, n_features_space, variable_dictionary)

Select subsets of n best predictors and save in a variable dictionary.

Creates dictionary of variable sets for modelling using RidgeCV regression. Variables are chosen using 3 algorithms: Select from model, forward selection, backwards selection with constraint to no more variables than indicated in n_features_space list.

Parameters:

data_instr: DataFrame with independent variables to analyze.
features_instr: List of variables to be chosen from. Must come from data_in DataFrame.
targetstr: Series with dependent variable. Must be continuous.
n_features_spacelist: List of limit of selected features. Passed value of [1,2,3] will select 3 sets of features for each method.
variable_dictionarystr: Pointer at variable dictionary that will be updated with variable_dictionary[value] = [list of selected features].

Returns:

variable_dictionary[“SFM_”+str(n)]list: List of variables stored as dictionary entry. Chosen by select from model method.
variable_dictionary[“FWD_”+str(n)]list: List of variables stored as dictionary entry. Chosen by forward variable selection method.
variable_dictionary[“BKWD_”+str(n)]list: List of variables stored as dictionary entry. Chosen by backward variable selection method.

Notes

Required libraries:

from sklearn.linear_model import RidgeCV
from sklearn.feature_selection import (SequentialFeatureSelector, SelectFromModel)

References

Source materials:

Diabetes use-case <https://scikit-learn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html>

Train

class src.models.train_model.EstimatorSelectionHelper(models, params)

Bases: object

Iterate through dictionaries of models and returning sum of results.

Class used for iterating through two dictionaries - one with list of models, and one with list of parameters (describing hyperparameter space). For each combination of two dictionaries class instantiates model object, trains it on provided dataset using k-fold linear regression to estimate average score per #parameters. Applicable for regression.

Parameters:

modelsstr: Dictionary of model object instances.
paramsstr: Dictionary of hyperparameters for each model object from models dict.

Notes

Required libraries:

import pandas as pd

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import Ridge

from sklearn.linear_model import ElasticNet

from sklearn.linear_model import Lasso

from sklearn.linear_model import TheilSenRegressor

from sklearn.linear_model import RANSACRegressor

from sklearn.linear_model import HuberRegressor

from sklearn.linear_model import SGDRegressor

from sklearn.linear_model import Lars

from sklearn.linear_model import RidgeCV

from sklearn.model_selection import GridSearchCV

References

Source materials:

Blog <https://www.davidsbatista.net/blog/2018/02/23/model_optimization/>

Methods

score_summary(self, sort_by=’mean_score’)	Return summary of scores.
fit(self, X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False))	Fit models from dictionaries to data using GridSearchCV.
__init__(self, models, params)	Constructor method.

fit(X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False)

Fit models from dictionaries on given dataset using cross-validation.

Using each model from models dictionary and set of hyperparameters from params dictionary method fits model to the dataset and estimates parameters.

Parameters:

Xstr: DataFrame with independent variables to analyze.
ystr: Series with dependent variable.
cvint, default = 3: Cross-validation parameter - splits X dataset into cv independent samples.
n_jobsint, default = 3: Multi-threading parameter - runs n_jobs in parallel.
verboseint, default = 1: Controlling output - values from 0 (no messages) to 3 (all messages and times of computation).
scoringstr, default = None: Test evaluation strategy.
refitbool, default = False: Refit an estimator using the best found parameters on the whole dataset.

score_summary(sort_by='mean_score')

Create model fit summary scores.

Creates DataFrame containing information about model (estimator) and cross-validation fit scores on provided dataset: minimum, maximum, mean and standard deviation.

Parameters:

sort_bystr, default = ‘mean_score’: Variable used to sort resulting dataframe (descending).

Returns:

df: DataFrame: Pandas Dataframe with cross-validation estimation results for each model.