Model
Hyperparameters
- src.models.hyperparameters_model.kfold_nFeaturesSelector(data_in, features_in, target, random_state, max_features, n_splits=5, shuffle=True)
Use k-fold linear regression to estimate average score per #parameters.
Estimate average R-squared adjustment on kfold cross-validated sample grouped by number of explanatory variables used. Applicable for regression tasks. Using cross-validation tests average. Estimated by linear model.
- Parameters:
- data_instr
DataFrame with independent variables to analyze.
- features_instr
List of variables to be chosen from. Must come from data_in DataFrame.
- targetstr
Series with dependent variable. Must be continuous.
- random_stateint, default = 123
Random number generator seed. used for KFold sampling.
- max_featuresint, default = 10
Limit of features in iteration. Algorithm will compute for models from i = 1 feature to max_features.
- n_splitsint, default = 5
Cross-validation parameter - will split data_in to n_splits equal parts. VarClusHi library).
- shufflebool, default = True
Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.
- Returns:
- table: top10 scores
Top 10 R-squared scores by mean test-set score and corresponding number of features category.
- plot: mean scores plot
Line plot of number of features selected versus average train & test sample R-squared scores.
Notes
Required libraries:
import pandas as pd:
import matplotlib.pyplot as plt:
import numpy as np:
from sklearn.model_selection import (KFold, GridSearchCV):
from sklearn.linear_model import LinearRegression:
from sklearn.feature_selection import RFE
- src.models.hyperparameters_model.varSelect_fwdBckw(data_in, features_in, target, n_features_space, variable_dictionary)
Select subsets of n best predictors and save in a variable dictionary.
Creates dictionary of variable sets for modelling using RidgeCV regression. Variables are chosen using 3 algorithms: Select from model, forward selection, backwards selection with constraint to no more variables than indicated in n_features_space list.
- Parameters:
- data_instr
DataFrame with independent variables to analyze.
- features_instr
List of variables to be chosen from. Must come from data_in DataFrame.
- targetstr
Series with dependent variable. Must be continuous.
- n_features_spacelist
List of limit of selected features. Passed value of [1,2,3] will select 3 sets of features for each method.
- variable_dictionarystr
Pointer at variable dictionary that will be updated with variable_dictionary[value] = [list of selected features].
- Returns:
- variable_dictionary[“SFM_”+str(n)]list
List of variables stored as dictionary entry. Chosen by select from model method.
- variable_dictionary[“FWD_”+str(n)]list
List of variables stored as dictionary entry. Chosen by forward variable selection method.
- variable_dictionary[“BKWD_”+str(n)]list
List of variables stored as dictionary entry. Chosen by backward variable selection method.
Notes
Required libraries:
from sklearn.linear_model import RidgeCV
from sklearn.feature_selection import (SequentialFeatureSelector, SelectFromModel)
References
Source materials:
Train
- class src.models.train_model.EstimatorSelectionHelper(models, params)
Bases:
object
Iterate through dictionaries of models and returning sum of results.
Class used for iterating through two dictionaries - one with list of models, and one with list of parameters (describing hyperparameter space). For each combination of two dictionaries class instantiates model object, trains it on provided dataset using k-fold linear regression to estimate average score per #parameters. Applicable for regression.
- Parameters:
- modelsstr
Dictionary of model object instances.
- paramsstr
Dictionary of hyperparameters for each model object from models dict.
Notes
Required libraries:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso
from sklearn.linear_model import TheilSenRegressor
from sklearn.linear_model import RANSACRegressor
from sklearn.linear_model import HuberRegressor
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Lars
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import GridSearchCV
References
Source materials:
Methods
score_summary(self, sort_by=’mean_score’)
Return summary of scores.
fit(self, X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False))
Fit models from dictionaries to data using GridSearchCV.
__init__(self, models, params)
Constructor method.
- fit(X, y, cv=3, n_jobs=3, verbose=1, scoring=None, refit=False)
Fit models from dictionaries on given dataset using cross-validation.
Using each model from models dictionary and set of hyperparameters from params dictionary method fits model to the dataset and estimates parameters.
- Parameters:
- Xstr
DataFrame with independent variables to analyze.
- ystr
Series with dependent variable.
- cvint, default = 3
Cross-validation parameter - splits X dataset into cv independent samples.
- n_jobsint, default = 3
Multi-threading parameter - runs n_jobs in parallel.
- verboseint, default = 1
Controlling output - values from 0 (no messages) to 3 (all messages and times of computation).
- scoringstr, default = None
Test evaluation strategy.
- refitbool, default = False
Refit an estimator using the best found parameters on the whole dataset.
- score_summary(sort_by='mean_score')
Create model fit summary scores.
Creates DataFrame containing information about model (estimator) and cross-validation fit scores on provided dataset: minimum, maximum, mean and standard deviation.
- Parameters:
- sort_bystr, default = ‘mean_score’
Variable used to sort resulting dataframe (descending).
- Returns:
- df: DataFrame
Pandas Dataframe with cross-validation estimation results for each model.