Features - Build features

class src.features.build_features.FeatureCorrection(data_in, numerical_var_list, na_var_list, cutoff_missing=0.25, cutoff_fill=0.05, fill_method='mode', print_details=True)

Bases: object

Transforming dataset to fix numerical variables issues with missing values.

Create pandas dataframe with transformed numerical features to drop variables with high % of missing values, impute missing values for variables with low % of missing values nad convert non-numerical features to category type.

Transformations: Dropping columns - if more than X% missing values (default: 25%)

Imputing values - if max X% missing values (default: 25%) with provided method (default: mode). For imputed variables also binary variables (variable name + ‘_NA’) created indicating if there was imputation for given record (value = 1). Imputation variables are added to self.na_var_list (list of imputation variables).

Parameters:

data_instr: DataFrame to analyze.
numerical_var_list[]: List of numeric variables to be analyzed and transformed.
na_var_list[]: Name of input discrete variable list. Binary var_name + “_NA” variables are appended to that list.
cutoff_missingfloat64, default = 0.25: Percentage of missing values used as cutoff point for dropping variable. If missings > cutoff_missing then drop variable from DataFrame.
cutoff_fillfloat64, default = 0.05: Percentage of missing values used as cutoff point for filling missing variables with fill_method. If missings < cutoff_fill then replace missing values with fill_method and create new variable var_name + “_NA” which indicates rows with missing values for original variable.
fill_methodstr, default = ‘mode’: Filling method for missing values, when variable meets cutoff_fill criteria. Can choose from average, median, mode.
print_detailsbool, default = True: Parameter controlling informative output. If set to false functiom will supress displaying of detailed information.

Notes

Required libraries:

import pandas as pd
import numpy as np

Attributes:

data_inpandas DataFrame: Dataset with features to be analyzed and transformed
na_var_list[]: Name of discrete variable list. Binary var_name + “_NA” variables are appended to that list.
numerical_var_list[]: list of numeric variables
data_outpandas DataFrame: Dataset with transformed features
dropped_colspandas DataFrame: list of columns to be dropped because of too many missing values

Methods

output(self)	Imports file_name and returns as pandas DataFrame.
__init__(self, file_name)	Constructor method.
_drop_missing(self)	Dropping variables with missing value % higher than cutoff_missing parameter value.
_impute_values(self)	If % of missing values are higher than cutoff_fill, then fills selected variables with give fill_method (available values : mode, mean, median). Also creates binary _NA variables where 1 indicates that there was value imputation made for particular record.
_convert_categories(self)	Convert non-numeric variables to “category” type.

output()

Generate transformed output.

Function calculates missing value % for each variable in dataset. Then performs (in order):

Establishing list of variables to drop with missing values
exceeding cutoff_missing treshold
Imputing values according to fill_method parameter for variables
with missing values not exceeding treshold
Converting non-numeric variables to “category” type
Dropping columns calculated from point 1)

Returns:

data_outDataFrame: DataFrame with transformed features.

_drop_missing()

Append variables with many missing values to dropped_cols list.

Function checks for every variable if % of missing values exceeds cutoff_missing treshold. If it does, then adds variable name to dropped_cols list. Prints steps to terminal. This can be supressed with self.print_details = False.

_impute_values()

Impute calculated values for missing values in features.

Function checks for every variable if % of missing values does not exceed cutoff_fill treshold. If it doesn’t, for missing value records it imputes value from fill_method (mode, mean, median) and creates _NA variable indicating value imputation for this record (value = 1). Prints steps to terminal. This can be supressed with self.print_details = False.

_convert_categories()

Convert non-numeric variables to ‘category’ type.

Selects all non-number type variables in self.data_out dataset and converts them to ‘category’ type.

class src.features.build_features.FeatureBinning(X_df, y, var_list, prebin_method='quantile', cat_coff=0.01, n_bins=10, metric='WoE', print_details=True)

Bases: object

Binning categorical features for continuous target variable.

Creates dictonary storing for each variable from var_list (has to be subset of X_df DataFrame) grouped to similar categories via continuous optimal binning function from optbinning library. This results in dictionary per variable containing optimal binning specification and Weight Of Evidence + Information Value calculations. WoE and IV values have been recalculated in this function, as currently in optbinning library WoE is calculated only for binary dependent variable. Formula for continuous calculation is taken from listendata.com (see: Source materials 2.)

Parameters:

X_dfstr: DataFrame with independent variables to analyze.
ystr: Series with dependent variable. Must be continuous.
var_liststr: List of variable names to create optimal bins for and calculate WoE/IV /Target encoders. Must be a subset of X_df columns.
prebin_methodstr, default = “quantile”: Quoting source materials 2. : “The pre-binning method. Supported methods are “cart” for a CART decision tree, “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width.”.
cat_cofffloat, default = 0.01: If category size is less than cat_coff % (default 1%) of total population , then it will be grouped in separate group. All categories with size lesser than cat_coff % will be grouped together.
n_binsint, default = 10: Max limit to number of bins (grouped categories).
metricstr, default = “WoE”: Numeric value calculated as default value for variable / bin group. By default calculates Weight Of Evidence (WoE), possible to calculate “mean”.
print_detailsbool, default = True: Parameter controlling informative output. If set to false function will supress displaying of detailed information.

Notes

Required libraries:

import pandas as pd
from optbinning import ContinuousOptimalBinning

References

Source materials:

Binning library <http://gnpalencia.org/optbinning/binning_continuous.html>
WOE <https://www.listendata.com/2019/08/WOE-IV-Continuous-Dependent.html>

Attributes:

optbin_dictdictionary: Dictionary storing variable transformation information
target_sumfloat: Sum of dependent variable (y)

Methods

transform(self, data, var_name)	Uses stored class optbin dictionary to transform var_name from data and outputs variable in return.
__init__(self, X_df, y, var_list, prebin_method=”quantile”, cat_coff=0.01, n_bins=10, metric=”WoE”, print_details=True)	Constructor method.
_optbin_create(self)	Uses continuous optimal binning library to create bins for each variable from var_list. Calculates statistics for each bin and stores as dictionary.

_optbin_create()

Create optimal binning dictionary.

Use optbin library to create optimal bins for each variable, calculates metrics and stores for future reference / transformations based on gathered data.

Returns:

optbin_dictDictionary: optbin_dict[i][“optbin”] - metadata about transformation, parameter values optbin_dict[i][“bin_table”] - table containing information about grouped category variables with below variables:

Bin - list of values representing grouped categories

Count - number of observations in bin

Count % - % of all observations in dataset

Sum - sum of dependent variable by bin

Mean - mean of dependent variable by bin

Min - minimum value of dependent variable by bin

Max - maximum value of dependent variable by bin

Zeros count - “0” values of dependent variable by bin

WoE - Weight of Evidence (See: source materials 2).

IV - Information Value specyfiying prediction power by bin

transform(data, var_name)

Transform dataset variables using stored transformation dictionary.

Transform variable var_name from Dataframe data using optbin_dict to return encoded value of Weight of Evidence (WoE) or mean dependent value from corresponding bin (see Optbin_create function). For every value that is not equal to “WoE” function will transform variable using “Mean” metric.

Parameters:

datastr: DataFrame with variables to transform.
var_namestr: Variable name (dtype = categorical) from data that will be transformed for corresponding metric from optbin_dict.

Returns:

transformedfloat: Series representing transformed variable from category to float values.

src.features.build_features.boxcox_transformation(data, var_name, transformation_dict, print_details=True)

Box-Cox variable transformation.

Transforming variable var from Dataframe data using Box-Cox transformation (power transform on series monotonically transformed by adding 1 - to avoid infinite results) in order to provide better predictive characteristics (stabilizes variance, makes variable distribution looking “more like” normal distribution. Also updates pointed dictionary with lambda values for transformed variables. This can be used for future scoring of new data. Returns transformed variable.

Parameters:

datastr: DataFrame to analyze.
var_namestr: Variable name (dtype = numerical) from data that will be transformed.
transformation_dictstr: Dictionary name storing lambda parameters for Box Cox transformations from data that will be transformed.
print_detailsbool, default = True: Parameter controlling informative output. If set to false function will supress displaying of detailed information.

Returns:

boxcox_varfloat: Series representing transformed variable using Box-Cox transformation.

Notes

Required libraries:

from scipy import stats

References

Source materials:

Wikipedia <https://en.wikipedia.org/wiki/Power_transform#Box%E2%80%93Cox_transformation>