Features - Build features
- class src.features.build_features.FeatureCorrection(data_in, numerical_var_list, na_var_list, cutoff_missing=0.25, cutoff_fill=0.05, fill_method='mode', print_details=True)
Bases:
object
Transforming dataset to fix numerical variables issues with missing values.
Create pandas dataframe with transformed numerical features to drop variables with high % of missing values, impute missing values for variables with low % of missing values nad convert non-numerical features to category type.
Transformations: Dropping columns - if more than X% missing values (default: 25%)
Imputing values - if max X% missing values (default: 25%) with provided method (default: mode). For imputed variables also binary variables (variable name + ‘_NA’) created indicating if there was imputation for given record (value = 1). Imputation variables are added to self.na_var_list (list of imputation variables).
- Parameters:
- data_instr
DataFrame to analyze.
- numerical_var_list[]
List of numeric variables to be analyzed and transformed.
- na_var_list[]
Name of input discrete variable list. Binary var_name + “_NA” variables are appended to that list.
- cutoff_missingfloat64, default = 0.25
Percentage of missing values used as cutoff point for dropping variable. If missings > cutoff_missing then drop variable from DataFrame.
- cutoff_fillfloat64, default = 0.05
Percentage of missing values used as cutoff point for filling missing variables with fill_method. If missings < cutoff_fill then replace missing values with fill_method and create new variable var_name + “_NA” which indicates rows with missing values for original variable.
- fill_methodstr, default = ‘mode’
Filling method for missing values, when variable meets cutoff_fill criteria. Can choose from average, median, mode.
- print_detailsbool, default = True
Parameter controlling informative output. If set to false functiom will supress displaying of detailed information.
Notes
Required libraries:
import pandas as pd
import numpy as np
- Attributes:
- data_inpandas DataFrame
Dataset with features to be analyzed and transformed
- na_var_list[]
Name of discrete variable list. Binary var_name + “_NA” variables are appended to that list.
- numerical_var_list[]
list of numeric variables
- data_outpandas DataFrame
Dataset with transformed features
- dropped_colspandas DataFrame
list of columns to be dropped because of too many missing values
Methods
output(self)
Imports file_name and returns as pandas DataFrame.
__init__(self, file_name)
Constructor method.
_drop_missing(self)
Dropping variables with missing value % higher than cutoff_missing parameter value.
_impute_values(self)
If % of missing values are higher than cutoff_fill, then fills selected variables with give fill_method (available values : mode, mean, median). Also creates binary _NA variables where 1 indicates that there was value imputation made for particular record.
_convert_categories(self)
Convert non-numeric variables to “category” type.
- output()
Generate transformed output.
Function calculates missing value % for each variable in dataset. Then performs (in order):
- Establishing list of variables to drop with missing values
exceeding cutoff_missing treshold
- Imputing values according to fill_method parameter for variables
with missing values not exceeding treshold
Converting non-numeric variables to “category” type
Dropping columns calculated from point 1)
- Returns:
- data_outDataFrame
DataFrame with transformed features.
- _drop_missing()
Append variables with many missing values to dropped_cols list.
Function checks for every variable if % of missing values exceeds cutoff_missing treshold. If it does, then adds variable name to dropped_cols list. Prints steps to terminal. This can be supressed with self.print_details = False.
- _impute_values()
Impute calculated values for missing values in features.
Function checks for every variable if % of missing values does not exceed cutoff_fill treshold. If it doesn’t, for missing value records it imputes value from fill_method (mode, mean, median) and creates _NA variable indicating value imputation for this record (value = 1). Prints steps to terminal. This can be supressed with self.print_details = False.
- _convert_categories()
Convert non-numeric variables to ‘category’ type.
Selects all non-number type variables in self.data_out dataset and converts them to ‘category’ type.
- class src.features.build_features.FeatureBinning(X_df, y, var_list, prebin_method='quantile', cat_coff=0.01, n_bins=10, metric='WoE', print_details=True)
Bases:
object
Binning categorical features for continuous target variable.
Creates dictonary storing for each variable from var_list (has to be subset of X_df DataFrame) grouped to similar categories via continuous optimal binning function from optbinning library. This results in dictionary per variable containing optimal binning specification and Weight Of Evidence + Information Value calculations. WoE and IV values have been recalculated in this function, as currently in optbinning library WoE is calculated only for binary dependent variable. Formula for continuous calculation is taken from listendata.com (see: Source materials 2.)
- Parameters:
- X_dfstr
DataFrame with independent variables to analyze.
- ystr
Series with dependent variable. Must be continuous.
- var_liststr
List of variable names to create optimal bins for and calculate WoE/IV /Target encoders. Must be a subset of X_df columns.
- prebin_methodstr, default = “quantile”
Quoting source materials 2. : “The pre-binning method. Supported methods are “cart” for a CART decision tree, “quantile” to generate prebins with approximately same frequency and “uniform” to generate prebins with equal width.”.
- cat_cofffloat, default = 0.01
If category size is less than cat_coff % (default 1%) of total population , then it will be grouped in separate group. All categories with size lesser than cat_coff % will be grouped together.
- n_binsint, default = 10
Max limit to number of bins (grouped categories).
- metricstr, default = “WoE”
Numeric value calculated as default value for variable / bin group. By default calculates Weight Of Evidence (WoE), possible to calculate “mean”.
- print_detailsbool, default = True
Parameter controlling informative output. If set to false function will supress displaying of detailed information.
Notes
Required libraries:
import pandas as pd
from optbinning import ContinuousOptimalBinning
References
Source materials:
Binning library <http://gnpalencia.org/optbinning/binning_continuous.html>
WOE <https://www.listendata.com/2019/08/WOE-IV-Continuous-Dependent.html>
- Attributes:
- optbin_dictdictionary
Dictionary storing variable transformation information
- target_sumfloat
Sum of dependent variable (y)
Methods
transform(self, data, var_name)
Uses stored class optbin dictionary to transform var_name from data and outputs variable in return.
__init__(self, X_df, y, var_list, prebin_method=”quantile”, cat_coff=0.01, n_bins=10, metric=”WoE”, print_details=True)
Constructor method.
_optbin_create(self)
Uses continuous optimal binning library to create bins for each variable from var_list. Calculates statistics for each bin and stores as dictionary.
- _optbin_create()
Create optimal binning dictionary.
Use optbin library to create optimal bins for each variable, calculates metrics and stores for future reference / transformations based on gathered data.
- Returns:
- optbin_dictDictionary
optbin_dict[i][“optbin”] - metadata about transformation, parameter values optbin_dict[i][“bin_table”] - table containing information about grouped category variables with below variables:
Bin - list of values representing grouped categories
Count - number of observations in bin
Count % - % of all observations in dataset
Sum - sum of dependent variable by bin
Mean - mean of dependent variable by bin
Min - minimum value of dependent variable by bin
Max - maximum value of dependent variable by bin
Zeros count - “0” values of dependent variable by bin
WoE - Weight of Evidence (See: source materials 2).
IV - Information Value specyfiying prediction power by bin
- transform(data, var_name)
Transform dataset variables using stored transformation dictionary.
Transform variable var_name from Dataframe data using optbin_dict to return encoded value of Weight of Evidence (WoE) or mean dependent value from corresponding bin (see Optbin_create function). For every value that is not equal to “WoE” function will transform variable using “Mean” metric.
- Parameters:
- datastr
DataFrame with variables to transform.
- var_namestr
Variable name (dtype = categorical) from data that will be transformed for corresponding metric from optbin_dict.
- Returns:
- transformedfloat
Series representing transformed variable from category to float values.
- src.features.build_features.boxcox_transformation(data, var_name, transformation_dict, print_details=True)
Box-Cox variable transformation.
Transforming variable var from Dataframe data using Box-Cox transformation (power transform on series monotonically transformed by adding 1 - to avoid infinite results) in order to provide better predictive characteristics (stabilizes variance, makes variable distribution looking “more like” normal distribution. Also updates pointed dictionary with lambda values for transformed variables. This can be used for future scoring of new data. Returns transformed variable.
- Parameters:
- datastr
DataFrame to analyze.
- var_namestr
Variable name (dtype = numerical) from data that will be transformed.
- transformation_dictstr
Dictionary name storing lambda parameters for Box Cox transformations from data that will be transformed.
- print_detailsbool, default = True
Parameter controlling informative output. If set to false function will supress displaying of detailed information.
- Returns:
- boxcox_varfloat
Series representing transformed variable using Box-Cox transformation.
Notes
Required libraries:
from scipy import stats
References
Source materials: