Utils

Diagnostics

class src.utils.diagnostics.LinearDiagnostics(model_in, data_in, target, scoring_dict, model_list, significance_level=0.05)

Bases: object

Perform linear model diagnostics and statistical tests on given sample.

For linear regression tasks. Given significance level as a parameter performs number of tests and plots several diagrams:

Descriptive statistics of model residuals
histogram of residuals
Kernel density plot of residuals
QQ plot of residuals
Jarque-Bera test for normality (residuals)
Shapiro-Wilk test for normality (residuals)
Anderson-Darling test for normality (residuals)
DAgostino and Pearsons normality test (residuals) - stats.normaltest
Durbin-Watson test for autocorrelation (residuals)
Predicted value vs residual scatter plot
Predicted value vs true value scatter plot
Breusch-Pagan test for homoscedasticity (residuals)
Variance Inflation Factor (VIF) table
Correlation matrix of independent variables

Notes

Prerequisites:

models_list - dictionary {‘model_name’ : { ‘model’ : class instance,

‘variables’ : model variables set alias}}

scoring_dict - dictionary { ‘variables’ : [corresponding list of variables]}

Required libraries:

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.compat import lzip
from IPython.display import display, HTML
import statsmodels as sm
import scipy.stats as stats
import statsmodels.stats.stattools as smt
import statsmodels.stats.diagnostic as smd
import matplotlib as plt
import seaborn as sn
import pandas as pd
import numpy as np
import pylab

Attributes:

model: str: Name of model used for diagnostics.
datastr: Data used for model goodness of fit.
targetseries: Series of target variable value. Must have corresponding index value with data.
alphaint: Probability of rejecting H0 when it is true. Parameter for statistical tests. Against significance level we find critical value for statistical test and compare it to statistic. Equal to significance_level parameter.
predictionspandas DataFrame: DataFrame with predicted values of dependent variable.
results_dfpandas DataFrame: DataFrame with goodness of fit for each observation.

Methods

__init__(self, model_in, data_in, target, scoring_dict, model_list, significance_level=0.05)	Constructor method.
_residual_stats(self)	Print descriptive statistics of residuals.
_res_hist(self)	Display residuals histogram.
_res_kde(self)	Display residuals kernel density plot.
_jb_normal_test(self)	Perform Jarque-Bera test for normality of residuals.
_sw_normal_test(self)	Perform Shapiro-Wilk test for normality of residuals.
_ad_normal_test(self)	Perform Anderson-Darling test for normality of residuals.
_normal_test(self)	Perform D’Agostino normality test for residuals.
_dw_autocorr_test(self)	Perform Durbin-Watson test for autocorrelation of residuals.
_homoskedasticity_plot(self)	Display predicted values vs residuals plot.
_bp_test_homoskedasticity(self)	Perform Breusch-Pagan test for homoskedasticity.
_vif(self)	Display variance inflation factors (VIF).
_corr_matrix(self)	Display Pearson correlation matrix between variables.

_residual_stats()

Calculate residual descriptive statistics on pointed dataset.

Calculating summary statistics about residuals (true - predicted values) values of dependent variable.

_res_hist()

Display residuals histogram.

Displays histogram of residuals (true values - predicted values).

_res_kde()

Display residuals kernel density plot.

Displays kernel density plot of residuals (true values - predicted values).

_res_qq()

Display residuals qqplot.

Display residuals (true values - predicted values) quantile-quantile plot.

_jb_normal_test()

Perform Jarque Bera test for normality of residuals.

References

Source materials:

Wiki <https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test>

_sw_normal_test()

Perform Shapiro-Wilk test for normality of residuals.

References

Source materials:

Wiki <https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test>

_ad_normal_test()

Perform Anderson-Darling test for normality of residuals.

References

Source materials:

Wiki <https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test>

_normal_test()

Perform D’Agostino normality test.

References

Source materials:

Wiki <https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test>

_dw_autocorr_test()

Perform Durbin-Watson test for autocorrelation.

References

Source materials:

Wiki <https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic>

_homoskedasticity_plot()

Plot predicted values against residual errors.

Investigate visually possible homoscedasticity.

_bp_test_homoskedasticity()

Perform Breusch-Pagan test for homoskedasticity.

References

Source materials:

Wiki <https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test>

_vif()

Display variance inflation factors (VIF).

Calculate variance inflation factors in dataset to check for multicollinearity.

References

Source materials:

Wiki <https://en.wikipedia.org/wiki/Variance_inflation_factor>

_corr_matrix()

Display Pearson correlation matrix between variables.

Calculates and displays matrix of correlations between variables in dataset. Correlation metric - Pearson correlation.