Utils

Diagnostics

class src.utils.diagnostics.LinearDiagnostics(model_in, data_in, target, scoring_dict, model_list, significance_level=0.05)

Bases: object

Perform linear model diagnostics and statistical tests on given sample.

For linear regression tasks. Given significance level as a parameter performs number of tests and plots several diagrams:

  • Descriptive statistics of model residuals

  • histogram of residuals

  • Kernel density plot of residuals

  • QQ plot of residuals

  • Jarque-Bera test for normality (residuals)

  • Shapiro-Wilk test for normality (residuals)

  • Anderson-Darling test for normality (residuals)

  • DAgostino and Pearsons normality test (residuals) - stats.normaltest

  • Durbin-Watson test for autocorrelation (residuals)

  • Predicted value vs residual scatter plot

  • Predicted value vs true value scatter plot

  • Breusch-Pagan test for homoscedasticity (residuals)

  • Variance Inflation Factor (VIF) table

  • Correlation matrix of independent variables

Notes

Prerequisites:

models_list - dictionary {‘model_name’ : { ‘model’ : class instance,

‘variables’ : model variables set alias}}

scoring_dict - dictionary { ‘variables’ : [corresponding list of variables]}

Required libraries:

  • from statsmodels.stats.outliers_influence import variance_inflation_factor

  • from statsmodels.compat import lzip

  • from IPython.display import display, HTML

  • import statsmodels as sm

  • import scipy.stats as stats

  • import statsmodels.stats.stattools as smt

  • import statsmodels.stats.diagnostic as smd

  • import matplotlib as plt

  • import seaborn as sn

  • import pandas as pd

  • import numpy as np

  • import pylab

Attributes:
model: str

Name of model used for diagnostics.

datastr

Data used for model goodness of fit.

targetseries

Series of target variable value. Must have corresponding index value with data.

alphaint

Probability of rejecting H0 when it is true. Parameter for statistical tests. Against significance level we find critical value for statistical test and compare it to statistic. Equal to significance_level parameter.

predictionspandas DataFrame

DataFrame with predicted values of dependent variable.

results_dfpandas DataFrame

DataFrame with goodness of fit for each observation.

Methods

__init__(self, model_in, data_in, target, scoring_dict, model_list, significance_level=0.05)

Constructor method.

_residual_stats(self)

Print descriptive statistics of residuals.

_res_hist(self)

Display residuals histogram.

_res_kde(self)

Display residuals kernel density plot.

_jb_normal_test(self)

Perform Jarque-Bera test for normality of residuals.

_sw_normal_test(self)

Perform Shapiro-Wilk test for normality of residuals.

_ad_normal_test(self)

Perform Anderson-Darling test for normality of residuals.

_normal_test(self)

Perform D’Agostino normality test for residuals.

_dw_autocorr_test(self)

Perform Durbin-Watson test for autocorrelation of residuals.

_homoskedasticity_plot(self)

Display predicted values vs residuals plot.

_bp_test_homoskedasticity(self)

Perform Breusch-Pagan test for homoskedasticity.

_vif(self)

Display variance inflation factors (VIF).

_corr_matrix(self)

Display Pearson correlation matrix between variables.

_residual_stats()

Calculate residual descriptive statistics on pointed dataset.

Calculating summary statistics about residuals (true - predicted values) values of dependent variable.

_res_hist()

Display residuals histogram.

Displays histogram of residuals (true values - predicted values).

_res_kde()

Display residuals kernel density plot.

Displays kernel density plot of residuals (true values - predicted values).

_res_qq()

Display residuals qqplot.

Display residuals (true values - predicted values) quantile-quantile plot.

_jb_normal_test()

Perform Jarque Bera test for normality of residuals.

References

Source materials:

  1. Wiki <https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test>

_sw_normal_test()

Perform Shapiro-Wilk test for normality of residuals.

References

Source materials:

  1. Wiki <https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test>

_ad_normal_test()

Perform Anderson-Darling test for normality of residuals.

References

Source materials:

  1. Wiki <https://en.wikipedia.org/wiki/Anderson%E2%80%93Darling_test>

_normal_test()

Perform D’Agostino normality test.

References

Source materials:

  1. Wiki <https://en.wikipedia.org/wiki/D%27Agostino%27s_K-squared_test>

_dw_autocorr_test()

Perform Durbin-Watson test for autocorrelation.

References

Source materials:

  1. Wiki <https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic>

_homoskedasticity_plot()

Plot predicted values against residual errors.

Investigate visually possible homoscedasticity.

_bp_test_homoskedasticity()

Perform Breusch-Pagan test for homoskedasticity.

References

Source materials:

  1. Wiki <https://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test>

_vif()

Display variance inflation factors (VIF).

Calculate variance inflation factors in dataset to check for multicollinearity.

References

Source materials:

  1. Wiki <https://en.wikipedia.org/wiki/Variance_inflation_factor>

_corr_matrix()

Display Pearson correlation matrix between variables.

Calculates and displays matrix of correlations between variables in dataset. Correlation metric - Pearson correlation.