Robust Linear Models

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
/usr/lib/python3/dist-packages/numpy/core/getlimits.py:214: RuntimeWarning: overflow encountered in nextafter
  if hasattr(umath, 'nextafter')  # Missing on some platforms?
/ws/builds/jenkins/ws/du3/components/statsmodels/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Estimation

Load data:

In [2]:
data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)
/ws/builds/jenkins/ws/du3/components/statsmodels/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py:100: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  exog = np.column_stack(data[field] for field in exog_name)

Huber's T norm with the (default) median absolute deviation scaling

In [3]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
            xname=['var_%d' % i for i in range(len(hub_results.params))]))
The history saving thread hit an unexpected error (OperationalError('disk I/O error')).History will not be written to the database.
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[9.79189854 0.11100521 0.30293016 0.12864961]
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                      y   No. Observations:                   21
Model:                            RLM   Df Residuals:                       17
Method:                          IRLS   Df Model:                            3
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Sat, 04 Jul 2020                                         
Time:                        15:13:20                                         
No. Iterations:                    19                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
var_0        -41.0265      9.792     -4.190      0.000     -60.218     -21.835
var_1          0.8294      0.111      7.472      0.000       0.612       1.047
var_2          0.9261      0.303      3.057      0.002       0.332       1.520
var_3         -0.1278      0.129     -0.994      0.320      -0.380       0.124
==============================================================================

If the model instance has been used for another fit with different fit
parameters, then the fit options might not be the correct ones anymore .

Huber's T norm with 'H2' covariance matrix

In [4]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
[-41.02649835   0.82938433   0.92606597  -0.12784672]
[9.08950419 0.11945975 0.32235497 0.11796313]

Andrew's Wave norm with Huber's Proposal 2 scaling and 'H3' covariance matrix

In [5]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)
Parameters:  [-40.8817957    0.79276138   1.04857556  -0.13360865]

See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options

Comparing OLS and RLM

Artificial data with outliers:

In [6]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3   # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)

Example 1: quadratic function with linear truth

Note that the quadratic term in OLS regression will capture outlier effects.

In [7]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 5.00091867  0.53463384 -0.01451132]
[0.44613222 0.06887679 0.00609453]
[ 4.63813579  4.91316599  5.18336111  5.44872114  5.70924607  5.96493592
  6.21579067  6.46181033  6.7029949   6.93934439  7.17085878  7.39753808
  7.61938229  7.83639141  8.04856544  8.25590438  8.45840823  8.65607699
  8.84891065  9.03690923  9.22007272  9.39840111  9.57189442  9.74055263
  9.90437576 10.06336379 10.21751674 10.36683459 10.51131735 10.65096502
 10.7857776  10.91575509 11.04089749 11.1612048  11.27667702 11.38731415
 11.49311619 11.59408314 11.69021499 11.78151176 11.86797344 11.94960002
 12.02639152 12.09834792 12.16546924 12.22775546 12.28520659 12.33782264
 12.38560359 12.42854945]

Estimate RLM:

In [8]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[ 4.93735450e+00  5.19419165e-01 -3.83274503e-03]
[0.11632586 0.01795914 0.00158911]

Draw a plot to compare OLS estimates to the robust estimates:

In [9]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")
Out[9]:
<matplotlib.legend.Legend at 0xfffffc7fe9a8d048>

Example 2: linear function with linear truth

Fit a new OLS model using only the linear term and the constant:

In [10]:
X2 = X[:,[0,1]] 
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[5.58581353 0.38952068]
[0.39010658 0.03361316]

Estimate RLM:

In [11]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[5.06604012 0.48431758]
[0.09426183 0.00812198]

Draw a plot to compare OLS estimates to the robust estimates:

In [12]:
prstd, iv_l, iv_u = wls_prediction_std(res2)

fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")