Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/usr/lib/python3/dist-packages/numpy/core/getlimits.py:214: RuntimeWarning: overflow encountered in nextafter
  if hasattr(umath, 'nextafter')  # Missing on some platforms?
/ws/builds/jenkins/ws/du3/components/statsmodels/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1059.
Date:                Sat, 04 Jul 2020   Prob (F-statistic):           1.96e-42
Time:                        15:13:34   Log-Likelihood:                 4.7845
No. Observations:                  50   AIC:                            -1.569
Df Residuals:                      46   BIC:                             6.079
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0348      0.078     64.435      0.000       4.878       5.192
x1             0.5083      0.012     42.178      0.000       0.484       0.533
x2             0.5354      0.047     11.301      0.000       0.440       0.631
x3            -0.0214      0.001    -20.219      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        4.540   Durbin-Watson:                   2.197
Prob(Omnibus):                  0.103   Jarque-Bera (JB):                3.432
Skew:                           0.583   Prob(JB):                        0.180
Kurtosis:                       3.538   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.5000028   5.00372461  5.46540432  5.85586423  6.15645669  6.36212792
  6.48224829  6.5390728   6.56408466  6.59282267  6.65904232  6.78916983
  6.99796041  7.28607404  7.639967    8.03411709  8.43521685  8.80764587
  9.11932316  9.3469779   9.47997284  9.52205193  9.4907252   9.41439192
  9.32767469  9.26573103  9.25847711  9.32567259  9.47367418  9.69439125
  9.96661509 10.25950344 10.53764781 10.76689095 10.91993884 10.98084497
 10.94763035 10.83260862 10.66036299 10.46370725 10.27829456 10.13675979
 10.06335762 10.06997763 10.15419086 10.29964885 10.4787697  10.65726965
 10.79979583 10.87573462]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.84853789 10.6799805  10.39105802 10.02961666  9.6586389   9.34082315
  9.12323301  9.02577442  9.03632204  9.11368794]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.034834
x1                  0.508280
np.sin(x1)          0.535380
I((x1 - 5) ** 2)   -0.021393
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.848538
1    10.679981
2    10.391058
3    10.029617
4     9.658639
5     9.340823
6     9.123233
7     9.025774
8     9.036322
9     9.113688
dtype: float64