User-Defined Data Transformations

Step function changes in the workload or other known changes can often introduce problems into linear regression forecasts. The following table illustrates this type of problem by presenting monthly observations of the average number of jobs per hour submitted by an engineering group. In April, three more engineers were added to the group, resulting in a step function increase of approximately 10 jobs per hour.
micsrm140cd
Step function changes in the workload or other known changes can often introduce problems into linear regression forecasts. The following table illustrates this type of problem by presenting monthly observations of the average number of jobs per hour submitted by an engineering group. In April, three more engineers were added to the group, resulting in a step function increase of approximately 10 jobs per hour.
                        Observation      Jobs              Month       Number         Hour             =======    ===========     ======              JAN98          1             19              FEB98          2             25              MAR98          3             23              APR98          4             34              MAY98          5             37              JUN98          6             40
Figure 7-7 shows a scatter plot of the data. A linear regression model developed for this historical data series has the following parameters:
    n  =      6, the number of historical observations       b  =   14.5, the y intercept       m  =    4.3, the slope of the line        2     r  =   0.92, the coefficient of determination       F  =    5.1, the F value       p  =   .003, the probability that we should reject the                  hypothesis       s  =    2.7, the standard error      e
The following table shows the predicted and residual values for the historical data series:
             Observation      Jobs       Est.      Residual    Month       Number         Hour       Jobs      (Error)   =======    ===========     ======     ======     ========    JAN98          1             19       18.8         0.2    FEB98          2             25       23.2         1.8    MAR98          3             23       27.5         4.5    APR98          4             34       31.8         2.7    MAY98          5             37       36.2        -6.9    JUN98          6             40       40.5       -21.5
Although the r-squared of 0.92 for the model is very good, the results are suspicious due to the step function increase that occurred in the workload. You can modify the historical series to account for the step function increase of 10 jobs per hour in April. There are two ways to approach the problem:
  • Subtracting 10 from the April through June observations.
  • Adding 10 to the January through March observations.
You would probably want to modify the January through March observations as you would expect the future observations to be more like the April through June sales. Therefore, you would transform the historical observations using the following equation:
    x(j) = x(j)+10, for j<=3                         (Eqn 13)       x(j) = x(j),    for j>3
Univariate Model Forecasting allows you to specify SAS statements to accomplish these transformations. The following table shows the result of this transformation:
                       Observation   Transformed              Month       Number         Jobs             =======    ===========     ======              JAN98          1            29              FEB98          2            35              MAR98          3            33              APR98          4            34              MAY98          5            37              JUN98          6            40
Using the transformed observations, a second linear model was developed. The parameters of this model are shown below:
    n  =      6, the number of historical observations       b  =   28.5, the y intercept       m  =   1.77, the slope of the line        2     r  =   0.79, the coefficient of determination       F  =  15.23, the F value       p  =   0.01, the probability that the hypothesis should                  be  rejected       s  =   1.92, the standard error      e
The following table shows the predicted and residual values developed using this model:
                                        Est                Obs    Jobs    Trans.   Jobs    Residual       Month     #     Hour     Jobs     Hour   (error)      =======   ===   ======   ======   =====   ========       JAN98     1      19       29      30.2     -1.2       FEB98     2      25       35      32.0      3.0       MAR98     3      23       33      33.8     -0.8       APR98     4      34       34      35.6     -1.6       MAY98     5      37       37      37.3     -0.3       JUN98     6      40       40      39.1      0.9
Note that the r-squared and F values for the resulting model, while acceptable, are poorer than the values calculated for the original historical data series. However, the statistical model proposed is much more reasonable because it accounts for a significant step that occurred in the data. Always examine each model to determine whether the proposed model is reasonable for the data being modeled.
Figure 7-7. Engineering Jobs Per Hour
                                            ENGINEERING JOBS PER HOUR                |            |       40.0 +                                                                                                 *            |            |            |       37.5 +            |                                                                              *            |            | J     35.0 + O          | B          |                                                           * S          |       32.5 + /          |            | H          | O     30.0 + U          | R          |            |       27.5 +            |            |            |       25.0 +                     *            |            |            |                                        *       22.5 +            |            |            |       20.0 +            |            |  *            |       17.5 +            |            ---+------------------+------------------+------------------+------------------+------------------+--               1                  2                  3                  4                  5                  6                                                    OBSERVATION NUMBER