Deleting Errant Observations

The following table shows average CPU utilizations that are collected for a 3090-200J processor, and illustrates the problems that often occur with historical observations:
micsrm140cd
The following table shows average CPU utilizations that are collected for a 3090-200J processor, and illustrates the problems that often occur with historical observations:
              Week     Observation     %  CPU              Ending      Number         BUSY             =======    ===========     ======             31OCT97         1           71.0             07NOV97         2           72.0             14NOV97         3           72.2             21NOV97         4           73.8             28NOV97         5           62.5             05DEC97         6           74.0             12DEC97         7           75.2             19DEC97         8           75.0             26DEC97         9           53.7             02JAN98        10           61.0             09JAN98        11           76.4             16JAN98        12           78.0
Figure 7-4 shows a scatter plot of the data. A linear regression model that is developed for this historical CPU utilization data has the following parameters:
    n  =     12, the number of historical observations       b  =  62.70, the y intercept       m  =   1.17, the slope of the line        2     r  =   0.25, the coefficient of determination       F  =   0.02, the F value       p  =   0.90, the probability that we should reject the                  hypothesis       s  =   6.68, the standard error      e
The predicted and residual values for the historical data series are shown in the following table:
      Week     Observation     %  CPU      Est      Residual      Ending      Number         BUSY      % CPU     (error)     =======    ===========     ======     =====     ========     31OCT97         1           71.0       63.9        7.1     07NOV97         2           72.0       65.1        9.9     14NOV97         3           72.2       66.2        6.0     21NOV97         4           73.8       67.4        6.4     28NOV97         5           62.5       68.6       -6.1     05DEC97         6           74.0       69.8        4.2     12DEC97         7           75.2       70.9        4.3     19DEC97         8           75.0       72.1        2.9     26DEC97         9           53.7       73.3      -19.6     02JAN98        10           61.0       74.5      -13.5     09JAN98        11           76.4       75.6        0.8     16JAN98        12           78.0       76.8        1.2
As you can see in the model parameters and residual values in this table, the proposed model fits the historical data very poorly. Often, these problems are introduced by poorly behaved historical data rather than by the type of model that is selected by the analyst. In this example, three observations in the historical data (28NOV97, 26DEC97, and 02JAN98) are significantly different from the remainder of the historical data points. Investigation reveals that these three weeks represent holidays, presenting two alternatives:
  • Compensating the historical data points. For example, you could attempt to compensate for the missing data by multiplying by some constant. Unfortunately, such constants are guesses made by the analyst. Therefore, we do not recommend that you compensate historical data.
  • Deleting the errant historical data points. Although this reduces the number of points available for developing the model, it does not introduce any of the analyst's biases into the modeling process and is statistically defensible, because these weeks really do represent a different category of work for the processor.
Deleting the historical observations for the holiday weeks results in a substantially better model, giving significantly improved parameters. The parameters of the model are shown below:
    n  =      9, the number of observations       b  =  70.78, the y intercept       m  =   0.57, the slope of the line        2     r  =   0.93, the coefficient of determination       F  =    162, the F value       p  = 0.0001, the probability that we should reject the                  hypothesis       s  =   0.64, the standard error      e
The predicted and residual values for the model that is developed from the historical series with the three holiday weeks deleted are shown in the following table.
      Week     Observation     %  CPU      Est      Residual      Ending      Number         BUSY      % CPU     (error)     =======    ===========     ======     =====     ========     31OCT97         1           71.0       71.4       -0.4     07NOV97         2           72.0       71.9       -0.1     14NOV97         3           72.2       72.5        0.3     21NOV97         4           73.8       73.0       -0.8     28NOV97         5            .         73.6        .     05DEC97         6           74.0       74.2       -0.2     12DEC97         7           75.2       74.7        0.5     19DEC97         8           75.0       75.3       -0.3     26DEC97         9            .         75.9        .     02JAN98        10            .         76.4        .     09JAN98        11           76.4       77.0        0.6     16JAN98        12           78.0       77.6       -0.4
The model developed from the historical data series after the three holiday weeks were deleted is significantly better than the model developed before this deletion. This example shows the value of deleting errant historical data points.  
Note:
The WEEKS timespan is probably more attractive for building models as there are often too few monthly observations for deletion to be an attractive alternative if the MONTHS timespan is used.
Figure 7-4. Weekly CPU Utilizations
                                               HOLIDAY CPU DATA           |           |        81 +           |           |           |        78 +                                                                                                   *           |           |                                                                                          *           |        75 +                                                      *        *           |                                             *           |                           *           |        72 +         *        *           |* %         |           |        69 + C         | P         | U         |        66 +           | B         | U         | S      63 + Y         |                                    *           |           |                                                                                 *        60 +           |           |           |        57 +           |           |           |        54 +                                                                        *           |           -+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+            1        2        3        4        5        6        7        8        9       10       11       12                                                   OBSERVATION NUMBER