Deleting Errant Observations
Example showing average CPU utilizations collected for a 3090-200J processor, and the problems that can occur with historical observations.
micsrm140cd
The following table shows average CPU utilizations that are collected for a 3090-200J processor, and illustrates the problems that often occur with historical observations:
Week Observation % CPU Ending Number BUSY ======= =========== ====== 31OCT97 1 71.0 07NOV97 2 72.0 14NOV97 3 72.2 21NOV97 4 73.8 28NOV97 5 62.5 05DEC97 6 74.0 12DEC97 7 75.2 19DEC97 8 75.0 26DEC97 9 53.7 02JAN98 10 61.0 09JAN98 11 76.4 16JAN98 12 78.0
Figure 7-4 shows a scatter plot of the data. A linear regression model that is developed for this historical CPU utilization data has the following parameters:
n = 12, the number of historical observations b = 62.70, the y intercept m = 1.17, the slope of the line 2 r = 0.25, the coefficient of determination F = 0.02, the F value p = 0.90, the probability that we should reject the hypothesis s = 6.68, the standard error e
The predicted and residual values for the historical data series are shown in the following table:
Week Observation % CPU Est Residual Ending Number BUSY % CPU (error) ======= =========== ====== ===== ======== 31OCT97 1 71.0 63.9 7.1 07NOV97 2 72.0 65.1 9.9 14NOV97 3 72.2 66.2 6.0 21NOV97 4 73.8 67.4 6.4 28NOV97 5 62.5 68.6 -6.1 05DEC97 6 74.0 69.8 4.2 12DEC97 7 75.2 70.9 4.3 19DEC97 8 75.0 72.1 2.9 26DEC97 9 53.7 73.3 -19.6 02JAN98 10 61.0 74.5 -13.5 09JAN98 11 76.4 75.6 0.8 16JAN98 12 78.0 76.8 1.2
As you can see in the model parameters and residual values in this table, the proposed model fits the historical data very poorly. Often, these problems are introduced by poorly behaved historical data rather than by the type of model that is selected by the analyst. In this example, three observations in the historical data (28NOV97, 26DEC97, and 02JAN98) are significantly different from the remainder of the historical data points. Investigation reveals that these three weeks represent holidays, presenting two alternatives:
- Compensating the historical data points. For example, you could attempt to compensate for the missing data by multiplying by some constant. Unfortunately, such constants are guesses made by the analyst. Therefore, we do not recommend that you compensate historical data.
- Deleting the errant historical data points. Although this reduces the number of points available for developing the model, it does not introduce any of the analyst's biases into the modeling process and is statistically defensible, because these weeks really do represent a different category of work for the processor.
Deleting the historical observations for the holiday weeks results in a substantially better model, giving significantly improved parameters. The parameters of the model are shown below:
n = 9, the number of observations b = 70.78, the y intercept m = 0.57, the slope of the line 2 r = 0.93, the coefficient of determination F = 162, the F value p = 0.0001, the probability that we should reject the hypothesis s = 0.64, the standard error e
The predicted and residual values for the model that is developed from the historical series with the three holiday weeks deleted are shown in the following table.
Week Observation % CPU Est Residual Ending Number BUSY % CPU (error) ======= =========== ====== ===== ======== 31OCT97 1 71.0 71.4 -0.4 07NOV97 2 72.0 71.9 -0.1 14NOV97 3 72.2 72.5 0.3 21NOV97 4 73.8 73.0 -0.8 28NOV97 5 . 73.6 . 05DEC97 6 74.0 74.2 -0.2 12DEC97 7 75.2 74.7 0.5 19DEC97 8 75.0 75.3 -0.3 26DEC97 9 . 75.9 . 02JAN98 10 . 76.4 . 09JAN98 11 76.4 77.0 0.6 16JAN98 12 78.0 77.6 -0.4
The model developed from the historical data series after the three holiday weeks were deleted is significantly better than the model developed before this deletion. This example shows the value of deleting errant historical data points.
Note:
The WEEKS timespan is probably more attractive for building models as there are often too few monthly observations for deletion to be an attractive alternative if the MONTHS timespan is used.Figure 7-4. Weekly CPU Utilizations
HOLIDAY CPU DATA | | 81 + | | | 78 + * | | * | 75 + * * | * | * | 72 + * * |* % | | 69 + C | P | U | 66 + | B | U | S 63 + Y | * | | * 60 + | | | 57 + | | | 54 + * | -+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+ 1 2 3 4 5 6 7 8 9 10 11 12 OBSERVATION NUMBER