This post was triggered by the email below:
I am a regular reader of your blog and website which is an inspiration to me as a forecasting analyst. I just have a very simple question for you, which I don’t understand as a practitioner. I have looked at 10-20 papers and almost every one has a lag variable in it for forecasting electricity demand. But in practice, if you are forecasting for a portfolio or a region and not the whole grid of a country, lag demand is simply not available until weeks or months later. Is this because academia is focused on the theoretical and not the practical, or is it because it focuses on the big picture, total demand and not by region/portfolio? And is there any way round this? You can always feed forecasts for D+1 as a lag into D+2 going forward, but this doesn’t give you a lag for D+0 and D+1.
This is an excellent and frequently asked question, but I don't have a simple answer.
In practice, if you have lagged load as a variable in the model but don't have its observation for the forecasting period, you have to use the predicted value.
Take day-ahead load forecasting for example, when forecasting hour ending 10am for tomorrow, we don't have the observation for hour ending 9am. If the model included the lagged load of the preceding hour, we have to predict the load of hour ending 9am first. In order to make that prediction, we need the load of hour ending 8am, which has to be predicted as well.
Let's say you are building is a multiple linear regression model, the regression models with lagged dependent variables are called dynamic regression models.
To implement a dynamic regression model to forecast the period where the observations for the lags are not available, you will have to execute an iterative process to forecast those lags first.
Now you may want to ask:
Are these dynamic regression models more accurate than the ones without lagged load?
Practically, it depends upon how far ahead you are forecasting and how far back the lagged variables go to.
If you are using the load of the preceding hour in your model, you should expect some improvement for the next few hours comparing with the models without lagged load variables. The improvement diminishes as the forecast horizon stretches. Beyond 10 hours or so, you may not see any improvement.
One way to get around this iterative process is to avoid using the load of preceding one or two hours. Instead, we can use the load of the same hour of yesterday. By doing so, you can expect some improvement for the next day or two comparing with the models without lagged load variables. Again, the improvement diminishes as the forecast horizon stretches. For the very short horizon, i.e., one or two hours ahead, the models with the load of the same hour of yesterday typically do not outperform the models with the load of the preceding hour.
For long term load forecasting, adding lagged load variables doesn't help much but creates issues.
One is on the interpretability of the model. Because the lagged load variables are highly correlated with the load series itself, most of the load variation is being "explained" by lagged load variables rather than the other explanatory variables such as weather and calendar variables. In other words, we can hardly answer "what if the next year is a hot year" if lagged load variables were in the model.
Another issue is on the inflation of forecast accuracy. Many people are plugging in actual values of the lagged load when analyzing the long term load forecasting performance, which would result in a very low error. Be careful, this is not ex post forecasting! You should not assume the perfect knowledge of the dependent variable in ex post forecasting.
To keep the answer short, this is what I have been doing: I use lagged load (see this MPCE paper) when the forecast horizon is less than two or three days, Sometimes I include residual forecasting (see the point forecasting portion of this IJF paper). I don't use lagged load for long term load forecasting.
Hope this helps!