Monday, October 12, 2015

Fall 2015 In-class Probabilistic Load Forecasting Competition

Update: The final ranking is available HERE.

The second exam of my Energy Analytics course this semester is a probabilistic load forecasting competition. The competition rules are listed below:
  • The competition will start on 10/22/2015, and end on 11/25/2015. 
  • The historical data will be released on 10/22/2015.
  • The year-ahead hourly probabilistic load forecast is due on 11:45am ET each Wednesday starting from 10/28/2015. 
  • The exam is individual effort. Each student form a single-person team. No collaboration is allowed.
  • The student can not use any data other than what's provided by Dr. Tao Hong and the U.S. federal holidays.
  • Pinball loss function is the error measure in this competition. 
  • The benchmark will be provided by Dr. Tao Hong. A student receive no credit if not beating the benchmark nor ranking top 6 in the class. 
  • No late submission is allowed. 
I would like to open this competition to students and professionals outside my class. If you are interested in joining the competition, please contact me for detailed instructions.

Recommended readings:


  1. Dear All, hereby a brief stepwise explanation of my method that I used in rounds 2-5:

    1) First of all I designed an hourly linear regression model in our forecasting application Itron Metrix ND, including variables as day of week, holidays, bridge days, long term trend, yearly cycle, and 3rd degree polynomial temperature variables, separated for the timescale of hours and days and separated for weekdays and weekends. MAPEs were roughly 1.8% (in sample) and 2,2% (out of sample / 1 year ahead, using the later provided temperature data). Outliers were marked as bad by looking at the scatterplot of actuals versus predicted.

    2) Secondly I ran a simulation again in Metrix ND on the training set, replacing all holiday and trend variables with a zero, creating a nicely stationary dataset.

    3) Thirdly, in MS Excel, I first shifted these simulated data to align for day of the week. Then I took 99 percentiles on every hour of the year, including the two neighbors exactly one and two weeks before and after. Every percentile would thus operate on a sample of 5 times the number of years of available load data. To conclude I have added to these percentiles the difference between my simulation and prediction run, to adjust for holidays and the long term trend.

    I hope this was useful as an introduction, but please let me know if you have any questions. I hope to hear from the Portuguese teams how they were able to outperform this method.

    Best regards, Geert Scholma

  2. Dear All,

    the forecast approach that I used for all five rounds is quite robust and scored 3rd in the overall pinball loss.

    The used method is based on high-dimensional time series analysis.
    It is important that I used only 2 (big) models for the load and temperature data, and not 24x2 small hourly models.

    In detail, I considered a (homoscedastic) two-dimensional time-varying threshold AR model.
    The dependency is designed in such a way that the past temperature and the past load have an impact on the actual load, but only the past temperature effects the actual temperature.
    Furthermore, I used many possible deterministic regressors, covering the seasonal pattern (hourly, daily, weekly, annual) with seasonal interactions (esp. the daily patterns are changing over the year), and public holiday effects. For the temperature neither weekly nor public holiday effects were allowed.

    As autoregressive impact I considered a possible memory of 1200 hours for the load and 360 hours for the temperature. The thresholds for non-linear impacts were chosen manually, 5 thresholds for the load and 7 thresholds for the temperature. Only the most important parameters were allowed to vary over time. The final model has several thousand possible parameters, but the high-dimensional estimation method will select only the relevant parameters.

    The model was estimated by minimising the BIC (Baysian information criterion) in a lasso regression. Given the estimated model I performed residual based bootstrap simulations for the next 8760/8784 values with N=20000 replications. The corresponding 99 quantiles of the N sample paths were used as estimates for the required percentiles.

    For the computational part I used R. The estimation of the model took about 1 hour, the simulation of 20000 sample paths took about 10 hours (using 4 cores).

    The methodology is similar to this one as used in

    Similarly to Geert I am curious about the methodology of the Portuguese teams.

    Best regards,


Note that you may link to your LinkedIn profile if you choose Name/URL option.