Tuesday, November 27, 2018

Winning Methods from BFCom2018 Qualifying Match

I invited the BFCom2018 finalists to share their methods used at the qualifying match. Here are the ones I've received so far.

#1. Geert Scholma

Team member: Geert Scholma

Software: Excel, R (dplyr, lubridate, ggplot2, plotly, tidyr, dygraphs, xts, nnls)

Core technique: Multiple Linear Regression.

The model includes the usual variables with some special recipe: 5 weekdays; federal holidays; strong bridge days (mo before / fr after); weak bridge days (others); 4th degree polynomials for exponentially weighted moving average temperatures on 3 timescales (roughly 1 day, 1 week, 1 month) with optimized decaying factors; 4th degree polynomial time trend for long term gradual changes, changing in a constant value after the last training date; 8th degree polynomial year day for yearly shape, with weekend interaction.

Core methodology: No data cleaning. 1 weighted weather station, based on the non negative linear regression coefficients of a second model step that combined the predictions of all the single weather station driven models of a first step.

Key reference: (Hong, Wang, & White, 2015).

#2. Redwood Coast Energy Authority

Team member: Allison Campbell, Redwood Coast Energy Authority and UNCC

Software: Python (SKLearn package LinearRegression, and the genetic algorithm package DEAP)

Core technique: Multiple Linear Regression.

I adapted the DEAP One Max Problem to optimize selection of weather stations. The bulk of my model is built from Tao's vanilla benchmark, with the inclusion of lagged temperature, weighted moving average of the last day's temperature, transformation of holidays to weekend/days, and exponentially weighted least squares.  Before the regression, I log transformed the load.  I also created 18 "sister" forecasts by redefining the number of months in a year to be 6 to 24.  This model was informed by Tao's doctoral thesis, Hong, Wang, White 2015 (Weather Stn Selection), Wang, Liu, Hong 2016 (Recency Big Data), Nowotarski, Liu, Weron, Hong 2016 (Combining Sisters), Xie, Hong 2018 (24 Solar Terms), and Arlot, Celisse 2009 (CV for model selection).

#5. Masoud_BigDEAL

Team member: Masoud Sobhani, UNCC

Software: SAS (proc GLM)

Core technique: Multiple Linear Regression

I work with Dr. Hong in BigDEAL lab and I am the TA of “Energy Analytics” course this semester. For the first few assignments of this class, we gave the same dataset to the student to make them improve the accuracy of their forecast after they learned different forecasting skills. Like previous classes, Dr. Hong asks me to prepare a benchmark forecast for the class. I built a model during the first lecture and we kept it as the benchmark for all assignments. Later, Dr. Hong decided to make a competition using the same dataset for the qualifying exam. My initial benchmark model was still in the leader board and fortunately qualified to the next round.

In this model, I did not do any data cleansing and I used the raw data for the forecasting. The core technique that I used was based on Vanilla Benchmark Model with recency (Wang, Liu, & Hong, 2016) and holiday effects (Hong, 2010). This model uses third order polynomials of temperature and calendar variables and interactions between them. I removed the Trend variable and used 14 lagged temperatures. For the weather station selection, I employed the exact method proposed in (Hong, Wang, & White, 2015).

#7. SaurabhSangamwar_BigDEAL

Team Member: Saurabh Sangamwar, UNCC

Software: SAS (proc GLM)

Core technique: Multiple Linear Regression

  • Weather station selection using proposed approach mentioned in (Hong, Wang, & White, 2015)
  • Used 24 solar terms to classify the data as proposed in (Xie & Hong, 2018)
  • Added recency effect to Tao’s Vanilla Benchmark model as proposed in (Wang, Liu, & Hong, 2016)
  • Used holiday effect (considering holiday as Sunday and day after holiday as Monday), weekend
  • effect, trend variable (Increasing serial number), maximum and minimum temperature of day and its interaction with month, solar terms and hour is considered. While forecasting using solar terms solar month 5 and 4 are grouped together.
  • Used 2 years of training period to train the model i.e.,year 2006 and 2007 to train and 2008 load data was forecasted.
  • Used 3- fold cross validation and stepwise variable selection method to select the parameter, number of lagged effects.
  • As there was different lagged effect for each year. Also, solar terms were good instead of using Gregorian calendars months as class variable and for some cases vice a versa. So, generated the point forecast from 11,12,13 and 14 lagged effect for solar terms and Gregorian calendar. So total 8-point forecasts were generated and finally submitted the average of 8 forecasts.

#10. YikeLi_BigDEAL

Team member: Yike Li, Accenture and UNCC 

Software: SAS (proc GLM)

Core techniques: Multiple Linear Regression

Core methodology:
  • Weather station selection: A modified version of (Hong, Wang, & White, 2015) by evaluating all possible combinations of top selected weather stations. Selecting the virual station based on three-fold cross validation.
  • Recency effect:  Performed a 2-dimensional forward stepwise analysis. Assumption is the MAPE results of each d-h combinations on the validation period (d=0~6, h=0~24) form a convex hull; Starting from d=0 and gradually adding the h terms to Tao’s vanilla model, until adding more temperature lags to the existing model won’t yield better MAPE; Keep the selected h value and gradually add d terms to the existing model, until adding more past daily average to the existing model won’t yield better MAPE. 

#13. 4C

Team members:
  • Ilias Dimoulkas, KTH Royal Institute of Technology, Stockholm, Sweden
  • Peyman Mazidi, Loyola Andalucia University, Seville, Spain
  • Lars Herre, KTH Royal Institute of Technology, Stockholm, Sweden
  • Nicholas-Gregory Baltas, Loyola Andalucia University, Seville, Spain
Software: Matlab / Matlab Neural Network Toolbox

Technique: Feed-forward Neural Networks

  • Data cleansing. Missing values at the spring daylight saving hours were filled with the average of the previous and the following hours. Double values at the fall daylight saving hours were replaced by their average value. No other data cleansing or outlier detection was done.
  • Weather station selection. The technique described in (Hong, Wang, & White, 2015) was used with the difference that neural networks were used to make the forecasts instead of multiple linear regression. 
  • Feature selection. Forward sequential feature selection was used. The initial pool of variables consisted of time variables (year, month, hour, etc.), temperature related variables (temperature, power, lags, simple moving average) and cross effects between the temperature and the time variables. The pool contained 172 variables in total. The evaluation was also based on neural networks forecasts. The final feature set consisted of 31 variables.
  • Forecast. 10 neural networks were trained on the whole data set (years 2005-2007). The forecast for year 2008 was the mean forecast of the 10 neural networks.

#13. AdG

Team member: Andrés M. Alonso, Universidad Carlos III de Madrid, Spain.

Software: Matlab (Statistics and Machine Learning toolbox)

Technique: support vector regression

In this project, I use SVM regressions to predict hourly loads using explanatory variables such as temperatures, day of the week, month, federal holidays, and a linear trend. As in Hong et al (2015), I made a selection of meteorological stations taking the loads of 2007 as a trial period. I selected the five meteorological stations with the best results from MAPE. In the final model, the five temperature measures were considered instead of using an aggregate measure. The local or focused approach consists in selecting days in the training sample that have a temperature behavior similar to the day to be predicted. In that way, the regression is estimated / trained using only similar days. That is, for 2007 (2008), I performed 365 (366) SVM regressions but trained in different samples. For 2007, the focused approach improves the overall approach that uses all data from the training set. 

References used by the finalists:
  • Hong, T. (2010), “Short Term Electric Load Forecasting,” Ph.D. Dissertation, Graduate Program of Operation Research and Dept. of Electrical and Computer Engineering, North Carolina State University.
  • Wang, P., Liu, B. and Hong, T. (2016) "Electric load forecasting with recency effect: a big data approach, "International Journal of Forecasting, vol.32, no.3, pp 585-597.
  • Hong, T., Wang, P. and White, L. (2015) "Weather station selection for electric load forecasting, "International Journal of Forecasting, vol.31, no.2, pp 286-295.
  • Tashman, L. J. (2000). Out-of-sample tests of forecasting accuracy: an analysis and review. International Journal of Forecasting, 16(4), 437-450.
  • Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics surveys,4, 40-79.
  • Xie, J. and Hong, T. (2018) "Load forecasting using 24 solar terms," Journal of Modern Power Systems and Clean Energy, vol.6, no.2, pp 208-214
  • Nowotarski, J., Liu, B., Weron, R. and Hong, T. (2016) "Improving short term load forecast accuracy via combining sister forecasts," Energy, vol.98, pp 40-49

BTW, I also created a new label "winning methods" so that audience of this blog can easily find the winning methods of previous competitions. 

No comments:

Post a Comment

Note that you may link to your LinkedIn profile if you choose Name/URL option.