Sunday, December 28, 2014

Forecasting and Data Mining

The main difference between forecasting and data mining is on the goal of the task. The goal of forecasting is to make statements about the future, while the goal of data mining is to extract patterns from large datasets. (The term "data mining" was a buzzword 15 years ago to broadly refer to working on the data, which is a misuse.) Many techniques can be applied to both forecasting and data mining, such as artificial neural networks, regression analysis, and clustering analysis, and so forth.

Due to the distinct goals mentioned earlier, the model development philosophy of forecasting and data mining is different. In forecasting, time is an important factor. Since we would like to select the models that are likely to behave well in the future, we  usually put training, validation and test data in the chronological order. In data mining, we usually apply k-fold cross validation to select models, where chronological order may not be necessary.

GEFCom2012 was hosted by Kaggle, a well-known platform for data mining competitions. When setting up the competition, Kaggle randomly selected 25% of the solution data to calculate the public leaderboard, and kept the rest 75% to calculate private leaderboard. A team can submit many entries to improve the scores in the public leaderboard. The team can select one entry for final scoring on the private leaderboard. The Kaggle setup had a major issue that made the competition more like energy "data mining" than energy "forecasting":
When forecasting the load 7 days ahead, we cannot use the 6 days ahead load data for validation. 
GEFCom2014 was hosted by CrowdAnalytix, who worked closely with us to develop a more realistic and cutting-edge forecasting competition platform than Kaggle's.  In this new platform, a team will always be forecasting a period chronologically after the training data. Although the team can submit multiple entries, we only take the first one as the entry for scoring. To make the competition even more realistic, we release incremental data on weekly basis, so that the forecast origin is rolling every week during the competition.

Back to Load Forecasting Terminology.

No comments:

Post a Comment

Note that you may link to your LinkedIn profile if you choose Name/URL option.