Monday, August 7, 2017

Breakthrough or Too Good To Be True: Several Smoke Tests

When sharing my Four Steps to Review an Energy Forecasting Paper, I spent about a third of the blog post elaborating what "contribution" means. This post is triggered by several review comments to my recent TSG paper variable selection methods for probabilistic load forecasting. Here I would like to elaborate what "contribution" means from a different angle.

A little background first. 

In that TSG paper, we compared two variable selection schemes, HeM (Heuristic Method) that sharpens the underlying model to minimize the point forecast error, and HoM (Holistic Method) that uses the quantile score to select the underlying model. The key finding is as follows:
HoM costs much more computational power but only produces slightly better quantile scores than HeM.
Then some reviewers raised the red flag:
If the new method is not much better than the existing one, why shall we accept the paper?
I believe that the question is genuine. Most likely the reviewers, as well as many other load forecasters, have read many papers in the literature that have presented super powerful models or methods that led to super accurate forecasts. After being flooded with those breakthroughs, they would be hesitant to give favorable ratings to a paper that presents a somewhat disappointing conclusion. 

Now let's take one step back:
What if those breakthroughs were just illusions? 
Given the fact that most of those papers were proposing complicated algorithms tested by some proprietary datasets, it is very difficult to reproduce the work. In other words, we can hardly verify those stories. The reviewers and editors may be rejecting valuable papers that are not bluffing. This time I was lucky - most reviewers were on my side.

When my premature models were beating all the other competitors many years ago, I was truly astonished about the realworld performance of those "state-of-the-art" models. If those breakthroughs in the literature were really tangible, my experiences tells me that the industry would be pouring money to those authors to ask for the insights. It's been many years after those papers were published, how much of those published papers have been recognized by the industry? (In my IJF review, I did mentioned a few exemplary papers though.)

We have run the Global Energy Forecasting Competitions three times. How often do you see those authors or their students on the leaderboard? If their methods are truly effective but not recognized by the industry, why not test them through these public competitions? 

Okay, now you know some of those "peer-reviewed" papers may be bluffing. How to tell if they are really bluffing? Before telling you my answer, let's see how those papers are produced:
  1. To make sure that the contribution is novel, they authors must propose something new. To insure it looks challenging, the proposal must be complicated. The easiest way to create such techniques is to mix the existing ones, such as ANN+PSO+ARIMA, etc.
  2. To make sure that nobody can reproduce the results, the data used in the case study must be proprietary. Since all we need to have the paper accepted is to have it go through the reviewers and editor(s). An unpopular dataset is fine too, because the reviewers don't bother to spend the time reproducing the work.
  3. To make sure that the results can justify the breakthrough, the forecasts must be close to perfection. The proposed models must beat the existing ones to death. How to accomplish that? Since the authors have the complete knowledge of the future dataset, just fine tune the model so that it outperform the others in the forecast period. This is called "peeking the future".
In reality, it is very hard to build the models or methods that can dominate the state of the art. It rarely comes from ab arbitrary "hybrid" of the existing ones. Instead, the breakthroughs (or major improvement) come from using new variables that people have not yet completely understood in the past, borrowing the knowledge from other domains, leveraging new computing power, and so forth.

In the world of predictive modeling, there is that well-known theorem called "no-free lunch", which states that no one model works the best in all situations. In other words, if one beats the others in all cases across all measures, it is "too good to be true". We need the empirical studies that report what's NOT working well as much as the ones promoting the champions. 

It's time for my list of smoke tests. The more check marks a paper gets, the more I consider it too good to be true.
  1. The paper is proposing a mix (or hybrid) of many techniques.
  2. The paper is merely catching new buzzwords.
  3. The data is proprietary.
  4. The paper is not co-authored with industry people (or not sponsored by the industry). 
  5. The proposed method does not utilize new variables 
  6. The proposed method does not take knowledge of other domains.
  7. The proposed method does not leverage new computing resources.
  8. The proposed method is dominating its counterparts (another credible method) in all aspects.
I spend minimal amount of time reading those papers, because they are emperor's new clothes to me,. Hopefully this list can help the readers save some time too. On the other hand, I didn't mean to imply that the authors were intentionally faking the paper. Many of them are genuine people but making the mistakes without knowing so. Hopefully this blog post can help point to the right direction for the authors as well.

1 comment:

  1. Nice discussion of this troubling issue! One can always cherry pick instances where a particular technique, with a particular dataset, under particular conditions, generate forecasts over a particular time period that are particularly good.

    It would be helpful if authors would identify situations where their proposed method does not work very well. It seems likely they would know this from their testing, and it would save the rest of us a lot of time figuring it out for ourselves.

    Paul Goodwin also has a good discussion of this topic in his Foresight article "High on Complexity, Low on Evidence: Are Advanced Forecasting Methods Always As Good As They Seem" (Fall 2011).

    ReplyDelete

Note that you may link to your LinkedIn profile if you choose Name/URL option.