The 12 evaluation weeks of GEFCom2014 just went by. Many contestants are curious to know about their rankings after such a marathon-type forecasting competition. Last weekend, I created a provisional leaderboard based on the scores I have documented on Inside Leaderboard. In this post, I will share this provisional leaderboard together with the rating and ranking methodology.

When designing this competition, we put a simple method (trimmed mean) to calculate the final score for each team. This is easy for CrowdAnalytix to implement and easy for the public to understand. However, applying the trimmed mean method to quantile scores is far from comprehensive enough to evaluate rolling forecasts. A major drawback of this simple method is to treat the point spread in all tasks with the same weight without considering which tasks have more variant scores than others. To fix this, we will create a rating with a reference to the benchmark score. In addition, we would like to use a scoring method to give preferences to the teams who

We start with the scores collected throughout the competition as published in the Inside Leaderboard, highlighting the missing entries as blue, and the erroneous entries as yellow.

Create a ranking matrix for each valid entry in its corresponding week. For each team, calculate the 2nd largest ranking among the valid entries. We permute the rankings of missing and erroneous entries of a team using its 2nd largest ranking. In addition, the following rules apply:

With the permuted rankings, we can then permute the scores. A permuted score for team

We define the rating for each entry as "how much percentage it beats the benchmark". To give preference to the teams that improve their methodologies along the way, we assign a linearly increasing weight to the 12 evaluation weeks. The last week is weighted as 12 times the first week. We also make the sum of 12 weights equal to 1. Such a weighting method will also reduce the impact on the missing and erroneous entries during the first few weeks. The rating for a team is the weighted sum of its ratings over the 12 weeks. This rating roughly tells how much improvement a team gets over the benchmark. The rankings are calculated based on the ascending order of the ratings.

One of my favorite professors during my graduate school days was Prof. Carl Meyer. Although I failed to convince him to serve on my dissertation committee, I did manage to learn a whole lot of things from him, from matrix analysis, linear algebra, to clustering. One day in his office, he gave me his new book, "Who's #1? The Science of Rating and Ranking", which essentially changed my view of how rating and ranking methods should be used in forecasting. The figure below shows the cover of the book and Meyer's autograph.

When developing my load forecasting methodology in graduate school, I had to make many comparisons among the candidate models and promote the best one to the next stage of model building. I didn't realize that I was kind of doing "rating and ranking" until I read that book. Of course I didn't realize there are so many ways to do rating and ranking, so many applications of rating and ranking, and a field of science behind rating and ranking. Since then, this has been an area I'm very interested in.

Like the world of forecasting, where all forecasts are wrong, the world of rating and ranking follows a similar principle, which is Arrow's Impossibility Theorem. The theorem basically tells us that there is no perfect ranking system. This leaves us all the possibilities to enhance our ranking systems.

While I have borrowed some ideas from Meyer's book to come up with the rating and ranking method for GEFCom2014, the method I'm using is not perfect. If you think your ranking is higher or lower than expected, that's normal. There are many other ways to rank the teams. It probably takes a full paper to discuss all the methods to rate and rank forecasts. Although I find myself too busy to write papers and to fight with nonsense reviewers, I may still try to put these ideas together. At least I can make it a chapter in my load forecasting book.

View Provisional Leaderboard.

Again, this is not the final leaderboard, pending corrections of individual scores (if there were errors) and adjustment of rankings based on the final reports.

**Motivation**When designing this competition, we put a simple method (trimmed mean) to calculate the final score for each team. This is easy for CrowdAnalytix to implement and easy for the public to understand. However, applying the trimmed mean method to quantile scores is far from comprehensive enough to evaluate rolling forecasts. A major drawback of this simple method is to treat the point spread in all tasks with the same weight without considering which tasks have more variant scores than others. To fix this, we will create a rating with a reference to the benchmark score. In addition, we would like to use a scoring method to give preferences to the teams who

- beat the benchmark more times;
- have made less mistakes;
- have stronger performance in the more recent tasks.

**Methodology***Step 0: initialization*We start with the scores collected throughout the competition as published in the Inside Leaderboard, highlighting the missing entries as blue, and the erroneous entries as yellow.

*Step 1: ranking permutation*Create a ranking matrix for each valid entry in its corresponding week. For each team, calculate the 2nd largest ranking among the valid entries. We permute the rankings of missing and erroneous entries of a team using its 2nd largest ranking. In addition, the following rules apply:

- For a given week, if its 2nd largest ranking is greater than the number of valid entries, its permuted ranking will be equal to the number of valid entries.
- If all of the valid entries corresponding to this team's 2nd largest ranking are above benchmark, its permuted ranking will be set to be equal to or higher than the ranking of benchmark.
- If some of the valid entries corresponding to its 2nd largest ranking are below benchmark, we give different treatments to missing entries and erroneous entries. For a missing entry, its ranking will be equal to the 2nd largest ranking. In other words, its permuted ranking may be below benchmark. For an erroneous entry, its permuted ranking will be equal to or above the benchmark ranking.

*Step 2: score permutation*With the permuted rankings, we can then permute the scores. A permuted score for team

*i*in week*j*with permuted ranking*r*is the same as the score from a team with ranking*r*in week*j*.*Step 3: rating and ranking*We define the rating for each entry as "how much percentage it beats the benchmark". To give preference to the teams that improve their methodologies along the way, we assign a linearly increasing weight to the 12 evaluation weeks. The last week is weighted as 12 times the first week. We also make the sum of 12 weights equal to 1. Such a weighting method will also reduce the impact on the missing and erroneous entries during the first few weeks. The rating for a team is the weighted sum of its ratings over the 12 weeks. This rating roughly tells how much improvement a team gets over the benchmark. The rankings are calculated based on the ascending order of the ratings.

**Aside: Prof. Carl Meyer and "Who's #1?"**One of my favorite professors during my graduate school days was Prof. Carl Meyer. Although I failed to convince him to serve on my dissertation committee, I did manage to learn a whole lot of things from him, from matrix analysis, linear algebra, to clustering. One day in his office, he gave me his new book, "Who's #1? The Science of Rating and Ranking", which essentially changed my view of how rating and ranking methods should be used in forecasting. The figure below shows the cover of the book and Meyer's autograph.

"Who's #1?" by Langville and Meyer |

**Final Remarks**Like the world of forecasting, where all forecasts are wrong, the world of rating and ranking follows a similar principle, which is Arrow's Impossibility Theorem. The theorem basically tells us that there is no perfect ranking system. This leaves us all the possibilities to enhance our ranking systems.

While I have borrowed some ideas from Meyer's book to come up with the rating and ranking method for GEFCom2014, the method I'm using is not perfect. If you think your ranking is higher or lower than expected, that's normal. There are many other ways to rank the teams. It probably takes a full paper to discuss all the methods to rate and rank forecasts. Although I find myself too busy to write papers and to fight with nonsense reviewers, I may still try to put these ideas together. At least I can make it a chapter in my load forecasting book.

View Provisional Leaderboard.

## No comments:

## Post a Comment

Note that you may link to your LinkedIn profile if you choose Name/URL option.