Sunday, February 10, 2013

The Hidden Outliers: How to Find and Handle Them

In a forecasting project, how much time do you spend dealing with your data?
If your answer is "less than 50%", I can think of three reasons: 1) someone cleaned the data for you; 2) you got some magical tool; 3) you didn't do a good job trying to identify all the problems in your data.
As a consultant specialized in load forecasting, I had the fortune working with various datasets (load, wind, solar, price, economy, weather, demand response, etc.) from many utilities in all sectors of the industry all over the world.
My answer is "80%, at least."
I'm not talking about the ETL (extract, transform, load) work, which usually takes a tiny portion of my 80% of the project. I'm talking about the analytical work, more specifically, outlier detection and data cleansing, which takes majority of the 80%.
What are the outliers?
If the annual peak of a utility is around 4.8GW, and there is a 4800GW load in the load series, we can easily tell it's an outlier. Mostly the error is caused by the misplacement of the decimal point. The obvious outliers like this are easy to find out and fixed through basic summary statistics and queries. Usually this is done in the first round even before building a model.
Sometimes, there are missing values in the load or weather data. They are easy to find out, but nontrivial to fix. A common way to fix them is to use simple methods, such as the average of two adjacent readings, linear extrapolation, or cubic splines. It works OK sometimes, but may create some trouble at the end.
The most interesting outliers are the hidden outliers. They are difficult to detect and hard to fix. Some of them are actually originated from the simple fix of the missing values; some are correct readings during special events including storms; some are due to incorrect but "not-too-bad" readings...
In this webinar, I'll talk about how to find and handle the outliers, with the focus on the hidden outliers. Here is a tentative agenda:
- why outliers can kill a model
- what are the outliers
- how simple statistical methods work, and when they don't work
- an advanced approach to identifying outliers
- how to handle outliers, DOs and DONTs
Please go to the webinar page to register if you are interested. You are more than welcome to share with me your stories about outlier detection and data cleansing.

No comments:

Post a Comment

Note that you may link to your LinkedIn profile if you choose Name/URL option.