WMO
Home | Contact  
Disclaimer | Users Guide  Documentation | Verification Maps  
 

ROC Scores

Detailed information on the Relative Operating Characteristics (ROC) score is available in attachment II.8, much of the details below are taken from this document. It is recommended users read the documentation in the attachment for more details.

ROC is a means of testing the skill of categorical forecasts. The derivation of ROC is based on contingency tables giving the hit rate and false alarm rate for deterministic or probabilistic forecasts. The events are defined as binary, which means that only two outcomes are possible, an occurrence or non-occurrence.

The binary event can be defined as the occurrence of one of two possible categories when the outcome of the LRF system is in two categories. When the outcome of the LRF system is in three (or more) categories, the binary event is defined in terms of occurrences of one category against the remaining ones. In those circumstances ROC has to be calculated for each possible category.

Deterministic Forecasts

Deterministic forecasts are simply binary forecasts, as mentioned above. They state that conditions in the forecast period will be above or below a certain criteria, say, above or below the median, or above or below a certain percentile (tercile) etc. If we have forecasts made for that period that have been performed over a number of years (ie, if we have a number of years of data) we can draw what is called a contingency table (Table 4 in attachment II.8).

Where O1 represents the correct forecasts or hits:

(OF) is one when the event occurrence is observed and forecast; 0 otherwise.

Where NO1 represents the false alarms:

(NOF) is one when the event occurrence is not observed but was forecast; 0 otherwise.

O2 represents the misses:

(ONF) is one when the event occurrence is observed but not forecast; 0 otherwise.

NO2 represents the correct rejections:

(NONF) is one when the event occurrence is not observed and not forecast; 0 otherwise.

The above results are summed over all grid points or stations.

We can then calculate the Hit Rate (HR) and False Alarm Rate (FAR). These are simply percentages that tell us how well the forecast did when the event (say, above median rainfall) was observed, and likewise, how well the scheme did when the event was not observed (because we must remember that a “null” forecast is still a forecast, none the less). Therefore the hit rate is defined as:

The range of values for HR goes from 0 to 1, 1 being desirable. A HR of one means that all occurrences of the event were correctly forecast.

The false alarm rate is defined as:

The range of values for FAR goes from 0 to 1, 0 being desirable.

Probabilistic Forecasts

Probabilistic forecasts are those that set a probability on an event occurring (e.g. there is a 65% probability of tercile three rainfall). As for the deterministic ROC score, the Probabilistic ROC score is simply calculated using the Hit Rate and False Alarm Rate. The HR and FAR are calculated for each probability interval. In this case an event is said to be forecast at a point if the forecast probability for an event occurred within the probability range. (e.g., a forecast for above median rainfall that had a 43% probability would fall in the 40-50% probability range.) Observed occurrences (i.e., Hits) are then the number of times that a forecast probability fell in that bin and subsequently that event occurred (in the example case above median rainfall occurred), while the observed non-Occurrences (i.e., Misses) are the number of times a forecast was made for that probability bin but the forecast was "incorrect" (in the example case below median rainfall occurred). We can again use ROC to test the validity of such forecasts as shown in the table below (Table 5 in attachment II.8):

In above table n=number of the nth probability interval or bin n; n goes from 1 to N; Pn=lower probability limit for bin n; Pn+1=upper probability limit for bin n; N= number of probability intervals or bins.

(O) being 1 when an event corresponding to a forecast in bin n, is observed as an occurrence; O otherwise. The summation is over all forecasts in bin n, at all grid points or stations.

(NO) being 1 when an event corresponding to a forecast in bin n, is not observed; O otherwise. The summation is over all forecasts in bin n, at all grid points or stations.

Wi=1 when verification is done at stations or single grid points; Wi=cos(Θi) at grid-point i, when verification is done on a grid; Θi the latitude at grid point i.

Hit rate and false alarm rate are calculated for each probability threshold Pn. The hit rate for the probability threshold Pn(HRn) is defined as:

and the false alarm rate (FARn) is defined as:

where n goes from 1 to N. The range of values for HRn goes from 0 to 1, 1 being desirable. The range of values for FARn goes from 0 to 1, zero being desirable. Frequent practice is for probability intervals of 10%

The ROC curve and ROC score

HR and FAR are calculated for each probability threshold Pn, giving N points on a graph of HR (vertical axis) against FAR (horizontal axis) to form the ROC curve. This curve must pass through points (0,0) and (1,1). No-skill forecasts are indicated by a diagonal line (where HR=FAR); the further the curve is towards the upper left-hand corner (where HR=1 and FAR=0) the better.

The area under the ROC curve is a commonly used summary statistic representing the skill of the forecast system. The area is standardized against the total area of the figure such that a perfect forecast system has an area of one and a curve lying along the diagonal (no information) has an area of 0.5. The normalised area has become known as the ROC score.

A Simple Example

If we group the forecast probabilities for a particular year into terciles and call them above normal (A), normal (N) and below normal (B), we might see:

                        Year               B              N               A

                        1999             10             20              70

 

If we decide that our threshold/cut-off for a ROC assessment of the skill in forecasting above normal rainfall is 80%, then 1999 forecast will not be counted as a forecast for an above normal (A) year but as a “not above normal” (B+N) year, and the 80% cutoff contingency table (as in the section on deterministic forecasts) will reflect this. However if we relax our threshold to 70% then it will be included as an above normal (A) year in the 70% cutoff contingency table.

If we want to plot a continuous graph of Hit Rate vs False Alarm Rate for a probabilistic forecast, we can do this one of two ways. Firstly, we can simply draw up contingency tables for each probability threshold (ie one table each for, say, > 90%, >80%, >70% etc ) then plot each Hit Rate vs False Alarm Rate on the same graph. Or a second (possibly quicker) method, we can calculate the number of correctly forecasted events that occur in each of some probability interval (say, 90-100%, 80-90%, 70-80% etc) as well as calculate the number of “non-forecasted” events in the same probability intervals. The Hit Rate for a certain probability (say 80%) is then simply the sum of the forecasted events above that level divided by the total sum of forecasted events. For our example 80% level = ( F(90-100%) + F(80-90%) ) / SF(ALL) ). Likewise, the False Alarm Rate is the sum of the “non forecasted” events above the threshold divided by the total sum of non-forecasted events.

If at first this second method does not appear intuitively identical to the first method, simply remember that, say, for the Hit Rate (which is equal to the Hits/(Hits + Misses)) any Misses in each probability interval (say 80-90%) WILL be counted later on in a lower probability interval. Hence the Misses are included in the denominator by default, and hence the two methods give the exact same answers.

Calculation of ROC scores are mandatory for all levels.

An example of the ROC curve for Level 1 is shown below:

References:

Mason, S.J., and N.E. Graham, 1999. Conditional Probabilities, Relative Operating Characteristics, and Relative Operating Levels., Weather and Forecasting, 14, 713-725.

Mason, S.J., and N.E. Graham, 2002. Areas beneath the relative operating characteristics (ROC), and relative operating levels (ROL) curves: Statistical significance and interpretation., Q.J.R. Meteorol. Soc., 128, 2145-2166.

Kharin, V., and F. Zwiers. 2003. On the ROC score of Probability Forecasts . Journal of Climate, 16, 4145-4150.

 

   
home | contact