In this post I’ll revisit a topic I’ve already discussed multiple times. My motivation is in part due to the fact that I now have a synthetic MSU record spanning the late 19th century to the end of the 21st century and this allows me to explore more issues. Another motivation is that I’d like to evaluate model performance in some way other than the Santer d-test.
Data Sources
The two obvious choices for observational data sets are RSS and UAH. I had some trouble deciding over whether I should include radiosonde data considering its less than ideal coverage. I’ve looked at four radiosonde/radiosonde-derived data sets: IUK, Raobcore (v1.4), HadAT2 and HadAT2-derived TLT. Let’s look at maps of all four data sets and see how they compare in terms of their spatial coverage. First up is IUK. The last time step in the data is December 2005 and so I will use that date for comparison for the three other series.
At the lowest pressure level, 850 hPa, the continental U.S. and Europe are well represented. Russia has pretty decent coverage. The southern hemisphere is very sparse. The data as you’ll notice is restricted to the land only. The few data points apparently in the ocean are from sondes launched from islands. Next you’ll see Raobcore has coarser resolution and apparently better coverage of the northern hemisphere.
The improved coverage is simply due to the fact that a much larger grid box is used so a lot of areas with missing data in IUK are joined in with boxes that do have data. Here’s the HadAT2 coverage which is noticeably better than IUK, but again, it’s only because a larger grid box is being used.
The HadAT2-derived synthetic TLT anomalies have atrocious coverage.
The source of the sparse coverage is because Hadley’s calculation of the synthetic TLT brightness temperature requires that at least 80% of the data points through the atmospheric profile have to be non-missing values. The data is sparse enough and this constraint causes a lot of grid points to end up as missing. I found it a bit funny to read this on the HadAT2 MSU page:
… these weighting functions provide investigators with just enough rope to go and hang themselves with.
Very true and nicely put.
If I were to include radiosonde data into my analysis, I’d want to interpolate to get total coverage over land. I would then reprocess the model-simulated temperature temperature and calculate a land-only spatial average for the temperature at each pressure level and then the same for the brightness temperature. That’ll take a while (if I ever decide to do it) so in the interest of getting this post done, I’ll put that project on the back burner.
Observational and Model Data Eyeball Comparison
Tamino recently posted on model/observation comparison and showed that the spread of the AR4 2-m surface temperature agreed very well with the observational record (at least as far as GISS is concerned.) Below are similar graphs for TLT, TMT and TLS global averages. The model-simulated anomalies are relative to 1979-1998 to match the base period for UAH and RSS.
The multi-model mean for TLT and TMT shows no ENSO event in 1998 because although these events do occur in the models, the way the simulations are initialized guarantees no synchronization with the actual ENSO events in the real world. The only common non-linear feature in the TLS data that is reproduced in the models are the two warming events spurred on by the eruptions of El Chichon (1982) and Mt Pinatubo (1991). Now while it does look good that the observational anomalies fall comfortably within the model uncertainty, albeit on the lower end, how does the model uncertainty change with time? Here’s a plot of the model spread over the course of the 20th-21st centuries.
The uncertainties grow pretty quickly. Below I show the multi-model standard deviation to make the changing uncertainty more clear. The uncertainty minimizes during the base period as one would expect because the anomalies are constrained to average to zero. However, for the TLS data, the uncertainties grow significantly faster than TLT and TMT.
If the observational anomalies fall within the model uncertainty it doesn’t necessarily mean that the models are on track. If the uncertainty grows with time, then the models are disagreeing as to what to expect and the fact that they encompass observations loses any real meaning.
Observational and Model Data Statistical Comparison
Instead of comparing model and observational trends to test for systematic differences, I’ll look at the trend in the difference between the two. The two hypothesis to be tested are:
H1: The trend in the difference between any given climate model realization and observational data is zero.
H2: The multi-model ensemble mean difference trend is zero.
For H1, the standard errors are inflated with sqrt( ( 1 + r1 )/( 1- r1 ) ) to account for lag-1 serial correlation in the regression residuals. H1 is the standard hypothesis test that accompanies regression to determine statistical significance in the coefficients.
First let’s have a look at the trends with error bars compared with RSS and UAH. The black dashed line is the mean MSU trend. The grey and dark grey bands are the mean MSU trend ± 1 SE and ± 2 SE, respectively. The light grey dashed line is the zero point.
Now here’re the trends in the difference between the models and RSS and UAH. Again, the light grey dashed line is the zero point.
Hypothesis Test Results and Discussion
| Obs | TLT | TMT | TLS |
|---|---|---|---|
| RSS | 0.565 | 0.652 | 0.609 |
| UAH | 0.652 | 0.783 | 0.739 |
Table 1: Proportion of models rejecting relative to observational data and MSU channel.
Looking at models individually, the H1 rejection rates are alarmingly high. As one would expect, they’re higher relative to UAH than RSS given that UAH shows a smaller trend. These results are very similar to my last analysis where rejection rates for TLT relative to RSS/UAH were 0.579/0.649. However, they differ very much for the mid-troposphere where the rates relative to RSS/UAH were 0.246/0.368.
| Obs | TLT | TMT | TLS |
|---|---|---|---|
| RSS | 0.110 | 0.079* | 0.264 |
| UAH | 0.058* | 0.023** | 0.159 |
Table 2: p-value for H2 test. (Significance labels: * significant at 10%, ** significant at 5%, *** significant at 1%)
The p-values for the H2 test show that the models aren’t as dead in the water as many believe. While the p-values for TLT and TMT aren’t terribly high like their TLS counterpart, there is only one rejection at 5% significance. In the last analysis I referred to, the p-values for the d-test for TLT relative to RSS and UAH were 0.001 and 0.000, respectively, which differs markedly from the current analysis. For the mid-troposphere, relative to RSS/UAH, the p-values were 0.083 and 0.017. These figures are relatively similar in current analysis. That last analysis didn’t include the lower stratosphere so I have nothing to compare to the new numbers.
P.S. I hate WordPress because the page never looks as good as it does in the editor!



































