AR4 Model Hypothesis Tests

This is the culmination of many weeks of hard work. To recap the previous posts on this blog, I found the weighting function used in Santer et al. 2008 to produce synthetic MSU brightness temperature. I applied the function to AR4 climate model data and found it to match the data released by Santer. I’ve processed all the data into the same choice of spatial averages used by Santer (global, nh, sh,  tropics, etc). The 23 models used in this analysis are

BCCR BCM 2.0, CCCMA CGCM 3.2 T47/T63, CNRM CM 3, CSIRO MK 3.0/3.5, GFDL CM 2.0/2.1, GISS AOM/EH/ER, IAP FGOALS 1.0g, INGV ECHAM 4, INM CM 3, IPSL CM 4, MIRCO 3.2 hires/medres, MPI ECHAM 5, MRI CGCM 2.3.2a, NCAR CCSM 3/PCM 1, UKMO HADCM 3/HADGEM 1

There are a total of 53 runs. Santer didn’t use BCCR BCM 2.0, INGV ECHAM 4 or GISS AOM or all the runs for all the the other models. The statistical basis of this analysis is based on Santer et al. 2009. Two hypotheses will be tested: H1) The  trend in a particular model run is consistent with the trend in the observational data and H2) The multi-model mean trend is consistent with the trend in the observational data. For more details on the statistical test, you can read the paper here. I also used the same method to account for autocorrelation in the residuals.

The statistical test will be applied to MSU channels 2, 4 and 6 which correspond to the lower/mid troposphere and the lower stratosphere, respectively. First, let’s look at the 1999-1979 period covered in the Santer period. There are some differences to note.


If you compare this graph to the one in the Santer paper on page 1711, you’ll see they largely match up. You’ll also notice that the number of runs for each model doesn’t match because when I was processing the data, I only chose runs from 20C3M that have a corresponding run in A1B. I calculated the multi-model d* statistic to be 1.27 and 0.58 with degrees of freedom of 22 for both and p-values of 0.23 and 0.57, for UAH and RSS, respectively. Hereafter, every time I quote two identical statistics, like d* or p-values, they refer to UAH first and RSS second. These figures are fairly close to what is reported by Santer (d*  = 1.11 and 0.37). Note d*1 in Santer et al. is simply d* here.

What happens if we redo the calculations out to the present (data up to August 2009)? Below we see the model means are significantly higher than the observational trend than they were in the 1979-1999 period.


The d* statistics are 3.49 and 2.01 with, again, 22 degrees of freedom and p-values of 0.002 and 0.056, respectively. At 5% significance the result is that H2 is rejected relative to UAH, but barely slips by with RSS. For H1, the two CCCMA CGCM models, GISS ER runs 4/5 and MIROC 3.2 hires run 1 rejected relative to RSS (or 17% of the model runs). The situation is worse with UAH. The CCCMA models rejected, as well as, GISS AOM, GISS EH 1/2, GISS ER, the two MIROC models, MRI, NCAR CCSM 3 except run 3 and UKMO HADGEM 1 (or 55% of the model runs).

If we consider the globe as a whole, the results are fairly similar.

TLT_global_Jan-1979_Aug-2009_RSSH2 is still rejected in both data sets  with p-values of 0.000 and 0.002. The proportion of H1 rejections are 54.7% and 45.3%. Let’s go back to the tropics and look at the middle troposphere.

TMT_tropics_Jan-1979_Aug-2009_RSSH2 again is still rejected with  p-values of 0.000 and 0.020. The proportion of rejections is 60.4% and 26.4%. The models thus far have taken quite a beating. But what about the stratosphere?

TLS_tropics_Jan-1979_Aug-2009_RSSWe find very different results. H2 fails to reject at p = 0.063 and 0.218. However, H1 still has a high rejection rate of 47.2% and 34.0%.

How different are the test results when we look from the end of the 20C3M simulation up to the present? A world of difference. Starting with the lower troposphere,

TLT_tropics_Jan-2000_Aug-2009_RSSH2 fails to reject at p = 0.159 and 0.208. The proportion rejecting under H1 is 5.7% for both UAH and RSS. For the middle troposphere, we get p = 0.111 and 0.085 with a rejected proportion of 9.4% and 13.2%.

TMT_tropics_Jan-2000_Aug-2009_RSSFor the stratosphere, p = 0.969 and 0.934 and none of the individual models were rejected.

TLS_tropics_Jan-2000_Aug-2009_RSSSo what explains these results? Are the models truly consistent with the observed “trend” over the last 9 years? I initially suspected that it was because of the autocorrelation in the regression residuals. I took a look at the AR(1) coefficients for the periods 1979-2009 and 2000–2009. Here’s the frequency distribution for both periods.

ar1_frequency_distributionWhile the inflation of the standard errors in the second period is less concentrated on the higher side, the standard errors in the observational data on Jan 2000 – Aug 2009 are roughly seven times their values in Jan 1979 – Aug 2009 while their AR(1) coefficients are essentially the same. This failure to reject may simply be weather noise decreasing the power of the test.

UPDATE (9/26) : Chip Knappenberger asked in the comments about the end date of 20C3M. He said he believed they ended in 2000. I thought they had ended in 1999. He was right. Most of the models end in Dec 1999. I re-ran my script for Jan 2001 – Aug 2009 to see how the results are affected. Here’s the rub: For the lower troposphere, H2 rejects with p-values of 0.011 and 0.012 for  UAH and RSS, respectively. The proportion of H1 rejections is 24.5% for both UAH and RSS. Again, the standard error in the observational estimates is still many times larger than it was in the 1979-1999 period. As I said previously, I think the weather noise in the data over this (relatively) small period of time is seriously degrading the test’s ability to perform adequately. Here’s a graph of the 2001-2009 period.

d*  -0.03984649 -0.0827173
dof 22.39936015 22.6161493
p    0.96856854  0.9348041
This entry was posted in Climate models, Hypothesis tests, MSU weighting functions, Santer, Synthetic MSU. Bookmark the permalink.

17 Responses to AR4 Model Hypothesis Tests

  1. lucia says:

    Are you going to write it up for publication? Or be lazy like me? :)

  2. lucia says:

    This failure to reject may simply be weather noise decreasing the power of the test.

    Yes. Type II error is typically high for short trends and decrease as you have more data.

  3. Pingback: The Blackboard » Temperatures of the Tropical Troposphere: Chad brings Santer up to 2008.

  4. Dan Hughes says:

    An excellent piece of work.

  5. pcknappenberger says:


    How did you decide that the 20C3M runs ended in December 1999? I know that Santer et al. said that most of them ended then, but did you find that to be the case?

    I have been led to believe that the 20C3M runs ended in December 2000?

    Any insight into this distinction gained from using the PCMDI databases (rather than the Climate Explorer database that I have relied upon) would be much appreciated!


    • Chad says:

      I’ve been working with the gridded data for surface and atmospheric temperature for about a year now. I have to admit I did not systematically check the end dates. I’ve been dealing with this gridded data for about a year now and that’s the impression that I’ve built up over time. I’ll look into now.

    • Chad says:

      You are right. Most of the simulations end in Dec 2000. A few end in 1999. GISS ER ends in 2003.

  6. pcknappenberger says:

    Thanks, Chad. That is information I can use.

    Do you have the surface temps from the same runs?

    I’d be interested in the TLT/surface trend ratios. I’ve seen the amplification factor as pegged at about 1.2 for the global average. Does this seem reasonable given what you have?

    Great work!


  7. vg says:

    It looks like climate modelling etc wont matter anymore as the basis for AGW doesnt exist anymore. See CA and WUWT etc…

  8. Pingback: Santer Update – Another Dead Paper « the Air Vent

  9. Pingback: Crazy Quotes From Under the Science Cloak « the Air Vent

  10. Pingback: AR4 Model Hypothesis Tests Results: Now with TAS! « Trees for the Forest

  11. Pingback: The Blunt Skippy « The Whiteboard

  12. Pingback: Christy actualiza la disparidad modelos – realidad «

Comments are closed.