This is the culmination of many weeks of hard work. To recap the previous posts on this blog, I found the weighting function used in Santer et al. 2008 to produce synthetic MSU brightness temperature. I applied the function to AR4 climate model data and found it to match the data released by Santer. I’ve processed all the data into the same choice of spatial averages used by Santer (global, nh, sh, tropics, etc). The 23 models used in this analysis are
BCCR BCM 2.0, CCCMA CGCM 3.2 T47/T63, CNRM CM 3, CSIRO MK 3.0/3.5, GFDL CM 2.0/2.1, GISS AOM/EH/ER, IAP FGOALS 1.0g, INGV ECHAM 4, INM CM 3, IPSL CM 4, MIRCO 3.2 hires/medres, MPI ECHAM 5, MRI CGCM 2.3.2a, NCAR CCSM 3/PCM 1, UKMO HADCM 3/HADGEM 1
There are a total of 53 runs. Santer didn’t use BCCR BCM 2.0, INGV ECHAM 4 or GISS AOM or all the runs for all the the other models. The statistical basis of this analysis is based on Santer et al. 2009. Two hypotheses will be tested: H1) The trend in a particular model run is consistent with the trend in the observational data and H2) The multi-model mean trend is consistent with the trend in the observational data. For more details on the statistical test, you can read the paper here. I also used the same method to account for autocorrelation in the residuals.
The statistical test will be applied to MSU channels 2, 4 and 6 which correspond to the lower/mid troposphere and the lower stratosphere, respectively. First, let’s look at the 1999-1979 period covered in the Santer period. There are some differences to note.
If you compare this graph to the one in the Santer paper on page 1711, you’ll see they largely match up. You’ll also notice that the number of runs for each model doesn’t match because when I was processing the data, I only chose runs from 20C3M that have a corresponding run in A1B. I calculated the multi-model d* statistic to be 1.27 and 0.58 with degrees of freedom of 22 for both and p-values of 0.23 and 0.57, for UAH and RSS, respectively. Hereafter, every time I quote two identical statistics, like d* or p-values, they refer to UAH first and RSS second. These figures are fairly close to what is reported by Santer (d* = 1.11 and 0.37). Note d*1 in Santer et al. is simply d* here.
What happens if we redo the calculations out to the present (data up to August 2009)? Below we see the model means are significantly higher than the observational trend than they were in the 1979-1999 period.
The d* statistics are 3.49 and 2.01 with, again, 22 degrees of freedom and p-values of 0.002 and 0.056, respectively. At 5% significance the result is that H2 is rejected relative to UAH, but barely slips by with RSS. For H1, the two CCCMA CGCM models, GISS ER runs 4/5 and MIROC 3.2 hires run 1 rejected relative to RSS (or 17% of the model runs). The situation is worse with UAH. The CCCMA models rejected, as well as, GISS AOM, GISS EH 1/2, GISS ER, the two MIROC models, MRI, NCAR CCSM 3 except run 3 and UKMO HADGEM 1 (or 55% of the model runs).
If we consider the globe as a whole, the results are fairly similar.
How different are the test results when we look from the end of the 20C3M simulation up to the present? A world of difference. Starting with the lower troposphere,
H2 fails to reject at p = 0.159 and 0.208. The proportion rejecting under H1 is 5.7% for both UAH and RSS. For the middle troposphere, we get p = 0.111 and 0.085 with a rejected proportion of 9.4% and 13.2%.
So what explains these results? Are the models truly consistent with the observed “trend” over the last 9 years? I initially suspected that it was because of the autocorrelation in the regression residuals. I took a look at the AR(1) coefficients for the periods 1979-2009 and 2000–2009. Here’s the frequency distribution for both periods.
While the inflation of the standard errors in the second period is less concentrated on the higher side, the standard errors in the observational data on Jan 2000 – Aug 2009 are roughly seven times their values in Jan 1979 – Aug 2009 while their AR(1) coefficients are essentially the same. This failure to reject may simply be weather noise decreasing the power of the test.
UPDATE (9/26) : Chip Knappenberger asked in the comments about the end date of 20C3M. He said he believed they ended in 2000. I thought they had ended in 1999. He was right. Most of the models end in Dec 1999. I re-ran my script for Jan 2001 – Aug 2009 to see how the results are affected. Here’s the rub: For the lower troposphere, H2 rejects with p-values of 0.011 and 0.012 for UAH and RSS, respectively. The proportion of H1 rejections is 24.5% for both UAH and RSS. Again, the standard error in the observational estimates is still many times larger than it was in the 1979-1999 period. As I said previously, I think the weather noise in the data over this (relatively) small period of time is seriously degrading the test’s ability to perform adequately. Here’s a graph of the 2001-2009 period.