Better Late Than Never

I’m sure many of you have taken notice of my absence. I’ve been very busy working on other projects whose results you may see published sometime in the (hopefully) near future. For the last week or so, I’ve been writing programs to process the GHCN station data into a gridded product. Though this has already been done by many, I’ll show you my results anyhow. I’ll briefly go over the procedure and then discuss some gridding issues that came up while analyzing the data.

Data and Methods

  1. Load the station inventory (v2.temperature.inv) and data (v2.mean) into memory.
  2. Consolidate each station’s duplicate entries (if applicable) by computing a monthly offset for each station such that the sum of the squared differences between each station after the offset is applied is at a minimum (See Tamino and Roman). The offsets align the series which are then averaged together.
  3. Determine which stations are in each gridbox. I initially used a 5° x 5° grid.
  4. Consolidate all the stations in each grid box using the same methodology as in step 2.
  5. Calculate the climatology at each grid point and remove it to get temperature anomalies. The base period is 1961-1990 and I require a minimum of 20 years of non-missing data to calculate a valid climatological mean.
  6. Calculate spatial averages for the global, NH, SH and the tropics.


Normally I would impose a constraint on how much area needs to be represented by non-missing values to calculate a valid spatial average. Unfortunately, when I was writing my program I was more worried about getting it work properly and neglected to include a land mask to determine how much land is present in the four regions that I the calculated averages for. When I calculated the spatial averages, I also made sure to hold on to the summed area represented by the non-missing data to see how it varies with the overall average. First let’s see how my global average compares to the results of Jeff ID/Roman M, Nick Stokes, Zeke and Residual Analysis.
My reconstruction is consistent with the others so that gives me confidence that I didn’t seriously botch the calculations. Let’s look at the global average over the entire time period (1701-2010).
The first 150 years of the series shows much larger variability than the rest of the series. Most likely this is because the sampling was small enough to allow regional or small-scale variability to dominate the ‘global’ average. The area fraction data was of some concern. This is the sum of surface area accounted for by grid boxes with non-missing data normalized to the total amount of global land area. Why is it greater one? I think this is a non-obvious error that everyone who has created a gridded product of GHCN has made. If a grid box contains one or more stations and it is completely occupied by land then weighting it according to the surface area of that grid box is correct. When there is some ocean present then the weight that is applied is too much. Let’s see how this issue affects the other spatial averages.
The haphazard weighting affects all the spatial averages but is the strongest in the tropics. I re-ran the spatial averages using a land mask to properly adjust the area weights and compared the normalized area fraction both ways.
Now the area fraction doesn’t take on physically unrealistic values. Given that many bloggers are now combing the land and ocean data, this issue shouldn’t be as serious, but we still need to know how much land/ocean is in a grid box to properly combine land/ocean anomalies corresponding to the same grid box. This area fraction bias would certainly be reduced by using a finer grid. To test this, I re-ran my program with 2.5° x 2.5° resolution and compared the area fractions.
As expected, using a finer grid reduced the area bias and brought it more in line with the correct figures. What difference does this bias incur in the spatial averages and their trends?
Over the course of the 20th century up to the present, the bias in the global, nh/sh averages is statistically significant thought practically negligible. In the tropics however the bias is fairly large. Here’s the same data and trends over the most recent 30-year period.
Over the recent period, the biases are statistically and practically significant with the most extreme bias occurring in the tropics. The trends are positive and this means that the improperly weighted procedure produces anomalies over this period whose trends are too small. Now see what happens when the spatial averages are calculated on a finer grid.
Over the 20th century up to the present the bias is still of no practical significance.
Over the more recent period the biases are still of some practical significance but not as large. Here are the spatial averages with the correct weighting and their trends.
UPDATE May 23, 2010: The trends below are wrong. The uncertainty is one standard error, not two as I intended.
Here’s the same graph but from the 2.5° x 2.5° data.
In all regions except the tropics, the finer griding brought down the trends. Why? That’ll have to be the topic of another post.
P.S. Does anyone like the new theme?
Update- May 19, 2010
I’ve uploaded the annual averages and the Fortran code ( As per Carrick’s request I’ve created plots of the weighted mean latitude of all non-missing grid cells. Below are three plots covering three different periods: 1701-2010, 1850-2010 and 1950-2010.
This entry was posted in gridded data, Station Data, Surface temperature record. Bookmark the permalink.

71 Responses to Better Late Than Never

  1. Marcus says:

    “In all regions except the tropics, the finer griding brought down the trends. ”

    Presumably because we have denser station placing in regions with cooler trends (eg, continental US, coastal zones), and therefore a 5×5 degree grid in an area with dense station placement becomes four 2.5×2.5 grids, whereas in a region with sparse placement (like much of Canada and Siberia) it may become only one or two 2.5×2.5 grids?

    (also not surprising that using a land-mask raises the trends, because coasts warm more slowly than interiors due to GHG forcing… that trend might not be as true if the warming was due to internal variability)

  2. Marcus stole my thunder on explaining the grid size-trend relationship :P

    The main issue is that any area not covered in a grid cell is implicitly assumed to have the global mean temp, so if there is more area in higher lat regions not covered (due to lower station density), it will bias the temp trend cool due to the fact that high lat areas are warming faster.

    Nice job on the analysis Chad. Could you stick a text file up somewhere giving the annual anomalies so I can add it to my summary graph?

  3. Carrick says:

    Chad, would you compute for us the mean latitude of the non-zero grid cells versus time? I’m interested in seeing how that varies for you (I have done this for CRUTEMP).



  4. Interesting. Would be nice to know the “true” mean latitude of all land grid cells for reference, since I assume its non-zero (the N Hemisphere having more land area than the S Hemisphere and whatnot).

  5. Here is a pretty version of ye olde spaghetti graph:

    And the three most similar lines (mine, Nick’s, and yours):

    Looks like the land mask has its largest effect on recent temps (or something else is causing your series to trend a bit higher than ours).

    Spreadsheet with land, ocean, and land/ocean temps for all series (when available) is here:

  6. Pingback: The Blackboard » Another land temp reconstruction joins the fray

  7. Carrick says:

    Hi Zeke,

    Land is biased to the northern hemisphere. I get a mean value of 15.5°N.

    The reason this is interesting is because of the polar amplification of land temperature trends that Zeke and I have separately posted figures on.

    Formula for mean latitude bias

    N(theta) is the number of non-zero cells at a given latitude. Since we are looking at mean value in terms of surface area, you need to throw in the cos(theta) factor. If somebody needs help turning this into summations let me know.

    • Chad says:

      I calculated it differently. I calculated the number of grid cells in each latitude band that have non missing data. I divided those numbers by their sum to normalize to one. I then multiplied the latitude values by those weights and summed. But we got mostly similar results.

      • Carrick says:

        Either way shows the effect. But I presume when you do your global average you are including the cos(theta) correction right?

        If so, it’s slightly more germane to use the cos(theta), especially if you want to use a figure like mine to estimate the error in the temperature trend from a particular area weighting scheme.

        (CRU has a big systematic error in their data prior to 1950 because they just average over all non-zero cells (correcting the cos(theta) effect):

        for ( my $j = 0 ; $j {columns} ; $j++ ) {
        if ( $Field->{data}->[$i][$j] == $Field->{missing} ) { next; }
        if ( $Lat > 0 ) {
        $W_tot_NH += $Weight;
        $A_tot_NH += $Field->{data}->[$i][$j] * $Weight;
        else {
        $W_tot_SH += $Weight;
        $A_tot_SH += $Field->{data}->[$i][$j] * $Weight;


        • Carrick says:

          I meant to include this line in the listing:

          my $Weight = cos( $Lat * 3.141592 / 180 );

          It’s one line above the code snippet I showed. The “my” keyword in perl means a variable local to that block or function, for people who don’t grok perl.

        • Chad says:

          When I first started calculating spatial averages, I never used (at least explicitly) the cosine weighting. I saw a reference to that in some paper by Jones but I didn’t know what it meant. What I do is create an array with dimensions nlon x nlat generated with the following fortran code.

          DO j = 1, SIZE(lat_bounds,2)
          DO i = 1, SIZE(lon_bounds,2)
          area(i,j) = (sin(d2r_d*lat_bounds(1,j)) – sin(d2r_d*lat_bounds(2,j)))* &
          & (lon_bounds(2,i) – lon_bounds(1,i))

          area = d2r_d*ABS(area)/(4.0*pi_d)

          The d2r_d variable is a double precision conversion factor from degrees to radians. I then multiple the array by d2r_d to convert the lon_bounds (makes no sense to do this nlon*nlat times in the do loop). The result is divided by 4PI to normalize to a unit sphere so it’s really surface area fraction and not surface area. It’s easier to do it this way because you can perform array operations like masking which make the programming more simple (less do loops in the end).

        • Carrick says:

          Chad, minor point, but there is no real gain in precision in writing the cos(theta) factor in terms of sin(theta2) – sin(theta1).

          It’s exact for my integral when you go from theta1 to theta2… when the argument is otherwise a constant:

          integral(cos(theta) dtheta, {theta,theta1,theta2}) = sin(theta2) – sin(theta1)

          of course. But the error for dtheta = 5° is only 0.03% even for that case.

          Without going into the math, you’d need a more substantial repertoire of replacement expressions than just sin(theta2) – sin(theta1) to improve on the accuracy dtheta * f(theta) * cos(theta) as an approximation to

          integral(f(theta) cos(theta) dtheta, {theta,theta1,theta2})

        • Carrick says:

          It’s exact if f(theta) = constant, which was my point above. (The derivation is a 2nd semester calc result.)

          For f(theta) != constant it is not in general a numerical improvement over dtheta * f(theta) * cos(theta)

          And even when it’s “exact” the error is still completely negligible (0.03% for dtheta = 5° is certainly negligible compared to other errors in climate science).

          This is just a minor point, feel free to ignore it. ;-)

  8. Carrick says:

    This may be interesting too.

    It’s a plot of the fraction of land to total area in swaths of equal latitude, broken out for land above sea level (blue), 1000-m above sea level (green) and so on.

  9. Chad says:

    I don’t have the derivation in front of me, but the equation I used is exact. It’s simply the integral of a differential element on a sphere evaluated on arbitrary bounds.

  10. Nick Stokes says:

    I tried to download your pack, but it’s in a proprietary .rar format. Any chance of a zip or tar version?

  11. Carrick says:

    Nick, I was able to decode it on the Mac using MacRAR. If Chad is OK with it, I’ll place a tar-balled version of his code on line for you to download.

    Chad, please let us know if this is OK.


    • Chad says:

      Thank Carrick. I’ve uploaded a gzipped tarball that’s the same as the RAR archive except it doesn’t have any Win32 executables.

      • Nick Stokes says:

        Thanks, Chad
        That worked. I haven’t tried to compile yet

      • Carrick says:

        Thanks, Chad.

        I was trying to compile it and got the error:

        gfortran -c ghcn_types.f03 -I/local/include -std=f2003 -Wextra -Wall -pedantic
         USE constants
        Fatal Error: Can't open module file 'constants.mod' for reading at (1): No such file or directory
        make: *** [ghcn_types.o] Error 1

        I assume we’re missing constants.mod, right?



        • Chad says:

          :( My mistake. I’ve updated the archives. That file is in my include directory so I forgot to include it. I included the source file not the mod file. I found out the hard way that even a slightly different versions of gfortran don’t recognize each others mod files.

        • Chad says:

          Oh, and if you see in the make file, you’ll need netcdf, blas, lapack and lapack 95. I assume you’re using linux so I guess you already have netcdf and perhaps blas/lapack.

        • Carrick says:

          No problem. I had noticed the library dependences and I have to get my libraries up to date in any case.

          I’m a Mac person, I’ll probably install the libraries the “old fashion” way, namely downloading the sources and compiling them for my platform.

  12. Steven Mosher says:

    When you guys are ready to run the data that GISS actually uses…

    download this.

    You’ll need the giss.inv file, if you cant find it quickly ( see clear climate code ) just let me know.

  13. Marcus says:

    Hey Carrick – let me know if you get netcdf to successfully install on your Mac.

    My first attempt to “make configure && make check” failed with a “Could not link conftestf.o and conftest.o”

    After I set
    CXXFLAGS = ‘-m64’

    I got what looked like a decent compile, but the make gives me:
    /usr/bin/ranlib: archive member: .libs/libnetcdff.a(netcdf.o) cputype (7) does not match previous archive members cputype (16777223) (all members must match)



  14. Steven Mosher says:

    Also, when you simply “combine” or average duplicate records in GHCN you will be doing something different than GISS does. Hansen has a special method for handling these records in his step one. For the most part these “duplicate” records can be handled in a variety of ways ( I’ve posted the relevant GHCN documents over at jeff id and I think at Nicks. ) The differences in handling these items should only change the answers in the 1/100ths of C.

    • Chad says:

      I’m making some mods to my code because that extra field at the end (the lights) is causing an read error. Something for me to keep in mind for the next stable version.

      • Steven Mosher says:

        If you like I can post a file that has 3 different ways of handling the duplicates in Ghcn

        1. Simple average of the duplicates
        2. median of the duplicates
        3. Midpoint of the max/min

        real minutia. not as cool as your land mask thing.

        • Chad says:

          Steven, do you know about a an undocumented field in the GHCN inventory? For some reason my program can’t get passed the Antarctica station Erin. When it hits the next station it crashes. I looked at the format specifiers I was using and I noticed something. The field ‘grveg’ is of width 16. When you count off those 16 spaces, there’s a letter next to it. For example,
          10160355000 SKIKDA 36.93 6.95 7 18U 107HIxxCO 1x-9WARM DECIDUOUS C

          Count off 16 spaces starting at the beginning of ‘WARM’. What does that C mean? It’s not documented in the program.

  15. Chad says:

    Never mind Steven. I found what it stands for. It’s brightness. It would be nice if they updated their sample Fortran program. Maybe I should fire off an email.

    • Steven Mosher says:

      ha sorry about that. same thing tripped me up a while back.
      they seriously need to redo that file, now that all our code is wrapped around it

      • Chad says:

        Still running into errors. I’m able to read the station inventory but then I started to get into integer overflow errors. That’s what happens when you get stingy with memory (because I only have 520 MB).

      • Chad says:

        Found some bad assumptions in my code. I assumed that every station in the inventory would have data for when I find the lower and upper indices for each station after loading the v2 file into memory. It was causing out of bounds errors! Why list a station in the inventory if it isn’t even in the data file? I found 51 stations with no data in the v2.mean_comb file. Here they are as country code, wmo, mod:

        137 61641 1
        222 28028 1
        222 28240 0
        222 28630 0
        403 71913 1
        425 72404 2
        425 72406 6
        425 72523 4
        425 72523 6
        425 72597 9
        603 11020 1
        603 11320 1
        604 37851 0
        611 11438 1
        611 11502 1
        611 11518 1
        611 11643 1
        611 11643 2
        611 11659 1
        611 11679 1
        611 11735 1
        613 26231 1
        613 26233 1
        614 2844 1
        614 2874 1
        614 2912 1
        615 7055 1
        615 7090 0
        615 7169 1
        615 7471 1
        615 7491 1
        617 10091 1
        617 10152 1
        617 10156 0
        617 10170 1
        617 10180 1
        617 10181 1
        617 10249 1
        617 10315 1
        617 10396 1
        617 10413 1
        617 10425 1
        617 10453 1
        617 10466 1
        617 10466 2
        617 10476 1
        617 10513 1
        617 10515 1
        617 10555 1
        617 10838 1
        623 16022 1
        623 16095 0
        623 16410 2
        626 26238 1
        626 26425 1
        634 1052 1
        635 12280 1
        635 12520 1
        637 15235 1
        638 26063 1
        638 27453 1
        638 34560 1
        645 2456 1
        645 2627 1
        646 6630 1
        650 33945 1
        650 33966 1
        700 89060 0
        700 89833 900

        Very strange.

        • steven Mosher says:


          Was the data missing from the comb file?

          GISS used to employ a file called “drop strange” which dropped certain stations…but thats not in the current build.. hmm

        • steven Mosher says:

          700 89833 900

          425 72597 9

          700 89833 900

          for this station in the antarctic all GISS had was a WMO number. SO in there manual steps they make
          the MOD = to 900 and they add the 700.

          Lets see what that station data looks like at the source:

          70089833900 D_17 -66.70 139.70 -999 antarc3

          Should be on this page, the antarctic 3 data page:

          its not.

          What does that mean? GISS have a quasi manual process for building the antarctic datafiles. They have not checked to see that site D_17 is in the inventory but not on the data page.

      • Chad says:

        Just got the bug fixes in order. You learn a lot about yourself as a programmer when you see all the unconscious assumptions you make. I looked at the gridded product and it looks sane. I’ll give it a more thorough treatment tomorrow.

  16. Chad,

    Could you run GHCN v2.mean_adj through your method and compare it to NCDC land temps? Its pretty damn close with just v2.mean, and I want to see if we exactly replicate it using v2.mean_adj.

    • Chad says:

      It doesn’t replicate it precisely. Here’s a graph of both NCDC/GHCN land temp anom (rel. 1961-1990). The NCDC trends for the entire period and the last 30 years are larger than the adjusted data.

  17. Deep Climate says:

    Chad, I’ve got a question about your upcoming paper with McKitrick and McIntyre, as described in McKitrick’s Heartland presentation.

    In Santer et al (IJoC , 2008), Table 1 shows model trends for synthetic TMT (T2) and TLT (LT) of 0.215C/dec. and 0.199C/dec respectively, for the period 1979-1999 (20 Cen).

    But the updated model trends analyzed in MMH 2010 show 0.24C/decade and 0.26C/decade respectively, presumably calculated from the model ensemble through 2009. What accounts for the large difference? That’s a huge rise with adding just 10 years of simulation.


    • Chad says:

      Both papers aren’t using the same suite of models. Santer et al. used 49 runs from 19 models. Our paper used 57 runs from 23 models.

  18. Pingback: D’Aleo at Heartland: An Apple a Day « The Whiteboard

  19. Thanks for the land mask data. I ran it though my model and got pretty much identical results to yours:

    I’ve also figured out a rather elegant way to apply it to land/ocean data by effectively splitting all cells containing both land and ocean stations into two.

    • Chad says:

      I’m guessing that the fact that the error terms in your emulation aren’t really small like ~0.001 or so is the difference in the combination methods. So what the error terms show is the difference between the climate anomaly method and the least squares method. Very nice.

      • Yep, whats left is most likely due to differing methods of station combination (CAM vs. LSM). Interestingly enough, it isn’t particularly large.

        On a related note, I’m trying to track down a 5×5 lat/lon grid version of the HadSST1/Reynolds series that GISTemp uses. I’m starting to strongly suspect the reason why GISTemp shows a higher trend than other series over the last decade is in a large part due to the use of different SST data. Let me know if you might have any leads…

        • Chad says:

          Is whatever language you’re using able to read netCDF files? If so it would be as easy as compiling one of the Fortran programs available on GISTEMP’s website to convert the SBBX.SSTHadR2 binary file into a netCDF file with 5° x 5° resolution.

        • Chad says:

          If you are able to read netCDF I have the file ready and waiting.

        • Zeke says:

          Unfortunately I’m using the STATA stats software, which isn’t the best for parsing data sets.

          I need to wrangle it into a format like:

          unique_id, year, jan, feb, mar, …, dec

          With a separate metadata file that has the lat/lon midpoint associated with the grid of each unique_id. Alternatively, the lat/lon could be appended to the format above and I could separate it out.

          Playing with this data makes me wish I had been more attentive in the one comp sci class I took way back in the day :-p

        • Chad says:

          I need to wrangle it into a format like:

          unique_id, year, jan, feb, mar, …, dec

          I’ll see what I can do.

        • Zeke says:

          Thanks. I’m hoping that I can convince someone a bit more programming-savvy than I to undertake a project to provide all the various SST datasets (Reynolds, HadSST1, HadSST2, ERSSTv3, etc.) gridded at different resolutions (1×1, 2×2, and 5×5 when applicable) and available in the same format as the land temp data. Would make combining things much more convenient.

  20. Pingback: The Blackboard » Musing on Reconstructions: Response to Bart at Keith’s

  21. Pingback: The Blackboard » Replication

  22. Doug Proctor says:

    The timescale interpreted for “global” temperatures from 1701 to the present has very different regions sampled. How would the graphs look if the same regions were tracked through time? In 1701 the temperature records must represent very little of the world, and most of that near cities (though not centrally heated ones!) subject to a lot of smoke/smog inversions. Can the database be broken into tracking the areas as they came on line, and then show what happens when those areas are combined? Apples to apples, you know. Let us see if regional trends exist.

  23. Pingback: Global Warming Contrarians Part 1.1: Amateur Temperature Records « Planet James

  24. Pingback: trb-0.01: GHCN Simple Mean Average « The Whiteboard

  25. Pingback: Calculating global temperature | Watts Up With That?

  26. Pingback: Gridded Global Temperature « the Air Vent

  27. Deep Climate says:


    I downloaded the MMH package and started to look at the model ensembles – congratulations! Very easy to work with.

    First finding: the rise in modeled synthetic temps comes from the additional years, not additional models. Some of those models have humdinger volcano troughs in the 90s, which depresses the 1979-1999 trend relative to 1979-2009!

    I know you’re aware of some discrepancies between various tables and figures in MMH 2010. For example, the “models” trend shows 0.26 +/- 0.030 in the figure and 0.27 +/- 0.024 in the table.

    I’m also having a hard time replicating the individual model slopes (they’re close but not quite the same as published in Table 1). That could be down to differences in the AC corrected trend in the STATA panel regression (I’m just doing OLS), but if it’s all right with you I’d like to check a couple of numbers with you, programmed directly in the Excel spreadsheet.

    The first thing I did was put in a year column (incremented by 0.0833 each year). Starting in 1850, that gives 1549 through 1920 for 1979-2009 (and 1999 ends at line 1800).

    Here are the slopes for the first two model runs, using LINEST () * 120.

    Model/run BCCR 1 CCMA T47 1
    1979-99 0.165 0.431
    1979-09 0.190 0.347

    For what it’s worth, I get 0.262 for average of all runs in 1979-2009, and exactly the same for weighted average of trends in Table 1 (which I’ve added to my spreadsheet).

    There’s also something peculiar I wonder if you know the answer to. I thought the baseline was 1980-1999 for model runs (at least that’s how the IPCC gave its numbers). Yet the baseline appears to be 1979-1999 (a 21-year period). That period is very close to zero average (to 4 decimal places) for all runs.

    Thanks – I just want to make sure I’m using this correctly.

  28. Pingback: A Citizen Science Choice: Obfuscate or Illuminate « The Whiteboard

  29. Pingback: The Climate Change Debate Thread - Page 477

Comments are closed.