Benchmarking homogenization algorithms for monthly data

Venema, V., O. Mestre, E. Aguilar, I. Auer, J.A. Guijarro, P. Domonkos, G. Vertacnik, T. Szentimrey, P. Stepanek, P. Zahradnicek, J. Viarre, G. Müller-Westermeier, M. Lakatos, C.N. Williams, M.J. Menne, R. Lindau, D. Rasol, E. Rustemeier, K. Kolokythas, T. Marinova, L. Andresen, F. Acquaotta, S. Fratianni, S. Cheval, M. Klancar, M. Brunetti, Ch. Gruber, M. Prohom Duran, T. Likso, P. Esteban, Th. Brandsma. Benchmarking homogenization algorithms for monthly data. Climate of the Past, 8, pp. 89-115, doi: 10.5194/cp-8-89-2012, 2012.

Abstract. The COST (European Cooperation in Science and Technology) Action ES0601: advances in homogenization methods of climate series: an integrated approach (HOME) has executed a blind intercomparison and validation study for monthly homogenization algorithms. Time series of monthly temperature and precipitation were evaluated because of their importance for climate studies and because they represent two important types of statistics (additive and multiplicative). The algorithms were validated against a realistic benchmark dataset. The benchmark contains real inhomogeneous data as well as simulated data with inserted inhomogeneities. Random independent break-type inhomogeneities with normally distributed breakpoint sizes were added to the simulated datasets. To approximate real world conditions, breaks were introduced that occur simultaneously in multiple station series within a simulated network of station data. The simulated time series also contained outliers, missing data periods and local station trends. Further, a stochastic nonlinear global (network-wide) trend was added.

Participants provided 25 separate homogenized contributions as part of the blind study. After the deadline at which details of the imposed inhomogeneities were revealed, 22 additional solutions were submitted. These homogenized datasets were assessed by a number of performance metrics including (i) the centered root mean square error relative to the true homogeneous value at various averaging scales, (ii) the error in linear trend estimates and (iii) traditional contingency skill scores. The metrics were computed both using the individual station series as well as the network average regional series. The performance of the contributions depends significantly on the error metric considered. Contingency scores by themselves are not very informative. Although relative homogenization algorithms typically improve the homogeneity of temperature data, only the best ones improve precipitation data. Training the users on homogenization software was found to be very important. Moreover, state-of-the-art relative homogenization algorithms developed to work with an inhomogeneous reference are shown to perform best. The study showed that automatic algorithms can perform as well as manual ones.


  1. Example synthesis by A. Cost and F. Home

    We would like to thank the two reviewers for their informative reviews. The assessments seem valid and we have nothing to add. Where there were minor differences in the grading we have decided to simply average.

    We also also would like to thank the first author for the additional extensive comments below and would encourage readers to read them.

    Impact on the larger scientific community. [77]
    Contribution to the scientific field of the journal. [90]
    The technical quality of the paper. [80]
    Importance at the time of publishing. [90]
    Importance of the research program. [-]


  1. Example review by E.X. Ample

    The paper describes the blind benchmarking study of the European COST Action HOME. This benchmarking tested a large number of homogenisation methods and found that homogenisation clearly improves the estimates of climate change and variability for temperature station data. The homogenisation of precipitation data is more difficult. Even for these dense European networks only the best methods could achieve modest improvements in the monthly data, which hampers our understanding of changes in the water cycle. At larger time scales homogenisation typically did improve the precipitation station data.

    An important result was that for these European networks homogenisation of temperature was most accurate on a monthly scale, while for precipitation the corrections are best applied on an annual scale. This is probably because the uncertainty of the monthly adjustments is larger due to the smaller signal to noise ratio. (For more sparse networks the annual scale may well be best also for temperature. The influence of station density (and/or the total number of references) is suggested in the results for the first 25 years where the networks are still building up; Figure. 4.)

    The modest difference in the results between the surrogate (autocorrelated non-Gaussian) and the synthetic data (uncorrelated normally distributed) data suggests that correlated data is somewhat more difficult; the uncertainty in the trends was 15% smaller in the synthetic data.

    With 5, 9 and 15 stations the networks were relatively small. This was necessary to keep the amount of work for the manual methods limited. Real cases with more stations will have more well-correlated reference stations and may show better results.

    Impact on the larger scientific community. [75]
    The study is also important for climatologists, which can use it to estimate uncertainties in homogenised temperature and precipitation data.

    Contribution to the scientific field of the journal. [90]
    The paper and the COST Action have helped the homogenisation community progress. Both the creation of the benchmark and the outcomes have stimulated important discussions, which have brought the community forward.

    The technical quality of the paper. [90]
    Clearly written, data available. Very good.

    Importance at the time of publishing. [90]
    Still the most comprehensive study.

    Importance of the research program. [-]
    Not relevant. Single paper.

  2. Example review by M. Ock

    The article presents a benchmarking study for homogenisation algorithms based on monthly data. The study is he most advanced we currently have. Special are the large number of contributions and algorithms that participated, including manual methods. The study was the first blind test and used monthly, not only annual data. The study generated complete networks, not directly difference time series. The inhomogeneities are highly realistic, including small in homogeneities and no artificial minimum distances between inhomogeneities. It includes inhomogeneities affecting multiple stations, even if they could have been more realistic, and gradual inhomogeneities. Importantly not only simulated data was homogenised, but also real observations were homogenised with the same methods, which allowed for a study of the realism of the benchmark data. The results were not just analysed in terms of detection scores, but also in terms of the accuracy of the homogenised data in climatological applications, e.g. the uncertainty of station trends.

    Limitations are the study are that it models small typical European networks. It is possible that results for large networks would be better, while Europe has a rather high station density. It was found afterwards that the breaks variance was two times too large, this is partially compensated by a lower network density due to the selection of a subset of stations to generate the surrogate network. The study unfortunately did not include explicit network-wide trend biases due to inhomogeneities; for the performance of homogenisation algorithms in this respect the reader is referred to Williams et al. (2012). For further limitations see the comment by the first author below.

    The study found that homogenisation improves temperature data, but that for precipitation only the best methods improved the data. Modern methods designed for multiple break points and/or inhomogeneous references performed best. The large improvement of these methods was surprising and it should be studied whether that was the case because of the relatively large signal to noise ratio of the benchmarking data.

    Impact on the larger scientific community. [78]
    The study clearly demonstrated that homogenisation algorithms improve temperature datasets and can be used independently from metadata. It also showed that at a station level also homogenised data still has considerable uncertainties, which should be taken into account in any climatological analysis.

    Contribution to the scientific field of the journal. [90]
    Clearly the best benchmarking study to date. It seems as if the paper made a clear impact and stimulated the use of modern homogenisation methods.

    The technical quality of the paper. [80]
    Unprecedented realism of the data and user-centred analysis of the results. The data was published.

    Importance at the time of publishing. [90]
    No change, the study is still be most realistic one. The upcoming ISTI benchmarking study may change this.

    Importance of the research program. [-]
    Not relevant. Single paper.

General comments

  1. I am the first author of the paper and would like to point to six weaknesses of this study we noticed after publication. This self-review was published in similar form on my blog in June 2014.

    Benchmarking homogenization methods
    In our benchmarking paper we generated a dataset that mimicked real temperature or precipitation data. To this data we added non-climatic changes (inhomogeneities). We requested the climatologists to homogenize this data, to remove the inhomogeneities we had inserted. How good the homogenization algorithms are can be seen by comparing the homogenized data to the original homogeneous data.

    This is straightforward science, but the realism of the dataset was the best to date and because this project was part of a large research program (the COST Action HOME) we had a large number of contributions. Mathematical understanding of the algorithms is also important, but homogenization algorithms are complicated methods and it is also possible to make errors in the implementation, thus such numerical validations are also valuable. Both approaches complement each other.

    The main conclusions were that homogenization improves the homogeneity of temperature data. Precipitation is more difficult and only the best algorithms were able to improve it. We found that modern methods improved the quality of temperature data about twice as much as traditional methods. It is thus important that people switch to one of these modern methods. My impression is that this seems to be happening.

    1. Missing homogenization methods
    An impressive number of methods participated in HOME. Also many manual methods were applied, which are validated less because this is more work. All the state-of-the-art methods participated and most of the much used methods. However, we forgot to test a two- or multi-phase regression method, which is popular in North America.

    Also not validated is HOMER, the algorithm that was designed afterwards using the best parts of the tested algorithms. We are working on this. Many people have started using HOMER. Its validation should thus be a high priority for the community.

    2. Size breaks (random walk or noise)
    Next to the benchmark data with the inserted inhomogeneities, we also asked people to homogenize some real datasets. This turned out to be very important because it allowed us to validate how realistic the benchmark data is. Information we need to make future studies more realistic. In this validation we found that the size of the benchmark in homogeneities was larger than those in the real data. Expressed as the standard deviation of the break size distribution, the benchmark breaks were typically 0.8°C and the real breaks were only 0.6°C.

    This was already reported in the paper, but we now understand why. In the benchmark, the inhomogeneities were implemented by drawing a random number for every homogeneous period and perturbing the original data by this amount. In other words, we added noise to the homogeneous data. However, the homogenizers that requested to make breaks with a size of about 0.8°C were thinking of the difference from one homogeneous period to the next. The size of such breaks is influenced by two random numbers. Because variances are additive, this means that the jumps implemented as noise were the square root of two (about 1.4) times too large.

    The validation showed that, except for the size, the idea of implementing the inhomogeneities as noise was a good approximation. The alternative would be to draw a random number and use that to perturb the data relative to the previously perturbed period. In that case you implement the inhomogeneities as a random walk. Nobody thought of reporting it, but it seems that most validation studies have implemented their inhomogeneities as random walks. This makes the influence of the inhomogeneities on the trend much larger. Because of the larger error, it is probably easier to achieve relative improvements, but because the initial errors were absolutely larger, the absolute errors after homogenization may well have been too large in previous studies.

    You can see the difference between a noise perturbation and a random walk by comparing the sign (up or down) of the breaks from one break to the next. For example, in case of noise and a large upward jump, the next change is likely to make the perturbation smaller again. In case of a random walk, the size and sign of the previous break is irrelevant. The likeliness of any sign is one half.

    In other words, in case of a random walk there are just as much up-down and down-up pairs as there are up-up and down-down pairs, every combination has a chance of one in four. In case of noise perturbations, up-down and down-up pairs (platform-like break pairs) are more likely than up-up and down-down pairs. The latter is what we found in the real datasets. Although there is a small deviation that suggests a small random walk contribution, but that may also be because the inhomogeneities cause a trend bias.

    3. Signal to noise ratio varies regionally
    The HOME benchmark reproduced a typical situation in Europe (the USA is similar). However, the station density in much of the world is lower. Inhomogeneities are detected and corrected by comparing a candidate station to neighbouring ones. When the station density is less, this difference signal is more noisy and this makes homogenization more difficult. Thus one would expect that the performance of homogenization methods is lower in other regions. Although, also the break frequency and break size may be different.

    Thus to estimate how large the influence of the remaining inhomogeneities can be on the global mean temperature, we need to study the performance of homogenization algorithms in a wider range of situations. Also for the intercomparison of homogenization methods (the more limited aim of HOME) the signal (break size) to noise ratio is important. Domonkos (2013) showed that the ranking of various algorithms depends on the signal to noise ratio. Ralf Lindau and I have just submitted a manuscript that shows that for low signal to noise ratios, the multiple breakpoint method PRODIGE is not much better in detecting breaks than a method that would “detect” random breaks, while it works fine for higher signal to noise ratios. Other methods may also be affected, but possibly not in the same amount. More on that later.

    4. Regional trends (absolute homogenization)
    The initially simulated data did not have a trend, thus we explicitly added a trend to all stations to give the data a regional climate change signal. This trend could be both upward or downward, just to check whether homogenization methods might have problems with downward trends, which are not typical of daily operations. They do not.

    Had we inserted a simple linear trend in the HOME benchmark data, the operators of the manual homogenization could have theoretically used this information to improve their performance. If the trend is not linear, there are apparently still inhomogeneities in the data. We wanted to keep the operators in the blind. Consequently, we inserted a rather complicated and variable nonlinear trend in the dataset.

    As already noted in the paper, this may have handicapped the participating absolute homogenization method. Homogenization methods used in climate are normally relative ones. These methods compare a station to its neighbours, both have the same regional climate signal, which is thus removed and not important. Absolute methods do not use the information from the neighbours; these methods have to make assumptions about the variability of the real regional climate signal. Absolute methods have problems with gradual inhomogeneities and are less sensitive and are therefore not used much.

    If absolute methods are participating in future studies, the trend should be modelled more realistically. When benchmarking only automatic homogenization methods (no operator) an easier trend should be no problem.

    5. Length of the series
    The station networks simulated in HOME were all one century long, part of the stations were shorter because we also simulated the build up of the network during the first 25 years. We recently found that criterion for the optimal number of break inhomogeneities used by one of the best homogenization methods (PRODIGE) does not have the right dependence on the number of data points (Lindau and Venema, 2013). For climate datasets that are about a century long, the criterion is quite good, but for much longer or shorter datasets there are deviations. This illustrates that the length of the datasets is also important and that it is important for benchmarking that the data availability is the same as in real datasets.

    Another reason why it is important that the benchmark data availability to be the same as in the real dataset is that this makes the comparison of the inhomogeneities found in the real data and in the benchmark more straightforward. This comparison is important to make future validation studies more accurate.

    6. Non-climatic trend bias
    The inhomogeneities we inserted in HOME were on average zero. For the stations this still results in clear non-climatic trend errors because you only average over a small number of inhomogeneities. For the full networks the number of inhomogeneities is larger and the non-climatic trend error thus very small. It was consequently very hard for the homogenization methods to improve this small errors. It is expected that in real raw datasets there is a larger non-climatic error. Globally the non-climatic trend will be relatively small, but within one network, where the stations experienced similar (technological and organisational) changes, it can be appreciable. Thus we should model such a non-climatic trend bias explicitly in future.

    International Surface Temperature Initiative
    The last five problems will be solved in the International Surface Temperature Initiative (ISTI) benchmark . Whether a two-phase homogenization method will participate is beyond our control. We do expect less participants than in HOME because for such a huge global dataset, the homogenization methods will need to be able to run automatically and unsupervised.

Specific comments

  1. The homogenisation method HOMER was designed based upon what the COST Action HOME learned about homogenisation from the benchmark and our general understanding of homogenisation. It was released at the end of the COST Action and did not participate in the benchmarking.

    As such it is HOME-developed, but not a HOME-recommended method. The HOME recommended methods are ACMANT, MASH, PRODIGE, PHA and iCraddock. If HOMER is operated in the way PRODIGE is, manual pairwise testing, it will likely produce similar quality results. However, several people have noted problems with HOMER operated automatically with the new joint detection option. Using joint detection manually in addition to pairwise testing is likely okay, but this configuration has not been validated yet.


Leave a Reply

Everyone is welcome to make comments on this paper below. The comments are pre-moderated (will only appear after approval by the editors) to ensure they are on topic.

Your email address will not be published. Required fields are marked *