Monitoring the health-related Sustainable Development Goals: lessons learned and recommendations for improved measurement

This post was originally posted on The Lancet

By Samira Asma et al

Background

The UN General Assembly launched the Sustainable Development Goals (SDGs) in September, 2015. The original global SDG framework included 17 goals, 169 targets, and 232 unique indicators. Of these, 12 goals, 33 targets, and 57 indicators have been identified as health-related SDGs (HRSDGs), that is, pertaining to health outcomes, health services, and well-established environmental, occupational, behavioural, and metabolic risks. The scope of health in the SDGs is much broader than in the Millennium Development Goals, spanning from maternal and child health and infectious diseases to non-communicable diseases, injuries, risk factors, and health-system functions. Regular monitoring of the HRSDGs is important for fostering a shared notion of accountability for results, identifying important gaps in resources and rates of progress, and taking into account emerging challenges that can influence the trajectory of progress. Regular monitoring and accountability will be essential to sustain policy focus and funding for the broad and complex HRSDG agenda.

100 countries have published SDG monitoring reports since 2015 and, in 2019, 38 more countries indicated intentions to report. In addition to these government-led efforts, several international groups report on the HRSDGs: WHO in 2016, 2017, 2018, and 2019; the World Bank in 2017 and 2018; the Global Burden of Disease (GBD) collaboration in 2016, 2017, and 2018; the Sustainable Development Solutions Network in 2016, 2017, and 2018; and Our World in Data’s SDG dashboard starting in 2017–18. These reporting efforts on the HRSDGs differ in the number of indicators, countries, and years covered. Where reports overlap for the same indicator, country, and year, correlation coefficients of the estimates vary widely. For example, WHO and GBD’s most recent reports have correlation coefficients varying from 0·94 for under-five mortality (indicator 3.2.1) to 0·43 for road traffic mortality (indicator 3.6.1). Poor correlations for some indicators across reporting efforts highlight inconsistencies that emerge from using different HRSDG definitions, data sources, data processing, and data synthesis approaches.In this Viewpoint, we examine why HRSDG results can differ so much across these empirical monitoring efforts and make recommendations on moving towards more standardised, universal assessments.

Defining HRSDG indicators

Beyond SDG 3—the SDG dedicated to ensuring healthy lives and promoting wellbeing for all—the global indicator framework includes many indicators that directly measure health outcomes (eg, indicator 3.1.1; maternal mortality ratio) or determinants of health (eg, indicator 6.1.1; proportion of population using safely managed drinking water services). Yet, defining what constitutes a HRSDG inevitably involves a judgment about the relative proximal or distal relationship of an indicator to health, as well as its scope within the direct or stewardship functions of the health system in delivering health interventions or reducing risks to health.

Drawing from previous monitoring efforts by WHO and the GBD collaboration, we identified 57 HRSDG indicators that (1) either are or directly relate to health outcomes and risks, health services and interventions, or health system needs and capacities (eg, indicator 3.d.1; international health regulations capacity and health emergency preparedness, indicator 17.19.2c; well certified deaths); and (2) have been established as health priorities by countries via international agendas or agencies (appendix pp 1–4). These criteria lead to a larger set of HRSDGs than had been previously reported by UN agencies and the GBD collaboration, with the latter identifying 52 HRSDGs in its 2018 analysis. Continuously updating the number of indicators included within the HRSDGs—and thus their scope—is far from ideal, but we view this revisiting process as crucial for better aligning HRSDG monitoring with recent global initiatives (eg, WHO’s Global Action Plan on the Sustainable Development Goals, which was initially launched in October, 2018). Further, with its annual indicator refinements and updates via the Inter-Agency and Expert Group on Sustainable Development Goal Indicators (IAEG-SDGs) and UN Statistical Commission, as well as its upcoming comprehensive reviews in 2020 and 2025, the global SDG indicator framework remains a dynamic development agenda. How each HRSDG is operationalised—that is, how indicators from the global framework are translated to meta-data and then to quantifiable measures at the country level—can contribute to differences in monitoring efforts. For some HRSDGs, particularly those for which data are incomplete or sparse in many countries, the proxy measures selected or modifications implemented could underlie differences in HRSDG reporting, in addition to data input and analyses. Continued collaboration to bring proxy measures closer to indicator definitions and their meta-data is important. Doing so also hinges upon heightened investments in data collection systems as well as testing which proxy indicators are most closely predictive of recent HRSDG patterns and trends.

Data platforms

Monitoring the 57 HRSDG indicators requires at least 12 data systems to be functioning in each country (appendix pp 1–4). The most extensively used systems include nationally representative household surveys (NRHS) with a wide array of modules (including biomarker collection), civil registration and vital statistics (CRVS), and various administrative data systems. Alternatives exist for many preferred data systems, such as collecting verbal autopsy data through household survey or sample registration systems for CRVS. For some more complex indicators, such as deaths attributable to air pollution (indicator 3.9.1), multiple sources are required. For several indicators, preferred measurement methods are not direct measures from collected data and must be transformed or modelled for reporting on specific indicators. For example, for malaria incidence (indicator 3.3.3), due to the often low or variable disease notification rate in many high-burden settings, surveys that measure parasite presence in blood samples serve as the preferred measurement method. Overall, 13 HRSDG indicators should be based on complete CRVS, 20 on household surveys, nine on inputs from NRHS, 12 on administrative data, and three on data from multiple sources. CRVS, administrative records, and household surveys feature prominently as the preferred data systems for monitoring the HRSDGs (appendix p 5). CRVS, administrative systems, and disease registries can produce data on an annual basis, but time lags can affect the collation and reporting of data.

Data processing to address bias and enhance comparability

Meaningful comparisons over time and across or within countries require correcting for known data biases, as well as converting proxy or alternative measures to the reference case definition through a defined analytical approach (otherwise known as cross-walking). For example, ten indicators should use data from CRVS with medical certification of causes of death. But even among functional CRVS, recorded deaths must be corrected for completeness for vital registration system coverage, as well as highly variable certification and coding practices. The fraction of International Classification of Diseases (ICD) non-specific codes, or codes that cannot be an underlying cause of death (so-called garbage codes), ranges from 2·3% (in Singapore in 2016) to 83% (in the Maldives in 2005). Without correction, these biases can lead to overestimation or underestimation HRSDGs and then incorrect orderings of countries or subnational units.

For some HRSDGs, household surveys collect similar data, albeit in different ways. Alternative assays, items, or instruments must be cross-walked to the reference measurement approach to ensure comparability across countries and over time. For malaria (indicator 3.3.3), data from prevalence surveys, or less often case notifications, can be based on thick and thin blood smears or rapid diagnostic tests, all of which feature different diagnostic sensitivities and specificities. Meaningful comparisons require adjusting for differences in test characteristics. Another example involves the duration of recall period used in surveys; for instance, many surveys on gender-based violence ask for lifetime recall, whereas others limit recall to a month or a year. Tuberculosis incidence (indicator 3.3.2) cannot be directly measured, so reported cases must be adjusted upwards by case-detection rates. Several approaches have been proposed to assess case-detection rates, including expert judgment, comparisons to prevalence surveys or triangulation using death rates, or prevalence in surveys. To ensure comparability, approaches used to triangulate data from multiple sources and across countries must be done consistently.

How data are processed can profoundly affect results, yet very little attention has been given to data processing and HRSDG measurement. For example, the IAEG-SDGs do not provide detailed guidance on data processing, and few scientific papers focus on data processing—a stark contrast to the substantial debates on data synthesis methods. In the absence of more explicit guidance, we would naturally expect heterogeneity in how national statistical offices, ministries of health, UN agencies, the GBD collaboration, and others have implemented data processing, undermining comparability across these efforts. Assessing bias in administrative data due to exclusion or misaligned incentives is an important aspect of data processing.

Data synthesis and imputation

Fundamental differences in opinion exist on how to move available processed data to final HRSDG values for a given country and year. Data synthesis and imputation should, in principle, account for both sampling and non-sampling error of measurements, synthesise multiple data points from different sources for the same indicator and year for a given country, and address the common problem of having no data for a given country and year. Although national authorities, UN agencies, the GBD collaboration, and various research groups use diverse approaches for data synthesis, these approaches can be roughly grouped into four families. In the following sections we discuss advantages and disadvantages of each approach.

Minimalist approach

With the minimalist approach, indicators are reported for years in which processed data are available. When more than one source for the same year is available, one source is selected on the basis of qualitative assessments of the quality of the different measurements. When different sources show large fluctuations over time (eg, stunting in South Africa), results are simply reported as they are. The advantage of this approach is that producers of indicator results can point to a specific data source when questions emerge regarding the validity of results. However, there are two major disadvantages. First, when measurements are periodic—as all survey-based HRSDGs are—only a small subset of countries can be directly compared for a given year. Even for indicators that are based on CRVS or administrative systems, lags in data collation and reporting mean that results might only be available for 3–4 years before the year of data publication (eg, in 2019, the most recent data points could be for the years 2015–16). Second, sampling and non-sampling errors in measurements can lead to substantial, unexplained year-to-year fluctuations in indicator values. These fluctuations undermine the notion of accountability for change—a cornerstone of the HRSDG monitoring framework.Minimalist approach plus the assumption of no change over time

Another approach starts with processed data and then increases the number of countries that can be compared by assuming indicator values are not changing over time since the most recent measurement. For example, a survey-based measure from 2011, if it is the most recent one, is used for 2018. Assuming no change in trend since the last available measurement is not credible for nearly all HRSDGs, as empirical evidence shows that most indicators have been improving globally. Assuming no change over time introduces considerable bias in reported comparisons; countries with more recent data will look better, all else being equal, than those with older measurements.

Minimalist approach plus the assumption of no change over time

Expert data synthesis

The third approach tries to address the limitations of the minimalist approach by using a panel of experts to review available data and then determine the best estimate for each year. WHO/UNICEF Joint Reporting Process on vaccine coverage is an example of this approach, wherein an expert group evaluates administrative data and survey data provided by countries and, on the basis of their assessment, annual vaccine coverage estimates are provided. An advantage of this approach is that it incorporates additional information that might not be captured by various quantitative sources (eg, vaccine stock-outs). Yet, disadvantages involve not being replicable by others—in fact, a different expert panel could easily come to diverging judgments. Lastly, the expert data synthesis approach does not follow the Guidelines on Accurate and Transparent Health Estimates Reporting (GATHER), which require well-documented statistical models for data synthesis to generate uncertainty estimates and allow replication.

Statistical data synthesis

The statistical data synthesis approach uses statistical models to synthesise processed data into a coherent trend, applying the same model to estimate HRSDG indicator values for years with and without data, as well as generate uncertainty estimates for each year. The Inter-agency Group on Child Mortality Estimation is among the organisations applying this approach. Advantages include enabling replicability by others, reporting of uncertainty, and producing estimates for a full time series, including years beyond the most recent processed data point. Some view this approach as potentially disadvantageous, such that using statistical models to generate estimates after an indicator’s last measurement could misrepresent important trends or reversals due to data sparsity or lags in administrative data, or giving the false impression of having more data than what is available. In addition, disagreements around how to best do statistical data synthesis for a given indicator are likely to occur and could lead to dissimilar models and results between analytical groups, even if all other analytic components—namely, data inputs and processing—are standardised. Finally, performance of statistical data synthesis generally improves by analysing data from all countries together; Bayesian statistical models often used in these efforts enhance predictive validity by borrowing strength over space and time.

Preferred approaches to data synthesis

Views about preferred approaches to data synthesis are as much philosophical as scientific. Some groups want to remain true to available observations and others would like to report indicator values that are as close as possible to what would have been observed with unbiased data systems and timely reporting. There is a substantial divide across and within organisations on the preferred approach: the World Bank largely follows the minimalist tradition for some indicators but statistical data synthesis for poverty estimation; WHO uses the minimalist approach plus the assumption of no change over time for many tracer indicators in the universal health coverage service coverage measure (eg, prevalence of raised blood pressure), expert synthesis for vaccine coverage, and statistical data synthesis for malaria, HIV, and tuberculosis. GBD uses statistical data synthesis for all HRSDGs, with the exception of 17.19.2a (population census or registry status). There is a similar debate regarding estimates of uncertainty, particularly in terms of what types of uncertainty should be captured (ie, sampling and non-sampling errors), and for what results uncertainty should be reported. These are non-trivial areas for HRSDG measurement, especially since reporting of uncertainty estimates can identify instances in which more serious data gaps remain and inform investments in data collection and systems.

Country ownership

Using data to take action is much more likely when country governments, academia, and civil society fully understand and own every step of the measurement process, including data processing and data synthesis. This notion of country ownership is also enshrined in UN Statistical Commission principles, and is also why WHO has adopted the principle that Ministries of Health should be consulted on all indicator assessments. In an ideal world, every country would have high-quality, publicly available data reported with minimal lags from CRVS, periodic NRHS, and administrative health service data alongside other sources, such as National Health Accounts. Further, in this ideal setting, accepted international standards would exist to guide data processing and data synthesis, thereby enabling national statistical offices, ministries of health, academia, UN agencies, the GBD collaboration, and civil society groups to arrive at exactly the same answer for HRSDG monitoring. In this ideal case, the data would literally speak for themselves.

Unfortunately, our current reality is quite different. Due to myriad factors—sparse data, particularly in lower-income and middle-income countries, no international standards for data processing, and no accepted standards for data synthesis—different groups at both country and international levels can reasonably get quite different results on the HRSDGs. Discordant metrics from various actors committed to monitoring the HRSDGs is strong evidence of this mixture of problems. In some cases, national assessments of indicators differ from WHO or GBD assessments, with origins traceable to the data sources examined, data processing choices, and different approaches to data synthesis. To maintain comparability and foster accountability, data processing and data synthesis must be standardised. Data ownership must lie with data producers, which are often national governments. However, these data should also be made publicly available, as they are usually produced using public resources and are intended for the common good. Openly sharing data ultimately lays the foundation for the best health practices—those that are informed by the best possible science that draws from all available data and represents the best evidence base at a given point in time. This scenario is mutually beneficial to countries that share data (ie, they can put it to the best use) and to experts who aim to continuously improve the measurement of health outcomes and determinants. Promoting heterogeneity in data processing and data synthesis standards in the name of advancing country ownership is not helpful. Real country ownership will happen when national data systems are strengthened for data generation, analysis, and use, as well as when clear standards for each step of HRSDG monitoring are established. At that point, everyone will own the results since anyone examining the data will come to the same conclusion. Similar to other large-scale multi-country collaborative efforts, the GBD collaboration offers an example wherein transparent analysis and very broad participation encourages national ownership and use. The collaborative scientific model espoused by the GBD is one example of how to establish highly standardised approaches to data processing and synthesis while also fostering broad ownership.

Transparency

Expectations for transparency and replicability from the public, media, civil society, and scientific communities have been greatly enhanced in the past two decades. WHO led the creation of GATHER, a much needed move towards transparency and a powerful way to improve trust across all segments in society in the validity of HRSDG monitoring. Although there has been great progress towards transparency, much more needs to happen. WHO estimates of child mortality, maternal mortality, and tuberculosis are fully compliant with GATHER, as are all outcomes produced by the GBD collaboration. However, many efforts that generate global health estimates are not compliant with GATHER. GATHER call for the public release of all input meta-data, but do not require publication of the actual values of input data. One key reason for this specification is that, often, national governments report data to WHO or other agencies but do not clarify if the data can be publicly shared. Some parts of WHO share data unless contributing governments give written prohibition, while other WHO departments do not share data unless permission is explicitly provided. WHO is moving towards a default opt-out position, such that data would be shared unless written agreements restrict sharing. This position would substantially enhance transparency and build trust in the global public health community. In some cases, governments restrict access and use of data in writing, posing challenges to transparency and replicability. In the future, transparency could be enhanced if the international community collectively agrees to only use data shared in the public domain for HRSDG monitoring.

Inequalities

The global SDG framework clearly emphasises the importance of addressing inequalities within countries across its goals, targets, and indicators. Monitoring HRSDG inequalities requires considerably more disaggregated data, which then necessitates the inclusion of appropriate equity stratifiers in NRHS or spatially disaggregated data from CRVS and administrative systems. However, examinations of HRSDG inequalities across indicators like socioeconomic status or minority groups (eg, race, ethnicity, and religion) have yet to widely occur, largely because these data are simply not available. An absence of practical and comparable approaches to assessing inequalities could mean that countries do not pay sufficient attention to reducing inequalities. The axis over which inequality varies the most (eg, sex, gender, age, sexual orientation, income, education, race, ethnicity, religion, and migrant status) will differ across countries. The global community is in a virtual impasse on inequality tracking for the SDGs: the choice appears to be to promote local inequality measurement using the most relevant axis of differentiation and lose all comparability and potentially accountability, or to inaccurately reflect the true nature of inequality in each country by using a standardised and comparable measure along a single axis of differentiation. Inequality monitoring with disaggregated data is the first step towards reducing inequities (ie, inequalities deemed as unjust). The only imperfect option available today is to track geographical inequalities. UNICEF has focused on district-level inequalities in its EQUIST work, and increasing number of HRSDGs have been analysed using model-based geostatistics. The most plausible prospect for adding an inequality dimension to HRSDG monitoring is to evaluate each indicator at the district or more disaggregated level, such as 5 km × 5 km pixels. Nonetheless, efforts to report on geographical inequalities must be clear that this is only one useful dimension of inequality measurement. Simultaneously, there needs to be investment in strengthening the capacity of data to enable reporting by other inequality dimensions.

Reporting

How indicators are reported generates substantial tension. Accountability requires social engagement and political debate. In most societies, media (print, digital, and social) are essential components to accountability cycles. The media want to highlight rankings, poor performance, and, less often, success stories. By contrast, countries often prefer to not publicly discuss unfavourable comparisons. However, the most important thing is that HRSDG indicator reporting is clear and simple, and that the actual indicator values are readily available for countries to learn from others with comparably better performance. A transparent audit trail of every estimate that is published must be provided to build confidence in these numbers.

Recommendations

Our analysis of monitoring of the HRSDGs by national governments, WHO, the World Bank, the GBD collaboration, and others suggests some concrete steps forward to improve the timeliness, reliability, and validity of measurements. Pursuing these recommendations will hopefully lead to a situation in which the correlation of HRSDG indicator values will be much higher across different groups.

Strengthening national data collection capabilities

The international community, led by WHO, should actively seek to provide resources and technical assistance to countries to help them strengthen CRVS. Given the major role for various types of NRHS modules, including biomarkers in HRSDG monitoring, WHO should promote and streamline NRHS capability. WHO is actively considering launching a revised World Health Survey Plus data collection platform that would be tailored to the efficient measurement of the relevant HRSDG indicators. A regular, institutionalised, adequately resourced system of NRHS that includes biomarker collection would go a long way in also enabling better inequality analyses because these surveys would collect geolocated data for the appropriate equity stratifiers. Better measurement and accountability must begin with improved primary data. Building national partnerships between the national statistical offices and the ministries of health will be key in this regard.

Data processing standards

Led by WHO, drawing on the experience of all the groups involved in HRSDG measurement, precise guidance should be developed for the processing of data that covers all key sources of bias, including assessing completeness of important registration, dealing with imprecise or impossible cause of death codes through redistribution algorithms, and cross-walking alternative case definitions, assays, instruments, and measurement methods to a defined reference approach. WHO has a clear leadership role in setting standards for health data; the ICD is a powerful example of this role. The normative role of WHO in setting data processing standards for HRSDG monitoring is extremely important.

Regularly updated best practice guidance on data synthesis

The international community should promulgate and adopt standard approaches to data synthesis reflecting current best practice. However, given that methods for data synthesis are rapidly evolving, principles by which improved methods are evaluated and adopted should also be clearly defined, such as enhanced out-of-sample predictive validity. National statistical offices and the myriad academic groups innovating in this area should be essential partners in this effort. Given widely divergent philosophical views on the right approach to data synthesis, these two extreme positions should be accommodated: the minimalist approach and the statistical data synthesis approach. Results can easily be produced using both approaches so that users can understand exactly what is coming from appropriately processed primary data and what is the result of the statistical data synthesis. This pragmatic compromise, producing both approaches, would then allow different groups to focus on their preferred set of results. As methods in this area evolve, international consensus can be built over a period of time.

Enhanced transparency guidelines

GATHER should be further strengthened. The scope of GATHER should cover all HRSDGs and inputs, including population numbers to the computation of the HRSDGs. For official HRSDG reporting, all data inputs, including data as they are used before and after processing, should be made publicly available. All code used in any analysis should be posted publicly. Strengthened GATHER, and their uniform application to HRSDG reporting championed by WHO, would have a ripple effect on other health analyses. If the norm for the community shifts to full data transparency and replicability, everyone will benefit. This scenario could become an effective antidote to the unfortunate tendency towards the siloing of data that still exists in many settings.

Full global and national implementation of GATHER

Countries with analytical capacity should apply the standard for data processing and data synthesis to their own data in a fully transparent manner consistent with GATHER. National statistical offices or ministry of health statistics departments would do this entirely on their own or participate in global collaborations that fulfil GATHER, such as the GBD collaboration or interagency processes that estimate child and maternal mortality. Countries with insufficient analytical capacity need support to develop their analytical skills and infrastructure; WHO and the GBD collaboration with partners should make coherent efforts to advance this agenda. In the presence of sparse data, statistical data synthesis methods are more accurate when data from all countries are analysed jointly. A technical challenge remains in how to facilitate joint statistical data synthesis approaches while allowing groups in each country to experiment with and interrogate the methods.

Conclusion

The science of measuring the HRSDGs must be a guiding principle for sound measurement. Good measurement itself is not political—rather, the actions that are based on good measurement are political, as societies must make their own decisions and agendas, informed by the available data, national values, and social priorities. If the necessary data are collected, and the processing and synthesis steps standardised, results should not substantively vary on the basis of who conducts the analysis. These developments will naturally lead to a single, consistent, and transparently documented number for every HRSDG indicator. The scientific community has a crucial role in advancing the procedures and methods to support robust measurement of the HRSDGs.

Contributors

SA, SC, RL, and CJLM prepared the first draft and finalised the manuscript on the basis of comments and reviewer feedback. SS, NY, MR, MFM, EV, AM, RM, and LD provided critical commentary on, and content for, the manuscript over the stages of manuscript finalisation.

Acknowledgments

The authors thank Nancy Fullman and J Everett Mumford for their contributions to this Viewpoint.