Measurement Error in Continuous Endpoints in Randomised Trials Statistics in Medicine

1 Introduction

In randomised controlled trials, continuous endpoints are often measured with some degree of error. Examples include trial endpoints that are based on self-report (e.g. self-reported physical activity levels

[1]), endpoints that are collected as part of routine care (e.g. in pragmatic trials [2]), endpoints that are assessed without blinding the patient or assessor to treatment allocation (e.g. in surgical [3] or dietary [4] interventions) and an alternative endpoint assessment that substitutes a gold-standard measurement because of monetary or time constraints or ethical considerations (e.g. food frequency questionnaire as substitute for doubly-labelled water to measure energy intake [5]). In these examples, the continuous endpoint measurements contain error in the sense that the recorded endpoints do not unequivocally reflect the endpoint one aims to measure.

Despite calls for attention to the issue of measurement error in endpoints (e.g. [6]), developments and applications of correction methods for error in endpoints are still rare [7]. Specifically, methodology that allow for correction of study estimates for the presence of measurement error have so far largely been focused on the setting of error in explanatory variables, which may give rise to inferential errors such as regression dilution bias [8, 9, 10, 11, 12, 13]. In addition, the application of correction methods for measurement errors in the applied medical literature is unusual [14].

We provide an exploration of problems and solutions for measurement error in continuous trial endpoints. For illustration of the problems and solutions for measurement error in continuous endpoints we consider one published trial that examined the efficacy and tolerability of low-dose iron-supplements during pregnancy [15]. To test the effect of the iron supplementation on maternal haemoglobin levels, haemoglobin concentrations were measured at delivery in venous blood.

This paper describes a taxonomy of measurement errors in trial endpoints, evaluates the effect of measurement errors on the analysis of trials and tests existing and proposes new methods evaluating trials containing measurement errors. Implementation of the proposed measurement error correction methods (i.e. the existing and novel methods) are supported by introducing a new R library. This paper is structured as follows. In section 2 we revisit the example trials introduced in the previous paragraph. Section 3 presents an exploration of the influence of measurement error structures and their impact on inferences of trials. In section 4 measurement error correction methods are proposed. A simulation study investigating the efficacy of the correction methods is presented in section 5. Conclusions and recommendations resulting from this study are provided in section 6.

2 Illustrative example: measurement of haemoglobin levels

Makridis and colleagues [15] tested the efficacy of a 20-mg daily iron supplement (ferrous sulfate) on maternal iron status in pregnant women in a randomized, two-arm, double-blind, placebo-controlled trial. The trial was designed to assess the effect of a low-dose iron supplement of 20 mg/day versus placebo. The dose of iron supplementation was chosen to ensure that women in the intervention group at least meet their Recommended Daily Intake. At delivery, a 5-mL venous blood sample was collected from the women to assess haemoglobin levels as a marker for their iron status, which is often considered the preferred standard for measuring haemoglobin. The results indicate that a low-dose iron-supplement during pregnancy is effective in increasing haemoglobin levels, measured at delivery.

The endpoints in this trial have been collected by a preferred measurement instrument. In this domain, similar trials have been conducted in which the endpoint was assessed with a lower standard. For instance, in field trials testing the effectiveness of iron supplementation, capillary blood samples instead of venous blood samples are often used to measure haemoglobin levels (e.g. [16]). While easier to measure, capillary haemoglobin levels are less accurate than venous haemoglobin levels [17]. Two more illustrative examples are discussed in Supplementary materials section 1.

2.1 Impact of measurement error

We expand on the preceding example to hypothetical structures of error in measurement of the endpoints by simulation. These structures are only explained intuitively (explicit definitions are provided in section 3). For this example, we take the observed mean difference in haemoglobin levels in the two groups of the iron supplementation trials as a reference (6.9 g/L higher in the iron-supplemented group), and assume that haemoglobin levels are normally distributed with equal variance in both groups. Fifty-thousand simulation samples were taken with 54 patients in each treatment arm, yielding a power of approximately 80% in the absence of measurement error at the usual alpha level (5%). Treatment effect for each simulation sample (mean difference in heamoglobin levels between the two arms) was estimated by OLS regression.

2.1.1 Classical measurement error in example trial

In the context of measurement of haemoglobin levels, random variability in the haemoglobin levels of capillary blood samples may be expected to vary more than haemoglobin levels in venous blood

[17], independently of the true haemoglobin level and allocated treatment. Lower power for testing the treatment effect is a well-known consequence of endpoints measured by the lower standard that are unbiased but more variable than the endpoints measured by the preferred measurement instruments [13]. This form of measurement error is commonly described as "random measurement error" or "classical measurement error" [10]

. To simulate such independent variation, we arbitrarily increased the standard deviation of haemoglobin levels by 75% (from 12.6 to 22.1). The impact of this imposed classical error was an increased between-replication variance of the estimated treatment effects of approximately 55% (left plot in panel b, Figure 1). The average estimated effect across simulations (depicted by the dashed line) is approximately equal to the true effect (depicted by the solid line), suggesting the classical measurement error did not introduce a bias in the estimated treatment effect (a formal proof is given in section 3.2). The Type-II error rates increased to 38% (grey area) while Type-I error remained at the nominal level (5%) (red area) (right plot in panel b, Figure 1).

2.1.2 Systematic measurement error in example trial

It may alternatively be assumed that capillary haemoglobin levels are systematically different from venous haemoglobin levels. For instance, when haemoglobin levels in capillary blood are more accurate in patients with low venous haemoglobin levels than in patients with high true haemoglobin levels (or vice versa). This is an example of systematic measurement error ([7]). To simulate, we assumed that capillary haemoglobin levels are 1.05 times haemoglobin levels and we increased the standard deviation of haemoglobin levels by 75%, equivalent to the previous example. The impact of this imposed systematic measurement error structure is that the average treatment effect was biased, increasing from 6.9 to 7.2, and that there is an increased between-replication variance of the estimated treatment effect of approximately 66% (left panel Figure 1d). Type-II error rates increased (to 37%) while Type-I error remained at rate close to nominal level (at 5%) (right panel Figure 1).

2.1.3 Differential measurement error in example trial

The measurement error (structure) may also differ between the treatment arms. In an extreme scenario, haemoglobin levels in placebo group patients would be measured by venous blood samples while patients in active arm (iron supplemented) would be measured using capillary blood samples. To simulate such a scenario, we assume the same systematic error structure from the previous paragraph, now only applying to the active group. Additionally, we assume classical measurement error in the placebo group. This scenario classifies as differential measurement error

[7]. The impact of this measurement error structure is that the average treatment effect was biased, increasing from 6.9 to 13.3, and that the between-replication variance of the estimated treatment effect is increased by approximately 62% (left panel Figure 1d). Type-II error rates decreased (to 0.1%) and Type-I error rates increased (to 48%) (right panel Figure 1d).

Figure 1: Illustration of impact of hypothetical measurement error in example trial 1 [15]
: (a) no measurement error; (b) classical measurement error; (c) systematic measurement error; (d) differential measurement error. The left plots depict every thousandth estimated OLS regression line (grey lines), the average estimated treatment effect (dashed line) and the true effect (black line). The right plots depict the density distribution of the Wald test-statistic of the slope of the regression line.

3 Measurement error structures

are measured on a continuous scale. We assume a linear regression model for the endpoint

where is iid normally distributed with mean 0 and variance . Under these assumptions and assumptions about the model for (described below), simple formulas for the bias in the OLS estimator of the treatment effect can be derived. Details of these derivations can be found in the supplementary materials, section 2.

3.1 Classical measurement error

There is classical measurement error in if is an unbiased proxy for [10]: , where are assumed iid normal with mean 0 and Var and independent of , , in (1). Using instead of in the linear model yields:

Where and the residuals have mean 0 and variance . This leads to a larger variance in (the estimator for ) compared to the variance in (the estimator for ). Consequently, classical measurement error will not lead to bias in the effect estimator but will decrease power for a given sample size.

3.2 Heteroscedastic measurement error

In the above we assumed that the variance in

is equal in both arms. When this assumption is violated, there is so called heteroscedastic measurement error. Heteroscedastic error will not lead to bias in the effect estimator, but will invalidate the estimator of the variance of

(proof is given in supplementary materials section 2).

3.3 Systematic measurement error

There is systematic measurement error in if depends systematically on : , where are assumed iid normal with mean 0 and Var and independent of , , in (1). Throughout, we assume systematic measurement error if or (and of course, in all cases). We assume independence between and , , in 1. Using with systematic measurement error in the linear model yields in the model defined by 2 where and the residuals have mean 0 and variance . Depending on the value of , the variance of is larger or smaller than the variance of . Hence, power to detect the treatment effect will either decrease or increase under systematic measurement. Type-I error is unaffected since if , (i.e., tests for null effects are still valid under systematic measurement error) (proof is given in supplementary materials section 2).

3.4 Differential measurement error

There is differential measurement error in if depends systematically on varying for : , where are assumed iid normal with mean 0 and Var and independent of , and in (1) for . Using with differential measurement error in the linear model yields in the model defined in (2) where and the residuals have mean 0 and variance for . Since the residual variance is not equal in both arms, the estimator of the variance of is invalid. A heteroscedastic consistent estimator of the variance of is provided by the White estimator [18]. Assuming that the White estimator is used to estimate the variance of , Type I is not expected the nominal level ( ) and power will decrease or increase under the differential measurement error model (proof is given in supplementary materials section 2).

4 Correction methods for measurement error in a continuous trial endpoint

In this section we describe several approaches to address measurement error in the trial endpoint. Throughout, we assume that is measured for all randomly allocated patients in the trial. We also assume that and are both measured for a smaller set of different individuals not included in the trial ( ), hereinafter referred to as the external calibration sample. In all but one case, it is assumed that only and are measured in the external calibration sample. In the case that the error in is different for the two treatment groups, it is assumed that the external calibration sample is in the form of a small pilot study where both treatments are allocated (i.e., and are both measured after assignment of ).

A well-known consequence of classical measurement error in a continuous trial endpoint is that a larger sample size (as compared to the same situations without the measurement error) is needed to compensate for the reduced precision [13]. For example, the new sample size may be calculated by formula where is the reliability coefficient and the original sample size for the trial [19]

. For solutions for heteroscedastic measurement error, we refer to standard theory of dealing with heteroscedastic errors in regression to find an unbiased estimator for the variance of

(e.g. see [18] for an overview of different heteroscedasticity consistent covariance matrices).

Hereinafter we focus on measurement error in that is either systematic or differential, both of which have been shown to introduce bias in the effect estimator if measurement error is neglected (section 3). Consistent estimators for the intervention effects are introduced, and various methods for constructing confidence intervals for these estimators are discussed. Section 3 in the supplementary materials provides a explanation of the results stated in this section. Throughout, we assume that is measured for all patients in the trial. We also assume that and are both measured for a smaller set of different individuals not included in the trial ( ), hereinafter referred to as the external calibration sample. For an earlier exploration of the use of an internal calibration set when there is systematic or differential measurement error in endpoints, see [7].

4.1 Systematic measurement error

From section 3.3 it follows that natural estimators for and are

Where and are the estimated error parameters from the calibration data set using standard OLS regression. From equation (3), it becomes apparent that needs to be assumed bounded away from zero for finite estimates of and [8]. The estimators in (3) are consistent, see for a proof section 3.1 in the supplementary material.
The variance of the estimators defined in (3) can be approximated using the Delta method [20], the Fieller method [20], the Zero-variance method and by bootstrap. Further details are provided in section 3.1 of the supplementary materials.

4.2 Differential measurement error

From section 3.4 it follows that natural estimators for and are,

where , , and are estimated from the external calibration set using standard OLS estimators. Here it is assumed that both and are bounded away from zero (for reasons similar to those mentioned in section 4.1). The estimators in (4) are consistent, see for a proof section 3.1 in the supplementary material. The variance of the estimators defined in (4) can be approximated using the Delta method [20], the Zero-variance method and by bootstrap. Further details are provided in section 3.2 of the supplementary materials.

5 Simulation study

The finite sample performance of the measurement error corrected estimators of the treatment effect was studied by simulation. We focussed on the situation of a two-arm trial in which the continuous surrogate endpoint was measured with systematic or differential measurement error, and in which an external calibration set was available, which was varied in size. The results from the example trial 1 are used to motivate our simulation study (see section 2).

5.1 Data generation

Data were generated for a sample of individuals, approximately equal to the size of example trial 1. The individuals were equally divided in the two treatment arms. The true endpoints were generated according to model (1), assuming iid normal errors, and using the estimated characteristics found in the example trial 1 ( , and ). Surrogate endpoints were generated under models for systematic measurement error and differential measurement error described in section 3.3 and 3.4, respectively.

For systematic measurement error in , we set and . Under the differential measurement error model we set , , , . We considered three scenarios based on the coefficient of determination between the and , : (i) , (ii) and (iii) . For , for systematic measurement error and and for differential measurement error. For , for systematic measurement error and and for differential measurement error. For , for systematic measurement error and and for differential measurement error.

For the scenarios with systematic measurement error induced, a separate calibration set was generated of size with the characteristics of the placebo arm for each simulated data set. For differential measurement error scenarios, a calibration data set was generated of size for each simulated data set, with subjects equally divided over the two treatment groups. The sample size of the external calibration data set was varied with for systematic measurement error and for differential measurement error.

5.2 Computation

For each simulated data set the corrected treatment effect estimator (3) for systematic error and (4) for differential error were applied. In systematic measurement error scenarios, confidence intervals for the corrected estimator for were constructed by using the Zero-variance method, the Delta method, the Fieller method, and the Bootstrap method based on 999 replicates (as defined in section 4.1). In the case of differential measurement error, confidence intervals for the corrected estimator for were constructed by using the Zero-Variance method, the Delta method and the Bootstrap method based on 999 replicates (as defined in section 4.2). The HC3 heteroscedastic consistent variance estimator was used to accommodate for heteroscedastic error in the differential measurement error scenario [18]. Furthermore, for both the systematic and differential measurement error scenarios the naive analysis was performed (resulting in a naive effect estimate and naive confidence interval), which is the 'regular' analysis which would be performed if measurement errors were neglected.

We studied performance of the corrected treatment effect estimators in terms of percentage bias [21]

, empirical standard error (EmpSE) and square root of the mean squared error (sqrtMSE)

[22]. The performance of the methods for constructing the confidence intervals was studied in terms of coverage and power (1-Type-II error) [22].

In our simulations, the Fieller method resulted in undefined confidence intervals if in an iteration . The percentage of iterations for which the Fieller method failed to construct confidence intervals is reported. If the Fieller method resulted in undefined confidence intervals in more than 5% of cases in one simulation scenario, the coverage and average confidence interval width were not calculated as this would result in unfair comparisons between the different confidence interval constructing methods. The bootstrap confidence intervals were based on less than 999 estimates in case the sample drawn from the external calibration set consisted of equal replicates. These errors occurred more frequently for small values of and low R-squared. All simulations were run in R version 3.4, using the library mecor (version 0.1.0). The results of the simulation are available at doi:10.6084/m9.figshare.7068695 and the code is available at doi:10.6084/m9.figshare.7068773, together with the seed used for the simulation study.

5.3 Results of simulation study

5.3.1 Systematic measurement error

Table 1 shows percentage bias, EmpSE and sqrtMSE of the naive estimator and the corrected estimator for , and and when there is systematic measurement error. Naturally, the percentage of bias in the naive estimator is about 5% as . For the corrected estimator and , percentage bias, EmpSE and sqrtMSE of are reasonably small for . For the corrected estimator and , bias is reasonably small for For the corrected estimator and , bias of fluctuates and EmpSE and sqrtMSE is large for all . The estimates of the intervention effect using the corrected estimator of each 10th iteration of our simulation is shown in Figure 2, which provides a clear visualization of the results formerly discussed. The bigger the sample size of the external calibration set and the higher R-squared, the better the performance of the corrected estimator. The sampling distribution of depicted in Figure 3 explains why there is so much variation in the corrected effect estimator for small sample sizes of the external calibration set and low R-squared. Namely, for a number of iterations in our simulation, was estimated close to zero, expanding the corrected estimator the same number of times resulting in large bias, EmpSE and MSE. Note that if , the sign of the corrected estimator changes, explaining why the corrected estimate of the intervention effect is sometimes below zero.

For the Fieller method failed to construct confidence intervals in 15, 5, 1 and 0.1 % of simulated datasets for respectively . Therefore, coverage and average confidence interval width of the Fieller method is not evaluated for . For , the Fieller method failed to construct confidence intervals in 48, 36, 22, 8, 3, 0.3 % of simulated data sets for , respectively. Therefore, coverage and average confidence interval width is not evaluated for . For , the Fieller method failed to construct confidence intervals in 74, 71, 64, 53, 43, 26, 15 and 8 % of simulated data sets for , respectively (i.e., in every case more than 5%, thus the Fieller method is not evaluated for ).

Table 1 shows coverage of the true intervention effect in the constructed confidence intervals and average confidence interval width using the Zero-variance, Delta, Fieller and Bootstrap method. Coverage of the true treatment effect of 6.9 using Wald confidence intervals for the naive effect estimator nearly yielded in 95%, because for bias in the naive estimator is small (i.e. 5%). The Zero-variance method yielded too narrow confidence intervals for all scenario's, an intuitively clear result as the Zero-variance method neglects the variance in . For the Delta, Fieller and Bootstrap method constructed correct confidence intervals for . For the Delta method and the Fieller method constructed too narrow confidence intervals, and the Bootstrap method too broad confidence intervals. For the Delta and Bootstrap method constructed correct confidence intervals for . For the Delta method constructed too narrow confidence intervals, and the Bootstrap method too broad confidence intervals. Coverage of the Fieller method was about the desired 95% level for .

Power of the naive effect estimator was 99.8%, 97.1% and 68.4% for , and , respectively. Power using the Zero-variance, Delta and Boostrap method was 100%. For the considered scenario's using the Fieller method, power was 99.8% for and 97.1% for .

5.3.2 Differential measurement error

Table 2 shows percentage bias, EmpSE and sqrtMSE of the naive estimator and the corrected estimator for , and and when there is differential measurement error. The percentage bias in the naive estimator was about 92%. For the corrected estimator and , percentage bias, EmpSE and sqrtMSE of are reasonably small for . For the naive estimator and , percentage bias, EmpSE and MSE of the corrected estimator are small for . For the naive estimator and , percentage bias, EmpSE and MSE of the corrected estimator is large for all 's. The estimates of the intervention effect using the corrected estimator of each 10th iteration of our simulation is shown in Figure 4, which provides a clear visualization of the results formerly discussed. The bigger the sample size of the external calibration set and the higher R-squared, the better the performance of the corrected estimator.

Table 2 shows coverage of the true intervention effect in the constructed confidence intervals and average confidence interval width using the Zero-Variance, Delta and Bootstrap method. Coverage of the true treatment effect of 6.9 using Wald confidence intervals for the naive effect estimator were about 1, 7 and 41 % for , and , respectively. In all cases, the Zero-Variance method yielded too narrow confidence intervals; the Delta method yielded too broad confidence intervals and the bootstrap method yielded mostly too broad confidence intervals, except for and and (too narrow). For and , coverage of the true intervention effect was 95%.

Power of the naive effect estimator was 100%, 100% and 99.6% for , and , respectively. Power using the Zero-variance, Delta and Bootstrap method was 100%.

Figure 2: Estimates of the treatment effect using the naive estimator and corrected estimator for different values of R-squared (row grids) and different sample sizes of the external calibration set (column grids) under systematic measurement error ( ). Each grid is based on every 10th estimate of a simulation of 10,000 replicates, using an estimand of 6.9 (indicated by the red line), based on the example trial 1 by Makrides et. al [15].

Figure 3: Estimates of (i.e. slope of the systematic measurment error mordel) for different values of R-squared (row grids) and different sample sizes of the external calibration set (column grids). Each grid is based on every 10th estimate of a simulation of 10,000 replicates, using an estimand of 1.05 (indicated by the red line).

Figure 4: Estimates of the treatment effect using the naive estimator and corrected estimator for different values of R-squared (row grids) and different sample sizes of the external calibration set (column grids) under differential measurement error ( , , , ). Each grid is based on every 10th estimate of a simulation of 10,000 replicates, using an estimand of 6.9 (indicated by the red line), based on the example trial 1 by Makrides et. al [15].

6 Discussion

This paper outlined the ramifications for randomised trial inferences when continuous endpoints are measured with error. Our study showed that when this measurement error is ignored, not only can trial results be hampered by a loss in precision of the treatment effect estimate (i.e. increase in Type-II error for a given sample size), but trial inferences can be impacted through bias in the treatment effect estimator and a null-hypothesis significance test for the treatment effect can deviate substantially from the nominal level. In this article we proposed a number of regression calibration-like correction methods to reduce the bias in the treatment effect estimator and obtain confidence intervals with nominal coverage. In our simulation studies, these methods were effective in improving trial inferences when an external calibration dataset (containing information about error-prone and error-free measurements) with at least 20 subjects was available.

To anticipate the impact of measurement error on trial inferences, the mechanism and magnitude of the measurement error should be considered. Endpoints that are measured with purely homoscedastic classical measurement error are expected to reduce the precision of treatment effect estimates and increase the Type-II error, proportional to the relative amount of variance that is due to the error. Heteroscedastic classical error and differential error also affect Type-I error, where larger Type-I errors are typically expected for differential measurement error and heteroscedastic errors. Under systematic measurement error, only Type-I errors for testing null effects are expected to be at the nominal level. The treatment effect estimate itself is biased by systematic error and differential error. Heteroscedastic error can be addressed using standard robust standard error estimators (e.g. HC3 [18]). Systematic error and differential error in the endpoint can be addressed via regression calibration.

We considered regression calibration-like correction methods that rely on an external calibration set that contains information about both error-prone and error-free measurements. We anticipate such an external calibration set can be feasible as a planned pilot study phase of a trial. Our simulation study shows that the effectiveness of correction methods to adjust the trial results for endpoint measurement error are dependent on the size of the calibration sample and the strength of the correlation between the error-free and error-prone measurement of the trial endpoint. For a weak relation (R = 0.20) we found the correction methods to be generally ineffective in improving trial inference with reasonably sized calibration sets (i.e., up to size n=50). However for medium (R = 0.50) or strong (R = 0.80) correlations, the regression calibration showed improvements with external calibration samples as small as observations. The proposed calibration correction methods rely on a linear regression framework and can thus easily be extended to incorporate covariables in the trial analysis [23].

The use of measurement error corrections is still rare in applied biomedical studies despite an abundance of measurement error problems usually reported as an afterthought to a study [14]. Indeed, to our knowledge, no measurement error correction methods have been used so far in the analysis of biomedical trials to correct for measurement error in the endpoint. This may in part be due to a common misconception that measurement error can only affect trial inference by reducing the precision of estimating (and power to detect) the effect of treatment. Our study demonstrates that such an assumption is warranted only when strict classical homoscedastic error structure of the trial endpoint can be assumed. Such does not hold, for instance, when measurement errors are more pronounced in the tails of the distribution, or when measurement errors vary between treatment arms.

Instead of the use of external calibration datasets, internal measurement correction approaches where both the preferred endpoint and the error contaminated endpoint are measured on a subset of trial participants may sometimes be more feasible. For internal calibration, Keogh and colleagues [7]

recently reviewed methods of moment estimation and maximum likelihood estimation approaches. There are also other approaches to correct for measurement error that we did not discuss in this paper. For instance, Cole and colleagues suggested a multiple imputing approach based on an internal calibration set

[24]. We also focused only on continuous outcomes in this paper. Problems and solutions for misclassified outcomes can be found elsewhere [25]. However, to the best of our knowledge, none of these methods have been tested in the setting where trial endpoints are measured with error and thus need further study.

In summary, the impact of measurement error in a continuous endpoint on trial inferences can be particularly non-ignorable when the measurement error is not strictly random, because Type-I, Type-II error and the effect estimates can be affected. To alleviate the detrimental effects of measurement error we proposed measurement error corrected estimators and a variety of methods to construct confidence intervals for non-random measurement error. To facilitate the implementation of these measurement error correction estimators we have developed the R package mecor, available at: www.github.com/LindaNab/mecor.

References

[1] E Cerin, KL Cain, AL Oyeyemi, N Owen, TL Conway, T Cochrane, D Van Dyck, J Schipperijn, J Mitáš, M Toftager, I Aguinaga-Ontoso, and JF Sallis. Correlates of agreement between accelerometry and self-reported physical activity. Med Sci Sports Exerc, 48(6):1075–1084, 2016.
[2] MS Lauer and RB D'Agostino. The randomized registry trial: The next disruptive technology in clinical research? N Engl J Med, 369(17):1579–1581, 2013.
[3] I Boutron, F Tubach, B Giraudeau, and P Ravaud. Blinding was judged more difficult to achieve and maintain in nonpharmacologic than pharmacologic trials. J Clin Epidemiol, 57:543–550, 2004.
[4] HM Staudacher, PM Irving, MCE Lomer, and K Whelan. The challenges of control groups, placebos and blinding in clinical trials of dietary interventions. Proc Nutr Soc, 76:203–112, 2017.
[5] S Mahabir, DJ Baer, C Giffen, A Subar, W Campbell, TJ Hartman, B Clevidence, D Albanes, and PR Taylor. Calorie intake misreporting by diet record and food frequency questionnaire compared to doubly labeled water among postmenopausal women. Eur J Clin Nutr, 60:561–565, 2006.
[6] S Senn and S Julious. Measurement in clinical trials: A neglected issue for statisticians? Stat Med, 28:3189–3209, 2009.
[7] RH Keogh, RJ Carroll, JA Tooze, SI Kirkpatrick, and LS Freedman.
Statistical issues related to dietary intake as the response variable in intervention trials.
Stat Med, 35:4493–4508, 2016.
[8] JP Buonaccorsi. Measurement error: Models, methods, and applications. Chapman & Hall/CRC, Boca Raton, FL, 2010.
[9] TB Brakenhoff, M van Smeden, FLJ Visseren, and RHH Groenwold. Random measurement error: Why worry? An example of cardiovascular risk factors. PLoS One, 13(2):1–8, 2018.
[10] RJ Carroll, D Ruppert, LA Stefanski, and CM Crainiceanu. Measurement error in nonlinear models: A modern perspective. Chapman & Hall/CRC, Boca Raton, FL, 2nd edition, 2006.
[11] WA Fuller. Measurement Error Models. John Wiley & Sons, New York, NY, 1987.
[12] P Gustafson. Measurement error and misclassification in statistics and epidemiology: Impacts and Bayesian adjustments. Chapman & Hall/CRC, Boca Raton, FL, 2004.
[13] JA Hutcheon, A Chiolero, and JA Hanley. Random measurement error and regression dilution bias. BMJ, 340:c2289, 2010.
[14] TB Brakenhoff, M Mitroiu, RH Keogh, KGM Moons, RHH Groenwold, and M van Smeden. Measurement error is often neglected in medical literature: A systematic review. J Clin Epidemiol, 98:89–97, 2018.
[15] M Makrides, CA Crowther, RA Gibson, RS Gibson, and CM Skeaff. Efficacy and tolerability of low-dose iron supplements during pregnancy: a randomized controlled trial 1–3. Am J Clin Nutr, 78(1):145–153, 2003.
[16] S Zlotkin, P Arthur, KY Antwi, and G Yeung. Randomized, controlled trial of single versus 3-times-daily ferrous sulfate drops for treatment of anemia. Pediatrics, 108(3):613–616, 2001.
[17] AJ Patel, R Wesley, SF Leitman, and BJ Bryant. Capillary versus venous haemoglobin determination in the assessment of healthy blood donors. Vox Sang, 104(4):317–323, 2013.
[18] JS Long and LH Ervin. Using heteroscedasticity consistent standard errors in the linear regression model. Am Stat, 54(3):217–224, 2000.
[19] G Fitzmaurice. Measurement error and reliability. Nutr, 18(1):112–114, 2002.
[20] JP Buonaccorsi. Measurement errors, linear calibration and inferences for means. Comput Stat Data Anal, 11(3):239–257, 1991.
[21] A Burton, DG Altman, P Royston, and RL Holder. The design of simulation studies in medical statistics. Stat Med, 25:4279–4292, 2006.
[22] TP Morris, IR White, and J Crowther. Using simulation studies to evaluate statistical methods. https://arxiv.org/pdf/1712.03198v2.pdf, 2018.
[23] SJ Senn. Covariate imbalance and random allocation in clinical trials. Stat Med, 8:467–475, 1989.
[24] SR Cole, H Chu, and S Greenland. Multiple-imputation for measurement-error correction. Int J Epidemiol, 35(4):1074–1081, 2006.
[25] DR Brooks, KD Getz, AT Brennan, AZ Pollack, and MP Fox. The impact of joint misclassification of exposures and outcomes on the results of epidemiologic research. Curr Epidemiol Rep, 5(2):166–174, 2018.
[26] ET Poehlman, Denino WF, Beckett T, Kinaman KA, Dionne IJ, Dvorak R, and Ades PA. Effects of endurance and resistance training on total daily energy expenditure in young women: A controlled randomized trial. J Clin Endocrinol Metab, 87(3):1004–1009, 2002.
[27] G Plasqui and KR Westerterp. Physical activity assessment with accelerometers: An evaluation against doubly labeled water. Obesity, 15(10):2371–2379, 2007.
[28] JWJ Bijlsma, PMJ Welsing, TG Woodworth, LM Middelink, A Pethö-Schramm, C Bernasconi, MEA Borm, CH Wortel, EJ ter Borg, ZN Jahangier, WH van der Laan, GAW Bruyn, P Baudoin, S Wijngaarden, PAJM Vos, R Bos, MJF Starmans, EN Griep, JRM Griep-Wentink, CF Allaart, AHM Heurkens, XM Teitsma, J Tekstra, ACA Marijnissen, FPJ Lafeber, and JWG Jacobs. Early rheumatoid arthritis treated with tocilizumab, methotrexate, or their combination (U-Act-Early): A multicentre, randomised, double-blind, double-dummy, strategy trial. Lancet, 388:343–355, 2017.
[29] MLL Prevoo, MA van 't Hof, H. H. Kuper, MA van Leeuwen, LBA van de Putte, and PLCM van Riel.
Modified disease activity scores that include twenty-eight-joint counts development and validation in a prospective longitudinal study of patients with rheumatoid arthritis.
Arthritis Rheum, 38(1):44–48, 1995.
[30] T Pincus, Y Yazici, and T Sokka. Complexities in assessment of rheumatoid arthritis: Absence of a single gold standard measure. Rheum Dis Clin N Am, 35(4):687–697, 2009.
[31] J Anderson, L Caplan, J Yazdany, ML Robbins, T Neogi, K Michaud, KG Saag, JR O'dell, and S Kazi. Rheumatoid arthritis disease activity measures: American College of Rheumatology recommendations for use in clinical practice. Arthritis Care Res, 64(5):640–647, 2012.
[32] EH Choy, B Khoshaba, D Cooper, A Macgregor, and DL Scott. Development and validation of a patient-based disease activity score in rheumatoid arthritis that can be used in clinical trials and routine practice. Arthritis Rheum, 59(2):192–199, 2008.
[33] R Davidson and JG MacKinnon. Econometric Theory and Methods. Oxford University Press, New York, NY, 2004.
[34] VH Franz. Ratios: A short guide to confidence limits and proper use. https://arxiv.org/pdf/0710.2024v1.pdf, 2007.
[35] B Efron. Bootstrap methods: Another look at the jackknife. Ann Stat, 7(1):1–26, 1979.
[36] EC Fieller. The biological standardization of insulin. J Roy Stat Soc Suppl, 7(1):1–64, 1940.

S1 Illustrative examples

We introduce here two additional example trials from literature, hypothesize that these trial could also have used endpoints measured with error to illustrate how the use of an endpoint that is contaminated with error would affect trial inference. We assume that the original endpoints used in our example trials are measurement error free.

s1.1 Example trial 2: energy expenditure

Poehlman and colleagues [26] studied the effects of endurance and resistance training on total daily energy expenditure in a randomised trial of young sedentary women. Participants were randomized to one of three six-month during exercise programmes: endurance training, resistance training or the control arm. Some controversy regarding the effect of exercise training on total energy expenditure (TEE) existed at the time of the start of the trial, partly because of the difficulty to assess daily energy expenditure [26]. Starting 72 hours after completion of the training program, TEE of the participants was measured by doubly labelled water during a ten day period, which is considered the gold standard in measuring energy expenditure in humans [27]. In short, the study found no evidence for an effect of resistance and endurance training (compared to placebo) on total energy expenditure. Post-trial, measured TEE was higher in the control arm than in the two intervention arms. Table 1 shows the decrease in TEE of the women exposed to the existence training programme versus the placebo arm.

s1.2 Example trial 3: rheumatoid arthritis disease activity

The U-Act-Early trial tested the efficacy of a new treatment strategy for rheumatoid arthritis (RA) in patients with newly diagnosed RA [28] in a three-arm trial: tocilizumab plus methotrexate versus tocilizumab only versus methotrexate only, all as initial treatment. For endpoint assessment, this trial used a validated RA disease activity measure (the Disease Activity Score 28, DAS28) [29]) which is commonly used and recommended to measure endpoints in RA clinical trials [30, 31]. In short, the trial showed that immediate initiation of tocilizumab with or without methotrexate is more effective than methotrexate alone to achieve sustained remission in newly diagnosed RA patients. The difference in mean DAS28 score in the tocilizumab plus methotrexate versus methotrexate only group after 24 weeks is shown in Table S1. The sample size of the former groups reported in Table S1 is based on measurements available at 24 weeks of follow up.

A common alternative approach to measure energy expenditure (example trial 2) is by a accelerometer, that measures body movement via motion sensors to assess energy expenditure (e.g. [27]). As compared to double labelled water (example trial 2), the accelerometer is cheaper, but less accurate [27]. Lastly, instead of endpoint assessment by DAS28 (example trial 3), where assessment is done by trained medical staff [29], trials could alternatively use the patient-based RA disease activity score (PDAS), where endpoint assessment is done by the patient [32].

For the example trial in the paper and each of the aforementioned example trials here, in Table S1 we show to what extent the power of a test for treatment effect changes when a hypothetical lower standard of endpoint measurement would have been used introducing classical measurement error. The table clearly shows the anticipated decrease in power with increasing error at the same sample size.

Effect estimates, standard errors and sample sizes are based on results in papers by Makridis et al. [15] (trial 1), Poehlman et al. [26] (trial 2) and Bijlsma et al. [28] (trial 3).
Proportion of observed variance in endpoints due to measurement error.
Power calculations are based on results provided in section 3.1.

Table S1: Impact of classical measurement error on power

S2 Measurement error structures

Consider a two-arm randomized controlled trial that compares the effects of two treatments ( ), where 0 may represent a placebo treatment or an active comparator. Let denote the true (or preferred) trial endpoint and an error prone operationalisation of . We will assume that both and are measured on a continuous scale. Throughout, we assume that is measured for all randomly allocated patients in the trial. We assume that the effect of allocated treatment ( ) on preferred endpoint is defined by the linear model

where defines the treatment effect on the endpoint, and has expected mean 0 and variance . Throughout, we assume that is fixed. Further, we assume that model 1 is inestimable from the observed data because the endpoint instead of was measured. We will assume that the relation between and is given by a linear model,

where is a random variable whose distribution is independent of , and . The parameters and define the relation between and , where it is assumed that does not equal 0. We assume that both parameters and are estimable only in the external calibration sample comprising individuals not included in the trial ( ).

Simple OLS regression estimators for , and (the variance of the errors ) in (S1) are,

respectively. In a two-arm trial, the interest is in making inferences about , which cannot be directly estimated because in the trial the endpoint of interest was replaced by . In the following we will show: a) that may be a poor estimator for (section 3.1-3.4), and b) how adjustments to using information from the calibration model described by (S2) can improve inference about the treatment effect (section 4). As a starting point, in the following section relevant and known properties are defined for the special case that , which is then followed by the properties under different measurement error structures for in subsequent sections.

s2.1 No measurement error

Consider the hypothetical case that is a perfect proxy for , i.e. . By using that , as defined in (S1), it follows that:

From standard regression theory (e.g. [33]), we know that if the errors satisfy the regular Gauss-Markov assumptions [33] and their variance is defined by , the OLS estimators , , and (defined by S3, S4, and S6, respectively) are Best Linear Unbiased Estimators (BLUE) for , , and , respectively.

Moreover, if the are independently and identically (iid) normally distributed, the OLS estimators and (defined in S3 and S4, respectively) are the Maximum Likelihood Estimators (MLE) of and , respectively. Note that the errors satisfy the Gauss-Markov assumptions if we assume that they are iid normally distributed with mean 0 and constant variance .
Hypotheses for the treatment effect , can be defined by

Under normality of the error terms , the OLS estimator defined in (S3) is the MLE for and is an unbiased estimator for , the following is known for the Wald test:

where,

Assuming no measurement error in and , under , follows a Student's t distribution with degrees of freedom [33]. Under , follows a Student's t distribution with degrees of freedom and non-centrality parameter .

s2.2 Classical measurement error

There is classical measurement error in if is an unbiased proxy for [10]:

where E and Var and mutually independent of , , (in (S1)).
Using that from (S1), it follows that:

Given the aforementioned assumptions, the sum of and , , has variance . It follows that if the errors satisfy the Gauss-Markov assumptions, in (S3) remains a BLUE estimator for . Also, in (S4) and in (S6) remain BLUE estimators for and the variance of , respectively.

Further, if is iid normally distributed with mean 0 and variance , then is the MLE for and is the MLE for . Obviously, given that and , the variance of the OLS regression estimator is larger if there is classical measurement error in the outcome compared to the case when there is no measurement error. Under the null hypothesis, the Wald test-statistic defined in (S7) still follows a Student's distribution with degrees of freedom. However, under the alternative hypothesis, the non-centrality parameter of , , will be smaller in the presence of classical measurement error.

To summarize, in the presence of only classical measurement error the Type-II error for detecting any given treatment effect increases, Type-I error is unaffected and the treatment effect estimator is unbiased MLE under standard regularity conditions.

s2.2.1 Heteroscedastic classical measurement error

In the preceding we assumed that the Gauss-Markov assumptions were met. But notably, in the case that the variance of the errors in (S9) varies per treatment arm, the errors are no longer homoscedastic (as needed to satisfy the Gauss-Markov assumptions) but heteroscedastic. In the case of this type of heteroscedastic classical measurement error, it can be shown that the variance of will be underestimated by the default estimator of the variance of defined by (S8), affecting both Type-I and Type-II errors.

s2.3 Systematic measurement error

There is systematic measurement error in , if systematically depends on . Assuming this dependence is linear, the relation between and can be defined as:

where E and Var . Throughout, we assume systematic measurement error if or (and of course, in all cases). We assume mutual independence between and , , ( in S1). Naturally, if and the measurement error is of the classical form.

By using that from (S1), it follows that:

Given the aforementioned assumptions, with expected variance . It follows that under the Gauss-Markov assumptions, defined in (S3) is BLUE for , and defined in (S4) is BLUE for and defined in (S6) is BLUE for the variance of (i.e. ). Conversely, is no longer BLUE for . Note that in this case is BLUE for , that is, depending on , smaller or larger than (the variance of the error terms if there is no measurement error).

If we further assume that is iid normally distributed, we can conclude that is the MLE for and is the MLE for . Conversely, is no longer the MLE for , if there is systematic measurement error in . In the absence of a treatment effect, as if , defined in (S7) still follows a Student's distribution with degrees of freedom. In the presence of any given treatment effect, follows a non-central Student's distribution with degrees of freedom and non-centrality parameter . Depending on the value of , the non-centrality parameter will be smaller or larger than the non-centrality parameter in the absence of measurement error (see section 3.2).

In summary, if there is systematic measurement error in the endpoints, the Type-I error is unaffected under standard regularity conditions and hence testing whether there is no effect is still valid under the null hypothesis [20]). Type-II error, however, is affected (it may increase or decrease) and the treatment effect estimator is a biased MLE.

s2.4 Differential measurement error

There is differential measurement error in when measurement error varies with . Assuming a linear model for this variation, formally:

where E and Var and independent of the endpoint of interest , and in (S1). From the equations it becomes clear that systematic error (equation (S10)) can be seen as a special case of differential error, where and .

By using that from (S1), it follows from equation (S11) that,

Let , with expected variance . Since the the error term is no longer homoscedastic, the OLS estimators defined in (S3) and (S4) are no longer BLUE. However, the OLS estimator in (S3) is consistent (although not efficient) for . The OLS estimator defined in (S4) is consistent (although not efficient) for . Nevertheless, the estimator for the variance of defined in (S8) is no longer valid.

By using the residuals defined in (S6), a heteroscedastic consistent estimator for the variance of is:

which is known as the White estimator [18]. From standard regression theory, it is known that using the above defined estimator, defined in (S7) is still valid. Yet, under differential measurement error no longer if . Thus, under the null hypothesis, defined in (S7) follows a Student's distribution with degrees of freedom and non-centrality parameter . Consequently, Type-I error changes if there is differential measurement error in and test about contrast under the null hypothesis are invalid [20]. Moreover, under the alternative hypothesis, follows a non-central Student's distribution with degrees of freedom and non-centrality parameter . Depending on the values of the 's and , the non-centrality parameters will be smaller or larger than 0 and the non-centrality parameter if there is no measurement error, respectively (see section 3.2). Hence, type-I error and power could increase or decrease if there is differential measurement error in .

To summarize, Type-I error is not expected nominal ( ) if there is differential measurement error in (see also [20]). Also, similar to systematic error in , Type-II error is affected (may increase or decrease) and the treatment effect estimator is a biased estimator.

S3 Correction methods for measurement error in continuous endpoints

To accommodate measurement error correction, we assume that and are both measured for a smaller set of different individuals not included in the trial ( ), hereinafter referred to as the external calibration sample. In all but one case, it is assumed that only and are measured in the external calibration sample. In the case that the error in is different for the two treatment groups, it is assumed that the external calibration sample is in the form of a small pilot study where both treatments are allocated (i.e., and are both measured after assignment of ).

s3.1 Systematic measurement error

Using an external calibration set and assuming that the errors in (S10) are iid normal, the MLE of the measurement error parameters in (S10) are:

The superscript (c) is used to indicate that the measurement is obtained in the calibration set. From section 3.4, under systematic measurement error and assuming that in (S1) and in (S10) iid normal and independent, the estimator defined in (S3) is the MLE of and, the estimator defined in (S4) is the MLE of . Natural sample estimators for and are then

where and are the estimated error parameters from the calibration data set. From equation (S13), it becomes apparent that needs to be assumed bounded away from zero for finite estimates of and [8].
The first moment of estimators and can be approximated by using multivariate Taylor expansions and assuming that ( , , , ) are normally distributed [8],

where , the total sum of squares of . In conclusion, the estimators and are consistent. Formal derivations for the presented formulas are provided in the Appendix.

In the following we will focus on specifying confidence limits for the treatment effect estimator defined in (S13). We make use of the fact that this estimator is a ratio, which motivates the use of the Delta method, Fieller method and Zero-variance method [34]. We also present a non-parametric bootstrap method for specifying confidence limits [35].

s3.1.1 Delta method

Assuming that and are both normally distributed and applying the Delta method, the second moment of can be approximated [20]. Formal derivations of the presented formulas are provided in Appendix A. The Delta method variance of is given by:

where , the total sum of squares of . An approximation of the above defined variance, denoted by , is provided by approximating , , and respectively by , , and [20].

An approximate confidence interval for the estimator is then given by

s3.1.2 Fieller method

A second method to construct confidence intervals for the estimator in (S13), described by Buonaccorsi, is the Fieller method [20, 36]. In the case that is significantly different from zero at a significance level of (that is, ), the confidence intervals of are defined by the Fieller method by:

A formal derivation can be found in Appendix A.

s3.1.3 Zero-variance method

The zero-variance method adjusts the observed endpoints by

where and are derived from (S10). The adjusted endpoints are regressed on the treatment variable , which yields,

with , and as in equations (S3, S4 and S6), respectively. Thus, equals and equals defined in (S13).

If the value of (i.e. ) is known, the variance of the estimator is equal to:

Using the standard OLS regression framework the variance of can be estimated by:

By replacing by in the above, the quantity in (S16) is in expectation equal to (defined above). The quantity in (S16) is used in the zero-variance method to construct confidence intervals for , by replacing for in equation S14. In conclusion, this zero-variance approach will provide confidence intervals for the treatment effect estimator while assuming there is no variance in (giving it its name zero-variance method). Although the zero-variance approach wins in terms of simplicity, it may underestimate the variability of the ratio since the variance in is assumed zero.

s3.1.4 Bootstrap

An alternative for defining confidence intervals for the corrected treatment effect estimator is by using a non-parametric bootstrap [35]. We propose the following stepwise procedure:

Draw a random sample with replacement of size of the calibration sample to estimate defined in (S12).
Draw a random sample with replacement of size of the trial data to calculate the corrected treatment effect estimate by . Where is defined in (S3).
Repeat step 1-2 times, with large (e.g. 999 times).
Approximate confidence intervals are given by the percentile of the distribution of .

s3.2 Differential measurement error

For corrections for endpoints that suffer from differential measurement error we will here assume the existence of a pilot trial, which serves as an external calibration set, where both treatments are allocated at random that serves as an external calibration set to estimate the measurement error model in (S11). For notational convenience we rewrite the linear model in equation (S11) in matrix form as:

where E and E , a positive definite matrix, with on its diagonal. Further, . In the external calibration set, the measurement error parameters can be estimated by,

with variance,

See [18] for a discussion on different estimators for the above defined variance. From section 2.5 it follows that natural estimators for and are,

where , , and are estimated from the external calibration set. Here it is assumed that both and are bounded away from zero (for reasons similar to those mentioned in section 3.1).

By multivariate Taylor expansions, the first moments of the estimators and defined in (S19) can be approximated [20], in the same way as the estimators for systematic measurement error (section 4.1),

From this, it is apparent that the estimators and defined in (S19) are consistent (details are found in the Appendix). In the subsequent sections we review the Delta method, zero-variance and propose a bootstrap for specifying confidence limits for the estimator of the treatment effect under differential measurement error of the endpoints.

s3.2.1 Delta method

The variance of the estimator defined in (S19) can be approximated by the Delta method [20]:

where is approximated by:

An approximate confidence interval for the estimator in (S19) is:

An approximation of , , , , , , and in the above is provided by: , , , , , , and [20].

s3.2.2 Zero-variance method

The zero-variance method adjusts the observed endpoints by

for and and derived from (S18). In the zero-variance method the above defined adjusted values are regressed on the treatment variable , yielding in estimators and , which are, respectively, equal to the estimators and defined in (S19). The variance of these estimators can be approximated with a heteroscedastic consistent covariance estimator (see [18] for an overview). Confidence intervals for are subsequently constructed by using formula S20. Similar to what is described in section 4.1.3 discussing the zero-variance method for systematic measurement error, this way of constructing confidence intervals neglects the variance of the 's from the calibration data set, and will thus often yield in confidence intervals that are too narrow.

s3.2.3 Bootstrap

We here alternatively propose a non-parametric bootstrap procedure to specify confidence limits. This entails the following steps:

Draw a random sample with replacement of size of the calibration sample and estimate as defined in (S18).
Draw a random sample (with replacement) of size of the study population and calculate the effect estimate by and . Where and are defined in (S3) and (S4), respectively.
Repeat step 1-2 times, with large (e.g. 999 times).
Approximate confidence intervals are given by the percentile of the distribution of .

Appendix A1 Approximation of bias and variance in corrected estimator

a1.1 Systematic measurement error

Obvious estimators for and are:

These estimators can be approximated with a second order Taylor expansion by:

Simplifying these terms and substraction of the latter two, will lead to the following approximations for and :

Since , , and an approximation of the expected value of the estimator is given by:

Congruently, an approximation of the expected value of the estimator is given by:

Only using the first order Taylor expansion of the estimators, approximations of the variance of and are respectively:

a1.1.1 Fieller method

Assume that and are normally distributed (note that this assumption is satisfied with large study samples ( ) and large calibration samples ( )). The sum of two normally distributed variables is normally distributed, hence, is normally distributed.
Furthermore, we have

Where,

If we now divide the term by its variance, we get:

We are interested to find the set of values for which the corresponding values lie within the quantiles of the -distribution with degrees of freedom (this only holds approximately, see for details [34]). Let us denote these values by , from (A1) we have,

In the case that is significantly different from zero at a significance level of (that is,
), solving this for results in the following confidence intervals:

In the other case, the confidence intervals are unbounded, see for more details [34].

a1.2 Differential measurement error