This is an important finding as RCTs commonly collect information on a number of different PROMs, as well as clinical information such as clinical assessments, readmissions, and complications. This information should be used in MI models, where appropriate. Using auxiliary variables can also make the MAR assumption, on which all these three approaches rely, more plausible, 38 particularly when some missing data are related to a change in health states.
The MI model including auxiliary variables performed slightly worse with monotone missing data than with intermittently missing data.
This finding emphasizes the importance of continued data collection and including all collected data in the analysis. IPW performed notably worse than its comparators, in terms of both bias and variability around the estimates of the treatment effects. IPW potentially uses only a small subset of the observed outcome data.
Outside the context of a simulation study, models should be tailored to the data available and simplified for smaller sample sizes, as appropriate. The differences in the performance measures for the three approaches are relatively small. They lie within the measurement error of the PROM and do not exceed its minimal important difference, which have been estimated at 4 and 5 points, respectively. However, many trials are powered to detect small differences between treatment arms. For example, the KAT study was powered to detect a 1. A limited number of missing data scenarios were considered, and the maximum sample sizes were restricted by the number of complete cases in the trial that this simulation study is based on.
However, sample sizes ranging from to almost 1, participants were deemed representative of the vast majority of RCTs. Most of the simulations considered the same missing data pattern, a mixture of intermittent and monotone missingness. Missing data patterns are likely to vary between trials and, to a smaller extent, between PROMs. Other patterns of missingness could have been investigated here. However, monotone and intermittently missing data are commonly observed in RCTs, and we believe that the patterns used, as well as the conclusions drawn, are generalizable to a large proportion of RCTs.
More resources may be spent on ensuring high completion rates for the primary or key secondary outcome measures, eg, through follow-ups by telephone. It is also possible that RCT participants are more inclined to complete shorter questionnaires or those they consider more relevant to themselves. Different follow-up schedules may be used for different PROMs, and those collected more frequently can be used to make inferences about missing data in other questionnaires.
Information on clinical assessments, readmissions, additional treatment, or complications may be less prone to missing data and could be used in imputation models. In short, any available additional postrandomization information should be included in imputation models if deemed appropriate to reduce bias. These specifications were included in this simulation study as they are well established, commonly used by the statistical community, and easily implementable using standard statistical software. Other specifications of these models are possible, but were not considered here.
Other implementations of IPW have been suggested, including a stratification approach to account for different missing data patterns, which may be due to differences in patient characteristics; however, this approach is only thought to be appropriate if the number of missing data patterns is small. However, as the implementations for these approaches are complex and not routinely available in standard statistical software, they did not match the criteria for methods compared here. Validation in other PROMs could be beneficial. This work did not consider the effect of MNAR mechanisms or misspecifications of the analysis model on the performance of the three approaches.
As misspecification and MNAR can occur in a number of ways, the effect of different misspecifications or different MNAR mechanisms may have very different effects on the performance of the three approaches for handling missing data in longitudinal data sets. We therefore avoided general statements about the performance of the investigated analysis approaches that may not be applicable to all MNAR and misspecification scenarios, which could lead to underestimating the bias introduced through missing data.
The effects of MNAR scenarios should be investigated for all analyses on incomplete data in appropriate sensitivity analyses, as recommended in the literature.
However, if auxiliary PROMs have been more completely observed during follow-up than the PROM of primary interest, or other postrandomization data are available, then MI performs better and should be favored over non-imputation-based ML approaches. As both approaches assume an MAR mechanism, additional sensitivity analyses considering MNAR scenarios should be conducted to supplement the primary analysis. The data used for this simulation work were collected as part of the KAT study.
Data requests should be directed to the trial coordinating office, the Health Services Research Unit at the University of Aberdeen. The simulation work was performed in Stata and is available from the corresponding author upon request. We are very grateful to the KAT study group for providing data for this methodological work.
We recognize the contributions of all of the KAT investigators, collaborators, and those who coordinated the KAT study. We acknowledge English language editing by Jennifer A de Beyer. To impute or not to impute?
Clinical trial - Wikipedia
A comparison of statistical approaches for analyzing missing longitudinal patient reported outcome data in randomized controlled trials. A comparison of statistical approaches for analyzing missing longitudinal patient reported outcome data in RCTs. All authors are employed by the University of Oxford. The authors report no other conflicts of interest in this work. Handling missing data in RCTs; a review of the top medical journals. The current practice of handling and reporting missing outcome data in eight widely used PROMs in RCT publications: a review of the current literature.
Qual Life Res. Direct likelihood analysis versus simple forms of imputation for missing data in randomized clinical trials. Clin Trials. Assessing response profiles from incomplete longitudinal clinical trial data under regulatory considerations. J Biopharm Stat. A modelling strategy for the analysis of clinical trials with partly missing longitudinal data. Int J Methods Psychiatr Res. Practical and statistical issues in missing data for longitudinal patient-reported outcomes.
Stat Methods Med Res. Patient-reported outcomes to support medical product labeling claims: FDA perspective. Value Health. Jenkinson C, Morley D. Patient reported outcomes. Eur J Cardiovasc Nurs. Kazi AM, Khalid W. Questionnaire designing and validation. J Pak Med Assoc. Meaningful changes for the Oxford hip and knee scores after joint replacement surgery. J Clin Epidemiol. The Oxford shoulder score revisited. Arch Orthop Trauma Surg. Using patient-reported outcomes in clinical practice: challenges and opportunities. A review of RCTs in four medical journals to assess the use of imputation to overcome missing data in quality of life outcomes.
Health Econ. Ibrahim JG, Molenberghs G. Missing data methods in longitudinal studies: a review. Test Madr. Enders CK. Analyzing longitudinal data with missing values. Rehabil Psychol. The analysis of multivariate longitudinal data: a review. A guide to handling missing data in cost-effectiveness analysis conducted within randomised controlled trials. A review of the handling of missing longitudinal outcome data in clinical trials. Assessing and interpreting treatment effects in longitudinal clinical trials with missing data. Biol Psychiatry. Lachin JM.
Fallacies of last observation carried forward analyses. Missing data: a systematic review of how they are reported and handled. Principled missing data treatments. Prev Sci. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. An application of maximum likelihood and generalized estimating equations to the analysis of ordinal data from a longitudinal study with cases missing at random.
Doidge JC. Responsiveness-informed multiple imputation and inverse probability-weighting in cohort studies with missing data that are non-monotone or not missing at random. Review of inverse probability weighting for dealing with missing data. Statistical Analysis with Missing Data. Recommendations for the primary analysis of continuous endpoints in longitudinal clinical trials.
Drug Inf J. Multiple Imputation and Its Application. Combining multiple imputation and inverse-probability weighting. Multiple imputation of discrete and continuous data by fully conditional specification. Rubin D. Multiple Imputation for Nonresponse in Surveys. Kenward MG, Carpenter J. Multiple imputation: current perspectives. Analysing randomised controlled trials with missing data: choice of approach affects conclusions.
Contemp Clin Trials. Inverse probability weighting. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. Allowing for missing outcome data and incomplete uptake of randomised interventions, with application to an Internet-based alcohol trial. Stata Statistical Software: Release The Knee Arthroplasty Trial KAT design features, baseline characteristics, and two-year functional outcomes after alternative approaches to knee replacement. J Bone Joint Surg Am. A randomised controlled trial of the clinical effectiveness and cost-effectiveness of different knee prostheses: the Knee Arthroplasty Trial KAT.
Health Technol Assess. Questionnaire on the perceptions of patients about total knee replacement. J Bone Joint Surg Br. The use of the Oxford hip and knee scores. Fully conditional specification in multivariate imputation.
J Stat Comput Simul. Evaluation of software for multiple imputation of semi-continuous data. Multiple imputation to deal with missing EQ-5D-3L data: Should we impute individual domains or the actual index? Oemar M, Oppe M. Accessed October 1, Jenkinson C, Layte R. Development and testing of the UK SF short form health survey.
J Health Serv Res Policy. A shorter form health survey: can the SF replicate results from the SF in longitudinal studies? J Public Health Med. Should multiple imputation be the method of choice for handling missing data in randomized trials? Epub Jan 1. Graham JW. Missing data analysis: making it work in the real world. Ann Rev Psychol. Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat. Including auxiliary item information in longitudinal data analyses improved handling missing questionnaire outcome data.
Using linked educational attainment data to reduce bias due to missing outcome data in estimates of the association between the duration of breastfeeding and IQ at 15 years. Int J Epidemiol. Permutt T. Sensitivity analysis for missing data in regulatory submissions. This work is published and licensed by Dove Medical Press Limited.
Patterns of treatment effects in subsets of patients in clinical trials.
By accessing the work you hereby accept the Terms. The number of patients included in a comparative trial, called the sample size of the trial, must be sufficient to detect a difference deemed of clinical relevance. The sample size is calculated so as to guarantee that the difference of interest, if real, will be detected with a given probability, called the statistical power of the trial. In order to calculate a sample size, the trialists need to agree on the following design parameters:. Software is available to calculate sample sizes for different types of endpoints, and for different values of the design parameters ABCDE 1.
Many trials in the past have ended up being inconclusive not showing a statistically significant difference between the treatment groups because of an insufficient sample size and ensuing low power. In this case, a meta-analysis of all related trials would be the best way of establishing real, but small, treatment differences The need for large-scale trials has been recognized since; for instance, the ATAC trial randomised over 9, patients for the treatment of patients with early breast cancer.
Such a large sample size was needed because the goal of the trial was to show that anastrozole was non-inferior to tamoxifen in terms of disease-free survival i. This difference may vary greatly depending on the disease and the treatment considered. In order to detect this difference, a sample size of over 1, patients was needed 9. In retrospect, such a huge treatment benefit could have been seen in far less than 1, patients, but the phase III trial had been planned conservatively to detect a smaller difference that would still have been of major clinical importance. When the endpoint of interest is a time to event, such as progression-free survival or overall survival, the benefit of an experimental treatment compared to a control treatment is expressed in terms of a hazard ratio, which is equal to the risk of the event in the treatment group divided by the risk of the event in the control group.
If the risk is the same in both groups, the hazard ratio is equal to 1. If the treatment reduces the risk of the event, the hazard ratio is less than 1. For instance, a hazard ratio equal to 0. The power of a trial to detect the effect of a treatment on a time to event endpoint depends only on the number of events, and not on the number of patients. Hence for a rare event, many more patients will be needed to achieve the same number of events as for a common event, which is one reason why trials in the adjuvant setting have to be large.
Table 1 shows the number of events required to detect given hazard ratios. The number of patients required to observe this number of events depends on the risk of the event, the duration of accrual into the trial, and the duration of follow-up. The trial of imatinib provides an interesting example of a treatment being far more effective than anticipated.
In such situations, there is an ethical imperative to stop the trial as soon as there is enough evidence that the experimental treatment is efficacious and safe. For this reason, most phase III trials now include interim analyses of efficacy. Typically, the significance levels used for the interim analyses are very small, such that an interim analysis is declared statistically significant only if an extreme treatment effect has already been demonstrated, making the continuation of the trial unnecessary and potentially unethical.
It is appropriate to use extreme levels of significance to stop a trial early to safeguard against the play of chance which could cause an apparent but spurious treatment effect at one of the interim analyses. Sometimes phase III trials must be stopped early for the opposite reason, i.
Here again, interim analyses must be carefully planned, e. The main advantage of an independent committee is to keep investigators blinded to the interim results of the trial, thereby avoiding any bias that knowledge of interim results could create if the investigators were privy to such results, such as a change in the type of patients entered in the trial. A prognostic factor is a patient characteristic that modifies his or her prognosis: for instance, patients with a tumor nodal involvement tend to fare less well than those without such involvement.
A predictive factor is a patient characteristic that modifies the effect of a treatment: for instance, breast cancer patients without hormone receptors do not benefit from tamoxifen therapy, while patients with hormone receptors do. It is obviously of interest to identify subsets of patients who do not benefit from treatment, or conversely the subset that benefits the most, but the search for subsets is a perilous statistical exercise This calculation assumes that just one comparison is performed.
If multiple comparisons are performed, the probability of false positive claims is increased. Thus, if two subsets are looked at, three treatment comparisons are performed: one overall, plus one in each subset. This explains why inappropriate subset claims create enormous confusion in the clinical literature. Simon 18 proposed useful guidelines to assess subset results 25 years ago, but his guidelines remain just as relevant today Table 2. Another important consideration when interpreting subset analyses is to examine the biological plausibility of the findings. The most convincing examples are molecular alterations such as gene mutations, translocations, amplifications, etc.
Even then, however, biology may be incompletely understood and suggest a modulation of the treatment effect that may turn out not to be correct. Here again, confirmation of the hypothesis can be obtained in a randomized trial in which either an interaction test or a prospective subset analysis is planned in addition to the overall analysis. The latter approach prospective subset analysis was used in the SATURN trial for patients with advanced non-small cell lung cancer. After standard treatment with four cycles of platinum-based chemotherapy, patients who had not yet progressed were randomly allocated to receive erlotinib or placebo until progression or unacceptable toxicity Progression-free survival after randomization was tested in all patients at a significance level of 0.
In this trial the overall significance level was clearly maintained at 0. However the trial showed the same treatment effect overall and in the subset, and it became clear after the trial was completed that while overexpression of EGFR did not increase the efficacy of erlotinib, a specific mutation of the EGFR gene did This example demonstrates that most hypotheses need prospective confirmation, whether suggested by tumor biology or by unexpected statistical evidence from a clinical trial or a patient series.
- An Irishwomans Tale.
- Spellmans Standard Handbook for Wastewater Operators: Volume II, Intermediate Level, Second Edition.
- Marco Bonetti - Citazioni di Google Scholar.
The ideal scenario is one in which several trials show concordant subset results, in which case a combined analysis of all available evidence may be sufficient to establish the validity of a predictive biomarker. This situation led to a change in label by the US Food and Drug Administration FDA to restrict usage of the two anti-EGFR monoclonal antibody drugs panitumumab Vectibix and cetuximab Erbitux to the treatment of patients with K-ras wild type metastatic colorectal cancer All patients who are randomised in a phase III trial should be followed according to the study protocol, even if they are found, after randomisation, to be ineligible or invaluable for any reason.
The most reliable analysis of a trial is based on the intent-to-treat principle, which consists of considering all randomised patients, regardless of any protocol violations. In particular, patients who take other treatments or refuse any treatment have to be kept in the treatment group they were randomized to. All other forms of analysis may be biased and, as such, are less desirable from a statistical viewpoint. In an intent-to-treat analysis, the number of patients who drop out of the trial prior to reaching the endpoint of primary interest should be kept to an absolute minimum.
A phase III trial protocol should be precise and detailed, but it should not attempt to provide exhaustive guidelines for all aspects of patient management, since many of the routine examinations and procedures that would be performed outside of the clinical trial contribute no useful information to the endpoints of the trial. Likewise, in a phase III trial, it is generally undesirable to submit the patients to a more thorough or precise follow-up than what they would receive in routine clinical practice, so long as the endpoints of interest are assessed reliably.
Follow-up should be identical in thoroughness and frequency in the various treatment groups. For instance, seeing experimental arm patients more frequently than control arm patients could bias the assessment of disease-free interval, because recurrences would be detected earlier in the experimental group. Softer endpoints, such as disease recurrence, are more subject to bias than harder endpoints, such as death. For instance, if an untreated control group is compared to a treatment group, there may be pressure to scrutinize the untreated patients much more thoroughly than the treated ones in order to identify and treat disease recurrences as early as possible.
When end-points are subjective, they should ideally be assessed blindly, i. The ideal endpoint for a phase III trial is one that is important to the patient, observed soon after treatment inception, clinically meaningful, statistically sensitive to treatment effects, and measured objectively and without bias. If such an endpoint existed, it could always serve as the primary endpoint of randomised trials the primary endpoint is that used to calculate the sample size, and to determine whether the trial shows a significant effect of treatment or not.
Unfortunately, in general, no single endpoint fulfils all these desirable conditions. This is illustrated by the endpoints commonly used in advanced cancer: response to treatment tumor shrinkage , time to disease progression, and overall survival Table 3. In general, response to treatment tumor shrinkage is insufficient per se to establish patient benefit, time to disease progression is hard to measure objectively, and survival is insensitive to true treatment differences Usually, therefore, all of these endpoints are generally analysed and the totality of the evidence is taken into account to support claims of treatment benefit.
In some advanced forms of cancer e. Changes on such clinical benefit scales constitute meaningful outcomes to the patients and may be quite sensitive to real treatment effects. As such, they seem useful and often more relevant than general-purpose quality of life questionnaires that do not specifically reflect the effects of treatment. In some situations, biomarkers are also available to follow the disease status, such as prostate-specific antigen PSA and circulating tumor cells CTC in patients with prostatic cancer. For a surrogate endpoint to be valid, two conditions should be fulfilled: first, the surrogate endpoint must be predictive of the true endpoint for individual patients, and second, the treatment effect on the surrogate endpoint must be predictive of the treatment effect on the true endpoint for groups of patients Unfortunately, few endpoints or markers qualify as valid surrogates for the clinical endpoints of interest in advanced disease For instance, in advanced colorectal cancer, tumor response is highly predictive of longer survival in individual patients, but the effects of treatment on tumor response do not reliably predict the effects of treatment on survival Hence, even if an experimental treatment induced higher response rates in advanced colorectal cancer, its effect on survival would remain elusive.
Likewise, in prostate cancer, changes in PSA predict the course of the disease and eventually the patient survival, but the effects of treatment on PSA changes have not shown to be predictive of the effects of treatment on survival The discovery of markers that reflect relevant biological mechanisms at the tumor level will undoubtedly make the search for surrogate markers more promising in the future. Until such time, some endpoints measured earlier than death such as disease-free survival in the adjuvant setting have been shown to be excellent surrogates for survival The choice of an appropriate method of statistical analysis is crucial for any trial, in particular for phase III trials.
This choice is fairly standardized, however, depending on the type of endpoint that is used to assess treatment benefit. Table 4 shows commonly used methods of analysis for normal, binary, or time-to-event endpoints. These methods are available in all standard statistical analysis packages. It is also essential, when reporting the results of a phase III clinical trial, to choose a scale on which the treatment effect is expressed.
We noted above that when the endpoint of interest is a time to event, the treatment effect is usually expressed as a hazard ratio, but other scales are available to measure the treatment effect, such as the difference between the median time-to-event endpoints between the arms or the difference between the percentage of patients who have had the event at a given time point.
Different scales to measure the treatment effect have their respective pros and cons. The most commonly used measures of treatment effect are shown in Table 5 :. Note from Table 5 that the odds reduction is larger than the risk reduction, which in turn is larger than the absolute risk reduction. This is not a feature of the particular figures chosen in Table 5 , it is a general feature that holds true for any treatment effect other than zero. This fact should be kept in mind when reading a paper, and more importantly when comparing the results of different papers, since these may be expressed on different scales.
It has been shown that the same therapeutic benefit may lead to different prescription patterns depending on the scale used to express it, because any benefit seems more impressive when expressed in relative, rather than absolute, terms There is currently far too much emphasis on the administrative tasks required to conduct a trial, and far too little on the trial design itself. Yet a poorly designed trial is likely to fail to answer the question it addresses. The present paper covers basic considerations in trial design; other articles in this volume cover more advanced features such as adaptive and biomarker-based trial designs.
While these are increasingly important in personalized medicine, simple randomized trials will continue to serve clinical research well. In addition, each of the statistical principles discussed in this paper may be directly translated without modification to more sophisticated designs. Tam, tamoxifen. Table 2 Checklist to assess results from subset analyses Full table.