#### Comparison of fingolimod, dimethyl fumarate and teriflunomide for multiple sclerosis: when methodology does not hold the promise

Platt RW, Karim ME, Debray TPA, Copetti M, Tsivgoulis G, Waubant E, Hartung HP

Comment on Comparison of fingolimod, dimethyl fumarate and teriflunomide for multiple sclerosis

Dear Editor,

We read with interest the article by Kalincik et al. ^{1} comparing fingolimod, dimethyl fumarate and teriflunomide in a cohort of relapsing-remitting multiple sclerosis (MS) patients. The authors investigated several endpoints and performed various sensitivity analyses, and we commend them for reporting technical details in the online supplementary material. We, however, have some concerns about the design, analysis and reporting of the study.

1. In the primary analyses, three separate propensity score models were developed to construct a matched cohort for each of the three pairwise comparisons. Supplementary Table 6 clearly indicates the existence of zero or low frequencies in some variables (e.g., most active previous therapy and magnetic resonance imaging [MRI] T2 lesions). Yet, those variables were used as covariates in the propensity score models, unsurprisingly resulting in extremely high point estimates and standard errors (SE; as reported in Supplementary Table 7). For example, teriflunomide was not the most active therapy for any patient in the dimethyl fumarate cohort (n=0 from Supplementary Table 6), but that category was nevertheless included in the propensity score model, leading to an unrealistic point estimate of 18.65 with SE of 434.5 (Supplementary Table 7). Even higher SEs (greater than 1000) are observed in the other propensity score models. Propensity scores estimated from these poorly constructed models were then used to create three matched cohorts, which are the basis for the primary analyses in this work. Readers need to be skeptical about any inference (estimated SE of the treatment effect, and consequently the confidence intervals/p-values) made from these cohorts, because of the instability in the propensity score models. Further, while teriflunomide was not the 'most active previous therapy' for any patient in the original (unmatched) dimethyl fumarate cohort (n=0 and 0% from Supplementary Table 6), Table 1 reported n=14 (2%) patients with this therapy after matching. Naturally the matched cohort should not produce more patients than originally present in the unmatched cohort for any category.

2. A threshold lower than 10% or 20% in absolute value for standardized mean differences (i.e. Cohen's d) is normally considered to assess imbalances in baseline covariates. However, in this study, a standardized mean difference was reported to be equal to 26% for relapse activity prior to the baseline for the comparison of fingolimod vs. teriflunomide matched sample (Table 1). Neither the standardized or raw difference in proportions was reported for any of the categorical variables in Table 1, even though some of the percentages in matched cohorts were substantially different (e.g., relapse rate). Large residual differences observed in the distribution of the covariates (likely due to poorly built propensity score models) will contribute to bias in the resulting estimates. Furthermore, matching by country, a crucial variable which would allow minimizing outcome assessment bias ^{2} was not reported in Table 1, but as Supplementary Table 4 which clearly shows that matching by country is far from being obtained. Even more important, since the matching process was conducted in a variable ratio manner for the primary analyses, standardized differences in Table 1 should be replaced with weighted standardized means or proportion differences to obtain a correct check of residual baseline imbalances after matching ^{3}.

3. In the primary analysis, missing baseline MRI values were imputed to generate 17 imputed datasets (MRI information was available for only 20-27% of the population as reported in Supplementary Table 6). In a propensity score analysis, multiple imputation (instead of single imputation) would substantially complicate the analysis due to the pooling of estimates from the 17 imputed datasets (using Rubin's rules). Both Supplementary Tables 7 and 11 included one set of estimates from each stage of the analysis (propensity score analysis and primary analysis of the matched cohorts respectively), making it unclear how the results were pooled. If the results were not pooled and a single imputed dataset was used for the analysis (as suggested by Supplementary Tables 7 and 11), then such a process would fail to account for the uncertainty in the missing values, leading to SEs and p-values that are smaller than expected.

4. We applaud the authors for conducting a series of sensitivity analyses to evaluate the robustness of their findings. However, readers would have more confidence in the findings if the supplementary materials included more details of how those sensitivity analyses were done. For example, when 1:1 matching was done, it is not clear whether and how the authors have accounted for the matched-pairs designs. In particular, despite having almost identical sample sizes in some matched cohorts (e.g., comparing analyses of 'no MRI data included' vs. 'matching on 2-year relapse rate' for fingolimod vs. dimethyl fumarate in Supplementary Table 11), high variability in p-values in most cases deserve further explanation.

5. As for the PS adjusted treatment effect analyses, this work claims that individual ARRs were calculated and used in the assessment of primary endpoint analysis. This approach is controversial ^{4}. Furthermore, the use of individual ARRs is contradicted in the statistical analysis section in which the authors state that a weighted negative binomial accounting for matching has been used. It is unclear whether individual ARRs were fed into a negative binomial and it is important to note that, if they were, results may be biased. The authors do not make clear whether standard errors and p-values properly accounted both for matching and weighting in all assessed endpoints (they include a cluster term in the negative binomial model which only accounts for matching). Table 1 presented below reports a back-calculation of the standard deviation (SD) for ARRs, which should correspond to a stable population parameter, in particular the column 'SD right' which is less prone to rounding effects present the original ARRs confidence interval values. This standard deviation is benchmarked against a recent work on a new drug for MS ^{5}. For the studies OPERA I and II consistent and stable SD values are obtained (around 1), while highly inconsistent and underestimated SDs are obtained for the MSBASE study, especially when the weighting scheme should have been attributing within the matched group the weight of 1 to the treatment arm represented by one single patient. Our Table 1 shows that the reported estimated standard errors are incorrect (i.e. generally smaller) due to the use of a wrong weighting scheme or lack of accounting for weighting properly and, consequently, p-values significance has been inflated dramatically.

6. The large number of patients included made statistically significant a very small and perhaps not clinically meaningful difference, increasing the risk of overinterpretation of the results. From a clinical standpoint an ARR difference between 0.20 and 0.26, that is an ARR ratio of 0.80 or, more intuitively, 1 relapse over 5 years vs. 1 relapse over 4 years is close to negligible overall. This represents an effect size that no future trial would likely be powered or interested to detect, especially as it comes from an ARR threshold (0.20) quite prone to the presence of noise in detecting a relapse.

To recapitulate, we wish to highlight the need for caution while interpreting the findings of this paper. Real world evidence (RWE) is an important and necessary component of research to assess the effectiveness and safety of various therapies outside the context of randomized clinical trials. However, because RWE is prone to various sources of bias, rigorous and careful analysis, interpretation and reporting are needed to ensure that results are reliable, reproducible and useful to inform clinical decision making.

References

[1] Kalincik T et al. Comparison of fingolimod, dimethyl fumarate and teriflunomide for multiple sclerosis. Journal of Neurology, Neurosurgery & Psychiatry. 2019: jnnp-2018-319831. doi:10.1136/jnnp-2018-31983.[2] Bovis F et al. Expanded disability status scale progression assessment heterogeneity in multiple sclerosis according to geographical areas. Ann Neurol. 2018 Oct;84(4):621-625.

[3] Austin, Peter C. Assessing balance in measured baseline covariates when using many-to-one matching on the propensity-score. Pharmacoepidemiology and drug safety 17.12 (2008): 1218-1225.

[4] Suissa S et al. Statistical Treatment of Exacerbations in Therapeutic Trials of Chronic Obstructive Pulmonary Disease. American Journal of Respiratory and Critical Care Medicine. 2006;173(8):842-846.

[5] Hauser SL, Bar-Or A, Comi G, Giovannoni G, Hartung HP, Hemmer B, Lublin F, Montalban X, Rammohan KW, Selmaj K, Traboulsee A, Wolinsky JS, Arnold DL, Klingelschmitt G, Masterman D, Fontoura P, Belachew S, Chin P, Mairon N, Garren H, Kappos L; OPERA I and OPERA II Clinical Investigators. Ocrelizumab versus Interferon Beta-1a in Relapsing Multiple Sclerosis. N Engl J Med. 2017 Jan 19;376(3):221-234.

Cite as: Platt RW, Karim ME, Debray TPA, Copetti M, Tsivgoulis G, Waubant E, Hartung HP. Comparison of fingolimod, dimethyl fumarate and teriflunomide for multiple sclerosis: when methodology does not hold the promise. *Journal of Neurology, Neurosurgery & Psychiatry* 2019, volume 90, page(s): 458-468.