Abstract
Introduction: Fetal hemoglobin (HbF) plays a critical role in modulating the clinical severity of sickle cell disease (SCD). Several studies have evaluated its impact on clinical complications like acute chest syndrome (ACS), vaso-occlusive crises (VOC), and stroke. Nevertheless, cross-sectional studies have led to inconsistent conclusions regarding the causal association of HbF with clinical complications in SCD patients, reflecting the limitations of static exposure information in a disease characterized by dynamic progression. Meanwhile, the lack of proper statistical approaches that can handle both longitudinal patterns of exposure and different types of outcomes hinders the accurate estimation of causal effects. To overcome these limitations, we proposed a novel trajectory-informed Mendelian Randomization (MR) approach using functional principal component analysis (FPCA) to model dynamic early-life HbF exposure in pediatric SCD patients. By leveraging genetic variants as instrumental variables (IV), MR helps mitigate the confounding and reverse causation present in observational studies. We used the likelihood-based estimation to assess causal effects on count outcomes using data from the well-characterized SCD cohort in the Sickle Cell Research and Intervention Program (SCCRIP).
Methods: We conducted a two-stage MR analysis using individual-level data from children with SCD enrolled in SCCRIP (NCT02098863)—a multi-institutional cohort of over 1400 patients. We modeled HbF measured at ages 0-2, using FPCA to identify the exposure trends across ages. The FPC scores for each component were extracted and used as surrogate exposure variables, with the corresponding eigenfunctions capturing the predominant patterns of variation in HbF trajectories. The IV was a polygenic risk score, constructed from well-established HbF-induced variants, extracted from the existing whole genome sequencing (WGS) data.
We then employed Poisson or Negative Binomial regression models to estimate the causal effect of the HbF trajectory scores on five count outcomes from ages 2-5, which included ACS, VOC, emergency department (ED) visits and hospitalizations due to ACS/VOC, and acute care utilization (ACU = ED + hospitalizations). All models were adjusted for sex, hydroxyurea use, chronic transfusion, and genotype. Likelihood-based estimation was employed to improve model robustness and efficiency. The estimated coefficients for each FPC score, together with their eigenfunctions, were combined to obtain the reconstructed causal effects of the HbF on 5 clinical outcomes.
Results: Of 1,404 enrolled patients, we excluded those lacking two HbF measurements between ages 0–2 (n = 328), and WGS data (n = 782). The final sample included 294 patients with both genetic data and repeated HbF measurements in early childhood. Those included were similar to those excluded, except for longer duration of HU use before age 2.
FPCA revealed three major HbF trajectory components, which in combination explained greater than 98.0% of the variance. Trajectory-informed MR identified robust protective effects of elevated early-life HbF on clinical complications. In Negative Binomial models, elevated HbF were causally associated with fewer ACS (β = –0.35, p = 0.05), VOCs (β = –0.40, p = 0.03), ED (β = –0.42, p = 0.04), hospitalizations (β = –0.32, p = 0.04), and ACU (β = –0.39, p = 0.007). Poisson models yielded consistent or stronger statistical significance for all outcomes.
In contrast, cross-sectional MR using a single HbF near age 2 showed weaker or non-significant associations (p≤0.05 for only 5 tests), emphasizing the benefit of modeling HbF trajectories. Sensitivity analyses confirmed the robustness of the findings across model specifications, genotype strata, and subset analyses.
Conclusions: Our findings provide the first strong and consistent evidence for causal protective effect of elevated HbF on acute complications in SCD patients. This strengthens the biological rationale for targeting HbF modifiers in gene therapy for SCD patients. Trajectory-informed MR offers significant advantages over traditional methods, improving causal inference, temporal resolution, and statistical power. The method is broadly applicable to other genetic epidemiology studies that involve time-varying exposures and count-type outcomes, offering a robust path forward for causal discovery in longitudinal clinical data.