Key Points
A comprehensive genetic testing program was developed using PacBio LRS for hemophilia diagnosis.
Complex structure variants were successfully detected, improving diagnostic accuracy over traditional methods.
Visual Abstract
Hemophilia is an X-linked bleeding disorder caused by defects in the F8 or F9 genes. Given the wide variety of F8 variants, conventional genetic testing typically requires a combination of multiple methods, and detecting rearrangements in the intron 22 homologous regions (int22hs) remains a challenging task. In this study, we developed a comprehensive hemophilia testing program using the PacBio long-read sequencing (LRS) platform. Experimentally, we established a standard operating procedure for hybridization capture LRS (hc-LRS), which generates reads longer than 5 kilobases. Analytically, we used a suite of bioinformatics tools to identify variants associated with hemophilia, including the detection of int22h-related rearrangements through de novo assembly of homologous haplotypes. Our approach successfully identified pathogenic variants in patients with, and carrier of, hemophilia, encompassing both single-nucleotide variants and structural variations, with full concordance to validated methods. Moreover, the program identified complex int22h rearrangements in several samples, which were previously difficult to detect using traditional techniques. Compared with conventional methods, hc-LRS is more cost-effective, convenient, and capable of detecting various variants in a single test. This approach provides a powerful tool for the genetic diagnosis of hemophilia, particularly in patients with unknown genetic backgrounds or complex variants. In conclusion, our comprehensive testing program represents a significant advancement in the genetic diagnosis of hemophilia.
Introduction
Hemophilia comprises a group of common X-linked recessive disorders. Deficiencies in coagulation factor VIII (FVIII) and FIX are classified as hemophilia A and hemophilia B, respectively. Together with von Willebrand disease, they account for 95% to 97% of all inherited coagulation factor deficiencies.1 Patients with hemophilia typically exhibit a lifelong tendency toward hemorrhaging, which can lead to severe or even life-threatening bleeding during trauma, surgical procedures, or childbirth.2
Numerous genetic variants associated with hemophilia have been identified. The F8 gene, responsible for encoding coagulation FVIII, spans 186 kilobases (kb) and has a wide variety of variant types. According to the European Association for Haemophilia and Allied Disorders database, >3000 variants in F8 have been documented to cause hemophilia A.3,4 Single-nucleotide variants (SNVs) and indels have been identified across all 26 exons, splice regions, and untranslated regions of F8.3-5 In addition, rearrangements between intron 22 homologous region 1 (int22h-1), located within intron 22 of F8, and its homologous sequences, int22h-2 and int22h-3, result in intron 22 inversions (Inv22; Inv22 type I and Inv22 type II), accounting for ∼40% of severe hemophilia A cases.4,6 Other int22h-related variants include deletions, duplications, and complex rearrangements.7-12 Additional types of F8 variants include Inv1 and large structural variants (SVs).4 In hemophilia B, most pathogenic variants are SNVs in F9, although SVs have also been observed.13
The accurate genetic diagnosis of hemophilia remains a significant challenge. Conventional genetic testing for hemophilia A involves a series of steps, starting with long-range polymerase chain reaction (LR-PCR) or inverse-PCR (IS-PCR) to detect int22h-related rearrangements and Inv1,14-16 followed by next-generation sequencing (NGS) to identify SNVs and indels,17 and concluding with additional methodologies to assess for large deletions and duplications such as multiplexed ligation-dependent probe amplification, quantitative real-time PCR or array-based comparative genomic hybridization.18 However, this series of testing procedures is costly, labor-intensive, and inefficient, resulting in low rates of genetic testing for hemophilia.
In recent years, gene therapy has emerged as a promising approach for treating hemophilia.19-21 The advent of preimplantation genetic testing (PGT) has led to a paradigm shift in the management of inherited diseases, shifting strategies from treatment toward prevention.22-24 However, the successful implementation of these techniques depends on an accurate diagnosis of genetic defects.
Long-read sequencing (LRS), exemplified by the PacBio platform, has the potential to identify variants with high precision. The technology behind this platform is single-molecule real-time (SMRT) sequencing,25 which is capable of detecting large SVs. High fidelity (HiFi) reads from circular sequencing enable the detection of SNVs with an accuracy of 99.9%,26 comparable with that of NGS.
LRS has typically been associated with high costs. To reduce costs and increase throughput, pre-enrichment of target regions is essential. Although targeted sequencing is well-established in short-read sequencing platforms, the targeted enrichment of long DNA fragments still presents significant challenges. Current methods include hybridization capture, PCR enrichment, and CRISPR associated protein (Cas) mediated enrichment. Hybridization capture with nucleic acid probes is a simple and efficient method, but it is limited by the length of the final library (typically <5 kb).27-29 Given that the int22h regions are 9.5 kb in length, conventional bioinformatics tools may produce erroneous results when processing such short fragments of highly homologous DNA. The adaptation of PCR enrichment to LRS has been facilitated by the commercial availability of long-range DNA polymerases.30 However, multiplex PCR must be compatible with each reaction condition. An increase in primer types or product length may introduce additional bias, making reaction conditions more challenging.31 It is generally preferable to divide the same sample into multiple reaction systems.32 Given that PCR primers are fixed to hot spot mutation regions, such as exons, this approach may overlook opportunities to discover new mutation sites.33,34 In addition, the CRISPR-Cas9 system, a prominent tool in genome editing, can be used to obtain targeted DNA fragments that retain epigenetic information.35-37 However, it currently faces limitations in efficiency and multiplexing capabilities.
To address the challenges of genetic testing for hemophilia and the limitations of long-read targeted sequencing, we have developed a comprehensive testing program using the PacBio LRS platform. This program combines hybridization capture LRS (hc-LRS) to generate sequencing reads of >5 kb, and a set of bioinformatics tools for analyzing various variant types, including the identification of int22h-related rearrangements through the de novo assembly of homologous haplotypes (DAHH). The program has been validated through the analysis of 18 samples from 14 families with hemophilia. Furthermore, we have investigated cases involving complex variants, and novel variants were identified.
Methods
Samples
The testing samples consisted of 18 peripheral blood samples collected from patients or their relatives, representing 14 hemophilic family lines (12 with hemophilia A and 2 with hemophilia B). The clinical diagnosis for all patients was confirmed through standardized coagulation activity tests for FVIII or FIX. These samples were archived at The Fourth Affiliated Hospital of Soochow University between 2021 and 2024.
All procedures adhered to the ethical principles outlined in the 1964 Declaration of Helsinki and its subsequent amendments, with approval granted by the institutional review board of The Fourth Affiliated Hospital of Soochow University. Informed consent was obtained from each participant or their legal guardian before inclusion in the study.
Probes for targeted enrichment
The target regions include the entire F8, F9, and VWF genes, 3 homologous regions (int22h-2, int22h-3, and int1h-1), and SRY on the Y chromosome, which is used to identify the sex of the sample. A total of 979 probes, targeting 383 kb of loci of interest and 28 kb of flanking regions, were designed and manufactured by iGeneTech, China. A comprehensive list of all probes is provided in supplemental Methods.
hc-LRS
We established a standard operating procedure for hc-LRS. A detailed description can be found in the supplemental Methods. Briefly, the first step is the extraction of genomic DNA from the blood. Next, Tn5 transposase randomly cleaves double-stranded DNA while inserting primer-binding sequences at both ends, and the resulting DNA fragments are then ready for PCR amplification. The subsequent step involves precapture amplification using long-range DNA polymerases, followed by size selection using beads to remove DNA fragments of <5 kb from the PCR products. Biotinylated oligonucleotide probes are then hybridized to the target DNA and captured by streptavidin magnetic beads. This is followed by a rinse to remove uncaptured DNA. The postcapture amplification system is then constructed to amplify the target DNA fragments. The amplification product is purified and ligated to hairpin adapters, creating the final library, which is ready for sequencing on the PacBio Sequel II sequencing system.
Data analysis process
This study did not develop any unique code or algorithms but used a combination of published bioinformatics tools adapted for analyzing various types of genetic variants. The data analysis process consisted of several steps, including tag removal, sample splitting, alignment with reference sequences, analysis of SNVs/indels and SVs, de novo assembly, and visualization. The PacBio Sequel II sequencing system generates HiFi reads with a base accuracy of Q30, eliminating the need for additional quality control.
Initially, sequence tags were excised from the raw sequencing data using the Demultiplex barcodes module of SMRT Link (version 13.1.0.221970). The sequence data were then divided into sample-specific files in Fastq format. Next, the split sequence data files were aligned to the reference genome (hg38) using Minimap2 (version 2.17-r974-dirty) and indexed.38 SNVs and indels were identified using DeepVariant (version 1.6.0),39 whereas structural variations and copy number variations were analyzed with Pbsv (version 2.8.0).26 Paraphase (version 3.1.0) was used to detect int22h-related recombination through DAHH.40 Finally, Hifiasm (version 0.19.3-r572) was used to identify additional structural variations or genomic recombination events, if necessary.41 The resulting assemblies were visualized using the Integrative Genomics Viewer.
Methods for validation
We performed Sanger sequencing on samples from patients in whom SNVs were detected, and their relatives. For int22h-related rearrangements, LR-PCR was performed following the methodology described by Bagnall et al.15 For deletions and other complex SVs, we used optical genome mapping (OGM; A12 and B2, as described in our previous study)42 or PacBio whole-genome sequencing (A9).
Results
Establishment of hc-LRS method
The comprehensive testing program combines hc-LRS with a suite of bioinformatics tools, aiming to detect various variant types in a single test.
First, we designed probes specifically suited for capturing long DNA fragments, which differ from those used in NGS platforms (Figure 1A). The biotinylated DNA probes are 100 base pairs (bp) in length and have a 0.33× tiling (200-bp gaps between probes). The regions of interest include the entire F8, F9, and VWF genes and 3 homologous regions (int22h-2, int22h-3, and int1h-1). Following the standard operating procedure of this study, it was determined that this probe set achieves complete coverage of all target regions (Figure 1B).
Probe design and experimental procedure for hc-LRS. (A) Comparison of the probes used for hc-LRS in this study with those used for NGS. (B) Sequencing depth and coverage of the target regions. (C) The experimental process from genomic DNA to the final sequencing library involves the following steps: fragmentation, precapture PCR, hybridization capture, postcapture PCR, and adapter ligation. The figure illustrates the results of capillary electrophoresis for the fragmentation products, precapture PCR products, postcapture PCR products, and the final library, showing a trend of decreasing fragment length throughout the experiment. The red and orange bands indicate the fragments introduced by Tn5 transposase, the dark red/red and yellow/orange bands represent the primers for precapture amplification, the blue bands indicate the targeted regions, and the black bands represent the hairpin adapter. UTR, untranslated regions; WES, whole-exome sequencing; RFU, relative fluorescence unit.
Probe design and experimental procedure for hc-LRS. (A) Comparison of the probes used for hc-LRS in this study with those used for NGS. (B) Sequencing depth and coverage of the target regions. (C) The experimental process from genomic DNA to the final sequencing library involves the following steps: fragmentation, precapture PCR, hybridization capture, postcapture PCR, and adapter ligation. The figure illustrates the results of capillary electrophoresis for the fragmentation products, precapture PCR products, postcapture PCR products, and the final library, showing a trend of decreasing fragment length throughout the experiment. The red and orange bands indicate the fragments introduced by Tn5 transposase, the dark red/red and yellow/orange bands represent the primers for precapture amplification, the blue bands indicate the targeted regions, and the black bands represent the hairpin adapter. UTR, untranslated regions; WES, whole-exome sequencing; RFU, relative fluorescence unit.
Next, we established the standard operating procedure for hc-LRS (Figure 1C). The overall design of the experimental workflow was inspired by the general approach used in targeted sequencing on NGS platforms, following the steps of fragmentation, preamplification, hybridization capture, and postamplification. However, because LRS requires much longer DNA molecules than NGS, the conditions for fragmentation and amplification needed to be adjusted and are more stringent. Because longer sequencing reads lead to more accurate genome assemblies and a higher possibility of identifying structural variations, it was essential to preserve fragments as long as possible. Therefore, we optimized the conditions for each experimental step, with particular emphasis on the following key stages: DNA fragmentation (supplemental Figure 1), selection of long-range DNA polymerases (supplemental Figure 2), and DNA size selection (supplemental Figure 3). Following the standard operating procedure, we obtained sequencing reads with an average length of ∼5.5 kb and a capture efficiency of ∼25%, whereas the average sequencing depth exceeded 100× in all target regions.
The samples used for testing comprised 18 specimens derived from patients or relatives within 14 families with hemophilia. Table 1 summarizes the results for all samples. Full concordance was observed between the pathogenic SNVs and SVs identified by the comprehensive testing program and validated methods.
Sample information and the results of the comprehensive testing program
Number . | Sex/age, y . | Phenotype . | Average length of reads . | On-target rate, % . | Depth of sequencing . | Type of variant . | Nucleotide site . | Validation . |
---|---|---|---|---|---|---|---|---|
A1 | M/34 | Severe HA | 5631 | 26.0 | 128 | Inv22 type I | LR-PCR | |
A2 | M/57 | Severe HA | 5872 | 26.1 | 148 | Inv22 type I | LR-PCR | |
A3 | M/32 | Severe HA | 5707 | 26.0 | 147 | Nonsense | NM_000132.4(F8): c.1804C>T (p.Arg602Ter) | Sanger sequencing |
A4 | M/27 | Severe HA | 5610 | 27.2 | 153 | Inv22 type I | LR-PCR | |
A4-2 | F/51 | A4' mother | 5583 | 28.1 | 152 | Inv22 type I carrier | LR-PCR | |
A5 | M/15 | Severe HA | 5735 | 26.7 | 112 | Nonsense | NM_000132.4(F8): c.6496C>T (p.Arg2166Ter) | Sanger sequencing |
A5-2 | F/40 | A5' mother | 5576 | 27.6 | 151 | No identical A5 variant found | Sanger sequencing | |
A6 | M/58 | Severe HA | 5409 | 26.2 | 140 | Inv22 type I | LR-PCR | |
A7 | M/31 | Severe HA | 5457 | 25.9 | 127 | Inv22 type I | LR-PCR | |
A8 | M/34 | Severe HA | 5595 | 25.8 | 136 | Nonsense | NM_000132.4(F8): c.6714G>A (p.Trp2238Ter) | Sanger sequencing |
A9 | M/7 | Severe HA | 5597 | 25.2 | 135 | int22h-related complex rearrangement | LR-PCR PacBio WGS | |
A10 | M/47 | Severe HA | 5795 | 24.8 | 126 | Inv22 type I | LR-PCR | |
A10-2 | F/23 | A10' daughter | 5346 | 28.4 | 151 | Inv22 type I carrier | LR-PCR | |
A11 | M/50 | Severe HA | 5633 | 26.8 | 140 | Missense | NM_000132.4(F8): c.5530C>T (p.Pro1844Ser) | Sanger sequencing |
A12 | F/32 | Mild HA | 5643 | 28.0 | 169 | int22h-related complex rearrangement carrier | LR-PCR OGM | |
B1 | M/59 | Severe HB | 5719 | 28.3 | 170 | Missense | NM_000133.4(F9): c.205T>G (p.Cys69Gly) | Sanger sequencing |
B2 | M/13 | Severe HB | 5690 | 22.8 | 107 | F9 complete deletion | OGM | |
B2-2 | F/36 | B2' mother | 5864 | 24.9 | 166 | F9 complete deletion carrier |
Number . | Sex/age, y . | Phenotype . | Average length of reads . | On-target rate, % . | Depth of sequencing . | Type of variant . | Nucleotide site . | Validation . |
---|---|---|---|---|---|---|---|---|
A1 | M/34 | Severe HA | 5631 | 26.0 | 128 | Inv22 type I | LR-PCR | |
A2 | M/57 | Severe HA | 5872 | 26.1 | 148 | Inv22 type I | LR-PCR | |
A3 | M/32 | Severe HA | 5707 | 26.0 | 147 | Nonsense | NM_000132.4(F8): c.1804C>T (p.Arg602Ter) | Sanger sequencing |
A4 | M/27 | Severe HA | 5610 | 27.2 | 153 | Inv22 type I | LR-PCR | |
A4-2 | F/51 | A4' mother | 5583 | 28.1 | 152 | Inv22 type I carrier | LR-PCR | |
A5 | M/15 | Severe HA | 5735 | 26.7 | 112 | Nonsense | NM_000132.4(F8): c.6496C>T (p.Arg2166Ter) | Sanger sequencing |
A5-2 | F/40 | A5' mother | 5576 | 27.6 | 151 | No identical A5 variant found | Sanger sequencing | |
A6 | M/58 | Severe HA | 5409 | 26.2 | 140 | Inv22 type I | LR-PCR | |
A7 | M/31 | Severe HA | 5457 | 25.9 | 127 | Inv22 type I | LR-PCR | |
A8 | M/34 | Severe HA | 5595 | 25.8 | 136 | Nonsense | NM_000132.4(F8): c.6714G>A (p.Trp2238Ter) | Sanger sequencing |
A9 | M/7 | Severe HA | 5597 | 25.2 | 135 | int22h-related complex rearrangement | LR-PCR PacBio WGS | |
A10 | M/47 | Severe HA | 5795 | 24.8 | 126 | Inv22 type I | LR-PCR | |
A10-2 | F/23 | A10' daughter | 5346 | 28.4 | 151 | Inv22 type I carrier | LR-PCR | |
A11 | M/50 | Severe HA | 5633 | 26.8 | 140 | Missense | NM_000132.4(F8): c.5530C>T (p.Pro1844Ser) | Sanger sequencing |
A12 | F/32 | Mild HA | 5643 | 28.0 | 169 | int22h-related complex rearrangement carrier | LR-PCR OGM | |
B1 | M/59 | Severe HB | 5719 | 28.3 | 170 | Missense | NM_000133.4(F9): c.205T>G (p.Cys69Gly) | Sanger sequencing |
B2 | M/13 | Severe HB | 5690 | 22.8 | 107 | F9 complete deletion | OGM | |
B2-2 | F/36 | B2' mother | 5864 | 24.9 | 166 | F9 complete deletion carrier |
F, female; HA, hemophilia A; HB, hemophilia B; M, male; WGS, whole-genome sequencing.
DAHH and analysis of int22h-related rearrangements
Identifying int22h-related rearrangements is of particular significance in genetic testing for hemophilia. A single sequencing read cannot fully cover the 9.5-kb-long int22h regions, whereas de novo assembly offers a feasible solution for distinguishing homologous regions.
Paraphase, a Python tool developed to differentiate the homologous genes SMN1 and SMN2, which are associated with spinal muscular atrophy,40 was used in this study to identify int22h regions (Figure 2). Paraphase first extracted all HiFi reads containing the int22h-1, int22h-2, and int22h-3 regions, then aligned them with the int22h-2 region of the reference genome. The same single-nucleotide polymorphisms were identified and ligated to assemble the HiFi reads into haplotypes. These haplotypes were then assigned to int22h-1, int22h-2, or int22h-3 based on differences in sequences upstream and downstream of the homologous region. Male samples typically assemble 3 sets of haplotypes, whereas female samples assemble 6 (Figure 3A). Int22h-related rearrangements were indicated by the presence of fusion fragments, in which the upstream and downstream sequences originate from different int22h regions. In this study, the fusion fragment formed by the anterior part of int22h-3 and the posterior part of int22h-1 is designated int22h-3/1, whereas the fusion fragment formed by the anterior part of int22h-2 and the posterior part of int22h-1 is designated int22h-2/1. Int22h-2 and int22h-3 have the same downstream sequence.43 The fused fragment formed by the anterior part of int22h-1 and the posterior part of either int22h-2 or int22h-3 is designated int22h-1/2or3.
Diagram of int22h-related rearrangements analysis through the DAHH. Wild type and Inv22 type I as examples. int22h-1 is located within intron 22 of F8. int22h-2 and int22h-3 are located within the 2 arms of a stem-loop structure. Inv22 type I represents the inversion of the fragment between int22h-1 and int22h-3. Red bands represent F8, blue bands represent the upstream sequence of int22h-2, green bands represent the upstream sequence of int22h-3, and gray bands represent the downstream sequence of int22h-2 and int22h-3.
Diagram of int22h-related rearrangements analysis through the DAHH. Wild type and Inv22 type I as examples. int22h-1 is located within intron 22 of F8. int22h-2 and int22h-3 are located within the 2 arms of a stem-loop structure. Inv22 type I represents the inversion of the fragment between int22h-1 and int22h-3. Red bands represent F8, blue bands represent the upstream sequence of int22h-2, green bands represent the upstream sequence of int22h-3, and gray bands represent the downstream sequence of int22h-2 and int22h-3.
Analysis of int22h-related rearrangements. (A) Results of DAHH. Three sets of haplotypes were assembled in wild-type males, and 5 sets of haplotypes were assembled in wild-type females. The fusion fragments int22h-1/2or3 and int22h-3/1 were identified in samples with Inv22 type I (use A1 as an example for the male Inv22 type I, and A4-2 as an example for the female Inv22 type I carrier.) A9 and A12 are considered int22h-related complex variants. (B) LR-PCR for int22h-related rearrangements. The locations of the 5 primers (H1F, H2F, H3F, H1R, and H2/3R) in the reference genome. (C) Results of LR-PCR. Five primers were used in different combinations to amplify int22h-1, int22h-2, int22h-3, and the fusion fragments int22h-1/2or3, int22h-2/1, and int22h-3/1. Samples A1, A2, A4, A6, A7, and A10 were identified as Inv22 type I; samples A4-2 and A10-2 were identified as Inv22 type I carriers; and samples A9 and A12 are int22h-related complex variants.
Analysis of int22h-related rearrangements. (A) Results of DAHH. Three sets of haplotypes were assembled in wild-type males, and 5 sets of haplotypes were assembled in wild-type females. The fusion fragments int22h-1/2or3 and int22h-3/1 were identified in samples with Inv22 type I (use A1 as an example for the male Inv22 type I, and A4-2 as an example for the female Inv22 type I carrier.) A9 and A12 are considered int22h-related complex variants. (B) LR-PCR for int22h-related rearrangements. The locations of the 5 primers (H1F, H2F, H3F, H1R, and H2/3R) in the reference genome. (C) Results of LR-PCR. Five primers were used in different combinations to amplify int22h-1, int22h-2, int22h-3, and the fusion fragments int22h-1/2or3, int22h-2/1, and int22h-3/1. Samples A1, A2, A4, A6, A7, and A10 were identified as Inv22 type I; samples A4-2 and A10-2 were identified as Inv22 type I carriers; and samples A9 and A12 are int22h-related complex variants.
Through DAHH, we identified 10 samples with int22h-related rearrangements, some of which are shown in Figure 3A. The fusion fragments int22h-1/2or3 and int22h-3/1 were detected in 8 of these samples, in which normal int22h-2 was also identified in male patients (A1, A2, A4, A6, A7, and A10), whereas female carriers contained normal int22h-1, int22h-2, and int22h-3, indicating a normal X allele (A5-2, A10-2). This variant type is considered to be Inv22 type I, the most common type of SV leading to severe hemophilia A.4 In addition, A9 and A12 are considered complex variants, which will be described in the following part of this article.
We conducted LR-PCR on int22h-1, int22h-2, and int22h-3, as well as on 3 types of fusion fragments, using a combination of 5 primers (H1F, H2F, H3F, H1R, and H2/3R), in accordance with the methodology proposed by Bagnall et al (Figure 3B).15 The results for the samples identified as int22h-related rearrangements were identical (Figure 3C). However, PCR is only capable of characterizing normal and fusion fragments and cannot provide copy number information for each type.
In addition, we evaluated the potential of DAHH for application to other homologous loci by applying it to distinguish the VWF gene from its pseudogene VWFP1. As a result, we successfully resolved 4 sets of haplotypes corresponding to VWF and VWFP1, respectively (supplemental Figure 4).
Detection of SNVs
Different pathogenic SNVs were identified in 5 samples, including 3 nonsense mutations located in exons 12, 23, and 24 of F8; and 2 missense mutations, 1 in exon 16 of F8 and the other in exon 2 of F9 (Table 1). One of the variants in F8 (NM_000132.4:c.6714G>A) is a novel variant resulting in p.Trp2238Ter. It has been documented that an adjacent SNV (NM_000132.4:c.6713G>A) causes translation to terminate prematurely at the same position, leading to hemophilia A. All other SNVs are documented pathogenic variants in the database (https://dbs.eahad.org/FVIII; https://dbs.eahad.org/FIX). Additionally, patient A5, a 15-year-old adolescent boy with hemophilia A, was found to carry a nonsense mutation in F8 (NM_000132.4:c.6496C>T). However, this SNV was not identified in his mother (A5-2), suggesting that it is a de novo variant in A5. For all relevant aforementioned samples, Sanger sequencing was performed, confirming complete consistency with the results of the targeted sequencing.
Detection of other SVs
A large deletion of the F9 gene was identified in a male patient with hemophilia B and his mother (B2 and B2-2; Figure 4A). No reads corresponding to the F9 gene were identified in the sequencing results of patient B2, suggesting a complete deletion of the F9 gene in the probe-covered region. Although relevant reads of F9 gene were found in B2-2, the de novo assembly of the F9 gene only yielded a single haplotype, whereas 2 haplotypes should have been assembled in females with normal genotype. This suggests that B2-2 is a carrier of the F9 deletion. Additionally, in our previous study,42 OGM was performed on samples from patient B2, revealing a 451-kb deletion in the Xq27.1 region that includes the F9 gene (supplemental Figure 5).
Detection of other SVs. (A) Sequencing results and de novo assembly of the F9 gene in B2 and B2-2. No haplotypes were assembled in B2, and only 1 set of haplotypes was assembled in B2-2, whereas 1 set of haplotypes is present in WT males, 2 sets of haplotypes are present in WT females. (B) Sequencing results and de novo assembly showed the insertions in VWF intron 47 in A2 and A9. These variants do not affect the function of VWF. Chr12, chromosome 12; ChrX, X chromosome; WT, wild-type.
Detection of other SVs. (A) Sequencing results and de novo assembly of the F9 gene in B2 and B2-2. No haplotypes were assembled in B2, and only 1 set of haplotypes was assembled in B2-2, whereas 1 set of haplotypes is present in WT males, 2 sets of haplotypes are present in WT females. (B) Sequencing results and de novo assembly showed the insertions in VWF intron 47 in A2 and A9. These variants do not affect the function of VWF. Chr12, chromosome 12; ChrX, X chromosome; WT, wild-type.
In addition, heterozygous insertions of 140 bp and 144 bp were detected in the VWF gene in individuals A2 and A9, respectively (Figure 4B). Both insertions are located within the CATA repeat region of intron 47 and are not predicted to exert any functional effects on VWF.
Further exploration of complex variants
One set of int22h-3, 1 set of int22h-2/1, and 2 sets of int22h-1/2or3 were identified in a male patient with hemophilia A (patient A9), indicating complex int22h-related rearrangements, including copy number variants (Figure 3A). To further investigate this complex variant, we conducted PacBio whole-genome sequencing on samples from patient A9, which generated HiFi reads with an average length of 12 kb and a depth of 15×. Although the data did not yield a contiguous assembly of the entire region, several major fragments were successfully assembled. No fusion reads outside the homologous regions were identified, suggesting that all rearrangements occurred within the homologous regions. Based on the sequencing results, we constructed a model for the evolution of this variant. It is possible that an inversion between int22h-2 and int22h-3, another inversion between int22h-1 and int22h-2 (distal int22h), and a duplication from int22h-2 to int22h-1/2or3 occurred sequentially (Figure 5A). This inference is consistent with the results of DAHH and LR-PCR.
Complex variants in samples A9 and A12. (A) A model for the evolution of the complex variant in A9. An inversion occurred between int22h-2 and int22h-3, another inversion occurred between int22h-1 and int22h-2 (distal int22h), and a duplication of int22h-2 and int22h-1/2or3 repeats occurred sequentially. (B) The structure of the complex variant in A12. The fragment between int22h-2 and int22h-3 was inverted, and a repeat fragment was inserted into F8. One breakpoint of the insertion fragment is located at hg38 chrX:155,471,846; whereas the other is situated within int22h-2. In addition, a minor duplication of the sequence from hg38 chrX:154,872,422 to int22h-1 occurred, with the insertion site located at the junction of this duplication.
Complex variants in samples A9 and A12. (A) A model for the evolution of the complex variant in A9. An inversion occurred between int22h-2 and int22h-3, another inversion occurred between int22h-1 and int22h-2 (distal int22h), and a duplication of int22h-2 and int22h-1/2or3 repeats occurred sequentially. (B) The structure of the complex variant in A12. The fragment between int22h-2 and int22h-3 was inverted, and a repeat fragment was inserted into F8. One breakpoint of the insertion fragment is located at hg38 chrX:155,471,846; whereas the other is situated within int22h-2. In addition, a minor duplication of the sequence from hg38 chrX:154,872,422 to int22h-1 occurred, with the insertion site located at the junction of this duplication.
The DAHH results of female patient A12 showed 1 set of int22h-2/1, 1 set of int22h-1, 1 set of int22h-2, and 3 different sets of int22h-3 (Figure 3A). An additional set of fusion fragments was identified, chrX:155,471,846(+)-chrX:154,872,422(+), located outside int22h. The OGM results for the same sample in our previous study indicated a SV in Xq28,42 in which the fragment between int22h-2 and int22h-3 was inverted and repeatedly inserted into int22h-1 (supplemental Figure 5). Because of resolution limitations, OGM cannot resolve the nucleotide positions of the breakpoints. Synthesizing all the information, it can be concluded that 1 of the breakpoints of the insertion is at hg38 chrX:155,471,846; whereas the other is located within int22h-2. In addition, a minor duplication occurs from hg38 chrX:154,872,422 to int22h-1, with the insertion positioned at the junction of this duplication (Figure 5B). This conclusion is generally consistent with the results of DAHH and LR-PCR, but there is a minor issue. Because of the duplication, there should be 8 sets of int22h, whereas DAHH only assembled 6 sets. This discrepancy, in which the number of assembled haplotypes is fewer than the predicted number, was also observed for patient A10-2. This may be because the nucleotide sequences of int22h on the 2 alleles are too similar to be distinguished.
Discussion
Hemophilia is a hereditary bleeding disorder with a long history of research, yet the genetic variations of the disease continue to be elucidated. Among the known causative genes, F8 is particularly challenging because of its large genomic size, complex internal structure, and unique location at the telomeric end of the X chromosome's long arm. These features contribute to the wide variety of variant types observed in hemophilia A and complicate genetic testing. NGS technologies have limitations in detecting SVs, especially those involving highly homologous regions. Conventional genetic testing for hemophilia A typically follows a stepwise approach, initially detecting int22h-related rearrangements and Inv1 using LR-PCR or IS-PCR,14-16 followed by NGS to identify SNVs and indels.17 When PCR and sequencing are inconclusive, or when exon deletions are suspected, multiplexed ligation-dependent probe amplification is used for further analysis.18 In this study, we developed a comprehensive testing program for hemophilia based on the PacBio LRS platform. The program was validated with 18 clinical samples from 14 hemophilia pedigrees, with results entirely concordant with those obtained from validated methods. Leveraging the high single-base accuracy and long-read capability of the PacBio system, our approach enables simultaneous detection of various variant types in a single assay. Experimentally, we established a standard operating procedure for hc-LRS, consistently generating high-quality data, with read lengths exceeding 5 kb. Analytically, the DAHH pipeline enabled detection of int22h-related rearrangements and copy number analysis within homologous regions. Compared with conventional approaches, our method offers notable advantages in cost-effectiveness, workflow simplicity, and the capacity to identify complex variants. A comparative analysis with several alternative methods is summarized in Table 2.
Methods of genetic testing for hemophilia
. | SNVs/indels . | Inv 22 . | int22h-related complex variants . | Inv 1 . | Large deletions/duplications . | Distinguish VWD . | Cost per sample, $ . | Comment . |
---|---|---|---|---|---|---|---|---|
NGS | √ | Possible | √ | 100 | ||||
LR-PCR/IS-PCR | √ | √ | √ | 30 | Complex variants may be misdiagnosed | |||
MLPA | √ | 100 | ||||||
OGM | √ | √ | √ | √ | 800 | Demanding high-quality DNA; resolution limited | ||
PacBio WGS | √ | √ | √ | √ | √ | √ | 1350 | Low sequencing depth |
Comprehensive testing program | √ | √ | √ | √ | √ | √ | 100 | Identification of fusion fragments and CNVs in homologous regions |
. | SNVs/indels . | Inv 22 . | int22h-related complex variants . | Inv 1 . | Large deletions/duplications . | Distinguish VWD . | Cost per sample, $ . | Comment . |
---|---|---|---|---|---|---|---|---|
NGS | √ | Possible | √ | 100 | ||||
LR-PCR/IS-PCR | √ | √ | √ | 30 | Complex variants may be misdiagnosed | |||
MLPA | √ | 100 | ||||||
OGM | √ | √ | √ | √ | 800 | Demanding high-quality DNA; resolution limited | ||
PacBio WGS | √ | √ | √ | √ | √ | √ | 1350 | Low sequencing depth |
Comprehensive testing program | √ | √ | √ | √ | √ | √ | 100 | Identification of fusion fragments and CNVs in homologous regions |
√, capable; CNV, copy number variation; MLPA, multiplexed ligation-dependent probe amplification; VWD, von Willebrand disease.
LRS enables the exploration of genomic regions previously inaccessible with conventional sequencing techniques. To fully leverage its potential, targeted sequencing is essential. Multiplex LR-PCR has recently been used for targeted LRS in hemophilia A.33,34 However, this method could miss variants outside primer cover sites, whereas addition of primer types may also increase amplification bias, especially when amplifying long DNA fragments. In contrast, hybridization-based probe capture is particularly effective for identifying complex rearrangements and novel variants. Benefit from optimized technical parameters (detailed in the supplemental Methods), our standard operating procedure completes the experimental process within 2 days and provides sequencing results within 5 days, while maintaining cost-efficiency. For the panel used in this study, 10 000 HiFi reads are sufficient to achieve an average target region coverage of ≥100×. Assuming a capture efficiency of 25% and 2 to 4 million HiFi reads from a single PacBio SMRT Cell 8M chip, each sample requires only 1% to 2% of the chip capacity and can be mixed proportionally with any other barcoded HiFi library, such as a barcoded whole-genome sequencing sample. In this study, the cost of hc-LRS is approximately $100 per sample by sharing the flowcell with other samples, whereas whole-genome sequencing on the PacBio Sequel II platform costs up to $1350. The newer PacBio Revio platform, with the 25M chip, is expected to further enhance cost-efficiency.
The hc-LRS and DAHH approach offers an opportunity to identify and characterize complex SVs. For patient A12, hc-LRS identified fusion reads and nucleotide-level breakpoints beyond the homologous regions. For patients A9 and A12, DAHH revealed unexplained copy number abnormalities of int22h, indicating the presence of complex int22h-related rearrangements. Integrative analysis combining long-read whole-genome sequencing and OGM revealed the complete structure of these variants. Interestingly, both complex variants involved fragments of int22h-2/1. Sequencing of the human X chromosome has revealed that the orientations of int22h-1 and int22h-2 are identical, whereas int22h-3 is oriented oppositely to the former 2.44 As a result, inversions between int22h-1 and int22h-3 (Inv22 type I) are more likely to occur than inversions between int22h-1 and int22h-2 (Inv22 type II). Theoretically, the formation of Inv22 type II may require an inversion between int22h-2 and int22h-3 (as observed in patient A12), followed by another inversion, which ultimately forms the fusion fragments int22h-2/1 and int22h-1/2. The number of cases diagnosed with Inv22 type II was only one-fifth that of Inv22 type I, in severe hemophilia A.4 Int22h-related duplications or deletions mainly occur between int22h-1 and int22h-2, forming int22h-2/1, with only a few cases reported.7-12 The identification of int22h-related rearrangements relies on LR-PCR or IS-PCR, but the results could be misleading. For instance, if only PCR results are considered, A9 could be misclassified as Inv22 type II, and A12 could be misdiagnosed as a deletion between int22h-1 and int22h-2. Similar misclassifications have also been reported in other studies.8 These findings underscore that fusion fragments should not be automatically interpreted as simple inversions or deletions, as they may reflect more intricate genomic rearrangements.
Beyond F8, the hc-LRS and DAHH method also shows promise for analyzing other loci involving homologous regions. In this study, DAHH effectively distinguished the VWF gene from its pseudogene VWFP1, which shares 97% sequence identity with VWF exons 23 to 34 (supplemental Figure 4).45,46
Moreover, the comprehensive testing program has potential applications in PGT. Currently, pedigree linkage analysis is the mainstay for PGT of int22h-related rearrangements.47,48 An emerging alternative involves LRS to construct parental haplotypes, followed by single-nucleotide polymorphism–based linkage analysis for embryo selection, eliminating the need for a proband. Such applications have already been reported.23,24 To ensure robust haplotype assembly, a minimum of 30× coverage is recommended.26,49 Whereas whole-genome sequencing requires multiple sequencing chips to meet this requirement, hc-LRS can achieve it efficiently. In terms of haplotype assembly at partial loci, targeted and whole-genome LRS yield comparable results. Further studies are needed to assess the suitability of this approach for PGT in hemophilia.
This study has several limitations. Although hc-LRS and DAHH can indicate the presence of complex variants, OGM or long-read whole-genome sequencing remain necessary to resolve the complete structures. Expanding the target region to include sequences between F8, int22h-2, and int22h-3 may enable complete assembly of int22h-related rearrangements. In addition, the sample size was limited, and variants such as Inv1 or other non–int22h-related pathogenic SVs were not collected. Although the probe set includes int1h-2, a 1041-bp homologous region of F8 intron 1 implicated in Inv1,50 and theoretically a HiFi read can span and resolve this region, such rearrangements were not identified in our samples. Furthermore, although 2 heterozygous insertions in VWF were detected, both were nonpathogenic, leaving the method’s utility in identifying heterozygous pathogenic SVs and breakpoints unverified. The comprehensiveness of this approach requires validation in larger cohorts with more diverse variant types.
Acknowledgments
This work was supported by funds from the Priority Academic Program Development of Jiangsu Higher Education Institutions (20KJA320001 [M.J.]), the Suzhou Science and Technology Project (SKY2022012 [M.J.], SZS2023014 [M.J.]), Jiangsu Provincial Research Hospital Project (YJXYY202204 [L.Z.]), and the National Natural Science Foundation of China (82300151 [L.Z.]).
Authorship
Contribution: L.K., B. Liang, and M.J. designed the experiments; B. Liu, R.X., and Y.Y. carried out the molecular genetic studies; B. Liu and M.G. analyzed the data; B. Liu, H.L., S.C., and M.J. wrote the manuscript; and S.M., L.Z., and Z.Y. assisted in collecting clinical samples.
Conflict-of-interest disclosure: The authors declare no competing financial interests.
Correspondence: Miao Jiang, The Fourth Affiliated Hospital of Soochow University, 9 Chongwen St, Suzhou 215021, China; email: jiangmiao@suda.edu.cn; and Bo Liang, Basecare Medical Device Co, Ltd, 77 Jingu St, Suzhou 215028, China; email: boliang880@alumni.sjtu.edu.cn.
References
Author notes
B. Liu, R.X., and S.M. contributed equally to this study.
Data are available on request from the corresponding authors, Miao Jiang (jiangmiao@suda.edu.cn) and Bo Liang (boliang880@alumni.sjtu.edu.cn).
The full-text version of this article contains a data supplement.