Gene Therapy

Gene Therapy

AAV PacBio Quality Control

image

Assess the completeness and contamination of PacBio sequenced adeno-associated virus (AAV) constructs by examining alignment coverage across sequences and among specific regions including the promoter and CDS.

Version 2.3.0

Use Cases

  • The user has completed PacBio HiFi sequencing for AAV constructs and wishes to characterize the quality of the sequencing results
  • The user has completed Illumina sequencing and wishes to detect variants in the output data’

Summary

This is a quality control workflow that can be used to characterize PacBio adeno-associated virus (AAV) products by examining alignment coverage across sequence regions of interest. The user will provide either BAM files from the PacBio sequencer run in AAV mode or the raw PacBio run data folder in Tar GZ format to include subreads and XML data. The user may optionally provide Illumina sequencing data for variant detection. For each run analyzed, the user will receive a report of the alignment statistics.

Methods

If masking is selected, vector sequence along with packaging plasmid sequences were can be used to mask the human genome using MUMmer [1]. If input data were PacBio uBAMs that have not been run in AAV mode, consensus contig reads are created using circular consensus sequencing (CCS) [2]. Reads are aligned to the reference sequences using Minimap2 [3]. A custom report of alignment statistics was generated using a workflow developed at PacBio. Resulting alignments are filtered for quality to include primary alignments and reads with mapping quality scores greater than 10. Counts and lengths of alignments to regions of interest are determined from alignment files using Bedtools [4]. If Illumina data is provided, reads are trimmed using TrimGalore [5], to trim low quality (qual < 25) ends of reads and remove reads < 35bp. Trimmed reads are aligned to a reference genome using Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates [6]. BAMs from the same sample generated by multiple runs are merged using Samtools [7]. Replication errors can be detected using MuTect2 [8] and Freebayes [9]. Finally a report is generated with relevant quality metrics.

Tips and Tricks

The genomic regions file must be in the form of a BED file with no header and three mandatory columns: chrom (name of chromosome), chromStart, and chromEnd (the starting and ending positions of the feature in the chromosome). The file also takes 9 additional optional columns, including exon count and size as well as strand. More information can be found here.

Inputs

Mandatory

High-Throughput Sequence Data
  • BAMs
    • PacBio CCS BAM - Sequence file from PacBio Machine run in AAV Mode OR run through recall adapter and CCS to create consensus sequences
    • PacBio Subreads BAM - Files from PacBio Machine without any preproccessing(no recommended)
  • Data Folder
    • PacBio Run Folder, includes subreads, XML, etc
  • FastQ
    • FastQ files generated from CCS BAM files
Reference Data
  • Construct FastA
    • this sequence will be concatenated with it's reverse complement if the sample is self-complementary
  • Plasmid Sequences
    • other sequence used in the AAV replication inlcuding plasmids
  • Target Bedfile
    • annotated region of the construct including ITR, promotor and CDS regions
    • Must have either:
      • multiple itr regions ie itr5 and itr3
      • a region called “vector” or “transfer” that spans itr to itr
  • Host Genome
    • Mask Genome, the genome can be optionally masked in the region of the gene in the construct
  • Name for output VCF (used with Illumina analysis)
  • AAV Serotype
Options
  • Barcode Sequence
  • Illumina Reads
    • Optional folder of fastqs for determining sequence variants using Illumina
  • Alignment options for the PacBio Technology

Parameters

  • Raw Data (check if not using PacBio Hi-Fi BAM files as main input)
  • AAV Serotype (AAV2 or skip)

Outputs

Reference Genome and Annotation Files
  • genomefa.tar.gz
  • genome.bed
  • pacBioSOP.tar.gz
  • seq.bed
  • pbsopstats/sopregions.bed
Report
  • SampleID.results.html
Alignment Files
  • pbsopbams/SampleID.bam
  • pbsopbams/SampleID.pbsop.bam
  • Filtered for Regional Analysis
    • bams/SampleID.qual.bam
    • bams/SampleID.qual.bam.bai
Raw Stats Files
  • pbsopstats/SampleID.Rdata
  • pbsopstats/SampleID.alignments.tsv
  • pbsopstats/SampleID.nonmatch_stat.csv.gz
  • pbsopstats/SampleID.per_read.csv
  • pbsopstats/SampleID.readsummary.tsv
  • pbsopstats/SampleID.sequence-error.tsv
  • pbsopstats/SampleID.summary.csv
  • pbsopstats/SampleID_AAV_report.pdf
  • stats/SampleID.reads2regions.bed
  • stats/SampleID.regionposcts.txt
  • featurects/SampleID.bedout.txt
  • featurects/SampleID.bedtools.cov.txt

Runtime Estimates

Average: 2 hours 18 minutes based on 15 test runs

image

Workflow Walkthrough

  1. Navigate to the AAV PacBio Quality Control workflow on the Form Bio platform. You can use the Gene Therapy or Candidate Validation filters to help you locate the launcher.
  2. image
  3. Select the version from the dropdown tab in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Launcher Tabs
    1. Set up sequence files for analysis.
      • Choose an analysis mode. PacBio SOP is based on this protocol outline; FormBio Enhanced includes additional host contamination analysis and an interactive report.
      • Choose an alignment computational mode. Sentieon is faster than Open Source.
      • Select the type of sequence input.
      • Provide a file containing the barcode sequence for each sample (this table can be created within the workflow itself).
      • Lastly, provide the directory containing the files to be analyzed (may have to scroll down).
      • image
    2. Configure the AAV design.
      • Select the AAV Serotype (the base genome used for Flip Flop analysis)
      • Indicate the format of the Plasmid, Vector or Construct FastA sequence, upload the sequence and a file describing genomic regions.
      • A Packaging FastA file should be uploaded by default. Changing this file is optional.
      • image
    3. Select reference genome and genome annotation version
    4. image
    5. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
    6. image

Results Walkthrough

  1. To view results for your AAV PacBio QC workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab. This view may vary based on the type of analysis run. HTML output files can be previewed or opened from here.
  4. image
  5. Under the All Files tab, you can view the final HTML file, which is nested in the output folder. You may view or download these files. These files can also be found in the File Explorer.
  6. image

Citations

  1. Marçais, G. et al. MUMmer4: A fast and versatile genome alignment systemPLoS computational biology 14, e1005944 (2018).
  2. Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detectionNucleic Acids Research 38, e159–e159 (2010).
  3. Li, H. Minimap2: Pairwise alignment for nucleotide sequencesBioinformatics 34, 3094–3100 (2018).
  4. Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic featuresBioinformatics 26, 841–842 (2010).
  5. Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: V0.6.7 - DOI via Zenodo. (2021) doi:10.5281/ZENODO.5127899.
  6. Thomer, A. K., Twidale, M. B., Guo, J. & Yoder, M. J. Picard Tools. in Conference on Human Factors in Computing Systems - Proceedings (2016).
  7. Li, H. et al. The Sequence Alignment/Map format and SAMtoolsBioinformatics 25, 2078–2079 (2009).
  8. Benjamin, D. et al. Calling Somatic SNVs and Indels with Mutect2. (2019) doi:10.1101/861054.
  9. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. (2012).

Built with

image
image
image