Gene Therapy

AAV PacBio Quality Control
Use Cases
Summary
Methods
Tips and Tricks
Inputs
Outputs
Runtime Estimates
Workflow Walkthrough
Results Walkthrough
Citations
Built with

AAV PacBio Quality Control

Assess the completeness and contamination of PacBio sequenced adeno-associated virus (AAV) constructs by examining alignment coverage across sequences and among specific regions including the promoter and CDS.

Version 2.3.0

Use Cases

The user has completed PacBio HiFi sequencing for AAV constructs and wishes to characterize the quality of the sequencing results
The user has completed Illumina sequencing and wishes to detect variants in the output data’

Summary

This is a quality control workflow that can be used to characterize PacBio adeno-associated virus (AAV) products by examining alignment coverage across sequence regions of interest. The user will provide either BAM files from the PacBio sequencer run in AAV mode or the raw PacBio run data folder in Tar GZ format to include subreads and XML data. The user may optionally provide Illumina sequencing data for variant detection. For each run analyzed, the user will receive a report of the alignment statistics.

Methods

If masking is selected, vector sequence along with packaging plasmid sequences were can be used to mask the human genome using MUMmer [1]. If input data were PacBio uBAMs that have not been run in AAV mode, consensus contig reads are created using circular consensus sequencing (CCS) [2]. Reads are aligned to the reference sequences using Minimap2 [3]. A custom report of alignment statistics was generated using a workflow developed at PacBio. Resulting alignments are filtered for quality to include primary alignments and reads with mapping quality scores greater than 10. Counts and lengths of alignments to regions of interest are determined from alignment files using Bedtools [4]. If Illumina data is provided, reads are trimmed using TrimGalore [5], to trim low quality (qual < 25) ends of reads and remove reads < 35bp. Trimmed reads are aligned to a reference genome using Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates [6]. BAMs from the same sample generated by multiple runs are merged using Samtools [7]. Replication errors can be detected using MuTect2 [8] and Freebayes [9]. Finally a report is generated with relevant quality metrics.

Tips and Tricks

The genomic regions file must be in the form of a BED file with no header and three mandatory columns: chrom (name of chromosome), chromStart, and chromEnd (the starting and ending positions of the feature in the chromosome). The file also takes 9 additional optional columns, including exon count and size as well as strand. More information can be found here.

‣

Inputs

Mandatory

‣

High-Throughput Sequence Data

BAMs

PacBio CCS BAM - Sequence file from PacBio Machine run in AAV Mode OR run through recall adapter and CCS to create consensus sequences
PacBio Subreads BAM - Files from PacBio Machine without any preproccessing(no recommended)

Data Folder

PacBio Run Folder, includes subreads, XML, etc

FastQ

FastQ files generated from CCS BAM files

‣

Reference Data

Construct FastA

this sequence will be concatenated with it's reverse complement if the sample is self-complementary

Plasmid Sequences

other sequence used in the AAV replication inlcuding plasmids

Target Bedfile

annotated region of the construct including ITR, promotor and CDS regions
Must have either:

multiple itr regions ie itr5 and itr3
a region called “vector” or “transfer” that spans itr to itr

Host Genome

Mask Genome, the genome can be optionally masked in the region of the gene in the construct

Name for output VCF (used with Illumina analysis)
AAV Serotype

‣

Options

Barcode Sequence
Illumina Reads

Optional folder of fastqs for determining sequence variants using Illumina

Alignment options for the PacBio Technology

Parameters

Raw Data (check if not using PacBio Hi-Fi BAM files as main input)
AAV Serotype (AAV2 or skip)

‣

Outputs

‣

Reference Genome and Annotation Files

genomefa.tar.gz
genome.bed
pacBioSOP.tar.gz
seq.bed
pbsopstats/sopregions.bed

‣

Report

SampleID.results.html

‣

Alignment Files

pbsopbams/SampleID.bam
pbsopbams/SampleID.pbsop.bam
Filtered for Regional Analysis

bams/SampleID.qual.bam
bams/SampleID.qual.bam.bai

‣

Raw Stats Files

pbsopstats/SampleID.Rdata
pbsopstats/SampleID.alignments.tsv
pbsopstats/SampleID.nonmatch_stat.csv.gz
pbsopstats/SampleID.per_read.csv
pbsopstats/SampleID.readsummary.tsv
pbsopstats/SampleID.sequence-error.tsv
pbsopstats/SampleID.summary.csv
pbsopstats/SampleID_AAV_report.pdf
stats/SampleID.reads2regions.bed
stats/SampleID.regionposcts.txt
featurects/SampleID.bedout.txt
featurects/SampleID.bedtools.cov.txt

Runtime Estimates

Average: 2 hours 18 minutes based on 15 test runs

‣

Workflow Walkthrough

Navigate to the AAV PacBio Quality Control workflow on the Form Bio platform. You can use the Gene Therapy or Candidate Validation filters to help you locate the launcher.

Select the version from the dropdown tab in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.

Launcher Tabs

Set up sequence files for analysis.

Choose an analysis mode. PacBio SOP is based on this protocol outline; FormBio Enhanced includes additional host contamination analysis and an interactive report.
Choose an alignment computational mode. Sentieon is faster than Open Source.
Select the type of sequence input.
Provide a file containing the barcode sequence for each sample (this table can be created within the workflow itself).
Lastly, provide the directory containing the files to be analyzed (may have to scroll down).

Configure the AAV design.

Select the AAV Serotype (the base genome used for Flip Flop analysis)
Indicate the format of the Plasmid, Vector or Construct FastA sequence, upload the sequence and a file describing genomic regions.
A Packaging FastA file should be uploaded by default. Changing this file is optional.

Select reference genome and genome annotation version

Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.

‣

Results Walkthrough

To view results for your AAV PacBio QC workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.

Upon selection, results from your workflow run are summarized in the Results tab. This view may vary based on the type of analysis run. HTML output files can be previewed or opened from here.

Under the All Files tab, you can view the final HTML file, which is nested in the output folder. You may view or download these files. These files can also be found in the File Explorer.

‣

Citations

Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS computational biology 14, e1005944 (2018).
Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Research 38, e159–e159 (2010).
Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: V0.6.7 - DOI via Zenodo. (2021) doi:10.5281/ZENODO.5127899.
Thomer, A. K., Twidale, M. B., Guo, J. & Yoder, M. J. Picard Tools. in Conference on Human Factors in Computing Systems - Proceedings (2016).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Benjamin, D. et al. Calling Somatic SNVs and Indels with Mutect2. (2019) doi:10.1101/861054.
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. (2012).