- Chromatin Structure Analysis
- Use Cases
- Summary and Methods
- Inputs
- Output
- Workflow Walkthrough
- Results Walkthrough
- Citations
- Built with
- Epigenomics: DNA Methylation Detection
- Use Cases
- Summary and Methods
- Inputs
- Outputs
- Workflow Walkthrough
- Results Walkthrough
- Citations
- Built with
- Nucleotide Sequence Optimization
- Use Cases
- Summary and Methods
- Inputs
- Outputs
- Runtime Estimates
- Workflow Walkthrough
- Results Walkthrough
- Citations
- Built with
- Expression Analysis in RNASeq
- Use Cases
- Summary and Methods
- Inputs
- Outputs
- Workflow Walkthrough
- Results Walkthrough
- Citations
- Built with
- Single-Cell Analysis
- Use Cases
- Summary
- Methods
- Inputs
- Outputs
- Workflow Walkthrough
- Results Walkthrough
- Citations
- Built with
- Genomic Variant Analysis
- Use Cases
- Summary and Methods
- Inputs
- Outputs
- Workflow Walkthrough
- Results Walkthrough
- Citations
- Built with
Chromatin Structure Analysis
Analyze data to determine chromatin accessibility and protein-DNA interactions.
Version 1.0.3
Use Cases
This workflow analyzes data to determine chromatin accessibility and protein-DNA interactions
Summary and Methods
ChIP-seq, ATAC-seq, STARR-seq, Cut and Tag, and Cut and Run are all techniques used in molecular biology to analyze protein-DNA interactions or chromatin accessibility. While they share some similarities, they differ in their mechanisms, applications, and data outputs. Click the toggles below to learn more.
- A method to identify genome-wide protein-DNA interactions, particularly the binding sites of transcription factors, histones, and other chromatin-associated proteins. It involves cross-linking of proteins to DNA, immunoprecipitation of the protein of interest, fragmentation of DNA, sequencing, and mapping the resulting short reads to the reference genome to determine the binding locations. ChIP-seq provides information on the localization, abundance, and distribution of DNA-bound proteins, and their functional significance.
- A method to measure chromatin accessibility, i.e., the regions of DNA that are accessible to enzymatic cleavage, and hence transcription factors and other DNA-binding proteins. ATAC-seq involves the use of a hyperactive Tn5 transposase enzyme that inserts sequencing adapters into open chromatin regions, which are then PCR amplified, sequenced, and mapped to the reference genome. ATAC-seq provides high-resolution information on nucleosome positioning, regulatory elements, and gene expression, with low input requirements and fast processing time.
- A high-throughput sequencing technique used to identify and measure the activity of enhancers and other regulatory elements in the genome. Unlike other methods that rely on reporter genes or artificial constructs, STARR-seq uses endogenous genomic DNA to capture and amplify the transcriptional activity of enhancers, promoters, and other cis-regulatory sequences.
- A method that combines the benefits of ChIP-seq and ATAC-seq. It involves the cleavage of DNA by a restriction enzyme, followed by the tagging of the protein of interest with a sequencing adapter, and the release of the tagged DNA fragments. The fragments are then PCR amplified, sequenced, and mapped to the reference genome. Cut and Tag is faster and more sensitive than ChIP-seq, and has a lower background noise than ATAC-seq, making it suitable for detecting weak or low-abundance protein-DNA interactions.
- A method that uses a fusion protein of a protein A/G and a micrococcal nuclease enzyme to target and cleave DNA at the protein-DNA interface. The released DNA fragments are then tagged with sequencing adapters, amplified, sequenced, and mapped to the reference genome. Cut and Run is faster, more sensitive, and requires fewer cells than ChIP-seq, as it eliminates the need for crosslinking and immunoprecipitation steps. It is also less prone to nonspecific binding and DNA damage and can be performed on fresh or fixed cells and tissues.
The Chromatin Structure Analysis workflow can be used to process all of the above experiments. Click the toggles below to learn more about each analysis type within this workflow.
ChIP-Seq
Summary
This workflow is designed to help the user identify protein-DNA interactions from ChIP-Seq data. The user will provide a directory of files for analysis, and a file relating sample IDs to runs. The user will receive as output a summary of genome-wide protein-DNA interactions, including binding sites for transcription factors, histones, and other chromatin-associated proteins.
Methods
Reads are trimmed using TrimGalore, to remove low-quality (qual < 25) ends of reads and remove reads < 35bp. This workflow can be run with native open-source tools (NOST) or with Parabricks. With NOST, trimmed reads are aligned to a reference genome using BWA mem or Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools. Alignment quality is assessed using FastQC, Samtools, and Bedtools. With Parabricks, trimmed reads are aligned, duplicate reads are marked and alignment quality is accessed using fq2bam. Quality metrics are summarized with MultiQC.
ChIP-Seq users can pick a workflow that uses MACS3 or DROMPAplus. In the MACS workflow, peaks are detected using MACS. Alignments are converted to BigWIG files. This deepTool computeMatrix is used to calculate scores per genome region and prepares an intermediate file that is used with plotHeatmap and plotProfiles. Typically, the genome regions are genes, but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple region files (BED format). This tool can also be used to filter and sort regions according to their score. In the DROMPAplus workflow, alignments are converted to BigWig, and peaks are detected using DROMPAplus.
ATAC-Seq
Summary
This workflow is designed to help the user measure chromatin accessibility from ATAC-seq data. The user will provide a directory of files for analysis, and a file relating sample IDs to runs. The user will receive as output a summary of chromatin accessibility and nucleosome positioning in DNA.
Methods
Reads are trimmed using TrimGalore, to remove low-quality (qual < 25) ends of reads and remove reads < 35bp. This workflow can be run with native open-source tools (NOST) or with Parabricks. With NOST, trimmed reads are aligned to a reference genome using BWA mem or Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools. Alignment quality is assessed using FastQC, Samtools, and Bedtools. With Parabricks, trimmed reads are aligned, duplicate reads are marked and alignment quality is accessed using fq2bam. Quality metrics are summarized with MultiQC.
Peak calling is run with MACS3. In the MACS workflow, peaks are detected using MACS. Alignments are converted to BigWIG files. This deepTool computeMatrix is used to calculate scores per genome region and prepares an intermediate file that is used with plotHeatmap and plotProfiles. Typically, the genome regions are genes, but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple region files (BED format). This tool can also be used to filter and sort regions according to their score.
STARR-Seq
Summary
This workflow is designed to help the user identify the presence and activity of regulatory elements in the genome. The user will provide input sequencing files, and receive as output an analysis of regulatory elements.
Methods
Reads are trimmed using TrimGalore, to remove low-quality (qual < 25) ends of reads and remove reads < 35bp. This workflow can be run with native open-source tools (NOST) or with Parabricks. With NOST, trimmed reads are aligned to a reference genome using BWA mem or Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools. Alignment quality is assessed using FastQC, Samtools, and Bedtools. With Parabricks, trimmed reads are aligned, duplicate reads are marked and alignment quality is accessed using fq2bam. Quality metrics are summarized with MultiQC. Peak calling is handled by Starrseaker, and peaks are called directly from alignments.
Cut and Tag
Summary
This workflow is designed to help the user analyze protein-DNA interactions from Cut and Tag data. The user will provide as input a directory of files for analysis and will receive as output information on all protein-DNA interactions, including weak or low-abundance interactions.
Methods
Reads are trimmed using TrimGalore, to remove low-quality (qual < 25) ends of reads and remove reads < 35bp. This workflow can be run with native open-source tools (NOST) or with Parabricks. With NOST, trimmed reads are aligned to a reference genome using BWA mem or Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools. Alignment quality is assessed using FastQC, Samtools, and Bedtools. With Parabricks, trimmed reads are aligned, duplicate reads are marked and alignment quality is accessed using fq2bam. Quality metrics are summarized with MultiQC.
Peak calling is run with MACS3. In the MACS workflow, peaks are detected using MACS. Alignments are converted to BigWIG files. This deepTool computeMatrix is used to calculate scores per genome region and prepares an intermediate file that is used with plotHeatmap and plotProfiles. Typically, the genome regions are genes, but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple region files (BED format). This tool can also be used to filter and sort regions according to their score.
Cut and Run
Summary
This workflow is designed to help the user determine protein-DNA interactions from Cut and Run data. The user will provide a directory of files for analysis and will receive information on all protein-DNA interactions.
Methods
Reads are trimmed using TrimGalore, to remove low-quality (qual < 25) ends of reads and remove reads < 35bp. This workflow can be run with native open-source tools (NOST) or with Parabricks. With NOST, trimmed reads are aligned to a reference genome using BWA mem or Minimap2. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools. Alignment quality is assessed using FastQC, Samtools, and Bedtools. With Parabricks, trimmed reads are aligned, duplicate reads are marked and alignment quality is accessed using fq2bam. Quality metrics are summarized with MultiQC.
Peak calling is run with MACS3. In the MACS workflow, peaks are detected using MACS. Alignments are converted to BigWIG files. This deepTool computeMatrix is used to calculate scores per genome region and prepares an intermediate file that is used with plotHeatmap and plotProfiles. Typically, the genome regions are genes, but any other regions defined in a BED file can be used. computeMatrix accepts multiple score files (bigWig format) and multiple region files (BED format). This tool can also be used to filter and sort regions according to their score.
Inputs
- Run Name: This is a unique name for each run of pipelines in your project
- Organism: Reference Genome used for alignment
- Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
- Type of Chromatin Sequencing: This is workflow can perform ATAC-Seq, CHIP-Seq, Cut and Run, Cut and tag, and STARR-Seq so which one needs to be specied as an input
- Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
- Sample Description File
- This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
- RunID should be a part of the the fastq files.
- SampleGroup is necessary for statistical analysis, there must be atleast 2 samples per group
- ControlGroup is necessary for statistical analysis, this is the control to use against your sample. If no control then populate this group with the same text as the sampleID. If this row is a control then populate the sampleID with the same text as the control ID.
- File Format
RunID | SampleID | ControlID |
SRR994739 | SRR994739 | SRR994740 |
SRR994740 | SRR994740 | SRR994740 |
SRR994741 | SRR994741 | SRR994740 |
Algorithm Parameters
Algorithms
- true
- false
Sequence Data Type
Option | Meaning |
sr | Short single-end reads without splicing (-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no). This is the default mode. |
map-ont | Align noisy long reads of ~10% error rate to a reference genome. |
map-hifi | Align PacBio high-fidelity (HiFi) reads to a reference genome (-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200). |
map-pb | Align older PacBio continuous long (CLR) reads to a reference genome (-Hk19). |
Output
- Merged Sorted BAM Files per Sample
- bams/SampleID.bam
- bams/SampleID.bam.bai
- MultiQC HTML
- multiqc_data/multiqc.log
- multiqc_data/multiqc_data.json
- multiqc_data/multiqc_fastqc.txt
- multiqc_data/multiqc_general_stats.txt
- multiqc_data/multiqc_picard_dups.txt
- multiqc_data/multiqc_samtools_flagstat.txt
- multiqc_data/multiqc_samtools_stats.txt
- multiqc_data/multiqc_sources.txt
- multiqc_report.html
- NAME_peaks.xls is a tabular file which contains information about called peaks. You can open it in excel and sort/filter using excel functions. Note: coordinates in XLS format are 1-based which is different from BED format. Information include:
- chromosome name
- start position of peak
- end position of peak
- length of peak region
- absolute peak summit position
- pileup height at peak summit
- log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then this value should be 10)
- fold enrichment for this peak summit against random Poisson distribution with local lambda,
- log10(qvalue) at peak summit
- NAME_peaks.narrowPeak is BED6+4 format file which contains the peak locations together with peak summit, p-value, and q-value. You can load it to the UCSC genome browser. Definition of some specific columns are:
- 5th: integer score for display. This is calculated as int(-10*log10pvalue) or int(-10*log10qvalue) depending on whether (pvalue) or( qvalue) is used as score cutoff. Please note that currently this value might be out of the [0-1000] range defined in UCSC ENCODE narrowPeak format. You can let the value saturate at 1000 (i.e. p/q-value = 10^-100) by using the following 1-liner awk:
- 7th: fold-change at peak summit
- 8th: -log10pvalue at peak summit
- 9th: -log10qvalue at peak summit
- 10th: relative summit position to peak start
- NAME_summits.bed is in BED format, which contains the peak summits locations for every peak. The 5th column in this file is the same as what is in thefile. If you want to find the motifs at the binding sites, this file is recommended. The file can be loaded directly to the UCSC genome browser. Remove the beginning track line if you want to analyze it by other tools.
- NAME_peaks.broadPeak is in BED6+3 format which is similar to the narrowPeak file, except for missing the 10th column for annotating peak summits. This file and the gappedPeak file will only be available when ‘—broad’ is enabled. Since in the broad peak calling mode, the peak summit won't be called, the values in the 5th and 7-9th columns are the mean value across all positions in the peak region. Refer to narrowPeak if you want to fix the value issue in the 5th column.
- NAME_peaks.gappedPeak is in BED12+3 format which contains both the broad region and narrow peaks. The 5th column is the score for showing grey levels on the UCSC browser as in narrowPeak. The 7th is the start of the first narrow peak in the region, and the 8th column is the end. The 9th column should be RGB color key, however, we keep 0 here to use the default color, so change it if you want. The 10th column tells how many blocks including the starting 1bp and ending 1bp of broad regions. The 11th column shows the length of each block and 12th for the start of each block. 13th: fold-change, 14th: log10pvalue, 15th: log10qvalue. The file can be loaded directly to the UCSC genome browser. Refer to narrowPeak if you want to fix the value issue in the 5th column.
- NAME_model.r is an R script which you can use to produce a PDF image of the model based on your data. Load it to R by using: ‘$ Rscript NAME_model.r’. Then a PDF file NAME_model.pdf will be generated in your current directory. Note, R is required to draw this figure.
- The NAME_treat_pileup.bdg and NAME_control_lambda.bdg files are in bedGraph format which can be imported to the UCSC genome browser or be converted into even smaller bigWig files. The NAME_treat_pielup.bdg contains the pileup signals (normalized according to ‘—scale-to’ option) from ChIP/treatment sample. The NAME_control_lambda.bdg contains local biases estimated for each genomic location from the control sample, or from treatment sample when the control sample is absent. The subcommand ‘bdgcmp’ can be used to compare these two files and make a bedGraph file of scores such as p-value, q-value, log-likelihood, and log fold changes.
When -broad is enabled for broad peak calling, the pileup, p-value, q-value, and fold change in the XLS file will be the mean value across the entire peak region, since peak summit won't be called in broad peak calling mode.
awk -v OFS="\t" '{$5=$5>1000?1000:$5} {print}' NAME_peaks.narrowPeak
This file can be loaded directly to the UCSC genome browser. Remove the beginning track line if you want to analyze it by other tools.
- Peak List (Bed)
- Visualization (pdf/png)
- Enrichment pvalue data (WIG,bedGraph,bigWig)
- Peak List (Bed)
- Visualization (pdf/png)
- Enrichment pvalue data (WIG, bedGraph, bigWig)
- Peak List (Bed)
- Visualization (pdf/png)
- Enrichment pvalue data (WIG, bedGraph, bigWig)
Workflow Walkthrough
- Navigate to the Chromatin Structure Analysis workflow. You can use the search tool at the top-right corner or find the launch card using the “Functional Genomics” or “Next-Generation Sequencing” filters.
- Select the version from the dropdown menu in the top right corner.
- Click “Run Workflow” in the top right corner.
- Launcher Tabs
- First, choose the analysis you wish to perform - currently, we support CHIPSeq, ATAC-Seq, STARR-Seq, Cut and Run, and Cut and Tag. Next, select a reference genome. Last, provide the directory of files for analysis and a NGS sample attributes file relating run ID to sample IDs.
- Advanced Configuration - these choices may look different depending on what analysis you are performing.
- Give your workflow a unique name, and review the input files and parameters set for your workflow. When satisfied, click “Run Workflow” to begin your analysis.
Results Walkthrough
- Locate your workflow from the Activity tab and select it.
- Once inside, select the Files tab. In this tab, select the output folder. Within this folder you can view HTML reports for your analysis.
Citations
- Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, (2017).
- Krueger F, Trimgalore (2021), GitHub repository, https://github.com/FelixKrueger/TrimGalore.
- Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191
- Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv. 1-3 (2013).
- Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
- NVIDIA Parabricks: https://www.nvidia.com/en-us/clara/genomics/
- Nakato R., Sakata T., Methods for ChIP-seq analysis: A practical workflow and advanced applications, Methods, 2020.
- Zhang, Y., Liu, T., Meyer, C.A. et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008). https://doi.org/10.1186/gb-2008-9-9-r137
- Evan D Tarbell, Tao Liu, HMMRATAC: a Hidden Markov ModeleR for ATAC-seq, Nucleic Acids Research, Volume 47, Issue 16, 19 September 2019, Page e91, https://doi.org/10.1093/nar/gkz533
Built with
Epigenomics: DNA Methylation Detection
Determine patterns of methylation in bisulfite sequencing applications and determine methylation events and frequencies for ONT-based assemblies.
Version 0.0.4
Use Cases
- Determine patterns of methylation in Bisulfite-Seq applications
- Calculate methylation events and frequencies for ONT-based assemblies
Summary and Methods
This workflow is designed to explore methylation patterns in DNA, either with (i) bisulfite sequencing application or (ii) Oxford Nanopore data. The user will select which workflow they wish to run and provide input files containing sequences of interest. The user will receive as output an analysis regarding patterns of methylation in the files.
Reduced Representation Bisulfite Sequencing (RRBS)
Summary
Reduced Representation Bisulfite Sequencing (RRBS) data is a widely used method for profiling DNA methylation. Bismark performs a series of steps to map RRBS reads to a reference genome, remove duplicate reads, and convert the DNA methylation data into a format that can be used for downstream analysis. Bismark also includes options for quality control, filtering, and normalization of the methylation data. Bismark requires a reference genome to map the sequencing reads.
Methods
This analysis was performed with the help of the Epigenomics: DNA Methylation Identification workflow on the Form Bio Platform. Reads are trimmed with Trim Galore [1]. Trimmed reads are aligned to a reference genome using Bismark [2] and Bowtie [3]. Reports and summaries are determined using scripts from Bismark.
Oxford nanopore sequencing technology (ONT)
Summary
Oxford nanopore sequencing technology (ONT) is a newer sequencing technology that offers advantages such as long reads and real-time sequencing. This data type can be used to detect methylation sites in DNA as methylated residues will carry a different charge than unmethylated residues. DeepSignal2 uses deep neural networks to detect DNA modifications, including DNA methylation and hydroxymethylation, directly from the raw nanopore sequencing data. It can identify modifications at single-nucleotide resolution and can be used to map modifications across the entire genome.
Methods
This analysis was performed with the help of the Epigenomics: DNA Methylation Identification workflow on the Form Bio platform. Raw Fast5s are first annotated before being associated with the input basecall to then be aligned to the genome using Tombo [4]. Once aligned methylation events and frequencies are determined using Deepsignal2 [5].
Inputs
Bismark Methylation Analysis
- Input Directory of FastQ
- Reference Genome
DeepSignal2
- Input Directory of Fast5s
- Assembly FastA File
- Basecall FastQ File
Outputs
Bismark Methylation Analysis
- Bismark Alignment
- Bismark Report
- Bismark Summary
- BedGraph of Methylation
DeepSignal2
- Deepsignal2 Methylation Events
- Deepsignal2 Methylation Frequency
Workflow Walkthrough
- Navigate to the Epigenomics: DNA Methylation Detection launcher card. You can use the search bar at the top right corner or use the Functional Genomics or Next Generation Sequencing tags to find the workflow card.
- Select the version from the dropdown menu. When ready to begin analysis, select “Run Workflow” at the top-right corner.
- This workflow currently supports two analysis functions: Illumina Reduced Representation Bisulfite Sequencing with Bismark, and Oxford Nanopore Technology analysis with Deepsignal2. Select the analysis you wish to run from the dropdown menu. We’ll choose Bismark. Also provide the directory containing the files to be analyzed as well as a file relating RunIDs and SampleIDs (this table can be created within the workflow itself).
- Select a reference genome from the dropdown menu.
- Give your workflow run a unique name, and review the inputs and parameters for the workflow. When you’re satisfied, press “Run Workflow” to begin analysis.
Results Walkthrough
- Select your workflow from the Activity tab.
- In this tab, select the Files tab to view all file outputs. You may also view these files in the pipeline-outputs tab of the File Explorer.
Citations
- Krueger F, Trimgalore (2021), GitHub repository, https://github.com/FelixKrueger/TrimGalore.
- Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011 Jun 1;27(11):1571-2. doi: 10.1093/bioinformatics/btr167. Epub 2011 Apr 14. PMID: 21493656; PMCID: PMC3102221.
- Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923
- Stoiber, M.H. et al. De novo Identification of DNA Modifications Enabled by Genome-Guided Nanopore Signal Processing. bioRxiv (2016). http://biorxiv.org/content/early/2017/04/10/094672
- Jianxin Wang, Feng Luo, Peng Ni (2020), GitHub repository, https://github.com/PengNi/deepsignal2
Built with
Nucleotide Sequence Optimization
Perform various nucleotide optimizations and gather information on a nucleotide sequence of interest. Currently, three functions are supported: Gene Sequence Optimization to optimize codons and folding energies, Predict RNA/DNA 2D/3D Structure to ascertain the structure of a sequence at a given temperature, and Predict Splice Sites to locate potential splice sites in an optimized nucleotide sequence.
Version 1.0.7
Use Cases
- Perform codon optimization on nucleotide sequences and get information on CpG islands within the structure
- Determine splice sites in a sequence
- Determine how a nucleotide sequence will fold under given circumstances
Summary and Methods
This workflow is designed to help the user perform various optimizations and gather information on a nucleotide sequence of interest. Currently, three functions are supported: Gene Sequence Optimization, to optimize codons and folding energies, Predict RNA/DNA 2D Structure to ascertain the structure of a sequence at a given temperature, and Predict Splice Sites to locate potential splice sites in an optimized nucleotide sequence. Click the toggles below to learn more about each function.
Gene Sequence Optimization
Summary
This workflow is designed to help the user optimize a nucleotide sequence for codons and folding energies. The user will provide a FastA file for analysis containing the sequence of interest and a BED file containing protein domains. In lieu of a FastA file, the user may paste the sequence into the provided text box. The user will recieve as output files detailing optimal folding structures.
Methods
This analysis was performed using the Nucleotide Structure Optimization workflow on the Form Bio platform. The workflow takes a RNA or DNA FastA file and accompanying BED file and runs GeneGA and iCodon, which updates the peptides to optimize codon usage and RNA structure, as well as analyzing the sequence to get information about CPG islands [5, 6, 7].
Predict RNA/DNA 2D Structure
Summary
This workflow is designed to help the user predicted the 2D DNA or RNA structure of a protein at a certain temperature. The user will provide a FastA file containing a sequence of interest. The user will recieve a prediction of the 2D structure of the protein, as well as information on the presence of other prespecified structures.
Methods
This analysis was performed using the Optimize a Nucleotide Structure workflow on the Form Bio platform. This workflow takes a RNA or DNA fastA file and predicts the 2D structure of it at a certain temperature. Portions of the sequence can also be predicted and extra analysis can be done to determine if the structure contains prespecified structures such as hairpin loop or branches of certain sizes [1, 2, 3, 4]
Predict Splice Sites
Summary
This workflow is designed to help the user predict splice sites in a nucleotide sequence. The user will provide as input an optimized nucleotide sequence, such as one obtained after running the Gene Sequence Optimization workflow. The user will recieve as ouput a list of potential splice sites in the sequence.
Methods
This analysis was performed using the Nucleotide Structure Optimization workflow on the Form Bio platform. This workflow uses spliceator to predict splice sites in a sequence [8].
Inputs
- Input Type
- FastA file
- Sequence
- Splice Detection
- Optimization
- Folding
- Tool Specific Inputs
- Optimization
- Genome Region File with CDS Sequence in Sequence Defined
- Folding
- Nucleic Acid Base Type
- Max Number of Predicted Structures
- Start
- Stop
- Temperature
- Max Length
- Sequence Spacing
- Structure Search
- Minimum size of structure search
- Bulge Size
- Asymmetry Size
Outputs
Folding Outputs
- Struture ps file
- Struture jpg image
- Structure dot notation folding
- Structure analysis txt file
Optimization Outputs
- Results file containing free energies, top optimized sequences, MSA, and CPG information
Splice Outputs
- Results file containing locations in the sequence with splice sites.
Runtime Estimates
Average = 42 minutes
Workflow Walkthrough
- Navigate to the Nucleotide Sequence Optimization launcher card. You can use the search bar at the top right corner or use the Functional Genomics or Candidate Validation tags to find the workflow card.
- Select the version from the dropdown menu. When ready to begin analysis, select “Run Workflow” at the top-right corner.
- Launcher Tabs
- Choose a tool/algorithm to perform
- “Optimize CDS for Folding and Codon Optimization” for gene sequence optimization
- “Calculate Folding Energies” for prediction of RNA/DNA 2D Structure
- “Detect Splice-Sites in Optimized Sequence” for prediction of splice sites
- Select the type of sequence input (file or text)
- Provide sequence name and raw sequence (.fasta) or BED files corresponding to the sequence.
- In the next tab, provide the type of codon usage input and provide a reference genome for the workflow run.
- Review workflow parameters and name workflow run.
- Click “Run Workflow”.
Results Walkthrough
- To view results, first locate your run in the Activity tab. Once found, select it to view more information. On this page, you can view information about the workflow status, analysis outputs, files input and output, parameters, and Nextflow logs.
- Select Analysis from the tabs to view the results of the workflow run. You may also click View Full Screen Results to open the results in a new tab.
Citations
- M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31 (13), 3406-3415, 2003.
- A. Waugh, P. Gendron, R. Altman, J. W. Brown, D. Case, D. Gautheret, S. C. Harvey, N. Leontis, J. Westbrook, E. Westhof, M. Zuker & F. Major. RNAML: A standard syntax for exchanging RNA information. RNA 8 (6), 707-717, 2002.
- M. Zuker & A. B. Jacobson. Using Reliability Information to Annotate RNA Secondary Structures. RNA 4, 669-679, 1998..
- D. H. Mathews, J. Sabina, M. Zuker & D. H. Turner. Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure J. Mol. Biol. 288, 911-940, 1999.
- Lorenz, Ronny and Bernhart, Stephan H. and Höner zu Siederdissen, Christian and Tafer, Hakim and Flamm, Christoph and Stadler, Peter F. and Hofacker, Ivo L. ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6:1 26, 2011, doi:10.1186/1748-7188-6-26
- Li Z, Huang H (2022). GeneGA: Design gene based on both mRNA secondary structure and codon usage bias using Genetic algorithm. R package version 1.46.0, http://www.tbi.univie.ac.at/~ivo/RNA/
- Diez, M., Medina-Muñoz, S.G., Castellano, L.A. et al. iCodon customizes gene expression based on the codon composition. Sci Rep 12, 12126 (2022). https://doi.org/10.1038/s41598-022-15526-7
- Scalzitti, N., Kress, A., Orhand, R. et al. Spliceator: multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics 22, 561 (2021). https://doi.org/10.1186/s12859-021-04471-3
Built with
Expression Analysis in RNASeq
This workflow can be used to determine gene expression, splice variants and differential expression analysis.
Version 1.1.1
Use Cases
- Determine differentially expressed genes between two or more groups of samples (treated vs untreated, knock-out vs wildtype, cell type A vs cell type B)
- Determine differentially expressed transcripts between two or more groups of samples
- Compare the gene expression profiles of samples
Summary and Methods
This workflow is designed to help the user thoroughly analyze RNA sequencing data. Currently, two functions are supported: Full Analysis and Recalculate Statistics. Both functions include the option to specify whether the data include Human Cancer Samples. Click the toggles below to learn more about each function.
Full Analysis
Summary
This workflow is designed to help the user determine differential gene abundances and differential expression between two or more groups of samples. The user will provide as input a folder containing all the read files needed for analysis and a sequencing file relating sample IDs to attributes. The user will receive as output differential gene and transcript abundance analysis files and comparison files between the two or more samples.
Methods
This analysis was performed using the Expression Analysis in RNASeq workflow on the Form Bio platform. Reads are trimmed using TrimGalore [1], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. Trimmed reads are aligned to a reference genome using STAR [2] (default) or HiSAT2 [3]. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools [4]. The abundance of transcripts and genes are assessed using FeatureCount to generate raw gene counts [5], StringTie to generate FPKM [6] and Salmon to generate raw transcript counts [7]. Sample comparisons and differential gene/transcript expression analysis are performed using EdgeR [8], DESeq2 [9] and IsoformSwitchAnalyzeR [10].
Recalculate Statistics
Summary
This workflow is designed to help the user determine differentially expressed genes using abundance counts generated by previous workflow analysis. The user will provide a folder containing output files from previous analyses. The workflow will perform the statistical analysis again with a different composition of samples and output the results of this analysis.
Methods
Human Cancer
Summary
This workflow is designed to help the user determine differential gene abundances and differential expression between two or more groups of human tumor samples. The user will provide as input a folder containing all read files needed for analysis and a sequencing file relating sample IDs to attributes. The user will receive as output differential gene and transcript abundance analysis files and comparison files between the two or more samples.
Methods
This analysis was performed using the Expression Analysis in RNASeq workflow on the Form Bio platform. Reads are trimmed using TrimGalore [1], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. Trimmed reads are aligned to a reference genome using STAR [2] (default) or HiSAT2 [3]. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools [4]. The abundance of transcripts and genes are assessed using FeatureCount to generate raw gene counts [5], StringTie to generate FPKM [6] and Salmon to generate raw transcript counts [7]. Sample comparisons and differential gene/transcript expression analysis are performed using EdgeR [8], DESeq2 [9] and IsoformSwitchAnalyzeR [10]. Gene fusion events are detected using Star-Fusion [11]. Exon-skipping events are detected using RegTools [12].
Inputs
- Run Name: This is a unique name for each run of pipelines in your project
- Organism: Reference Genome used for alignment
- Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
- Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
- File Format
- Sample Description File
- This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
- RunID should be a part of the the fastq files.
- SampleGroup is necessary for statistical analysis, there must be at least 2 samples per group
- true
- false
- I = inward
- O = outward
- M = matching
- F = Forward
- R = Reverse
- '' = Single End/Unknown
- S = stranded
- U = unstranded
RunID | SampleID | SampleGroup |
SRR994739 | SAMEA9454349 | Treated |
SRR994740 | SAMEA9454349 | Treated |
SRR994741 | SAMEA9454341 | Untreated |
SRR994742 | SAMEA9454348 | Treated |
SRR994743 | SAMEA9454348 | Treated |
SRR994744 | SAMEA9454342 | Untreated |
Advanced Parameters
Algorithms
Outputs
- bams/SampleID.bam
- bams/SampleID.bam.bai
- featurects/SampleID.salmon.tar.gz
- featurects/SampleID_stringtie
- featurects/SampleID.fpkm.txt
- featurects/SampleID.cts.txt
- featurects/SampleID.cts.txt.summary
- countTable.fpkm.txt
- countTable.logCPM.txt
- countTable.stats.txt
- countTable.txt
- featurects/SampleID.unique.bw
- featurects/SampleID.all.bw
- multiqc_data/multiqc.log
- multiqc_data/multiqc_data.json
- multiqc_data/multiqc_fastqc.txt
- multiqc_data/multiqc_featureCounts.txt
- multiqc_data/multiqc_general_stats.txt
- multiqc_data/multiqc_samtools_flagstat.txt
- multiqc_data/multiqc_samtools_stats.txt
- multiqc_data/multiqc_sources.txt
- SampleID/SampleID.alnstat.txt
- SampleID/SampleID.flagstat.txt
- SampleID/SampleID_fastqc.html
- SampleID/SampleID_fastqc.zip
- multiqc_report.html
- Group1_Group2.edgeR.txt
- Group1_Group2.gene2path.txt
- Group1_Group2.stringDB.txt
- countTable.mds.txt
- countTable.pca.txt
- countTable.pcapercvar.txt
- countTable.sampleDists.txt
- countTable.dexseq.txt
- gene.trxstats.txt
- splicingEnrichment.txt
- splicingIsoformUsage.txt
- splicingResults.html
- splicingSummary.txt
Workflow Walkthrough
- Navigate to the Expression Analysis in RNASeq launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics, Precision Medicine, Functional Genomics, or Next Generation Sequencing tags to find the workflow card.
- Select the version from the dropdown box in the top right corner. When ready to begin analysis, click “Run Workflow”.
- This workflow currently supports two functions: Full Analysis and Recalculate Statistics. Both functions include the option to specify whether the data include Human Cancer Samples. Checking this option includes gene fusion predictions in the analysis to show alternations causing gene fusion events.
- Select a reference genome and annotation version for the workflow run.
- Name the workflow run, then take a minute to review workflow settings and parameters. When you’re satisfied, click “Run Workflow” at the bottom-left corner.
Let’s look at the Recalculate Statistics options, for repeating statistical analysis performed in previous workflow runs. Select this function from the dropdown box. Also provide the directory containing the files to be analyzed as well as a file relating RunIDs, SampleIDs and sample attributes such as SampleGroup. (this table can be created within the workflow itself).
Results Walkthrough
- To view results for your Expression Analysis in RNASeq workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
- After selecting your workflow run, click Open Analysis in the upper right-hand corner to open the RNASeq Analysis Portal in the RNASeq Dashboard (opens as a separate tab) to view an interactive summary of your data. You may also navigate to the Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
- Use the RNASeq Analysis Portal to view your data analysis. Navigate the tabs across the top or use the links in the Introduction tab.
Citations
- Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: V0.6.7 - DOI via Zenodo. (2021) doi:10.5281/ZENODO.5127899.
- Dobin, A. et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 29, 15–21 (2013).
- Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37, 907–915 (2019).
- Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
- Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
- Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, (2019).
- Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods 14, 417–419 (2017).
- Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26, 139–140 (2010).
- Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014).
- Vitting-Seerup, K. & Sandelin, A. IsoformSwitchAnalyzeR: Analysis of changes in genome-wide patterns of alternative splicing and its functional consequences. Bioinformatics 35, 4469–4471 (2019).
- Haas, B. et al. STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq. bioRxiv 120295 (2017) doi:10.1101/120295.
- Feng, Y.-Y. et al. RegTools: Integrated analysis of genomic and transcriptomic data for discovery of splicing variants in cancer. bioRxiv 436634 (2018) doi:10.1101/436634.
Built with
Single-Cell Analysis
Analyze single-cell RNASeq data using 10X assays.
Version 1.0.0
Use Cases
- Determine clustering, differential expression, and annotation of cells in a single-cell experiment
- Create a H5AD file to allow for visualization using CellxGene
Summary
Single-cell RNA sequencing (scRNA-seq) analysis enables researchers to address a wide range of biological questions related to cellular heterogeneity, gene expression dynamics, and cell state transitions. Single-cell RNASeq analysis can be used to (i) identify and classify distinct cell types within a heterogeneous tissue or organism; (ii) infer cellular lineages and developmental trajectories, providing insights into cell fate decisions and differentiation processes; (iii) reveal transitions between different cellular states, such as quiescent to activated states or stem cell to differentiated states; (iv) identify co-expressed gene modules and the inference of regulatory networks within specific cell types or states; (v) elucidate the molecular basis of diseases by identifying dysregulated genes, cell types, or signaling pathways associated with specific conditions; (vi) study immune cell populations and their responses in various contexts, including infection, cancer, and autoimmune diseases; and (vii) analyze gene expression patterns within intact tissue samples by combining scRNA-seq with spatial transcriptomics techniques.
Methods
The workflow is designed to analyze single-cell RNASeq data. Read files are generated by demultiplexing sequence run files with blc2fastq [1]. Sequence reads are mapped and genes are counted using CellRanger. Expression profiles are used to annotate and cluster cells using Seurat [2], SingleCellTK [3], SingleR [4] and CellDX [4]. SingleCellTK is an R package that integrates several existing tools and workflows for single-cell analysis including (i) data loading and preprocessing such as cell filtering, normalization, and log-transformation; (ii) quality control and outlier detection; (iii) dimensionality reduction and clustering with methods like PCA, t-SNE, and uniform manifold approximation and projection (UMAP) for dimensionality reduction and hierarchical clustering or density-based spatial clustering of applications with noise (DBSCAN) for clustering; and (iv) visualization with scatter plots, heatmaps, and gene expression trajectories. Annotations are only available for Human or Mouse samples using the Human Protein Cell Atlas [5] or a mouse RNASeq dataset [6].
Inputs
- FastQ
- Matrix, Barcodes, Feature Files (CellRanger Output)
- Sequence Run Folder
- Reference Genome
Outputs
- CellxGene h5ad File: annotation format readable by CellxGene
Workflow Walkthrough
- Navigate to the Single-Cell Analysis workflow launcher. You can use the search bar at the top right to navigate this workflow or you can use the Functional Genomics or Next-Generation Sequencing filters on the left-hand side.
- Select the version from the dropdown versioning menu. You can view information about the use-cases and workflow analysis here. When ready to begin, click Run Workflow.
- Enter the key that’s given once CellRanger License has been verified. For information please email support@formbio.com. Select your data input type, either CellRanger output, Illumina, or FastQ files. Then, provide matrix, barcodes, and genes files.
- Select the host genome, used to help determine contaminants
- Give your workflow run a unique name, then review input data and parameters. When ready to submit, click “Run Workflow”.
Results Walkthrough
- To view results of your Single-Cell Seq workflow, find your workflow run on the Activity tab and then select your run for more information.
- In this view, select the Analysis tab to view a summary of your workflow analysis. You may opt to view this workflow in a new tab as well.
- Alternatively, select the Files tab and navigate to the output folder to view or download the HTML results report.
Citations
- Bcl2fastq.
- Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
- Gendoo, D. M. A. et al. Genefu: An R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics 32, 1097–1099 (2016).
- Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nature Immunology 20, 163–172 (2019).
- McCarthy, D. J., Campbell, K. R., Lun, A. T. L. & Wills, Q. F. Scater: Pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).
- Wang, Y. et al. singleCellTK: Comprehensive and interactive analysis of single cell RNA-Seq data. (2023).
Built with
Genomic Variant Analysis
Identify single-nucleotide variants (SNVs), indels, and structural variants in a diploid genome resequencing projects by comparison to a reference genome.
Version 1.6.1
Use Cases
- Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions, deletions and structural variants
- Germline Variant Calling
- Variant Calling in Ancient DNA
- Somatic Mutation Detection
- Determine variants in DNA samples compared to a custom reference genome for small or synthetic genomes
- Plasmid
- Virus
- Bacteria
- Synthetic Genome
- Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)
Summary and Methods
This workflow is designed to help the user determine variants in DNA samples when compared to a reference genome. Currently, four different input DNA datatypes are supported: Germline (Diploid), Ancient DNA, Small Genomes (Viral/Prokaryotic/Synthetic), and Somatic (Human Cancer). Workflows can be run either with Parabricks, Sentieon or native open-source tools (NOST). Click the toggles below to learn more about each supported dataype.
Germline (Diploid)
Summary
This workflow is designed to help the user determine germline variants in diploid DNA. The user will provide input FastQ files containing the diploid DNA to be analyzed, and will recieve as output a summary of germline variants in the DNA compared to the chosen reference genome.
Methods
This analysis was performed using the Germline Variant Analysis workflow on the Form Bio platform. When the datatype is “Germline (Diploid)”, this workflow determines genetic variants including SNVs, insertions and deletions of high-quality NGS data when compared to a reference genome. Reads are trimmed using TrimGalore [1] or FastP [2], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. These default value can be changed by the user. This workflow can be run with native open-source tools (NOST), Sentieon or with Parabricks.
With NOST and Sentieon, trimmed reads are aligned to a reference genome using BWA-MeM [3], Minimap2 [4] or Winnowmap [5] depending on data type. Duplicate reads can optionally be marked using Picard MarkDuplicates [6]. BAMs from the same sample generated by multiple runs are merged using Samtools [7]. Alignment qualtity is assessed using FastQC [8], Samtools [7], Bedtools [9] and Qualimap [10]. Variants can be detected with joint calling using Freebayes [11], Samtools/Bcftools [12], DNAScope and GATK4 [13].
With Parabricks, trimmed reads are aligned, duplicate reads are marked and alignment quality is accessed using fq2bam. Quality metrics are summarized with MultiQC. Variants can be detected with GATK [13] and DeepVariant [14] to produce gVCF files. Genotyping of gVCF files is determined using GLNexus [15]. Variants effects are determined using SNPEff [16].
Ancient DNA
Summary
This workflow is designed to help the user determine variants in ancient DNA. The user will provide input FastQ files containing the DNA to be analyzed, and will recieve as output a summary of variants in the DNA compared to the chosen reference genome.
Methods
This analysis was performed using the Germline Variant Analysis workflow on the Form Bio platform. When the datatype is “Ancient”,this workflow can be used to determine genetic variants of high quality NGS data in your project compared to a supported reference genome. Reads are trimmed using AdapterRemoval [17], FastP [2], or TrimGalore [1], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. These default value can be changed by the user. Contaminates are detected using Kraken [18] with confidence of 0.8 using Kraken’s precompiled database or a custom database where the human genome has been removed. Unclassified trimmed reads are aligned to a reference genome using BWA MEM [3] or BWA Aln (with seed of 16,500, maximum edit distance of 0.01 and maximum gap opens of 2). BAMs from the same library generated by multiple runs are merged using Samtools [7]. Duplicate reads from the the same library can optionally be marked using PaleoMIX [19] or Picard MarkDuplicates [6]. BAMs from the same sample generated by multiple libraries are merged using Samtools [7]. Base recalibration is done using mapdamage2 [20]. Alignment quality is assessed using QualiMap [21], DamageProfiler [22], and MultiQC [23]. Germline variants can be detected using Freebayes [11], Samtools/Bcftools [12] and GATK4 [13]. In order to increase the speed of analysis, the Parabricks (requires GPUs) or Sentieon optimized versions of these algorithms are used for BWA Mem and GATK4. Genotyping of GVCF files is determined using GLnexus [15]. Variant effects are determined usng SNPEff [16].
Small Genomes (Viral/Prokaryotic/Synthetic)
Summary
This workflow is designed to help the user determine germline variants in small or synthetic genomes with an option to provide the custom genome sequence. The user will provide input FastQ files containing the DNA to be analyzed, and will recieve as output a summary of variants in the DNA compared to the chosen reference genome. It is assumed that Small genomes are haploid.
Methods
This analysis was performed using the Germline Variant Analysis workflow on the Form Bio platform. When data type is a “Small Genome”, this workflow can be used to determine genetic variants of high quality NGS data in your project compared to a supported reference genome. Reads are trimmed using TrimGalore [1]. Trimmed reads are aligned to a reference genome using BWA MEM [3]. Duplicate reads are marked using Picard MarkDuplicates [6]. Germline variants can be detected using Samtools/Bcftools [12]. In order to increase the speed of analysis, the Parabricks or Sentieon optimized versions of these algorithms can be used for BWA MEM. Variant effects are determined usng SNPEff [16] if the genome is provided by the platform. For SARS-CoV-2, genome sequences of samples are determined using BCFTools [12] and lineage classification is determined using PANGOLIN [24].
Somatic (Human Cancer)
Summary
This workflow is designed to help the user determine variants in somatic DNA. The user will provide input FastQ files containing the DNA to be analyzed, and will receive as output a summary of variants in the DNA compared to the chosen reference genome.
Methods
When data type is “Somatic”, this workflow can be used to determine genetic variants of tumor NGS data compared to a supported reference genome. When a normal sample is provided, specialized somatic variant calling methods will be applied and allow users to filter germline variants from the resulting VCF files. Reads are trimmed using TTrimGalore [1] or FastP [2], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. These default value can be changed by the user. Trimmed reads are aligned to a reference genome using BWA MEM [3], Minimap2 [4] or Winnowmap [5] depending on data type. Duplicates reads are marked using Picard MarkDuplicates [6]; if provided by Library. BAMs from the same sample generated by multiple runs are merged using Samtools [7]. Alignment quality is assessed using FastQC [8], Samtools [7], Bedtools [9] and Qualimap [10]. Quality reports are produced by MultiQC. Variant effects are determined using SNPEff [16]. Somatic variants can be detected in somatic or tumor-only mode using Strelka2 [25], Freebayes [11], DeepSomatic and MuTect2 [26], TNScope. In order to increase the speed of analysis, the Parabricks or Sentieon optimized versions of these algorithms are used for BWA Mem and GATK4. When a matched normal sample is present, tumor/normal germline SNP matching is confirmed using NGSCheckMate [27] and microsatellite stability is assessed using MSI-Sensor [28].
Inputs
- Run Name: This is a unique name for each run of pipelines in your project
- Organism: Reference Genome used for alignment
- Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
- Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
- Sample Description File
- This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
- RunID should be a part of the the fastq files.
- SampleGroup is necessary for statistical analysis, there must be atleast 2 samples per group
- File Format
- Capture Bedfile
- The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
- Make sure that there is no column names present in the file.
- Forth column can indicate a region name and used to determine poorly capture regions.
- true
- false
RunID | SampleID |
SRR994739 | SAMEA9454349 |
SRR994740 | SAMEA9454349 |
SRR994741 | SAMEA9454341 |
SeqName | Start | End | Name |
chr1 | 1787293 | 1787413 | GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1 |
chr1 | 1787353 | 1787473 | GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2 |
chr1 | 1789040 | 1789160 | GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1 |
chr1 | 1789160 | 1789280 | GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2 |
chr1 | 1790375 | 1790495 | GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1 |
chr1 | 1790495 | 1790615 | GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2 |
chr1 | 1793187 | 1793307 | GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1 |
chr1 | 1793247 | 1793367 | GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2 |
chr1 | 1804380 | 1804500 | GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1 |
chr1 | 1804500 | 1804620 | GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2 |
chr1 | 1806416 | 1806536 | GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1 |
chr1 | 1806476 | 1806596 | GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2 |
Advanced Parameters
Algorithms
Sequence Data Type
Option | Meaning |
sr | Short single-end reads without splicing (-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no). This is the default mode. |
map-ont | Align noisy long reads of ~10% error rate to a reference genome. |
map-hifi | Align PacBio high-fidelity (HiFi) reads to a reference genome (-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200). |
map-pb | Align older PacBio continuous long (CLR) reads to a reference genome (-Hk19). |
Outputs
- bams/SampleID.bam
- bams/SampleID.bam.bai
- DELLY VCF
- Freebayes
- Mutect2
- Strelka2
- SVABA
- Union VCF
- Filtered VCF
- MAF
- multiqc_data/multiqc.log
- multiqc_data/multiqc_data.json
- multiqc_data/multiqc_fastqc.txt
- multiqc_data/multiqc_general_stats.txt
- multiqc_data/multiqc_picard_dups.txt
- multiqc_data/multiqc_samtools_flagstat.txt
- multiqc_data/multiqc_samtools_stats.txt
- multiqc_data/multiqc_sources.txt
- multiqc_report.html
- profiling/SamplePair_all.txt
- profiling/SamplePair_matched.txt
- profiling/SamplePair.msi.txt
- SampleID/SampleID.alnstats.txt
- SampleID/SampleID.covhist.txt
- SampleID/SampleID.flagstat.txt
- SampleID/SampleID.genomecov.txt
- SampleID/SampleID.libcomplex.txt
- SampleID/SampleID_exoncoverage.txt
- SampleID/SampleID_lowcoverage.txt
- SampleID/SampleID_fastqc.html
- SampleID/SampleID_fastqc.zip
Workflow Walkthrough
- Navigate to the Genomic Variant Analysis workflow launcher on the Form Bio platform. You can locate the workflow using the search bar at the top right corner, or by using the Google DeepOmics, Functional Genomics, Precision Medicine, or Next-Generation Sequencing filters on the left-hand side.
- Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click Run Workflow.
- Select the type of input data to be analyzed. Currently, four types are supported - Germline, Ancient DNA, Somatic, and Viral/Prokaryotic/Synthetic Genome. Also provide the platform that was used to collect the data. Select the type of analysis to be run. Finally, provide the directory containing the files to be analyzed as well as a file relating RunIDs, LibraryIDs and SampleIDs. (this table can be created within the workflow itself).
- Select a reference genome to which the input data will be compared. You may also optionally upload a BED file detailing genomic regions of note.
- Tune additional parameters related to your workflow run. These parameters may change depending on your input data.
- Give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
Results Walkthrough
- To view results for your Genomics Variant Analysis workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
- On the Results tab, you can see a preview summary of the analysis.
- Under the All Files tab, you can view the final HTML file, which is nested in the output folder. You may view or download this file. This file can also be found in the File Explorer.
Citations
- Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: V0.6.7 - DOI via Zenodo. (2021) doi:10.5281/ZENODO.5127899.
- Chen, S. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2, e107 (2023).
- Li, H. [Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM](https://doi.org/arXiv:1303.3997 [q-bio.GN]). arXiv preprint arXiv 00, 3 (2013).
- Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
- Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods 19, 705–710 (2022).
- Thomer, A. K., Twidale, M. B., Guo, J. & Yoder, M. J. Picard Tools. in Conference on Human Factors in Computing Systems - Proceedings (2016).
- Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
- Andrews, S. et al. FastQC. (2012).
- Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
- Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32, 292–294 (2016).
- Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. (2012).
- Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
- DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43, 491–498 (2011).
- Yun, T. et al. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. (2020) doi:10.1101/2020.02.10.942086.
- Lin, M. F. et al. GLnexus: Joint variant calling for large cohort sequencing. (2018) doi:10.1101/343970.
- Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).
- Schubert, M., Lindgreen, S. & Orlando, L. AdapterRemoval v2: Rapid adapter trimming, identification, and read merging. BMC Research Notes 9, 88 (2016).
- Wood, D. E. & Salzberg, S. L. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15, R46 (2014).
- Schubert, M. et al. Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX. Nature Protocols 9, 1056–1082 (2014).
- Jónsson, H., Ginolhac, A., Schubert, M., Johnson, P. L. F. & Orlando, L. mapDamage2.0: Fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics 29, 1682–1684 (2013).
- Neukamm, J., Peltzer, A. & Nieselt, K. DamageProfiler: Fast damage pattern calculation for ancient DNA. Bioinformatics 37, 3652–3653 (2021).
- Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016).
- Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature Microbiology 5, 1403–1407 (2020).
- Kim, S. et al. Strelka2: Fast and accurate calling of germline and somatic variants. Nature Methods 15, 591–594 (2018).
- Benjamin, D. et al. Calling Somatic SNVs and Indels with Mutect2. (2019) doi:10.1101/861054.
- Lee, S. et al. NGSCheckMate: Software for validating sample identity in Next-generation sequencing studies within and across data types. Nucleic Acids Research 45, e103 (2017).
- Jia, P. et al. MSIsensor-pro: Fast, Accurate, and Matched-normal-sample-free Detection of Microsatellite Instability. Genomics, Proteomics and Bioinformatics 18, 65–71 (2020).
- Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nature Biotechnology 37, 224–226 (2019).