🖥️

Google DeepOmics

These tools incorporate the Google DeepVariant analysis pipeline, a deep learning-based algorithm that uses a deep neural network to call genetic variants from next-generation sequencing data.

Expression Analysis in RNASeq

image

This workflow can be used to determine gene expression, splice variants and differential expression analysis.

Version 1.1.1

Use Cases

  • Determine differentially expressed genes between two or more groups of samples (treated vs untreated, knock-out vs wildtype, cell type A vs cell type B)
  • Determine differentially expressed transcripts between two or more groups of samples
  • Compare the gene expression profiles of samples

Summary and Methods

This workflow is designed to help the user thoroughly analyze RNA sequencing data. Currently, two functions are supported: Full Analysis and Recalculate Statistics. Both functions include the option to specify whether the data include Human Cancer Samples. Click the toggles below to learn more about each function.

Full Analysis

Summary

This workflow is designed to help the user determine differential gene abundances and differential expression between two or more groups of samples. The user will provide as input a folder containing all the read files needed for analysis and a sequencing file relating sample IDs to attributes. The user will receive as output differential gene and transcript abundance analysis files and comparison files between the two or more samples.

Methods

This analysis was performed using the Expression Analysis in RNASeq workflow on the Form Bio platform. Reads are trimmed using TrimGalore [1], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. Trimmed reads are aligned to a reference genome using STAR [2] (default) or HiSAT2 [3]. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools [4]. The abundance of transcripts and genes are assessed using FeatureCount to generate raw gene counts [5], StringTie to generate FPKM [6] and Salmon to generate raw transcript counts [7]. Sample comparisons and differential gene/transcript expression analysis are performed using EdgeR [8], DESeq2 [9] and IsoformSwitchAnalyzeR [10].

Recalculate Statistics

Summary

This workflow is designed to help the user determine differentially expressed genes using abundance counts generated by previous workflow analysis. The user will provide a folder containing output files from previous analyses. The workflow will perform the statistical analysis again with a different composition of samples and output the results of this analysis.

Methods

Sample comparisons and differential gene/transcript expression analysis are performed using EdgeR [8], DESeq2 [9] and IsoformSwitchAnalyzeR [10].

Human Cancer

Summary

This workflow is designed to help the user determine differential gene abundances and differential expression between two or more groups of human tumor samples. The user will provide as input a folder containing all read files needed for analysis and a sequencing file relating sample IDs to attributes. The user will receive as output differential gene and transcript abundance analysis files and comparison files between the two or more samples.

Methods

This analysis was performed using the Expression Analysis in RNASeq workflow on the Form Bio platform. Reads are trimmed using TrimGalore [1], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. Trimmed reads are aligned to a reference genome using STAR [2] (default) or HiSAT2 [3]. Duplicate reads can optionally be marked using Picard MarkDuplicates. BAMs from the same sample generated by multiple runs are merged using Samtools [4]. The abundance of transcripts and genes are assessed using FeatureCount to generate raw gene counts [5], StringTie to generate FPKM [6] and Salmon to generate raw transcript counts [7]. Sample comparisons and differential gene/transcript expression analysis are performed using EdgeR [8], DESeq2 [9] and IsoformSwitchAnalyzeR [10]. Gene fusion events are detected using Star-Fusion [11]. Exon-skipping events are detected using RegTools [12].

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • File Format
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • SampleGroup is necessary for statistical analysis, there must be at least 2 samples per group
    • RunID
      SampleID
      SampleGroup
      SRR994739
      SAMEA9454349
      Treated
      SRR994740
      SAMEA9454349
      Treated
      SRR994741
      SAMEA9454341
      Untreated
      SRR994742
      SAMEA9454348
      Treated
      SRR994743
      SAMEA9454348
      Treated
      SRR994744
      SAMEA9454342
      Untreated

      Advanced Parameters

      Algorithms

      Mark Duplicates Algorithm
      Trim Reads
      • true
      • false
      Alignment Algorithm
      Orientation
      • I = inward
      • O = outward
      • M = matching
      Read Origin
      • F = Forward
      • R = Reverse
      • '' = Single End/Unknown
      Stranded
      • S = stranded
      • U = unstranded

Outputs

Merged Sorted BAM Files per Sample
  • bams/SampleID.bam
  • bams/SampleID.bam.bai
Salmon Output
  • featurects/SampleID.salmon.tar.gz
StringTie Output
  • featurects/SampleID_stringtie
  • featurects/SampleID.fpkm.txt
Gene/Transcript Abundances
  • featurects/SampleID.cts.txt
  • featurects/SampleID.cts.txt.summary
  • countTable.fpkm.txt
  • countTable.logCPM.txt
  • countTable.stats.txt
  • countTable.txt
BigWig Files
  • featurects/SampleID.unique.bw
  • featurects/SampleID.all.bw
MultiQC HTML
  • multiqc_data/multiqc.log
  • multiqc_data/multiqc_data.json
  • multiqc_data/multiqc_fastqc.txt
  • multiqc_data/multiqc_featureCounts.txt
  • multiqc_data/multiqc_general_stats.txt
  • multiqc_data/multiqc_samtools_flagstat.txt
  • multiqc_data/multiqc_samtools_stats.txt
  • multiqc_data/multiqc_sources.txt
Raw BAM QC Tables
  • SampleID/SampleID.alnstat.txt
  • SampleID/SampleID.flagstat.txt
  • SampleID/SampleID_fastqc.html
  • SampleID/SampleID_fastqc.zip
MultiQC Raw Tables
  • multiqc_report.html
Differential Gene Abundance Analysis
  • Group1_Group2.edgeR.txt
  • Group1_Group2.gene2path.txt
  • Group1_Group2.stringDB.txt
Sample Comparison
  • countTable.mds.txt
  • countTable.pca.txt
  • countTable.pcapercvar.txt
  • countTable.sampleDists.txt
Differential Transcript Abundance Analysis
  • countTable.dexseq.txt
  • gene.trxstats.txt
  • splicingEnrichment.txt
  • splicingIsoformUsage.txt
  • splicingResults.html
  • splicingSummary.txt

Workflow Walkthrough

  1. Navigate to the Expression Analysis in RNASeq launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics, Precision Medicine, Functional Genomics, or Next Generation Sequencing tags to find the workflow card.
  2. image
  3. Select the version from the dropdown box in the top right corner. When ready to begin analysis, click “Run Workflow”.
  4. image
  5. This workflow currently supports two functions: Full Analysis and Recalculate Statistics. Both functions include the option to specify whether the data include Human Cancer Samples. Checking this option includes gene fusion predictions in the analysis to show alternations causing gene fusion events.
  6. Let’s look at the Recalculate Statistics options, for repeating statistical analysis performed in previous workflow runs. Select this function from the dropdown box. Also provide the directory containing the files to be analyzed as well as a file relating RunIDs, SampleIDs and sample attributes such as SampleGroup. (this table can be created within the workflow itself).

    image
  7. Select a reference genome and annotation version for the workflow run.
  8. image
  9. Name the workflow run, then take a minute to review workflow settings and parameters. When you’re satisfied, click “Run Workflow” at the bottom-left corner.
  10. image

Results Walkthrough

  1. To view results for your Expression Analysis in RNASeq workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. After selecting your workflow run, click Open Analysis in the upper right-hand corner to open the RNASeq Analysis Portal in the RNASeq Dashboard (opens as a separate tab) to view an interactive summary of your data. You may also navigate to the Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  4. image
  5. Use the RNASeq Analysis Portal to view your data analysis. Navigate the tabs across the top or use the links in the Introduction tab.
  6. image

Citations

  1. Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: V0.6.7 - DOI via Zenodo. (2021) doi:10.5281/ZENODO.5127899.
  2. Dobin, A. et al. STAR: Ultrafast universal RNA-seq alignerBioinformatics (Oxford, England) 29, 15–21 (2013).
  3. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotypeNature Biotechnology 37, 907–915 (2019).
  4. Li, H. et al. The Sequence Alignment/Map format and SAMtoolsBioinformatics 25, 2078–2079 (2009).
  5. Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic featuresBioinformatics 30, 923–930 (2014).
  6. Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2Genome Biology 20, (2019).
  7. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expressionNature Methods 14, 417–419 (2017).
  8. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England) 26, 139–140 (2010).
  9. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2Genome Biology 15, 550 (2014).
  10. Vitting-Seerup, K. & Sandelin, A. IsoformSwitchAnalyzeR: Analysis of changes in genome-wide patterns of alternative splicing and its functional consequencesBioinformatics 35, 4469–4471 (2019).
  11. Haas, B. et al. STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq. bioRxiv 120295 (2017) doi:10.1101/120295.
  12. Feng, Y.-Y. et al. RegTools: Integrated analysis of genomic and transcriptomic data for discovery of splicing variants in cancer. bioRxiv 436634 (2018) doi:10.1101/436634.

Built with

image

Genomics Variant Analysis

image

Identify single-nucleotide variants (SNVs), indels, and structural variants in diploid genome resequencing projects by comparison to a reference genome.

Version 1.6.1

Use Cases

  • Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions, deletions and structural variants
    • Germline Variant Calling
    • Variant Calling in Ancient DNA
    • Somatic Mutation Detection
  • Determine variants in DNA samples compared to a custom reference genome for small or synthetic genomes
    • Plasmid
    • Virus
    • Bacteria
    • Synthetic Genome
  • Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)

Summary and Methods

This workflow is designed to help the user determine variants in DNA samples when compared to a reference genome. Currently, four different input DNA datatypes are supported: Germline (Diploid)Ancient DNASmall Genomes (Viral/Prokaryotic/Synthetic), and Somatic (Human Cancer). Workflows can be run either with Parabricks, Sentieon or native open-source tools (NOST). Click the toggles below to learn more about each supported dataype.

Germline (Diploid)

Summary

This workflow is designed to help the user determine germline variants in diploid DNA. The user will provide input FastQ files containing the diploid DNA to be analyzed, and will recieve as output a summary of germline variants in the DNA compared to the chosen reference genome.

Methods

This analysis was performed using the Germline Variant Analysis workflow on the Form Bio platform. When the datatype is “Germline (Diploid)”, this workflow determines genetic variants including SNVs, insertions and deletions of high-quality NGS data when compared to a reference genome. Reads are trimmed using TrimGalore [1] or FastP [2], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. These default value can be changed by the user. This workflow can be run with native open-source tools (NOST), Sentieon or with Parabricks.

With NOST and Sentieon, trimmed reads are aligned to a reference genome using BWA-MeM [3], Minimap2 [4] or Winnowmap [5] depending on data type. Duplicate reads can optionally be marked using Picard MarkDuplicates [6]. BAMs from the same sample generated by multiple runs are merged using Samtools [7]. Alignment qualtity is assessed using FastQC [8], Samtools [7], Bedtools [9] and Qualimap [10]. Variants can be detected with joint calling using Freebayes [11], Samtools/Bcftools [12], DNAScope and GATK4 [13].

With Parabricks, trimmed reads are aligned, duplicate reads are marked and alignment quality is accessed using fq2bam. Quality metrics are summarized with MultiQC. Variants can be detected with GATK [13] and DeepVariant [14] to produce gVCF files. Genotyping of gVCF files is determined using GLNexus [15]. Variants effects are determined using SNPEff [16].

Ancient DNA

Summary

This workflow is designed to help the user determine variants in ancient DNA. The user will provide input FastQ files containing the DNA to be analyzed, and will recieve as output a summary of variants in the DNA compared to the chosen reference genome.

Methods

This analysis was performed using the Germline Variant Analysis workflow on the Form Bio platform. When the datatype is “Ancient”,this workflow can be used to determine genetic variants of high quality NGS data in your project compared to a supported reference genome. Reads are trimmed using AdapterRemoval [17], FastP [2], or TrimGalore [1], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. These default value can be changed by the user. Contaminates are detected using Kraken [18] with confidence of 0.8 using Kraken’s precompiled database or a custom database where the human genome has been removed. Unclassified trimmed reads are aligned to a reference genome using BWA MEM [3] or BWA Aln (with seed of 16,500, maximum edit distance of 0.01 and maximum gap opens of 2). BAMs from the same library generated by multiple runs are merged using Samtools [7]. Duplicate reads from the the same library can optionally be marked using PaleoMIX [19] or Picard MarkDuplicates [6]. BAMs from the same sample generated by multiple libraries are merged using Samtools [7]. Base recalibration is done using mapdamage2 [20]. Alignment quality is assessed using QualiMap [21], DamageProfiler [22], and MultiQC [23]. Germline variants can be detected using Freebayes [11], Samtools/Bcftools [12] and GATK4 [13]. In order to increase the speed of analysis, the Parabricks (requires GPUs) or Sentieon optimized versions of these algorithms are used for BWA Mem and GATK4. Genotyping of GVCF files is determined using GLnexus [15]. Variant effects are determined usng SNPEff [16].

Small Genomes (Viral/Prokaryotic/Synthetic)

Summary

This workflow is designed to help the user determine germline variants in small or synthetic genomes with an option to provide the custom genome sequence. The user will provide input FastQ files containing the DNA to be analyzed, and will recieve as output a summary of variants in the DNA compared to the chosen reference genome. It is assumed that Small genomes are haploid.

Methods

This analysis was performed using the Germline Variant Analysis workflow on the Form Bio platform. When data type is a “Small Genome”, this workflow can be used to determine genetic variants of high quality NGS data in your project compared to a supported reference genome. Reads are trimmed using TrimGalore [1]. Trimmed reads are aligned to a reference genome using BWA MEM [3]. Duplicate reads are marked using Picard MarkDuplicates [6]. Germline variants can be detected using Samtools/Bcftools [12]. In order to increase the speed of analysis, the Parabricks or Sentieon optimized versions of these algorithms can be used for BWA MEM. Variant effects are determined usng SNPEff [16] if the genome is provided by the platform. For SARS-CoV-2, genome sequences of samples are determined using BCFTools [12] and lineage classification is determined using PANGOLIN [24].

Somatic (Human Cancer)

Summary

This workflow is designed to help the user determine variants in somatic DNA. The user will provide input FastQ files containing the DNA to be analyzed, and will receive as output a summary of variants in the DNA compared to the chosen reference genome.

Methods

When data type is “Somatic”, this workflow can be used to determine genetic variants of tumor NGS data compared to a supported reference genome. When a normal sample is provided, specialized somatic variant calling methods will be applied and allow users to filter germline variants from the resulting VCF files. Reads are trimmed using TTrimGalore [1] or FastP [2], to remove low quality (qual < 25) ends of reads and remove reads < 35bp. These default value can be changed by the user. Trimmed reads are aligned to a reference genome using BWA MEM [3], Minimap2 [4] or Winnowmap [5] depending on data type. Duplicates reads are marked using Picard MarkDuplicates [6]; if provided by Library. BAMs from the same sample generated by multiple runs are merged using Samtools [7]. Alignment quality is assessed using FastQC [8], Samtools [7], Bedtools [9] and Qualimap [10]. Quality reports are produced by MultiQC. Variant effects are determined using SNPEff [16]. Somatic variants can be detected in somatic or tumor-only mode using Strelka2 [25], Freebayes [11], DeepSomatic and MuTect2 [26], TNScope. In order to increase the speed of analysis, the Parabricks or Sentieon optimized versions of these algorithms are used for BWA Mem and GATK4. When a matched normal sample is present, tumor/normal germline SNP matching is confirmed using NGSCheckMate [27] and microsatellite stability is assessed using MSI-Sensor [28].

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • SampleGroup is necessary for statistical analysis, there must be atleast 2 samples per group
    • File Format
    • RunID
      SampleID
      SRR994739
      SAMEA9454349
      SRR994740
      SAMEA9454349
      SRR994741
      SAMEA9454341
  • Capture Bedfile
    • The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
    • Make sure that there is no column names present in the file.
    • Forth column can indicate a region name and used to determine poorly capture regions.
    • SeqName
      Start
      End
      Name
      chr1
      1787293
      1787413
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1
      chr1
      1787353
      1787473
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2
      chr1
      1789040
      1789160
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1
      chr1
      1789160
      1789280
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2
      chr1
      1790375
      1790495
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1
      chr1
      1790495
      1790615
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2
      chr1
      1793187
      1793307
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1
      chr1
      1793247
      1793367
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2
      chr1
      1804380
      1804500
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1
      chr1
      1804500
      1804620
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2
      chr1
      1806416
      1806536
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1
      chr1
      1806476
      1806596
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2

      Advanced Parameters

      Algorithms

      Trim Reads
      • true
      • false
      Alignment Algorithm
      Mark Duplicates Algorithm

      Sequence Data Type

      Option
      Meaning
      sr
      Short single-end reads without splicing (-k21 -w11 --sr --frag=yes -A2 -B8 -O12,32 -E2,1 -r100 -p.5 -N20 -f1000,5000 -n2 -m20 -s40 -g100 -2K50m --heap-sort=yes --secondary=no). This is the default mode.
      map-ont
      Align noisy long reads of ~10% error rate to a reference genome.
      map-hifi
      Align PacBio high-fidelity (HiFi) reads to a reference genome (-k19 -w19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200).
      map-pb
      Align older PacBio continuous long (CLR) reads to a reference genome (-Hk19).

Outputs

Merged Sorted BAM Files per Sample
  • bams/SampleID.bam
  • bams/SampleID.bam.bai
Variants (genomevariants)
  • DELLY VCF
  • Freebayes
  • Mutect2
  • Strelka2
  • SVABA
  • Union VCF
  • Filtered VCF
  • MAF
MultiQC HTML
  • multiqc_data/multiqc.log
  • multiqc_data/multiqc_data.json
  • multiqc_data/multiqc_fastqc.txt
  • multiqc_data/multiqc_general_stats.txt
  • multiqc_data/multiqc_picard_dups.txt
  • multiqc_data/multiqc_samtools_flagstat.txt
  • multiqc_data/multiqc_samtools_stats.txt
  • multiqc_data/multiqc_sources.txt
  • multiqc_report.html
NGS Checkmate
  • profiling/SamplePair_all.txt
  • profiling/SamplePair_matched.txt
MSI
  • profiling/SamplePair.msi.txt
Sample QC
  • SampleID/SampleID.alnstats.txt
  • SampleID/SampleID.covhist.txt
  • SampleID/SampleID.flagstat.txt
  • SampleID/SampleID.genomecov.txt
  • SampleID/SampleID.libcomplex.txt
  • SampleID/SampleID_exoncoverage.txt
  • SampleID/SampleID_lowcoverage.txt
  • SampleID/SampleID_fastqc.html
  • SampleID/SampleID_fastqc.zip

Workflow Walkthrough

  1. Navigate to the Genomics Variant Analysis workflow launcher on the Form Bio platform. You can locate the workflow using the search bar at the top right corner, or by using the Google DeepOmics, Functional Genomics, Precision Medicine, or Next-Generation Sequencing filters on the left-hand side.
  2. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click Run Workflow.
  3. image
  4. Select the type of input data to be analyzed. Currently, four types are supported - Germline, Ancient DNA, Somatic, and Viral/Prokaryotic/Synthetic Genome. Also provide the platform that was used to collect the data. Select the type of analysis to be run. Finally, provide the directory containing the files to be analyzed as well as a file relating RunIDs, LibraryIDs and SampleIDs. (this table can be created within the workflow itself).
  5. image
    image
  6. Select a reference genome to which the input data will be compared. You may also optionally upload a BED file detailing genomic regions of note.
  7. image
  8. Tune additional parameters related to your workflow run. These parameters may change depending on your input data.
  9. image
  10. Give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  11. image

Results Walkthrough

  1. To view results for your Genomics Variant Analysis workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab. HTML output files can be previewed or opened from here.
  4. image
  5. Under the All Files tab, you can view the final HTML file, which is nested in the output folder. You may view or download this file. This file can also be found in the File Explorer.
  6. image

Citations

  1. Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: V0.6.7 - DOI via Zenodo. (2021) doi:10.5281/ZENODO.5127899.
  2. Chen, S. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastpiMeta 2, e107 (2023).
  3. Li, H. [Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM](https://doi.org/arXiv:1303.3997 [q-bio.GN]). arXiv preprint arXiv 00, 3 (2013).
  4. Li, H. Minimap2: Pairwise alignment for nucleotide sequencesBioinformatics 34, 3094–3100 (2018).
  5. Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2Nature Methods 19, 705–710 (2022).
  6. Thomer, A. K., Twidale, M. B., Guo, J. & Yoder, M. J. Picard Tools. in Conference on Human Factors in Computing Systems - Proceedings (2016).
  7. Li, H. et al. The Sequence Alignment/Map format and SAMtoolsBioinformatics 25, 2078–2079 (2009).
  8. Andrews, S. et al. FastQC. (2012).
  9. Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic featuresBioinformatics 26, 841–842 (2010).
  10. Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing dataBioinformatics 32, 292–294 (2016).
  11. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. (2012).
  12. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing dataBioinformatics 27, 2987–2993 (2011).
  13. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 43, 491–498 (2011).
  14. Yun, T. et al. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. (2020) doi:10.1101/2020.02.10.942086.
  15. Lin, M. F. et al. GLnexus: Joint variant calling for large cohort sequencing. (2018) doi:10.1101/343970.
  16. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEffFly 6, 80–92 (2012).
  17. Schubert, M., Lindgreen, S. & Orlando, L. AdapterRemoval v2: Rapid adapter trimming, identification, and read mergingBMC Research Notes 9, 88 (2016).
  18. Wood, D. E. & Salzberg, S. L. Kraken: Ultrafast metagenomic sequence classification using exact alignmentsGenome Biology 15, R46 (2014).
  19. Schubert, M. et al. Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIXNature Protocols 9, 1056–1082 (2014).
  20. Jónsson, H., Ginolhac, A., Schubert, M., Johnson, P. L. F. & Orlando, L. mapDamage2.0: Fast approximate Bayesian estimates of ancient DNA damage parametersBioinformatics 29, 1682–1684 (2013).
  21. Neukamm, J., Peltzer, A. & Nieselt, K. DamageProfiler: Fast damage pattern calculation for ancient DNABioinformatics 37, 3652–3653 (2021).
  22. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single reportBioinformatics 32, 3047–3048 (2016).
  23. Rambaut, A. et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiologyNature Microbiology 5, 1403–1407 (2020).
  24. Kim, S. et al. Strelka2: Fast and accurate calling of germline and somatic variantsNature Methods 15, 591–594 (2018).
  25. Benjamin, D. et al. Calling Somatic SNVs and Indels with Mutect2. (2019) doi:10.1101/861054.
  26. Lee, S. et al. NGSCheckMate: Software for validating sample identity in Next-generation sequencing studies within and across data typesNucleic Acids Research 45, e103 (2017).
  27. Jia, P. et al. MSIsensor-pro: Fast, Accurate, and Matched-normal-sample-free Detection of Microsatellite InstabilityGenomics, Proteomics and Bioinformatics 18, 65–71 (2020).
  28. Clement, K. et al. CRISPResso2 provides accurate and rapid genome editing sequence analysisNature Biotechnology 37, 224–226 (2019).

Built with

image
image
image
image

DeepSomatic

image

This workflow can be used to identify single-nucleotide variants, indels and structural variants in diploid species genomics resequencing projects by comparison to a reference genome

Version 1.7.0

Use Cases

  • Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions and deletions
    • Somatic Variant Calling
  • Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)

Summary

This workflow is designed to run DeepSomatic with BAM files.

Methods

Variants are detected with joint calling using DeepSomatic to produce VCF files. Variants effects are determined using SNPEff [1].

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • GroupID denotes samples to be analyzed jointly
    • File Format
    • RunID
      SampleID
      GroupID
      SampleType
      SRR994739
      SAMEA9454349
      Tumor A
      Tumor
      SRR994740
      SAMEA9454349
      Normal A
      Normal
      SRR994741
      SAMEA9454341
      Tumor B
      Tumor
      SRR994742
      SAMEA9454342
      Family B
      Normal
  • Capture Bedfile
    • The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
    • Make sure that there is no column names present in the file.
    • The fourth column can indicate a region name and used to determine poorly capture regions.
    • SeqName
      Start
      End
      Name
      chr1
      1787293
      1787413
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1
      chr1
      1787353
      1787473
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2
      chr1
      1789040
      1789160
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1
      chr1
      1789160
      1789280
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2
      chr1
      1790375
      1790495
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1
      chr1
      1790495
      1790615
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2
      chr1
      1793187
      1793307
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1
      chr1
      1793247
      1793367
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2
      chr1
      1804380
      1804500
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1
      chr1
      1804500
      1804620
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2
      chr1
      1806416
      1806536
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1
      chr1
      1806476
      1806596
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2

Outputs

  • Variants (genomevariants)
    • DeepVariant VCF
    • MAF

Workflow Walkthrough

  1. Navigate to the DeepSomatic launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequencer platform that was used to generate the data, Illumina, PacBio, or Oxford Nanopore. Also provide the directory containing the files to be analyzed as well as a file containing RunIDs, LibraryIDs and SampleIDs (this table can be created within the workflow itself).
  6. image
  7. On the next tab, provide a BED file containing the genomic regions to be analyzed. You may also optionally upload a BED file detailing genomic regions of note.
  8. image
  9. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  10. image

Results Walkthrough

  1. To view results for your DeepSomatic workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab.
  4. image
  5. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  6. image

Built with

image
image

DeepTrio

image

This workflow can be used to identify single-nucleotide variants, insertions and deletions in diploid species genomics resequencing projects by comparison to a reference genome for probands and their parents.

Version 1.7.0

Use Cases

  • Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions and deletions
    • Germline Variant Calling
  • Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)

Summary

This workflow is designed to run DeepTrio with BAM files from probands and their parents.

Methods

Variants are detected with joint calling using DeepTrio to produce VCF files. Variants effects are determined using SNPEff [1].

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • GroupID denotes samples to be analyzed jointly
    • File Format
    • RunID
      SampleID
      Group ID
      SampleType
      SRR994739
      SAMEA9454349
      Family A
      Proband
      SRR994740
      SAMEA9454349
      Family A
      Parent
      SRR994741
      SAMEA9454341
      Family A
      Parent
  • Capture Bedfile
    • The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
    • Make sure that there is no column names present in the file.
    • Forth column can indicate a region name and used to determine poorly capture regions.
    • SeqName
      Start
      End
      Name
      chr1
      1787293
      1787413
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1
      chr1
      1787353
      1787473
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2
      chr1
      1789040
      1789160
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1
      chr1
      1789160
      1789280
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2
      chr1
      1790375
      1790495
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1
      chr1
      1790495
      1790615
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2
      chr1
      1793187
      1793307
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1
      chr1
      1793247
      1793367
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2
      chr1
      1804380
      1804500
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1
      chr1
      1804500
      1804620
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2
      chr1
      1806416
      1806536
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1
      chr1
      1806476
      1806596
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2

Outputs

  • Variants (genomevariants)
    • DeepVariant VCF
    • MAF

Workflow Walkthrough

  1. Navigate to the DeepTrio launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequencer platform that was used to generate the data, Illumina, PacBio, or Oxford Nanopore. Choose whether or not to split large fastq files into smaller files by checking the “Whole Genome Sequencing Parallelization” box (warning: may be more costly). Also provide the directory containing the files to be analyzed as well as a file containing RunIDs, LibraryIDs and SampleIDs (this table can be created within the workflow itself).
  6. image
  7. On the next tab, select a reference genome to which the input data will be compared. You may also optionally upload a BED file detailing genomic regions of note.
  8. image
  9. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  10. image

Results Walkthrough

  1. To view results for your DeepTrio workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. Upon selection, results from your workflow run are summarized in the Results tab.
  3. image
  4. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  5. image

Built with

image
image

DeepVariant

image

This workflow runs DeepVariant on BAM files.

Version 1.7.0

Use Cases

  • Determine variants in DNA samples compared to a reference genome including single nucleotide variants (SNVs), insertions, deletions and structural variants
    • Germline Variant Calling
  • Determine variants in DNA samples compared to a custom reference genome for small or synthetic genomes
    • Plasmid
    • Virus
    • Bacteria
    • Sythetic Genome
  • Sequencing Platform supported include Illumina, Pacbio and Oxford Nanopore (ONT)

Summary

This workflow is designed to run DeepVariant with BAM files. Workflows can be run either with Parabricks, native open-source tools (NOST).

Methods

Variants are detected with joint calling using DeepVariant [1] to produce gVCF files. Genotyping of gVCF files is determined using GLNexus [2]. Variants effects are determined using SNPEff [3].

Inputs

  • Run Name: This is a unique name for each run of pipelines in your project
  • Organism: Reference Genome used for alignment
  • Reference Genome Annotation: Annotation that should be used for determining gene and transcript counts.
  • Input Folder: This is the folder that contains all of the fastq files that will be used in this analysis
  • Sample Description File
    • This file matches the sequence files to samples; sequence data from multiple runs will be merged if they have the same SampleID
    • RunID should be a part of the the fastq files.
    • GroupID denotes samples to be analyzed jointly
    • File Format
    • RunID
      SampleID
      GroupID
      SRR994739
      SAMEA9454349
      Family A
      SRR994740
      SAMEA9454349
      Family A
      SRR994741
      SAMEA9454341
      Family A
      SRR994742
      SAMEA9454342
      Family B
  • Capture Bedfile
    • The intervals in capture BED file indicate regions where alignments are expected based on the target capture kit.
    • Make sure that there is no column names present in the file.
    • Forth column can indicate a region name and used to determine poorly capture regions.
    • SeqName
      Start
      End
      Name
      chr1
      1787293
      1787413
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_1
      chr1
      1787353
      1787473
      GNB1:GNB1_chr1:1718769-1718876:chr1:1718769-1718876_2
      chr1
      1789040
      1789160
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_1
      chr1
      1789160
      1789280
      GNB1:GNB1_chr1:1720491-1720708:chr1:1720491-1720708_2
      chr1
      1790375
      1790495
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_1
      chr1
      1790495
      1790615
      GNB1:GNB1_chr1:1721833-1722035:chr1:1721833-1722035_2
      chr1
      1793187
      1793307
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_1
      chr1
      1793247
      1793367
      GNB1:GNB1_chr1:1724683-1724750:chr1:1724683-1724750_2
      chr1
      1804380
      1804500
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_1
      chr1
      1804500
      1804620
      GNB1:GNB1_chr1:1735857-1736020:chr1:1735857-1736020_2
      chr1
      1806416
      1806536
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_1
      chr1
      1806476
      1806596
      GNB1:GNB1_chr1:1737913-1737977:chr1:1737913-1737977_2

Outputs

  • Variants (genomevariants)
    • DeepVariant VCF
    • MAF

Workflow Walkthrough

  1. Navigate to the DeepVariant launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequencer platform that was used to generate the data, Illumina, PacBio, or Oxford Nanopore. Choose which algorithm to run based on the input data, Parabricks DeepVariant, DeepVariant for RNAseq, or DeepVariant. Choose whether or not to split large fastq files into smaller files by checking the “Whole Genome Sequencing Parallelization” box (warning: may be more costly).
  6. image
  7. On the same tab, name the variant result output. Also provide the directory containing the files to be analyzed as well as a file relating SampleIDs to GroupIDs (this table can be created within the workflow itself).
  8. image
  9. On the next tab, select a reference genome to which the input data will be compared. You may also optionally upload a BED file detailing genomic regions of note.
  10. image
  11. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  12. image

Results Walkthrough

  1. To view results for your DeepVariant workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab.
  4. image
  5. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  6. image

Citations

  1. Yun, T. et al. Accurate, scalable cohort variant calls using DeepVariant and GLnexus. (2020) doi:10.1101/2020.02.10.942086.
  2. Lin, M. F. et al. GLnexus: Joint variant calling for large cohort sequencing. (2018) doi:10.1101/343970.
  3. Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEffFly 6, 80–92 (2012).

Built with

image
image
image

Sequencer Raw Data to FastQ

image

Converts Sequencer Data to FastQ; includes DeepConsensus for PacBio.

Version 0.0.4

Use Cases

  • The user has completed PacBio HiFi sequencing
  • The user has completed Illumina sequencing
  • The user has completed Oxford Nanopore sequencing

Summary

This is a workflow that can be used to convert raw data from a sequencer into data FastQ format.

Methods

If input data is PacBio subread uBAMs or the sequencer run folder, consensus contig reads are created using circular consensus sequencing (CCS) [1]. Optionally DeepConsensus can be used to improve basecalls [2]. If the input data is ONT fast5 data, basecalling is performed using Dorado [3]. If the input data is Illumina sequencer run folder, bcl2fastq will be run to create fastq files and demultiplex using sample barcodes [4].

Inputs

Mandatory

High-Throughput Sequence Data

  • Input data folder, either PacBio BAMs, ONT fast5 OR Run folder from PacBio (includes subread XML) or Illumina
  • BAMs
    • PacBio CCS BAM - Sequence file from PacBio Machine run in AAV Mode OR run through recall adapter and CCS to create consensus sequences
    • PacBio Subreads BAM - Files from PacBio Machine without any preproccessing(no recommended)
  • Data Folder
    • PacBio Run Folder, includes subreads, XML, etc
  • FastQ
    • FastQ files generated from CCS BAM files

Optional Inputs

  • Barcode Sequence
    • used to separate multiplexed samples by sequence barcodes or indexes
  • MASseq Adapters
    • used in masseq applications

Outputs

FastQ files per Sample

Workflow Walkthrough

  1. Navigate to the Sequencer Raw Data to FastQ launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown versioning menu in the top right corner. On this page, you can find information about the workflow analysis. When ready to begin, click “Run Workflow”.
  4. image
  5. Select the sequence input type from the drop-down menu. Also provide the directory containing the files to be analyzed as well as a file containing the barcode sequence for each sample (this table can be created within the workflow itself).
  6. image
  7. On the next tab, choose the data assay type (or ONT basecalling model) based on the sequence input type that was selected on the first tab. Configure any additional parameters. These parameters will differ for Illumina vs PacBio vs ONT fast5 input types:
  8. Illumina parameter interface:

    image

    PacBio parameter interface:

    image

    ONT fast5 parameter interface:

    image
  9. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  10. image

Results Walkthrough

  1. To view results for your Sequencer Raw Data to FastQ workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. image
  3. Upon selection, results from your workflow run are summarized in the Results tab.
  4. image
  5. Navigate to the All Files tab to view and download analysis outputs in the output folder. These folders are also available in the File Explorer.
  6. image

Citations

  1. Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detectionNucleic Acids Research 38, e159–e159 (2010).
  2. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformerNature Biotechnology 41, 232–238 (2023).
  3. Chris Seymour, Joyjit Daw, Mike Vella & Mark Bicknell. Dorado is a high-performance, easy-to-use, open source basecaller for Oxford Nanopore reads.
  4. Bcl2fastq.

Built with

image
image
image

Genome Assembly

image

This workflow can take short read Illumina, long read ONT, or PacBio data along with OMNI-C or Hi-C reads for scaffolding to create genome assemblies. This workflow is enhanced with Google DeepOmics tools such as DeepConsensus and DeepPolisher.

Version 1.0.5

Use Cases

  • Assemble genomes from short and/or long-read sequencing files

Summary and Methods

This workflow has been designed to create draft genome assemblies from short and/or long-read FastQ sequencing files. The workflow is also capable of polishing, purging, filtering, and evaluating these assemblies during their creation. If supplied with Fast5 files, the workflow can perform ONT basecalling before assembly. The user will provide as input the short and/or long-read FastQ files. The user will receive as output a draft genome assembly. Click the toggles below to learn more about how this workflow processes short reads and long/mixed reads.

Short Read

Summary

This workflow is designed to help create draft genome assemblies from short-read sequencing data, such as Illumina or WGS. The user will provide as input a directory of sequencing files as well as optional OMNI-C and/or HiC data for polishing. The user will receive as output a draft genome assembly.

Methods

This analysis was performed using the Genome Assembly workflow on the Form Bio platform. Short reads are first interleaved with bbmap [1] if paired-end and not yet interleaved before combining them into a singular file. Once reads are consolidated, they are assembled with Spades [2], MetaSpades [3], Megahit [4], and/or Skesa [5]. Finally, the created assemblies are evaluated with Quast [6], Busco [7], and/or Merqury [8].

Long Read or Mixed Read

Summary

This workflow is designed to help the user create draft genome assemblies from long-read sequencing data, such as PacBio or ONT data. The user will provide as input a directory of sequencing files as well as optional OMNI-C and/or HiC data for polishing. The user will receive as output a draft genome assembly.

Methods

This analysis was performed using the Genome Assembly workflow on the Form Bio platform. This workflow first performs basecalling if desired. This basecalling can be ONT basecalling when supplied with Fast5/Pod5 files either using Dorado or Bonito. Alternatively, if input data is PacBio subread uBAMs or the sequencer run folder, consensus contig reads are created using circular consensus sequencing (CCS) [9]. Optionally DeepConsensus [10] can be used to improve basecalls. If Bonito is run modbam2bed can optionally be run to create a methyl BED file. Using the provided input FastQ files or ONT basecalls a draft assembly is created with Flye [11], Shasta [12], Verkko [13], and/or Hifiasm [14]. Once draft assemblies are made, they are then polished with polishers running in the following order: racon [15], medaka consensus, pilon [16], juicer [17] + 3ddna [18], TGS Gap Closer [19], DeepPolisher [20], Ragtag Scaffold [21], Ragtag Patch [21], Gapless [22]. After polishing, optional purging can be run with the Purge Dups workflow using minimap2 [23] and small contigs can be filtered out with Seqtk. Between all steps in the workflow, the created assemblies are evaluated with Quast[6], Busco [24], and/or Merqury [8].

Inputs

  • Folder containing Fast5/Pod5 ONT reads (optional)
  • Folder containing PacBio reads (optional)
  • FastQ file of ONT reads (optional)
  • FastQ file of Pacbio Hifi reads (optional)
  • FastQ files of forward and reverse HI-C or OMNI-C reads (optional)
  • Folder of WGS Reads (optional)
  • FastA file of assembly of a related organism, preferably within family level (optional)

Mandatory Inputs - Sequencing data files

Parameters:

  • Type of sequencing data input (PacBio, ONT, etc)
  • Basecalling algorithm
  • Assembly algorithm(s)
  • Polishing algorithm(s)
  • Evaluation algorithm(s)

Outputs

  • ONT FastQ Basecall file (only if basecalling is selected)
  • Draft assembly in FastA format
  • Polished assembly in FastA format
  • Optional assembly evaluation step outputs

Workflow Walkthrough

  1. Navigate to the Genome Assembly launcher card. You can use the search bar at the top right corner, or use the Google DeepOmics or Genomics tags to find the workflow card.
  2. image
  3. Select the version from the dropdown box in the top right corner. When ready to begin analysis, click “Run Workflow”.
  4. image
  5. Select which type of sequencing technology was used to collect the input data. PacBio and ONT are long read technologies that can be combined with each other as well as OMNIC and HiC sequencing reads for the polishing steps. Illumina is a short read technology. WGS (whole genome shotgun sequencing) takes in a directory of short read FastQ files for input to create a singular assembly. Determine whether to run basecalling. Finally, provide the sequencing data file(s) to be analyzed (or directory, in the case of WGS).
  6. The following example shows the setup for an ONT assembly input. Parameters and other fields will differ depending on input type.

    image
  7. On the next tab, select desired assembler algorithms. There are some default options already selected.
  8. image
  9. For long read data, select desired polishing options and post-assembler algorithms on the next tab. There are some default options already selected.
  10. image
  11. On the next tab, determine which evaluation algorithm to run.
    • When using Busco, you’ll be asked to choose a lineage - if you’re unsure, you can select “auto-lineage” which will attempt to determine the input lineage, but will result in a longer runtime.
    • If running Merqury, you’ll be asked for an expected genome size in bytes.
    • image
  12. Finally, give your workflow run a unique name, and review the input data and run parameters. When ready to submit, click “Run Workflow”.
  13. image

Results Walkthrough

  1. To view results for your Sequencer Raw Data to FastQ workflow, first find your workflow run from the Activity tab of the platform. You can use the search bar to search for it. Select your workflow run for more information.
  2. Upon selection, results from your workflow run are summarized in the Results tab. HTML output files can be previewed or opened from here.
  3. image
  4. Under the All Files tab, you can view the final HTML files, which are nested in the output folder. You may view or download this file. This file can also be found in the File Explorer.
  5. image

Citations

  1. Bushnell, B. BBMap: A Fast, Accurate, Splice-Aware Aligner. (2014).
  2. Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A. & Korobeynikov, A. Using SPAdes De Novo AssemblerCurrent Protocols in Bioinformatics 70, e102 (2020).
  3. Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: A new versatile metagenomic assemblerGenome Research 27, 824–834 (2017).
  4. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graphBioinformatics 31, 1674–1676 (2015).
  5. Souvorov, A., Agarwala, R. & Lipman, D. J. SKESA: Strategic k-mer extension for scrupulous assembliesGenome Biology 19, 153 (2018).
  6. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: Quality assessment tool for genome assembliesBioinformatics (Oxford, England) 29, 1072–1075 (2013).
  7. Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral GenomesMolecular Biology and Evolution 38, 4647–4654 (2021).
  8. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: Reference-free quality, completeness, and phasing assessment for genome assembliesGenome Biology 21, 245 (2020).
  9. Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detectionNucleic Acids Research 38, e159 (2010).
  10. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformerNature Biotechnology 41, 232–238 (2023).
  11. Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphsNature Biotechnology 37, 540–546 (2019).
  12. Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomesNature Biotechnology 38, 1044–1053 (2020).
  13. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with VerkkoNature Biotechnology 41, 1474–1482 (2023).
  14. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasmNature Methods 18, 170–175 (2021).
  15. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected readsGenome Research 27, 737–746 (2017).
  16. Walker, B. J. et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly ImprovementPLOS ONE 9, e112963 (2014).
  17. Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C ExperimentsCell Systems 3, 95–98 (2016).
  18. Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffoldsScience (New York, N.Y.) 356, 92–95 (2017).
  19. Xu, M. et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long readsGigaScience 9, giaa094 (2020).
  20. Google/deeppolisher. (2024).
  21. Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editingGenome Biology 23, 258 (2022).
  22. Schmeing, S. & Robinson, M. D. Gapless provides combined scaffolding, gap filling, and assembly correction with long readsLife Science Alliance 6, e202201471 (2023).
  23. Li, H. Minimap2: Pairwise alignment for nucleotide sequencesBioinformatics 34, 3094–3100 (2018).
  24. Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. (2021) doi:10.48550/arXiv.2106.11799.