Sequence Alignment

Sequence Alignment

Gene Molecular Evolution

image

Create a multiple sequence alignment, phylogenetic tree, and calculate gene conservation.

Version 1.1.2

Use Cases

  • Perform molecular evolutionary analysis.
    • Conduct phylogenetic and conservation analysis of a gene family curated by a user
    • Conduct phylogenetic and conservation analysis of a gene family in publicly available curated ortholog datasets
    • Identify homologies and create MSA of a gene family identified by a database search
    • Identify genes whose evolutionary rates shift in association with change in a trait

Summary and Methods

Phylogenetic analysis is a scientific method used to study the evolutionary relationships between genes, species, or groups of organisms. It aims to reconstruct the evolutionary history, or phylogeny, by analyzing and comparing genetic characteristics.

One commonly used approach in phylogenetic analysis is Multiple Sequence Alignment (MSA). MSA involves aligning the DNA, RNA, or protein sequences of different organisms to identify regions of similarity and difference. This alignment helps to infer the evolutionary relationships and identify evolutionary changes that have occurred over time. MSA is typically performed using algorithms that optimize the alignment based on sequence similarities, insertions, deletions, and gaps.

Once the MSA is obtained, it serves as the basis for constructing a phylogenetic tree. A phylogenetic tree is a graphical representation of the evolutionary relationships between different species or groups. It depicts the branching patterns that connect common ancestors and descendant lineages. In the tree, each branch represents a different species or lineage and the nodes represent hypothetical common ancestors.

Phylogenetic tree construction involves various methods, such as maximum likelihood (ML), maximum parsimony (MP), and Bayesian inference. These methods use the information from the MSA to estimate the most likely evolutionary tree that explains the observed sequence similarities and differences. The resulting tree represents the most probable evolutionary history given the available data.

The branches in a phylogenetic tree can be further classified as either bifurcating (binary) or multifurcating (polytomy). Bifurcating branches represent a split into two distinct lineages, indicating a speciation event. Polytomies occur when there is insufficient information to resolve the exact branching pattern, representing uncertainty or rapid diversification.

Phylogenetic trees can provide insights into various aspects of evolution, including the divergence times between species, the order of speciation events, and the patterns of evolutionary change. They are widely used in fields such as evolutionary biology, systematics, comparative genomics, and ecology to study the relationships and evolutionary history of organisms.

Click the toggles below to learn more about the different starting points of this workflow.

Curated Gene Family

Summary

This workflow is designed to create a phylogenetic tree and multiple sequence alignment from a curated gene family. The user will provide input sequences from a gene family and will receive a report detailing phylogenetic and conservation analysis.

Methods

This workflow takes an input FastA file and, if containing a single sequence, an MSA can be created from a database search [1, 2]. A multiple-sequence alignment can be generated with a multi-sequence FastA file using muscle [3], clustal_omega [4], mafft [5] (einsi, linsi, ginsi, fftnsi, fftns), dialign_tx [6]. Multiple sequence alignments are generated for single gene orthologs (one gene per species). Phylogenetic trees can be generated using IQ Tree [8], IQ Tree 2 [9], RAXml [10], Phangorn [11] and FastTree [12]. A report of the conservation is generated by comparing the conservation of the gene by position, comparing sample distances, and using PCA.

This workflow uses Nextflow to orchestrate job distribution [13].

Curated Orthologs

Summary

This workflow is designed to help the user conduct phylogenetic and conservation analysis of a gene family by searching publically curated ortholog databases. The user will identify a gene of interest and a taxon to search. The user

Methods

This workflow takes a gene and taxonomic branch input and creates an MSA and phylogenetic tree from curated orthologs. Multiple sequence alignments are generated for single gene orthologs (one gene per species). Phylogenetic trees can be generated using IQ Tree [8], IQ Tree 2 [9], RAXml [10], Phangorn [11] and FastTree [12]. A report of the conservation is generated by comparing the conservation of the gene by position, comparing sample distances, and using PCA.

This workflow uses Nextflow to orchestrate job distribution [13].

Run Orthofinder

Summary

This workflow is designed to identify homologies and create an MSA of a gene family as identified by a database search. The user will provide an input directory of files.

Methods

This workflow takes a directory of genomes/gff tar gz files from several genomes. A multiple-sequence alignment can be generated with a multi-sequence FastA file using muscle [3], clustal_omega [4], MAFFT [5] (einsi, linsi, ginsi, fftnsi, fftns), dialign_tx [6]. User can also identify orthologous sequences using OrthoFinder [7]. Multiple sequence alignments are generated for single gene orthologs (one gene per species). Phylogenetic trees can be generated using IQ Tree [8], IQ Tree 2 [9], RAXml [10], Phangorn [11] and FastTree [12]. If a phylogenetic tree is provided, the workflow can generate the branch lengths of the sequences using the same topology using phangorn. A report of the conservation is generated by comparing the conservation of the gene by position, comparing sample distances, and using PCA.

This workflow uses Nextflow to orchestrate job distribution [13].

Create MSA from Database Sequence Search

Summary

This workflow is designed to help the user create a multiple sequence alignment from a single input sequence. The user will provide a FastA file containing a single sequence of interest. The workflow will perform a database search and the user will receive as output a multiple sequence alignment.

Methods

This workflow takes a FastA file with a single sequence, and uses it to create an MSA from a database search [1, 2]. Phylogenetic trees can be generated using IQ Tree [8], IQ Tree 2 [9], RAXml [10], Phangorn [11] and FastTree [12]. If a phylogenetic tree is provided, the workflow can generate the branch lengths of the sequences using the same topology using phangorn. A report of the conservation is generated by comparing the conservation of the gene by position, comparing sample distances, and using PCA.

This workflow uses Nextflow to orchestrate job distribution [13].

Inputs

Multiple Sequence FastA of Homologous Genes

Outputs

Alignment FastA

Workflow Walkthrough

  1. Navigate to the Gene Molecular Evolution workflow on the Form Bio platform. You can use the search bar to locate this workflow, or select the Sequence Alignment workflow on the left-hand side.
  2. Select version from the dropdown menu, and click “Run Workflow” when ready to begin.
  3. image
  4. Currently, four starting points are supported. MSA can be created from a curated gene family, a set of orthologs, an OrthoFinder run, or from a database search. Upload the associated files or inputs pertaining to your starting point.
  5. image
  6. Tune some parameters related to the creation of the MSA, including the algorithm to use, the phylogenetic tree algorithm to use, and the sensitivity. Additional context for each parameter is provided under “Info” on the right.
  7. image
  8. Take a moment to review workflow inputs and parameters, and give the workflow run a unique name. When ready to submit, click “Run Workflow”.
  9. image

Results Walkthrough

  1. To view the results of your Gene Molecular Evolution workflow run, first find and select your workflow run in the Activity tab.
  2. Navigate to the Files tab, and then outputs. Here, all output analysis files are nested.
  3. image

Citations

  1. Eddy, S. R. Accelerated Profile HMM SearchesPLoS Computational Biology 7, e1002195 (2011).
  2. Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotationBMC Bioinformatics 20, 473 (2019).
  3. Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughputNucleic Acids Res 32, 1792–1797 (2004).
  4. Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequencesProtein Science: A Publication of the Protein Society 27, 135–145 (2018).
  5. Katoh, K. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transformNucleic Acids Research 30, 3059–3066 (2002).
  6. Morgenstern, B., Frech, K., Dress, A. & Werner, T. DIALIGN: Finding local similarities by multiple sequence alignmentBioinformatics (Oxford, England) 14, 290–294 (1998).
  7. Emms, D. M. & Kelly, S. OrthoFinder: Phylogenetic orthology inference for comparative genomicsGenome Biology 20, 238 (2019).
  8. Nguyen, L.-T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood PhylogeniesMolecular Biology and Evolution 32, 268–274 (2015).
  9. Minh, B. Q. et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic EraMolecular Biology and Evolution 37, 1530–1534 (2020).
  10. Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogeniesBioinformatics 30, 1312–1313 (2014).
  11. Schliep, K. P. Phangorn: Phylogenetic analysis in RBioinformatics 27, 592–593 (2011).
  12. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 Approximately Maximum-Likelihood Trees for Large AlignmentsPLoS ONE 5, e9490 (2010).
  13. Di Tommaso, P. et al. Nextflow enables reproducible computational workflowsNature Biotechnology 35, 316–319 (2017).

Sequence Similarity Search

image

Find gene sequences that are significantly similar to known query sequences.

Version 1.1.1

Use Case

Identify Similar Sequences using a Database where similarity is determined by statistical significance of the alignment score

Summary

This workflow uses Blast [1, 2], Diamond [3], SSEARCH, FASTA36 [4], or Miniprot [5] for sequence comparison and similarity searches in biological databases.

  1. Blast (Basic Local Alignment Search Tool): Blast is a widely used algorithm for comparing biological sequences, such as DNA, RNA, or protein sequences, against a large database. It employs a heuristic approach to find regions of local similarity between sequences. Blast provides a measure of sequence similarity, identifies regions of conservation, and predicts functional and evolutionary relationships between sequences.
  2. Diamond: Diamond is a sequence alignment tool specifically designed for comparing protein sequences against protein sequence databases. It utilizes a fast and sensitive algorithm based on the concept of seed-and-extend alignment. Diamond is known for its high speed and is often used for large-scale protein sequence analysis.
  3. SSEARCH: SSEARCH is a sequence comparison tool that performs global sequence alignment using the Smith-Waterman algorithm. It is known for its sensitivity in detecting distant homologs by finding optimal local alignments. SSEARCH is commonly used in protein sequence analysis and is particularly effective when comparing sequences with low similarity.
  4. FASTA36: FASTA36 is a versatile and widely used program for comparing protein and nucleotide sequences against sequence databases. It employs the FASTA algorithm, which is based on local alignment. FASTA36 is known for its sensitivity in identifying distant homologs and can be used for both database searches and pairwise alignments.
  5. Miniprot: Miniprot is an extremely fast protein-to-genome aligner developed by Heng Li, the developer of minimap2. It outputs alignments in PAF (paired alignment format) and gtf (gene transfer format).

Methods

This analysis was performed using the Sequence Similarity Search workflow on the Form Bio platform. This workflow takes an input FastA file and performs a sequence similarity search with BLAST [1, 2], Diamond [3], SSEARCH, FASTA36 [4], or Miniprot[5].

Inputs

  • FastA file
  • Algorithm
  • Database or Genome

Outputs

Sequence Alignment Output

Runtime Estimates

Average = 1 hour 19 minutes

image

Workflow Walkthrough

  1. Navigate to the Sequence Similarity Search workflow. You can use the search tool at the top right corner, or find it using the Sequence Alignment filter on the left side.
  2. Select the version from the dropdown menu in the top right corner. You can view use cases, a summary, and inputs/outputs on this page. When ready to begin analysis, click “Run Workflow”.
  3. image
  4. Launcher Tabs
    1. Provide a FastA file containing the query sequence(s). Then, select the type of sequence search (nucleotide to nucleotide, nucleotide to protein, etc), the search algorithm, and the database to search.
    2. image
    3. Determine how you want to the output file (XML, HTML, etc), and tune additional search parameters depending on your chosen algorithm.
    4. image
    5. Review workflow parameters and inputs, and give the workflow run a unique name. When ready to submit the job, click “Run Workflow”.
    6. image

      Results Walkthrough

    7. To view the results of your Sequence Similarity Search workflow run, first find and select your workflow run from the Activity Tab.
    8. Navigate to the Files tab, then find workflow outputs under the output folder.
    9. image

Citations

  1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search toolJournal of Molecular Biology 215, 403–410 (1990).
  2. Pearson, W. R. & Lipman, D. J. Improved tools for biological sequence comparisonProceedings of the National Academy of Sciences of the United States of America 85, 2444–2448 (1988).
  3. Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMONDNature Methods 18, 366–368 (2021).
  4. Wrpearson/Fasta36 at master.
  5. Protein-to-genome alignment with miniprot | Bioinformatics | Oxford Academic.

Genome Coordinate Conversion

image

Convert the location of a set of genomic features, such as genes, transcription factor bindings sites, or promoters, from one genome to another.

Version 1.0.1

Use Cases

  • Create a Genome Coordinate Conversion File between two genomes
  • Map the location of a set of genomic features (e.g. genes, transcription factor binding sites, promoters) from the target genome to the query genome
  • Filter the genomic features that are converted to the query genome for overlap with a second set of genomic features. The locations of the features in this separate set are in the query genome

Summary

This workflow is designed to help the user convert the location of a set of genomic features, such as genes, transcription factor bindings sites, or promoters, from one genome to another. The user will provide as input a query genome to convert to and a target genome to convert from. If a Genome Coordinate Conversion File is not provided, one may be generated from two input FastA files. The user will receive as output the location of genomic features on the target genome, and a Genome Coordinate Conversion File if indicated.

Methods

This workflow was performed using the Genome Coordinate Conversion workflow on the Form Bio platform. The Genome Coordinate Conversion File is a whole genome alignment between the target genome and the query genome. Only features that lie in regions of homology between the two genomes are mapped. If the Genome Coordinate Conversion File does not yet exist, this workflow can generate one using either LastZ [1] or SegAlign [2] (GPU-optimized version of LastZ) from a pair of genome FASTA files. CrossMap [3] will use this chain file to convert the coordinates of genomic features in the target genome to the query genome. This file of genomic features in the target/reference genome can be in BED, VCF, BAM, or MAF file formats. CrossMap outputs a BED file with the location of these genomic features in the query genome. If provided with a second BED file with genomic features in the query genome, the workflow will filter converted genomic features for overlap with this second BED file [4].

Inputs

  • Genome Coordinate Conversion File (see UCSC to learn about chain format https://genome.ucsc.edu/goldenPath/help/chain.htmlOR FastA of Target Genome and FastA of Query Genome.
  • File 1 of genomic features with coordinates in the Target Genome (optional)
    • Can be in VCF, MAF, BAM, or BED format
  • File 2 of Genomic Features with coordinates in the Query Genome (optional)
    • Must be in BED format
    • The genomic features in File 1 converted to the Query Genome will be filtered for intersection of genomic features in File 2.

Mandatory Inputs:

  • Target genome FastA file * Query genome FastA file

Optional Inputs:

  • Genome Coordinate File - necessary for converting genomic regions, and is created from the FastA files otherwise

Required Parameters:

  • Analysis type (create Genome Coordinate File or convert genomic regions from Genome Coordinate file) * Query genome type (provided in the project or select one of our supported genomes) * Target genome type (provided in the project or select one of our supported genomes)

Optional Parameters:

  • Alignment algorithm (the default is LastZ; the user must specify if they wish to use SegAlign) * LastZ - Mode: default OR self (alignments between genomes of the same species) OR divergent (for species diverged >150 Mya - think human vs platypus) * LastZ - Linear gap: medium (species diverged by less than 100 million years ago) OR loose (for more distantly related species) * SegAlign - Related: TRUE if target and query genomes are closely related

Outputs

  • Genome Coordinate Conversion File (in chain format) between the Target Genome and the Query Genome ("liftover.chn")
  • BED file with location of genomic features in the Query Genome that lie in regions of homology in the Target Genome
  • BED file of genomic features in the Query Genome, filtered for overlap with Input File 2.

Workflow Walkthrough

  1. Navigate to the Genome Coordinate Conversion workflow on the Form Bio platform. This workflow can also be found using the search bar at the top right or by selecting either the “Genomics” or “Sequence Alignment” filter.
  2. On the launcher page, you can view use-cases, a brief summary of the analysis, and information on inputs and outputs of the workflow. When ready to begin, click “Run Workflow” in the top right corner.
  3. image
  4. On the inputs tab, select the type of analysis you wish to perform - either creation of a Genome Coordinate Conversion File, or the conversion of genomic regions using a preexisting file. Then, upload the query and target genomes as FastA files.
  5. image
  6. Review the workflow parameters and file inputs. Give your workflow run a unique name. When ready to submit the job, click Run Workflow at the bottom right corner.
  7. image

Results Walkthrough

  1. To view the results of your Genome Coordinate Conversion workflow, first find and select your workflow run from the Activity tab.
  2. Navigate to the Files tab. Under output, all workflow files will be listed. You may also view these files in the File Explorer on the left-hand side.

Citations

  1. Harris, R. Improved Pairwise Alignment of Genomic DNA. ProQuest (2007).
  2. Goenka, S. D., Turakhia, Y., Paten, B. & Horowitz, M. SegAlign: A scalable GPU-based whole genome aligner. in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 1–13 (IEEE Press, 2020).
  3. Zhao, H. et al. CrossMap: A versatile tool for coordinate conversion between genome assembliesBioinformatics 30, 1006–1007 (2014).
  4. Quinlan, A. R. & Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic featuresBioinformatics 26, 841–842 (2010).

Built with

image