Post

SPAdes Assembler Toolkit

The SPAdes toolkit is a powerful bioinformatics tool for genome assembly from next-generation sequencing data. It is widely used for bacterial genomes and is compatible with Illumina and other sequencing platforms. This guide provides a detailed installation tutorial, usage examples, and tips to maximize efficience

SPAdes Assembler Toolkit

SPAdes overview

SPAdes (St. Petersburg genome assembler) is primarily developed for Illumina sequencing data but can be used for IonTorrent as well. Most SPAdes pipelines support a hybrid mode, i.e. allow using long reads (PacBio and Oxford Nanopore) as supplementary data. The package enables the assembly of bacterial isolates, single-cell genomes, metagenomes, and transcriptomes while featuring specialized modules for plasmid and RNA virus recovery. It integrates k-mer-based algorithms for read processing, graph manipulation, and sequence alignment, supporting diverse genomic analyses through a modular pipeline architecture 1.

Supported Sequencing Platforms

  • Illumina (MiSeq, HiSeq, NovaSeq)
  • Ion Torrent
  • 454 Roche
  • PacBio (with limitations)

Installation

SPAdes requires a 64-bit Linux system or Mac OS and Python (3.8 or higher) to be pre-installed on it. To obtain SPAdes you can either download binaries or download source code and compile it yourself.

Supported Operating Systems

  • Ubuntu 20.04 LTS and later
  • CentOS/RHEL 8.x
  • macOS 10.15 (Catalina) and later
  • Windows 10/11 with Windows Subsystem for Linux (WSL2)

Dependencies

  • Python 3.5 or later (Linux systems typically include Python)
  • CMake 3.1 or later.
1
2
3
4
5
6
    # Create a new conda environment 
    conda create -n spades_env python=3.8 
    # Activate the environment 
    conda activate spades_env 
    # Install SPAdes 
    conda install -c bioconda spades

Downloading SPAdes Linux binaries

1
2
3
4
5
6
7
    wget https://github.com/ablab/spades/releases/download/v4.0.0/SPAdes-4.0.0-Linux.tar.gz
    # Replace the URL with the latest release version.
    tar -xzf SPAdes-4.0.0-Linux.tar.gz
    cd SPAdes-4.0.0-Linux/bin/
    ./spades.py --help
    # Add to PATH (optional) 
    export  PATH=$PATH:/path/to/SPAdes-4.0.0

You can also compile SPAdes from source (requires g++ 9.0+, cmake 3.16+, zlib and libbz2).

1
2
3
4
5
6
7
8
9
10
    # Prerequisites
    sudo apt-get install cmake gcc g++ python3-dev
    # Clone the SPAdes repository
    git clone https://github.com/ablab/spades.git
    cd spades
    # Configure and compile
    mkdir build && cd build
	cmake ..
	make
    ./bin/spades.py --help

Example of Assembling RNA-Seq data

rnaSPAdes is a transcriptome assembly tool for eukaryotic and prokaryotic short reads, supporting paired-end, single-end, and hybrid assemblies with PacBio/Nanopore reads. It has limitations including no --careful or --cov-cutoff options, specific pipeline mode constraints, and automatic k-mer size selection to prevent chimeric transcripts 1.

SPAdes command line

1
2
3
4
5
6
7
```bash
spades.py --rna \
-1 /media/kashmir/HP P900/Main/deduplicated_1.fastq\
-2 /media/kashmir/HP P900/Main/deduplicated_2.fastq\
-m 60 -t 32 -k 33,55,77,99,127\
-o /media/kashmir/HP P900/Main/spades_output
```

SPAdes parameters.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
System information:
  SPAdes version: 3.13.1
  Python version: 3.10.12
  OS: Linux-6.8.0-49-generic-x86_64-with-glibc2.35

Output dir: /media/kashmir/HP P900/Main/spades_output
Mode: ONLY assembling (without read error correction)
Debug mode is turned OFF

Dataset parameters:
  RNA-seq mode
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/media/kashmir/HP P900/Main/deduplicated_1.fastq']
      right reads: ['/media/kashmir/HP P900/Main/deduplicated_2.fastq']
      interlaced reads: not specified
      single reads: not specified
      merged reads: not specified
Assembly parameters:
  k: [33, 55, 77, 99, 127]
  Repeat resolution is enabled
  Mismatch careful mode is turned OFF
  MismatchCorrector will be SKIPPED
  Coverage cutoff is turned OFF
Other parameters:
  Dir for temp files: /media/kashmir/HP P900/Main/spades_output/tmp
  Threads: 32
  Memory limit (in Gb): 60

Look for key files

1
2
3
4
5
6
7
8
9
10
11
12
```bash
output_dir/
├── corrected/          # Error-corrected reads
├── scaffolds.fasta     # Final scaffolds
├── contigs.fasta       # Final contigs
├── assembly_graph.fastg # Assembly graph in FASTG format
├── contigs.paths       # Paths in the assembly graph
├── scaffolds.paths     # Scaffold paths
├── params.txt          # Parameters used
└── spades.log          # Log file

``` >The output contigs.fasta should contain high-quality, assembled contigs in FASTA format. Further validation with tools like QUAST is recommended to confirm assembly accuracy.

The end of the log file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
```bash
===== Assembling finished. Used k-mer sizes: 33, 55, 77, 99, 127 

* Assembled transcripts are in "/media/kashmir/HP P900/Main/spades_output/transcripts.fasta"
* Paths in the assembly graph corresponding to the transcripts are in "/media/kashmir/HP P900/Main/spades_output/transcripts.paths"
* Hard filtered transcripts are in "/media/kashmir/HP P900/Main/spades_output/hard_filtered_transcripts.fasta"
* Soft filtered transcripts are in "/media/kashmir/HP P900/Main/spades_output/soft_filtered_transcripts.fasta"
* Assembly graph is in "/media/kashmir/HP P900/Main/spades_output/assembly_graph.fastg"
* Assembly graph in GFA format is in "/media/kashmir/HP P900/Main/spades_output/assembly_graph_with_scaffolds.gfa"

======= SPAdes pipeline finished.

SPAdes log can be found here: /media/kashmir/HP P900/Main/spades_output/spades.log

Thank you for using SPAdes!
``` ### rnaSPAdes output rnaSPAdes generates multiple output files: - `transcripts.fasta`: Main output file (recommended for most projects) - `hard_filtered_transcripts.fasta`: Long, reliable, high-expression transcripts - `soft_filtered_transcripts.fasta`: Short, low-expression transcripts

Contig names follow format: >NODE_97_length_6237_cov_11.9819_g8_i2, with components representing node number, length, coverage, gene group, and transcript index.


References

Feedback and bug reports Please, leave your comments and bug reports at SPAdes GitHub repository tracker.

  1. Prjibelski, Andrey, et al. “Using SPAdes De Novo Assembler.” Current Protocols in Bioinformatics, vol. 70, no. 1, June 2020, https://doi.org/10.1002/cpbi.102↩︎ ↩︎2

This post is licensed under CC BY-NC 4.0 by the author.