Seqkit Toolkit for FASTA/Q File Manipulation and Analysis
SeqKit is a fast and versatile command-line toolkit for processing FASTA and FASTQ files. It is widely used for bioinformatics tasks, offering various functions, including sequence statistics, searching, filtering, splitting, shuffling, and data extraction. SeqKit is optimized for efficiency, supporting multi-threading and large datasets.
SeqKit Usage Guide
Introduction
SeqKit is a fast and versatile command-line toolkit for processing FASTA and FASTQ files. It is widely used for bioinformatics tasks, offering various functions, including sequence statistics, searching, filtering, splitting, shuffling, and data extraction 1. SeqKit is optimized for efficiency, supporting multi-threading and large datasets.
Key Benefits:
- Handles large datasets efficiently
- Multi-threading for faster performance
- Support for multiple sequence formats (FASTA, FASTQ, SAM, BED, etc.)
- Easy to integrate into bioinformatics pipelines
Installation of SeqKit
SeqKit can be installed on Linux, macOS, and Windows.
Linux and macOS
You can download SeqKit via conda, Homebrew, or directly from its GitHub repository.
Using Conda
1
conda install -c bioconda seqkit
Using Homebrew (macOS)
1
brew install seqkit
Download Binary
Download the latest release from the SeqKit GitHub repository.
1
2
3
4
5
6
7
8
9
10
11
# Download the latest Go release
wget https://go.dev/dl/go1.17.13.linux-amd64.tar.gz
# Extract the archive to your home directory
tar -zxf go1.17.13.linux-amd64.tar.gz -C $HOME/
# Add Go binary to your PATH
echo 'export PATH=$PATH:$HOME/go/bin' >> ~/.bashrc
# Reload the bash configuration
source ~/.bashrc
Windows
Download the executable from the GitHub release page and add it to your system PATH.
1
#Copy seqkit.exe to C:\WINDOWS\system32
Building from Source
1
2
3
4
Copy code
git clone https://github.com/shenwei356/seqkit.git
cd seqkit
make
Verify Installation
To confirm SeqKit is installed correctly, run:
1
2
3
seqkit version
#seqkit v2.9.0
Key Features
| Feature | Subcommand | Options |
|---|---|---|
| Filter by GC content | seqkit fx2tab |
--gc (compute GC content) |
| Trim sequences | seqkit seq |
--min-qual (min quality), --max-qual (max quality) |
| Compute statistics | seqkit stats |
-a (all stats), -T (tabular output) |
| Extract sequences by ID | seqkit grep |
-f (file with IDs), -i (case-insensitive), -p (pattern) |
| k-mer analysis | seqkit kmer |
-k (k-mer size), -t (number of threads), -r (reverse complement) |
| Sort sequences by length | seqkit sort |
-l (sort by length), -n (numeric sort), -r (reverse order) |
| Filter by length | seqkit seq |
-m (min length), -M (max length) |
| View sequences | seqkit head |
-n (number of records to view), -t (tail mode) |
| Locate subsequences | seqkit locate |
-p (pattern), -P (regex pattern), -i (case-insensitive), -r (report) |
| Convert FASTQ to FASTA | seqkit fq2fa |
-t (include quality as title), -o (output file) |
Basic SeqKit Commands
SeqKit operates using subcommands, similar to git. Basic syntax:
1
2
3
4
5
6
7
8
9
seqkit <subcommand> [Flags] <file>
#Display command usage
-h, --help
seqkit <subcommand> --help
# Subcommands
#seq subseq sliding stats sum faidx watch sana \
#scat fq2fa fa2fq fx2tab & tab2fx convert translate
Sequence Data Manipulation
SeqKit provides essential tools for sequence manipulation, including viewing, filtering, extracting, and sorting.
Viewing Sequences
To quickly view sequences:
1
seqkit head -n 5 example.fasta
Filtering Sequences
You can filter sequences based on length, GC content, or specific patterns.
Filtering by Length:
1
seqkit seq -m 100 -M 500 example.fasta > filtered.fasta
-m: Minimum length-M: Maximum length
Filtering by GC Content:
1
seqkit fx2tab --gc example.fasta | awk '$2 > 50' > gc_filtered.fasta
Extracting Sequences
To extract sequences by ID:
1
seqkit grep -i -f ids.txt example.fasta > extracted.fasta
-i: Case-insensitive search-f: File with sequence IDs
Sorting Sequences
To sort sequences by length:
1
seqkit sort -l example.fasta > sorted.fasta
Sequence Data Transformation
Trimming and Cleaning
SeqKit can remove adapters, trim sequences, and clean reads.
Trimming by Quality:
1
seqkit seq --min-qual 20 example.fastq > trimmed.fastq
Converting Formats
Convert FASTQ to FASTA:
1
seqkit fq2fa example.fastq > example.fasta
Advanced SeqKit Usage
k-mer Analysis
To generate k-mer frequencies:
1
seqkit kmer -k 5 example.fasta > kmer_counts.txt
Statistics and Summary
Get summary statistics of sequence files:
1
2
3
4
5
seqkit stats input.fasta
#Results
file # sequences total bases avg length max length min length
input.fasta 1000 1000000 1000 1500 500
Sequence Alignment and Comparisons
To compute pairwise similarities:
1
seqkit locate -p "ATGC" input.fasta
Batch Processing and scripting
seqkit allows users to apply commands to multiple files, which is useful when processing large datasets or performing batch operations.
1
2
3
4
5
#command to multiple files (same folder)
seqkit stats *.fasta
#command to files in subdirectories
seqkit stats $(find . -name "*.fasta")
Scripting with seqkit
seqkit can be integrated into shell scripts to automate common bioinformatics workflows. The following example shows how to use seqkit in a bash script to convert all FASTQ files in a directory to FASTA
1
2
3
4
5
#!/bin/bash
for file in *.fastq
do
seqkit seq -f "$file" > "${file%.fastq}.fasta"
done
Practise Datasets
Datasets 1 from The miRBase Sequence Database
- hairpin.fa.gz
- mature.fa.gz
- miRNA.diff.gz
References
-
Wei Shen*, Botond Sipos, and Liuyang Zhao. 2024. SeqKit2: A Swiss Army Knife for Sequence and Alignment Processing. iMeta e191. doi:10.1002/imt2.191.
-
Wei Shen, Shuai Le, Yan Li, and Fuquan Hu. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.