FastUniq - De Novo Duplicates Removal Tool

FastUniq is a command-line tool and application designed for de novo deduplication of paired-end sequencing reads in FASTQ format . It provides a lightweight and efficient solution for preprocessing sequencing data without requiring a reference genome.

Posted Dec 13, 2024

By Beaven Manjengwa

3 min read

FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads ¹:

Eliminates PCR-introduced duplicates, improving the accuracy of genome assembly and variation discovery.
Processes paired-end reads without requiring a reference genome, ideal for de novo sequencing.
A command-line tool that is straightforward to use for preprocessing sequencing data.
Ultra-fast : Handles 87 million reads in 10 minutes ¹, ensuring rapid preprocessing of sequencing data.
Free to download and use; A command-line tool and application that is straightforward to use for preprocessing sequencing data.

FastUniq Installation Guide

Install using Source Code

Download the Source Code:
FastUniq can be downloaded from Source Forge repository or via wget command. This guide is for linux system (ubuntu) or WSL on windows:

Dependencies: Make sure the gcc compiler installed on your computer (Version 4.0 or above)

   wget https://sourceforge.net/projects/fastuniq/files/latest/download 

Extract the Package:
Extract the downloaded archive:
1 tar -xvzf FastUniq.tar.gz
Navigate to the Directory:
Move to the extracted FastUniq directory:
1 cd FastUniq
Compile the Tool:
Compile the source code using make:
1 make
Verify Installation:
After compilation, the fastuniq binary will be created in the directory. Check the installation by running:
1 ./fastuniq
Optional - Add to PATH:
To use FastUniq globally, move the binary to a directory in your system’s PATH or add its directory to the PATH:
1 sudo mv fastuniq /usr/local/bin/
Or update your PATH:
1 export PATH=$PATH:/path/to/FastUniq

conda install

To install this package run one of the following:

```bash
conda install bioconda::fastuniq
# OR you can use
conda install bioconda/label/cf201901::fastuniq
```

Parameters of FastUniq

-i: Input File List
- Specifies a list of paired FASTQ sequence files as input. Two adjacent files with reads in the same order are treated as a pair.
- Input: [FILE IN]
- Limit: Maximum of 1000 pairs.
-t: Output Sequence Format
- Defines the format of the output sequence files.
- Options:
  - q: Outputs paired reads in FASTQ format into two separate files (default).
  - f: Outputs paired reads in FASTA format into two separate files.
  - p: Outputs paired reads in FASTA format into a single file (adjacent reads belong to the same pair).
-o: First Output File
- Specifies the name of the first output file.
- Input: [FILE OUT]
-p: Second Output File
- Specifies the name of the second output file.
- Input: [FILE OUT]
- Note: Required only when the output format (-t) is q or f.
-c: Sequence Description Type
- Determines how sequence descriptions are handled in the output files.
- Options:
  - 0: Retains raw descriptions from input files (default).
  - 1: Assigns new serial numbers to descriptions.

Usage Example

For deduplicating paired-end FASTQ files and outputting in FASTQ format:

        
      
# Sample input_file_list.txt
reads_R1.fastq
reads_R2.fastq

# Command
fastuniq -i input_file_list.txt -t q -o clean_R1.fastq -p clean_R2.fastq

For deduplicating and outputting in single FASTA file:

        
      
fastuniq -i input_file_list.txt -t p -o output.fasta

Applications and Use Cases

Genome Assembly: Removing duplicates improves scaffolding and assembly accuracy.
Variant Calling: Ensures accurate detection of SNPs and other variations.
RNA-Seq Analysis: Reduces noise caused by duplicates for reliable expression analysis.
Metagenomics: Handles diverse datasets without a reference genome.

Reference

Xu H, Luo X, Qian J, Pang X, Song J, et al. (2012) FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads. PLOS ONE 7(12): e52249. https://doi.org/10.1371/journal.pone.0052249 ↩︎ ↩︎²

De Novo, Tutorial

FastUniq

This post is licensed under CC BY-NC 4.0 by the author.