Post

FastUniq - De Novo Duplicates Removal Tool

FastUniq is a command-line tool and application designed for de novo deduplication of paired-end sequencing reads in FASTQ format . It provides a lightweight and efficient solution for preprocessing sequencing data without requiring a reference genome.

FastUniq - De Novo Duplicates Removal Tool

FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads 1:

  • Eliminates PCR-introduced duplicates, improving the accuracy of genome assembly and variation discovery.
  • Processes paired-end reads without requiring a reference genome, ideal for de novo sequencing.
  • A command-line tool that is straightforward to use for preprocessing sequencing data.
  • Ultra-fast : Handles 87 million reads in 10 minutes 1, ensuring rapid preprocessing of sequencing data.
  • Free to download and use; A command-line tool and application that is straightforward to use for preprocessing sequencing data.

FastUniq Installation Guide

Install using Source Code

  1. Download the Source Code:
    FastUniq can be downloaded from Source Forge repository or via wget command. This guide is for linux system (ubuntu) or WSL on windows:

Dependencies: Make sure the gcc compiler installed on your computer (Version 4.0 or above)

1
   wget https://sourceforge.net/projects/fastuniq/files/latest/download 
  1. Extract the Package:
    Extract the downloaded archive:
    1
    
    tar -xvzf FastUniq.tar.gz
    
  2. Navigate to the Directory:
    Move to the extracted FastUniq directory:
    1
    
    cd FastUniq
    
  3. Compile the Tool:
    Compile the source code using make:
    1
    
    make
    
  4. Verify Installation:
    After compilation, the fastuniq binary will be created in the directory. Check the installation by running:
    1
    
    ./fastuniq
    
  5. Optional - Add to PATH:
    To use FastUniq globally, move the binary to a directory in your system’s PATH or add its directory to the PATH:
    1
    
    sudo mv fastuniq /usr/local/bin/
    

    Or update your PATH:

    1
    
    export PATH=$PATH:/path/to/FastUniq
    

conda install

To install this package run one of the following:

1
2
3
4
5
```bash
conda install bioconda::fastuniq
# OR you can use
conda install bioconda/label/cf201901::fastuniq
```

Parameters of FastUniq

  1. -i: Input File List
    • Specifies a list of paired FASTQ sequence files as input. Two adjacent files with reads in the same order are treated as a pair.
    • Input: [FILE IN]
    • Limit: Maximum of 1000 pairs.
  2. -t: Output Sequence Format
    • Defines the format of the output sequence files.
    • Options:
      • q: Outputs paired reads in FASTQ format into two separate files (default).
      • f: Outputs paired reads in FASTA format into two separate files.
      • p: Outputs paired reads in FASTA format into a single file (adjacent reads belong to the same pair).
  3. -o: First Output File
    • Specifies the name of the first output file.
    • Input: [FILE OUT]
  4. -p: Second Output File
    • Specifies the name of the second output file.
    • Input: [FILE OUT]
    • Note: Required only when the output format (-t) is q or f.
  5. -c: Sequence Description Type
    • Determines how sequence descriptions are handled in the output files.
    • Options:
      • 0: Retains raw descriptions from input files (default).
      • 1: Assigns new serial numbers to descriptions.

Usage Example

For deduplicating paired-end FASTQ files and outputting in FASTQ format:

1
2
3
4
5
6
# Sample input_file_list.txt
reads_R1.fastq
reads_R2.fastq

# Command
fastuniq -i input_file_list.txt -t q -o clean_R1.fastq -p clean_R2.fastq

For deduplicating and outputting in single FASTA file:

1
fastuniq -i input_file_list.txt -t p -o output.fasta

Applications and Use Cases

  1. Genome Assembly: Removing duplicates improves scaffolding and assembly accuracy.
  2. Variant Calling: Ensures accurate detection of SNPs and other variations.
  3. RNA-Seq Analysis: Reduces noise caused by duplicates for reliable expression analysis.
  4. Metagenomics: Handles diverse datasets without a reference genome.

Reference

  1. Xu H, Luo X, Qian J, Pang X, Song J, et al. (2012) FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads. PLOS ONE 7(12): e52249. https://doi.org/10.1371/journal.pone.0052249 ↩︎ ↩︎2

This post is licensed under CC BY-NC 4.0 by the author.