FastUniq - De Novo Duplicates Removal Tool
FastUniq is a command-line tool and application designed for de novo deduplication of paired-end sequencing reads in FASTQ format . It provides a lightweight and efficient solution for preprocessing sequencing data without requiring a reference genome.
FastUniq - De Novo Duplicates Removal Tool
FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads 1:
- Eliminates PCR-introduced duplicates, improving the accuracy of genome assembly and variation discovery.
- Processes paired-end reads without requiring a reference genome, ideal for de novo sequencing.
- A command-line tool that is straightforward to use for preprocessing sequencing data.
- Ultra-fast : Handles 87 million reads in 10 minutes 1, ensuring rapid preprocessing of sequencing data.
- Free to download and use; A command-line tool and application that is straightforward to use for preprocessing sequencing data.
FastUniq Installation Guide
Install using Source Code
- Download the Source Code:
FastUniq can be downloaded from Source Forge repository or via wget command. This guide is for linux system (ubuntu) or WSL on windows:
Dependencies: Make sure the gcc compiler installed on your computer (Version 4.0 or above)
1
wget https://sourceforge.net/projects/fastuniq/files/latest/download
- Extract the Package:
Extract the downloaded archive:1
tar -xvzf FastUniq.tar.gz
- Navigate to the Directory:
Move to the extracted FastUniq directory:1
cd FastUniq - Compile the Tool:
Compile the source code usingmake:1
make
- Verify Installation:
After compilation, thefastuniqbinary will be created in the directory. Check the installation by running:1
./fastuniq
- Optional - Add to PATH:
To use FastUniq globally, move the binary to a directory in your system’s PATH or add its directory to the PATH:1
sudo mv fastuniq /usr/local/bin/Or update your PATH:
1
export PATH=$PATH:/path/to/FastUniq
conda install
To install this package run one of the following:
1
2
3
4
5
```bash
conda install bioconda::fastuniq
# OR you can use
conda install bioconda/label/cf201901::fastuniq
```
Parameters of FastUniq
-i: Input File List- Specifies a list of paired FASTQ sequence files as input. Two adjacent files with reads in the same order are treated as a pair.
- Input:
[FILE IN] - Limit: Maximum of 1000 pairs.
-t: Output Sequence Format- Defines the format of the output sequence files.
- Options:
q: Outputs paired reads in FASTQ format into two separate files (default).f: Outputs paired reads in FASTA format into two separate files.p: Outputs paired reads in FASTA format into a single file (adjacent reads belong to the same pair).
-o: First Output File- Specifies the name of the first output file.
- Input:
[FILE OUT]
-p: Second Output File- Specifies the name of the second output file.
- Input:
[FILE OUT] - Note: Required only when the output format (
-t) isqorf.
-c: Sequence Description Type- Determines how sequence descriptions are handled in the output files.
- Options:
0: Retains raw descriptions from input files (default).1: Assigns new serial numbers to descriptions.
Usage Example
For deduplicating paired-end FASTQ files and outputting in FASTQ format:
1
2
3
4
5
6
# Sample input_file_list.txt
reads_R1.fastq
reads_R2.fastq
# Command
fastuniq -i input_file_list.txt -t q -o clean_R1.fastq -p clean_R2.fastq
For deduplicating and outputting in single FASTA file:
1
fastuniq -i input_file_list.txt -t p -o output.fasta
Applications and Use Cases
- Genome Assembly: Removing duplicates improves scaffolding and assembly accuracy.
- Variant Calling: Ensures accurate detection of SNPs and other variations.
- RNA-Seq Analysis: Reduces noise caused by duplicates for reliable expression analysis.
- Metagenomics: Handles diverse datasets without a reference genome.
Reference
This post is licensed under
CC BY-NC 4.0
by the author.