LFastqC: A lossless non-reference-based FASTQ compressor
Autoři:
Sultan Al Yami aff001; Chun-Hsi Huang aff001
Působiště autorů:
Computer Science and Engineering, University of Connecticut, Storrs, Connecticut, United States of America
aff001; Computer Science and Information System, Najran University, Najran, Saudi Arabia
aff002
Vyšlo v časopise:
PLoS ONE 14(11)
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pone.0224806
Souhrn
The cost-effectiveness of next-generation sequencing (NGS) has led to the advancement of genomic research, thereby regularly generating a large amount of raw data that often requires efficient infrastructures such as data centers to manage the storage and transmission of such data. The generated NGS data are highly redundant and need to be efficiently compressed to reduce the cost of storage space and transmission bandwidth. We present a lossless, non-reference-based FASTQ compression algorithm, known as LFastqC, an improvement over the LFQC tool, to address these issues. LFastqC is compared with several state-of-the-art compressors, and the results indicate that LFastqC achieves better compression ratios for important datasets such as the LS454, PacBio, and MinION. Moreover, LFastqC has a better compression and decompression speed than LFQC, which was previously the top-performing compression algorithm for the LS454 dataset. LFastqC is freely available at https://github.uconn.edu/sya12005/LFastqC.
Klíčová slova:
Algorithms – Arithmetic – Compression – Data management – Next-generation sequencing – Nucleotide sequencing – Spring – Data compression
Zdroje
1. Ewing B, Hillier L, Wendl MC and Green P. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Research. 1998;8(3): 175–185. doi: 10.1101/gr.8.3.175 9521921
2. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research. 2009;38(6): 1767–1771. doi: 10.1093/nar/gkp1137 20015970
3. Pinho AJ, and Pratas D. MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics. 2013;30(1): 117–118. doi: 10.1093/bioinformatics/btt594 24132931
4. Pinho AJ, Ferreira PJ, Neves AJ, Bastos CA. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLOS One. 2011;6(6): e21588. doi: 10.1371/journal.pone.0021588 21738720
5. Li P, Wang S, Kim J, Xiong H, Ohno-Machado L, Jiang X. DNA-COMPACT: DNA compression based on a pattern-aware contextual modeling technique. PLOS One. 2013;8(11): e80377. doi: 10.1371/journal.pone.0080377 24282536
6. Sardaraz M, Tahir M, Ikram AA, Bajwa H. SeqCompress: An algorithm for biological sequence compression. Genomics. 2014;104(4): 225–228. doi: 10.1016/j.ygeno.2014.08.007 25173568
7. Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. Bioinformatics. 2011;27(6): 860–862. doi: 10.1093/bioinformatics/btr014 21252073
8. Roguski Ł, Deorowicz S. DSRC 2—Industry-oriented compression of FASTQ files. Bioinformatics. 2014;30(15): 2213–2215. doi: 10.1093/bioinformatics/btu208 24747219
9. Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research. 2012;40(22): e171–e171. doi: 10.1093/nar/gks754 22904078
10. Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLOS One. 2013;8(3): e59190. doi: 10.1371/journal.pone.0059190 23533605
11. Nicolae M, Pathak S, Rajasekaran S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics. 2015;31(20): 3276–3281. doi: 10.1093/bioinformatics/btv384 26093148
12. Roguski Ł., Ochoa I., Hernaez M., & Deorowicz S. (2018). FaStore: a space-saving solution for raw sequencing data. Bioinformatics, 34(16), 2748–2756. doi: 10.1093/bioinformatics/bty205 29617939
13. Chandak S., Tatwawadi K., Ochoa I., Hernaez M., & Weissman T. (2018). SPRING: a next-generation compressor for FASTQ data. Bioinformatics.
14. Deutsch P. GZIP file format specification version 4.3 (No. RFC 1952). 1996.
15. Seward J. Bzip2. 1996. Available from: http://www.bzip.org/bzip2.html.
16. Armando P. SeqSqueeze1.2012 Available from: https://sourceforge.net/p/ieetaseqsqueeze/
Článek vyšel v časopise
PLOS One
2019 Číslo 11
- Jak a kdy u celiakie začíná reakce na lepek? Možnou odpověď poodkryla čerstvá kanadská studie
- Pomůže v budoucnu s triáží na pohotovostech umělá inteligence?
- Spermie, vajíčka a mozky – „jednohubky“ z výzkumu 2024/38
- Metamizol jako analgetikum první volby: kdy, pro koho, jak a proč?
- Infekce se v Americe po příjezdu Kolumba šířily nesrovnatelně déle, než se traduje
Nejčtenější v tomto čísle
- A daily diary study on maladaptive daydreaming, mind wandering, and sleep disturbances: Examining within-person and between-persons relations
- A 3’ UTR SNP rs885863, a cis-eQTL for the circadian gene VIPR2 and lincRNA 689, is associated with opioid addiction
- A substitution mutation in a conserved domain of mammalian acetate-dependent acetyl CoA synthetase 2 results in destabilized protein and impaired HIF-2 signaling
- Molecular validation of clinical Pantoea isolates identified by MALDI-TOF
Zvyšte si kvalifikaci online z pohodlí domova
Všechny kurzy