Show simple item record

dc.contributor.authorHřivňák, Rostislav
dc.contributor.authorGajdoš, Petr
dc.contributor.authorSnášel, Václav
dc.date.accessioned2022-04-06T15:07:40Z
dc.date.available2022-04-06T15:07:40Z
dc.date.issued2021
dc.identifier.citationIEEE Access. 2021, vol. 9, p. 161890-161897.cs
dc.identifier.issn2169-3536
dc.identifier.urihttp://hdl.handle.net/10084/146001
dc.description.abstractThe most common index data structures used by whole genome aligners (WGA) are based on suffix trees (ST), suffix arrays, and FM-indexes. These data structures show good performance results as WGA works with sequences of letters over small alphabets; for example, four letters a, c, t, g for DNA alignment. A novel whole genome aligner, which we are developing, will work with distances between the label sites on DNA samples, which are represented as a sequence of positive integers. The size of alphabet sigma is theoretically unlimited. This has prompted us to investigate if there is a better structure that would improve search performance on large alphabets compared to the commonly used suffix-based structures. This paper presents the implementation of a highly optimized hybrid index data structure based on a ternary search tree (TST) and hash tables, which perform much better when working with strings on large alphabets compared to the ST. Single core parallelism was achieved using advanced vector extensions (AVX) single instruction multiple data (SIMD) instruction set. When searching for short k-mers over an alphabet of 25, 695 letters, our index search performance was up to 29 times better than the search performance of the reference ST. When the alphabet was compressed to approximately 1; 300 letters, our index search performance was still up to 2.6 times better than the ST. The source code is available free on http://olgen.cz/Resources/Upload/Home/public/software/hds.zip under the MIT license.cs
dc.language.isoencs
dc.publisherIEEEcs
dc.relation.ispartofseriesIEEE Accesscs
dc.relation.urihttps://doi.org/10.1109/ACCESS.2021.3121749cs
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/cs
dc.subjectengineering managementcs
dc.subjectengineering in medicine and biologycs
dc.subjectbioinformaticscs
dc.subjectcomputational and artificial intelligencecs
dc.subjectcomputer architecturecs
dc.subjectdata structurescs
dc.subjecttree data structurescs
dc.subjectparallel processingcs
dc.subjectparallel algorithmscs
dc.titleFast hybrid data structure for a large alphabet k-mers indexing for whole genome alignmentcs
dc.typearticlecs
dc.identifier.doi10.1109/ACCESS.2021.3121749
dc.rights.accessopenAccesscs
dc.type.versionpublishedVersioncs
dc.type.statusPeer-reviewedcs
dc.description.sourceWeb of Sciencecs
dc.description.volume9cs
dc.description.lastpage161897cs
dc.description.firstpage161890cs
dc.identifier.wos000730465200001


Files in this item

This item appears in the following Collection(s)

Show simple item record

http://creativecommons.org/licenses/by/4.0/
Except where otherwise noted, this item's license is described as http://creativecommons.org/licenses/by/4.0/