Fast hybrid data structure for a large alphabet k-mers indexing for whole genome alignment

Hřivňák, Rostislav; Gajdoš, Petr; Snášel, Václav

dc.contributor.author	Hřivňák, Rostislav
dc.contributor.author	Gajdoš, Petr
dc.contributor.author	Snášel, Václav
dc.date.accessioned	2022-04-06T15:07:40Z
dc.date.available	2022-04-06T15:07:40Z
dc.date.issued	2021
dc.identifier.citation	IEEE Access. 2021, vol. 9, p. 161890-161897.	cs
dc.identifier.issn	2169-3536
dc.identifier.uri	http://hdl.handle.net/10084/146001
dc.description.abstract	The most common index data structures used by whole genome aligners (WGA) are based on suffix trees (ST), suffix arrays, and FM-indexes. These data structures show good performance results as WGA works with sequences of letters over small alphabets; for example, four letters a, c, t, g for DNA alignment. A novel whole genome aligner, which we are developing, will work with distances between the label sites on DNA samples, which are represented as a sequence of positive integers. The size of alphabet sigma is theoretically unlimited. This has prompted us to investigate if there is a better structure that would improve search performance on large alphabets compared to the commonly used suffix-based structures. This paper presents the implementation of a highly optimized hybrid index data structure based on a ternary search tree (TST) and hash tables, which perform much better when working with strings on large alphabets compared to the ST. Single core parallelism was achieved using advanced vector extensions (AVX) single instruction multiple data (SIMD) instruction set. When searching for short k-mers over an alphabet of 25, 695 letters, our index search performance was up to 29 times better than the search performance of the reference ST. When the alphabet was compressed to approximately 1; 300 letters, our index search performance was still up to 2.6 times better than the ST. The source code is available free on http://olgen.cz/Resources/Upload/Home/public/software/hds.zip under the MIT license.	cs
dc.language.iso	en	cs
dc.publisher	IEEE	cs
dc.relation.ispartofseries	IEEE Access	cs
dc.relation.uri	https://doi.org/10.1109/ACCESS.2021.3121749	cs
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	cs
dc.subject	engineering management	cs
dc.subject	engineering in medicine and biology	cs
dc.subject	bioinformatics	cs
dc.subject	computational and artificial intelligence	cs
dc.subject	computer architecture	cs
dc.subject	data structures	cs
dc.subject	tree data structures	cs
dc.subject	parallel processing	cs
dc.subject	parallel algorithms	cs
dc.title	Fast hybrid data structure for a large alphabet k-mers indexing for whole genome alignment	cs
dc.type	article	cs
dc.identifier.doi	10.1109/ACCESS.2021.3121749
dc.rights.access	openAccess	cs
dc.type.version	publishedVersion	cs
dc.type.status	Peer-reviewed	cs
dc.description.source	Web of Science	cs
dc.description.volume	9	cs
dc.description.lastpage	161897	cs
dc.description.firstpage	161890	cs
dc.identifier.wos	000730465200001

Files in this item

Name:: 2169-3536-2021v9p161890.pdf
Size:: 1.107Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Články z časopisů s impakt faktorem / Articles from Impact Factor Journals [6377]
Články z časopisů (od roku 2008), které v době vydání článku měly impakt faktor.
OpenAIRE [5085]
Kolekce určená pro sklízení infrastrukturou OpenAIRE; obsahuje otevřeně přístupné publikace, případně další publikace, které jsou výsledkem projektů rámcových programů Evropské komise (7. RP, H2020, Horizon Europe).
Publikační činnost Katedry informatiky / Publications of Department of Computer Science (460) [562]
Kolekce obsahuje bibliografické záznamy publikační činnosti (článků) akademických pracovníků Katedry informatiky (460) v časopisech a v Lecture Notes in Computer Science registrovaných ve Web of Science od roku 2003 po současnost.
Publikační činnost VŠB-TUO ve Web of Science / Publications of VŠB-TUO in Web of Science [7798]
Kolekce obsahuje bibliografické záznamy článků akademických pracovníků VŠB-TUO publikovaných v časopisech indexovaných ve Web of Science od roku 1990 po současnost.

Show simple item record

Except where otherwise noted, this item's license is described as http://creativecommons.org/licenses/by/4.0/