There is a need for systems and techniques to allow for genomic data representing unidentified genomic material to be compared to reference genomic data that is identified with known genomic material, to rapidly identify the unknown material. There is also a need to achieve this aim in a faster (hours vs days), more inexpensive and more computationally efficient manner. Our patent aims to meet this need.
In some embodiments, genomic information may be identified by computing systems without access to a database of reference genomic information, instead relying on locally stored probabilistic data structures representing reference genomic information. Query genomic data, such as data taken from a read-set, may be divided into sub-strings, and each of the locally-stored probabilistic data structures may be queried by each of the extracted sub-strings, generating probabilistic outputs indicating either that (a) the sub-string is probably included in the set of data represented by the probabilistic data structure or (b) the sub-string is definitely not included in the set of data. Based on the number and/or proportion of sub-strings from a read-set that are indicated as being likely represented by a probabilistic data structure, a likely identity or classification for the genomic information in the read-set may be determined.