kmer diversity


Estimating the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza is difficult with data generated by Next Generation Sequencing (NGS). There have been several successful types of approaches that have developed over the time the technology has been available.

The gold standard for assessing the genetic diversity within a viral population is conducting Single Genome Amplification (SGA), where after a series of dilutions, it is presumed that a single viral strain has been isolated. From these multiple viral strains are isolated and NGS is performed. Genetic diversity is then calculated from this population to be a representative sample (Maldarelli et al, 2013 and Gibbs et al, 2007). SGA is time consuming, expensive, and often does not accurately represent the entire viral population within a single host, but only the dominate population.

A significant amount of attention has been paid to creating methods that attempt to take advantage of the longer reads produced by some NGS techniques in order to generatively regress a population of haplotypes. Assuming the distribution is correctly estimated, these haplotypes will be representative of the diversity of the original population.

Other methods, like Tanden (Zukurov et al, 2016), focus on taking advantage of the shorter, more accurate, higher coverage reads other NGS methods, allowing the avoidance of difficulties associated with haplotype reconstruction in favor of a frequency analysis of specific sites of the virus genome.

Here we propose a different approach, circumventing the assembly process entirely. We propose using the raw reads to do a k-mer analysis, and then using various summaries of the k-mer counts, regressing to population diversity. This approach is extremely flexible, able to adapt to any sequencing technology, very robust to noise and fairly accurate for the naïve type of approach it represents. The key questions we examine here all revolve around (a) whether this represents a valid and useful technique for estimating virus population diversity, (b) which precise regressors are the best predictors of population diversity, and (c) what are the failure cases of this approach, and how to address them.