VIRUS_DB2.0
An Online Crowdsourcing Virus Database for Classification Based on Natural Vector
Natural vector is a fast and accurate tool to classify genome sequences. This website handles both single-segmented and multiple-segmented viruses.
The rapid development of sequencing technologies produces a large number of viral genome sequences. Characterizing genetic sequences and determining viral origins have always been important issues in virology. The study of sequence similarity at the interfamily level is especially crucial for revealing key aspects of evolutionary history.
It is known that the commonly used multiple sequence alignment methods fail for diverse systems of different families of RNA viruses. Besides, its heavy calculation cost makes it impossible for genomes classification and evolutionary analysis. In the past 10 years, alignment-free methods have attracted a lot of attention from researchers. More recently the genome space method has been shown to be a fast and efficient way to characterize nucleotide sequences.
Unlike k-mer methods, which ignore the positional information of nucleotides, the natural vector approach constructs a one-to-one correspondence between genome sequences and numerical vectors. Alongwith this line, we construct a viral genome space based on the quantity and global distribution of nucleotides in viral sequences. Each sequence is uniquely represented by a single point, which is also a vector, called a Natural Vector (NV). The Euclidean distance between two points represents the biological distance of the corresponding two viruses. This allows us to make a simultaneous comparison against all available viruses at any level (e.g., Baltimore class, family, subfamily, genus, and species) in a fast and efficient manner. Using a higher dimensional NV doesn’t change the classification or phylogenetic relationships. We emphasize that our approach does not depend on any model assumption. Our approach to classifying viral genomes is not a partial-sequence-based method; it uses the global sequence information of genomes. Furthermore, we propose a two-dimensional graphical representation of viruses in the genome space which is unique and does not depend on any model assumption.
NV not only could be applied to classify viruses[3][4], but also could be used to predict the missing labels of viruses. In our paper[3], we show high consistency prediction results with some published literature.