Using dbSNP via SQL queriesdbSNP is a large relational database maintained by the National Center for Biotechnology Information (NCBI) on a Microsoft SQLServer. NCBI provides several public interfaces to dbSNP:
NCBI does NOT, however, provide an interface for submitting SQL commands directly to the SQLServer. Several organizations, such as the Indiana University School of Medicine (IUSM), download the dbSNP data from the NCBI FTP repository, load it into their local MS SQLServers, and make it available to local users for use via SQL. This page describes the basic structure of the dbSNP database for those interested in accessing it through SQL. It provides several example SQL queries along with sample output. A basic understanding of relational databases and SQL is assumed, particularly the structure of a relational table, and the functions of the SELECT and JOIN directives with their WHERE and ON clauses. There are many introductions to SQL available over the Web. Here are a couple from: Beginning with Build 125 dbSNP was split into a set of species specific databases, each with its own set of tables, along with a database containing a set of tables that are shared by the species databases. The examples in this page were done using Human SNP data, most of them with Build 127.
Topics
PolymorphismAs its name suggests, dbSNP is a database containing information about "SNPs", Single Nucleotide Polymorphisms. A SNP "may be thought of" as a single location on the genome of an organism or species where variation is observed. For example, one sequence including base position 5406 of chromosome 1 may contain a "T", while another sequence covering the same region contains a "G" in the same position. This would constitute a SNP with a "T/G" allelic variation at position 5406 of chromosome 1. However, this description hides a fair amount of complexity and much of this document will expand upon this basic model as it describes some of the most important tables within dbSNP.
Alleles and the Univariation tableTo begin with, there may be more than two possible alleles in a specific location. Perhaps position 5406 on chromosome 1 could contain "A", "G", or "T" (which may be displayed as the string "A/G/T"). In addition, not all SNPs are "single nucleotide" changes. There are many cases where a string of nucleotides may be present in some sequences but absent in others. This may constitute a multi-nucleotide "insertion" or "deletion", that is sometimes called a "Deletion Insertion Polymorphism" (DIP), or it may involve replacing one sequence of nucleotides with another, longer sequence, as in a "range deletion", etc. The list of observed polymorphism types recorded in dbSNP is kept in the LocTypeCode table within the SHARED database, and described in detail in this in-depth document from NCBI. The dbSNP Univariation table shows all observed dbSNP variations in a uniform fashion. SQL commands "like":
select * from Univariation order by univar_id, allele_cnt
can be used to extract data similar to what is shown in the following
table from Univariation.
The allele_cnt field provides the number of separate allelic varations listed in the var_str field. The IUPACK code is a code used to represent the var_str value. Among others, the subsnp_class identifies the following types SNP classes:
The field named "top_or_bot_strand" hints at additional complexity involved in the identification of the strand upon which a SNP occurs, the values used to identify a range of positions, and the direction and complementarity used to display alleles and flanking regions. It appears that chromosome base positions are counted relative to the 5' end of of the top strand regardless of the strand, contig or submitted sequence upon which a base is found. So, for example, the 10th base from the start of the top strand of chromosome 1 will be said to be at position 10, whereas the 10th base from the end of the bottom strand will also be at position 10. Some tables use "orientation" to describe how a sequence within a defined region is being displayed (and/or how it occurs in the chromosome). There are two orientations: "plus" and "minus". Left-to-right displays of sequences in both orientations are in 5'-to-3' order, so that a plus orientation and a minus orientation of the same sequence appear in reverse complement. For example, the sequence displayed in "plus" orientation as: sequence: 5' AGCTC 3' position: 1...5will appear in complemented form as: sequence: 5' TCGAG 3' position: 1...5and will appear backwards as: sequence: 3' CTCGA 5' position: 5...1and in "minus" orientation (both backwards and complemeted) as: sequence: 3' GAGCT 5' position: 5...1 Consequently, an actual 3'-to-5' "minus" sequence like: sequence: 3' GAGCT 5' position: 1...5will (sometimes) appear in reverse complement form as a left-to-right, 5'-to-3' sequence like: sequence: 5' AGCTC 3' position: 1...5(A plus orientation also sometimes appears to indicate placement on a "top" strand, whereas a minus orientation indicates placement on a "bottom" strand.)
Submitted SNPs and the SNP, SNPSubSNPLink, and SubSNPSeq* tablesAnother source of complexity is that submitted sequences that cover the same region are clustered together, and represented by the submission with the longest flanking region. This submission is known as the "exemplar submission".
Here is a dbSNP127_human SNP table, produced by the SQL query:
The identifiers for the submissions associated with these SNPs
(SubSNP_id) can be found by using a query like:
Submitted SNP flanking regions could then be viewed by using SQL
queries on the (the 20 million, or so, rows of) SubSNPseq3_p1,
SubSNPseq3_p2, and SubSNPseq3_p3, and likewise for
SubSNPseq5... like:
In addition, it should be possible to examine the submitted
sequences related to specific SNPs by using a query like:
By analogy, the exemplar_subsnp_id will allow us to get the exemplar flanking
sequences from SubSNPSeq5_p1 (...p2, and ...p3) and SubSNPSeq3_p1, et
al., using commands like:
Finding Contig and chromosome info using SNPMapInfo, SNPContigLoc and ContigInfoPossibly the biggest source of complexity within dbSNP is the fact that the location of a SNP is determined by BLASTing an exemplar sequence against a recent Build of the relevant genome, and this process often yields matches to multiple contigs. As a result, the location of a SNP may be taken to be a SET of "putative" mappings to the relevant genome.
To get summary information about the set of mappings, you can query the
SNPMapInfo table with a command like:
This table shows the total number of chromosomes and contigs mapped to for each SNP_ID. Those entries are amalgamated into a single value in the "weight" column. NCBI assigns weight values on the basis of the number and types of mappings within a specific genome assembly ("refernce", "Celera", etc.). A weight of
An Entrez search for SNPS within a chromosome range will find ONLY those SNPs with weight < 3, and the Entrez Neighbors popup will only return neighbors with weight < 3.
To get more detailed information about each mapping, you can consult
the ContigInfo table. Here is a
slightly-edited portion of the (7 thousand entry) dbSNP127_human
Contig table, produced by sending the SQL command:
Note that Contig NT_077402 represents positions 0 through 167279 of chromosome 1 (as denoted by the "1" in the "contig_chr" column), and has a "plus" orientation ("orient" is set to "0") relative to the chromsome. Also note that these submissions are all "reference" submissions with respect to NCBI, but that reference submissions provided by Celera Genomics will show "Celera" in place of "reference" in the appropriate fields.
The TABLES that map "individual" SNPs to Contigs are versions of
names containing the string "SNPContigLoc". For example, the
dbSNP127_human version of SNPContigLoc is "b127_SNPContigLoc_36_2",
which indicates it is part of SNP Build 127, and based on the Version
2 of Build 36 of the human genome. The following command
Among other information, SNPContigLoc shows a SNP allele, its position from the start of the chromosome, its orientation relative to the contig (?), the quality of the SNP's alignment to the chromosome (apparently calculated using a complex formula that takes the distances from a target SNP to each mismatch into consideration). Also shown is the loc_type which takes the following values:
Specific information about the locations of insertions and deletions includes the neighbor positions on the "flanking" and "subject" sequences (lf_ngbr, rf_ngbr, lc_ngbr, and rc_ngbr) that are described in detail in the NCBI document: Types of Alignments. This table shows that SNP 4030245 maps to one contig, 4030243 maps to 3 contigs, and 4030249 maps to 2 contigs. SNPContigLoc does NOT specify the chromosome upon which a SNP resides, even though it DOES specify the position (phys_pos_from) the SNP occupies on whatever chromosome that might be. In this case phys_pos_from is a position count beginning with 0. Phys_pos_from is defined by NCBI as "Start position of SNP on chromosome. Phys_pos_from is obtained by adding asn_from to the contig start position in ContigInfo table. It is always the position from the beginning of the chromosome regardless of the SNP orienation to the contig and regardless of the contig orienation to chromosome." The phrase "position from the beginning of the chromosome" appears to mean "the number of bases from the first base," so that the "base position" is the same as "phys_pos_from + 1".
To determine which chromosome a SNP occurs on, SNPContigLoc can be
joined with ContigInfo on the ctg_id field.
For example, one might find that, and other, information from the two
tables for the 3 SNPs listed above by using a command like:
Note that SNP 4030243 maps to 3 different chromosomes, while 4030249 maps to 2 different chomsomes. Note also that although 4030245 maps to only one chromosome, the Aln_quality of that mapping is lower than the Aln_quality values for two of the 4030249 mappings.
Using SNPContigLocus to find SNPs associated with genic regions
Finally, there are several tables devoted to mapping SNPs to genes
and/or exons. These will not be discussed beyond this one example:
This table shows the affected gene and its base, along with the affected
protein and the amino acid residue. In addition, the function class
(fxn_class) is displayed. Values in this field are itemized in
the SNPfunctioncode table, and include: locus (1), coding (2),
cds-synon (3), cds-nonsynon (4), UTR (5), intron (6), splice-site (7),
etc.
To itemize SNPs within the JAK3 locus use an SQL statement like:
The following query
Additional informationNCBI provides a great deal of information about dbSNP:
The information in this page should help users who consult the NCBI dbSNP documentation. In particular, it should provide a background for appreciating the SQL example presented therein, several of which are simply extensions of examples in this introduction, enhanced by joins that bring in coding tables, like Univariation or SNPClassCode. Please report errors with this document to the author via data @ iu.edu, and feel free to make suggestions for improvements, and/or comments regarding its usefulness.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



