Using dbSNP via SQL queries

dbSNP is a large relational database maintained by the National Center for Biotechnology Information (NCBI) on a Microsoft SQLServer. NCBI provides several public interfaces to dbSNP:

  • a powerful web-based interface (integrated with Entrez) for public use,
  • a set of web-accessible scripts and Web Services known as the Entrez eUtils, and,
  • an FTP repository of the data exported from the MS SQLServer.

NCBI does NOT, however, provide an interface for submitting SQL commands directly to the SQLServer.

Several organizations, such as the Indiana University School of Medicine (IUSM), download the dbSNP data from the NCBI FTP repository, load it into their local MS SQLServers, and make it available to local users for use via SQL.

This page describes the basic structure of the dbSNP database for those interested in accessing it through SQL. It provides several example SQL queries along with sample output.

A basic understanding of relational databases and SQL is assumed, particularly the structure of a relational table, and the functions of the SELECT and JOIN directives with their WHERE and ON clauses. There are many introductions to SQL available over the Web. Here are a couple from:

Beginning with Build 125 dbSNP was split into a set of species specific databases, each with its own set of tables, along with a database containing a set of tables that are shared by the species databases. The examples in this page were done using Human SNP data, most of them with Build 127.

Topics

  • Polymorphism
  • Alleles and the Univariation table
  • Submitted SNPs and the SNP, SNPSubSNPLink, and SubSNPSeq* tables
  • Finding Contig and chromosome info using SNPMapInfo, SNPContigLoc and ContigInfo
  • Using SNPContigLocus to find SNPs associated with gene regions

Polymorphism

As its name suggests, dbSNP is a database containing information about "SNPs", Single Nucleotide Polymorphisms. A SNP "may be thought of" as a single location on the genome of an organism or species where variation is observed. For example, one sequence including base position 5406 of chromosome 1 may contain a "T", while another sequence covering the same region contains a "G" in the same position. This would constitute a SNP with a "T/G" allelic variation at position 5406 of chromosome 1.

However, this description hides a fair amount of complexity and much of this document will expand upon this basic model as it describes some of the most important tables within dbSNP.

Alleles and the Univariation table

To begin with, there may be more than two possible alleles in a specific location. Perhaps position 5406 on chromosome 1 could contain "A", "G", or "T" (which may be displayed as the string "A/G/T"). In addition, not all SNPs are "single nucleotide" changes. There are many cases where a string of nucleotides may be present in some sequences but absent in others. This may constitute a multi-nucleotide "insertion" or "deletion", that is sometimes called a "Deletion Insertion Polymorphism" (DIP), or it may involve replacing one sequence of nucleotides with another, longer sequence, as in a "range deletion", etc. The list of observed polymorphism types recorded in dbSNP is kept in the LocTypeCode table within the SHARED database, and described in detail in this in-depth document from NCBI.

The dbSNP Univariation table shows all observed dbSNP variations in a uniform fashion. SQL commands "like":

    select * from Univariation order by univar_id, allele_cnt 
can be used to extract data similar to what is shown in the following table from Univariation.

univar_id (int) var_str (varchar) allele_cnt (smallint) subsnp_
class
(tinyint)
iupack_
code
(char)
top_or_
bot_strand
(char)
src_code (tinyint) rev_
univar_id
(int)
2 A/G 2 1 R T 1 35
3 -/A 2 2 N ? 1 14
5 A/C 2 1 M T 1 25
6 -/R 2 2 N ? 1 19
146 A/G/T 3 1 D ? 1 128
232 -/TCAG 2 2 N ? 1 359
233 -/ATGC 2 2 N ? 1 216
234 -/TGCA 2 2 N ? 1 234
109643 -/A/G/GT/GTGT/GTGTGTGT 6 7 N ? 4 103582
109804 -/A/ATTG/ATTGTGTG/ATTGTGTGTG/T 6 7 N ? 4 103799
112317 -/A/C/G/T/TACACACA 6 7 N ? 2 114961
115583 (TG)19/20/21/22/23/24/-/GT 8 4 N ? 4 112993
115254 -/A/AAAAAAAAAA/AAAAAAAATA/ AAATATATA/ATATATA/T/TATATATAT 8 7 N ? 4 112637
115202 (TG)17/18/19/20/21/22/A/T 8 4 N ? 4 112581

The allele_cnt field provides the number of separate allelic varations listed in the var_str field. The IUPACK code is a code used to represent the var_str value. Among others, the subsnp_class identifies the following types SNP classes:

  • single base polymorphisms (1),
  • DIPs: deletion/insertion polymorphisms (2),
  • microsatellites (3), and
  • Multiple Nucleotide Polymophisms (MNP) (8),

The field named "top_or_bot_strand" hints at additional complexity involved in the identification of the strand upon which a SNP occurs, the values used to identify a range of positions, and the direction and complementarity used to display alleles and flanking regions.

It appears that chromosome base positions are counted relative to the 5' end of of the top strand regardless of the strand, contig or submitted sequence upon which a base is found. So, for example, the 10th base from the start of the top strand of chromosome 1 will be said to be at position 10, whereas the 10th base from the end of the bottom strand will also be at position 10.

Some tables use "orientation" to describe how a sequence within a defined region is being displayed (and/or how it occurs in the chromosome). There are two orientations: "plus" and "minus". Left-to-right displays of sequences in both orientations are in 5'-to-3' order, so that a plus orientation and a minus orientation of the same sequence appear in reverse complement. For example, the sequence displayed in "plus" orientation as:

   sequence:   5' AGCTC 3'
   position:      1...5
will appear in complemented form as:
   sequence:   5' TCGAG 3'
   position:      1...5
and will appear backwards as:
   sequence:   3' CTCGA 5'
   position:      5...1
and in "minus" orientation (both backwards and complemeted) as:
   sequence:   3' GAGCT 5'
   position:      5...1

Consequently, an actual 3'-to-5' "minus" sequence like:

   sequence:   3' GAGCT 5'
   position:      1...5
will (sometimes) appear in reverse complement form as a left-to-right, 5'-to-3' sequence like:
   sequence:   5' AGCTC 3'
   position:      1...5
(A plus orientation also sometimes appears to indicate placement on a "top" strand, whereas a minus orientation indicates placement on a "bottom" strand.)

Submitted SNPs and the SNP, SNPSubSNPLink, and SubSNPSeq* tables

Another source of complexity is that submitted sequences that cover the same region are clustered together, and represented by the submission with the longest flanking region. This submission is known as the "exemplar submission".

Here is a dbSNP127_human SNP table, produced by the SQL query:

   select * from SNP where snp_id in (4030243, 4030245, 4030249)
that shows the submission identifier (exemplar_subsnp_id) associated with several SNPs.

snp_id (int) avg_
hetero-
zygosity
(real)
het_se (real) create_time (smalldatetime) last_updated_
time
(smalldatetime)
CpG_
code
(tinyint)
tax_id (int) vali-
dation_
status
(tinyint)
exemplar_
subsnp_id
(int)
univar_
id
(int)
cnt_
subsnp
(smallint)
map_
property
(tinyint)
4030249 null null 2002-10-10 09:21:00.0 2004-10-04 14:25:00.0 1 9606 0 5379757 25 5 2
4030245 null null 2002-10-10 09:21:00.0 2004-10-04 14:25:00.0 1 9606 0 5379753 56 1 1
4030243 null null 2002-10-10 09:21:00.0 2004-10-04 14:25:00.0 1 9606 0 5379751 471 1 2

The identifiers for the submissions associated with these SNPs (SubSNP_id) can be found by using a query like:

   select * from SNPSubSNPLink where snp_id in (4030243, 4030245, 4030249)
which will produce a table like:

subsnp_id (int) snp_id (int) substrand_
reversed_flag
(bit)
create_time (smalldatetime) last_updated_time (smalldatetime) build_id (int) comment (varchar)
5379751 4030243 0 2002-10-10 09:21:00.0 2002-10-10 09:21:00.0 108 null
5379753 4030245 0 2002-10-10 09:21:00.0 2002-10-10 09:21:00.0 108 null
5379757 4030249 0 2002-10-10 09:21:00.0 2002-10-10 09:21:00.0 108 null
11212736 4030249 0 2003-07-11 22:49:00.0 2004-04-05 10:28:00.0 121 null
11472404 4030249 0 2003-07-11 22:49:00.0 2004-04-05 10:28:00.0 121 null
15442755 4030249 1 2004-02-27 09:30:00.0 2004-04-05 10:28:00.0 121 null
15765853 4030249 0 2004-02-27 09:30:00.0 2004-04-05 10:28:00.0 121 null

Submitted SNP flanking regions could then be viewed by using SQL queries on the (the 20 million, or so, rows of) SubSNPseq3_p1, SubSNPseq3_p2, and SubSNPseq3_p3, and likewise for SubSNPseq5... like:

select 
   * 
from 
   SubSNPseq3_p1
where
   subsnp_id in ( 5379751, 5379753, 5379757, ... )
which, when using dbSNP128_human produces a table that resembles:

subsnp_id (int) type (tinyint) line_
num
(tinyint)
line (varchar)
5379751 3 0 GGAGCCATAGTGGAGCTTTCCTCTCTAAGCTCACCCAGCTCAAACTGNNACAGGAGA...
5379751 3 2 CNANGCGCTGCTTTTGCCTTGCTNCTGTGACCCCAGGNCAAGCTGCCNTC
5379751 3 3 ACCNTCTCNTGGGNCCAGNTTTNCCNCCNANTTGTACANGTGGTGCTGCA
5379751 3 4 CACCCTGNGNCCCTGGCCCNNNNNCNGNAGGNTGGCTNGGGANNGGNTGG
5379751 3 5 NCTCCTCAAANCAGCNCGNCTNGTCTNCATCNAGTNGCCCNGGTGCTGGG
5379751 3 6 TCNAGGGNANTNCGNACTGAGGNCTCTGAGCTAANCTGGGAAANCACNAN
5379753 3 0 NACAGGAGAATCTTCTTCGACTGCCAAGAGCGGTCCAAGGCAATGGTCAGCCACTGC...
5379753 3 2 TCACCNTCTCNTGGGNCCAGNTTTNCCNCCNANTTGTACANGTGGTGCTG
5379753 3 3 CACACCCTGNGNCCCTGGCCCNNNNNCNGNAGGNTGGCTNGGGANNGGNT
5379753 3 4 GGNCTCCTCAAANCAGCNCGNCTNGTCTNCATCNAGTNGCCCNGGTGCTG
5379753 3 5 GGTCNAGGGNANTNCGNACTGAGGNCTCTGAGCTAANCTGGGAAANCACN
5379753 3 6 ANNGTGNGCCNTTGGNAGGGCTGGGNGANGTGTNCNATGGGGGNTNGGGG
5379757 3 0 GAGGCTACTGATGATGCCTGCTGTGAACGCAGACACTGGTGTGATGCGATGCCTGCG...
5379757 3 2 GTGCCCGGTGCTGGGTCAGGGATCGACTGAGGCTCTGAGCTAACTGGGAA
5379757 3 3 ACACAGTGGCCTTGGAGGGCTGGGGAGTGTCATGGGGGTGGGGACAGGGA
5379757 3 4 GTCACCGGTCGCATGTGACTGAACTCTTCACCCCAGTCTGTGGCTTTCCC
5379757 3 5 GTTGCAGTGAGAGCCACGAGCCAAGGTG

In addition, it should be possible to examine the submitted sequences related to specific SNPs by using a query like:

select 
   * 
from 
   SNPSubSNPLink a
join
   SubSNPseq3_p1 b
on
   a.subsnp_id = b.subsnp_id
   AND
   a.snp_id in (4030243, 4030245, 4030249)

By analogy, the exemplar_subsnp_id will allow us to get the exemplar flanking sequences from SubSNPSeq5_p1 (...p2, and ...p3) and SubSNPSeq3_p1, et al., using commands like:

select 
   * 
from 
   SNP a 
join
   SubSNPseq3_p1 b
on 
   a.exemplar_subsnp_id = b.subsnp_id 
   AND 
   a.snp_id in (4030243, 4030245, 4030249)
The SNP table described above also provides univar_id which will link to the uniform allele table that was described earlier. To display the values for these fields, however, a SNP command is needed that accesses data in 2 tables, each residing in a different dbSNP "catalog", namely dbSNP128_human and dbSNP128_shared. For example, to get the actual 3' flanking sequences we need to use a command like:
select 
   * 
from 
   dbSNP128_human.dbo.SNP a
join
   dbSNP128_shared.dbo.univariation b
on
   a.snp_id in (4030243, 4030245, 4030249)
   AND
   a.univar_id = b.univar_id
which will give us, among other data, the variation string (VAR_STR), the allele count, the subSNP class, and the IUPAC code.

Finding Contig and chromosome info using SNPMapInfo, SNPContigLoc and ContigInfo

Possibly the biggest source of complexity within dbSNP is the fact that the location of a SNP is determined by BLASTing an exemplar sequence against a recent Build of the relevant genome, and this process often yields matches to multiple contigs. As a result, the location of a SNP may be taken to be a SET of "putative" mappings to the relevant genome.

To get summary information about the set of mappings, you can query the SNPMapInfo table with a command like:

select 
   * 
from
   SNPMapInfo
where
   snp_id in (4030243, 4030245, 4030249)
which will produce a table like:

snp_type (varchar) snp_id (int) chr_cnt (int) contig_
cnt
(int)
loc_cnt (int) weight (int) hap_cnt (int) placed_
cnt
(int)
grouped_
cnt
(int)
unplaced_
cnt
(int)
md5 (char) assembly (varchar)
rs 4030243 2 2 2 3 0 2 0 0 cc7bc4139...3c1e reference
rs 4030245 1 1 1 1 0 1 0 0 d35923860...01c4 reference
rs 4030249 2 2 2 3 0 2 0 0 47c1b6875...7935 reference

This table shows the total number of chromosomes and contigs mapped to for each SNP_ID. Those entries are amalgamated into a single value in the "weight" column.

NCBI assigns weight values on the basis of the number and types of mappings within a specific genome assembly ("refernce", "Celera", etc.). A weight of

  • 1 is assigned to any SNP with a unique mapping within a build,
  • 2 to a SNP with multiple mappings all to the same chromosome,
  • 3 to a SNP with fewer than 10 mappings to different chromosomes, etc.

An Entrez search for SNPS within a chromosome range will find ONLY those SNPs with weight < 3, and the Entrez Neighbors popup will only return neighbors with weight < 3.

To get more detailed information about each mapping, you can consult the ContigInfo table. Here is a slightly-edited portion of the (7 thousand entry) dbSNP127_human Contig table, produced by sending the SQL command:

     select top 5 * from b126_ContigInfo_36_1
to the SQLServer hosting dbSNP (and removing time-stamp fields):

ctg_id (int) tax_id (int) contig_acc (varchar) contig_ver (tinyint) contig_name (varchar) contig_chr (varchar) contig_start (int) contig_end (int)
960600001 9606 NT_077402 1 Hs1_77451_36 1 0 167279
960600006 9606 NT_019273 18 Hs1_19429_36 1 103893838 120498678
960600008 9606 NT_086586 1 Hs1_86217_36 1 120548679 120738217
960600009 9606 NT_077389 3 Hs1_77438_36 1 120788218 121186956
960600011 9606 NT_077382 3 Hs1_77431_36 1 13192499 13429748
Continued => orient (tinyint) contig_gi (int) group_term (varchar) group_label (varchar) contig_label (varchar) build_id (int) build_ver (int)
0 29794392 ref_assembly reference reference 36 2
0 88942992 ref_assembly reference reference 36 2
0 51459500 ref_assembly reference reference 36 2
0 51459383 ref_assembly reference reference 36 2
0 51459264 ref_assembly reference reference 36 2

Note that Contig NT_077402 represents positions 0 through 167279 of chromosome 1 (as denoted by the "1" in the "contig_chr" column), and has a "plus" orientation ("orient" is set to "0") relative to the chromsome.

Also note that these submissions are all "reference" submissions with respect to NCBI, but that reference submissions provided by Celera Genomics will show "Celera" in place of "reference" in the appropriate fields.

The TABLES that map "individual" SNPs to Contigs are versions of names containing the string "SNPContigLoc". For example, the dbSNP127_human version of SNPContigLoc is "b127_SNPContigLoc_36_2", which indicates it is part of SNP Build 127, and based on the Version 2 of Build 36 of the human genome. The following command

   select * from b127_SNPContigLoc_36_2 where snp_id in (4030245, 4030243, 4030249)
applied against the (24 million entry) dbSNP127_human table b127_SNPContigLoc_36_2 shows the entries for 3 SNP IDs:

snp_type (varchar) snp_id (int) ctg_id (int) asn_from (int) asn_to (int) lf_ngbr (int) rf_ngbr (int) lc_ngbr (int) rc_ngbr (int) loc_type (tinyint)
rs 4030249 960600001 5308 5308 495 497 5307 5309 2
rs 4030245 960600001 5405 5406 499 502 5405 5406 3
rs 4030243 960600001 5453 5456 499 501 5452 5457 1
rs 4030243 960600194 3965421 3965424 499 501 3965420 3965425 1
rs 4030243 960605600 323095 323098 499 501 323094 323099 1
rs 4030249 960605600 323245 323245 495 497 323244 323246 2
Continued => phys_pos_from (int) orientation (tinyint) allele (varchar) loc_sts_uid (int) aln_quality (float) num_mism (int) num_del (int) num_ins (int)
5308 1 C null 0.999995 1 0 0
5405 1 - null 0.77632 0 21 0
5453 1 CAAA null 0.780573 2 9 3
100333093 0 TTTG null 0.764994 6 9 5
57766532 0 TTTG null 0.727843 15 7 4
57766682 0 G null 0.94823 15 0 3

Among other information, SNPContigLoc shows a SNP allele, its position from the start of the chromosome, its orientation relative to the contig (?), the quality of the SNP's alignment to the chromosome (apparently calculated using a complex formula that takes the distances from a target SNP to each mismatch into consideration). Also shown is the loc_type which takes the following values:

Numeric valueCode Description
1InsOnCtg Insertion on contig
2trueSNP Contig allele is one base long
3DelOnCtg Deletion on the contig
4LongerOnCtg Part of the SNP flanking sequence is replaced with the contig sequence of longer length
5EqualOnCtg Part of the SNP flanking sequence is replaced with the contig sequence of exactly the same length
6ShorterOnCtg Part of the SNP flanking sequence is replaced with the contig sequence of a shorter length.

Specific information about the locations of insertions and deletions includes the neighbor positions on the "flanking" and "subject" sequences (lf_ngbr, rf_ngbr, lc_ngbr, and rc_ngbr) that are described in detail in the NCBI document: Types of Alignments.

This table shows that SNP 4030245 maps to one contig, 4030243 maps to 3 contigs, and 4030249 maps to 2 contigs.

SNPContigLoc does NOT specify the chromosome upon which a SNP resides, even though it DOES specify the position (phys_pos_from) the SNP occupies on whatever chromosome that might be. In this case phys_pos_from is a position count beginning with 0. Phys_pos_from is defined by NCBI as

"Start position of SNP on chromosome. Phys_pos_from is obtained by adding asn_from to the contig start position in ContigInfo table. It is always the position from the beginning of the chromosome regardless of the SNP orienation to the contig and regardless of the contig orienation to chromosome."

The phrase "position from the beginning of the chromosome" appears to mean "the number of bases from the first base," so that the "base position" is the same as "phys_pos_from + 1".

To determine which chromosome a SNP occurs on, SNPContigLoc can be joined with ContigInfo on the ctg_id field. For example, one might find that, and other, information from the two tables for the 3 SNPs listed above by using a command like:

   select 
      snp_type,snp_id, phys_pos_from, orientation, allele,
      aln_quality, num_del, num_ins, contig_acc, contig_chr, contig_label
   from 
      b127_snpcontigloc_36_2 a
   join 
      b127_contiginfo_36_2 b
   on 
      a.ctg_id = b.ctg_id 
      and 
      a.snp_id in (4030243, 4030245, 4030249) 
which will yield a result like:

snp_type (varchar) snp_id (int) phys_pos_from (int) orien-tation (tinyint) allele (varchar) aln_quality (float) num_del (int) num_ins (int) contig_acc (varchar) contig_
chr
(varchar)
contig_
label
(varchar)
rs 4030249 5308 1 C 0.999995 0 0 NT_077402 1 reference
rs 4030245 5405 1 - 0.77632 21 0 NT_077402 1 reference
rs 4030243 5453 1 CAAA 0.780573 9 3 NT_077402 1 reference
rs 4030243 100333093 0 TTTG 0.764994 9 5 NT_035325 15 reference
rs 4030243 57766532 0 TTTG 0.727843 7 4 NT_113974 Y PAR
rs 4030249 57766682 0 G 0.94823 0 3 NT_113974 Y PAR

Note that SNP 4030243 maps to 3 different chromosomes, while 4030249 maps to 2 different chomsomes. Note also that although 4030245 maps to only one chromosome, the Aln_quality of that mapping is lower than the Aln_quality values for two of the 4030249 mappings.

Using SNPContigLocus to find SNPs associated with genic regions

Finally, there are several tables devoted to mapping SNPs to genes and/or exons. These will not be discussed beyond this one example:

    select * from b126_SNPContigLocusId_36_1 where snp_id in (3212724, 3212755)
which retrieves the records for 2 SNPs (among many others) that appear within the coding regions for JAK3. (Note that Build 126 data is being used here.)

snp_id (int) contig_acc (varchar) contig_ver (tinyint) asn_from (int) asn_to (int) locus_id (int) locus_
symbol
(varchar)
mrna_acc (varchar) mrna_ver (tinyint) Continued =>
3212724 NT_011295 10 9216959 9216959 3718 JAK3 NM_000215 2
3212724 NW_927195 1 9072117 9072117 3718 JAK3 NM_000215 2
3212755 NT_011295 10 9210815 9210815 3718 JAK3 NM_000215 2
3212755 NT_011295 10 9210815 9210815 3718 JAK3 NM_000215 2
3212755 NW_927195 1 9065973 9065973 3718 JAK3 NM_000215 2
3212755 NW_927195 1 9065973 9065973 3718 JAK3 NM_000215 2
Continued => protein_acc (varchar) protein_ver (tinyint) fxn_class (tinyint) reading_frame (tinyint) allele (varchar) residue (varchar) aa_position (int) build_id (varchar) ctg_id (int)
NP_000206 2 6 null null null null 36_1 960600305
NP_000206 2 6 null null null null 36_1 960606324
NP_000206 2 3 3 A L 569 36_1 960600305
NP_000206 2 8 3 G L 569 36_1 960600305
NP_000206 2 3 3 A L 569 36_1 960606324
NP_000206 2 8 3 G L 569 36_1 960606324

This table shows the affected gene and its base, along with the affected protein and the amino acid residue. In addition, the function class (fxn_class) is displayed. Values in this field are itemized in the SNPfunctioncode table, and include: locus (1), coding (2), cds-synon (3), cds-nonsynon (4), UTR (5), intron (6), splice-site (7), etc. To itemize SNPs within the JAK3 locus use an SQL statement like:

   select 
     fxn_class, count(*) 
   from 
     b126_SNPContigLocusId_36_1 
   where 
      locus_symbol = 'JAK3' 
   GROUP BY fxn_class 
   ORDER BY fxn_class
along with a similar query without the WHERE clause to get information to build a table like:

Function class (fxn_class) Counts
for JAK3 for complete table
1 9 846989
2   4
3 16 128007
4 10 175649
5 76 1863314
6 239 8538453
7   2664
8 26 299063

The following query

   select 
      a.fxn_class, count(*) 
   from 
      b126_SNPContigLocusId_36_1 a 
   join 
      b126_SNPContigLoc_36_1 b 
   on 
      a.snp_id = b.snp_id 
   join 
      b126_ContigInfo_36_1 c 
   on 
      b.ctg_id = c.ctg_id 
      AND
      c.contig_label = 'reference'
   GROUP BY 
      a.fxn_class 
   ORDER BY
      a.fxn_class
could be used to get the same breakdown for mappings restricted to the NCBI reference genome. Note: This request takes about 5 minutes to complete using Build 128 when relayed through the local IU CLSD service, and about 1 minute via a direct JDBC connection.)

Additional information

NCBI provides a great deal of information about dbSNP:

The information in this page should help users who consult the NCBI dbSNP documentation. In particular, it should provide a background for appreciating the SQL example presented therein, several of which are simply extensions of examples in this introduction, enhanced by joins that bring in coding tables, like Univariation or SNPClassCode.

Please report errors with this document to the author via data @ iu.edu, and feel free to make suggestions for improvements, and/or comments regarding its usefulness.