HaploReg: Documentation

Database construction

The initial database table was populated using genomic coordinates and sequences for 16,151,841 biallelic SNPs and small indels from the pilot release of the 1000 Genomes Project. In some cases, such as novel indels, the Variant Call Format (VCF) file from the pilot release did not have a RefSNP identifier (rsid); for the purpose of creating a unique identifier for this database, these variants were assigned the label of "chromosome:position" in hg18 coordinates. To provide backward compatibility with obsolete rsids, dbSNP release 132 was checked for variants at the same position as 1000 Genomes pilot variants with multiple rsids. In addition, annotations of functional consequences were extracted from dbSNP.

A variety of functional annotations were then intersected with the set of variants using the BEDTools package, including the chromatin state segmentation of Ernst et al., conserved regions by GERP and SiPhy, the narrow peaks called by the ENCODE project on DNAse hypersensitivity experiments, and the SPP narrow peaks called by the ENCODE project on ChIP-seq experiments (By default, ENCODE tracks are switched off; see notes on use of ENCODE data). To obtain gene annotations, RefSeq genes were downloaded from the UCSC Genome Browser and GENCODE version 7 was downloaded from the project website. BEDTools was then used to calculate the proximity of each variant to a gene by either annotation, as well as the orientation (3' or 5') relative to the nearest end of the gene, based on the strand of the gene.

In order to annotate variants by their effect on regulatory motifs, a library of position weight matrices (PWMs) was constructed from literature sources and was scored on genomic sequences as described previously. Briefly, a set of PWMs was collected from TRANSFAC, JASPAR, and protein-binding microarray (PBM) experiments (Berger et al. 2006, Berger et al. 2008, Badis et al. 2009). The reference and alternate alleles for each of the 1000 Genomes pilot SNPs and indels were concatenated with 29 bp of genomic context on each side, using the hg18 sequence obtained from the UCSC Genome Browser. PWMs were then scored for instances that passed either of two thresholds, a stringent threshold of p < 4-8 and a less-stringent threshold of p < 4-7 (see Touzet et al.). Only instances where a motif in the sequence (a) passed the stringent threshold of a PWM in either the reference or the alternate genomic sequence, and (b) overlapped the variable nucleotide(s) (thus changing the PWM score) were considered. Then, the change in log-odds (LOD) score was calculated. In cases where the weaker match was did not pass the less-stringent threshold, an approximate minimum change of LOD score was reported, corresponding to the difference between the score of the stronger match to the score required to pass the less-stringent threshold. In cases where both allelic variants surpassed the less-stringent threshold, the exact difference in score was reported.

GWAS results were obtained from the NHGRI catalog (accessed June 29, 2011.) In cases where multiple studies were annotated as pertaining to the same phenotype, unique independent SNPs were consolidated into a single list.

LD was calculated using the phased genotype information accompanying the 1000 Genomes Project pilot release. VCFTools was used to perform the calculation, using an LD threshold of r² >= 0.80, and a maximum distance between variants of 200 kb. Results from VCFTools were then consolidated such that for every variant in our database, a list of linked variants is accessible for each of the three populations, along with an r2 value. To perform enhancer enrichment analysis on sets of variants, tables of common array designs were obtained from the snpArray track provided by the UCSC Table Browser and lists were constructed of 1000 Genomes SNPs segregating in each of the three pilot populations, as well as all SNPs in the database. Then, a background frequency of coverage was calculated for variants annotated as overlapping a strong enhancer state in each cell type. When a user submits a query list of variants, the coverage of strong enhancers in each cell type is calculated. If the coverage exceeds that of the background set selected by the user, a binomial test is performed, and enrichment is reported if it passes an uncorrected significance threshold of 0.05.

Use of ENCODE data

Publicly-available ENCODE data are cross-referenced for browsing purposes. However, use of these data for analyses are currently restricted. Please see the ENCODE Consortium Data Release Policy.

Options

Build Query. If a file is uploaded, HaploReg expects a plain text file with a refSNP id (e.g., "rs22") on each line. If no file is uploaded, HaploReg then checks if a GWAS is selected from the drop-down menu. In cases where there are multiple GWAS for the same trait, one option will combine unique SNPs; however, these SNPs are not guaranteed to be in independent loci. If neither a file nor GWAS selection is made, HaploReg will look for a comma-delimited list of refSNP ids in the text box (e.g., "rs3,rs22").

LD threshold. Variants can be shown with their position relative to hg18 or hg19.

Genome version. If an r² threshold is specified (see the Set Options tab), results for each variant will be shown in a separate table along with other variants in LD. If r² is set to NA, only queried variants will be shown, together in one table.

1000G pilot population for LD calculation. LD was calculated separately for each population: ASN (composed of CHB, Han Chinese in Beijing, and JPT, Japanese in Tokyo), CEU (Utah residents from the Centre d'Etude du Polymorphisme Humain with Northern and Western European ancestry), and YRI (Yoruba in Ibadan, Nigeria). When interpreting association studies, this option should be set to the population that best matches the ancestry of the subjects.

Mammalian conservation algorithm. HaploReg provides annotations from two algorithms designed to detect constrained sequences across mammalian genomes: GERP and SiPhy-omega. By default, both are displayed.

Show position relative to genes. Gene annotations by GENCODE and RefSeq are both included by default.

Condense lists. Some SNPs are bound by many proteins, or are DNAse hypersensitive in many cell types. By default, when more than three cell types, proteins, or motifs are involved, a summary (e.g., "7 cell types"). Hovering over these condensed cells displays pop-up text with the complete list.

Condense indel oligos. HaploReg always displays both the reference and alternate nucleotide(s). In the case of indels, if the reference or alternate is longer than six nucleotides, the oligonucleotide will be condensed (e.g., "7mer").

Background set for enhancer enrichment analysis. Sets of query variants (but not linked variants) are analyzed for cell-type-specific enhancers, as in Ernst et al. To control for biases in genotyping array design or genotype imputation, a number of background sets are available. For example, if the list of query variants was selected using genotypes from Illumina 1M, "Illumina1M" should be selected. If GWAS was performed followed by imputation to all 1000 Genomes SNPs present in the Yoruba population, "All SNPs in 1KG YRI pilot" should be selected.

Usage examples

Single locus. In a study of tooth development by Pillas et al., an association was reported between the intergenic SNP rs6504340 and dental occlusion. In the first tab, "Build Query", enter rs6504340. In the second tab, "Set Options", set the LD threshold to 1.0. Then click "Submit Query", below the options. Note that there are five variants reported: the original query SNP, an indel (absent from YRI), and three other SNPs. One of these SNPs is a synonymous coding change in HOXB1, and the query SNP and other two linked SNPs are intergenic. One lies in a conserved element by both the GERP and SiPhy algorithms: rs8073963. Click on the name of this SNP, and a detail box will page will open. Note that besides being an enhancer in K562 (reported on the main page), the chromatin state in eight other cell types is reported: polycomb-repressed in H1, HepG2, and Huvec, and heterochromatic in GM12878, HMEC, HSMM, NHEK, and NHLF. The locus is found to be DNAse-sensitive by two different ENCODE production groups, although in different cells. Hits to two literature PWMs for Foxo are detected. Because the LOD(alt)-LOD(ref) is negative, the affinity in the reference sequence is higher. In both cases, the SNP affects a match to the "C" at the end of the PWM (which matches on the minus strand; the plus strand reference allele is G.)

Multiple loci. Ernst et al. report an enrichment among the lupus GWAS SNPs from Han et al. In the first tab, "Build Query", select "Systemic lupus erythematosus (Han et al., 2009), 18 SNPs" from the GWAS drop-down menu. In the second tab, "Set Options", select an LD threshold of 1.0 and set the 1000G population to ASN (the study's subjects were Han Chinese). Now click "Submit Query." Note that the SNPs appear in separate tables with their LD blocks (the second is at the HLA locus.) Also note that the enhancer enrichment analysis reports an enrichment for enhancers in HSMM and GM12878. (This enrichment is based on the query SNPs and is indepenedent of the population or LD threshold.) Now select "Systemic lupus erythematosus (8 studies combined), 62 SNPs" in the first tab and select an LD threshold of "NA" in the second tab, and click "Submit Query." Note that all variants appear in the same table.

Credits

The HaploReg database and web interface were produced by Luke Ward in collaboration with the Computational Biology Group at MIT.

HaploReg is hosted by the Broad Institute.

Contact: lukeward@mit.edu