The initial database table was populated using genomic coordinates and sequences for all SNPs and small indels from dbSNP build 137. Allele frequencies were obtained for those SNPs that were also in 1000 Genomes Phase 1 low-coverage data, and pairwise LD was calculated for these SNPs. In addition, annotations of functional consequences were extracted from dbSNP.
A variety of functional annotations were then intersected with the set of variants using the BEDTools package, including the chromatin state segmentation of Ernst et al. and chromatin state segmentation on the Roadmap reference epigenomes (contact Anshul Kundaje for details), conserved regions by GERP and SiPhy, the narrow peaks called by the ENCODE project on DNAse hypersensitivity experiments, and the SPP narrow peaks called by the ENCODE project on ChIP-seq experiments To obtain gene annotations, RefSeq genes were downloaded from the UCSC Genome Browser and GENCODE version 13 was downloaded from the project website. BEDTools was then used to calculate the proximity of each variant to a gene by either annotation, as well as the orientation (3' or 5') relative to the nearest end of the gene, based on the strand of the gene.
In order to annotate variants by their effect on regulatory motifs, a library of position weight matrices (PWMs) was constructed from literature sources and was scored on genomic sequences as described previously. Briefly, a set of PWMs was collected from TRANSFAC, JASPAR, and protein-binding microarray (PBM) experiments (Berger et al. 2006, Berger et al. 2008, Badis et al. 2009). In addition, PWMs were discovered from ENCODE ChIP-seq experiments (contact Pouya Kheradpour for details.) The reference and alternate alleles for each of the 1000 Genomes pilot SNPs and indels were concatenated with 29 bp of genomic context on each side, using the hg19 sequence obtained from the UCSC Genome Browser. PWMs were then scored for instances that passed a threshold of p < 4-7 (see Touzet et al.). Only instances where a motif in the sequence (a) passed the threshold of a PWM in either the reference or the alternate genomic sequence, and (b) overlapped the variable nucleotide(s) (thus changing the PWM score) were considered. Then, the change in log-odds (LOD) score was calculated.
GWAS results were obtained from the NHGRI catalog (accessed January 10, 2013.) In cases where multiple studies were annotated as pertaining to the same phenotype, unique (but not necessarily genetically unlinked) SNPs were consolidated into a single list. To perform enhancer and DNAse enrichment analysis on sets of variants, tables of common array designs were obtained from the snpArray track provided by the UCSC Table Browser and lists were constructed of 1000 Genomes SNPs segregating in each of the three pilot populations, as well as all SNPs in the database. Then, a background frequency of coverage was calculated for variants annotated as overlapping an enhancer or DNAse state in each cell type. When a user submits a query list of variants, the coverage of strong enhancers in each cell type is calculated. If the coverage exceeds that of the background set selected by the user, a binomial test is performed, and enrichment is reported if it passes an uncorrected significance threshold of 0.05.
NOTE: This is a beta release 1 of chromatin states from the Roadmap Epigenomics Consortium. A newer model (slightly different than the current one) with greater cell-type coverage and higher quality data will be released soon. For the current model, the states are as follows
ST_ID MNEMONIC DESCRIPTION 1 TssP TSS_poised 2 TssF TSS_flanking_more_upstream 3 TssA TSS_active 4 TssWk TSS_weak 5 TssD1 TSS_flanking_downstream 6 TssD2 TSS_flanking_more_downstream 7 Tx Transcription 8 TxWk Transcription_weak 9 TxEnhG1 Transcription Enhancer-like 10 TxEnhG2 Transcription Enhancer-like (short genes) 11 EnhWk1 Enhancer_weak 12 EnhWk2 Enhancer_weak 13 EnhA Enhancer_active 14 Enh Enhancer_active_with_weakK4me1_strong_K27ac 15 EnhP Enhancer_poised 16 ReprPCWk Repressed_polycomb_weak 17 ReprPC Repressed_polycomb 18 K9K27me3 H3K9me3_K27me3 19 ZNF Zinc_finger_genes_H3K36me3_K9me3 20 HetRpts Heterochromatin_at_repeats 21 Het Heterochromatin 22 Quies1 Quiescent 23 Quies2 Quiescent 24 Quies3 Quiescent 25 K9acLow Quiescent (low H3K9ac)Also, you may find useful the chromatin state composition and various feature enrichment plots that would help you interpret the states (it is how we added state descriptions and mnemonics as well) better:
Build Query. If a file is uploaded, HaploReg expects a plain text file with a refSNP id (e.g., "rs22") on each line. If no file is uploaded, HaploReg then checks if a GWAS is selected from the drop-down menu. In cases where there are multiple GWAS for the same trait, one option will combine unique SNPs; however, these SNPs are not guaranteed to be in independent loci. If neither a file nor GWAS selection is made, HaploReg will look for either a region (e.g., "chr1:1000000-1001000") or a comma-delimited list of refSNP ids (e.g., "rs3, rs22") in the text box.
LD threshold. Available for r² > 0.2.
Genome version. If an r² threshold is specified (see the Set Options tab), results for each variant will be shown in a separate table along with other variants in LD. If r² is set to NA, only queried variants will be shown, together in one table.
1000G Phase 1 population for LD calculation. LD was calculated separately for each continent. When interpreting association studies, this option should be set to the population that best matches the ancestry of the subjects.
Mammalian conservation algorithm. HaploReg provides annotations from two algorithms designed to detect constrained sequences across mammalian genomes: GERP and SiPhy-omega. By default, both are displayed.
Show position relative to genes. Gene annotations by GENCODE are both included by default.
Condense lists. Some SNPs are bound by many proteins, or are DNAse hypersensitive in many cell types. By default, when more than three cell types, proteins, or motifs are involved, a summary (e.g., "7 cell types"). Hovering over these condensed cells displays pop-up text with the complete list.
Condense indel oligos. HaploReg always displays both the reference and alternate nucleotide(s). In the case of indels, if the reference or alternate is longer than six nucleotides, the oligonucleotide will be condensed (e.g., "7mer").
Background set for enhancer enrichment analysis. Sets of query variants (but not linked variants) are analyzed for cell-type-specific enhancers, as in Ernst et al. To control for biases in genotyping array design or genotype imputation, a number of background sets are available. For example, if the list of query variants was selected using genotypes from Illumina 1M, "Illumina1M" should be selected. If GWAS was performed followed by imputation to all 1000 Genomes SNPs present in the Yoruba population, "All SNPs in 1KG YRI pilot" should be selected.
Ernst et al. report an enrichment among the lupus GWAS SNPs from Han et al. In the first tab, "Build Query", select "Systemic lupus erythematosus (Han et al., 2009), 18 SNPs" from the GWAS drop-down menu. In the second tab, "Set Options", select an LD threshold of 0.6 and set the 1000G population to ASN (the study's subjects were Han Chinese). Now click "Submit Query." Note that the SNPs appear in separate tables with their LD blocks (the second is at the HLA locus.) Also note that the enhancer enrichment analysis reports a 9-fold enrichment for all enhancers in GM12878 lymphoblasts. (This enrichment is based on the query SNPs and is indepenedent of the population or LD threshold.) Find rs9271055 on the LD block with tag SNP rs9271100; note that it has r² = 0.76 and D' = 0.98 with the GWAS-reported SNP. Also note that, like the lead SNP, it is reported as an eQTL by Montgomery et al. Then click on the rsid rs9271055 to reach its detail page. Note at the bottom of the detail page that, as illustrated in Ernst et al. Figure 5d, the alternate allele of the SNP strengthens a match to a literature Ets-family PWM. Hover over "Ets_known9" to obtain its TRANSFAC accession number. Now, return to the main page and select "Systemic lupus erythematosus (8 studies combined), 62 SNPs" in the first tab and select an LD threshold of "NA" in the second tab, and click "Submit Query." Note that all variants appear in the same table.
The HaploReg database and web interface were produced by Luke Ward in collaboration with the Computational Biology Group at MIT.
HaploReg is hosted by the Broad Institute.
Contact: lukeward@mit.edu