HaploReg v2: Documentation

Database construction

The initial database table was populated using genomic coordinates and sequences for all SNPs and small indels from dbSNP build 137. Allele frequencies were obtained for those SNPs that were also in 1000 Genomes Phase 1 low-coverage data, and pairwise LD was calculated for these SNPs. In addition, annotations of functional consequences were extracted from dbSNP.

A variety of functional annotations were then intersected with the set of variants using the BEDTools package, including the chromatin state segmentation of Ernst et al. and chromatin state segmentation on the Roadmap reference epigenomes (contact Anshul Kundaje for details), conserved regions by GERP and SiPhy, the narrow peaks called by the ENCODE project on DNAse hypersensitivity experiments, and the SPP narrow peaks called by the ENCODE project on ChIP-seq experiments To obtain gene annotations, RefSeq genes were downloaded from the UCSC Genome Browser and GENCODE version 13 was downloaded from the project website. BEDTools was then used to calculate the proximity of each variant to a gene by either annotation, as well as the orientation (3' or 5') relative to the nearest end of the gene, based on the strand of the gene.

In order to annotate variants by their effect on regulatory motifs, a library of position weight matrices (PWMs) was constructed from literature sources and was scored on genomic sequences as described previously. Briefly, a set of PWMs was collected from TRANSFAC, JASPAR, and protein-binding microarray (PBM) experiments (Berger et al. 2006, Berger et al. 2008, Badis et al. 2009). In addition, PWMs were discovered from ENCODE ChIP-seq experiments (contact Pouya Kheradpour for details.) The reference and alternate alleles for each of the 1000 Genomes pilot SNPs and indels were concatenated with 29 bp of genomic context on each side, using the hg19 sequence obtained from the UCSC Genome Browser. PWMs were then scored for instances that passed a threshold of p < 4-7 (see Touzet et al.). Only instances where a motif in the sequence (a) passed the threshold of a PWM in either the reference or the alternate genomic sequence, and (b) overlapped the variable nucleotide(s) (thus changing the PWM score) were considered. Then, the change in log-odds (LOD) score was calculated.

GWAS results were obtained from the NHGRI catalog (accessed January 10, 2013.) In cases where multiple studies were annotated as pertaining to the same phenotype, unique (but not necessarily genetically unlinked) SNPs were consolidated into a single list. To perform enhancer and DNAse enrichment analysis on sets of variants, tables of common array designs were obtained from the snpArray track provided by the UCSC Table Browser and lists were constructed of 1000 Genomes SNPs segregating in each of the three pilot populations, as well as all SNPs in the database. Then, a background frequency of coverage was calculated for variants annotated as overlapping an enhancer or DNAse state in each cell type. When a user submits a query list of variants, the coverage of strong enhancers in each cell type is calculated. If the coverage exceeds that of the background set selected by the user, a binomial test is performed, and enrichment is reported if it passes an uncorrected significance threshold of 0.05.

Roadmap data in HaploReg v2 is provisional and will be finalized in HaploReg v3

NOTE: This is a beta release 1 of chromatin states from the Roadmap Epigenomics Consortium. A newer model (slightly different than the current one) with greater cell-type coverage and higher quality data will be released soon. For the current model, the states are as follows

ST_ID	MNEMONIC	DESCRIPTION	
1	TssP		TSS_poised	
2	TssF		TSS_flanking_more_upstream	
3	TssA		TSS_active	
4	TssWk		TSS_weak	
5	TssD1		TSS_flanking_downstream
6	TssD2		TSS_flanking_more_downstream	
7	Tx		Transcription	
8	TxWk		Transcription_weak	
9	TxEnhG1		Transcription Enhancer-like	
10	TxEnhG2		Transcription Enhancer-like (short genes)	
11	EnhWk1		Enhancer_weak	
12	EnhWk2		Enhancer_weak	
13	EnhA		Enhancer_active
14	Enh		Enhancer_active_with_weakK4me1_strong_K27ac	
15	EnhP		Enhancer_poised	
16	ReprPCWk	Repressed_polycomb_weak
17	ReprPC		Repressed_polycomb	
18	K9K27me3	H3K9me3_K27me3	
19	ZNF		Zinc_finger_genes_H3K36me3_K9me3	
20	HetRpts		Heterochromatin_at_repeats	
21	Het		Heterochromatin	
22	Quies1		Quiescent	
23	Quies2		Quiescent	
24	Quies3		Quiescent
25	K9acLow	Quiescent (low H3K9ac)	
	
Also, you may find useful the chromatin state composition and various feature enrichment plots that would help you interpret the states (it is how we added state descriptions and mnemonics as well) better:

Options

Build Query. If a file is uploaded, HaploReg expects a plain text file with a refSNP id (e.g., "rs22") on each line. If no file is uploaded, HaploReg then checks if a GWAS is selected from the drop-down menu. In cases where there are multiple GWAS for the same trait, one option will combine unique SNPs; however, these SNPs are not guaranteed to be in independent loci. If neither a file nor GWAS selection is made, HaploReg will look for either a region (e.g., "chr1:1000000-1001000") or a comma-delimited list of refSNP ids (e.g., "rs3, rs22") in the text box.

LD threshold. Available for r² > 0.2.

Genome version. If an r² threshold is specified (see the Set Options tab), results for each variant will be shown in a separate table along with other variants in LD. If r² is set to NA, only queried variants will be shown, together in one table.

1000G Phase 1 population for LD calculation. LD was calculated separately for each continent. When interpreting association studies, this option should be set to the population that best matches the ancestry of the subjects.

Mammalian conservation algorithm. HaploReg provides annotations from two algorithms designed to detect constrained sequences across mammalian genomes: GERP and SiPhy-omega. By default, both are displayed.

Show position relative to genes. Gene annotations by GENCODE are both included by default.

Condense lists. Some SNPs are bound by many proteins, or are DNAse hypersensitive in many cell types. By default, when more than three cell types, proteins, or motifs are involved, a summary (e.g., "7 cell types"). Hovering over these condensed cells displays pop-up text with the complete list.

Condense indel oligos. HaploReg always displays both the reference and alternate nucleotide(s). In the case of indels, if the reference or alternate is longer than six nucleotides, the oligonucleotide will be condensed (e.g., "7mer").

Background set for enhancer enrichment analysis. Sets of query variants (but not linked variants) are analyzed for cell-type-specific enhancers, as in Ernst et al. To control for biases in genotyping array design or genotype imputation, a number of background sets are available. For example, if the list of query variants was selected using genotypes from Illumina 1M, "Illumina1M" should be selected. If GWAS was performed followed by imputation to all 1000 Genomes SNPs present in the Yoruba population, "All SNPs in 1KG YRI pilot" should be selected.

Usage example

Ernst et al. report an enrichment among the lupus GWAS SNPs from Han et al. In the first tab, "Build Query", select "Systemic lupus erythematosus (Han et al., 2009), 18 SNPs" from the GWAS drop-down menu. In the second tab, "Set Options", select an LD threshold of 0.6 and set the 1000G population to ASN (the study's subjects were Han Chinese). Now click "Submit Query." Note that the SNPs appear in separate tables with their LD blocks (the second is at the HLA locus.) Also note that the enhancer enrichment analysis reports a 9-fold enrichment for all enhancers in GM12878 lymphoblasts. (This enrichment is based on the query SNPs and is indepenedent of the population or LD threshold.) Find rs9271055 on the LD block with tag SNP rs9271100; note that it has r² = 0.76 and D' = 0.98 with the GWAS-reported SNP. Also note that, like the lead SNP, it is reported as an eQTL by Montgomery et al. Then click on the rsid rs9271055 to reach its detail page. Note at the bottom of the detail page that, as illustrated in Ernst et al. Figure 5d, the alternate allele of the SNP strengthens a match to a literature Ets-family PWM. Hover over "Ets_known9" to obtain its TRANSFAC accession number. Now, return to the main page and select "Systemic lupus erythematosus (8 studies combined), 62 SNPs" in the first tab and select an LD threshold of "NA" in the second tab, and click "Submit Query." Note that all variants appear in the same table.

Credits

The HaploReg database and web interface were produced by Luke Ward in collaboration with the Computational Biology Group at MIT.

HaploReg is hosted by the Broad Institute.

Contact: lukeward@mit.edu