HaploReg v4.1: Documentation

Database construction

The initial database table was populated using genomic coordinates and sequences for all SNPs and small indels from dbSNP build 137. Allele frequencies were obtained for those SNPs that were also in 1000 Genomes Phase 1 low-coverage data, and pairwise LD was calculated for all pairs of SNPs in each continental population (AFR, AMR, ASN, EUR) within 250 kb. In addition, annotations of functional consequences were extracted from dbSNP. Positions on hg38 for these SNPs was determined by looking up the rsids from dbSNP build 137 in dbSNP build 141.

A variety of functional annotations were then intersected with the set of variants using the BEDTools package, including the chromatin mark ChIP-seq tracks (gappedPeak calls), DNase tracks (narrowPeak calls), and chromatin state segmentations (15-state and 25-state) from the Roadmap Epigenomics Project, conserved regions by GERP and SiPhy, and the SPP narrow peaks called by the ENCODE project on ChIP-seq experiments To obtain gene annotations, RefSeq genes were downloaded from the UCSC Genome Browser and GENCODE version 13 was downloaded from the project website. BEDTools was then used to calculate the proximity of each variant to a gene by either annotation, as well as the orientation (3' or 5') relative to the nearest end of the gene, based on the strand of the gene.

In order to annotate variants by their effect on regulatory motifs, a library of position weight matrices (PWMs) was constructed from literature sources and was scored on genomic sequences as described previously. Briefly, a set of PWMs was collected from TRANSFAC, JASPAR, and protein-binding microarray (PBM) experiments (Berger et al. 2006, Berger et al. 2008, Badis et al. 2009). In addition, PWMs were discovered from ENCODE ChIP-seq experiments, resulting in a final library reported by Kheradpour and Kellis (2014) and accessible here. The reference and alternate alleles for each of the 1000 Genomes pilot SNPs and indels were concatenated with 29 bp of genomic context on each side, using the hg19 sequence obtained from the UCSC Genome Browser. PWMs were then scored for instances that passed a threshold of p < 4-7 (see Touzet et al.). Only instances where a motif in the sequence (a) passed the threshold of a PWM in either the reference or the alternate genomic sequence, and (b) overlapped the variable nucleotide(s) (thus changing the PWM score) were considered. Then, the change in log-odds (LOD) score was calculated.

GWAS results were obtained from the NHGRI catalog (accessed October 31, 2015). When combining GWAS results across papers for the same trait, SNPs were pruned if another paper had reported the same SNP or another SNP within 1 MB with a more significant p-value.

The GRASP Build 2.0.0.0 results were obtained and QTLs (non-GWAS) were included.

eQTLs were also manually obtained from the GTEx analysis V6, the GEUVADIS analysis, and 10 other studies. The list of studies and methods for manually extrating from papers is available in the data directory.

To perform enhancer enrichment analysis on sets of variants, a background frequency of enhancer overlap was calculated for each cell type among two background sets of variants: all unique GWAS loci assigned to EUR (curretly from the earlier pruned GWAS set for Haploreg 4.0; in process of updating to 4.1 background), and all 1000 Genomes variants with a frequency above 5% in any population. When a user submits a query list of variants, the coverage of enhancers in each cell type is calculated. A binomial test for enrichment is performed, and the result appears in the list bold if it surpasses an uncorrected significance threshold of 0.05. All tests are shown to allow the user to apply an appropriate multiple-testing correction (e.g. Bonferroni or FDR).

Options

Build Query. If a file is uploaded, HaploReg expects a plain text file with a refSNP id (e.g., "rs22") on each line. If no file is uploaded, HaploReg then checks if a GWAS is selected from the drop-down menu. In cases where there are multiple GWAS for the same trait, one option will combine unique SNPs; however, these SNPs are not guaranteed to be in independent loci. If neither a file nor GWAS selection is made, HaploReg will look for either a region (e.g., "chr1:1000000-1001000") or a comma-delimited list of refSNP ids (e.g., "rs3, rs22") in the text box. To protect the server, intervals greater than 10 megabases or searches of greater than 10000 rsids will return an error; large queries of annotations should instead be done by downloading the complete VCF.

LD threshold. Available for r2 > 0.2. If an r2 threshold is specified (see the Set Options tab), results for each variant will be shown in a separate table along with other variants in LD. If r2 is set to NA, only queried variants will be shown, together in one table.

1000G Phase 1 population for LD calculation. LD was calculated separately for each continent. When interpreting association studies, this option should be set to the population that best matches the ancestry of the subjects.

Source for epigenomes. Four options are available: The core 15-state model, the 25-state model using imputed marks, using peaks from H3K4me1 and H3K4me3 (as enhancers and promoters, respectively), and using peaks from H3K27ac and H3K9ac (as enhancers and promoters, respectively).

Mammalian conservation algorithm. HaploReg provides annotations from two algorithms designed to detect constrained sequences across mammalian genomes: GERP and SiPhy-omega. By default, both are displayed. A colored box indicates that the region is called as constrained (vs. non-constrained) by one of these algorithms. Please see the original publications for details on how discrete elements are called.

Show position relative to genes. Gene annotations by GENCODE are shown included by default.

Condense lists. Some SNPs are bound by many proteins, or are DNAse hypersensitive in many cell types. By default, when more than three cell types, proteins, or motifs are involved, a summary (e.g., "7 cell types"). Hovering over these condensed cells displays pop-up text with the complete list.

Condense indel oligos. HaploReg always displays both the reference and alternate nucleotide(s). In the case of indels, if the reference or alternate is longer than six nucleotides, the oligonucleotide will be condensed (e.g., "7mer").

Usage example

From the GWAS drop-down menu, select "Attention deficit hyperactivity disorder (66 loci from 6 studies in EUR)" and select "Submit". The first result is the most significant association from the entire literature - rs864643. See that the top row is the most distantly linked in the haplotype block, rs561543, which has LD of r2 = 0.81 with the lead variant rs864643. The table reports that rs561543 overlaps with an enhancer in four major tissue types; hover ofer "4 tissues" in that row to see a variety of enhancer tissues, including brain. Try changing to other enhancer definitions and re-submitting the query. Note that there is also a ChIP-seq result reported, HNF4 protein bound by ChIP-seq, as well as 8 eQTL results and an HNF4 motif disruption.

Below the haplotype results are the enrichment results. Note that the strongest enrichment for enhancers (as defined by the 15-state core ChromHMM model) is in the angular gyrus sample from brain.

To drill down to the detail view of a SNP, look at the lead SNP itself, rs864643, which is colored red because it is the lead SNP. Click on the link of the rsid name. Note that in the full table of epigenomic information, there is an anhancer activity cluster in brain, and it is classified as a genic enhancer by the 15-state core model, and as a transcribed 3' enhancer by the 25-state model. See also that H3K4me1, H3K27ac, and H3K9ac are all contributing to the chromatin state assignment at this SNP in these cell types. Black cells on the right hand of this area of the table indicate that DNase experiments from Roadmap are not available.

At the bottom of the detail page for rs864643, see the eQTL result showing that the SNP has been correlated with MOBP expression in two brain tissues. In the motif table, note that the SNP changes the match to the p300 PWM, ATTAYRWCA, with the alternate allele changing a match to the fourth A to a G. Hover over the “p300_disc” ID to see that the PWM was discovered using the Trawler algorithm on a p300 ChIP-Seq experiment in HeLa cells from the ENCODE dataset.