Data Quality: Metrics


Perhaps the most important requirement within the UW ENCODE Project is the production of high quality data. We have invested considerable energy towards developing protocols and standards to ensure that we have the best possible product for downstream analysis. Two key aspects of this effort have been validation of our called DNaseI hypersensitive sites (DHSs) by the gold-standard chromatin accessibility Southern-blot assay; and the development of a data quality metric applied uniformly to all ChIP-seq and DNase-seq datasets coming off of our sequencers.

Data Quality Metrics

Given the large and increasing volume of data produced by our center, we recognized early on the need for an automated method of assessing data quality based on the enrichment of signal in the final collection of reads mapped to the genome for a given experiment. Towards that end, we developed and implemented a data quality metric called SPOT (Signal Portion of Tag) that is applied uniformly to all of our DNase-seq and ChIP-seq datasets. The metric forms a basis for our decision to invest further lanes of sequencing for a given sample, and for flagging datasets of high enough quality to deliver to the consortium.

We have been using SPOT for these purposes for several years. SPOT is intuitively interpretable, and we have shown (manuscript in preparation) that SPOT scores track with a number of other aspects that one would normally associate with reflecting data quality; in particular,

  1. SPOT scores correlate positively with true positive rates, based on true positives defined using DNaseI hypersensitiviy Southern blots. That is, DNase-seq datasets with high SPOT scores tend to contain a higher proportion of true positives that those with low SPOT scores.
  2. SPOT scores correlate positively with replicate reproducibility. That is, biological replicates with matching and high SPOT scores tend to be more concordant, when measured by simple correlation of signal.
  3. SPOT scores recapitulate signal-to-noise scoring by human observers. In an experiment, users were asked to view multiple tracks from a single datatype in a UCSC mirror browser, where the tracks spanned a range of SPOT scores, which were hidden from the users. Each person was asked to rank the tracks in terms of quality based their visual appearance. In every experiment, the plurality ranking for a given track matched the ranking by SPOT score.

We provide SPOT scores along with each track in our browser. Moreover, SPOT scoring has been performed on all ENCODE DNaseI and ChIP-seq data and the scores have been posted to the ENCODE wiki. SPOT has also been implemented by the NIH Roadmap Epigenomics Consortium as part of their regular data processing pipeline, and scores are reported in all of the data portals for that project for downstream users and data producers alike.

Brief description of SPOT

The SPOT metric is motivated by the following illustration, which shows a 36kb stretch of short-read sequence tags (each represented by a tiny rectangle) mapped to the human genome for four separate DNaseI experiments, the top two in one cell-type, the bottom two in another. Total genome-wide sequencing depth is similar in all four cases, but one can observe differences in the apparent signal-to-noise ratio between them, in terms of the degree to which tags are concentrated in peaks versus the background. The metrics discussed below each assign a single number to each genome-wide data set to gauge the signal enrichment, a process which has typically, up to this point, been made by eye. The values for one of these metrics, SPOT, are displayed for each of the four experiments below.

SPOT is based on the hotspot algorithm, which is a scan statistic that identifies regions of statistically significant enrichment of tags based on the binomial distribution. The method assigns significance based on a local background estimate, thus correcting for local elevation of tag levels due to segmental duplications, copy number events, etc. Regions of significant enrichment are called "hotspots," and can range in size from 10bp to several kb. A description of the method, as well as downloadable software can be found here.

For a given dataset, hotspots are called and the SPOT score is simply calculated as the percentage of all tags in hotspots.