The goal of ChIP-seq is to determine the locations in the genome associating with a protein factor
- Chromatin ImmunoPrecipitation (ChIP)
- Protein-DNA crosslinking with formaldehyde (for TF)
- Chop the chromatin using sonication (TF) or micrococal nuclease (MNase) digestion (histone)
- Specific factor-targeting antibody
- Immunoprecipitation
- DNA purification
- PCR amplification (~150bp)
- High-throughput sequencing (Illumina: can only sequencing the end of the DNA fragments)
- UV crosslinking (1984) : the protein-DNA interaction can be captured
- Crosslinking + immunoprecipitation (1993) : use antibody to grab the DNA-protein complex
- ChIP-chip (2000) : genomewide microarray method was developed, using a pre-designed way
- Unbiased chromosomal coverage by tiling array (2004)
- ChIP-seq (2007)
Today, ChIP-seq has become the predominant method for profiling chromatin epigenomes.
The analysis aims to achieve the following goals:
- Where in the genome do these sequence reads come from? This is accomplished using sequence alignment after quality control
- What does the enrichment of sequences mean? Accomplished using peak calling
- What can we learn from these data? This requires further downstream analysis and integration
Here is a brief outline of steps that are required to achieve those goals:
- Sequencing quality assessment using fastqc. If the quality scores across bases fail, either re-do the experiment or trim the data.
- ChIP-seq read mapping: map the fastq file containing the sequence information to the genome; alignment of each sequence read: bowtie, BWA (Burrows–Wheeler Algorithm); usually use the reads can map to a unique/best location in the genome.
- Redundancy control: completely identical reads are considered error (for example, induced by PCR)
- Non-redundant rate:The ratio of the number of non-redundant reads to the number of mapped reads
- PBC (PCR Bottleneck Coefficient): The ratio of the number of locations with 1 read mapped to the number of locations with reads mapped
- DNA fragment size estimation:
- peak model (MACS) for TF
- cross-correlation (SICER) for any ChIP-seq (input): calculate the Correlation between two strings with a displacement; Auto-correlation: Cross-correlation with itself
- Retrieve DNA fragments
- Full length retrieval (MACS)
- Partial retrieval (sharpen the signal)
- Point retrieval (SICER)
- Pile up: Signal map generation
Background Control: Input or IgG
- Input chromatin: sonicated/digested chromatin without immunoprecipitation
- IgG: “unspecific” immunoprecipitation
Study Control:
- Control exp sample: ChIP + input
- Treated exp sample: ChIP + input
Goal: Identify regions in the genome enriched for sequence reads: – Compared to genomic background – Compared to input control
MACS: Model-based Analysis for ChIP-Seq Read distribution along the genome
- Poisson distribution (
$\lambda B G$ = total tag / genome size) - Negative binomial distribution (MACS2)
ChIP-seq show local biases in the genome
- Chromatin and sequencing bias
- 200-300 bp control windows have too few tags – But can look further
B-H adjustment to correct for FDR : p-value → q-value
Data Visualization
- bedGraph to bigWig
- macs2 output data
- IGV
Quality Control
- FRiP(FractionofReadsinPeaks)score – 1-10% for TF is normal
- Numberofpeaks
- Number of peaks with high fold-enrichment, e.g, 5, 10, ...
- 2000
- Sequenceconservation
- Fractionofpeakswithinregulatoryregions – 80%
Biological interpretation: ChIP-seq captures a snapshot of binding patterns from a cell population
- TF intrinsic property
- Binding activity
- Cellular heterogeneity