Chromatin Accessibility Analyses
Chromatin is composed of nucleosomes, consisting a histone octamer wrapped by DNA1. The interactions between histone proteins can change DNA structures:
- Closed: Nucleosome tightly packed regions, less accessible to regulatory components, cannot activate gene expression
- Open: Nucleosome-depleted regions, available to interact with regulators (TFs and enhancers), turn on gene expression
Background
ATAC-seq2 stands for Assay for Transposase-Accessible Chromatin with high-throughput sequencing. It allows us to identify the chromatin accessibility within a genome and requires fewer amounts of cells and time. ATAC-seq uses Tn5 transposase to enrich DNA fragments from open chromatin regions, followed by NGS sequencing (Fig.1).
Fig1. Schematic of ATAC-seq library construction
Pipeline
ATACgrpah3 is a simple and effective software for the analysis of ATAC-Seq data. It can be download from GitHub with my modification. The pipeline starts from the aligned reads bam file.
Fig2. Schematic diagram of ATACgraph. The black boxes are the ATAC-seq specific analyses while the yellow boxed are the comparison between ATAC-seq and other NGS methods.
step1: Removing mitochondrial
There are high abundance of mitochondrial reads (usually 20%–80%), therefore,
the first would be remove these mitochondrial or plastid sequences by 00_rmChr
:
ATACgraph 00_rmChr demo.bam demo_rmM.bam chrM
step2: Quality control
Fragment length distribution and Fast Fourier Transform (FFT) can show the periodicity ATAC-seq framgment, as well as be used as a simple QC check. It should has a clear periodicity cut off at about 190-200 bp:
ATACgraph 01_calFragDist demo_rmM.bam demo_rmM_fragment demo_rmM_FFT
Two figures will be generated demo_rmM_fragment.png
and demo_rmM_FFT.png
Fig3. The output figures of periodicity analyses. (A) Fragment length distribution with red box represent the nucleosome free region (NFR). The peaks in the gree box may be the mono-nucleosomes, di-nucleosomes and tri-nucleosomes. (B) The Fast Fourier Transform (FFT) show the clear period of fragment cut off is at about 195 bp.
step3: Peak calling
The updated version incorporates MACS3 (MACS2 in original ATACgraph) for peak calling that identifies nucleosome-occupied regions from full-extended fragments and nucleosome-free regions from integration regions:
ATACgraph 03_callPeak demo_rmM.bam demo_rmM_peakcall demo_gene_body_bed6.bed -p macs3
3 files will be generated including a peak location file demo_rmM_peakcall_peaks.narrowPeak
, a peak intensity bigwig file demo_rmM_peakcall_coverage.bw
, and a genes list of overlapping with peaks locations file demo_rmM_peakcall_peak_gene_list.txt
.
step4: Transform GTF file to BED files
ATACgraph can transform the annotation gtf file into seperated bed file (promoter,gene,exon,intron,utr5,cds,utr3,igr) for the following analyses:
ATACgraph 02_gtftoBed demo_gene.gtf demo -p 2000
step5: Visualisation
peaks and genes
To investigate the chromatin accessibility around genes, ATACgraph uses the files describing the ATAC-seq peak locations and gene annotations. It can generate heatmap and metagene plots of ATAC-seq abundance and fold enrichment analysis of open regions in genomic features
ATACgraph 03_genePlot demo_rmM_peakcall.narrowpeak demo_rmM_peakcall_coverage.bw demo
3 figures will be plotted in pdf files as follow fig4 A-C.
junction
To identify a fragment junction track from paired-end reads which connects the start/end site of a read in the track. The lengths above 200 bp are the long fragments coloured in red; and the short fragments (< 150 bp) are marked to represent possibly NFRs in blue.
ATACgraph 03_junctionBed demo_rmM.bam demo_rmM_junction_output.bed
These outputs can be loaded into Integrative Genomics Viewer (IGV) to improve the recognition of open chromatin as fig4 D.
Fig4. The visualisation of (A) peaks abundance in metagene plot (top) and heatmap (bottom) show that ATAC-seq enriched at the center of the predicted peak locations and ;(B) ATAC-seq abundance of the gene body and flanking regions show the accessible regions are located before the transcription start sites (TSSs) in two-thirds of the genes. (C) Fold enrichment analysis of open regions in genomic features. (D) The junction tracks of ATAC-seq in the genome.
Quality assessment
To make sure ATAC-seq or ChIP-seq quality, a manual 4 was conducted by ENCODE. One of the most popular evaluation score for qc these seqs is Fraction of all mapped Reads that fall into peak regions (FRip) Score. It represents the percentage of reads in the peaks region, while the reads outside the peaks are consider as noise.
FRip socre
FRip can be calculated by the featureCounts
but the .narrowPeak file should first be converted to .saf file format.
awk 'OFS="\t" {print $1"-"$2+1"-"$3, $1, $2+1, $3, "+"}' demo_rmM_peakcall.narrowPeak > demo_rmM_peakcall.SAF
featureCounts -p -a demo_rmM_peakcall.SAF -F SAF -o demo_rmM_peakcall.txt demo_rmM_peakcall.bam
IDR
Normally, the experiments required at least two biological replicates and the peaks should maintain highly consistance between these two. We can use Irreproducibility Discovery Rate (IDR) 5 for the assessment.
idr --samples YJATAC0432_rmM_peakcall_peaks.narrowPeak YJATAC0433_rmM_peakcall_peaks.narrowPeak --input-file-type narrowPeak --rank p.value --output-file YJATAC32-33-idr --plot --log-output-file YJATAC32-33.log