Understanding Ancient DNA Damage Analysis
This guide explains the key concepts and metrics used in the KAPK Explorer for analyzing ancient DNA damage patterns in metagenomic samples from the Kap København Formation.
DNA Damage
Ancient DNA undergoes characteristic chemical modifications over time. The most diagnostic pattern is cytosine deamination, where cytosine (C) bases are converted to uracil, which is then read as thymine (T) during sequencing.
This process creates predictable substitution patterns:
The damage is most pronounced at the ends of DNA fragments, decreasing toward the center. This creates the characteristic "smiley" pattern when plotted. The damage value represents the estimated proportion of damaged reads, typically expressed as a percentage.
Damage Primary Metric
The estimated proportion of reads showing authentic ancient DNA damage patterns. Higher values indicate stronger evidence for ancient origin.
The Smiley Plot
The "smiley plot" visualizes nucleotide substitution frequencies across read positions. It shows how damage varies from the 5' end (left) through the center to the 3' end (right).
Reading the Plot
- Red line (5' C→T): Shows C-to-T substitution rate from the 5' end. Authentic ancient DNA shows elevated rates at the fragment end, decaying toward the center.
- Blue line (3' G→A): Shows G-to-A substitution rate from the 3' end. This is the complementary damage pattern on the opposite strand.
- Faded lines: Show observed data points.
- Solid lines: Show the fitted model predictions.
The characteristic "smile" shape—high at both ends, low in the middle—is the hallmark of authentic ancient DNA damage. Modern contamination typically shows flat patterns.
Significance
The significance metric represents the number of standard deviations (sigmas) that the estimated damage is away from zero. It indicates statistical confidence that the observed damage is real rather than noise.
Significance Statistical
A Z-score measuring how confidently we can distinguish the damage signal from zero.
> 3— Strong evidence of damage (99.7% confidence)2-3— Moderate evidence (95-99.7% confidence)< 2— Weak or no evidence of damage
Model Parameters
The damage model fits an exponential decay function to the observed substitution patterns. The model estimates four key parameters:
A (Amplitude)
The background-independent damage level. Represents the maximum damage rate at fragment ends, independent of sequencing artifacts.
q (Decay Rate)
Controls how quickly damage decreases from the ends toward the center of fragments. Higher values mean faster decay (damage concentrated at ends).
φ (Phi - Concentration)
The concentration parameter for the beta-binomial distribution. Controls the overdispersion in the model—how much variation exists beyond what's expected.
c (Background)
The baseline substitution rate due to sequencing errors or other non-damage sources. Accounts for noise in the data that's not related to ancient DNA damage.
Coverage Metrics
Coverage metrics from filterBAM help assess the quality and reliability of read alignments to each reference genome.
Mean Coverage
Average sequencing depth across all positions in the reference genome. Calculated as total aligned bases divided by reference length.
Breadth
The proportion of reference bases covered by at least one aligned read. Expressed as a percentage (0-100%).
Expected Breadth
Theoretical breadth expected given the mean coverage, assuming random read distribution:
Breadth/Expected Ratio (B/E)
Ratio between observed and expected breadth. Helps identify cases where coverage doesn't match random distribution expectations.
Coverage Evenness
Measures uniformity of read distribution across the reference genome, as described in Fuentes-Pardo & Ruzzante (2017).
Normalized Gini Coefficient
Measures inequality in coverage distribution. Adapted from economics where it measures wealth inequality. Used alongside entropy to detect uneven coverage patterns.
Normalized Entropy
Quantifies the randomness of coverage distribution across genomic positions. Complements Gini coefficient for coverage quality assessment.
Damage Status
Each reference is classified based on its damage characteristics:
Shows statistically significant ancient DNA damage patterns. These taxa likely represent authentic ancient organisms from the sample.
Does not show significant damage. May represent modern contamination, organisms with insufficient coverage, or recently deposited material.
The status classification considers both the damage level and its statistical significance. A taxon needs both elevated damage and high confidence to be classified as "damaged."
References & Tools
The damage analysis in this explorer uses methods from: