Back to Explorer

Understanding Ancient DNA Damage Analysis

This guide explains the key concepts and metrics used in the KAPK Explorer for analyzing ancient DNA damage patterns in metagenomic samples from the Kap København Formation.

DNA Damage

Ancient DNA undergoes characteristic chemical modifications over time. The most diagnostic pattern is cytosine deamination, where cytosine (C) bases are converted to uracil, which is then read as thymine (T) during sequencing.

This process creates predictable substitution patterns:

5' End
C → T
3' End
G → A

The damage is most pronounced at the ends of DNA fragments, decreasing toward the center. This creates the characteristic "smiley" pattern when plotted. The damage value represents the estimated proportion of damaged reads, typically expressed as a percentage.

Damage Primary Metric

The estimated proportion of reads showing authentic ancient DNA damage patterns. Higher values indicate stronger evidence for ancient origin.

Interpretation: Values above 10-15% typically indicate authentic ancient DNA, though this threshold can vary depending on the organism and preservation conditions.

The Smiley Plot

The "smiley plot" visualizes nucleotide substitution frequencies across read positions. It shows how damage varies from the 5' end (left) through the center to the 3' end (right).

Reading the Plot

  • Red line (5' C→T): Shows C-to-T substitution rate from the 5' end. Authentic ancient DNA shows elevated rates at the fragment end, decaying toward the center.
  • Blue line (3' G→A): Shows G-to-A substitution rate from the 3' end. This is the complementary damage pattern on the opposite strand.
  • Faded lines: Show observed data points.
  • Solid lines: Show the fitted model predictions.

The characteristic "smile" shape—high at both ends, low in the middle—is the hallmark of authentic ancient DNA damage. Modern contamination typically shows flat patterns.

Significance

The significance metric represents the number of standard deviations (sigmas) that the estimated damage is away from zero. It indicates statistical confidence that the observed damage is real rather than noise.

Significance Statistical

A Z-score measuring how confidently we can distinguish the damage signal from zero.

Interpretation:
  • > 3 — Strong evidence of damage (99.7% confidence)
  • 2-3 — Moderate evidence (95-99.7% confidence)
  • < 2 — Weak or no evidence of damage

Model Parameters

The damage model fits an exponential decay function to the observed substitution patterns. The model estimates four key parameters:

A (Amplitude)

The background-independent damage level. Represents the maximum damage rate at fragment ends, independent of sequencing artifacts.

q (Decay Rate)

Controls how quickly damage decreases from the ends toward the center of fragments. Higher values mean faster decay (damage concentrated at ends).

φ (Phi - Concentration)

The concentration parameter for the beta-binomial distribution. Controls the overdispersion in the model—how much variation exists beyond what's expected.

c (Background)

The baseline substitution rate due to sequencing errors or other non-damage sources. Accounts for noise in the data that's not related to ancient DNA damage.

Coverage Metrics

Coverage metrics from filterBAM help assess the quality and reliability of read alignments to each reference genome.

Mean Coverage

Average sequencing depth across all positions in the reference genome. Calculated as total aligned bases divided by reference length.

Breadth

The proportion of reference bases covered by at least one aligned read. Expressed as a percentage (0-100%).

Expected Breadth

Theoretical breadth expected given the mean coverage, assuming random read distribution:

Expected Breadth = 1 - e-coverage

Breadth/Expected Ratio (B/E)

Ratio between observed and expected breadth. Helps identify cases where coverage doesn't match random distribution expectations.

Interpretation: Values near 1.0 indicate coverage consistent with random read placement. Low ratios suggest reads are clumped in specific regions rather than spread across the genome.

Coverage Evenness

Measures uniformity of read distribution across the reference genome, as described in Fuentes-Pardo & Ruzzante (2017).

Interpretation: Higher values indicate more uniform coverage. A value of 0 indicates highly localized coverage (potential artifacts or conserved regions).

Normalized Gini Coefficient

Measures inequality in coverage distribution. Adapted from economics where it measures wealth inequality. Used alongside entropy to detect uneven coverage patterns.

Interpretation: Higher values indicate more uneven/clumped coverage. Useful for identifying cases with high B/E ratio but localized coverage.

Normalized Entropy

Quantifies the randomness of coverage distribution across genomic positions. Complements Gini coefficient for coverage quality assessment.

Interpretation: Higher entropy indicates more uniform, random coverage distribution. Low entropy suggests coverage is concentrated in specific regions.

Damage Status

Each reference is classified based on its damage characteristics:

Damaged

Shows statistically significant ancient DNA damage patterns. These taxa likely represent authentic ancient organisms from the sample.

Non-Damaged

Does not show significant damage. May represent modern contamination, organisms with insufficient coverage, or recently deposited material.

The status classification considers both the damage level and its statistical significance. A taxon needs both elevated damage and high confidence to be classified as "damaged."

References & Tools

The damage analysis in this explorer uses methods from: