Catch-22: On the Fundamental Tradeoff Between Detectability and Robustness in LLM Watermarking

Pratihar, Kuheli; Mukhopadhyay, Debdeep

TL;DR

We develop a unified information-theoretic view of LLM watermarking, show how editing contracts usable watermark evidence, and use the resulting limits to choose a near-Pareto watermarking family for the anticipated edit regime.

Different Types of Watermarking

Biased watermarking example with green-token probability increase

I. Biased watermarking

Probability-modifying token watermarking

Raises the probability of keyed token sets, producing strong verifier evidence and higher keyless visibility.

KGW Context-dependent green-list watermark that biases logits toward keyed token subsets.

Paper Code

Unigram Fixed-green-list watermark with a simple count-based detector over a keyed token partition.

Paper Code

Bias-free token watermarking example with keyed sampling and unchanged marginal distribution

II. Bias-free token watermarking

Keyed sampling with lower marginal drift

Preserves token marginals more carefully. Robustness depends on token alignment surviving edits.

DiPMark Distribution-preserving p-alpha reweighting with retained keyed detector signal.

Paper Code

HCW Unbiased inverse-CDF style sampling that preserves marginal probabilities.

Paper Code

HeavyWater Heavy-tailed PRF token preference watermark for low-entropy decoding regimes.

Paper Code

SimplexWater Simplex-direction PRF watermark evaluated through the token-level adapter.

Paper Code

Kuditipudi Gumbel-key randomized sampling aligned against a keyed random sequence.

Paper Code

Semantic watermarking example with sentence-level encoding

II*. Semantic watermarking

Sentence- or meaning-level evidence

Uses sentence or semantic features, so paraphrases hurt less than exact token substitutions.

SemStamp Embedding-space LSH and rejection sampling for sentence-level signals.

Paper Code

PMark Channel-constrained semantic watermark with limited distortion.

Paper Code

SimMark Sentence-similarity watermark over keyed embedding-space intervals.

Paper Code

Undetectable watermarking example with distribution-preserving sampling

III. Undetectable watermarking

Distribution-preserving schemes

Preserves the output distribution for keyless observers, but edit robustness is fragile.

CGW Cryptographic inverse-CDF sampling designed to stay invisible to keyless tests.

Paper Code

IV. Others

Adaptive, structural, and selector-style schemes

These methods are evaluated in the same pipeline, but they do not correspond to one of the four schematic panels above.

GaussMark Structural/training-time watermark evaluated through checkpoint hooks.

Paper Code

DAWA Distribution-adaptive watermark that changes strength with the next-token distribution.

Paper Code

Hybrid Catch-22 selector wrapper that routes across families by expected edit regime.

Paper Code

The Watermarking Catch-22

Downstream edit pressure 25%

Robustness increases Keyless detectability increases

High signal, high visibility

Biased probability watermark

A strong sampling bias accumulates evidence quickly, but the same probability shift gives a keyless observer more statistical drift to exploit.

Verifier evidence after edits

68%

Keyless detectability

High

Strong evidence is useful for verification, but it also moves the text farther from the unwatermarked distribution.

Inference-time watermarks verify AI-generated text by embedding statistical evidence into the sampling process. The same evidence creates tension between two goals: the verifier wants the signal to survive paraphrasing and other edits, while an outsider should not be able to detect that the text is watermarked.

Our framework compares heterogeneous watermark families through a shared quantity: a usable KL information budget. Distribution-preserving schemes keep keyless statistical drift at zero but are brittle under edits. Probability-modifying token- and sentence-level schemes accumulate more evidence, but that evidence also increases detectability.

A Statistical Signal, Not a Vibe Check

A token-level watermark can be explained as a subtle statistical preference. At each generation step, the vocabulary is split into context-dependent green and red tokens. The generator softly nudges probability mass toward green tokens without forcing low-quality words.

Detection recomputes the same green/red split and asks whether the final text contains an unusually high green-token count. Editing and paraphrasing reduce that evidence, which is why the selector must account for the expected edit channel before choosing a watermark family.

model explains the answer with care and detail

Detector view

Too many green choices

The keyed verifier sees a statistically surprising surplus of green tokens, while a keyless outsider may also see distributional drift if the watermark modifies probabilities too aggressively.

green-token share 5 / 8 green evidence accumulates with length

1split vocabulary

2softly bias sampling

3count evidence

4route by edit regime

Mathematical build-up

Theorem 1: sequence evidence accumulates from token evidence

Start with the ordinary sampler and the watermarked sampler over the generated sequence.

$$P^s(y_{1:T}\mid x)=\prod_{t=1}^{T}p_t(y_t),\qquad Q(y_{1:T}\mid x)=\prod_{t=1}^{T}q_t(y_t)$$

At one step, the watermark contributes the KL gap between the two next-token distributions.

$$D_t=\mathrm{KL}(q_t\|p_t)=\sum_{v}q_t(v)\log\frac{q_t(v)}{p_t(v)}$$

Because generation is token-by-token, these small information contributions add across the sequence.

$$D_{\mathrm{seq}}=\mathrm{KL}(Q\|P^s)=\sum_{t=1}^{T}D_t$$

Keyless detectability is then bounded by the accumulated information.

$$\mathrm{TV}(P^s,Q)\le \sqrt{\frac{D_{\mathrm{seq}}}{2}}$$

Plugging in the per-family one-step terms gives the qualitative frontier.

$$\mathrm{TV}_{\mathrm{bias}}=O(|\delta|\sqrt{T}),\qquad \mathrm{TV}_{\mathrm{sem}}=O(\sqrt{T/\ell}),\qquad \mathrm{TV}_{\mathrm{prf}}=0$$

Interpretation. A single small nudge may be invisible, but the summed KL budget $D_{\mathrm{seq}}$ grows with length, and the total-variation bound translates that budget into possible keyless detectability.

This intuition follows the green-token watermark explanation popularized around KGW-style LLM watermarks; see the Arize research reading summary for a reader-friendly treatment.

Information Budget Under Editing

Detectability

KL controls separability

The accumulated divergence between watermarked and unwatermarked text bounds what keyless tests can distinguish, while also governing verifier-side hypothesis testing power.

Robustness

Editing shrinks evidence

For token-level schemes, usable information contracts with the edit rate. For semantic schemes, the relevant contraction depends on the induced semantic flip rate.

Selection

No single family dominates

The best family depends on the expected editing regime, the stealth cap, and the amount of post-edit verification power required at deployment.

usable information after edits

$$D_{\varepsilon} \approx (1-\varepsilon)^2 D_0$$

for token-level schemes under the paper's edit model

Mathematical build-up

Theorem 2: editing contracts the information budget

Write the watermark as a small perturbation of the baseline token distribution.

$$q=p+r,\qquad D_0\approx \frac{1}{2}\sum_v\frac{r(v)^2}{p(v)}$$

Model editing as replacing an $\varepsilon$ fraction of tokens by an edit distribution $R$.

$$p_{\varepsilon}=(1-\varepsilon)p+\varepsilon R,\qquad q_{\varepsilon}=(1-\varepsilon)q+\varepsilon R$$

The edit distribution cancels in the difference, leaving a weaker watermark perturbation.

$$q_{\varepsilon}-p_{\varepsilon}=(1-\varepsilon)(q-p)=(1-\varepsilon)r$$

Since local KL is quadratic in the perturbation size, the retained information is squared.

$$D_{\varepsilon}\approx \frac{1}{2}\sum_v\frac{\big((1-\varepsilon)r(v)\big)^2}{p(v)}=(1-\varepsilon)^2D_0$$

Across $T$ tokens, the post-edit token budget is therefore

$$C_{\mathrm{tok}}(\varepsilon)\approx T(1-\varepsilon)^2D_0$$

Verification succeeds only while that remaining budget clears the target error threshold.

$$C_{\mathrm{tok}}(\varepsilon)\gtrsim \log_2(1/\beta)\quad\Longrightarrow\quad \varepsilon_{\beta}^{\mathrm{tok}}\approx 1-\sqrt{\frac{\log_2(1/\beta)}{TD_0}}$$

For semantic schemes, replace token survival by the semantic flip rate.

$$C_{\mathrm{sem}}(\varepsilon)\approx T_s\big(1-2\varepsilon_s(\varepsilon)\big)^2D_0^{(\mathrm{sem})}$$

Interpretation. Edits first shrink the watermark perturbation by $1-\varepsilon$; KL then squares that remaining perturbation, producing the $(1-\varepsilon)^2$ contraction.

Selector Walkthrough: One Answer, Three Deployment Routes

Imagine an LFQA answer generated for public release. Before embedding the watermark, the system estimates how the answer will be edited downstream and routes it to the family that best matches that edit regime.

Expected route

Verbatim archive release

The answer is stored or served nearly unchanged, so the selector can use a distribution-preserving watermark whose evidence depends on preserving alignment.

token edit rate ε ≈ 0

semantic flip rate ε_s ≈ 0

Selected watermark

CGW

Distribution-preserving watermark

selector score 1.990

AUC 0.99

keyless z -5.80

Token-based

ε ≈ 0.25, ε_s ≈ 0.06

HCW: 1.137 (0.910, 3.40)

DiPMark: 1.104 (0.90, 3.90)

HeavyWater: 1.072 (0.880, 4.20)

Semantic-based

ε ≈ 0.42, ε_s ≈ 0.10

PMark: 1.305 (0.850, 1.20)

SemStamp: 1.220 (0.820, 1.50)

SimMark: 1.170 (0.800, 1.70)

Distribution-preserving

ε ≈ 0, ε_s ≈ 0

CGW: 1.990 (0.99, -5.80)

This is the low-edit corner of the diagram: distribution preservation gives the strongest stealth signal when the downstream channel is expected to leave the text nearly intact.

Empirical Validation

Experiments on Llama-2-7B and Mistral-7B evaluate clean detection and post-edit robustness under moderate Dipper rewriting and stronger summarization-style paraphrasing. The results support the theoretical prediction that watermark evidence weakens as edits remove or flip the signal.

Across evaluated regimes, the hybrid method tracks the strongest available family more closely than any fixed family while preserving a lower-detectability operating point than aggressively biased token-level methods.

2base LLMs

14watermark methods

3main conditions

Edit-rate robustness frontier for watermarking schemes — Post-edit verification weakens as the edit rate increases, exposing the high-edit boundary.

Reproducibility

The code package evaluates the implemented watermark families on LFQA prompts with Llama-2-7B and Mistral-7B. A local backend is included for environment checks before full GPU reproduction.

python -m catch22.pipeline \
  --config configs/llama2_lfqa.yaml \
  --reproduction-suite \
  --num-samples 2 \
  --local-backend \
  --resume

Open the Reproduction Repository

BibTeX

@inproceedings{catch22watermarking2026,
  title = {Catch-22: On the Fundamental Tradeoff Between Detectability and Robustness in LLM Watermarking},
  author = {Pratihar, Kuheli and Mukhopadhyay, Debdeep},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026},
  url = {https://icml.cc/virtual/2026/poster/66807}
}