ICML 2026

Catch-22: On the Fundamental Tradeoff Between Detectability and Robustness in LLM Watermarking

Kuheli Pratihar and Debdeep Mukhopadhyay

LLM watermarks face a Catch-22: signals that survive editing tend to become easier for keyless observers to detect, while signals that remain stealthy are often easier to erase.

TL;DR

We develop a unified information-theoretic view of LLM watermarking, show how editing contracts usable watermark evidence, and use the resulting limits to choose a near-Pareto watermarking family for the anticipated edit regime.

Different Types of Watermarking

Biased watermarking example with green-token probability increase

I. Biased watermarking

Probability-modifying token watermarking

Raises the probability of keyed token sets, producing strong verifier evidence and higher keyless visibility.

KGW Context-dependent green-list watermark that biases logits toward keyed token subsets.
Unigram Fixed-green-list watermark with a simple count-based detector over a keyed token partition.
Bias-free token watermarking example with keyed sampling and unchanged marginal distribution

II. Bias-free token watermarking

Keyed sampling with lower marginal drift

Preserves token marginals more carefully. Robustness depends on token alignment surviving edits.

DiPMark Distribution-preserving p-alpha reweighting with retained keyed detector signal.
HCW Unbiased inverse-CDF style sampling that preserves marginal probabilities.
HeavyWater Heavy-tailed PRF token preference watermark for low-entropy decoding regimes.
SimplexWater Simplex-direction PRF watermark evaluated through the token-level adapter.
Kuditipudi Gumbel-key randomized sampling aligned against a keyed random sequence.
Semantic watermarking example with sentence-level encoding

II*. Semantic watermarking

Sentence- or meaning-level evidence

Uses sentence or semantic features, so paraphrases hurt less than exact token substitutions.

SemStamp Embedding-space LSH and rejection sampling for sentence-level signals.
PMark Channel-constrained semantic watermark with limited distortion.
SimMark Sentence-similarity watermark over keyed embedding-space intervals.
Undetectable watermarking example with distribution-preserving sampling

III. Undetectable watermarking

Distribution-preserving schemes

Preserves the output distribution for keyless observers, but edit robustness is fragile.

CGW Cryptographic inverse-CDF sampling designed to stay invisible to keyless tests.

IV. Others

Adaptive, structural, and selector-style schemes

These methods are evaluated in the same pipeline, but they do not correspond to one of the four schematic panels above.

GaussMark Structural/training-time watermark evaluated through checkpoint hooks.
DAWA Distribution-adaptive watermark that changes strength with the next-token distribution.
Hybrid Catch-22 selector wrapper that routes across families by expected edit regime.

The Watermarking Catch-22

Robustness increases Keyless detectability increases

High signal, high visibility

Biased probability watermark

A strong sampling bias accumulates evidence quickly, but the same probability shift gives a keyless observer more statistical drift to exploit.

Verifier evidence after edits
68%
Keyless detectability
High

Strong evidence is useful for verification, but it also moves the text farther from the unwatermarked distribution.

Inference-time watermarks verify AI-generated text by embedding statistical evidence into the sampling process. The same evidence creates tension between two goals: the verifier wants the signal to survive paraphrasing and other edits, while an outsider should not be able to detect that the text is watermarked.

Our framework compares heterogeneous watermark families through a shared quantity: a usable KL information budget. Distribution-preserving schemes keep keyless statistical drift at zero but are brittle under edits. Probability-modifying token- and sentence-level schemes accumulate more evidence, but that evidence also increases detectability.

A Statistical Signal, Not a Vibe Check

A token-level watermark can be explained as a subtle statistical preference. At each generation step, the vocabulary is split into context-dependent green and red tokens. The generator softly nudges probability mass toward green tokens without forcing low-quality words.

Detection recomputes the same green/red split and asks whether the final text contains an unusually high green-token count. Editing and paraphrasing reduce that evidence, which is why the selector must account for the expected edit channel before choosing a watermark family.

model explains the answer with care and detail

Detector view

Too many green choices

The keyed verifier sees a statistically surprising surplus of green tokens, while a keyless outsider may also see distributional drift if the watermark modifies probabilities too aggressively.

green-token share 5 / 8 green evidence accumulates with length
1split vocabulary
2softly bias sampling
3count evidence
4route by edit regime

Mathematical build-up

Theorem 1: sequence evidence accumulates from token evidence

Start with the ordinary sampler and the watermarked sampler over the generated sequence.

$$P^s(y_{1:T}\mid x)=\prod_{t=1}^{T}p_t(y_t),\qquad Q(y_{1:T}\mid x)=\prod_{t=1}^{T}q_t(y_t)$$

At one step, the watermark contributes the KL gap between the two next-token distributions.

$$D_t=\mathrm{KL}(q_t\|p_t)=\sum_{v}q_t(v)\log\frac{q_t(v)}{p_t(v)}$$

Because generation is token-by-token, these small information contributions add across the sequence.

$$D_{\mathrm{seq}}=\mathrm{KL}(Q\|P^s)=\sum_{t=1}^{T}D_t$$

Keyless detectability is then bounded by the accumulated information.

$$\mathrm{TV}(P^s,Q)\le \sqrt{\frac{D_{\mathrm{seq}}}{2}}$$

Plugging in the per-family one-step terms gives the qualitative frontier.

$$\mathrm{TV}_{\mathrm{bias}}=O(|\delta|\sqrt{T}),\qquad \mathrm{TV}_{\mathrm{sem}}=O(\sqrt{T/\ell}),\qquad \mathrm{TV}_{\mathrm{prf}}=0$$

Interpretation. A single small nudge may be invisible, but the summed KL budget \(D_{\mathrm{seq}}\) grows with length, and the total-variation bound translates that budget into possible keyless detectability.

This intuition follows the green-token watermark explanation popularized around KGW-style LLM watermarks; see the Arize research reading summary for a reader-friendly treatment.

Information Budget Under Editing

Detectability

KL controls separability

The accumulated divergence between watermarked and unwatermarked text bounds what keyless tests can distinguish, while also governing verifier-side hypothesis testing power.

Robustness

Editing shrinks evidence

For token-level schemes, usable information contracts with the edit rate. For semantic schemes, the relevant contraction depends on the induced semantic flip rate.

Selection

No single family dominates

The best family depends on the expected editing regime, the stealth cap, and the amount of post-edit verification power required at deployment.

usable information after edits
$$D_{\varepsilon} \approx (1-\varepsilon)^2 D_0$$
for token-level schemes under the paper's edit model

Mathematical build-up

Theorem 2: editing contracts the information budget

Write the watermark as a small perturbation of the baseline token distribution.

$$q=p+r,\qquad D_0\approx \frac{1}{2}\sum_v\frac{r(v)^2}{p(v)}$$

Model editing as replacing an \(\varepsilon\) fraction of tokens by an edit distribution \(R\).

$$p_{\varepsilon}=(1-\varepsilon)p+\varepsilon R,\qquad q_{\varepsilon}=(1-\varepsilon)q+\varepsilon R$$

The edit distribution cancels in the difference, leaving a weaker watermark perturbation.

$$q_{\varepsilon}-p_{\varepsilon}=(1-\varepsilon)(q-p)=(1-\varepsilon)r$$

Since local KL is quadratic in the perturbation size, the retained information is squared.

$$D_{\varepsilon}\approx \frac{1}{2}\sum_v\frac{\big((1-\varepsilon)r(v)\big)^2}{p(v)}=(1-\varepsilon)^2D_0$$

Across \(T\) tokens, the post-edit token budget is therefore

$$C_{\mathrm{tok}}(\varepsilon)\approx T(1-\varepsilon)^2D_0$$

Verification succeeds only while that remaining budget clears the target error threshold.

$$C_{\mathrm{tok}}(\varepsilon)\gtrsim \log_2(1/\beta)\quad\Longrightarrow\quad \varepsilon_{\beta}^{\mathrm{tok}}\approx 1-\sqrt{\frac{\log_2(1/\beta)}{TD_0}}$$

For semantic schemes, replace token survival by the semantic flip rate.

$$C_{\mathrm{sem}}(\varepsilon)\approx T_s\big(1-2\varepsilon_s(\varepsilon)\big)^2D_0^{(\mathrm{sem})}$$

Interpretation. Edits first shrink the watermark perturbation by \(1-\varepsilon\); KL then squares that remaining perturbation, producing the \((1-\varepsilon)^2\) contraction.

Selector Walkthrough: One Answer, Three Deployment Routes

Imagine an LFQA answer generated for public release. Before embedding the watermark, the system estimates how the answer will be edited downstream and routes it to the family that best matches that edit regime.

Expected route

Verbatim archive release

The answer is stored or served nearly unchanged, so the selector can use a distribution-preserving watermark whose evidence depends on preserving alignment.

token edit rate ε ≈ 0
semantic flip rate εs ≈ 0

Selected watermark

CGW

Distribution-preserving watermark

selector score 1.990
AUC 0.99
keyless z -5.80

Token-based

ε ≈ 0.25, εs ≈ 0.06

HCW: 1.137 (0.910, 3.40)
DiPMark: 1.104 (0.90, 3.90)
HeavyWater: 1.072 (0.880, 4.20)

Semantic-based

ε ≈ 0.42, εs ≈ 0.10

PMark: 1.305 (0.850, 1.20)
SemStamp: 1.220 (0.820, 1.50)
SimMark: 1.170 (0.800, 1.70)

Distribution-preserving

ε ≈ 0, εs ≈ 0

CGW: 1.990 (0.99, -5.80)

This is the low-edit corner of the diagram: distribution preservation gives the strongest stealth signal when the downstream channel is expected to leave the text nearly intact.

Empirical Validation

Experiments on Llama-2-7B and Mistral-7B evaluate clean detection and post-edit robustness under moderate Dipper rewriting and stronger summarization-style paraphrasing. The results support the theoretical prediction that watermark evidence weakens as edits remove or flip the signal.

Across evaluated regimes, the hybrid method tracks the strongest available family more closely than any fixed family while preserving a lower-detectability operating point than aggressively biased token-level methods.

2base LLMs
14watermark methods
3main conditions
Edit-rate robustness frontier for watermarking schemes
Post-edit verification weakens as the edit rate increases, exposing the high-edit boundary.

Reproducibility

The code package evaluates the implemented watermark families on LFQA prompts with Llama-2-7B and Mistral-7B. A local backend is included for environment checks before full GPU reproduction.

python -m catch22.pipeline \
  --config configs/llama2_lfqa.yaml \
  --reproduction-suite \
  --num-samples 2 \
  --local-backend \
  --resume

BibTeX

@inproceedings{catch22watermarking2026,
  title = {Catch-22: On the Fundamental Tradeoff Between Detectability and Robustness in LLM Watermarking},
  author = {Pratihar, Kuheli and Mukhopadhyay, Debdeep},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026},
  url = {https://icml.cc/virtual/2026/poster/66807}
}