KlusAI Technical Report · June 1, 2026

EuroPriv-Bench: A Unified Pan-European De-identification Benchmark with Re-identification Risk Metrics

KlusAI Research

KlusAI

Working paper · preliminary results (n = 1,500 docs/config) · not peer-reviewed

Abstract

Privacy-focused NLP for European languages is served by fragmented resources: the Text Anonymization Benchmark provides privacy–utility metrics but is English- and legal-only; AI4Privacy offers cross-lingual European detection data without re-identification metrics; MAPA covers 24 EU languages and both legal and clinical text but as a detection toolkit, not a comparative leaderboard; and MultiGraSCCo is multilingual but clinical-only and translation-based. Concurrent, independent work — RAT-Bench — contributes a hosted re-identification-risk benchmark but is built on U.S. demographics (English/Spanish/Chinese), with no legal text and no GDPR-aligned taxonomy; recent PII models such as GLiNER2-PII ship strong systems with no standardized European evaluation, and a 2025 survey of the field flags exactly this missing standardized multilingual benchmark. We introduce EuroPriv-Bench, the first unified, openly-licensed leaderboard for European cross-lingual legal and clinical de-identification with a harmonized GDPR-aligned entity taxonomy and a re-identification-risk metric. It unifies (a) European cross-lingual breadth, (b) both legal and clinical text, (c) one harmonized GDPR-aligned entity taxonomy, and (d) a re-identification-risk metric alongside detection F1, in one reproducible, openly-licensed, leaderboard-style suite. We build on the prior art rather than replacing it, re-using its label schemes through a documented crosswalk. Evaluating four public systems on realistic Romanian documents, we find that detection F1 does not track national-identifier (CNP) protection: the best detector is not the best protector. OpenAI privacy-filter — the weakest detector (F1 0.36) — leaks only 1.4% of CNPs (95% CI 0.9–2.3) because it labels 96% of them as account numbers and redacts them regardless of type, whereas the three detectors that type CNPs all leak 26–35% — privacy-filter's Wilson interval lies entirely below all three (non-overlapping). GLiNER, the most accurate at F1 0.85, leaks 30.2%; tabularisai, despite a lower F1 (0.75), leaks the most, 35.4% (32.6–38.2). Coverage-based redaction and type-accurate detection are different objectives — F1 measures the latter, protection needs the former — so detection F1 is an unsafe proxy for re-identification protection, at least for national identifiers. We release the benchmark, harness, configs, and data so the gap is measurable.

Introduction

Models advertising 94–97% F1 on English PII benchmarks tell us little about how they behave on Dutch clinical notes or Romanian court decisions under a European privacy taxonomy. Yet the public de-identification literature is fragmented along the axes that matter for EU deployment. The Text Anonymization Benchmark (TAB) [1] introduced privacy–utility metrics, but only for English ECHR legal text. AI4Privacy [2] provides cross-lingual European detection data, but scores detection F1 only. MAPA [3] covers all 24 official EU languages across legal and clinical text, but ships as a detection toolkit, not a comparative leaderboard. MultiGraSCCo [4] is multilingual and GDPR-aware but clinical-only and produced by machine translation. Concurrent, independent work — RAT-Bench [11] — contributes a hosted re-identification-risk benchmark, but is built on U.S. demographics (English/Spanish/Chinese), with no legal text and no GDPR-aligned taxonomy. Recent PII models such as GLiNER2-PII [12] ship strong systems with no standardized European evaluation, and a 2025 survey of the field [17] flags exactly this missing standardized multilingual benchmark. And the privacy-filter model lineage — OpenAI’s privacy-filter [6] and OpenMed’s multilingual finetune [7] — ships capable systems for which we found no standardized public privacy-risk evaluation as of June 2026.

EuroPriv-Bench is, to our knowledge, the first unified, openly-licensed leaderboard for European cross-lingual legal and clinical de-identification with a harmonized GDPR-aligned entity taxonomy and a re-identification-risk metric. No single prior artifact unifies (a) European cross-lingual breadth, (b) both legal and clinical text, (c) one harmonized GDPR-aligned taxonomy, and (d) a re-identification-risk metric, in a reproducible, openly-licensed leaderboard. Our claim is explicitly “first unified”, not “first”: we re-use and subsume the prior art (§6). We contribute a harmonized taxonomy with a documented crosswalk to six external schemes (§2); a cleanly-licensed, reproducible benchmark over eight published European languages (ro, en, pl, de, fr, es, it, nl — seven with general-text tracks, scaling toward EU-24), with decode-bearing real-skeleton re-identification tracks in three languages (RO CNP, PL PESEL, IT codice fiscale) and a Romanian legal-domain real-skeleton track, in both synthetic and realistic-document form (all config_status = dev, pending validation; §3); and a national-identifier re-identification-risk metric that exposes a dissociation between detection accuracy and privacy protection (§4–5).

The KP Taxonomy

Every model speaks a different label dialect: OpenAI’s privacy-filter has 8 coarse types, AI4Privacy ~98, HIPAA 18, MAPA a legal/medical set, OpenMed 54, and tabularisai 42. Before scoring, we define one GDPR-aligned KP (KlusAI Privacy) taxonomy and a crosswalk mapping each external scheme’s labels onto it (published in full in the code repository). This is standardization, not invention; the contribution is reconciliation.

The crosswalk is validated to be a function — each native label maps to exactly one KP type — which surfaced real modelling ambiguities (e.g. HIPAA names is claimed by both a person and a care-provider sense; the general type wins, and refinements do not claim the source label). National identifiers (passport, driving-licence, social-insurance, and the Romanian CNP) form a dedicated NATIONAL_ID type rather than collapsing into a generic account bucket, because they carry distinct legal safeguards and, for the CNP, deterministic leakage (§4). Spans use BIOES tagging; combined with the crosswalk this lets each model’s native output be scored in the shared label space — including a head-to-head with privacy-filter’s own BIOES output. Every model is scored only on the entity types a given config’s gold annotates, so a system is never penalized for detecting categories that config does not cover.

Benchmark Construction

Cleanly-licensed sources only. EuroPriv-Bench v0 is built from CC-BY AI4Privacy open core, KlusAI-authored Romanian document structure, and KlusAI-generated synthetic identifiers — so the entire suite is openly redistributable. Six general-text language configs (en, fr, es, de, it, nl) are curated from AI4Privacy and remapped to the KP taxonomy.

The Romanian track. Romanian is absent from AI4Privacy and is a strong test of locale-specific identifiers (CNP, RO IBAN/CUI, county-coded formats) that English-primary models have never seen. We release two Romanian configs. ro-synthetic-v1 is a development track of template-generated documents. ro-realskeleton-v1 is the contamination-free realistic-context track (still config_status = dev, pending the validation noted in §5): documents that reproduce the structure of real Romanian official document types, drawn from two independent template families — an official-correspondence family (clinical / legal / administrative: e.g. a CNAS discharge letter, a services contract, a sworn declaration, an administrative letter) and an academic-registry family (higher-education student records) — populated with procedurally-generated identifiers — valid-checksum CNPs with consistent dates of birth, RO IBAN/CUI/CI, county addresses. The two families share no skeleton text (5-gram Jaccard 0.0000), so the dissociation can be checked on each independently (§5). The skeletons are original KlusAI-authored documents that imitate the functional layout of these document types (headings, field order, boilerplate) without copying any source text; for genuinely official texts, Law 8/1996 art. 9(b) additionally places them outside copyright. They are released under the suite’s open license. No identifier is derived from a real data subject; all are procedurally generated.

Polish, Italian, and legal-domain real-skeleton tracks. The same real-skeleton construction extends the decode-bearing channel to two further national-ID schemes — pl-realskeleton-v1 (Polish PESEL) and it-realskeleton-v1 (Italian codice fiscale) — and to a Romanian legal-domain real-skeleton track (legal-realskeleton-v1, structure-only). All are KlusAI-authored skeletons populated with valid-checksum, procedurally-generated identifiers, carry config_status = dev (pending the validation noted in §5), and contain no real data subjects. Together with the Romanian track this gives three decode-bearing real-skeleton re-identification tracks (RO CNP, PL PESEL, IT codice fiscale) across the suite’s eight published languages.

Provenance. Every result row records the harness version, taxonomy version, dataset config and split, model id, and timestamp, so any number traces to an exact configuration. Synthetic training data is kept strictly separate from gold; generation is offset-deterministic (each identifier is spliced into a template slot and its character span recorded by construction, then re-validated).

Metrics

Detection. Strict entity-level precision/recall/F1 (exact span and type match), plus a recall-weighted F2, since in de-identification a false negative (PII left in) is far costlier than a false positive (something harmless redacted).

Re-identification leakage. Our headline metric. The Romanian CNP is not an opaque string: its first digit encodes sex and birth-century, digits 2–7 the date of birth, and digits 8–9 the county of registration. A single un-redacted CNP therefore discloses at least three quasi-identifiers at once.¹ We decode the structure directly and report, over all gold CNPs, the fraction left unflagged (leak_rate) and the total quasi-identifiers thereby exposed. A CNP counts as protected iff the model flags at least one token overlapping its span as PII of any type — it would be redacted regardless of the predicted label; if every token of the CNP is predicted O, it is a leak. This is coverage, not labels — which is exactly the property detection F1 does not measure.

The any-overlap rule is deliberately conservative: a model that flags even one token of a CNP is credited with protection, so leak_rate is a lower bound on real-world leakage — a redaction pipeline keying on exact spans or types could still expose digits. (privacy-filter, below, flags full CNP spans, so it is unaffected by this caveat.)

Baselines and Results

We evaluate four public systems: OpenAI privacy-filter [6], OpenMed privacy-filter-multilingual [7], tabularisai/eu-pii-safeguard [8], and zero-shot GLiNER gliner_multi_pii-v1 [9]. All numbers are entity-F1 at n = 1,500 docs per configuration, taxonomy v0.2.0; the full F1/F2 table is on the live leaderboard.

Config	privacy-filter	OpenMed	tabularisai	GLiNER
English (general)	0.41	0.60	0.51	0.50
French (general)	0.46	0.61	0.59	0.56
German (general)	0.50	0.61	0.63	0.57
Italian (general)	0.45	0.55	0.58	0.54
Spanish (general)	0.47	0.59	0.58	0.55
Dutch (general)	0.47	0.63	0.63	0.57
Romanian (synthetic)	0.58	0.74	0.88	0.81
Romanian (real-skeleton)	0.36	0.58	0.75	0.85

Table 1. Entity-level detection F1 by configuration (n = 1,500 docs/config; taxonomy v0.2.0). OpenMed and tabularisai are statistically indistinguishable on the general-text average (0.598 vs 0.589, a 0.009 gap we report without a confidence interval); the Romanian tracks are led by tabularisai (synthetic) and GLiNER (real-skeleton). The general-text ranking is confounded — OpenMed and tabularisai were trained on AI4Privacy (this gold's source), GLiNER and privacy-filter were not — so it mixes in- and out-of-distribution systems; the Romanian real-skeleton track, which no baseline has seen, is the fair comparison.

A model claiming 96–97% F1 on English PII drops to 0.41–0.63 across general European text under a GDPR-aligned taxonomy, with recall the weak point throughout: recall-weighted F2 is lower than F1 in every cell (English, for instance: privacy-filter 0.41→0.35, OpenMed 0.60→0.57, tabularisai 0.51→0.46, GLiNER 0.50→0.45). No system dominates — OpenMed and tabularisai are level on the general-text average, tabularisai and GLiNER lead the Romanian tracks — and the gap between synthetic and realistic Romanian context is stark (tabularisai 0.88→0.75; privacy-filter 0.58→0.36).

The dissociation. On ro-realskeleton-v1 (1,500 documents, 1,123 gold CNPs), detection accuracy does not predict protection: the best detector is not the best protector. The per-model contrast is the evidence — and it is significant, because the Wilson 95% intervals on leak-rate separate the systems (Table 2). With only four systems we do not lean on a correlation coefficient: the rank order happens to run positive (Spearman ρ = +0.80), but over four points that is a descriptive observation, not an estimate, and it is not statistically significant (exact permutation p = 0.33). We therefore read the result as “F1 does not track CNP protection,” and explain why below — not as a monotonic law.

Model	Detection F1	CNP leak-rate (95% CI)	Quasi-IDs leaked
OpenAI privacy-filter	0.36	1.4% (0.9–2.3)	48
OpenMed	0.58	26.4% (23.9–29.0)	888
GLiNER	0.85	30.2% (27.6–32.9)	1,017
tabularisai	0.75	35.4% (32.6–38.2)	1,191

Table 2. Detection F1 vs CNP re-identification leakage on ro-realskeleton-v1 (1,123 gold CNPs; Detection F1 is the contamination-free real-skeleton F1 from Table 1, which no baseline was trained on; Wilson 95% confidence intervals on leak-rate). "Quasi-IDs leaked" is a deterministic exposure tally — exactly 3 × missed CNPs, since each un-redacted CNP discloses sex, date of birth, and county — not an inferential estimate.

Figure 1 plots the same dissociation and adds two systems beyond the four-baseline study of Table 2: KlusAI’s reference de-identifier kp-deid and Presidio, the first external leaderboard submission (both development-track, pending the validation noted below). Both flag every CNP and join privacy-filter at ≈0% leak, off the detection-F1/protection frontier.

Scatter of detection entity-F1 (x) against CNP re-identification leak-rate (y) on ro-realskeleton-v1, with Wilson 95% confidence-interval error bars. The blanket-redacting privacy-filter and the type-accurate kp-deid and Presidio systems sit at the bottom of the plot at near-zero leak-rate, off the protection–F1 frontier, while the higher-F1 detectors GLiNER, tabularisai, and OpenMed leak 26–35%. — Figure 1. Detection–protection dissociation on `ro-realskeleton-v1` (n = 1,123 distinct CNP subjects). Horizontal axis: detection entity-F1; vertical axis: CNP re-identification leak-rate (the headline privacy metric), with Wilson 95% confidence-interval error bars. Re-identification protection does not track detection accuracy: GLiNER, the highest-F1 detector (0.85), still leaks 30.2% of CNPs (tabularisai, at F1 0.75, leaks the most overall at 35.4%), while three systems sit at the bottom of the plot at ≈0% leak — the blanket redactor privacy-filter (F1 0.36, leak 1.4%) and the two systems that flag every CNP, kp-deid (F1 0.74, 0% leak) and Presidio (F1 0.47, 0% leak) — all off the detection-F1/protection frontier. Scope: development track (`config_status = dev`), contamination-controlled (clean held-out skeletons no baseline was trained on); the Romanian real-skeleton numbers are **pending native-speaker review and inter-annotator-agreement validation** and are not yet citable.

The strongest detector on this track, GLiNER (F1 0.85), leaks 30.2% of CNPs; tabularisai leaks the most (35.4%) at high precision; while the weakest detector, privacy-filter (F1 0.36), leaks the least (1.4%). Its low leak-rate is earned, not accidental: of the 1,123 CNPs, privacy-filter flags 1,107, labelling ~96% as account numbers and ~3% as phone numbers² — and a flagged span is redacted regardless of type.

This is the mechanism behind the dissociation, and it is specific. The leak metric rewards coverage (any-overlap redaction) while F1 rewards exact span and type, so a blanket redactor like privacy-filter maximizes protection while scoring worst on typed F1. The effect is carried by that one model: among the three systems that actually type CNPs (OpenMed, GLiNER, tabularisai), leak-rate is flat-to-rising in F1 and all leak 26.4–35.4% — no dissociation among them. The finding is therefore not “better detectors leak more”; it is that coverage-based redaction and type-accurate detection are different objectives, and detection F1 measures only the latter. (GLiNER is zero-shot, so its F1 depends on the label prompt — a confound for any cross-system F1 comparison, and a further reason we rest the claim on the per-model leak-rate intervals rather than on F1 rankings.) The one clean statistical separation is privacy-filter’s: its Wilson 95% interval (0.9–2.3%) does not overlap any other model’s. The three type-accurate detectors are not all mutually separable — OpenMed–GLiNER and GLiNER–tabularisai overlap (though OpenMed and tabularisai do not), so they form a connected chain through GLiNER rather than a clean ordering — so among them we make no graded significance claim. The sharp, significant contrast is the blanket redactor versus everything else.

Item-paired significance. The Wilson intervals above are unpaired; we additionally test the dissociation at the subject level, where each of the 1,123 distinct CNP subjects is scored as protected-or-leaked by both systems and the discordant pairs drive an exact (two-sided binomial) McNemar test. Against the F1 leader GLiNER, our reference de-identifier kp-deid (F1 0.74, 0% CNP leak) protects 339 CNP subjects that GLiNER leaks while GLiNER protects none that kp-deid leaks (b = 339, c = 0; p ≈ 1.8 × 10⁻¹⁰²) — detection-F1 and re-identification protection are statistically dissociated on item-paired data, not just on aggregate intervals. The contrast against the leaking protector privacy-filter is also significant in the same direction (b = 16, c = 0; p ≈ 3.05 × 10⁻⁵). We report the third comparison honestly as a tie: kp-deid and Presidio both flag every CNP, so their per-subject protection is identical (b = 0, c = 0; p = 1, not significant) — the test confirms no detectable difference where the leak-rates are equal (both 0%), which is the correct null result rather than evidence for either system. These tests share the per-subject CNP unit of the leak-rate metric and inherit its scope: ro-realskeleton-v1, development track, pending the native-speaker and inter-annotator-agreement validation noted above.

Replication across two independent template families. The dissociation is not an artefact of a single authored skeleton family. We split ro-realskeleton-v1 into two template families that share no skeleton text (5-gram Jaccard 0.0000) — an official-correspondence family (clinical / legal / administrative; 190 distinct CNP subjects) and an academic-registry family (higher-education student records; 250 distinct CNP subjects) — and re-run the difference-of-proportions test (typed-detector leak-rate minus protector leak-rate) per family, with a Newcombe (1998) hybrid-score 95% interval on the difference. The protector kp-deid leaks 0% of CNPs in both families (Wilson 95% upper bound 0.0198 in the official-correspondence family, 0.0151 in the academic-registry family — both within the pre-registered ≤0.02 target). The dissociation holds in both families: every type-accurate detector with a non-trivial leak has a per-family Newcombe interval that excludes 0 — in the official-correspondence family, spaCy (+0.92, 95% CI 0.87–0.95), GLiNER2 (+0.28, 0.22–0.35), GLiNER (+0.28, 0.22–0.35), tabularisai (+0.37, 0.30–0.44) and OpenMed (+0.25, 0.19–0.32); in the academic-registry family, GLiNER (+1.00, 0.98–1.00), privacy-filter (+0.98, 0.96–0.99), GLiNER2 and presidio (+0.90, 0.85–0.93 each), spaCy (+0.89, 0.84–0.92) and OpenMed (+0.67, 0.61–0.72). The contrast survives independently on two skeleton sets that share no template text — strengthening the “detection F1 does not track CNP protection” reading while the track’s config_status = dev scope (pending native-speaker + IAA validation) is unchanged.

On the synthetic track leakage is ≤1.9% for all models (OpenMed 1.9%, privacy-filter 0.1%, GLiNER and tabularisai 0%): templated CNPs are trivially caught, which is why a realistic-context gold is necessary to see the effect at all.

EuroPriv-Bench is designed to subsume, not compete with, prior resources, re-using their splits and metrics where applicable. We position it against the closest prior and concurrent artifacts along six axes: (a) EU cross-lingual coverage, (b) legal text, (c) clinical text, (d) a harmonized GDPR-aligned entity taxonomy, (e) a re-identification-risk metric, and (f) an open, reproducible leaderboard. Table 3 summarizes coverage: every prior artifact is missing at least two of these axes, and EuroPriv-Bench is the first to fill all six in a single suite.

Artifact	(a) EU x-ling	(b) legal	(c) clinical	(d) GDPR tax.	(e) re-id metric	(f) leaderboard
TAB [1]	✗	✓	✗	✗	✓	✗
AI4Privacy [2]	✓	✗	✗	✗	✗	✗
MAPA [3]	✓	✓	✓	✗	✗	✗
MultiGraSCCo [4]	✓	✗	✓	✗	✗	✗
MEDDOCAN [5]	✗	✗	✓	✗	✗	✗
RAT-Bench [11]	✗	✗	✓	✗	✓	✓
GLiNER2-PII [12]	✓	✗	✗	✗	✗	✗
SPY [13]	✗	✓	✓	✗	✗	✗
MedPriv-Bench [14]	✗	✗	✓	✗	✗	✗
PrivaCI-Bench [15]	✗	✓	✗	✗	✗	✗
PIIBench [16]	✓	✗	✗	✗	✗	✗
EuroPriv-Bench (ours)	✓	✓	✓	✓	✓	✓

Table 3. Coverage of related and concurrent artifacts. Columns: (a) EU cross-lingual, (b) legal-domain de-identification text, (c) clinical-domain de-identification text, (d) harmonized GDPR taxonomy, (e) span/document-level re-identification-risk metric, (f) open reproducible leaderboard. Every prior row is missing ≥2 columns; only EuroPriv-Bench fills all six.

The closest EU-breadth, dual-domain prior art is MAPA [3] (24 EU languages, legal and clinical), but it is a detection toolkit with no re-identification metric and no open leaderboard. The legal-domain re-identification lineage is anchored by TAB [1], which pairs a privacy–utility / re-identification framing with legal text but is English-only. AI4Privacy [2] contributes cross-lingual European detection data without a re-identification metric or a GDPR-aligned taxonomy. MultiGraSCCo [4] is the closest multilingual prior benchmark; it is clinical-only and, per its own description, produced by machine-translating a German corpus into other languages — localized native generation avoids the structurally invalid identifiers (e.g. checksum-invalid national IDs) that translation produces. MEDDOCAN [5] is a clinical de-identification track but Spanish-only. EuroPriv-Bench is best understood as the unification of what MAPA (EU breadth + dual domain) and TAB (legal text + re-identification metric) each establish in isolation, under one harmonized GDPR-aligned taxonomy.

RAT-Bench [11] is concurrent and independent work: a 2026 hosted re-identification-risk leaderboard. It is complementary rather than overlapping — it is built on U.S. demographic statistics over English, Spanish, and Chinese, contains no legal text, and uses no GDPR-aligned taxonomy, so it does not address the European legal/clinical de-identification setting EuroPriv-Bench targets. Recent PII models and corpora are similarly partial: GLiNER2-PII [12] is a strong multilingual (seven-language, 42-type) PII model but ships no benchmark, no re-identification metric, and no legal or clinical coverage; the SPY benchmark [13] does contain legal and clinical de-identification text, but it is English-only synthetic data with no EU cross-lingual breadth, no GDPR-aligned taxonomy, and no re-identification-risk metric; MedPriv-Bench [14] is a clinical-only LLM-QA privacy-utility benchmark; PrivaCI-Bench [15] evaluates contextual integrity and legal compliance rather than span-level de-identification; and PIIBench [16] consolidates ten public PII datasets for detection only. A 2025 survey of text anonymization [17] explicitly flags the absence of a standardized multilingual de-identification benchmark — the gap EuroPriv-Bench is built to close.

Piiranha [10] is included by citation only, as its CC-BY-NC-ND license precludes redistribution or use as a base model.

Limitations

These are preliminary results. (i) The general-text gold is itself synthetic (AI4Privacy); the only realistic-context track is Romanian, and even there the identifiers are synthetic injected into real structure — we measure a synthetic-context vs real-context gap, not a synthetic-to-real-data gap. (ii) The cross-system F1–leakage rank correlation is descriptive over four systems (Spearman ρ = +0.80, not significant, exact permutation p = 0.33) and is largely carried by one blanket-redacting model (privacy-filter); we do not treat it as an effect estimate. The claim rests instead on privacy-filter’s leak-rate, whose Wilson 95% interval is separated from every other model’s (non-overlapping, Table 2), and on the coverage-vs-type mechanism in §5 — the three type-accurate detectors are not all mutually separable (the adjacent pairs overlap through GLiNER; only OpenMed and tabularisai separate), so the protective effect we report is privacy-filter’s blanket coverage, not a graded one across detectors. (iii) OpenMed and tabularisai were trained on AI4Privacy, the source of our general-text gold, so part of their general-text lead reflects in-distribution advantage — the Romanian track, which no baseline has seen, is the cleaner signal. (iv) The re-identification finding is demonstrated on decode-bearing national identifiers (the Romanian CNP, and — in the live leaderboard beyond this four-system study — Polish PESEL and Italian codice fiscale); we make no claim it generalizes to all identifiers or languages without further evidence. The deeper reading — that an aggregate detection F1 can stay high while a model misses the rare, high-stakes tokens that carry the re-identification — is general in principle, but national identifiers are the clearest provable case of it, not the whole of it; extending the measure to quasi-identifier-combination re-identification is in progress, so the broad claim is a hypothesis under test rather than a settled result. A second, name-in-context mechanism — the residual distinctiveness of a person name left in its surrounding context — is measured on the same development-track gold (config_status = dev); we report it as residual quasi-identifier distinctiveness, not a re-identification rate, which we reserve for the deterministic national-ID channel. (v) The anonymization/utility and membership-inference tracks are specified but not yet populated; the present metric is re-identification leakage, hence the title.

Reproducibility

All results are produced by the open harness (europriv-bench, v0.2.0) over the public dataset (klusai/europriv-bench, taxonomy v0.2.0); each leaderboard row carries its provenance. Wilson 95% intervals are computed from the published per-model CNP miss counts. GLiNER is zero-shot and its label prompts are part of the configuration (in the code). The CNP-protection rule is the harness definition stated in §4. Re-running europriv run against the published configs reproduces Tables 1–2.

References

Pilán, Lison, Øvrelid, Papadopoulou, Sánchez, Batet. "The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization." Computational Linguistics 48(4), 2022. doi:10.1162/coli_a_00458.
AI4Privacy. "OpenPII Masking" datasets (CC-BY-4.0). Hugging Face, ai4privacy/open-pii-masking-500k-ai4privacy. Accessed June 2026.
Ajausks et al. "The Multilingual Anonymisation Toolkit for Public Administrations (MAPA)." EAMT 2020; CEF Telecom project 2019-EU-IA-0045. Code: github.com/MAPA-Consortium; models on Hugging Face under BSC-LT/. Accessed June 2026.
"MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers." 2026 preprint (German GraSCCo corpus machine-translated into further languages); builds on Modersohn et al., "GraSCCo," 2022. We were unable to resolve a stable DOI/arXiv locator at access time (June 2026).
Marimon et al. "MEDDOCAN: Medical Document Anonymization track (Spanish)." IberLEF, 2019.
OpenAI. "Privacy Filter" (openai/privacy-filter). Hugging Face, 2026. Accessed June 2026.
OpenMed. "privacy-filter-multilingual" (OpenMed/privacy-filter-multilingual), 16 languages / 54 types. Hugging Face, 2026. Accessed June 2026.
tabularisai. "eu-pii-safeguard" (tabularisai/eu-pii-safeguard), XLM-R, 26 EU languages. Hugging Face. Accessed June 2026.
Zaratiana, Tomeh, Holat, Charnois. "GLiNER: Generalist Model for NER" (urchade/gliner_multi_pii-v1). 2023. arXiv:2311.08526.
iiiorg. "Piiranha-v1" (mDeBERTa-v3), license CC-BY-NC-ND-4.0. Hugging Face, 2024.
Krčo, Yao, Meeus, de Montjoye. "RAT-Bench: A Comprehensive Benchmark for Text Anonymization." 2026. arXiv:2602.12806. Concurrent, independent re-identification-risk leaderboard over synthetic text on U.S. demographics (English/Spanish/Chinese); no legal text, no GDPR-aligned taxonomy.
Zaratiana, Lewis, Hurn-Maloney. "GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction." 2026. arXiv:2605.09973. A 0.3B multilingual PII extractor over a 42-type taxonomy; a model, not a leaderboard.
Savkin, Ionov, Konovalov. "SPY: Enhancing Privacy with Synthetic PII Detection Dataset." NAACL 2025 (SRW), pp. 236–246. English-only synthetic PII benchmark with legal and clinical items.
Guan, Zhai, Kwok, Du, Feng, Li, Qin, Hui. "MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering." 2026. arXiv:2603.14265. Clinical-only LLM-QA privacy-utility benchmark.
Li, Hu, Jing, Chen, Hu, Han, Chu, Hu, Song. "PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance." ACL 2025. arXiv:2502.17041. Contextual-integrity / legal-compliance evaluation, not span-level de-identification.
Jha. "PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection." 2026. arXiv:2604.15776. Consolidates ten public PII datasets for detection only.
Deußer, Sparrenberg, Berger, Hahnbück, Bauckhage, Sifa. "A Survey on Current Trends and Recent Advances in Text Anonymization." IEEE DSAA 2025. arXiv:2508.21587. Flags the absence of a standardized multilingual de-identification benchmark.

We hold the per-CNP quasi-identifier count at three (date of birth, sex, county) as a conservative lower bound; the first digit jointly encodes sex and birth-century, and digits 8–9 encode the county of registration (with reserved codes for Bucharest sectors), not necessarily of residence. ↩
The per-predicted-label breakdown (96% ACCOUNT_ID, 3% phone) is not a scored leaderboard metric; it comes from the harness’s per-label prediction dump for privacy-filter on this config (regenerated by europriv run --dump-predictions), and is internally consistent with the 1,107/1,123 flagged and 16 missed. ↩