Applied Research · AI Services · Ventures

Privacy-preserving NLP for
European languages

The open research hub of KlusAI — benchmarks, datasets, models and papers for PII/PHI detection, anonymization, and re-identification-risk evaluation across European languages (8 published datasets: ro, en, pl, de, fr, es, it, nl — scaling toward EU-24), spanning the legal and clinical domains.

See the live leaderboard → Benchmark on Hugging Face

Flagship benchmark

EuroPriv-Bench

EuroPriv-Bench is the first unified pan-European de-identification benchmark. Unlike prior work that reports only detection-F1 on English, it measures re-identification risk alongside detection on a single GDPR-aligned taxonomy. Eight languages are published — ro, en, pl, de, fr, es, it, nl. Seven have general-text tracks (de, en, es, fr, it, nl, ro), and three carry contamination-controlled, decode-bearing real-skeleton tracks: Romanian (CNP), Polish (PESEL), and Italian (codice fiscale). The roadmap scales toward the full EU-24.

The headline finding: detection-F1 does not track re-identification protection — demonstrated on decode-bearing national identifiers (RO CNP, PL PESEL, IT codice fiscale). The deeper mechanism is general but the proof is not yet: an aggregate detection-F1 can stay high while a model misses the rare, high-stakes tokens that actually carry the re-identification. National IDs are the clearest, provable case of that — each un-redacted national ID deterministically discloses several quasi-identifiers at once (a Romanian CNP, for instance, decodes date of birth, sex, and county) — not the whole of it. On contamination-free, realistic-structure documents the dissociation holds across three decode-bearing identifiers in three languages (RO CNP, PL PESEL, IT codice fiscale), across two independent Romanian template families, and is reproduced by independent third-party submissions on the public board: spaCy, with no structured-ID recognizer, leaks 89.0% of Romanian CNPs at a detection-F1 of just 0.14, while GLiNER — the strongest detector on the track (F1 0.85) — still leaks 30.2%. The contrast is KlusAI's reference de-identifier kp-deid, the strongest protector that still detects — 0% CNP leakage at detection-F1 0.74. These are measured, contamination-controlled signals on development-track gold (config_status = dev), pending native-speaker and inter-annotator-agreement validation — a finding, not yet a validated or citable claim. Extending the measure to quasi-identifier-combination re-identification is in progress, so the broad reading remains a hypothesis under test.

Artifacts

Open from day one

Benchmark

EuroPriv-Bench — versioned, provenance-tracked, openly redistributable. The live leaderboard is open for external submissions.

Hugging Face ↗ Submit a model →

Models

The KlusAI Privacy (kp-*) family — kp-deid-mdeberta-280m is the strongest protector that still detects (0% CNP leakage at detection-F1 0.74).

Hugging Face ↗

</>

Code

The benchmark harness, dataset curation, and model training pipelines.

GitHub ↗

arXiv

Papers

The EuroPriv-Bench preprint — reproducible numbers that trace back to a commit. In-progress working paper; arXiv pending.

Read the paper →

What we publish here

A citable home for the program

Leaderboards — versioned, provenance-tracked results you can cite.
Release notes & methodology — the technical story behind each artifact.
Paper companions — reproducible numbers tracing back to a commit.

For the company and product, see klusai.com.

Updates

Latest posts

Jun 13, 2026
Every detector leaked half the IDs on real court text — except one
A few weeks ago we argued that a high-F1 detector can still leak every ID it finds — but the evidence was on our own synthetic national-ID tracks. The obvious objection: show it on real data. So we did. On real ECHR court judgments — the peer-reviewed Text Anonymization Benchmark (TAB; Pilán et al. 2022), manually annotated, that no model on our board trained on — the dissociation holds, and it is starker than on synthetic text.
Jun 7, 2026
Making the benchmark honest: a harder external eval, and our first real data
Two changes landed this week that make EuroPriv-Bench harder to game — including by us. We added an external, third-party detection eval that pulls our own synthetic scores apart, and we wired in the first real-data gold config in the suite. Both make our numbers look worse, on purpose. A benchmark you can’t fail isn’t measuring anything.
Jun 7, 2026
A second look at what survives redaction: a quasi-identifier diagnostic
So far we’ve measured leakage one way: did a specific, structured identifier — a national ID that decodes to a birthday, a sex, and a county — slip through? That’s the re-identification-risk channel, and it’s deliberately narrow: we only call something re-identification when an identifier’s structure earns the word.
Jun 3, 2026
The leaderboard is open — and it now carries three external detectors
EuroPriv-Bench’s submission path is open: a no-secrets CI that runs the harness against your model and adds the row, with provenance baked in. To open it honestly we ran it ourselves on three independent, third-party detectors that were never tuned to compete on re-identification risk — Microsoft Presidio, GLiNER2 (Fastino), and spaCy (Explosion). They now sit on the public leaderboard next to our own model.

All posts →

Privacy-preserving NLP forEuropean languages

EuroPriv-Bench

Open from day one

Benchmark

Models

Code

Papers

A citable home for the program

Latest posts

Every detector leaked half the IDs on real court text — except one

Making the benchmark honest: a harder external eval, and our first real data

A second look at what survives redaction: a quasi-identifier diagnostic

The leaderboard is open — and it now carries three external detectors

Privacy-preserving NLP for
European languages