Blog | KlusAI Research

Notes from the privacy program

Methodology, engineering, and results as we build EuroPriv-Bench and the KlusAI privacy models.

Jun 13, 2026
Every detector leaked half the IDs on real court text — except one
A few weeks ago we argued that a high-F1 detector can still leak every ID it finds — but the evidence was on our own synthetic national-ID tracks. The obvious objection: show it on real data. So we did. On real ECHR court judgments — the peer-reviewed Text Anonymization Benchmark (TAB; Pilán et al. 2022), manually annotated, that no model on our board trained on — the dissociation holds, and it is starker than on synthetic text.
Jun 7, 2026
Making the benchmark honest: a harder external eval, and our first real data
Two changes landed this week that make EuroPriv-Bench harder to game — including by us. We added an external, third-party detection eval that pulls our own synthetic scores apart, and we wired in the first real-data gold config in the suite. Both make our numbers look worse, on purpose. A benchmark you can’t fail isn’t measuring anything.
Jun 7, 2026
A second look at what survives redaction: a quasi-identifier diagnostic
So far we’ve measured leakage one way: did a specific, structured identifier — a national ID that decodes to a birthday, a sex, and a county — slip through? That’s the re-identification-risk channel, and it’s deliberately narrow: we only call something re-identification when an identifier’s structure earns the word.
Jun 3, 2026
The leaderboard is open — and it now carries three external detectors
EuroPriv-Bench’s submission path is open: a no-secrets CI that runs the harness against your model and adds the row, with provenance baked in. To open it honestly we ran it ourselves on three independent, third-party detectors that were never tuned to compete on re-identification risk — Microsoft Presidio, GLiNER2 (Fastino), and spaCy (Explosion). They now sit on the public leaderboard next to our own model.
Jun 2, 2026
A high-F1 detector can still leak every ID it finds
Our first de-identification model, klusai/kp-deid-mdeberta-280m, just landed on the public leaderboard as the best protector on the contamination-free Romanian real-skeleton track. It earns that title not by topping detection-F1 — it doesn’t — but on the metric we actually lead with: re-identification risk. And in earning it, it surfaces the program’s headline finding.
Jun 1, 2026
The first open datasets in the EuroPriv-Bench suite
The first open EuroPriv-Bench datasets are now on Hugging Face — general-domain bring-up sets in Romanian, English, and Polish, 50,000 documents each:
May 30, 2026
Introducing EuroPriv-Bench
We’re releasing EuroPriv-Bench — the first unified pan-European de-identification benchmark. It puts privacy NLP for European languages on a single, GDPR-aligned taxonomy and a privacy-utility metric, rather than the fragmented, English-centric, detection-F1-only picture that exists today.
May 27, 2026
The GPU isn't always the answer
We benchmark a lot of models, so harness throughput matters. The intuition — “we have a 60-core M3 Ultra GPU, use it” — turned out to be exactly wrong for this workload, in an instructive way.
May 23, 2026
What a neutral leaderboard must control for
When we ran the first cross-lingual baselines, one model led on most European languages. The easy headline would be “model X wins.” The honest footnote is more interesting — and it’s the kind of thing a benchmark exists to surface.
May 19, 2026
A missed ID number is a birthday, a sex, and a county
Most PII benchmarks report one number: detection F1. It treats every missed entity as one missed token. But for privacy, not all misses are equal — and detection-F1 hides exactly the misses that matter most.
May 13, 2026
Harmonizing the PII taxonomy Babel
Every PII model speaks a different dialect. OpenAI’s privacy-filter has 8 coarse types; AI4Privacy uses ~98; HIPAA defines 18; the EU’s MAPA project has its own legal-and-medical set; OpenMed expands to 54; tabularisai to 42. If you want to compare these models on the same data — the whole point of a benchmark — you first have to make them agree on what a “name” or an “ID number” even is.