Applied Research · AI Services · Ventures
Privacy-preserving NLP for
European languages
The open research hub of KlusAI — benchmarks, datasets, models and papers for PII/PHI detection, anonymization, and re-identification-risk evaluation across 20 European languages, spanning the legal and clinical domains.
Flagship benchmark
EuroPriv-Bench
EuroPriv-Bench is the first unified pan-European de-identification benchmark. Unlike prior work that reports only detection-F1 on English, it measures privacy-utility / re-identification risk on a single GDPR-aligned taxonomy across 20 European languages, spanning legal and clinical text.
The headline finding: models advertising 96–97% F1 on English PII drop to 0.44–0.61 F1 once you hold them to a unified European taxonomy across languages — the gap the benchmark exists to expose.
Artifacts
Open from day one
Papers
Preprints with reproducible numbers that trace back to a commit.
What we publish here
A citable home for the program
- Leaderboards — versioned, provenance-tracked results you can cite.
- Release notes & methodology — the technical story behind each artifact.
- Paper companions — reproducible numbers tracing back to a commit.
For the company and product, see klusai.com.
Updates
Latest posts
-
Introducing EuroPriv-Bench
We’re releasing EuroPriv-Bench — the first unified pan-European de-identification benchmark. It puts privacy NLP for European languages on a single, GDPR-aligned taxonomy and a privacy-utility metric, rather than the fragmented, English-centric, detection-F1-only picture that exists today.
-
The GPU isn't always the answer
We benchmark a lot of models, so harness throughput matters. The intuition — “we have a 60-core M3 Ultra GPU, use it” — turned out to be exactly wrong for this workload, in an instructive way.
-
What a neutral leaderboard must control for
When we ran the first cross-lingual baselines, one model led on most European languages. The easy headline would be “model X wins.” The honest footnote is more interesting — and it’s the kind of thing a benchmark exists to surface.
-
A missed ID number is a birthday, a sex, and a county
Most PII benchmarks report one number: detection F1. It treats every missed entity as one missed token. But for privacy, not all misses are equal — and detection-F1 hides exactly the misses that matter most.
-
Harmonizing the PII taxonomy Babel
Every PII model speaks a different dialect. OpenAI’s privacy-filter has 8 coarse types; AI4Privacy uses ~98; HIPAA defines 18; the EU’s MAPA project has its own legal-and-medical set; OpenMed expands to 54; tabularisai to 42. If you want to compare these models on the same data — the whole point of a benchmark — you first have to make them agree on what a “name” or an “ID number” even is.