Applied Research · AI Services · Ventures

Privacy-preserving NLP for
European languages

The open research hub of KlusAI — benchmarks, datasets, models and papers for PII/PHI detection, anonymization, and re-identification-risk evaluation across 20 European languages, spanning the legal and clinical domains.

Flagship benchmark

EuroPriv-Bench

EuroPriv-Bench is the first unified pan-European de-identification benchmark. Unlike prior work that reports only detection-F1 on English, it measures privacy-utility / re-identification risk on a single GDPR-aligned taxonomy across 20 European languages, spanning legal and clinical text.

The headline finding: models advertising 96–97% F1 on English PII drop to 0.44–0.61 F1 once you hold them to a unified European taxonomy across languages — the gap the benchmark exists to expose.

Artifacts

Open from day one

HF

Benchmark

EuroPriv-Bench — versioned, provenance-tracked, openly redistributable.

</>

Code

The benchmark harness, dataset curation, and model training pipelines.

arXiv

Papers

Preprints with reproducible numbers that trace back to a commit.

Coming soon

What we publish here

A citable home for the program

  • Leaderboards — versioned, provenance-tracked results you can cite.
  • Release notes & methodology — the technical story behind each artifact.
  • Paper companions — reproducible numbers tracing back to a commit.

For the company and product, see klusai.com.

Updates

Latest posts

  • Introducing EuroPriv-Bench

    We’re releasing EuroPriv-Bench — the first unified pan-European de-identification benchmark. It puts privacy NLP for European languages on a single, GDPR-aligned taxonomy and a privacy-utility metric, rather than the fragmented, English-centric, detection-F1-only picture that exists today.

  • The GPU isn't always the answer

    We benchmark a lot of models, so harness throughput matters. The intuition — “we have a 60-core M3 Ultra GPU, use it” — turned out to be exactly wrong for this workload, in an instructive way.

  • What a neutral leaderboard must control for

    When we ran the first cross-lingual baselines, one model led on most European languages. The easy headline would be “model X wins.” The honest footnote is more interesting — and it’s the kind of thing a benchmark exists to surface.

  • A missed ID number is a birthday, a sex, and a county

    Most PII benchmarks report one number: detection F1. It treats every missed entity as one missed token. But for privacy, not all misses are equal — and detection-F1 hides exactly the misses that matter most.

  • Harmonizing the PII taxonomy Babel

    Every PII model speaks a different dialect. OpenAI’s privacy-filter has 8 coarse types; AI4Privacy uses ~98; HIPAA defines 18; the EU’s MAPA project has its own legal-and-medical set; OpenMed expands to 54; tabularisai to 42. If you want to compare these models on the same data — the whole point of a benchmark — you first have to make them agree on what a “name” or an “ID number” even is.