KlusAI Research

Introducing EuroPriv-Bench

2026-05-30T00:00:00+00:00

We’re releasing EuroPriv-Bench — the first unified pan-European de-identification benchmark. It puts privacy NLP for European languages on a single, GDPR-aligned taxonomy and a privacy-utility metric, rather than the fragmented, English-centric, detection-F1-only picture that exists today.

The gap it exposes

Models advertising 96–97% F1 on English PII benchmarks tell you little about how they behave on, say, Dutch clinical notes under a European privacy taxonomy. When we hold current baselines to that bar, F1 drops to 0.44–0.61 across en/fr/es/de/it/nl — with recall (the privacy-critical failure mode) the weakest link.

→ See the live leaderboard

What’s in v0

6 European language configs from cleanly-licensed open sources, remapped to the unified KlusAI privacy taxonomy and fully attributed.
A reproducible harness with provenance baked into every result (model id, dataset config/split, harness & taxonomy version, timestamp).
Baselines for the current public de-identification models.

What’s next

More baselines (multilingual de-id models, GLiNER, Presidio), larger sample sizes, and the legal/clinical splits — plus under-served languages including Romanian. Follow along here or on GitHub.

EuroPriv-Bench is open — see How to submit.

The GPU isn’t always the answer

2026-05-27T00:00:00+00:00

We benchmark a lot of models, so harness throughput matters. The intuition — “we have a 60-core M3 Ultra GPU, use it” — turned out to be exactly wrong for this workload, in an instructive way.

The model under test is openai/privacy-filter: a sparse Mixture-of-Experts with only 50M active parameters, doing short-sequence token classification. We measured inference per example, several ways:

setup	s/example
MPS (Apple GPU, PyTorch), single	0.81
MLX (Apple GPU), batched	0.085
CPU, batch=32, 28 threads	0.083
CPU, batch=32, 4 threads	0.041

Two surprises:

1. The GPU lost. Both the PyTorch-MPS and the MLX paths were slower than CPU. With so few active parameters and short sequences, there isn’t enough work per item to amortize the overhead of getting data onto the GPU — and the MoE’s routing ops fall off the Metal fast path and bounce back to the CPU anyway. GPUs win on big compute (large models, long sequences, training); they don’t on a tiny MoE doing short spans.

2. More threads were slower. PyTorch’s default (28 BLAS threads here) ran at half the speed of 4 threads — the ops are small enough that thread-coordination overhead dominates.

What actually used the machine

The real lever was parallelism at the job level: run many small, thread-capped jobs at once. Seven worker processes × 4 threads saturates the 28 cores, each job at its own fastest point. A full sweep (multiple models × multiple language configs) that crawled before now finishes in a couple of minutes — a ~7× wall-clock win, and identical (deterministic) numbers.

The lesson

Profile before you reach for the accelerator. “Use the GPU” and “use all the cores” are heuristics, not laws — and for small models on short inputs, both can cost you. The GPU still earns its keep here: just for the other job, training our own models, where the compute is actually large.

What a neutral leaderboard must control for

2026-05-23T00:00:00+00:00

When we ran the first cross-lingual baselines, one model led on most European languages. The easy headline would be “model X wins.” The honest footnote is more interesting — and it’s the kind of thing a benchmark exists to surface.

That model was fine-tuned on AI4Privacy — the same synthetic corpus our v0 gold is derived from. It isn’t cheating; it’s good work. But it means part of its lead is home-field advantage: it has seen data drawn from the same distribution it’s being tested on. A leaderboard that prints the ranking without flagging this is quietly misleading.

Contamination is the default, not the exception

In privacy NLP, a handful of open corpora (AI4Privacy chief among them) are both the common training data and the common evaluation data. So overlap between a model’s training set and your benchmark’s source is the normal case, not a rare one. A credible leaderboard has to assume it and design around it:

Provenance on every row — we record the dataset, config, split, and versions behind each score, so a number can always be traced and a contaminated comparison can be spotted.
Held-out, source-separated gold — training and evaluation kept strictly apart, with synthetic generation kept separate from the gold.
Data nobody has trained on — the real differentiator. Results on a corpus that no baseline has seen are the only ones immune to this effect.

That last point is why we’re investing in under-served languages and real-document gold rather than just scaling up the AI4Privacy-derived splits. Romanian, for instance, isn’t in AI4Privacy at all — so a Romanian gold set tests every model on genuinely unseen ground.

The point

“Neutral scorer” isn’t a slogan; it’s a set of obligations: flag contamination, publish provenance, and keep building evaluation data that the field hasn’t already trained on. We’d rather report a smaller, honest gap than a big one that won’t survive scrutiny.

→ Every score on the leaderboard carries its provenance.

A missed ID number is a birthday, a sex, and a county

2026-05-19T00:00:00+00:00

Most PII benchmarks report one number: detection F1. It treats every missed entity as one missed token. But for privacy, not all misses are equal — and detection-F1 hides exactly the misses that matter most.

Take the Romanian CNP (Cod Numeric Personal), the national ID number. It isn’t an opaque string: its 13 digits encode the holder’s data. A single un-redacted CNP deterministically discloses:

date of birth (century + year + month + day),
sex (the leading digit’s parity), and
county of registration.

So a model that misses one CNP hasn’t dropped “one token” — it has leaked three quasi-identifiers at once, the exact combination that enables re-identification. A model that catches 99% of generic PII but routinely misses national IDs can look great on F1 and still be a privacy liability.

Measuring the thing that matters

That’s why our headline metric isn’t detection-F1 — it’s re-identification risk. For national IDs we decode the structure directly and count what a miss actually exposes. We also report a recall-weighted score alongside F1, because in de-identification a false negative (PII left in) is far costlier than a false positive (something harmless redacted).

This also reframes what “good” means. A model that flags a 13-digit string as the wrong type but still redacts it hasn’t leaked anything — redaction cares about coverage, not labels. Detection-F1 penalizes the mislabel; the leakage metric correctly says “no harm done.” Different questions, different metrics — and privacy needs the second one.

Why national IDs, and why Europe

National-ID formats are where English-centric models fall down hardest: a CNP, a Spanish DNI, a Polish PESEL each have their own structure and check digits that a model trained mostly on US/UK data has never seen. Getting these right — and measuring the leakage when you don’t — is central to a European privacy benchmark, not a footnote.

We’re building the gold sets to test this properly across languages and domains. The metric is ready; the harder, more honest part — realistic documents — is what we’re working on next.

Harmonizing the PII taxonomy Babel

2026-05-13T00:00:00+00:00

Every PII model speaks a different dialect. OpenAI’s privacy-filter has 8 coarse types; AI4Privacy uses ~98; HIPAA defines 18; the EU’s MAPA project has its own legal-and-medical set; OpenMed expands to 54; tabularisai to 42. If you want to compare these models on the same data — the whole point of a benchmark — you first have to make them agree on what a “name” or an “ID number” even is.

So before any scoring, we built one GDPR-aligned crosswalk: a single harmonized taxonomy with a documented mapping from each external scheme’s labels onto it. It’s deliberately standardization, not invention — the interesting work is reconciliation, not coining new categories.

The rule that makes it sound

A crosswalk is only trustworthy if each native label maps to exactly one harmonized type — i.e. native → harmonized is a function. We enforce that at load time, and it immediately caught real modelling bugs:

HIPAA names was claimed by both PERSON and PROVIDER. But you can’t recover “this name is a clinician, not a patient” from a flat names label — so PROVIDER is a refinement that doesn’t get to claim the source label. names → PERSON.
MAPA ORGANIZATION was claimed by COURT, FACILITY, and ORG_PARTY. Same fix: the general owner wins; the refinements are ours, not the source’s.
medical_record_numbers sat in both the generic account bucket and the clinical MRN type. The clinical-specific type wins.

Each of these would have silently corrupted comparisons — a model credited for the “wrong” type, or double-counted. Failing loudly at build time beats discovering it in a results table.

What we deliberately left out

Schemes like tabularisai’s carry GDPR Article 9 special categories — ethnicity, religion, political opinion, sexual orientation. These are real, high-stakes identifiers, but they need careful design (and they’re not in our gold yet), so for now they map to nothing rather than being force-fit. Better an honest gap than a sloppy mapping.

Why it matters

Without a harmonized taxonomy, “model A scores higher than model B” can just mean “A’s label set happens to line up better with this dataset.” The crosswalk — plus scoring every model only on the types the gold actually annotates — is what lets a leaderboard mean something across models that were never designed to agree.

→ See it in action on the leaderboard, or read the code on GitHub.