Notes from the privacy program

Methodology, engineering, and results as we build EuroPriv-Bench and the KlusAI privacy models.

  • Introducing EuroPriv-Bench

    We’re releasing EuroPriv-Bench — the first unified pan-European de-identification benchmark. It puts privacy NLP for European languages on a single, GDPR-aligned taxonomy and a privacy-utility metric, rather than the fragmented, English-centric, detection-F1-only picture that exists today.

  • The GPU isn't always the answer

    We benchmark a lot of models, so harness throughput matters. The intuition — “we have a 60-core M3 Ultra GPU, use it” — turned out to be exactly wrong for this workload, in an instructive way.

  • What a neutral leaderboard must control for

    When we ran the first cross-lingual baselines, one model led on most European languages. The easy headline would be “model X wins.” The honest footnote is more interesting — and it’s the kind of thing a benchmark exists to surface.

  • A missed ID number is a birthday, a sex, and a county

    Most PII benchmarks report one number: detection F1. It treats every missed entity as one missed token. But for privacy, not all misses are equal — and detection-F1 hides exactly the misses that matter most.

  • Harmonizing the PII taxonomy Babel

    Every PII model speaks a different dialect. OpenAI’s privacy-filter has 8 coarse types; AI4Privacy uses ~98; HIPAA defines 18; the EU’s MAPA project has its own legal-and-medical set; OpenMed expands to 54; tabularisai to 42. If you want to compare these models on the same data — the whole point of a benchmark — you first have to make them agree on what a “name” or an “ID number” even is.