Research

Papers

Open research from the KlusAI privacy program. Each paper ships a working artifact — a benchmark, dataset, or model — not just a writeup.

KlusAI Technical Report Detection ≠ Re-identificationA unified dissociation · 3 national-ID schemes (RO/PL/IT) · real legal gold (TAB ECHR) · an openly-released protector

June 2026Working paper

Detection Is Not Re-identification: A Unified Dissociation and an Open Protector for European De-identification

KlusAI Research · KlusAI

On real external legal gold (TAB ECHR; Pilán et al. 2022), a CJEU-structure-trained checkpoint attains the lowest direct-identifier leak rate, 0.095 (95% CI 0.065–0.136, n = 264 DIRECT subjects), versus the next-best 0.496 (spaCy) and 0.500 (Presidio) — a paired-bootstrap improvement of Δ = −0.40 (95% CI −0.477 to −0.322), 3-seed confirmed; it is openly released as klusai/kp-deid-xlmr-560m-legal. The point is the dissociation, not a leaderboard win: detection F1 does not track re-identification protection. The same dissociation holds across three decode-bearing national-ID schemes (RO CNP, PL PESEL, IT codice fiscale) where a high-F1 detector (GLiNER, F1 0.85) still leaks 30.2% of CNP subjects while an openly-released protector leaks 0% at F1 0.74. We reserve 're-identification' for the deterministic national-ID channel; the TAB axis is direct-identifier protection that bounds re-identification risk. Synthetic real-skeleton numbers are config_status = dev (a finding, not yet citable); the TAB direct-identifier result is on real external gold and is citable within its stated scope (single board, single architecture, TAB-EN-legal). No SOTA claim.

De-identificationRe-identification riskDetection-protection dissociationRomanian CNPTAB ECHROpen protector

Dataset ↗Code ↗Leaderboard Read paper →

KlusAI Technical Report EuroPriv-BenchPan-European de-identification benchmark · 8 languages (3 decode-bearing real-skeleton re-id tracks) · re-identification-risk metric

June 2026Working paper

EuroPriv-Bench: A Unified Pan-European De-identification Benchmark with Re-identification Risk Metrics

KlusAI Research · KlusAI

Detection F1 doesn't predict privacy on decode-bearing national identifiers: the weakest PII detector leaks the fewest Romanian national IDs (1.4%), while the strongest leak 26–35%. Aggregate F1 stays high while a model misses the rare, high-stakes tokens that carry the re-identification — national IDs are the clearest provable case. A unified, openly-licensed pan-European de-identification benchmark that scores re-identification risk — not just detection F1.

De-identificationRe-identification riskRomanian CNPGDPR taxonomy8 languages

Dataset ↗Code ↗Leaderboard Read paper →