Privacy Models Program

Research Roadmap

How EuroPriv-Bench becomes the neutral, open scorekeeper of European privacy NLP — across benchmark, models, datasets, and papers.

Living plan · updated 2026-06-07 · europriv-bench ↗ datasets ↗ models ↗ papers ↗

North star

EuroPriv-Bench aims to become the neutral, open scorekeeper of European privacy NLP — the unified, reproducible, fully-redistributable yardstick that every PII/PHI/de-identification system gets ranked on, with complete coverage of all 24 EU official languages (plus 4 strategic non-EU European languages — uk/ru/tr/sr — for 28 total) across the legal + clinical domains. This is the unified-pan-European completeness claim: EU-24 parity with MAPA on language breadth, won on a contamination-controlled, re-id-aware footing no F1-only suite has. The 8 EU-completeness languages (the Baltic set et/lv/lt, the ex-Yugoslav-EU pair hr/sl plus Slovak sk, and Maltese mt and Irish ga) are exactly the under-served-language wedge the program is built to own — small national languages that mainstream PII tooling neglects. The program does not try to win as “another multilingual PII model.” It wins by owning the canonical yardstick (the role HELM and lm-eval-harness play for general LLMs), and by reframing the field’s default metric: the published, reproducible finding that detection-F1 does not track re-identification protection — demonstrated on decode-bearing national identifiers (RO CNP, PL PESEL, IT codice fiscale) — where the best detector is not the best protector. The deeper mechanism is general (aggregate detection-F1 can stay high while a model misses the rare, high-stakes tokens that carry the re-identification); national IDs are its clearest, provable case, not the whole of it, and extending the measure to quasi-identifier-combination re-identification is in progress — so the broad claim is a hypothesis under test, not yet settled.

That reframing is the program’s sharpest asset. On the contamination-free Romanian real-skeleton track, a blanket-coverage system (openai/privacy-filter) leaks ~1.4% of Romanian national IDs (CNP) while type-accurate detectors leak 26–35% at much higher detection-F1. No F1-only benchmark can surface that. EuroPriv-Bench is built to measure it, openly and reproducibly, and to extend it carefully — only where an identifier’s structure justifies a re-identification claim.

The position is defensible for a small, agent-augmented team because a leaderboard’s value compounds with submissions and citations, not headcount, and a neutral referee is a seat a vendor-aligned competitor structurally cannot take. The claim discipline throughout is “first unified, not first”: we subsume and re-host TAB / Ai4Privacy / MAPA / MEDDOCAN / MultiGraSCCo through a documented GDPR-aligned crosswalk rather than competing with them, and we ship models only where head-to-head, CI-backed wins exist on under-served languages and domains no baseline was trained on.

How to read this roadmap

This is a living plan organized into three horizons: H1 (~0–3 months, now), H2 (~3–9 months), H3 (~9–24 months). Every deliverable carries a measurable Acceptance line, an Agent tasks sub-list (concrete repo/file-level actions), and a Targets line. Each axis lives in a specific repo: Benchmark & Leaderboard in europriv-bench (+ the leaderboard UI in klusai-pages-research → research.klusai.com, private repo); Models in klusai-models; Datasets in klusai-datasets; Papers in klusai-papers. europriv-bench is the single source of truth for the taxonomy and span logic (TAXONOMY_VERSION 0.2.0, bioes_labels(), char_spans_to_bioes, national_id); the dataset and model repos import it and never copy. Agents must respect cleanly-licensed-only sourcing, the MLX-first + DigitalOcean-GPU-burst compute model (the M3 Ultra GPU + its large unified memory is the primary engine — mlx-lm lacks encoder archs, but the Mac GPU trains them fine via PyTorch-MPS / an MLX-native path; a DO GPU droplet is burst-only, when the Mac isn’t enough), and the rule that every published number ships with provenance (harness version, taxonomy version, dataset config/revision, model_id, timestamp) and a confidence interval.

Ground-truth note (verified 2026-06-01, visibility updated 2026-06-06): The four open-science program repos (europriv-bench, klusai-models, klusai-datasets, klusai-papers) are public and in sync with origin/main; the HF dataset klusai/europriv-bench is public (8 configs live). klusai-pages-* repos are private (org policy) — sites stay public via GitHub Pages. H1 is re-baselined against actual state. The genuinely-open admin items are narrower (Pages “Enforce HTTPS” UI checkbox; org-level Actions secrets/permissions for submission CI) and are tracked under Benchmark H1.

Status & updates

Live view — last updated 2026-06-07. The full plan is below; this section tracks what has shipped and what is next, and is refreshed as work lands.

✓ Shipped (H1)

Benchmark — the re-identification-risk metric is generalized to national IDs for RO (CNP), PL (PESEL), IT (codice fiscale via the Belfiore code), with ES (DNI/NIF) handled as coverage-only; every leak-rate now carries a harness-emitted Wilson confidence interval; the leaderboard is schema 3 with contamination + citable-status badges; the taxonomy is a versioned conf/taxonomy.yaml under a GOVERNANCE.md stability contract; and an externally-contributable submission CI with a reproduction gate makes the board open to outside models.
Benchmark — first real-data gold config (TAB). TAB (Text Anonymization Benchmark, English-legal text derived from ECHR judgments, MIT-licensed) is integrated as the first REAL-data gold config in the suite (config_status=real-external-gold, 127 documents, 0 misaligned spans after taxonomy projection). Honest finding: real legal English is hard — best model is Presidio at entity-F1 0.589, and our own kp-deid is OOD at 0.199 (RO-trained, doesn’t transfer to real EN legal). The TAB-derived gold config is private (licensing/redistribution review pending) — no public config to load yet; numbers are reproducible internally against the committed harness.
Benchmark — external de-saturating detection eval (Ai4Privacy). Adopted Ai4Privacy open-core (openpii-1m, CC-BY-4.0) as an external, independent, discriminating detection eval. Our own synthetic detection eval had saturated (a control run at entity-F1 1.000); the external eval de-saturates the same models to a 0.41–0.67 entity-F1 spread. The Llama-licensed 500k tier was caught and excluded by the cleanly-licensed-only gate. config_status=dev.
Benchmark — a second, independent re-identification channel (QI / k-anonymity). Activated a within-corpus k-anonymity-violation diagnostic over residual quasi-identifiers, surfacing sample distinctiveness (how unusual a record is within the corpus) that survives redaction. This is explicitly labelled residual distinctiveness, NOT population re-identification — the word re-identification stays reserved for the deterministic national-ID channel. config_status=dev; an internal sensitivity signal, not a citable claim.
Datasets — the reusable LocalePack (RO/EN/PL; checksum-valid identifiers, offset-correct-by-construction gold) and the first three open datasets on Hugging Face: ds-kp-general-{ro,en,pl}-50k (50,000 documents each, CC-BY).
Models — the first kp-deid-mdeberta-280m is published and now featured on the public leaderboard — the first kp-deid model on the board, and the best protector at 0% CNP leak-rate (entity-F1 0.74), trained on-device on the M3 Ultra. Mac-GPU (Metal/MPS) training is enabled — ~7.7× faster than CPU on the M3 Ultra, so full finetunes run on-device. The protection result replicates zero-shot to a second language and identifier (Polish PESEL): the RO-trained model leaks 0% PESEL on PL having never seen Polish, while type-accurate detectors still leak — so detection accuracy and re-identification protection come apart across two languages and two decode-bearing IDs, not one.
Papers — the submission + artifact-evaluation protocol is written, and a fresh prior-art rescan re-confirmed the “first unified” position (citing concurrent work such as RAT-Bench honestly).

▶ In progress

Hardening ro-realskeleton-v1 into validated citable gold (documented RO native-speaker + IAA sign-off) so the kp-deid board row can be promoted from dev to citable-validated. The dissociation now holds across two independent template families (the single-authored-family limitation is retired); native-speaker + IAA validation remains the only gate left before the row can be called citable.
Consolidating the detection≠re-id dissociation across decode-bearing national IDs in ~11 languages (RO/PL/IT plus SE/CZ/DK/FI/EE/LT/SI/SK). This breadth is HELD / in-progress, NOT a validated headline: model coverage is still uneven (an in-progress re-score is fixing it) and native-speaker validation is pending. Treat as a dev-stage coverage push, not a settled claim.
PURR@τ population-uniqueness estimator + reference-population module built as groundwork (internal sensitivity-analysis only). Pending a census-calibrated generator before it can support any population-level uniqueness statement — explicitly not a claim today.

→ Up next

The H1 retrospective (re-baselining H2 from real results), an MLX-native encoder training path, then H2: XLM-R-560m / kp-deid breadth across the T1 languages, the anonymization + downstream-utility track, and legal/clinical synthesis.

Updates log

2026-06-07 — We made the benchmark harder to game — including by us. Two additions on the detection side. (1) First real-data gold config: TAB (ECHR English legal, MIT-licensed) lands as the suite’s first non-synthetic gold (config_status=real-external-gold, 127 docs, 0 misaligned spans). Honest result — real legal English is hard: Presidio leads at entity-F1 0.589 and our RO-trained kp-deid is out-of-distribution at 0.199. The TAB-derived config is private (licensing review pending); numbers reproduce internally only. (2) External de-saturating eval: Ai4Privacy open-core (openpii-1m, CC-BY-4.0) adopted as an independent detection eval — our own synthetic control had saturated at entity-F1 1.000, and the external eval pulls the same models apart to a 0.41–0.67 spread; the Llama-licensed 500k tier was caught and excluded by the licensing gate. Both dev/real-external-gold — measured, contamination-controlled signals, not citable or SOTA.
2026-06-07 — A second, independent re-identification channel. Activated a within-corpus k-anonymity-violation diagnostic over residual quasi-identifiers — measuring sample distinctiveness (residual distinctiveness within the corpus) that survives redaction, as a complement to the deterministic national-ID channel. Explicitly NOT population re-identification, and the term re-identification stays reserved for the structured-ID channel. Separately, a PURR@τ population-uniqueness estimator + reference-population module was built as groundwork (internal sensitivity-analysis only, pending a census-calibrated generator). Both dev.
2026-06-07 — Coverage push (held, not a headline): an in-progress re-score is consolidating the detection≠re-id dissociation across decode-bearing national IDs in ~11 languages (RO/PL/IT + SE/CZ/DK/FI/EE/LT/SI/SK). Model coverage is still uneven and native-speaker validation is pending, so this breadth is framed in-progress / dev, not a validated claim. Compute re-scope confirmed: most of this runs on the M3 Ultra Mac Studio with no GPU budget needed.
2026-06-03 — The dissociation extends to a new domain: legal. Beyond the language axis (RO/PL/IT national IDs), a new structure-only legal real-skeleton track (legal-realskeleton-v1, Romanian — modelled on EUR-Lex / ECHR / DSAR document structure only, with no copyrighted source text redistributed; cleanly-licensed, synthetic PII, guarded by a falsifiable source-phrase-absence test) carries the same per-subject CNP re-identification metric. The detection-≠-protection dissociation holds in the legal genre: type-accurate detectors leak 27–97% of CNPs at high detection-F1 (GLiNER 0.80 F1 → 72% leak) while the kp-deid protector leaks far less — and, reported honestly, kp-deid leaks 4.07% here (Wilson upper 0.052, above its ≤0.02 target on the cleaner structured tracks): a genuine real-legal-genre gap, not a near-zero. Every per-detector difference-of-proportions gap CI still excludes 0. Honest scope: config_status=dev, a single authored legal family (a second independent family is required before any citation), one language (RO) — cross-lingual legal breadth is the next step.
2026-06-03 — The leaderboard now leads with the finding, not the table. The public leaderboard was reworked so a non-expert grasps the thesis at a glance: re-identification leakage (the metric that matters) leads, the detection-vs-protection Pareto figure is embedded up top, leak rates render as colour-coded bars, the detection table defaults best-first with memorisation-inflated in_distribution rows greyed, and the governance jargon folds into a “How to read this” panel. Every figure stays traced to the published data and every dev caveat intact (three-reviewer editorial-panel gate; the pass also corrected an earlier line that wrongly called the best-F1 detector the worst leaker).
2026-06-03 — The dissociation extends to a third language and identifier: Italian codice fiscale. A new it-realskeleton-v1 track joins the board (all 8 models scored). Codice fiscale is a richer re-identification surface than CNP/PESEL — it discloses date of birth, sex, and place of birth (via the Belfiore code). On the contamination-free IT track (224 distinct codice-fiscale subjects), kp-deid leaks 0% while type-accurate detectors leak 32–96%; the per-detector difference-of-proportions Newcombe 95% CI excludes 0 for six of the seven (privacy-filter, honestly, does not). The detection-≠-protection result now holds across RO/PL/IT — three identifiers, three languages. Honest scope: config_status=dev, one authored IT template family for now, pending native-speaker validation.
2026-06-03 — The dissociation replicates across two independent RO template families. The RO real-skeleton finding no longer rests on a single authored skeleton family: it now holds across two independent template families — Family A (official correspondence: clinical / legal / administrative; 190 distinct CNP subjects) and Family B (academic registry: higher-education student records; 250 distinct CNP subjects), with a 5-gram Jaccard of 0.0000 between the two skeleton sets. kp-deid-mdeberta-280m leaks 0% CNP in both families (Wilson 95% upper bound 0.0198 in A, 0.0151 in B — both within the pre-registered ≤0.02 target), while type-accurate detectors keep leaking: the per-family difference-of-proportions (typed-detector minus protector) has a Newcombe 95% CI that excludes 0 in both families for spaCy, GLiNER, GLiNER2, OpenMed and tabularisai in Family A, and for spaCy, GLiNER, GLiNER2, OpenMed, presidio and privacy-filter in Family B. The single-authored-template-family limitation is retired. Honest scope unchanged: the track stays config_status=dev — native-speaker + IAA validation is still the gate before it is citable.
2026-06-03 — The board is filling up — three external models now scored. Two more independent third-party systems joined via the no-secrets submission CI: GLiNER2 (Fastino, Apache-2.0) and spaCy (Explosion, MIT) — neither tuned to compete on re-identification risk. The honest result sharpens the thesis: spaCy, with no structured-ID recognizer, leaks ~89% of Romanian CNPs; GLiNER2 leaks ~29%. Detection tooling that never modelled national-ID structure simply doesn’t protect against re-identification — across three independent external systems now, not one. A self-serve call for submissions is open.
2026-06-02 — The leaderboard is open — first third-party model on the board. Microsoft Presidio was submitted through the automated, no-secrets submission CI and scored on the public configs — the first external model to land on EuroPriv-Bench, proving the contributable path works end-to-end. An honest result: the orchestration baseline protects national IDs (0% CNP leak) yet under-detects fine-grained types (0.22–0.44 entity-F1) — re-identification protection and detection accuracy, once again, are not the same thing.
2026-06-02 — The detection-vs-protection dissociation replicates on a second language and identifier. A new Polish real-skeleton track (PESEL) joins the board with five models scored over 1,096 distinct PESEL subjects: kp-deid-mdeberta-280m leaks 0% PESEL (Wilson CI 0.000–0.0035) at entity-F1 0.76 — zero-shot (it is RO-trained and had never seen Polish) — while type-accurate detectors still leak materially (GLiNER ~58%, tabularisai ~31% at high F1). The “train for protection, not just detection” result now holds across RO/CNP and PL/PESEL. Honest scope: this track is config_status=dev (not yet native-speaker/IAA validated), so it is a strong signal rather than a validated headline — native-speaker + IAA validation comes first.
2026-06-01 — First KlusAI model on the public leaderboard: the full-run kp-deid-mdeberta-280m lands on the contamination-free RO real-skeleton track (ro-realskeleton-v1) at entity-F1 0.74 with a 0% CNP leak-rate (1123/1123, Wilson CI 0.000–0.0034) — the best protector on the board, an open head-to-head delta on the contamination-free track. Trained on-device on the M3 Ultra; row carries full provenance (contamination=clean_held_out, config_status=dev).
2026-06-01 — Kickoff wave shipped: the benchmark’s re-id metric + confidence intervals + schema-3 board + open submission CI; the first open ds-kp datasets; the first kp-deid model with on-device Mac-GPU training; and the first progress post. Compute confirmed Mac-first (M3 Ultra), with cloud GPU as burst-only.

At a glance

Axis	H1 (0–3 mo)	H2 (3–9 mo)	H3 (9–24 mo)
Benchmark	Generalize re-id metric (PESEL/PL, codice-fiscale/IT); contamination flag (schema 3); GOVERNANCE + versioned taxonomy YAML; submission CI v1 (public configs, no-secrets sandbox)	Track C (anonymization+utility); legal track; gated-eval behind proven sandbox; per-track UI	Track D (MIA) if unblocked; EU-24 + 4 non-EU (28); subsume TAB/MEDDOCAN/MAPA splits; v1.0 + DOI
Models	Mac-GPU path (mDeBERTa-280m + LoRA on the M3 Ultra via MPS/MLX); KpModelAdapter; SDK `extract_pii`; first kp rows on RO real-skeleton	XLM-R-560m (GPU-burst gated); coverage-aware “protector” objective; T1 langs; kp-anon + SDK full	T2/T3 under-served langs; decision-gated from-scratch; clinical on credentialed gold
Datasets	LocalePack refactor; `synthetic.generate()` stage-A; release_dataset.py + license CI gate; 3 (ro/en/pl) 50k general-domain slugs	Drift metric; stage-B narrative gen; legal + clinical synthesis; real-skeleton for 3 more langs; T2 packs	T3-EU + T4 packs (EU-24 + 4 non-EU = 28); drift reported across the EU-24 (transfer-gap delta with CIs); contributable locale-pack path
Papers	Paper 1 → arXiv + Zenodo DOI; submission protocol doc; Paper 2 skeleton	Paper 1 → venue; Paper 2 → arXiv/dataset venue; Paper 3 draft (gated on models)	Paper 1 + 3 accepted; Paper 4 (utility+MIA) → PETS; Paper 5 (position) as flex buffer

Benchmark & Leaderboard — the neutral scorer

Vision. EuroPriv-Bench becomes the openly-licensed, reproducible suite the European privacy-NLP field gets ranked on: four tracks (detection; re-identification/leakage; anonymization-utility; membership-inference), a versioned GDPR-aligned taxonomy, a contributable public leaderboard with automated eval CI, and contamination-controlled citable gold.

World-class bar. Operational + methodological parity with HELM (multi-metric, reproducible per-scenario runs), lm-eval-harness / Open-LLM-Leaderboard (one-command reproducibility, model-agnostic adapters, automated submission CI), and BIG-bench (versioned tasks, contamination control). On privacy specifically: include and beat Presidio as a pipeline baseline; cite Private AI / Tonic Textual and score where an API is testable; subsume Ai4Privacy, TAB, MAPA, MEDDOCAN, MultiGraSCCo. The open seam none of the privacy prior art has: a working external submission flow with contamination guards.

Verified starting state. Detection (Track A) is wired and strong. The other three tracks are NotImplementedError and runner.run_spec hard-raises for all non-DETECTION tasks (“Phase 4”). src/europriv_bench/national_id.py is RO-only (101 lines; CNP decodes sex + county + DOB). Leaderboard is schema 2, 4 baseline models × 8 leaderboard specs, each mapping 1:1 to one of the 8 published HF dataset configs (6 general-language configs + 2 RO configs).

H1

Generalize cnp_leakage into a careful re-identification-risk metric. Lead with the only listed IDs whose structure actually encodes quasi-identifiers: PESEL (PL) and codice fiscale (IT) (both carry DOB + sex; codice fiscale also encodes place/comune of birth via the Belfiore code, not “region”). Make national_id_leakage a country-dispatched (RO/PL/IT) row-metric that folds in / aliases cnp_leakage so the board carries a single re-id-risk metric family; register it in ROW_REGISTRY. DNI/NIF (ES) is explicitly a coverage-only validator (no quasi-ID decode) — it encodes nothing, so it is never reported as a re-id-risk number, only as detection coverage.
Staged EU-completeness re-id extensions (after RO/PL/IT land). As the country-keyed registry grows, add the next decode-bearing national IDs whose structure encodes quasi-identifiers: EE isikukood (DOB + sex; century via leading digit), LT asmens kodas (DOB + sex; century via leading digit), SI EMŠO (DOB + sex + region), and SK/CZ rodné číslo (DOB + sex). These are reported as re-id-risk numbers exactly like RO CNP / PL PESEL / IT codice fiscale. Sequence: RO/PL/IT first per the existing plan, then EE/LT/SI/SK as the EU-completeness re-id wins.
Coverage-only validators (encode nothing → detection coverage, never a re-id number, like ES DNI/NIF): HR OIB, MT ID card, IE PPS. LV personal code is split: the pre-2017 format encoded DOB (decode that legacy format as re-id-bearing), but codes issued since 2017 are randomized and encode nothing (treat the new format as coverage-only) — the validator must distinguish the two and never emit a re-id number for a post-2017 code. Keep decode-bearing and coverage-only IDs cleanly separated in the registry and on the board.
Contamination flag, machine-readable. Bump leaderboard to schema 3 with a per-(model,config) contamination enum; seed the known OpenMed/tabularisai-trained-on-Ai4Privacy cases; render an in-distribution vs clean-held-out marker on the site.
Governance + versioned taxonomy. Move the full crosswalk out of the taxonomy.py docstring into a loaded conf/taxonomy.yaml (keep TAXONOMY_VERSION in sync); add a contract test against bioes_labels(); write GOVERNANCE.md (immutable config names, version-comparability rules, metric-stability contract, CHANGELOG).

Config-status field on the leaderboard schema. Add a per-(model,config) config_status enum (dev citable-validated real-external-gold) to the schema 3 rows and document it in GOVERNANCE.md; a config is citable-validated only when the Datasets H1 native-speaker/IAA sign-off is recorded in the dataset card (see Datasets H1); real-external-gold marks a real, externally-sourced gold config (e.g. TAB) that is not one of our own synthetic skeletons. The site never renders a config as citable without it.

Submission CI v1 — public configs only, no-secrets sandbox. PR/issue template (HF model id + adapter scheme + model card), a GitHub Actions job that builds the adapter via adapters.BUILDERS, runs europriv run on public configs in a no-secrets sandbox, validates a filled model card, and appends a provenance-stamped row. Add a “reproduce a published number” CI gate with a pinned numeric tolerance: reproduce privacy-filter English entity-F1 = 0.415 ±0.02 (taxonomy 0.2.0, n=1500, eval-label-fair mask) against the committed leaderboard.json row (exact value 0.4149).
Resolve the real admin items: Pages “Enforce HTTPS” checkbox; org-level Actions secrets/permissions for CI. (The “flip repos public” item is dropped — already done.)
First real-data gold + an external de-saturating eval (shipped, dev / real-external-gold). Wire TAB (ECHR English-legal, MIT) as the suite’s first non-synthetic gold config (config_status=real-external-gold, 127 docs, 0 misaligned spans) — establishing that real legal English is hard (Presidio 0.589 entity-F1; OOD kp-deid 0.199) and exposing where the synthetic-trained model doesn’t transfer; the TAB-derived config stays private (licensing review pending). Adopt Ai4Privacy open-core (openpii-1m, CC-BY-4.0) as an external, independent detection eval that de-saturates the previously-ceiling-pinned synthetic control (1.000 → 0.41–0.67 spread); the Llama-licensed 500k tier is excluded by the licensing CI gate. This adds real-external-gold alongside dev / citable-validated as a config_status value.
Second re-identification channel — QI / k-anonymity diagnostic (shipped, dev). A within-corpus k-anonymity-violation diagnostic over residual quasi-identifiers, reporting sample distinctiveness (residual distinctiveness, not population re-identification — the term re-identification stays reserved for the deterministic national-ID channel). Plus a PURR@τ population-uniqueness estimator + reference-population module as groundwork (internal sensitivity-analysis only, pending a census-calibrated generator). Both are dev/internal signals, not citable.

Acceptance: europriv run produces a non-stub re-identification-risk number for PL and IT with Wilson CIs emitted by the harness (new metrics helper), not hand-computed, committed to leaderboard.json and rendered on research.klusai.com; every row carries a contamination flag and a config_status (dev | citable-validated) field, and the site visibly separates in-distribution from clean held-out; GOVERNANCE.md + versioned taxonomy YAML merged with make check green; one external-style submission (e.g. a Presidio adapter) is scored end-to-end through the no-secrets CI and appears on the live board with a model card. Agent tasks: implement a Wilson score-interval helper in metrics.py and have cnp_leakage / national_id_leakage return leak_rate_ci_low/leak_rate_ci_high, backfilling the CI into existing leaderboard.json rows; refactor national_id.py to a country-keyed validator registry, add PESEL/codice-fiscale (quasi-ID decode) + DNI-NIF (coverage-only), make national_id_leakage country-dispatched (RO/PL/IT) and fold in/alias cnp_leakage, and update tests/test_ro.py; extend runner.run_spec output + leaderboard.py to schema 3 with contamination + config_status; add conf/taxonomy.yaml + contract test; add .github/ISSUE_TEMPLATE/PR template + Actions workflow (no-secrets) + the privacy-filter-F1 reproduction gate; update the Liquid template in klusai-pages-research to render the flag. Targets: ≥2 non-RO languages with a real re-id number; 100% of rows flagged for contamination (today 0%); 1 external-style submission through CI.

H2

Track C — anonymization + downstream utility. Wire Task.ANONYMIZATION in run_spec (remove the hard raise); implement utility_after_redaction (downstream-task accuracy + readability delta) and a redaction-consistency metric; ship a Presidio-redactor baseline + one LLM-based anonymizer.
Build the gated-eval trust boundary as a standalone, preceding deliverable. Public-config scoring stays in the no-secrets sandbox; only after the trust boundary is proven does a trusted runner score against PRIVATE gold (HF token), bursting heavy models to a DigitalOcean GPU droplet and writing back only the aggregate row.
Multi-jurisdiction LEGAL track — distinct from the existing single-jurisdiction RO real-skeleton track (ro-realskeleton-v1 is already domain-tagged legal); H2 adds cross-border EUR-Lex / MultiEURLEX real-skeleton + synthetic PII across ≥3 languages, and activates evaluations/pii-detection-ro-legal.yaml off _planned.
Wire the MEDDOCAN gold config (curated in Datasets H2) into the runner as a citable clinical config. Register the curated open-synthetic MEDDOCAN gold as a scored config so the clinical track is citable (validated, contamination-controlled) in H2, alongside the legal track.
Leaderboard-as-product v2: per-track tabs, per-language/per-domain sortable views, CI badges, per-row “reproduce this row” command.

Acceptance: detection + anonymization run end-to-end with real metrics (no NotImplementedError on populated tracks), each with ≥1 published row + CIs; a privacy-utility tradeoff figure (re-id risk vs downstream utility) reproducible for ≥2 anonymizer baselines; external models scored against gated gold via the proven sandbox without exposing the gold; ≥3 external submissions live. Agent tasks: implement metrics.utility_after_redaction + redaction-consistency; add a Presidio BaseAdapter.anonymize; build the trusted-runner Action (HF token, klusai-webos/tools/do-cli burst, aggregate-only writeback) after the sandbox boundary is documented and tested; wire the legal YAML into the runner; wire the curated MEDDOCAN gold config into the runner as a citable clinical config. Targets: ≥16 of 24 EU languages with detection baselines; both a legal and a clinical config citable (validated, contamination-controlled).

H3

Track D — membership-inference (LiRA/shadow-model), deferred from H2 because it stacks three unbuilt/unfunded dependencies (a KlusAI model trained on sensitive text, GPU-burst budget, a credentialed sensitive corpus). Lead the “beyond F1” thesis on Track B (re-id) + Track C (utility) until those clear.
Complete all 24 EU official + 4 non-EU languages (28) + both domains across all four tracks; freeze v1.0 with a Zenodo DOI and a contamination-audit report. Per the feasibility staging, the EU-24 long tail is met by detection baselines + the re-id metric (where the national ID decodes quasi-identifiers) + checksum-valid synthetic packs — not deep four-track-by-two-domain coverage for all 24; deep coverage stays T1(+RO), extending to T2 as capacity allows.
Subsume prior art operationally: re-host and score TAB (EN-legal), MEDDOCAN (ES-clinical), a MAPA-derived split inside the harness, each reporting its native metric alongside KP metrics. Claim subsumption only once the split actually runs (re-hosting is unbuilt and partly licensing-gated).

Acceptance: all 24 EU official + 4 non-EU languages (28) + legal & clinical with citable, contamination-controlled gold and baselines across populated tracks (EU-24 long tail via detection + re-id + checksum-valid packs; deep multi-track stays T1(+RO)→T2); v1.0 tagged + DOI; TAB/MEDDOCAN/a MAPA split runnable in-harness reporting both metrics; ≥10 distinct external models scored, incl. ≥1 commercial API where testable. Targets: ≥1 external group cites EuroPriv-Bench numbers or submits a model; every site number reproducible against a pinned (harness, taxonomy, dataset-rev) triple.

Models — the KlusAI Privacy (kp-*) family

Vision. Ship the open, research-grade, GDPR-aligned model family the leaderboard exists to crown — winning head-to-head where competitors are weak: under-served European languages (Romanian first), legal + clinical domains, and re-identification protection (coverage), not just detection-F1. Everything is continue-finetuning of proven open bases on MLX, with disposable DigitalOcean GPU bursts; every claim is a leaderboard delta with CIs.

World-class bar. Beat tabularisai, OpenMed, GLiNER-PII, openai/privacy-filter at entity/span-F1 on EuroPriv configs no competitor trained on (RO real-skeleton first). Be Pareto-competitive on the F1-vs-leak frontier our own paper exposed. Match Presidio / Private AI / Tonic on SDK ergonomics. Piiranha is eval-baseline-only (CC-BY-NC-ND) — never finetuned on or redistributed.

Verified starting state. scripts/train.py raises NotImplementedError (“Phase 3”); the SDK functions (extract_pii/deidentify/pseudonymize) are stubs; no kp-* weights exist yet.

H1

Default to the MLX-fittable path (mDeBERTa-280m token classifier + LoRA) as the committed H1 deliverable. The XLM-R-560m and MoE continue-finetune are GPU-burst-gated and move to H2 unless the DO budget (open question #6) is confirmed in H1 — see kill/pivot triggers. The MoE was the original primary track in conf/models.yaml; its demotion to GPU-gated H2 is deliberate — conf/models.yaml is the family registry, not the H1 priority order.
MLX-fit precondition + fallback. Validate mlx-lm LoRA token-classification on mdeberta-v3-base in week 1; if MLX stalls (the MoE already runs ~3× slower on MPS, so MLX-fit is unverified for this base), fall back to CPU transformers+peft on the Mac Studio (4-thread) — still no GPU burst, mirroring the GPU kill-trigger discipline.
Implement the device-agnostic backend behind scripts/train.py (mlx-lm + transformers/peft, dispatched on the Backend enum); wire scripts/evaluate.py to register a checkpoint as a EuroPriv adapter and shell out to europriv run.
Add KpModelAdapter to europriv-bench adapters.py and register in BUILDERS. Mechanism: either register a kp identity scheme in europriv_bench.crosswalk (to_kp returns the label unchanged for scheme='kp') and subclass PrivacyFilterAdapter, or implement KpModelAdapter on the GLiNERAdapter pattern calling kp_entities_to_bioes directly (kp models already emit native KP labels) — an identity crosswalk is not automatic, since to_kp returns None for any unregistered scheme.
GLiNER stays an eval baseline only — kp-deid-gliner-ml exists in conf/models.yaml/the Family enum with a working GLiNERAdapter, but no kp GLiNER finetune is planned; it is scored as a competitor baseline, not shipped as a kp model.
Implement SDK extract_pii() against the shipped kp model (CPU-default, 4-thread per the M3-Ultra perf finding); pip-installable klusai-privacy.

Acceptance: kp-deid-mdeberta-280m v1 published (weights + -mlx variant + model card) and scored on RO real-skeleton with bootstrap CIs; the SDK DEFAULT_MODEL is klusai/kp-deid-mdeberta-280m (the H1 default), so pip install klusai-privacy; extract_pii(text) returns KP-typed spans reproducing the leaderboard row against a model that actually exists in H1; make check green; every kp row carries full provenance. A SOTA-style claim is stated only as a CI-backed head-to-head delta on a contamination-free track — framed as “open, head-to-head win on RO real-skeleton,” not a naked “SOTA.” Agent tasks: implement train.py:_run backends + push_to_hub(publish_id()); wire evaluate.py→adapter; add KpModelAdapter to BUILDERS (with the kp identity scheme registered in crosswalk or the GLiNER-pattern path) plus a contract test round-tripping a KP-labelled span through KpModelAdapter unchanged; flip the SDK DEFAULT_MODEL from klusai/kp-deid-moe to klusai/kp-deid-mdeberta-280m in klusai/privacy/sdk/__init__.py (the MoE default returns only if/when kp-deid-moe ships in H2); implement extract_pii() in klusai/privacy/sdk/__init__.py; rename the conf/models.yaml variant kp-deid-mdeberta → kp-deid-mdeberta-280m (matching the kp-deid-xlmr-560m convention) so the registry slug equals the published weight. Targets: ≥1 kp-deid row on RO real-skeleton with non-overlapping CI vs at least one baseline; extract_pii shipped.

H2

Protector objective (the headline research bet, not an engineering certainty). Coverage-aware/recall-weighted loss + a numeric-span catch-all head so structured IDs (CNP/IBAN/CUI) are never dropped. Guaranteed floor: the catch-all + a leak-aware SDK decode mode matches privacy-filter’s coverage without its 0.36-F1 collapse, even if the single-model Pareto win slips.
XLM-R-560m + MoE continue-finetunes (GPU-burst); extend kp-deid to T1 languages (de/fr/es/it/nl/pl); first RO-legal model. Note: the live leaderboard has no PL detection baseline yet — a PL baseline must be generated first (depends on H1 PL LocalePack + a PL leaderboard config); en already has a baseline and may substitute as the 6th T1 cell if PL slips.
kp-anon-qwen3-1.7b-lora v1 (checksum-valid locale-coherent surrogates via ro_generators); wire SDK deidentify/pseudonymize. kp-sensitivity-mdeberta document classifier. kp-anon is explicitly gated on Benchmark H2 Track C landing first (metrics.utility_after_redaction + BaseAdapter.anonymize currently raise NotImplementedError, so there is nothing to score against until they ship).

Acceptance: a kp-deid model sits on the leaderboard Pareto frontier on RO real-skeleton (target: CNP leak ≤2% with Wilson CI AND entity-F1 ≥0.80) — stated as a target hypothesis, with the catch-all floor as the committed fallback; kp-deid wins (non-overlapping CI) on ≥3 of 6 T1 configs + RO-legal; kp-anon beats a masking baseline on utility-after-redaction at equal-or-lower re-id risk; klusai-privacy v0.2 ships all three SDK functions. Agent tasks: design + train the recall-weighted loss + catch-all head; LoRA-tune kp-anon with consistent within-doc mapping; implement EuroPriv anonymize/leakage_probe adapter methods for kp-anon. Targets: ≥3 T1 head-to-head wins with significance (the escalate-investment trigger).

H3

Scale kp-deid breadth across the under-served wedge — primary: the EU-completeness tier (T3-EU: hr/et/lv/lt/sk/sl/mt/ga) plus the T2 set (pt/el/cs/sv/hu/da/fi/bg); bonus: the non-EU strategic set (uk/ru/tr/sr). The EU-completeness Baltic + ex-Yugoslav-EU + Slovak + Maltese/Irish languages are the defensible breadth wedge mainstream tooling neglects. Deep own-finetunes stay staged/feasible (T1-focused +RO, extending to T2 as capacity allows); for much of T3-EU the EU-24 commitment is met by detection baselines + the re-id metric + checksum-valid packs, with finetunes only where a head-to-head win is reachable.
From-scratch is decision-gated and DROP-by-default: ship only if continue-finetuning measurably plateaus AND a defensible novelty exists (candidate: an MLX-native, structured-identifier-aware encoder). Otherwise formally drop with the plateau evidence recorded.
Clinical models on credentialed gold (if access secured); MIA evaluation of kp-anon on the leakage track; klusai-privacy v1.0 + Presidio-recognizer plugin.

Acceptance: head-to-head CI-backed wins on ≥6 under-served languages (drawn primarily from the EU-completeness tier T3-EU + T2) vs the best open competitor per language; from-scratch model either ships with a measured advantage or is formally dropped with evidence; clinical kp-deid competitive on a credentialed PHI set or the dependency is documented as the blocker. Targets: open, head-to-head wins on ≥6 under-served languages (EU-completeness tier prioritized); Presidio-recognizer plugin shipped.

Datasets — clean, validated, redistributable corpora

Vision. The cleanest, largest, most rigorously-validated openly-redistributable PII/PHI corpora for European text, built as legal synthesis + clinical via curated open-synthetic gold (MEDDOCAN) + credentialed clinical eval-only — gold spans emitted at generation time (zero annotation cost). The research contribution is two under-explored things: a quantified synthetic-context-to-real-context drift metric, and multilingual legal synthesis (thinner in the literature than clinical). H1 is deliberately a general-domain bring-up of the generation/release machinery; legal/clinical synthesis begins H2.

Canonical coverage set & volume/tier targets (THE single source of truth — every other mention in this roadmap references this table so counts cannot drift). EU total = 2 + 6 + 8 + 8 = 24 (all EU official languages); non-EU bonus = 4; grand total = 28:

Tier	Languages	Target docs / depth per lang
Anchor (H1)	ro · en (EU)	≥50k general; full deep multi-track (RO leads)
T1 (deep: general+legal+clinical)	de · fr · es · it · nl · pl (EU)	≥50k general + ≥20k legal + ≥20k clinical
T2 (general, +legal/clinical where justified)	pt · el · cs · sv · hu · da · fi · bg (EU)	≥20k general; legal/clinical where justified
T3-EU (EU completeness wedge: detection + re-id + checksum pack; deep only where justified)	hr · et · lv · lt · sk · sl · mt · ga (EU)	≥10k general + checksum-valid pack
T4 (non-EU strategic bonus: detection baselines)	uk · ru · tr · sr	≥10k general

Feasibility staging (keeps the small-team constraint credible). The EU-24 commitment is met primarily via (a) detection baselines run through the harness, (b) the re-identification-risk metric where the national ID decodes quasi-identifiers, and (c) checksum-valid synthetic locale packs — not deep own-finetunes, legal/clinical synthesis, or native-speaker-validated citable gold for all 24. The expensive deep work (kp-* finetunes, legal+clinical synthesis, native-speaker-validated gold) stays T1-focused (+RO) and extends to T2 as capacity allows. The long tail (T3-EU + T4) is breadth (coverage), not depth; coverage is staged T1→T2→T3-EU→T4.

A locale PACK (generator + checksum self-test) is distinct from a released SLUG (a volume of validated docs); pack-count milestones carry the slug-volume acceptance above so a “pack” count never stands in for delivered volume.

World-class bar. Match MAPA on EU-24 coverage (all 24 EU official languages) — the unified-pan-European completeness claim — and match Ai4Privacy (open-pii-masking-500k, CC-BY) on scale/docs, while beating both on checksum-VALID locale identifiers (Faker/MT-projected sets emit invalid IDs), offset-correct-by-construction gold spans, a published drift score, and full-provenance HF cards. Cleanly-licensed-only is a hard CI gate (excludes Piiranha CC-BY-NC-ND, Ai4Privacy Llama-bound tiers, LegalNERo NC-ND, MoNERo/MARCELL copyleft).

Verified starting state. Generators are RO-only (ro_generators.py/ro_documents.py/ro_skeletons.py); synthetic.generate() is a stub. make_ro_review.py is a review-prep script — no documented native-speaker sign-off / IAA exists yet, even for RO.

H1

LocalePack abstraction extracted from the RO generators (keep the _fill splice + byte-equality assert + strict char_spans_to_bioes gate unchanged — it is the quality moat). Implement packs for RO (refactor) + EN + PL only to depth in H1 (per the feasibility cut: one new language done deeply beats five done shallow), each with a checksum self-test; DE/FR/ES/IT/NL packs follow as capacity allows.
Fill the synthetic.generate() stub with stage-A (deterministic template/skeleton splice), sharded JSONL via the TinyFabulist pattern.
scripts/release_dataset.py with a 5-gate validator (checksum re-validate, span byte-equality, bioes_labels() contract, train/gold leakage, load round-trip) + --validate-only/--private; extend conf/datasets.yaml into a license registry with a CI gate that hard-fails on any non-allowlisted source.
Uniform dataset-card template emitted by release_dataset.py, carrying: source + license, generator commit + seed, TAXONOMY_VERSION, n_docs, span-byte-equality %, BIOES-validity %, train/gold-overlap = 0, drift score (where available), and validation status — so every ds-kp-* card is uniform and machine-checkable.
Produce documented RO native-speaker + IAA sign-off before Paper 1 hits arXiv: exercise make_ro_review.py, obtain documented RO native-speaker review + inter-annotator agreement, and record the sign-off in the ro-realskeleton-v1 card so the “validated” label can attach in H1.

Acceptance: RO/EN/PL packs pass a checksum round-trip at 10k samples with 100% validity (0 structurally-invalid IDs); synthetic.generate() no longer raises and produces ≥50k validated docs/lang for ≥2 langs in one overnight MLX run; every released slug: 100% gold-span byte-equality, clean BIOES projection, label space == bioes_labels(), zero train/gold overlap, loads via datasets.load_dataset; documented RO native-speaker + IAA sign-off recorded in the ro-realskeleton-v1 card. Agent tasks: refactor to klusai/privacy/datasets/data/locales/{base.py,ro.py,en.py,pl.py}; port tinyfabulist-tf3/.../ds_generation/main.py into stage-A; write release_dataset.py + the license-registry CI test; publish ds-kp-general-{ro,en,pl}-50k (private→public); run make_ro_review.py and record the documented RO native-speaker + IAA sign-off in the card. Targets: 3 checksum-valid locale packs; ≥3 ds-kp-* slugs with full provenance cards; documented RO native-speaker/IAA sign-off landed.

H2

Drift-measurement module (drift.py): surface stats + embedding distance (multilingual-E5 centroid + MAUVE) + cross-context-transfer-gap (model trained on synthetic, evaluated on real-skeleton), seeded by the existing RO synthetic-vs-real-skeleton gap.
Stage-B LLM-authored generation (narrative around pre-placed slots; spans recovered by the same splice) — GPU-burst dependency.
Real-skeleton tracks for ≥3 more languages (DE/FR/PL legal via EUR-Lex); T2 locale packs (pt/el/cs/sv/hu/da/fi/bg). EUR-Lex / MultiEURLEX (CC-BY-SA-4.0) is used only to derive document structure/skeletons (not redistributed source text) so outputs stay CC-BY-clean; if substantial source text is retained the slug inherits BY-SA.
Legal + clinical synthesis begins here. Ship scaled synthetic legal slugs (e.g. ds-kp-legal-ro-50k, building on the ro_skeletons.py legal templates) and a clinical real-skeleton deliverable (RO + 2 langs, ≥20k docs/lang, building on the ro_skeletons.py clinical template), each with the H1 5-gate acceptance.
MEDDOCAN is open-synthetic (no DUA needed) — curate it as a gold config in H2; this corrects the earlier “pursue credentialed access” framing.

Acceptance: one comparable drift score per (lang,domain) — comparability contract: same embedding model (multilingual-E5), same centroid/MAUVE config, and a held-out real reference; the literal bar is “transfer-gap delta reported with a 95% bootstrap CI, regardless of sign”, never “improved.” Stage-B’s effect vs stage-A is reported as descriptive distances + the transfer-gap delta with CIs on ≥3 langs — never “we closed the gap” without a held-out real reference and stated margin; 16 packs pass checksum + span gates; real-skeleton for ≥4 langs with documented native-speaker sign-off recorded in the card. Agent tasks: implement drift.py; build the scrub-then-inject real-skeleton pipeline as shared infra; release ds-kp-legal-multi-*. Targets: 16 locale packs; drift reported on legal in ≥2 langs.

H3

T3-EU packs (hr/et/lv/lt/sk/sl/mt/ga) completing the EU-24 — detection + re-id baselines + checksum-valid synthetic packs (deep work only where justified) — plus T4 non-EU bonus packs (uk/ru/tr/sr) as detection baselines, bringing the canonical set to 28; credentialed clinical (i2b2/n2c2/MIMIC) under DUA as eval-only gated configs where redistribution is forbidden (MIMIC) — explicitly off the critical path; per-domain slugs; contributable external locale-pack path.
The 1M-doc and 500k aggregates are opportunistic stretch goals, not acceptance criteria (stage-B at that scale may exceed the Mac Studio budget).

Acceptance: all 24 EU official languages have a checksum-valid pack + ≥1 released slug + a detection baseline; the re-id metric is live for every decode-bearing national ID; + 4 non-EU bonus packs (28 total); drift reported across the EU-24; ≥1 non-KlusAI locale pack passes all gates. Targets: EU-24 + 4 non-EU bonus packs (28); aggregate validation report published.

Papers — credibility infrastructure

Vision. A tight, compounding sequence of arXiv-first papers, each turning one shipped artifact into peer-reviewed credibility, so that by 2027 a reviewer cannot evaluate a new European de-id system without reporting EuroPriv-Bench numbers.

World-class bar. The artifact+paper discipline of HELM and lm-eval-harness (living, versioned, contributable, DOI’d snapshots), the privacy-utility rigor of TAB (Computational Linguistics), and a venue mix of an ACL-family acceptance plus a PETS/PoPETs or CL-journal anchor for privacy rigor.

H1

Promote Paper 1 (EuroPriv-Bench) from technical report to arXiv, ingesting baselines/leaderboard.json via make results; cut a tagged harness + taxonomy v0.2.0 release and a pinned HF dataset revision, snapshot to Zenodo for a DOI; add CITATION.cff.
Re-run the pre-submission prior-art rescan (MultiGraSCCo/MedPriv-Bench/ASQ-PHI/Azure are fast-moving 2026 artifacts); document a dated comparison table; if any single artifact subsumes the intersection, downgrade the “first unified” claim per the pre-registered pivot.
Author the external submission-protocol doc + PR template (provenance JSON schema; CI validator rejecting stale taxonomy_version). Draft Paper 2 skeleton.
Artifact-evaluation discipline, named per track. For the NLP paper, target the ARR responsible-NLP checklist + ACL reproducibility track; for the PETS-path paper, target ACM Artifact Evaluation (Available + Functional). Pass criterion: a Zenodo DOI + a pinned harness/taxonomy/dataset triple + one-command reproduction = Available + Functional. Note that arXiv-first is ARR-compatible, and no paper is dual-submitted to two venues at once.

Acceptance: arXiv preprint live with a stable ID used as the canonical citation on the HF card, leaderboard page, and research.klusai.com (cite the subdomain, never github.io); a Zenodo DOI resolves to a frozen harness tag + dataset revision; europriv run reproduces Paper 1 Tables 1–2 on a clean checkout in one command (= the Available + Functional pass criterion above); dated rescan documented. Paper 1 ships the RO real-skeleton CNP finding under the “KlusAI-authored real-document skeletons (validation in progress)” label unless the documented RO native-speaker + IAA sign-off (Datasets H1) has landed, in which case it ships under the “validated” label. Agent tasks: reconcile the Paper 1 title across klusai-papers/README.md and papers/europriv-bench/main.tex to the published headline (“…with Re-identification Risk Metrics”) before cutting the Zenodo snapshot, since the DOI record and CITATION.cff freeze the canonical title; wire make results ingestion + schema validation into papers/europriv-bench/main.tex; Zenodo-snapshot script + CITATION.cff; submission-protocol doc + PR template + CI validator. Targets: 1 arXiv preprint + 1 DOI; one-command Table 1–2 reproduction.

H2

Submit Paper 1 to an ACL-family venue against a concrete target-venue + deadline calendar (venue choice re-evaluated each horizon):

Track	Target venue	Window / mechanism
NLP (primary)	EMNLP 2026	ARR submission, next commitment cycle
Privacy	PoPETs	next quarterly PoPETs deadline (4 rolling/yr)
NLP (fallback)	EACL 2027	if the ARR commitment slips

Paper 1’s privacy-venue (PETS/CL) variant and Paper 5 move to flex-buffer, not firm commitments (five papers across disjoint styles overruns a small team).

Before any NLP-venue submission, extend the leak metric to a second structured EU national ID (PESEL/PL or codice-fiscale/IT) so the dissociation finding rests on ≥2 identifiers, not the CNP alone.
Ship Paper 2 (synthetic-drift) to arXiv + a dataset/resources venue, backed by released ds-kp-* slugs and the real-skeleton protocol with documented native-speaker validation. Draft Paper 3 (under-served SOTA) gated on the first kp-deid checkpoints; every claim a leaderboard delta with bootstrap CI + McNemar, reporting both F1 and leak rate per model.

Acceptance: Paper 1 has a venue submission ID citing the Zenodo DOI; Paper 2 on arXiv with loadable ds-kp-* datasets and a drift number on legal in ≥2 langs; every SOTA-style claim in the Paper 3 draft is a CI-backed leaderboard delta with a McNemar p-value and a reported leak rate (McNemar computed on item-paired predictions over a single fixed (lang,domain) config; cross-config deltas use bootstrap CIs only). Targets: Paper 1 submitted; Paper 2 on arXiv; ≥1 ARR responsible-NLP checklist + ACL reproducibility-track application.

H3

Land Paper 1 + Paper 3 at peer-reviewed venues; populate the utility + MIA tracks and ship Paper 4 to PETS/PoPETs. Paper 4 ships the privacy-utility (Track C) half as committed core; the MIA (Track D) half ships only if Track D is unblocked per Benchmark H3 — else Paper 4 ships utility-only and MIA moves to a follow-up. It integrates established MIA (PrivLM-Bench/MedPriv-Bench), not claiming it. Ship Paper 5 (position/survey) as the low-compute schedule buffer; Paper 5 is intentionally unscaffolded (no repo dir / slug yet) and materializes only if H3 schedule slack appears — a flex buffer, not a commitment.

Acceptance: ≥2 KlusAI privacy papers peer-reviewed (≥1 NLP venue + ≥1 privacy venue/CL journal), each with a resolvable DOI + reproducible artifact; the public leaderboard shows ≥1 external submission + a versioned history; EuroPriv-Bench cited by ≥1 external artifact. Targets: 2 accepted papers; populated utility + MIA tracks with published CIs.

Cross-axis dependencies & sequencing

The hard ordering, with the red-team’s corrections applied:

Taxonomy is upstream of everything. europriv-bench (bioes_labels(), char_spans_to_bioes, national_id) is the single source of truth; klusai-datasets and klusai-models import a pinned tag and never copy. Any TAXONOMY_VERSION bump updates test_contract.py in all consuming repos in lockstep.
Datasets before Models before Papers. ds-kp-legal-ro and synthetic legal/clinical slugs must exist before the RO-legal and clinical finetunes; kp-deid checkpoints must exist before Paper 3; the EuroPriv anonymization/leakage adapter methods must be wired before kp-anon/kp-sensitivity can be scored (Paper 4).
Re-baseline H1 off real state, not the stale blocker. “Flip repos/dataset public” is done — drop it. Replace it with the actually-open admin items (Pages HTTPS checkbox; org Actions secrets/permissions for CI).
Sandbox precedes gated eval. The no-secrets public-config CI ships in H1; the trusted-runner gated-gold path (HF token + DO burst) ships in H2 only after the trust boundary is built and proven. Never co-schedule the feature and its security control.
Native-speaker validation gates every “citable validated gold” label — including RO. Sequence: exercise make_ro_review.py → obtain documented RO native-speaker + IAA sign-off → only then label ro-realskeleton-v1 “validated” anywhere public. Until then the site/paper language is “KlusAI-authored real-document skeletons (validation in progress).” This gates Papers 2 and 3 and all T1/T2/T3 breadth.
Models GPU dependency is a precondition, not a parallel assumption. XLM-R-560m + MoE finetunes need DO-GPU burst; resolving open-question #6 (budget + per-run cost cap) is an explicit H1 precondition. If unresolved, the committed H1 model deliverable is CPU-only (mDeBERTa-280m + LoRA).
Re-id metric ordering: lead with PESEL/PL + codice-fiscale/IT (PESEL encodes DOB+sex; codice fiscale encodes DOB+sex + place/comune of birth via the Belfiore code); DNI/ES is coverage-only. The mechanism is identifier-specific, not nationally-general.
Paper 2 depends on ≥1 ds-kp-* release + validated RO real-skeleton; Paper 3 depends entirely on kp-deid checkpoints; Paper 4 depends on the H2/H3 benchmark tracks + (for clinical) MEDDOCAN/credentialed access.

Positioning & KPIs

Moat. Be the neutral, open, fully-redistributable scorekeeper of European privacy NLP — the canonical yardstick competitors get ranked on — anchored by the published, reproducible re-identification-risk finding that no F1-only benchmark can surface. A leaderboard’s value compounds with submissions and citations, not headcount; a neutral referee is a seat a vendor-aligned competitor cannot occupy.

Differentiation (named).

vs openai/privacy-filter, OpenMed, tabularisai: they ship weights with no published eval and optimize English-primary F1; EuroPriv-Bench is the multilingual, re-id-aware evaluation they get ranked on. Ship kp-* models only where head-to-head CI-backed wins exist (under-served langs + multi-jurisdiction legal).
vs Piiranha: cited as a baseline, excluded from the suite by its CC-BY-NC-ND license — the contrast is “the benchmark you can actually redistribute and build on.”
vs Presidio: an orchestration framework, not an evaluation — a baseline integration target and downstream consumer (a Presidio recognizer wrapping kp-* models), not a rival.
vs Private AI / Tonic / John Snow Labs: they win on breadth/SLAs; differentiate on open + reproducible + research-grade + the neutral leaderboard they can be benchmarked on. Never claim “first multilingual anonymization.”
vs TAB / MAPA / MultiGraSCCo / MEDDOCAN: subsume, not compete — re-host splits/metrics through the documented crosswalk; the differentiation is unification (the intersection none cover at once) + the re-id metric + locale-native (not MT-projected) generation. The survivable claim is “first unified, not first.”

KPIs proving arrival.

Leaderboard adoption (the moat metric): distinct models scored, split KlusAI-run vs externally-submitted (today: 4 KlusAI-run baselines, 0 third-party) → target ≥15 total with ≥3 third-party submissions by end-H2.
Citations/endorsement: first external citation in H2; co-citation by a TAB/MAPA/Ai4Privacy-adjacent author as a leading indicator.
Reproducibility/trust: number of independently-published competitor numbers the harness reproduces within tolerance (≥2 by H1) + measurable HF dataset traffic.
Model SOTA wins: (language × domain) cells where a kp-deid model beats the best public baseline at span-F1 with significance — ≥3 under-served-language wins (the escalate trigger) by H2.
Suite completeness: EU-24 official-language coverage with ≥1 live privacy task each by H3 (detection + re-id + checksum-valid pack for the long tail) plus 4 non-EU bonus languages (28 total); deep multi-track (T1 + RO) for ≥7 languages (the deep-tier wins, citable native-validated gold × ≥2 domains × ≥2 tracks beyond detection), staged T1(+RO)→T2.

Credibility-win sequencing. H1: convert shipped artifacts into public credibility — one-command reproduction of Tables 1–2, reproduce two independently-published competitor numbers, arXiv preprint, documented submission flow; lead all messaging with “detection F1 does not track re-identification protection” — scoped to the decode-bearing national identifiers (RO CNP, PL PESEL, IT codice fiscale) where it is proven, with quasi-identifier-combination generalization flagged as in progress, not as a settled universal law. Ship only the MLX mDeBERTa-280m kp-deid baseline on the contamination-free RO real-skeleton; hold XLM-R/MoE and breadth models for H2. H2: turn the referee position into a network effect (first third-party submission, first external citation), harden RO real-skeleton into validated citable gold and replicate in one more language, then finetune kp-deid breadth (XLM-R/MoE, T1 langs) and chase the escalate trigger. H3: complete the suite (utility + MIA), add a clinical track once credentialed data lands, expand toward 10+ languages; arrival signal = external adoption as the default eval + a clinical/legal/EU-funded partner.

Kill / pivot triggers & risks

Kill / pivot triggers (decide explicitly, don’t drift):

No third-party submission by end-H2 → freeze leaderboard-as-product investment; pivot effort to the model track.
kp-deid Pareto win (good detector AND good protector) fails to materialize by mid-H2 → formally fall back to the numeric-span catch-all + leak-aware SDK decode (the guaranteed floor) and drop the single-model Pareto claim from papers.
DO-GPU budget denied (open-question #6 = no) → the CPU-only degraded path (mDeBERTa-280m + LoRA) becomes the committed H1 model deliverable; XLM-R-560m + MoE explicitly deferred. Open-question #6 has an owner (program lead), is decided by end of H1 week 2, with a per-run cost cap of $X; until resolved, the CPU-only path is the committed deliverable.
Track D (MIA) not unblocked by Benchmark H3 → Paper 4 ships utility-only (Track C) and the MIA half moves to a follow-up; do not claim MIA results without the track.
A single artifact subsumes the unified intersection at the pre-submission rescan → pre-registered pivot: downgrade “first unified” and lead with under-served-language SOTA instead.
Native-speaker validation capacity not secured → keep “validated citable gold” claims to RO only; breadth papers wait.

Top risks + mitigations.

Stale ground truth poisoning the plan. The “repos private / push blocked” memory note is stale. Mitigation: memory file corrected before publishing this roadmap; H1 re-baselined; the real admin items (Pages HTTPS, Actions secrets) tracked under Benchmark H1.
Re-id metric doesn’t generalize the way the headline needs. Quasi-ID disclosure is identifier-specific: PESEL encodes DOB+sex, codice fiscale encodes DOB+sex + place/comune of birth (Belfiore code); DNI/NIF encodes nothing. Mitigation: per-identifier classification — report “re-identification risk” only where the encoding justifies it, “coverage” elsewhere; lead with PL/IT.
Contamination invalidating rankings. OpenMed/tabularisai trained on Ai4Privacy (our general-text gold source); the publicly-rendered general-text table currently mixes contaminated and clean systems with no marker. Mitigation: the machine-readable contamination flag (schema 3) is an H1, not a nicety; lead all fair comparisons on the contamination-free RO real-skeleton track.
Four-track × 28-language × 2-domain matrix is large for a small team. Mitigation: the long tail (T3-EU + T4) is detection / re-id-baseline breadth only (detection baselines + the re-id metric where the national ID decodes quasi-identifiers + checksum-valid synthetic packs); the deep four-track × 2-domain work stays T1(+RO), extending to T2 as capacity allows. Coverage is staged T1→T2→T3-EU→T4 with a per-tier kill option (stop after any tier if capacity runs out — EU-24 breadth before non-EU bonus). Do one new language deeply in H1 (PL); defer Track D to H3; treat mega-aggregate datasets as stretch goals.
Submission-CI security. Running untrusted adapter code with access to a gold-bearing token is an exfiltration vector. Mitigation: sandbox first (no-secrets public-config scoring in H1); gated-gold scoring only behind a proven trust boundary in H2.
Native-speaker validation asserted but not evidenced (including RO). Mitigation: documented protocol with named/sourced reviewers (or a contractor budget line) as a hard precondition for any “validated gold” label.

Overclaim discipline. No naked “first” or “SOTA”: use “first unified” and “open, head-to-head, CI-backed win on [config].” The detector-and-protector Pareto target is a stated hypothesis with a committed fallback floor, not an engineering guarantee. Subsumption of TAB/MEDDOCAN/MAPA is claimed only once the split runs in-harness. Public language distinguishes 7 languages (6 general + RO), not “8” (8 = config count). MEDDOCAN is open-synthetic (no DUA); credentialed clinical (i2b2/MIMIC) stays off the critical path and ships eval-only where redistribution is forbidden.

Status & provenance

This is a living plan, dated 2026-06-01, maintained against verified repo state and revised each horizon. Every deliverable resolves to a measurable acceptance criterion; every published number traces to a pinned (harness version, taxonomy version, dataset config/revision, model_id, timestamp) triple with a confidence interval. Repos: europriv-bench · klusai-datasets · klusai-models · klusai-papers · site hub klusai-pages-research (private) → research.klusai.com. Dataset hub: klusai/europriv-bench on Hugging Face. Canonical citation: the EuroPriv-Bench arXiv ID + the research.klusai.com handle (never the github.io URL). Compute model: MLX-first on the Mac Studio + disposable DigitalOcean GPU bursts. Sourcing invariant: cleanly-licensed-only; Piiranha eval-baseline-only.