<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://research.klusai.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://research.klusai.com/" rel="alternate" type="text/html" hreflang="en" /><updated>2026-05-30T14:30:34+00:00</updated><id>https://research.klusai.com/feed.xml</id><title type="html">KlusAI Research</title><subtitle>Open privacy-focused models for European languages — benchmarks, datasets, models and papers from the KlusAI privacy program.</subtitle><author><name>KlusAI</name><email>research@klusai.com</email></author><entry><title type="html">Introducing EuroPriv-Bench</title><link href="https://research.klusai.com/benchmark/release/2026/05/30/introducing-europriv-bench.html" rel="alternate" type="text/html" title="Introducing EuroPriv-Bench" /><published>2026-05-30T00:00:00+00:00</published><updated>2026-05-30T00:00:00+00:00</updated><id>https://research.klusai.com/benchmark/release/2026/05/30/introducing-europriv-bench</id><content type="html" xml:base="https://research.klusai.com/benchmark/release/2026/05/30/introducing-europriv-bench.html"><![CDATA[<p>We’re releasing <strong>EuroPriv-Bench</strong> — the first <em>unified</em> pan-European de-identification
benchmark. It puts privacy NLP for European languages on a single, GDPR-aligned taxonomy
and a privacy-utility metric, rather than the fragmented, English-centric, detection-F1-only
picture that exists today.</p>

<h2 id="the-gap-it-exposes">The gap it exposes</h2>

<p>Models advertising <strong>96–97% F1</strong> on English PII benchmarks tell you little about how they
behave on, say, Dutch clinical notes under a European privacy taxonomy. When we hold current
baselines to that bar, F1 drops to <strong>0.44–0.61</strong> across <code class="language-plaintext highlighter-rouge">en/fr/es/de/it/nl</code> — with recall
(the privacy-critical failure mode) the weakest link.</p>

<p>→ <strong><a href="/leaderboard/">See the live leaderboard</a></strong></p>

<h2 id="whats-in-v0">What’s in v0</h2>

<ul>
  <li>6 European language configs from cleanly-licensed open sources, remapped to the unified
KlusAI privacy taxonomy and fully attributed.</li>
  <li>A reproducible harness with provenance baked into every result (model id, dataset
config/split, harness &amp; taxonomy version, timestamp).</li>
  <li>Baselines for the current public de-identification models.</li>
</ul>

<h2 id="whats-next">What’s next</h2>

<p>More baselines (multilingual de-id models, GLiNER, Presidio), larger sample sizes, and the
legal/clinical splits — plus under-served languages including Romanian. Follow along here or
on <a href="https://github.com/klusai">GitHub</a>.</p>

<p><em>EuroPriv-Bench is open — see <a href="/leaderboard/#how-to-submit">How to submit</a>.</em></p>]]></content><author><name>KlusAI</name><email>research@klusai.com</email></author><category term="benchmark" /><category term="release" /><summary type="html"><![CDATA[We’re releasing EuroPriv-Bench — the first unified pan-European de-identification benchmark. It puts privacy NLP for European languages on a single, GDPR-aligned taxonomy and a privacy-utility metric, rather than the fragmented, English-centric, detection-F1-only picture that exists today.]]></summary></entry><entry><title type="html">The GPU isn’t always the answer</title><link href="https://research.klusai.com/engineering/notes/2026/05/27/the-gpu-isnt-always-the-answer.html" rel="alternate" type="text/html" title="The GPU isn’t always the answer" /><published>2026-05-27T00:00:00+00:00</published><updated>2026-05-27T00:00:00+00:00</updated><id>https://research.klusai.com/engineering/notes/2026/05/27/the-gpu-isnt-always-the-answer</id><content type="html" xml:base="https://research.klusai.com/engineering/notes/2026/05/27/the-gpu-isnt-always-the-answer.html"><![CDATA[<p>We benchmark a lot of models, so harness throughput matters. The intuition — “we have a 60-core
M3 Ultra GPU, use it” — turned out to be exactly wrong for this workload, in an instructive way.</p>

<p>The model under test is <code class="language-plaintext highlighter-rouge">openai/privacy-filter</code>: a sparse Mixture-of-Experts with only <strong>50M
active parameters</strong>, doing short-sequence token classification. We measured inference per example,
several ways:</p>

<table>
  <thead>
    <tr>
      <th>setup</th>
      <th>s/example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MPS (Apple GPU, PyTorch), single</td>
      <td>0.81</td>
    </tr>
    <tr>
      <td>MLX (Apple GPU), batched</td>
      <td>0.085</td>
    </tr>
    <tr>
      <td>CPU, batch=32, <strong>28 threads</strong></td>
      <td>0.083</td>
    </tr>
    <tr>
      <td>CPU, batch=32, <strong>4 threads</strong></td>
      <td><strong>0.041</strong></td>
    </tr>
  </tbody>
</table>

<p>Two surprises:</p>

<p><strong>1. The GPU lost.</strong> Both the PyTorch-MPS and the MLX paths were <em>slower</em> than CPU. With so few
active parameters and short sequences, there isn’t enough work per item to amortize the overhead of
getting data onto the GPU — and the MoE’s routing ops fall off the Metal fast path and bounce back
to the CPU anyway. GPUs win on <em>big</em> compute (large models, long sequences, training); they don’t
on a tiny MoE doing short spans.</p>

<p><strong>2. More threads were slower.</strong> PyTorch’s default (28 BLAS threads here) ran at <strong>half the speed</strong>
of 4 threads — the ops are small enough that thread-coordination overhead dominates.</p>

<h2 id="what-actually-used-the-machine">What actually used the machine</h2>

<p>The real lever was <strong>parallelism at the job level</strong>: run many small, thread-capped jobs at once.
Seven worker processes × 4 threads saturates the 28 cores, each job at its own fastest point. A
full sweep (multiple models × multiple language configs) that crawled before now finishes in a
couple of minutes — a ~7× wall-clock win, and identical (deterministic) numbers.</p>

<h2 id="the-lesson">The lesson</h2>

<p>Profile before you reach for the accelerator. “Use the GPU” and “use all the cores” are heuristics,
not laws — and for small models on short inputs, both can cost you. The GPU still earns its keep
here: just for the <em>other</em> job, training our own models, where the compute is actually large.</p>]]></content><author><name>KlusAI</name><email>research@klusai.com</email></author><category term="engineering" /><category term="notes" /><summary type="html"><![CDATA[We benchmark a lot of models, so harness throughput matters. The intuition — “we have a 60-core M3 Ultra GPU, use it” — turned out to be exactly wrong for this workload, in an instructive way.]]></summary></entry><entry><title type="html">What a neutral leaderboard must control for</title><link href="https://research.klusai.com/methodology/benchmark/2026/05/23/what-a-neutral-leaderboard-must-control-for.html" rel="alternate" type="text/html" title="What a neutral leaderboard must control for" /><published>2026-05-23T00:00:00+00:00</published><updated>2026-05-23T00:00:00+00:00</updated><id>https://research.klusai.com/methodology/benchmark/2026/05/23/what-a-neutral-leaderboard-must-control-for</id><content type="html" xml:base="https://research.klusai.com/methodology/benchmark/2026/05/23/what-a-neutral-leaderboard-must-control-for.html"><![CDATA[<p>When we ran the first cross-lingual baselines, one model led on most European languages. The easy
headline would be “model X wins.” The honest footnote is more interesting — and it’s the kind of
thing a benchmark exists to surface.</p>

<p>That model was <strong>fine-tuned on AI4Privacy</strong> — the same synthetic corpus our v0 gold is derived
from. It isn’t cheating; it’s good work. But it means part of its lead is <strong>home-field advantage</strong>:
it has seen data drawn from the same distribution it’s being tested on. A leaderboard that prints
the ranking without flagging this is quietly misleading.</p>

<h2 id="contamination-is-the-default-not-the-exception">Contamination is the default, not the exception</h2>

<p>In privacy NLP, a handful of open corpora (AI4Privacy chief among them) are <em>both</em> the common
training data <em>and</em> the common evaluation data. So overlap between a model’s training set and your
benchmark’s source is the normal case, not a rare one. A credible leaderboard has to assume it and
design around it:</p>

<ul>
  <li><strong>Provenance on every row</strong> — we record the dataset, config, split, and versions behind each
score, so a number can always be traced and a contaminated comparison can be spotted.</li>
  <li><strong>Held-out, source-separated gold</strong> — training and evaluation kept strictly apart, with synthetic
generation kept separate from the gold.</li>
  <li><strong>Data nobody has trained on</strong> — the real differentiator. Results on a corpus that <em>no</em> baseline
has seen are the only ones immune to this effect.</li>
</ul>

<p>That last point is why we’re investing in <strong>under-served languages and real-document gold</strong> rather
than just scaling up the AI4Privacy-derived splits. Romanian, for instance, isn’t in AI4Privacy at
all — so a Romanian gold set tests every model on genuinely unseen ground.</p>

<h2 id="the-point">The point</h2>

<p>“Neutral scorer” isn’t a slogan; it’s a set of obligations: flag contamination, publish provenance,
and keep building evaluation data that the field hasn’t already trained on. We’d rather report a
smaller, honest gap than a big one that won’t survive scrutiny.</p>

<p>→ Every score on the <a href="/leaderboard/">leaderboard</a> carries its provenance.</p>]]></content><author><name>KlusAI</name><email>research@klusai.com</email></author><category term="methodology" /><category term="benchmark" /><summary type="html"><![CDATA[When we ran the first cross-lingual baselines, one model led on most European languages. The easy headline would be “model X wins.” The honest footnote is more interesting — and it’s the kind of thing a benchmark exists to surface.]]></summary></entry><entry><title type="html">A missed ID number is a birthday, a sex, and a county</title><link href="https://research.klusai.com/methodology/metrics/2026/05/19/a-missed-id-is-a-birthday-a-sex-and-a-county.html" rel="alternate" type="text/html" title="A missed ID number is a birthday, a sex, and a county" /><published>2026-05-19T00:00:00+00:00</published><updated>2026-05-19T00:00:00+00:00</updated><id>https://research.klusai.com/methodology/metrics/2026/05/19/a-missed-id-is-a-birthday-a-sex-and-a-county</id><content type="html" xml:base="https://research.klusai.com/methodology/metrics/2026/05/19/a-missed-id-is-a-birthday-a-sex-and-a-county.html"><![CDATA[<p>Most PII benchmarks report one number: detection F1. It treats every missed entity as one missed
token. But for privacy, <strong>not all misses are equal</strong> — and detection-F1 hides exactly the misses
that matter most.</p>

<p>Take the Romanian <strong>CNP</strong> (Cod Numeric Personal), the national ID number. It isn’t an opaque
string: its 13 digits <em>encode</em> the holder’s data. A single un-redacted CNP deterministically
discloses:</p>

<ul>
  <li><strong>date of birth</strong> (century + year + month + day),</li>
  <li><strong>sex</strong> (the leading digit’s parity), and</li>
  <li><strong>county</strong> of registration.</li>
</ul>

<p>So a model that misses one CNP hasn’t dropped “one token” — it has leaked <strong>three quasi-identifiers
at once</strong>, the exact combination that enables re-identification. A model that catches 99% of generic
PII but routinely misses national IDs can look great on F1 and still be a privacy liability.</p>

<h2 id="measuring-the-thing-that-matters">Measuring the thing that matters</h2>

<p>That’s why our headline metric isn’t detection-F1 — it’s <strong>re-identification risk</strong>. For national
IDs we decode the structure directly and count what a miss actually exposes. We also report a
<strong>recall-weighted</strong> score alongside F1, because in de-identification a false negative (PII left in)
is far costlier than a false positive (something harmless redacted).</p>

<p>This also reframes what “good” means. A model that flags a 13-digit string as the <em>wrong</em> type but
still redacts it hasn’t leaked anything — redaction cares about coverage, not labels. Detection-F1
penalizes the mislabel; the leakage metric correctly says “no harm done.” Different questions,
different metrics — and privacy needs the second one.</p>

<h2 id="why-national-ids-and-why-europe">Why national IDs, and why Europe</h2>

<p>National-ID formats are where English-centric models fall down hardest: a CNP, a Spanish DNI, a
Polish PESEL each have their own structure and check digits that a model trained mostly on US/UK
data has never seen. Getting these right — and <em>measuring the leakage when you don’t</em> — is central
to a European privacy benchmark, not a footnote.</p>

<p>We’re building the gold sets to test this properly across languages and domains. The metric is
ready; the harder, more honest part — realistic documents — is what we’re working on next.</p>]]></content><author><name>KlusAI</name><email>research@klusai.com</email></author><category term="methodology" /><category term="metrics" /><summary type="html"><![CDATA[Most PII benchmarks report one number: detection F1. It treats every missed entity as one missed token. But for privacy, not all misses are equal — and detection-F1 hides exactly the misses that matter most.]]></summary></entry><entry><title type="html">Harmonizing the PII taxonomy Babel</title><link href="https://research.klusai.com/methodology/taxonomy/2026/05/13/harmonizing-the-pii-taxonomy-babel.html" rel="alternate" type="text/html" title="Harmonizing the PII taxonomy Babel" /><published>2026-05-13T00:00:00+00:00</published><updated>2026-05-13T00:00:00+00:00</updated><id>https://research.klusai.com/methodology/taxonomy/2026/05/13/harmonizing-the-pii-taxonomy-babel</id><content type="html" xml:base="https://research.klusai.com/methodology/taxonomy/2026/05/13/harmonizing-the-pii-taxonomy-babel.html"><![CDATA[<p>Every PII model speaks a different dialect. OpenAI’s privacy-filter has 8 coarse types;
AI4Privacy uses ~98; HIPAA defines 18; the EU’s MAPA project has its own legal-and-medical
set; OpenMed expands to 54; tabularisai to 42. If you want to compare these models on the same
data — the whole point of a benchmark — you first have to make them agree on what a “name” or an
“ID number” even is.</p>

<p>So before any scoring, we built one <strong>GDPR-aligned crosswalk</strong>: a single harmonized taxonomy with
a documented mapping from each external scheme’s labels onto it. It’s deliberately <em>standardization,
not invention</em> — the interesting work is reconciliation, not coining new categories.</p>

<h2 id="the-rule-that-makes-it-sound">The rule that makes it sound</h2>

<p>A crosswalk is only trustworthy if <strong>each native label maps to exactly one harmonized type</strong> —
i.e. <code class="language-plaintext highlighter-rouge">native → harmonized</code> is a <em>function</em>. We enforce that at load time, and it immediately
caught real modelling bugs:</p>

<ul>
  <li><strong>HIPAA <code class="language-plaintext highlighter-rouge">names</code></strong> was claimed by both <code class="language-plaintext highlighter-rouge">PERSON</code> and <code class="language-plaintext highlighter-rouge">PROVIDER</code>. But you can’t recover “this name
is a clinician, not a patient” from a flat <code class="language-plaintext highlighter-rouge">names</code> label — so <code class="language-plaintext highlighter-rouge">PROVIDER</code> is a refinement that
doesn’t get to claim the source label. <code class="language-plaintext highlighter-rouge">names → PERSON</code>.</li>
  <li><strong>MAPA <code class="language-plaintext highlighter-rouge">ORGANIZATION</code></strong> was claimed by <code class="language-plaintext highlighter-rouge">COURT</code>, <code class="language-plaintext highlighter-rouge">FACILITY</code>, <em>and</em> <code class="language-plaintext highlighter-rouge">ORG_PARTY</code>. Same fix: the
general owner wins; the refinements are ours, not the source’s.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">medical_record_numbers</code></strong> sat in both the generic account bucket and the clinical <code class="language-plaintext highlighter-rouge">MRN</code> type.
The clinical-specific type wins.</li>
</ul>

<p>Each of these would have silently corrupted comparisons — a model credited for the “wrong” type, or
double-counted. Failing loudly at build time beats discovering it in a results table.</p>

<h2 id="what-we-deliberately-left-out">What we deliberately left out</h2>

<p>Schemes like tabularisai’s carry GDPR <strong>Article 9 special categories</strong> — ethnicity, religion,
political opinion, sexual orientation. These are real, high-stakes identifiers, but they need
careful design (and they’re not in our gold yet), so for now they map to <em>nothing</em> rather than
being force-fit. Better an honest gap than a sloppy mapping.</p>

<h2 id="why-it-matters">Why it matters</h2>

<p>Without a harmonized taxonomy, “model A scores higher than model B” can just mean “A’s label set
happens to line up better with this dataset.” The crosswalk — plus scoring every model only on the
types the gold actually annotates — is what lets a leaderboard mean something across models that
were never designed to agree.</p>

<p>→ See it in action on the <a href="/leaderboard/">leaderboard</a>, or read the code on
<a href="https://github.com/klusai">GitHub</a>.</p>]]></content><author><name>KlusAI</name><email>research@klusai.com</email></author><category term="methodology" /><category term="taxonomy" /><summary type="html"><![CDATA[Every PII model speaks a different dialect. OpenAI’s privacy-filter has 8 coarse types; AI4Privacy uses ~98; HIPAA defines 18; the EU’s MAPA project has its own legal-and-medical set; OpenMed expands to 54; tabularisai to 42. If you want to compare these models on the same data — the whole point of a benchmark — you first have to make them agree on what a “name” or an “ID number” even is.]]></summary></entry></feed>