Forem: Athroniaeth

PIIGhost: a Python library for PII anonymization in LLM agents

Athroniaeth — Mon, 27 Apr 2026 18:18:21 +0000

I've been building agents on top of LangGraph for a while now, and I keep running into the same problem: every message sent to the LLM might contain sensitive data, and depending on the provider you're using, what happens to that data changes completely.

To simplify, there are three families of providers:

Non-EU cloud (OpenAI, Anthropic, Google): the best models, but data leaves the EU, which is problematic on many fronts. I wrote a summary here.
Sovereign EU cloud (Mistral, Aleph Alpha): processing happens in the EU, but a more restricted catalog.
Self-hosted (Ollama, vLLM, open-weight models): you never hand your data to a third party, you control everything, but you have to manage the infrastructure yourself.

I'm currently working on notarial documents, which in practice limits me to Mistral. So I can't take advantage of the best LLMs to do my work. The only clean way to decouple the LLM from the sensitivity of the content is to anonymize upstream.

Why it's harder than it looks

On paper, it's simple. You take a detector (regex for emails, NER model for names), replace what matches with placeholders, and send to the LLM.

In practice, four problems show up almost immediately.

Placeholder consistency. The point of anonymization is to replace "Patrick" with a placeholder like <<PERSON:1>>, which tells the LLM two things. A person has been hidden here, and every occurrence of <<PERSON:1>> refers to the same person. If "Patrick" becomes <<PERSON:1>> at the start of the text and <<PERSON:3>> at the end, the LLM can no longer reason about the fact that it's the same individual.

Variants missed by the detector. The NER detects "Patrick Dupont" at the start of the text but misses "Patrick" alone two sentences later. Or it detects "Patrick" but not "patrick" in lowercase. Or not "Patriick" with a typo.

Overlap between detectors. You chain two NERs to boost recall. On "Patrick", both can claim the same span with different labels (one says PERSON, the other says ORG because it confused it with a company name). Without arbitration, the final replacement hits the same position twice and breaks the text.

Persistence across messages. Once the LLM has seen <<PERSON:1>> in message 1, message 2 needs to use the same placeholder. Without shared memory, "Patrick" becomes <<PERSON:1>> then <<PERSON:7>> depending on the moment, and the LLM loses track.

And that's before we even get to the agent, where tools need to receive the real values (to send an email, for example) while the LLM should only see placeholders. On the front-end side, you also have to deanonymize the placeholders before showing the response to the user, without the LLM ever knowing the mapping.

It's to address all of this that I built PIIGhost, an open-source project that adds a layer of detection, anonymization and deanonymization on top of your detectors (NER, regex, LLM, whatever you want). It also offers a conversational mode and a LangChain middleware that plugs into LangGraph without modifying your existing code.

The rest of the article follows the pipeline order: detection, span arbitration, entity linking, merging, anonymization, then the conversational and agent layers.

Step 1: Detection

Everything starts with detection. A detector takes text and returns a list of Detection objects (text found, label, position, confidence). PIIGhost ships several out of the box:

RegexDetector for structured formats (emails, phone numbers, IBAN).
ExactMatchDetector for fixed words known in advance, useful for tests or business dictionaries.
Gliner2Detector for NER, plugged on GLiNER2 by default.
CompositeDetector to combine multiple detectors into one.

The interface is an AnyDetector protocol, so you can plug in your own (an LLM call, another NER model, whatever you want).

Here's an example without an ML model, just to show the mechanics:

from piighost import ExactMatchDetector

detector = ExactMatchDetector([
    ("Patrick", "PERSON"),
    ("Paris", "LOCATION"),
])

detections = await detector.detect("Patrick lives in Paris.")
# Detection(text='Patrick', label='PERSON',   position=Span(0, 7),   confidence=1.0)
# Detection(text='Paris',   label='LOCATION', position=Span(15, 20), confidence=1.0)

At this stage, we have a raw list of detections. No anonymization, no duplicate handling, nothing. Just "here's what looks like PII and where it sits".

Step 2: Span arbitration

First real problem. When you chain multiple detectors on the same text, they can claim the same chunk with different labels. This is typically what happens when you combine two NERs to boost recall. They step on each other and one of them is wrong.

A concrete example. On the following sentence:

"Patrick works at Orange since 2015."

You run two NERs:

NER A (a generalist model) detects "Patrick" → PERSON, span [0:7], confidence 0.95
NER B (a domain model less reliable on first names) detects "Patrick" → ORG, span [0:7], confidence 0.60 (it confused it with a company name)

Both point to exactly the same span [0:7], but with mutually exclusive labels. If we replace both, we hit the same position twice and end up with something broken like <<ORG:1>><<PERSON:1>> works at.... We have to choose.

That's the role of the span resolver. PIIGhost ships two by default:

ConfidenceSpanConflictResolver: keeps the detection with the highest confidence in case of overlap. The reasonable default.
DisabledSpanConflictResolver: does nothing, to use if your detections are already clean or if you want to handle the case yourself.

You can also write your own (prefer the longest span, prefer a specific label, etc.) by implementing the SpanConflictResolver protocol.

from piighost import ConfidenceSpanConflictResolver

resolver = ConfidenceSpanConflictResolver()
clean = resolver.resolve(detections)

# Input detections:
#   - PERSON "Patrick" [0:7] confidence=0.95   (NER A)
#   - ORG    "Patrick" [0:7] confidence=0.60   (NER B)
#
# After resolution, only this remains:
#   - PERSON "Patrick" [0:7] confidence=0.95

At the end of this step, no more overlaps. Each chunk of text is claimed by only one detection.

Overlap isn't necessarily exact. The resolver also handles cases where one span is included in another, or where two spans partially overlap. The principle stays the same. Keep the most confident.

Step 3: Entity linking

Second problem. The NER misses occurrences. It finds "Patrick Dupont" in sentence 1 but misses "Patrick" alone in sentence 3. If we stop at raw detection, "Patrick" stays in clear text in the anonymized output. That's exactly what we want to avoid.

The linker fixes this. ExactEntityLinker does two things:

For each detection, it searches for all other occurrences of the same text in the document, using a word-boundary regex (to avoid matching "Patric" inside "Patricia").
It groups every detection that points to the same normalized text into a single Entity object.

Concretely:

Text: "Patrick Dupont lives in Paris. Patrick loves Paris."

Raw NER detections:
  - PERSON   "Patrick Dupont"  (sentence 1)
  - LOCATION "Paris"            (sentence 1)
  # "Patrick" and "Paris" in sentence 2 were missed by the NER

After ExactEntityLinker:
  - Entity(label=PERSON,   detections=["Patrick Dupont", "Patrick"])
  - Entity(label=LOCATION, detections=["Paris", "Paris"])

All occurrences are recovered, grouped by entity. The NER misses things, the linker catches them.

One caveat. The linker does exact string matching. It won't catch "patrick" in lowercase or "Patriick" with a typo. For that, you need a fuzzy linker, which you can write by implementing the EntityLinker protocol.

Step 4: Entity merging

Third problem, more subtle. Imagine two detectors that see the same person but with different spans:

The NER detects "Patrick Dupont" → entity A, label PERSON
A business dictionary detects "Patrick" alone (because they're in the firm's associates list) → entity B, label PERSON

After the linker, you end up with two distinct entities even though it's clearly the same person. If you anonymize as is, "Patrick Dupont" becomes <<PERSON:1>> and "Patrick" alone becomes <<PERSON:2>>. The LLM thinks these are two different people.

The entity resolver merges these duplicates. Two options:

MergeEntityConflictResolver: uses union-find to merge entities sharing at least one detection (strict matching). The default.
FuzzyEntityConflictResolver: uses Jaro-Winkler distance to merge entities whose canonical text is close (e.g. "Patrick" and "Patriick" with a typo). More tolerant, but higher false-positive risk.

A concrete example:

Before merge:
  - Entity(label=PERSON, detections=["Patrick Dupont"])
  - Entity(label=PERSON, detections=["Patrick"])
  # Both entities share a detection on the string "Patrick"

After MergeEntityConflictResolver:
  - Entity(label=PERSON, detections=["Patrick Dupont", "Patrick"])

At this stage, you have a clean list of entities, each grouping all of its occurrences. No more duplicates, no more overlaps.

Step 5: Anonymization

Now we can replace. The Anonymizer generates a unique placeholder per entity via a PlaceholderFactory, then replaces the spans in the text from right to left (so the positions of the following spans don't shift).

from piighost import Anonymizer, LabelCounterPlaceholderFactory

anonymizer = Anonymizer(LabelCounterPlaceholderFactory())
result = anonymizer.anonymize(text, entities)

# Patrick Dupont lives in Paris. Patrick loves Paris.
# becomes
# <<PERSON:1>> lives in <<LOCATION:1>>. <<PERSON:1>> loves <<LOCATION:1>>.

Several factories are provided, to choose based on your case:

LabelCounterPlaceholderFactory: <<PERSON:1>>, <<LOCATION:1>>. Readable in logs and traces.
LabelHashPlaceholderFactory: <<PERSON:a3f9>>. Avoids leaking the order in which entities appear from one conversation to another.
FakerCounterPlaceholderFactory: "John Smith", "Springfield". Preserves linguistic flow for the LLM (useful if the model struggles with raw placeholders).
MaskPlaceholderFactory: [REDACTED]. Pure anonymization, irreversible.

The default <<LABEL:N>> format has four useful properties:

it's unique as a token in theory,
the LLM immediately sees what type of PII it's dealing with,
it's not ambiguous in regular text,
it can't be confused with another placeholder (unlike a plain <<PERSON>>, which doesn't distinguish people from one another).

The assembled pipeline

All the steps above chain together into a pipeline:

from piighost.pipeline import AnonymizationPipeline
from piighost import (
    ConfidenceSpanConflictResolver,
    ExactEntityLinker,
    MergeEntityConflictResolver,
    Anonymizer,
    LabelCounterPlaceholderFactory,
)

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=ConfidenceSpanConflictResolver(),
    entity_linker=ExactEntityLinker(),
    entity_resolver=MergeEntityConflictResolver(),
    anonymizer=Anonymizer(LabelCounterPlaceholderFactory()),
)

anonymized, entities = await pipeline.anonymize(
    "Patrick Dupont lives in Paris. Patrick loves Paris."
)
# <<PERSON:1>> lives in <<LOCATION:1>>. <<PERSON:1>> loves <<LOCATION:1>>.

original, _ = await pipeline.deanonymize(anonymized)
# Patrick Dupont lives in Paris. Patrick loves Paris.

The pipeline keeps a cache of the mapping (SHA-256 key on the input text), so deanonymization is free after the first call.

The conversation problem

All of this works for an isolated message. In a real conversation, it breaks because of three problems.

Counters not shared. Every call to anonymize starts from scratch. The Patrick → <<PERSON:1>> mapping from message 1 is not guaranteed to be reused at message 2.

Detections missed across messages. The NER detects "Patrick" in message 1 but misses it in message 5. Without memory of entities already seen, we can't fill the gap.

Concurrent conversations. If multiple users share the same pipeline instance, their entities mix together. The <<PERSON:1>> of one and the other become indistinguishable.

Bug demonstration:

# Message 1
m1, _ = await pipeline.anonymize("Patrick lives in Paris.")
# <<PERSON:1>> lives in <<LOCATION:1>>.

# Message 2, state not shared
m2, _ = await pipeline.anonymize("Bob is happy.")
# <<PERSON:1>> is happy.   ← the counter restarted at 1
# Bob inherits the same placeholder as Patrick → collision:
# the LLM thinks it's the same person.

ThreadAnonymizationPipeline extends the standard pipeline with a ConversationMemory scoped by thread_id. The memory accumulates entities across messages, deduplicated by (text.lower(), label). Each call passes a thread_id, and the cache is prefixed with that identifier so conversations stay isolated.

from piighost.pipeline.thread import ThreadAnonymizationPipeline

pipeline = ThreadAnonymizationPipeline(detector=..., span_resolver=..., ...)

# Conversation A
m1, _ = await pipeline.anonymize("Patrick lives in Paris.", thread_id="user-A")
# <<PERSON:1>> lives in <<LOCATION:1>>.

m2, _ = await pipeline.anonymize("Patrick is happy.", thread_id="user-A")
# <<PERSON:1>> is happy.   ← guaranteed, shared via the thread memory

# Conversation B in parallel, isolated
m3, _ = await pipeline.anonymize("Bob loves Lyon.", thread_id="user-B")
# <<PERSON:1>> loves <<LOCATION:1>>.   ← counter independent from conversation A

ThreadAnonymizationPipeline also adds two operations useful for the agent case:

anonymize_with_ent(text, thread_id=...): pure string replacement, without detection. Uses the entities already known to the thread to anonymize a new text. Faster, but doesn't detect new PII.
deanonymize_with_ent(text, thread_id=...): inverse replacement. Useful when the LLM produces text with placeholders we want to restore.

These two operations correctly handle cases where one placeholder is a prefix of another (<<PERSON:1>> vs <<PERSON:10>>) by replacing the longer ones first.

The agent problem

In a LangGraph agent, the LLM doesn't just process messages. It calls tools, reads their results, and reasons in a loop. Anonymizing properly in this setting requires three interventions at precise moments.

Before the LLM call. All messages have to be anonymized. This is the standard pipeline.anonymize(), applied to each message of the context.

Before and after a tool execution. The LLM calls send_email(to=<<PERSON:1>>). The tool needs the real address, not the placeholder. We deanonymize the arguments via deanonymize_with_ent, execute, then re-anonymize the result before handing it back to the LLM.

Before display to the user. The LLM produces "Done, I sent the email to <<PERSON:1>>". The user wants to see "Patrick", not the placeholder.

PIIAnonymizationMiddleware wires these three hooks into LangGraph:

from langchain.agents import create_agent
from piighost.middleware import PIIAnonymizationMiddleware

middleware = PIIAnonymizationMiddleware(pipeline=pipeline)

agent = create_agent(
    model="mistral:mistral-large-latest",
    tools=[send_email, get_weather],
    middleware=[middleware],
)

Under the hood, the middleware reads the thread_id from the LangGraph config (get_config()["configurable"]["thread_id"]) and passes it to every pipeline operation. The LLM never sees real values, the tools receive them normally, the user gets the response with their names intact. No agent code to modify.

piighost-chat: the human-in-the-loop demo

To make all of this concrete, I built a chatbot on top of the library. The user sees what is about to be anonymized before the message is sent to the LLM. They can deselect a span flagged by mistake, or select text the detector missed. Once validated, the message goes into the pipeline.

This kind of human-in-the-loop UX is what makes auto-anonymization actually usable in real workflows, where automatic precision often plateaus around 90-95% and those few missed percent can be a problem. The auto pass does the heavy lifting, the human catches the edges.

For instance, here you type your message, it goes through the piighost API and the front shows what was detected and what's about to be anonymized.

You can remove anonymized entities if there's a false positive.

You can also select text to add new entities to anonymize.

If you ask for information about an anonymized PII, for instance which letter the word starts with, the LLM won't be able to answer.

The library is in its early days. I tried to anticipate as many cases as possible starting from my own needs on notarial documents, but I know that's a particular angle and that many things can be debated. Components that aren't generic enough, abstractions that don't pull their weight, use cases I haven't seen.
If you give it a try, your feedback genuinely matters to me:

what felt missing or counter-intuitive,
what feels too complex or pointless and should be removed,
the use cases where it doesn't hold up.

Anything is welcome, whether through a GitHub issue, a PR, or even a direct message. I'd rather cut early on what doesn't belong than accumulate debt.

Thanks for reading.

PIIGhost : une librairie Python d'anonymisation de données confidentiels pour les agents LLM

Athroniaeth — Sun, 26 Apr 2026 23:38:08 +0000

Ça fait un moment que je construis des agents avec LangGraph, et je retombe toujours sur le même problème : chaque message envoyé au LLM peut contenir des données sensibles, et selon le fournisseur que vous utilisez, ce qu'il advient de ces données change complètement.

En simplifiant, il y a trois familles de fournisseurs :

Cloud non-européen (OpenAI, Anthropic, Google) : les meilleurs modèles, mais les données quittent l'UE, ce qui est problématique sur plein d'aspects. J'en ai fait un résumé ici.
Cloud souverain européen (Mistral, Aleph Alpha) : traitement en UE, mais catalogue plus restreint.
Self-hosted (Ollama, vLLM, modèles open-weight) : vous ne fournissez jamais vos données à un tiers, vous contrôlez tout, mais vous devez gérer l'infrastructure vous-même.

Je travaille actuellement sur des documents notariaux, ce qui me limite en pratique à Mistral. Je ne peux donc pas profiter des meilleurs LLM pour effectuer mes tâches. La seule façon propre de découpler le LLM de la sensibilité du contenu, c'est d'anonymiser en amont.

Pourquoi c'est plus dur qu'il n'y paraît

Sur le papier, c'est simple : on prend un détecteur (regex pour les emails, modèle NER pour les noms), on remplace ce qui matche par des placeholders, et on envoie au LLM.

En pratique, quatre problèmes apparaissent presque immédiatement.

Cohérence des placeholders. Le but de l'anonymisation est de remplacer "Patrick" par un placeholder du type <<PERSON:1>>, qui dit deux choses au LLM : on a caché une personne ici, et toutes les occurrences de <<PERSON:1>> parlent de la même personne. Si "Patrick" devient <<PERSON:1>> au début du texte et <<PERSON:3>> à la fin, le LLM ne peut plus raisonner sur le fait qu'il s'agit du même individu.

Variantes ratées par le détecteur. Le NER détecte "Patrick Dupont" en début de texte mais rate "Patrick" tout seul deux phrases plus loin. Ou il détecte "Patrick" mais pas "patrick" en bas de casse. Ou pas "Patriick" avec une faute d'orthographe.

Chevauchement entre détecteurs. Vous chaînez deux NER pour augmenter le rappel. Sur "Patrick", les deux peuvent revendiquer le même span avec des labels différents (l'un dit PERSON, l'autre dit ORG parce qu'il a confondu avec un nom d'entreprise). Sans arbitrage, le remplacement final tape sur la même position deux fois et casse le texte.

Persistance entre messages. Une fois que le LLM a vu <<PERSON:1>> dans le message 1, il faut que le message 2 utilise le même placeholder. Sans mémoire partagée, "Patrick" devient <<PERSON:1>> puis <<PERSON:7>> selon le moment, et le LLM perd le fil.

Et c'est avant même de parler de l'agent, où les outils doivent recevoir les vraies valeurs (pour envoyer un email, par exemple) tandis que le LLM ne doit voir que les placeholders. Côté front, il faut aussi désanonymiser les placeholders avant de montrer la réponse à l'utilisateur, sans que le LLM ait connaissance du mapping.

C'est pour répondre à tout ça que j'ai construit PIIGhost, un projet open-source qui ajoute une couche de détection, d'anonymisation et de désanonymisation par-dessus vos détecteurs (NER, regex, LLM, ce que vous voulez). Il propose en plus un mode conversationnel et un middleware LangChain qui s'intègre dans LangGraph sans modifier votre code existant.

Le reste de l'article suit l'ordre du pipeline : détection, arbitrage des spans, liaison d'entités, fusion, anonymisation, puis les couches conversationnelle et agent.

Étape 1 : Détection

Tout commence par la détection. Un détecteur prend du texte et retourne une liste d'objets Detection (texte trouvé, label, position, confiance). PIIGhost en fournit plusieurs en standard :

RegexDetector pour les formats structurés (emails, téléphones, IBAN).
ExactMatchDetector pour des mots fixes connus à l'avance, utile pour les tests ou pour des dictionnaires métier.
Gliner2Detector pour le NER, branché sur GLiNER2 par défaut.
CompositeDetector pour combiner plusieurs détecteurs en un seul.

L'interface est un protocole AnyDetector, donc vous pouvez brancher le vôtre (un appel LLM, un autre modèle NER, ce que vous voulez).

Voici un exemple sans modèle ML, juste pour montrer la mécanique :

from piighost import ExactMatchDetector

detector = ExactMatchDetector([
    ("Patrick", "PERSON"),
    ("Paris", "LOCATION"),
])

detections = await detector.detect("Patrick habite à Paris.")
# Detection(text='Patrick', label='PERSON',   position=Span(0, 7),   confidence=1.0)
# Detection(text='Paris',   label='LOCATION', position=Span(17, 22), confidence=1.0)

À ce stade, on a une liste brute de détections. Pas encore d'anonymisation, pas de gestion de doublons, rien. Juste : "voici ce qui ressemble à des PII et où elles sont".

Étape 2 : Arbitrage des spans

Premier vrai problème : quand vous chaînez plusieurs détecteurs sur le même texte, ils peuvent revendiquer le même morceau avec des labels différents. C'est typiquement ce qui arrive quand on combine deux NER pour augmenter le rappel : ils se marchent dessus et l'un des deux se trompe.

Prenons un exemple concret. Sur la phrase suivante :

"Patrick travaille chez Orange depuis 2015."

Vous faites tourner deux NER :

NER A (un modèle généraliste) détecte "Patrick" → PERSON, span [0:7], confidence 0.95
NER B (un modèle métier moins fiable sur les prénoms) détecte "Patrick" → ORG, span [0:7], confidence 0.60 (il a confondu avec un nom d'entreprise)

Les deux pointent exactement sur le même span [0:7], mais avec des labels qui s'excluent mutuellement. Si on remplace les deux, on tape deux fois sur la même position et on obtient un truc cassé du genre <<ORG:1>><<PERSON:1>> travaille chez.... Il faut choisir.

C'est le rôle du résolveur de spans. PIIGhost en fournit deux par défaut :

ConfidenceSpanConflictResolver : garde la détection avec la plus haute confiance en cas de chevauchement. C'est le défaut raisonnable.
DisabledSpanConflictResolver : ne fait rien, à utiliser si vos détections sont déjà propres ou si vous voulez gérer le cas vous-même.

Vous pouvez aussi écrire le vôtre (préférer le span le plus long, préférer un label spécifique, etc.) en implémentant le protocole SpanConflictResolver.

from piighost import ConfidenceSpanConflictResolver

resolver = ConfidenceSpanConflictResolver()
clean = resolver.resolve(detections)

# Détections en entrée :
#   - PERSON "Patrick" [0:7] confidence=0.95   (NER A)
#   - ORG    "Patrick" [0:7] confidence=0.60   (NER B)
#
# Après résolution, il ne reste que :
#   - PERSON "Patrick" [0:7] confidence=0.95

À la fin de cette étape, plus de chevauchements. Chaque morceau de texte n'est revendiqué que par une seule détection.

Le chevauchement n'est pas forcément exact. Le résolveur gère aussi les cas où un span est inclus dans un autre, ou où deux spans se recouvrent partiellement. Le principe reste le même : garder le plus confiant.

Étape 3 : Liaison d'entités

Deuxième problème : le NER rate des occurrences. Il trouve "Patrick Dupont" dans la phrase 1, mais rate "Patrick" tout seul dans la phrase 3. Si on s'arrête à la détection brute, "Patrick" reste en clair dans le texte anonymisé. C'est exactement ce qu'on veut éviter.

Le linker corrige ça. ExactEntityLinker fait deux choses :

Pour chaque détection, il cherche toutes les autres occurrences du même texte dans le document, avec une regex word-boundary (pour éviter de matcher "Patric" dans "Patricia").
Il regroupe toutes les détections qui pointent vers le même texte normalisé en un seul objet Entity.

Concrètement :

Texte : "Patrick Dupont habite à Paris. Patrick adore Paris."

Détections brutes du NER :
  - PERSON   "Patrick Dupont"  (phrase 1)
  - LOCATION "Paris"            (phrase 1)
  # "Patrick" et "Paris" de la phrase 2 ont été ratés par le NER

Après ExactEntityLinker :
  - Entity(label=PERSON,   detections=["Patrick Dupont", "Patrick"])
  - Entity(label=LOCATION, detections=["Paris", "Paris"])

Toutes les occurrences sont retrouvées, regroupées par entité. Le NER rate des choses, le linker rattrape derrière.

À noter : le linker fait du matching exact sur la chaîne. Il n'attrape pas "patrick" en bas de casse ou "Patriick" avec une faute. Pour ça, il faut un linker fuzzy, qu'on peut écrire en implémentant le protocole EntityLinker.

Étape 4 : Fusion d'entités

Troisième problème, plus subtil. Imaginez deux détecteurs qui voient la même personne mais avec des spans différents :

Le NER détecte "Patrick Dupont" → entité A, label PERSON
Un dictionnaire métier détecte "Patrick" tout seul (parce qu'il est dans la liste des associés du cabinet) → entité B, label PERSON

Après le linker, vous vous retrouvez avec deux entités distinctes alors qu'il s'agit clairement de la même personne. Si vous anonymisez tel quel, "Patrick Dupont" devient <<PERSON:1>> et "Patrick" tout seul devient <<PERSON:2>>. Le LLM pense que ce sont deux personnes différentes.

Le resolver d'entités fusionne ces doublons. Deux options :

MergeEntityConflictResolver : utilise un union-find pour fusionner les entités qui partagent au moins une détection en commun (matching strict). C'est le défaut.
FuzzyEntityConflictResolver : utilise la distance Jaro-Winkler pour fusionner les entités dont le texte canonique est proche (ex. "Patrick" et "Patriick" avec une typo). Plus tolérant, mais risque de faux positifs plus élevé.

Exemple concret :

Avant fusion :
  - Entity(label=PERSON, detections=["Patrick Dupont"])
  - Entity(label=PERSON, detections=["Patrick"])
  # Les deux entités partagent une détection sur la chaîne "Patrick"

Après MergeEntityConflictResolver :
  - Entity(label=PERSON, detections=["Patrick Dupont", "Patrick"])

À ce stade, vous avez une liste propre d'entités, chacune regroupant toutes ses occurrences. Plus de doublons, plus de chevauchements.

Étape 5 : Anonymisation

Maintenant on peut remplacer. L'Anonymizer génère un placeholder unique par entité via une PlaceholderFactory, puis remplace les spans dans le texte de droite à gauche (pour ne pas décaler les positions des spans suivants).

from piighost import Anonymizer, LabelCounterPlaceholderFactory

anonymizer = Anonymizer(LabelCounterPlaceholderFactory())
result = anonymizer.anonymize(text, entities)

# Patrick Dupont habite à Paris. Patrick adore Paris.
# devient
# <<PERSON:1>> habite à <<LOCATION:1>>. <<PERSON:1>> adore <<LOCATION:1>>.

Plusieurs factories sont fournies, à choisir selon votre cas :

LabelCounterPlaceholderFactory : <<PERSON:1>>, <<LOCATION:1>>. Lisible dans les logs et les traces.
LabelHashPlaceholderFactory : <<PERSON:a3f9>>. Évite de fuiter l'ordre d'apparition des entités d'une conversation à l'autre.
FakerCounterPlaceholderFactory : "John Smith", "Springfield". Préserve le flux linguistique pour le LLM (utile si le modèle galère avec les placeholders bruts).
MaskPlaceholderFactory : [REDACTED]. Anonymisation pure, irréversible.

Le format <<LABEL:N>> par défaut a quatre propriétés utiles :

il est en théorie unique comme token,
le LLM voit immédiatement de quel type de PII il s'agit,
il n'est pas ambigu dans du texte normal,
il ne peut pas être confondu avec un autre placeholder (contrairement à <<PERSON>> tout court, qui ne distingue pas les personnes entre elles).

Le pipeline assemblé

Toutes les étapes ci-dessus s'enchaînent dans un pipeline :

from piighost.pipeline import AnonymizationPipeline
from piighost import (
    ConfidenceSpanConflictResolver,
    ExactEntityLinker,
    MergeEntityConflictResolver,
    Anonymizer,
    LabelCounterPlaceholderFactory,
)

pipeline = AnonymizationPipeline(
    detector=detector,
    span_resolver=ConfidenceSpanConflictResolver(),
    entity_linker=ExactEntityLinker(),
    entity_resolver=MergeEntityConflictResolver(),
    anonymizer=Anonymizer(LabelCounterPlaceholderFactory()),
)

anonymized, entities = await pipeline.anonymize(
    "Patrick Dupont habite à Paris. Patrick adore Paris."
)
# <<PERSON:1>> habite à <<LOCATION:1>>. <<PERSON:1>> adore <<LOCATION:1>>.

original, _ = await pipeline.deanonymize(anonymized)
# Patrick Dupont habite à Paris. Patrick adore Paris.

Le pipeline garde un cache du mapping (clé SHA-256 sur le texte d'entrée), donc la désanonymisation est gratuite après le premier appel.

Le problème de la conversation

Tout ça marche pour un message isolé. Dans une vraie conversation, ça casse à cause de trois problèmes.

Compteurs non partagés. Chaque appel à anonymize repart de zéro. Le mapping Patrick → <<PERSON:1>> du message 1 n'est pas garanti d'être réutilisé au message 2.

Détections manquées entre messages. Le NER détecte "Patrick" dans le message 1 mais le rate dans le message 5. Sans mémoire des entités déjà vues, on ne peut pas combler le trou.

Conversations concurrentes. Si plusieurs utilisateurs partagent la même instance de pipeline, leurs entités se mélangent. Les <<PERSON:1>> des uns et des autres deviennent indiscernables.

Démonstration du bug :

# Message 1
m1, _ = await pipeline.anonymize("Patrick habite à Paris.")
# <<PERSON:1>> habite à <<LOCATION:1>>.

# Message 2 : état non partagé
m2, _ = await pipeline.anonymize("Bob est content.")
# <<PERSON:1>> est content.   ← le compteur est reparti à 1
# Bob hérite donc du même placeholder que Patrick → collision :
# le LLM pense que c'est la même personne.

ThreadAnonymizationPipeline étend le pipeline standard avec une ConversationMemory scopée par thread_id. La mémoire accumule les entités au fil des messages, dédupliquées par (text.lower(), label). Chaque appel passe un thread_id, et le cache est préfixé par cet identifiant pour isoler les conversations.

from piighost.pipeline.thread import ThreadAnonymizationPipeline

pipeline = ThreadAnonymizationPipeline(detector=..., span_resolver=..., ...)

# Conversation A
m1, _ = await pipeline.anonymize("Patrick habite à Paris.", thread_id="user-A")
# <<PERSON:1>> habite à <<LOCATION:1>>.

m2, _ = await pipeline.anonymize("Patrick est content.", thread_id="user-A")
# <<PERSON:1>> est content.   ← garanti, partagé via la mémoire du thread

# Conversation B en parallèle, isolée
m3, _ = await pipeline.anonymize("Bob aime Lyon.", thread_id="user-B")
# <<PERSON:1>> aime <<LOCATION:1>>.   ← compteur indépendant de la conversation A

ThreadAnonymizationPipeline ajoute aussi deux opérations utiles pour le cas agent :

anonymize_with_ent(text, thread_id=...) : remplacement de chaîne pur, sans détection. Utilise les entités déjà connues du thread pour anonymiser un nouveau texte. Plus rapide, mais ne détecte pas de nouvelles PII.
deanonymize_with_ent(text, thread_id=...) : remplacement inverse. Utile quand le LLM produit un texte avec des placeholders qu'on veut restaurer.

Ces deux opérations gèrent correctement les cas où un placeholder est préfixe d'un autre (<<PERSON:1>> vs <<PERSON:10>>) en remplaçant les plus longs en premier.

Le problème de l'agent

Dans un agent LangGraph, le LLM ne traite pas juste des messages. Il appelle des outils, lit leurs résultats, et raisonne en boucle. Anonymiser proprement dans ce contexte demande trois interventions à des moments précis.

Avant l'appel LLM. Tous les messages doivent être anonymisés. C'est le pipeline.anonymize() standard, appliqué sur chaque message du contexte.

Avant et après l'exécution d'un outil. Le LLM appelle send_email(to=<<PERSON:1>>). Le tool a besoin de la vraie adresse, pas du placeholder. On désanonymise les arguments via deanonymize_with_ent, on exécute, puis on réanonymise le résultat avant de le redonner au LLM.

Avant l'affichage à l'utilisateur. Le LLM produit "C'est fait, j'ai envoyé l'email à <<PERSON:1>>". L'utilisateur veut voir "Patrick", pas le placeholder.

PIIAnonymizationMiddleware pose ces trois hooks dans LangGraph :

from langchain.agents import create_agent
from piighost.middleware import PIIAnonymizationMiddleware

middleware = PIIAnonymizationMiddleware(pipeline=pipeline)

agent = create_agent(
    model="mistral:mistral-large-latest",
    tools=[send_email, get_weather],
    middleware=[middleware],
)

Sous le capot, le middleware lit le thread_id depuis la config LangGraph (get_config()["configurable"]["thread_id"]) et le passe à toutes les opérations du pipeline. Le LLM ne voit jamais les vraies valeurs, les outils les reçoivent normalement, l'utilisateur récupère sa réponse avec ses noms intacts. Aucun code agent à modifier.

piighost-chat : la démo human-in-the-loop

Pour rendre tout ça concret, j'ai construit un chatbot par-dessus la librairie. L'utilisateur voit ce qui va être anonymisé avant que le message parte au LLM. Il peut désélectionner un span flaggué par erreur, ou sélectionner du texte que le détecteur a raté. Une fois validé, le message part dans la pipeline.

Ce genre d'UX human-in-the-loop est ce qui rend l'anonymisation automatique vraiment utilisable dans les workflows réels, où la précision automatique plafonne souvent autour de 90-95 % et où ces quelques pourcents manqués peuvent être problématiques. La passe automatique fait le gros du boulot, l'humain rattrape les bords.

Par exemple ici vous rentrez votre message, il passe par l'API piighost et le front affiche ce qui a été détecté et ce qui va être anonymisé.

Vous pouvez supprimer des entités anonymisées s'il y a eu un faux positif.

Vous pouvez aussi sélectionner du texte pour rajouter des entités à anonymiser.

Si vous demandez des informations sur une PII anonymisée, par exemple par quelle lettre commence le mot, le LLM ne pourra pas vous répondre.

La librairie est à ses débuts. J'ai essayé d'anticiper un maximum de cas en partant de mes propres besoins sur des documents notariaux, mais je sais que c'est un angle particulier et que beaucoup de choses peuvent être discutées : des composants pas assez génériques, des abstractions qui ne servent à rien, des cas d'usage que je n'ai pas vus.
Si vous l'essayez, vos retours m'intéressent vraiment :

ce qui vous a manqué ou paru contre-intuitif,
ce qui vous semble trop complexe ou inutile et mériterait d'être supprimé,
les cas d'usage où elle ne tient pas la route.

Tout est bon à prendre, que ce soit via une issue GitHub, une PR, ou même un message direct. Je préfère trancher tôt sur ce qui n'a pas sa place plutôt que d'accumuler de la dette.

Merci d'avoir lu.