Clustering & Deduplication¶

🗺️ Context: we ingest data on circular economy actors from multiple sources
🔴 Problem: some actors are duplicated, polluting the user application
🟢 Solution: work to reduce duplicates and improve data consistency
🤔 How: see below

⚠️ Known limitations¶

We do not split children from an existing cluster: our choice for now; we only want to add new children. To change this would require CLUSTERING & DEDUP: re-cluster existing children. For now:
- 🟢 advantage: no risk of damaging existing clusters
- 🟠 drawback: no opportunity to re-cluster existing bad clusters
No child re-clustering = no child context: consequence of the above; children do not have their data fetched or normalized, which can cause context issues, so we did https://github.com/incubateur-ademe/quefairedemesobjets/pull/1379 in the meantime

📜 Definition¶

📦 Clustering: the act of grouping actors by their similarity
1️⃣ Deduplication: turn a cluster into a single actor so there are no more duplicates
🎏 Ideal final state of an actor: we focus on actors with status=ACTIF (ACTIVE)

State 🇫🇷	Code 🇬🇧	Definition	Corresponds to a specific source	Visible on the map
parent(s)	parent(s)	actor that has 1 or more children attached to it	🟠 NO we create an « artificial » parent so it lives independently from its children = more robust over time, no need to change the parent whenever children change	🟢 YES
child/children	child / children	actor attached to 1 parent (1 is the maximum)	🟢 YES This is the source the actor comes from	🟠 NO The parent is displayed
orphan(s)	orphan(s)	actor attached to 0 parents	🟢 YES This is the source the actor comes from	🟢 YES

➡️ State transitions: scenarios¶

change model	State before	State after	Scenario	Impact in revision	Impact in displayed
`acteur_create_as_parent`	Orphan	Parent	➕ New parent for new cluster	➕ Parent to create ➕ Data enriched as best as possible	same as revision
`acteur_keep_as_parent`	Parent	Parent	1️⃣ Single existing parent → keep	🟰 Still parent of cluster ➕ Data enriched as best as possible	same as revision
`acteur_keep_as_parent`	Parent	Parent	🎖️ 2+ parents in cluster → the one with most children → keep	🟰 Still parent of cluster ➕ Data enriched as best as possible	same as revision
`acteur_delete_as_parent`	Parent	Will no longer exist	🔴 2+ parents in cluster → not chosen → to delete	🛑 Should be automatically deleted after these children are updated (see `PR1247`)	🛑 Should disappear from displayed
`acteur_verify_in_revision`	Child	Child	🟰 Already points to new parent → nothing to do	None	None
`acteur_update_parent_id`	Child	Child	🔀 Pointed to a parent that was not chosen → point to new parent	🔀 Update parent_id to point to new parent	None
`acteur_update_parent_id`	Orphan	Child	🔀 to point to a parent	🔀 Update parent_id to point to new parent	🛑 Should disappear from displayed
(none)	Orphan	Orphan	Still not part of any cluster (no change)	None	None

🧪 Algorithm¶

🗓️ Past attempts¶

Consideration of https://github.com/dedupeio/dedupe: but seeing that the comparator evaluates 2 values at a time = O(n²) complexity at runtime, and with our volumes (~500K actors) = concern about going with a non-vectorized approach. We see user feedback in that direction (matching time exploding when going from a 1K to 5K sample).
Vectorized attempt: tried basic vectorization (e.g. TF-IDF to naturally deprioritize noise/redundancy) but without compute infra (e.g. state startup = frugal) the idea was abandoned (risk of building models we could not operate via Airflow).

👉🏻 Current¶

Very primitive: with normalization and TF-IDF but running at too small a scale to be really relevant. Lacks fuzzy tolerance.

💡 Improvements¶

Continue upstream normalization: because it benefits not only clustering but also data quality on the map:
- old → new city conversion: thanks to BAN
- address normalization: still with BAN
- name enrichment at matching time for more embeddings via AE
Extend vectorized scope: e.g. at department level, to offer a relevance vs. model size trade-off
Reconsider https://github.com/dedupeio/dedupe: again on subsets (e.g. city) to benefit from the library (which offers useful things like distances) without suffering too much from O(n²)
Embeddings: due to very diverse representations (e.g. waste collection centre vs. waste disposal sites)
Conversion mappings for limited/known cases (e.g. street name abbreviations, e.g. ESP → ESPLANADE)
Phonetic algorithms for typos
Language models: potentially compact models (SLMs) that would outperform the above by asking the simple question (« please cluster these samples »)

🚤 Performance¶

👉🏻 Current¶

Poor but sufficient for the business: runs the algorithm on Airflow and does something else while waiting
Main reason: loops and successive round-trips to choose parents and their data
Example: clustering 150K actors took ~6 hours

💡 Improvements¶

Rewrite tasks to use DBT models that prepare the data = eliminate loops / DB round-trips via Python.

🔀 Schema¶

        graph TD
   subgraph selection["🔎 <b>selection</b>: statut=ACTIF"]
      direction LR
      parents
      orphans
      children
   end

   subgraph normalization["🧹 <b>normalization</b>"]
       norma["lower case, no accents etc..."]
   end

   subgraph clustering["📦 <b>clustering</b>"]
        similar["exact + fuzzy similarity"]
   end

   subgraph dedup["1️⃣ <b>deduplication</b>"]
        parents_choose["🥇 Choose new parents"]
        parents_data["🗄️ Choose parents' data (from non-normalized data)"]
        parents_choose-->parents_data
        parents_data-->children_feed["⬅️ Add existing children"]
        children_feed-->changes_mark["📋 Define changes"]
   end

   subgraph suggestions[" <b>suggestions</b>"]
      propose
    end

   parents-->|displayed| normalization
   orphans-->|displayed| normalization
   normalization --> clustering
   children-->|revision| children_feed
   clustering-->dedup
   changes_mark-->suggestions