Clustering & Deduplication¶
🗺️ Context: we ingest data on circular economy actors from multiple sources
🔴 Problem: some actors are duplicated, polluting the user application
🟢 Solution: work to reduce duplicates and improve data consistency
🤔 How: see below
⚠️ Known limitations¶
We do not split children from an existing cluster: our choice for now; we only want to add new children. To change this would require CLUSTERING & DEDUP: re-cluster existing children. For now:
🟢 advantage: no risk of damaging existing clusters
🟠 drawback: no opportunity to re-cluster existing bad clusters
No child re-clustering = no child context: consequence of the above; children do not have their data fetched or normalized, which can cause context issues, so we did https://github.com/incubateur-ademe/quefairedemesobjets/pull/1379 in the meantime
📜 Definition¶
📦 Clustering: the act of grouping actors by their similarity
1️⃣ Deduplication: turn a cluster into a single actor so there are no more duplicates
🎏 Ideal final state of an actor: we focus on actors with status=ACTIF (ACTIVE)
State 🇫🇷 |
Code 🇬🇧 |
Definition |
Corresponds to a specific source |
Visible on the map |
|---|---|---|---|---|
parent(s) |
parent(s) |
actor that has 1 or more children attached to it |
🟠 NO |
🟢 YES |
child/children |
child / children |
actor attached to 1 parent (1 is the maximum) |
🟢 YES |
🟠 NO |
orphan(s) |
orphan(s) |
actor attached to 0 parents |
🟢 YES |
🟢 YES |
➡️ State transitions: scenarios¶
change model |
State before |
State after |
Scenario |
Impact in revision |
Impact in displayed |
|---|---|---|---|---|---|
|
Orphan |
Parent |
➕ New parent for new cluster |
➕ Parent to create |
same as revision |
|
Parent |
Parent |
1️⃣ Single existing parent → keep |
🟰 Still parent of cluster |
same as revision |
|
Parent |
Parent |
🎖️ 2+ parents in cluster → the one with most children → keep |
🟰 Still parent of cluster |
same as revision |
|
Parent |
Will no longer exist |
🔴 2+ parents in cluster → not chosen → to delete |
🛑 Should be automatically deleted after these children are updated (see |
🛑 Should disappear from displayed |
|
Child |
Child |
🟰 Already points to new parent → nothing to do |
None |
None |
|
Child |
Child |
🔀 Pointed to a parent that was not chosen → point to new parent |
🔀 Update parent_id to point to new parent |
None |
|
Orphan |
Child |
🔀 to point to a parent |
🔀 Update parent_id to point to new parent |
🛑 Should disappear from displayed |
(none) |
Orphan |
Orphan |
Still not part of any cluster (no change) |
None |
None |
🧪 Algorithm¶
🗓️ Past attempts¶
Consideration of https://github.com/dedupeio/dedupe: but seeing that the comparator evaluates 2 values at a time = O(n²) complexity at runtime, and with our volumes (~500K actors) = concern about going with a non-vectorized approach. We see user feedback in that direction (matching time exploding when going from a 1K to 5K sample).
Vectorized attempt: tried basic vectorization (e.g. TF-IDF to naturally deprioritize noise/redundancy) but without compute infra (e.g. state startup = frugal) the idea was abandoned (risk of building models we could not operate via Airflow).
👉🏻 Current¶
Very primitive: with normalization and TF-IDF but running at too small a scale to be really relevant. Lacks fuzzy tolerance.
💡 Improvements¶
Continue upstream normalization: because it benefits not only clustering but also data quality on the map:
Extend vectorized scope: e.g. at department level, to offer a relevance vs. model size trade-off
Reconsider https://github.com/dedupeio/dedupe: again on subsets (e.g. city) to benefit from the library (which offers useful things like distances) without suffering too much from O(n²)
Embeddings: due to very diverse representations (e.g. waste collection centre vs. waste disposal sites)
Conversion mappings for limited/known cases (e.g. street name abbreviations, e.g. ESP → ESPLANADE)
Phonetic algorithms for typos
Language models: potentially compact models (SLMs) that would outperform the above by asking the simple question (« please cluster these samples »)
🚤 Performance¶
👉🏻 Current¶
Poor but sufficient for the business: runs the algorithm on Airflow and does something else while waiting
Main reason: loops and successive round-trips to choose parents and their data
Example: clustering 150K actors took ~6 hours
💡 Improvements¶
Rewrite tasks to use DBT models that prepare the data = eliminate loops / DB round-trips via Python.
🔀 Schema¶
graph TD
subgraph selection["🔎 <b>selection</b>: statut=ACTIF"]
direction LR
parents
orphans
children
end
subgraph normalization["🧹 <b>normalization</b>"]
norma["lower case, no accents etc..."]
end
subgraph clustering["📦 <b>clustering</b>"]
similar["exact + fuzzy similarity"]
end
subgraph dedup["1️⃣ <b>deduplication</b>"]
parents_choose["🥇 Choose new parents"]
parents_data["🗄️ Choose parents' data (from non-normalized data)"]
parents_choose-->parents_data
parents_data-->children_feed["⬅️ Add existing children"]
children_feed-->changes_mark["📋 Define changes"]
end
subgraph suggestions[" <b>suggestions</b>"]
propose
end
parents-->|displayed| normalization
orphans-->|displayed| normalization
normalization --> clustering
children-->|revision| children_feed
clustering-->dedup
changes_mark-->suggestions