# Clustering & Deduplication
- πΊοΈ **Context**: we ingest data on circular economy actors from [multiple sources](https://data.ademe.fr/)
- π΄ **Problem**: some actors are duplicated, polluting the user application
- π’ **Solution**: work to reduce duplicates and improve data consistency
- π€ **How**: see below
## β οΈ Known limitations
- **We do not split children from an existing cluster**: our choice for now; we only want to add new children. To change this would require [CLUSTERING & DEDUP: re-cluster existing children](https://www.notion.so/CLUSTERING-DEDUP-re-clusteriser-les-enfants-existants-1956523d57d7801abe3bc132ec7e33b0?pvs=21). For now:
- π’ **advantage**: no risk of damaging existing clusters
- π **drawback**: no opportunity to re-cluster existing bad clusters
- **No child re-clustering = no child context**: consequence of the above; children do not have their data fetched or normalized, which can cause context issues, so we did https://github.com/incubateur-ademe/quefairedemesobjets/pull/1379 in the meantime
## π Definition
- π¦ **Clustering:** the act of **grouping** actors by their **similarity**
- 1οΈβ£ **Deduplication**: turn a **cluster** into **a single actor** so there are **no more duplicates**
- π **Ideal final state of an actor**: we focus on actors with status=ACTIF (ACTIVE)
| State π«π· | Code π¬π§ | Definition | Corresponds to a specific source | Visible on the map |
| -------------- | ---------------- | ------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- |
| parent(s) | parent(s) | actor that has 1 or more children attached to it | π NO
we create an "artificial" parent so it lives independently from its children = more robust over time, no need to change the parent whenever children change | π’ YES |
| child/children | child / children | actor attached to 1 parent (1 is the maximum) | π’ YES
This is the source the actor comes from | π NO
The parent is displayed |
| orphan(s) | orphan(s) | actor attached to 0 parents | π’ YES
This is the source the actor comes from | π’ YES |
## β‘οΈ State transitions: scenarios
| change model | State before | State after | Scenario | Impact in revision | Impact in displayed |
| --------------------------- | ------------ | -------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
| `acteur_create_as_parent` | Orphan | Parent | β New parent for new cluster | β Parent to create
β Data enriched as best as possible | same as revision |
| `acteur_keep_as_parent` | Parent | Parent | 1οΈβ£ Single existing parent β keep | π° Still parent of cluster
β Data enriched as best as possible | same as revision |
| `acteur_keep_as_parent` | Parent | Parent | ποΈ 2+ parents in cluster β the one with most children β keep | π° Still parent of cluster
β Data enriched as best as possible | same as revision |
| `acteur_delete_as_parent` | Parent | Will no longer exist | π΄ 2+ parents in cluster β not chosen β to delete | π Should be automatically deleted after these children are updated (see [`PR1247`](https://github.com/incubateur-ademe/quefairedemesobjets/pull/1247)) | π Should disappear from displayed |
| `acteur_verify_in_revision` | Child | Child | π° Already points to new parent β nothing to do | None | None |
| `acteur_update_parent_id` | Child | Child | π Pointed to a parent that was not chosen β point to new parent | π Update parent_id to point to new parent | None |
| `acteur_update_parent_id` | Orphan | Child | π to point to a parent | π Update parent_id to point to new parent | π Should disappear from displayed |
| _(none)_ | Orphan | Orphan | Still not part of any cluster (no change) | None | None |
## π§ͺ Algorithm
### ποΈ Past attempts
- **Consideration of https://github.com/dedupeio/dedupe**: but seeing that the [comparator](https://github.com/search?q=repo%3Adedupeio%2Fdedupe+comparator&type=code) evaluates 2 values at a time = **O(nΒ²) complexity** at runtime, and with our volumes (~500K actors) = concern about going with a non-vectorized approach. We see [user feedback in that direction](https://github.com/dedupeio/dedupe/issues/940) (matching time exploding when going from a 1K to 5K sample).
- **Vectorized attempt**: tried basic vectorization (e.g. TF-IDF to naturally deprioritize noise/redundancy) but without compute infra (e.g. state startup = frugal) the idea was abandoned (risk of building models we could not operate via Airflow).
### ππ» Current
- **Very primitive**: with normalization and TF-IDF but running at too small a scale to be really relevant. Lacks fuzzy tolerance.
### π‘ Improvements
- **Continue upstream normalization**: because it benefits not only clustering but also data quality on the map:
- **old β new city conversion**: thanks to [BAN](https://github.com/incubateur-ademe/quefairedemesobjets/pull/1451)
- **address normalization**: still with BAN
- **name enrichment** at matching time for more embeddings via [AE](https://github.com/incubateur-ademe/quefairedemesobjets/pulls?q=Annuaire+entreprise)
- **Extend vectorized scope**: e.g. at department level, to offer a relevance vs. model size trade-off
- **Reconsider https://github.com/dedupeio/dedupe**: again on subsets (e.g. city) to benefit from the library (which offers useful things like distances) without suffering too much from O(nΒ²)
- **Embeddings**: due to very diverse representations (e.g. waste collection centre vs. waste disposal sites)
- **Conversion mappings** for limited/known cases (e.g. street name abbreviations, e.g. ESP β ESPLANADE)
- **Phonetic algorithms** for typos
- **Language models**: potentially compact models (SLMs) that would outperform the above by asking the simple question ("please cluster these samples")
## π€ Performance
### ππ» Current
- **Poor but sufficient for the business**: runs the algorithm on Airflow and does something else while waiting
- **Main reason**: loops and successive round-trips to choose parents and their data
- **Example**: clustering 150K actors took ~6 hours
### π‘ Improvements
- **Rewrite tasks** to use DBT models that prepare the data = eliminate loops / DB round-trips via Python.
## π Schema
```mermaid
graph TD
subgraph selection["π selection: statut=ACTIF"]
direction LR
parents
orphans
children
end
subgraph normalization["π§Ή normalization"]
norma["lower case, no accents etc..."]
end
subgraph clustering["π¦ clustering"]
similar["exact + fuzzy similarity"]
end
subgraph dedup["1οΈβ£ deduplication"]
parents_choose["π₯ Choose new parents"]
parents_data["ποΈ Choose parents' data (from non-normalized data)"]
parents_choose-->parents_data
parents_data-->children_feed["β¬
οΈ Add existing children"]
children_feed-->changes_mark["π Define changes"]
end
subgraph suggestions[" suggestions"]
propose
end
parents-->|displayed| normalization
orphans-->|displayed| normalization
normalization --> clustering
children-->|revision| children_feed
clustering-->dedup
changes_mark-->suggestions
```