dbt DocumentationΒΆ
Production-ready patterns for dbt (data build tool) including model organization, testing strategies, documentation, and incremental processing.
Model LayersΒΆ
source/ Source definitions (YAML only, declares external tables)
β
base/ 1:1 with source, light cleaning / filtering
β
intermediate/ Business logic, joins, aggregations, normalisation
β
marts/ Business-ready models (combines intermediate models)
β
exposure/ Final tables exposed to consumers (API, dashboards, Open Data)
Naming ConventionsΒΆ
Layer |
Prefix |
Example |
|---|---|---|
Source |
|
|
Base |
|
|
Intermediate |
|
|
Marts |
|
|
Exposure |
|
|
Each layer is organized by subject (business domain): acteurs, stats, geo, ban, ae_annuaire_entreprises, enrich.
Project StructureΒΆ
data-platform/dbt/
βββ dbt_project.yml
βββ Makefile
βββ macros/
β βββ field/ Field-level helpers
β β βββ acteur_status_is_active.sql
β β βββ coalesce_empty.sql
β β βββ field_empty.sql
β β βββ sscat_from_action.sql
β βββ table/ Table-level macros (reusable model logic)
β β βββ macro_acteur.sql
β β βββ macro_acteur_sources.sql
β β βββ macro_acteur_labels.sql
β β βββ macro_acteur_epci.sql
β β βββ macro_propositionservice.sql
β β βββ ...
β βββ udf/ User-defined functions (created on-run-start)
β β βββ udf_encode_base57.sql
β β βββ udf_uuid_to_int.sql
β β βββ ...
β βββ constants/ Shared constants
β βββ value_unavailable.sql
β βββ public_accueilli_exclus.sql
βββ models/
βββ source/ Source definitions (YAML only)
β βββ source_acteur.yml
β βββ source_stats.yml
β βββ source_ban_base_adresse_nationale.yml
β βββ source_ae_annuaire_entreprises.yml
β βββ source_enrich.yml
βββ base/ 1:1 with source, light cleaning
β βββ acteurs/
β β βββ schema.yml
β β βββ base_acteur.sql
β β βββ base_acteur_acteur_services.sql
β β βββ base_propositionservice.sql
β β βββ base_source.sql
β βββ stats/
β β βββ schema.yml
β β βββ base_action.sql
β β βββ base_vueacteur_visible.sql
β β βββ base_vuepropositionservice.sql
β β βββ base_vuepropositionservice_visible.sql
β β βββ base_random_position.sql
β βββ ban/
β β βββ schema.yml
β β βββ base_ban_adresses.sql
β β βββ base_ban_lieux_dits.sql
β βββ ae_annuaire_entreprises/
β βββ schema.yml
β βββ base_ae_unite_legale.sql
β βββ base_ae_etablissement.sql
βββ intermediate/ Business logic, joins, aggregations
β βββ acteurs/
β β βββ schema.yml
β β βββ int_acteur.sql
β βββ stats/
β β βββ schema.yml
β β βββ int_acteur_visible_location.sql
β β βββ int_acteur_visible_location_rounded.sql
β β βββ int_acteur_with_siren.sql
β β βββ int_acteur_with_siret.sql
β β βββ int_acteur_with_revision.sql
β β βββ int_stacked_location.sql
β β βββ int_stacked_location_rounded.sql
β βββ geo/
β β βββ schema.yml
β β βββ int_epci.sql
β βββ ban/
β β βββ int_ban_adresses.sql
β β βββ int_ban_villes.sql
β βββ ae_annuaire_entreprises/
β βββ schema.yml
β βββ int_ae_unite_legale.sql
β βββ int_ae_etablissement.sql
βββ marts/ Business-ready models
β βββ acteurs/
β β βββ carte/ Models for the map (carte)
β β β βββ schema.yml
β β β βββ marts_carte_acteur.sql
β β β βββ ...
β β βββ opendata/ Models for Open Data export
β β β βββ schema.yml
β β β βββ marts_opendata_acteur.sql
β β β βββ ...
β β βββ exhaustive/ All acteurs without filtering
β β β βββ ...
β β βββ sample/ Sample/displayed acteurs
β β βββ ...
β βββ stats/ Statistics and metrics
β β βββ schema.yml
β β βββ marts_acteur_stacked.sql
β β βββ marts_acteur_siren_actif.sql
β β βββ marts_distance_first_action.sql
β β βββ ...
β βββ enrich/ Data enrichment suggestions
β βββ schema.yml
β βββ marts_enrich_acteurs_closed_candidates.sql
β βββ ...
βββ exposure/ Final tables exposed to consumers
βββ acteurs/
β βββ carte/
β β βββ schema.yml
β β βββ exposure_carte_acteur.sql
β βββ opendata/
β β βββ schema.yml
β βββ exhaustive/
β β βββ ...
β βββ sample/
β βββ ...
βββ stats/
β βββ schema.yml
β βββ exposure_stats_acteur_stacked_history.sql
β βββ exposure_stats_acteur_no_siren_actif.sql
β βββ exposure_stats_distance_nb_acteur_by_action.sql
β βββ ...
βββ geo/
βββ schema.yml
βββ exposure_epci.sql
PatternsΒΆ
All examples below use real models from the stats subject.
Pattern 1: Source DefinitionsΒΆ
Sources are declared in YAML files in models/source/. Each source maps to a database schema.
# models/source/source_stats.yml
version: 2
sources:
- name: stats_qfdmo
schema: webapp_public
tables:
- name: qfdmo_vueacteur
- name: qfdmo_vuepropositionservice
- name: qfdmo_action
- name: stats_clone
schema: public
tables:
- name: clone_ca_epci_in_use
- name: clone_ae_unite_legale_in_use
- name: clone_ae_etablissement_in_use
Pattern 2: Base ModelsΒΆ
Base models are 1:1 with source tables. They apply minimal transformations: filtering, sampling.
-- models/base/stats/base_vueacteur_visible.sql
SELECT *
FROM {{ source('stats_qfdmo', 'qfdmo_vueacteur') }}
WHERE est_dans_carte IS TRUE OR est_dans_opendata IS TRUE
{% if env_var('DBT_SAMPLING', 'false') == 'true' %}
TABLESAMPLE SYSTEM (10)
{% endif %}
-- models/base/stats/base_action.sql
SELECT *
FROM {{ source('stats_qfdmo', 'qfdmo_action') }}
Base models that combine two base models with a simple join are also acceptable:
-- models/base/stats/base_vuepropositionservice_visible.sql
SELECT *
FROM {{ ref('base_vuepropositionservice') }} AS propositionservice
INNER JOIN {{ ref('base_vueacteur_visible') }} AS acteur
ON propositionservice.acteur_id = acteur.identifiant_unique
Pattern 3: Intermediate ModelsΒΆ
Intermediate models apply business logic: filtering, validation, rounding, aggregation.
-- models/intermediate/stats/int_acteur_visible_location.sql
SELECT identifiant_unique, latitude, longitude
FROM {{ ref('base_vueacteur_visible') }}
WHERE latitude IS NOT NULL AND longitude IS NOT NULL
AND latitude != 0 AND longitude != 0
-- models/intermediate/stats/int_acteur_with_siret.sql
SELECT identifiant_unique, siret
FROM {{ ref('base_vueacteur_visible') }}
WHERE LENGTH(siret) = 14
AND siret ~ '^[0-9]+$'
-- models/intermediate/stats/int_stacked_location.sql
SELECT latitude, longitude
FROM {{ ref('int_acteur_visible_location') }}
GROUP BY latitude, longitude
HAVING COUNT(*) > 1
Pattern 4: Marts ModelsΒΆ
Marts models combine intermediate models to produce business-ready datasets.
-- models/marts/stats/marts_acteur_stacked.sql
SELECT a.identifiant_unique, a.latitude, a.longitude
FROM {{ ref('int_acteur_visible_location') }} a
INNER JOIN {{ ref('int_stacked_location') }} c
ON a.latitude = c.latitude
AND a.longitude = c.longitude
ORDER BY a.latitude, a.longitude, a.identifiant_unique
-- models/marts/stats/marts_acteur_siren_actif.sql
SELECT a.identifiant_unique
FROM {{ ref('int_acteur_with_siren') }} a
INNER JOIN {{ ref('int_ae_siren_in_acteur') }} ae ON a.siren = ae.siren
WHERE ae.etat_administratif = 'A'
Pattern 5: Exposure ModelsΒΆ
Exposure models are the final tables exposed to consumers. They can be simple pass-throughs or incremental models that track history.
Simple pass-through:
-- models/exposure/stats/exposure_stats_acteur_no_siren_actif.sql
select * from {{ ref('marts_acteur_no_siren_actif') }}
Exposure with filtering/projection:
-- models/exposure/stats/exposure_stats_acteur_revision_siren.sql
SELECT
identifiant_unique,
nom,
source_code,
acteur_siren,
revision_siren,
acteur_modifie_le,
revision_modifie_le
FROM {{ ref('marts_acteur_revision') }}
WHERE revision_siren != ''
AND acteur_siren != ''
AND acteur_statut = 'ACTIF'
AND revision_statut = 'ACTIF'
AND revision_siren != acteur_siren
AND revision_modifie_le < acteur_modifie_le
Incremental exposure (history tracking):
-- models/exposure/stats/exposure_stats_acteur_stacked_history.sql
WITH stacked AS (
SELECT COUNT(*) AS nb_stacked
FROM {{ ref('marts_acteur_stacked') }}
),
stacked_rounded AS (
SELECT COUNT(*) AS nb_stacked_rounded
FROM {{ ref('marts_acteur_stacked_rounded') }}
),
visible AS (
SELECT COUNT(*) AS nb_total
FROM {{ ref('int_acteur_visible_location') }}
)
SELECT
CURRENT_DATE AS date_snapshot,
v.nb_total,
s.nb_stacked,
sr.nb_stacked_rounded,
CASE
WHEN v.nb_total = 0 THEN 0
ELSE ROUND((s.nb_stacked::NUMERIC / v.nb_total) * 100, 2)
END AS rate_stacked,
CASE
WHEN v.nb_total = 0 THEN 0
ELSE ROUND((sr.nb_stacked_rounded::NUMERIC / v.nb_total) * 100, 2)
END AS rate_stacked_rounded
FROM visible v
CROSS JOIN stacked s
CROSS JOIN stacked_rounded sr
Pattern 6: Schema & TestingΒΆ
Each folder contains a schema.yml that documents and tests models.
# models/base/stats/schema.yml
version: 2
models:
- name: base_vueacteur_visible
description: "Acteurs visibles"
columns:
- name: identifiant_unique
description: "L'identifiant unique de l'acteur"
data_tests:
- not_null
- unique
- name: nom
description: "Le nom de l'acteur"
data_tests:
- not_null
config:
materialized: table
unique_key: identifiant_unique
tags:
- stats
- base
- acteurs
- visible
Incremental exposure models use delete+insert strategy:
# models/exposure/stats/schema.yml (extract)
- name: exposure_stats_acteur_stacked_history
description: "Historique du nombre d'acteurs empilΓ©s"
columns:
- name: date_snapshot
description: "Date de l'exΓ©cution dbt"
data_tests:
- not_null
- unique
- name: nb_total
data_tests:
- not_null
config:
materialized: incremental
unique_key: date_snapshot
incremental_strategy: delete+insert
tags:
- exposure
- stats
Pattern 7: Sampling in DevΒΆ
Use the DBT_SAMPLING environment variable to limit data in development:
SELECT *
FROM {{ source('stats_qfdmo', 'qfdmo_vueacteur') }}
WHERE est_dans_carte IS TRUE OR est_dans_opendata IS TRUE
{% if env_var('DBT_SAMPLING', 'false') == 'true' %}
TABLESAMPLE SYSTEM (10)
{% endif %}
Pattern 8: UDFs (User-Defined Functions)ΒΆ
UDFs are created automatically at the start of each dbt run via on-run-start in dbt_project.yml:
# dbt_project.yml (extract)
on-run-start:
- "{{ create_udf_encode_base57() }}"
- "{{ create_udf_uuid_to_int() }}"
- "{{ create_udf_safe_divmod() }}"
- "{{ create_udf_normalize_string_alpha_for_match() }}"
- "{{ create_udf_ae_string_cleanup() }}"
UDF macros live in macros/udf/.
Makefile CommandsΒΆ
Run these commands from data-platform/dbt/:
make generate-docs # Generate dbt documentation (HTML)
make serve-docs # Serve documentation locally on port 8085
make clean-docs # Remove generated documentation (target/)
These commands use uv run dbt to ensure the correct Python environment.
dbt CommandsΒΆ
# Run models
uv run dbt run # Run all models
uv run dbt run --select stats # Run all stats models
uv run dbt run --select +marts_acteur_stacked # Run model and its upstream dependencies
uv run dbt run --select marts_acteur_stacked+ # Run model and its downstream dependents
uv run dbt run --full-refresh # Rebuild incremental models from scratch
uv run dbt run --select tag:stats # Run models tagged "stats"
# Testing
uv run dbt test # Run all tests
uv run dbt test --select stats # Test stats models only
uv run dbt build # Run + test in DAG order
# Documentation
uv run dbt docs generate # Generate docs
uv run dbt docs serve # Serve docs locally
# Debugging
uv run dbt compile # Compile SQL without running
uv run dbt debug # Test database connection
uv run dbt ls --select tag:stats # List models by tag
# Sampling (dev only, reduces dataset to ~10%)
DBT_SAMPLING=true uv run dbt run
Data Flow Example: Stats Stacked ActeursΒΆ
source_stats.yml (qfdmo_vueacteur)
β
base_vueacteur_visible.sql (filter: visible acteurs only)
β
int_acteur_visible_location.sql (extract lat/lon, filter nulls)
β
int_stacked_location.sql (GROUP BY lat/lon HAVING COUNT > 1)
β
marts_acteur_stacked.sql (JOIN acteurs with stacked locations)
β
exposure_stats_acteur_stacked_history.sql (incremental: daily snapshot with counts & rates)
Best PracticesΒΆ
DoβsΒΆ
Use base layer - Clean data once in base, reuse everywhere
Test aggressively -
not_null,unique,data_testson key columnsDocument everything - Column descriptions in
schema.ymlUse incremental - For history/snapshot tables (exposure layer)
Use tags - Tag models by subject (
stats,acteurs) and layer (base,marts)Use
env_varfor sampling -DBT_SAMPLING=truefor faster dev cycles
DonβtsΒΆ
Donβt skip base - Source β marts is tech debt
Donβt hardcode schemas - Declare sources in
source_*.ymlDonβt repeat logic - Extract to macros (
macros/table/,macros/field/)Donβt test in prod - Use sampling and dev targets
Donβt forget
schema.yml- Every folder needs one for documentation and tests