dbt DocumentationΒΆ

Production-ready patterns for dbt (data build tool) including model organization, testing strategies, documentation, and incremental processing.

Model LayersΒΆ

source/           Source definitions (YAML only, declares external tables)
    ↓
base/             1:1 with source, light cleaning / filtering
    ↓
intermediate/     Business logic, joins, aggregations, normalisation
    ↓
marts/            Business-ready models (combines intermediate models)
    ↓
exposure/         Final tables exposed to consumers (API, dashboards, Open Data)

Naming ConventionsΒΆ

Layer

Prefix

Example

Source

source_

source_stats.yml

Base

base_

base_vueacteur_visible.sql

Intermediate

int_

int_acteur_visible_location.sql

Marts

marts_

marts_acteur_stacked.sql

Exposure

exposure_

exposure_stats_acteur_stacked_history.sql

Each layer is organized by subject (business domain): acteurs, stats, geo, ban, ae_annuaire_entreprises, enrich.

Project StructureΒΆ

data-platform/dbt/
β”œβ”€β”€ dbt_project.yml
β”œβ”€β”€ Makefile
β”œβ”€β”€ macros/
β”‚   β”œβ”€β”€ field/                           Field-level helpers
β”‚   β”‚   β”œβ”€β”€ acteur_status_is_active.sql
β”‚   β”‚   β”œβ”€β”€ coalesce_empty.sql
β”‚   β”‚   β”œβ”€β”€ field_empty.sql
β”‚   β”‚   └── sscat_from_action.sql
β”‚   β”œβ”€β”€ table/                           Table-level macros (reusable model logic)
β”‚   β”‚   β”œβ”€β”€ macro_acteur.sql
β”‚   β”‚   β”œβ”€β”€ macro_acteur_sources.sql
β”‚   β”‚   β”œβ”€β”€ macro_acteur_labels.sql
β”‚   β”‚   β”œβ”€β”€ macro_acteur_epci.sql
β”‚   β”‚   β”œβ”€β”€ macro_propositionservice.sql
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ udf/                             User-defined functions (created on-run-start)
β”‚   β”‚   β”œβ”€β”€ udf_encode_base57.sql
β”‚   β”‚   β”œβ”€β”€ udf_uuid_to_int.sql
β”‚   β”‚   └── ...
β”‚   └── constants/                       Shared constants
β”‚       β”œβ”€β”€ value_unavailable.sql
β”‚       └── public_accueilli_exclus.sql
└── models/
    β”œβ”€β”€ source/                          Source definitions (YAML only)
    β”‚   β”œβ”€β”€ source_acteur.yml
    β”‚   β”œβ”€β”€ source_stats.yml
    β”‚   β”œβ”€β”€ source_ban_base_adresse_nationale.yml
    β”‚   β”œβ”€β”€ source_ae_annuaire_entreprises.yml
    β”‚   └── source_enrich.yml
    β”œβ”€β”€ base/                            1:1 with source, light cleaning
    β”‚   β”œβ”€β”€ acteurs/
    β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   β”œβ”€β”€ base_acteur.sql
    β”‚   β”‚   β”œβ”€β”€ base_acteur_acteur_services.sql
    β”‚   β”‚   β”œβ”€β”€ base_propositionservice.sql
    β”‚   β”‚   └── base_source.sql
    β”‚   β”œβ”€β”€ stats/
    β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   β”œβ”€β”€ base_action.sql
    β”‚   β”‚   β”œβ”€β”€ base_vueacteur_visible.sql
    β”‚   β”‚   β”œβ”€β”€ base_vuepropositionservice.sql
    β”‚   β”‚   β”œβ”€β”€ base_vuepropositionservice_visible.sql
    β”‚   β”‚   └── base_random_position.sql
    β”‚   β”œβ”€β”€ ban/
    β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   β”œβ”€β”€ base_ban_adresses.sql
    β”‚   β”‚   └── base_ban_lieux_dits.sql
    β”‚   └── ae_annuaire_entreprises/
    β”‚       β”œβ”€β”€ schema.yml
    β”‚       β”œβ”€β”€ base_ae_unite_legale.sql
    β”‚       └── base_ae_etablissement.sql
    β”œβ”€β”€ intermediate/                    Business logic, joins, aggregations
    β”‚   β”œβ”€β”€ acteurs/
    β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   └── int_acteur.sql
    β”‚   β”œβ”€β”€ stats/
    β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   β”œβ”€β”€ int_acteur_visible_location.sql
    β”‚   β”‚   β”œβ”€β”€ int_acteur_visible_location_rounded.sql
    β”‚   β”‚   β”œβ”€β”€ int_acteur_with_siren.sql
    β”‚   β”‚   β”œβ”€β”€ int_acteur_with_siret.sql
    β”‚   β”‚   β”œβ”€β”€ int_acteur_with_revision.sql
    β”‚   β”‚   β”œβ”€β”€ int_stacked_location.sql
    β”‚   β”‚   └── int_stacked_location_rounded.sql
    β”‚   β”œβ”€β”€ geo/
    β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   └── int_epci.sql
    β”‚   β”œβ”€β”€ ban/
    β”‚   β”‚   β”œβ”€β”€ int_ban_adresses.sql
    β”‚   β”‚   └── int_ban_villes.sql
    β”‚   └── ae_annuaire_entreprises/
    β”‚       β”œβ”€β”€ schema.yml
    β”‚       β”œβ”€β”€ int_ae_unite_legale.sql
    β”‚       └── int_ae_etablissement.sql
    β”œβ”€β”€ marts/                           Business-ready models
    β”‚   β”œβ”€β”€ acteurs/
    β”‚   β”‚   β”œβ”€β”€ carte/                   Models for the map (carte)
    β”‚   β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   β”‚   β”œβ”€β”€ marts_carte_acteur.sql
    β”‚   β”‚   β”‚   └── ...
    β”‚   β”‚   β”œβ”€β”€ opendata/                Models for Open Data export
    β”‚   β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   β”‚   β”œβ”€β”€ marts_opendata_acteur.sql
    β”‚   β”‚   β”‚   └── ...
    β”‚   β”‚   β”œβ”€β”€ exhaustive/              All acteurs without filtering
    β”‚   β”‚   β”‚   └── ...
    β”‚   β”‚   └── sample/                  Sample/displayed acteurs
    β”‚   β”‚       └── ...
    β”‚   β”œβ”€β”€ stats/                       Statistics and metrics
    β”‚   β”‚   β”œβ”€β”€ schema.yml
    β”‚   β”‚   β”œβ”€β”€ marts_acteur_stacked.sql
    β”‚   β”‚   β”œβ”€β”€ marts_acteur_siren_actif.sql
    β”‚   β”‚   β”œβ”€β”€ marts_distance_first_action.sql
    β”‚   β”‚   └── ...
    β”‚   └── enrich/                      Data enrichment suggestions
    β”‚       β”œβ”€β”€ schema.yml
    β”‚       β”œβ”€β”€ marts_enrich_acteurs_closed_candidates.sql
    β”‚       └── ...
    └── exposure/                        Final tables exposed to consumers
        β”œβ”€β”€ acteurs/
        β”‚   β”œβ”€β”€ carte/
        β”‚   β”‚   β”œβ”€β”€ schema.yml
        β”‚   β”‚   └── exposure_carte_acteur.sql
        β”‚   β”œβ”€β”€ opendata/
        β”‚   β”‚   └── schema.yml
        β”‚   β”œβ”€β”€ exhaustive/
        β”‚   β”‚   └── ...
        β”‚   └── sample/
        β”‚       └── ...
        β”œβ”€β”€ stats/
        β”‚   β”œβ”€β”€ schema.yml
        β”‚   β”œβ”€β”€ exposure_stats_acteur_stacked_history.sql
        β”‚   β”œβ”€β”€ exposure_stats_acteur_no_siren_actif.sql
        β”‚   β”œβ”€β”€ exposure_stats_distance_nb_acteur_by_action.sql
        β”‚   └── ...
        └── geo/
            β”œβ”€β”€ schema.yml
            └── exposure_epci.sql

PatternsΒΆ

All examples below use real models from the stats subject.

Pattern 1: Source DefinitionsΒΆ

Sources are declared in YAML files in models/source/. Each source maps to a database schema.

# models/source/source_stats.yml
version: 2

sources:
  - name: stats_qfdmo
    schema: webapp_public
    tables:
      - name: qfdmo_vueacteur
      - name: qfdmo_vuepropositionservice
      - name: qfdmo_action
  - name: stats_clone
    schema: public
    tables:
      - name: clone_ca_epci_in_use
      - name: clone_ae_unite_legale_in_use
      - name: clone_ae_etablissement_in_use

Pattern 2: Base ModelsΒΆ

Base models are 1:1 with source tables. They apply minimal transformations: filtering, sampling.

-- models/base/stats/base_vueacteur_visible.sql
SELECT *
FROM {{ source('stats_qfdmo', 'qfdmo_vueacteur') }}
WHERE est_dans_carte IS TRUE OR est_dans_opendata IS TRUE
{% if env_var('DBT_SAMPLING', 'false') == 'true' %}
TABLESAMPLE SYSTEM (10)
{% endif %}
-- models/base/stats/base_action.sql
SELECT *
FROM {{ source('stats_qfdmo', 'qfdmo_action') }}

Base models that combine two base models with a simple join are also acceptable:

-- models/base/stats/base_vuepropositionservice_visible.sql
SELECT *
FROM {{ ref('base_vuepropositionservice') }} AS propositionservice
INNER JOIN {{ ref('base_vueacteur_visible') }} AS acteur
    ON propositionservice.acteur_id = acteur.identifiant_unique

Pattern 3: Intermediate ModelsΒΆ

Intermediate models apply business logic: filtering, validation, rounding, aggregation.

-- models/intermediate/stats/int_acteur_visible_location.sql
SELECT identifiant_unique, latitude, longitude
FROM {{ ref('base_vueacteur_visible') }}
WHERE latitude IS NOT NULL AND longitude IS NOT NULL
AND latitude != 0 AND longitude != 0
-- models/intermediate/stats/int_acteur_with_siret.sql
SELECT identifiant_unique, siret
FROM {{ ref('base_vueacteur_visible') }}
WHERE LENGTH(siret) = 14
AND siret ~ '^[0-9]+$'
-- models/intermediate/stats/int_stacked_location.sql
SELECT latitude, longitude
FROM {{ ref('int_acteur_visible_location') }}
GROUP BY latitude, longitude
HAVING COUNT(*) > 1

Pattern 4: Marts ModelsΒΆ

Marts models combine intermediate models to produce business-ready datasets.

-- models/marts/stats/marts_acteur_stacked.sql
SELECT a.identifiant_unique, a.latitude, a.longitude
FROM {{ ref('int_acteur_visible_location') }} a
INNER JOIN {{ ref('int_stacked_location') }} c
    ON a.latitude = c.latitude
    AND a.longitude = c.longitude
ORDER BY a.latitude, a.longitude, a.identifiant_unique
-- models/marts/stats/marts_acteur_siren_actif.sql
SELECT a.identifiant_unique
FROM {{ ref('int_acteur_with_siren') }} a
INNER JOIN {{ ref('int_ae_siren_in_acteur') }} ae ON a.siren = ae.siren
WHERE ae.etat_administratif = 'A'

Pattern 5: Exposure ModelsΒΆ

Exposure models are the final tables exposed to consumers. They can be simple pass-throughs or incremental models that track history.

Simple pass-through:

-- models/exposure/stats/exposure_stats_acteur_no_siren_actif.sql
select * from {{ ref('marts_acteur_no_siren_actif') }}

Exposure with filtering/projection:

-- models/exposure/stats/exposure_stats_acteur_revision_siren.sql
SELECT
  identifiant_unique,
  nom,
  source_code,
  acteur_siren,
  revision_siren,
  acteur_modifie_le,
  revision_modifie_le
FROM {{ ref('marts_acteur_revision') }}
WHERE revision_siren != ''
AND acteur_siren != ''
AND acteur_statut = 'ACTIF'
AND revision_statut = 'ACTIF'
AND revision_siren != acteur_siren
AND revision_modifie_le < acteur_modifie_le

Incremental exposure (history tracking):

-- models/exposure/stats/exposure_stats_acteur_stacked_history.sql
WITH stacked AS (
    SELECT COUNT(*) AS nb_stacked
    FROM {{ ref('marts_acteur_stacked') }}
),
stacked_rounded AS (
    SELECT COUNT(*) AS nb_stacked_rounded
    FROM {{ ref('marts_acteur_stacked_rounded') }}
),
visible AS (
    SELECT COUNT(*) AS nb_total
    FROM {{ ref('int_acteur_visible_location') }}
)
SELECT
    CURRENT_DATE AS date_snapshot,
    v.nb_total,
    s.nb_stacked,
    sr.nb_stacked_rounded,
    CASE
        WHEN v.nb_total = 0 THEN 0
        ELSE ROUND((s.nb_stacked::NUMERIC / v.nb_total) * 100, 2)
    END AS rate_stacked,
    CASE
        WHEN v.nb_total = 0 THEN 0
        ELSE ROUND((sr.nb_stacked_rounded::NUMERIC / v.nb_total) * 100, 2)
    END AS rate_stacked_rounded
FROM visible v
CROSS JOIN stacked s
CROSS JOIN stacked_rounded sr

Pattern 6: Schema & TestingΒΆ

Each folder contains a schema.yml that documents and tests models.

# models/base/stats/schema.yml
version: 2

models:
  - name: base_vueacteur_visible
    description: "Acteurs visibles"
    columns:
      - name: identifiant_unique
        description: "L'identifiant unique de l'acteur"
        data_tests:
          - not_null
          - unique
      - name: nom
        description: "Le nom de l'acteur"
        data_tests:
          - not_null
    config:
      materialized: table
      unique_key: identifiant_unique
      tags:
        - stats
        - base
        - acteurs
        - visible

Incremental exposure models use delete+insert strategy:

# models/exposure/stats/schema.yml (extract)
- name: exposure_stats_acteur_stacked_history
  description: "Historique du nombre d'acteurs empilΓ©s"
  columns:
    - name: date_snapshot
      description: "Date de l'exΓ©cution dbt"
      data_tests:
        - not_null
        - unique
    - name: nb_total
      data_tests:
        - not_null
  config:
    materialized: incremental
    unique_key: date_snapshot
    incremental_strategy: delete+insert
    tags:
      - exposure
      - stats

Pattern 7: Sampling in DevΒΆ

Use the DBT_SAMPLING environment variable to limit data in development:

SELECT *
FROM {{ source('stats_qfdmo', 'qfdmo_vueacteur') }}
WHERE est_dans_carte IS TRUE OR est_dans_opendata IS TRUE
{% if env_var('DBT_SAMPLING', 'false') == 'true' %}
TABLESAMPLE SYSTEM (10)
{% endif %}

Pattern 8: UDFs (User-Defined Functions)ΒΆ

UDFs are created automatically at the start of each dbt run via on-run-start in dbt_project.yml:

# dbt_project.yml (extract)
on-run-start:
  - "{{ create_udf_encode_base57() }}"
  - "{{ create_udf_uuid_to_int() }}"
  - "{{ create_udf_safe_divmod() }}"
  - "{{ create_udf_normalize_string_alpha_for_match() }}"
  - "{{ create_udf_ae_string_cleanup() }}"

UDF macros live in macros/udf/.

Makefile CommandsΒΆ

Run these commands from data-platform/dbt/:

make generate-docs    # Generate dbt documentation (HTML)
make serve-docs       # Serve documentation locally on port 8085
make clean-docs       # Remove generated documentation (target/)

These commands use uv run dbt to ensure the correct Python environment.

dbt CommandsΒΆ

# Run models
uv run dbt run                              # Run all models
uv run dbt run --select stats               # Run all stats models
uv run dbt run --select +marts_acteur_stacked  # Run model and its upstream dependencies
uv run dbt run --select marts_acteur_stacked+  # Run model and its downstream dependents
uv run dbt run --full-refresh               # Rebuild incremental models from scratch
uv run dbt run --select tag:stats           # Run models tagged "stats"

# Testing
uv run dbt test                             # Run all tests
uv run dbt test --select stats              # Test stats models only
uv run dbt build                            # Run + test in DAG order

# Documentation
uv run dbt docs generate                    # Generate docs
uv run dbt docs serve                       # Serve docs locally

# Debugging
uv run dbt compile                          # Compile SQL without running
uv run dbt debug                            # Test database connection
uv run dbt ls --select tag:stats            # List models by tag

# Sampling (dev only, reduces dataset to ~10%)
DBT_SAMPLING=true uv run dbt run

Data Flow Example: Stats Stacked ActeursΒΆ

source_stats.yml (qfdmo_vueacteur)
    ↓
base_vueacteur_visible.sql          (filter: visible acteurs only)
    ↓
int_acteur_visible_location.sql     (extract lat/lon, filter nulls)
    ↓
int_stacked_location.sql            (GROUP BY lat/lon HAVING COUNT > 1)
    ↓
marts_acteur_stacked.sql            (JOIN acteurs with stacked locations)
    ↓
exposure_stats_acteur_stacked_history.sql  (incremental: daily snapshot with counts & rates)

Best PracticesΒΆ

Do’sΒΆ

  • Use base layer - Clean data once in base, reuse everywhere

  • Test aggressively - not_null, unique, data_tests on key columns

  • Document everything - Column descriptions in schema.yml

  • Use incremental - For history/snapshot tables (exposure layer)

  • Use tags - Tag models by subject (stats, acteurs) and layer (base, marts)

  • Use env_var for sampling - DBT_SAMPLING=true for faster dev cycles

Don’tsΒΆ

  • Don’t skip base - Source β†’ marts is tech debt

  • Don’t hardcode schemas - Declare sources in source_*.yml

  • Don’t repeat logic - Extract to macros (macros/table/, macros/field/)

  • Don’t test in prod - Use sampling and dev targets

  • Don’t forget schema.yml - Every folder needs one for documentation and tests