Fitness Analytics Platform +LLM (End-to-End)

Author

Gabriel Ferreira

Published

February 12, 2026

Introduction

This project implements a complete end-to-end data engineering pipeline for strength training and nutrition data.

The objective was to design a production-style analytical data system capable of:

  • Auditing and validating raw operational data
  • Modeling relationally in PostgreSQL
  • Executing idempotent ETL pipelines
  • Producing deterministic weekly aggregations
  • Generating a canonical JSON contract for LLM consumption
  • Running in AWS under a controlled, low-cost architecture

Architecture Overview

graph TD
  A[Raw CSV: Hevy + Diet] --> B[Cleaning & Validation: Silver]
  B --> C[PostgreSQL: Normalized Schema]
  C --> D[Gold Weekly Views]
  D --> E[Canonical JSON Contract]
  E --> F[Amazon Bedrock: LLM Analysis]

The system was designed for deterministic weekly reproducibility.


Data Sources

Workout Data (Hevy)

  • Format: CSV export
  • Granularity: 1 row = 1 set
  • Historical coverage: 2023–2026
  • Total rows: 9542
  • Total unique exercises: 118

Diet Data

  • Source: SQLite (personal databse) → CSV export
  • Granularity: 1 row per day
  • Includes:
    • Macros
    • Calories
    • Cardio volume
    • Bodyweight
    • Cycle classification

Stage 1: Raw Workout Data

Before modeling or automation, a full exploratory and quality audit was conducted.

Iinitial Load:
Show Code
df = pd.read_csv('hevy_workouts_raw.csv')
df.shape
(9542, 14)
Sample:
Show Code
df.head(10)
title start_time end_time description exercise_title superset_id exercise_notes set_index set_type weight_kg reps distance_km duration_seconds rpe
0 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Hip Adduction (Machine) NaN NaN 0 normal 90.00 5.0 NaN NaN NaN
1 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Hip Adduction (Machine) NaN NaN 1 normal 86.25 6.0 NaN NaN NaN
2 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Hip Adduction (Machine) NaN NaN 2 failure 86.25 6.0 NaN NaN NaN
3 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Straight Leg Deadlift NaN 2RIR? 0 normal 100.00 6.0 NaN NaN NaN
4 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Straight Leg Deadlift NaN 2RIR? 1 normal 100.00 6.0 NaN NaN NaN
5 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Leg Press (Machine) NaN 2RIR 0 normal 260.00 5.0 NaN NaN NaN
6 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Leg Press (Machine) NaN 2RIR 1 normal 260.00 5.0 NaN NaN NaN
7 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Seated Leg Curl (Machine) NaN NaN 0 normal 110.00 6.0 NaN NaN NaN
8 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Leg Extension (Machine) NaN 1RIR 0 normal 101.25 9.0 NaN NaN NaN
9 Lower 13 Jan 2026, 18:43 13 Jan 2026, 20:03 NaN Leg Extension (Machine) NaN 1RIR 1 normal 101.25 7.0 NaN NaN NaN
Types:
Show Code
df.dtypes
title                object
start_time           object
end_time             object
description          object
exercise_title       object
superset_id         float64
exercise_notes       object
set_index             int64
set_type             object
weight_kg           float64
reps                float64
distance_km         float64
duration_seconds    float64
rpe                 float64
dtype: object
Columns:
Show Code
df.columns
Index(['title', 'start_time', 'end_time', 'description', 'exercise_title',
       'superset_id', 'exercise_notes', 'set_index', 'set_type', 'weight_kg',
       'reps', 'distance_km', 'duration_seconds', 'rpe'],
      dtype='object')

There is no need to capture the training start and end times along with the date; therefore, we must split them into a column for the training date (training_date) and another for the training duration (derived from start_time and end_time).

EDA - Exploratory Data Analysis

Checking Missing Values:
Show Code
valores_ausentes = df.isnull().sum().sort_values(ascending = False)
print(valores_ausentes)
superset_id         9542
rpe                 9542
distance_km         9541
duration_seconds    9209
exercise_notes      8466
description         5264
weight_kg            708
reps                 333
title                  0
start_time             0
end_time               0
exercise_title         0
set_index              0
set_type               0
dtype: int64
  • superset_id: Never used; consistently contains null values.
  • rpe: Same status as superset_id (all values missing).
  • distance_km: Only applicable for treadmill workouts; usually null as I track by time.
  • duration_seconds: Cardio duration; only present in cardio exercises.
  • exercise_notes: Exercise-specific comments; optional/not always required.
  • description: Training session description added upon completion; also optional.
  • weight_kg: Null for bodyweight exercises and cardio (no load).
  • reps: To be analyzed.
weight_kg column missing values where ‘weight_kg’ is NaN and exercise type is not cardio:
Show Code
exercicios_excluidos = ['Treadmill', 'Stair Machine (Floors)', 'Stair Machine (Steps)', 'Stair Machine','Spinning', 'Walking', 'Air Bike']

linhas_ausentes_carga = df[
    df['weight_kg'].isna() &
    ~df['exercise_title'].isin(exercicios_excluidos)
]
linhas_ausentes_carga
title start_time end_time description exercise_title superset_id exercise_notes set_index set_type weight_kg reps distance_km duration_seconds rpe
1740 Upper I 26 May 2025, 19:46 26 May 2025, 21:23 NaN Chest Dip (Assisted) NaN NaN 0 normal NaN 8.0 NaN NaN NaN
1741 Upper I 26 May 2025, 19:46 26 May 2025, 21:23 NaN Chest Dip (Assisted) NaN NaN 1 failure NaN 8.0 NaN NaN NaN
1799 Upper I 19 May 2025, 19:54 19 May 2025, 21:43 NaN Chest Dip (Assisted) NaN NaN 1 failure NaN 7.0 NaN NaN NaN
3154 Push 2 e Pull 2 28 Dec 2024, 09:18 28 Dec 2024, 10:59 Pull 2 Romanian Chair Sit Ups NaN NaN 0 normal NaN 9.0 NaN NaN NaN
3155 Push 2 e Pull 2 28 Dec 2024, 09:18 28 Dec 2024, 10:59 Pull 2 Romanian Chair Sit Ups NaN NaN 1 normal NaN 7.0 NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9411 Treino A - Costas + Biceps 27 Nov 2023, 17:19 27 Nov 2023, 18:18 NaN Back Extension (Hyperextension) NaN NaN 3 normal NaN 10.0 NaN NaN NaN
9518 Treino A - Costas + Biceps 21 Nov 2023, 17:54 21 Nov 2023, 18:48 NaN Back Extension (Hyperextension) NaN NaN 0 normal NaN 10.0 NaN NaN NaN
9519 Treino A - Costas + Biceps 21 Nov 2023, 17:54 21 Nov 2023, 18:48 NaN Back Extension (Hyperextension) NaN NaN 1 normal NaN 10.0 NaN NaN NaN
9520 Treino A - Costas + Biceps 21 Nov 2023, 17:54 21 Nov 2023, 18:48 NaN Back Extension (Hyperextension) NaN NaN 2 normal NaN 10.0 NaN NaN NaN
9521 Treino A - Costas + Biceps 21 Nov 2023, 17:54 21 Nov 2023, 18:48 NaN Back Extension (Hyperextension) NaN NaN 3 normal NaN 10.0 NaN NaN NaN

376 rows × 14 columns

Visualizing missing value distribution:

Show Code
linhas_ausentes_carga = df[
    df['weight_kg'].isna() &
    ~df['exercise_title'].isin(exercicios_excluidos)
]
linhas_ausentes_carga['exercise_title'].value_counts()
exercise_title
Back Extension (Hyperextension)    201
Pull Up                             86
Knee Raise Parallel Bars            53
Leg Raise Parallel Bars              8
Bench Dip                            7
Decline Crunch                       4
Chin Up                              4
Chest Dip (Assisted)                 3
Romanian Chair Sit Ups               3
Decline Crunch (Weighted)            2
Squat (Smith Machine)                1
Push Up                              1
Squat (Bodyweight)                   1
Bicep Curl (Machine)                 1
Plank                                1
Name: count, dtype: int64

Missing values for Back Extension (Hyperextension), Pull Up, Knee Raise Parallel Bars, Leg Raise Parallel Bars, Bench Dip, Decline Crunch, Chin Up, Chest Dip (Assisted), Romanian Chair Sit Ups, Decline Crunch (Weighted), Squat (Bodyweight), Push Up, and Plank are expected as these are bodyweight exercises. However, Bicep Curl (Machine) should not have missing values, and Squat (Smith Machine) was likely a zero-load warm-up, but we will investigate further.

Filtering these two exercises with null values in weight_kg:

Show Code
exercicios_interesse = [
    'Squat (Smith Machine)',
    'Bicep Curl (Machine)'
]

df[
    df['weight_kg'].isna() &
    df['exercise_title'].isin(exercicios_interesse)
]
title start_time end_time description exercise_title superset_id exercise_notes set_index set_type weight_kg reps distance_km duration_seconds rpe
5270 Perna Completo 31 Jul 2024, 07:42 31 Jul 2024, 09:01 NaN Squat (Smith Machine) NaN NaN 0 warmup NaN 30.0 NaN NaN NaN
6644 Costas + Biceps + Lombar - Treino 2 14 May 2024, 11:11 14 May 2024, 12:36 NaN Bicep Curl (Machine) NaN NaN 2 failure NaN 9.0 NaN NaN NaN

We confirmed the Smith Machine was a warm-up, but there is likely an error in the bicep exercise, as it is a set to failure and the set_index is 2, meaning it is the third set of this exercise.

Analyzing exactly this exercise on this specific date:

Show Code
df[
    (df['title'] == 'Costas + Biceps + Lombar - Treino 2') &
    (df['exercise_title'] == 'Bicep Curl (Machine)') &
    (df['start_time'].str.contains('14 May 2024'))
]
title start_time end_time description exercise_title superset_id exercise_notes set_index set_type weight_kg reps distance_km duration_seconds rpe
6642 Costas + Biceps + Lombar - Treino 2 14 May 2024, 11:11 14 May 2024, 12:36 NaN Bicep Curl (Machine) NaN NaN 0 normal 54.0 12.0 NaN NaN NaN
6643 Costas + Biceps + Lombar - Treino 2 14 May 2024, 11:11 14 May 2024, 12:36 NaN Bicep Curl (Machine) NaN NaN 1 failure 54.0 10.0 NaN NaN NaN
6644 Costas + Biceps + Lombar - Treino 2 14 May 2024, 11:11 14 May 2024, 12:36 NaN Bicep Curl (Machine) NaN NaN 2 failure NaN 9.0 NaN NaN NaN

After analysis, we confirmed a manual entry error, resulting in the missing value; the correct value is 54.0 kg.

Imputing the correct value for this set:
Show Code
df.loc[6644, 'weight_kg'] = 54.0
reps column missing values where ‘reps’ is NaN and exercise type is not cardio:
Show Code
exercicios_excluidos = ['Treadmill', 'Stair Machine (Floors)', 'Stair Machine (Steps)', 'Stair Machine','Spinning', 'Walking', 'Air Bike']

linhas_ausentes_repeticoes = df[
    df['reps'].isna() &
    ~df['exercise_title'].isin(exercicios_excluidos)
]

linhas_ausentes_repeticoes
title start_time end_time description exercise_title superset_id exercise_notes set_index set_type weight_kg reps distance_km duration_seconds rpe
8022 (S1 ) Peito + Triceps + Abs - Treino 1 26 Feb 2024, 10:54 26 Feb 2024, 11:53 Treino 6 Plank NaN NaN 0 normal NaN NaN NaN 48.0 NaN

Plank is also an exercise where the absence of repetitions is expected and acceptable

Outlier Detectition

Function to detect outliers using the IQR (Interquartile Range) method:
Show Code
def detectar_outliers(df, coluna):
    Q1 = df[coluna].quantile(0.25)
    Q3 = df[coluna].quantile(0.75)
    IQR = Q3 - Q1
    limite_inferior = Q1 - 1.5 * IQR
    limite_superior = Q3 + 1.5 * IQR
    
    outliers = df[(df[coluna] < limite_inferior) | (df[coluna] > limite_superior)]
    return outliers
Detecting and displaying total outliers per column:
Show Code
colunas = ['weight_kg', 'reps']
for coluna in colunas:
    outliers = detectar_outliers(df, coluna)
    print(f"Total de outliers em '{coluna}': {len(outliers)}")
Total de outliers em 'weight_kg': 121
Total de outliers em 'reps': 239
Detect and display outliers for the ‘weight_kg’ column:
Show Code
outliers_carga_kg = detectar_outliers(df, 'weight_kg')

if not outliers_carga_kg.empty:
    print("Outliers em 'weight_kg (119) ':")
    print(outliers_carga_kg[['exercise_title', 'start_time', 'weight_kg', 'reps']])
Outliers em 'weight_kg (119) ':
           exercise_title          start_time  weight_kg  reps
5     Leg Press (Machine)  13 Jan 2026, 18:43      260.0   5.0
6     Leg Press (Machine)  13 Jan 2026, 18:43      260.0   5.0
73    Leg Press (Machine)   6 Jan 2026, 19:46      280.0   6.0
74    Leg Press (Machine)   6 Jan 2026, 19:46      280.0   5.0
132   Leg Press (Machine)  30 Dec 2025, 18:26      280.0   5.0
133   Leg Press (Machine)  30 Dec 2025, 18:26      260.0   6.0
186   Leg Press (Machine)  23 Dec 2025, 11:58      280.0   7.0
187   Leg Press (Machine)  23 Dec 2025, 11:58      260.0   7.0
252   Leg Press (Machine)  16 Dec 2025, 19:47      280.0   7.0
253   Leg Press (Machine)  16 Dec 2025, 19:47      280.0   6.0
309   Leg Press (Machine)   9 Dec 2025, 20:32      280.0   7.0
310   Leg Press (Machine)   9 Dec 2025, 20:32      280.0   6.0
374   Leg Press (Machine)   2 Dec 2025, 20:03      280.0   6.0
375   Leg Press (Machine)   2 Dec 2025, 20:03      280.0   5.0
464   Leg Press (Machine)  18 Nov 2025, 19:25      275.0   6.0
465   Leg Press (Machine)  18 Nov 2025, 19:25      275.0   6.0
522   Leg Press (Machine)  11 Nov 2025, 20:13      270.0   7.0
523   Leg Press (Machine)  11 Nov 2025, 20:13      275.0   6.0
583   Leg Press (Machine)   4 Nov 2025, 20:10      270.0   7.0
584   Leg Press (Machine)   4 Nov 2025, 20:10      270.0   7.0
647   Leg Press (Machine)  28 Oct 2025, 19:41      270.0   7.0
648   Leg Press (Machine)  28 Oct 2025, 19:41      270.0   7.0
713   Leg Press (Machine)  21 Oct 2025, 19:50      270.0   6.0
714   Leg Press (Machine)  21 Oct 2025, 19:50      270.0   6.0
771   Leg Press (Machine)  14 Oct 2025, 19:57      265.0   7.0
772   Leg Press (Machine)  14 Oct 2025, 19:57      267.0   6.0
826   Leg Press (Machine)   7 Oct 2025, 20:24      265.0   7.0
827   Leg Press (Machine)   7 Oct 2025, 20:24      265.0   6.0
914   Leg Press (Machine)  23 Sep 2025, 19:21      265.0   6.0
915   Leg Press (Machine)  23 Sep 2025, 19:21      265.0   6.0
965   Leg Press (Machine)  16 Sep 2025, 19:37      265.0   6.0
966   Leg Press (Machine)  16 Sep 2025, 19:37      265.0   6.0
1022  Leg Press (Machine)   9 Sep 2025, 20:22      265.0   6.0
1023  Leg Press (Machine)   9 Sep 2025, 20:22      265.0   5.0
1081  Leg Press (Machine)   2 Sep 2025, 20:23      260.0   7.0
1082  Leg Press (Machine)   2 Sep 2025, 20:23      265.0   6.0
1136  Leg Press (Machine)  26 Aug 2025, 20:35      260.0   7.0
1137  Leg Press (Machine)  26 Aug 2025, 20:35      260.0   6.0
1236  Leg Press (Machine)  29 Jul 2025, 19:49      250.0   7.0
1237  Leg Press (Machine)  29 Jul 2025, 19:49      260.0   6.0
1554  Leg Press (Machine)  17 Jun 2025, 20:14      260.0   5.0
1555  Leg Press (Machine)  17 Jun 2025, 20:14      260.0   6.0
1613  Leg Press (Machine)  10 Jun 2025, 20:17      260.0   6.0
1614  Leg Press (Machine)  10 Jun 2025, 20:17      260.0   5.0
1671  Leg Press (Machine)   3 Jun 2025, 19:51      255.0   7.0
1672  Leg Press (Machine)   3 Jun 2025, 19:51      260.0   6.0
1729  Leg Press (Machine)  27 May 2025, 19:55      250.0   7.0
1730  Leg Press (Machine)  27 May 2025, 19:55      255.0   6.0
1787  Leg Press (Machine)  20 May 2025, 20:06      250.0   7.0
1788  Leg Press (Machine)  20 May 2025, 20:06      250.0   6.0
1847  Leg Press (Machine)  13 May 2025, 19:48      240.0   7.0
1848  Leg Press (Machine)  13 May 2025, 19:48      250.0   7.0
1905  Leg Press (Machine)   6 May 2025, 07:17      280.0   6.0
1906  Leg Press (Machine)   6 May 2025, 07:17      280.0   5.0
1961  Leg Press (Machine)  29 Apr 2025, 07:11      280.0   8.0
1962  Leg Press (Machine)  29 Apr 2025, 07:11      285.0   8.0
2036  Leg Press (Machine)  15 Apr 2025, 20:28      280.0   8.0
2037  Leg Press (Machine)  15 Apr 2025, 20:28      285.0   8.0
2096  Leg Press (Machine)   8 Apr 2025, 19:14      260.0   8.0
2097  Leg Press (Machine)   8 Apr 2025, 19:14      270.0   8.0
2158  Leg Press (Machine)   1 Apr 2025, 06:28      260.0   8.0
2159  Leg Press (Machine)   1 Apr 2025, 06:28      270.0   7.0
2222  Leg Press (Machine)  25 Mar 2025, 19:14      240.0   8.0
2223  Leg Press (Machine)  25 Mar 2025, 19:14      270.0   7.0
2288  Leg Press (Machine)  18 Mar 2025, 18:29      240.0   8.0
2289  Leg Press (Machine)  18 Mar 2025, 18:29      260.0   8.0
2330  Leg Press (Machine)  14 Mar 2025, 18:32      240.0   9.0
2331  Leg Press (Machine)  14 Mar 2025, 18:32      260.0   8.0
2332  Leg Press (Machine)  14 Mar 2025, 18:32      280.0   7.0
3776  Leg Press (Machine)  13 Nov 2024, 07:04      200.0  12.0
3777  Leg Press (Machine)  13 Nov 2024, 07:04      220.0  10.0
3778  Leg Press (Machine)  13 Nov 2024, 07:04      240.0   9.0
4002  Leg Press (Machine)  30 Oct 2024, 07:14      200.0  10.0
4003  Leg Press (Machine)  30 Oct 2024, 07:14      220.0  10.0
4004  Leg Press (Machine)  30 Oct 2024, 07:14      240.0   8.0
4206  Leg Press (Machine)  16 Oct 2024, 07:12      160.0  10.0
4207  Leg Press (Machine)  16 Oct 2024, 07:12      200.0  10.0
4208  Leg Press (Machine)  16 Oct 2024, 07:12      240.0   8.0
5520  Leg Press (Machine)  17 Jan 2024, 12:00      160.0  11.0
5521  Leg Press (Machine)  17 Jan 2024, 12:00      160.0  10.0
5522  Leg Press (Machine)  17 Jan 2024, 12:00      200.0  10.0
5523  Leg Press (Machine)  17 Jan 2024, 12:00      200.0  10.0
7692  Leg Press (Machine)  13 Mar 2024, 11:03      160.0  11.0
7693  Leg Press (Machine)  13 Mar 2024, 11:03      160.0  10.0
7694  Leg Press (Machine)  13 Mar 2024, 11:03      200.0  10.0
7695  Leg Press (Machine)  13 Mar 2024, 11:03      200.0  10.0
8190  Leg Press (Machine)  14 Feb 2024, 11:07      200.0  12.0
8191  Leg Press (Machine)  14 Feb 2024, 11:07      200.0  12.0
8192  Leg Press (Machine)  14 Feb 2024, 11:07      200.0  11.0
8428  Leg Press (Machine)  31 Jan 2024, 10:53      200.0  11.0
8429  Leg Press (Machine)  31 Jan 2024, 10:53      200.0  10.0
8430  Leg Press (Machine)  31 Jan 2024, 10:53      200.0  10.0
8431  Leg Press (Machine)  31 Jan 2024, 10:53      200.0  10.0
8566  Leg Press (Machine)  24 Jan 2024, 10:52      200.0  11.0
8567  Leg Press (Machine)  24 Jan 2024, 10:52      200.0  10.0
8568  Leg Press (Machine)  24 Jan 2024, 10:52      200.0  10.0
8569  Leg Press (Machine)  24 Jan 2024, 10:52      200.0  10.0
8792  Leg Press (Machine)   2 Jan 2024, 17:11      200.0  10.0
8793  Leg Press (Machine)   2 Jan 2024, 17:11      200.0  10.0
8794  Leg Press (Machine)   2 Jan 2024, 17:11      200.0  10.0
8795  Leg Press (Machine)   2 Jan 2024, 17:11      200.0  10.0
8871  Leg Press (Machine)  29 Dec 2023, 10:40      200.0  10.0
8872  Leg Press (Machine)  29 Dec 2023, 10:40      200.0  10.0
8873  Leg Press (Machine)  29 Dec 2023, 10:40      200.0  10.0
8874  Leg Press (Machine)  29 Dec 2023, 10:40      200.0  10.0
9057  Leg Press (Machine)  18 Dec 2023, 10:45      200.0  10.0
9058  Leg Press (Machine)  18 Dec 2023, 10:45      200.0  10.0
9059  Leg Press (Machine)  18 Dec 2023, 10:45      200.0  10.0
9060  Leg Press (Machine)  18 Dec 2023, 10:45      200.0  10.0
9140  Leg Press (Machine)  12 Dec 2023, 10:44      200.0  10.0
9141  Leg Press (Machine)  12 Dec 2023, 10:44      200.0  10.0
9142  Leg Press (Machine)  12 Dec 2023, 10:44      200.0  10.0
9143  Leg Press (Machine)  12 Dec 2023, 10:44      240.0   9.0
9250  Leg Press (Machine)   6 Dec 2023, 17:48      200.0  10.0
9251  Leg Press (Machine)   6 Dec 2023, 17:48      200.0  10.0
9252  Leg Press (Machine)   6 Dec 2023, 17:48      200.0  10.0
9253  Leg Press (Machine)   6 Dec 2023, 17:48      240.0   9.0
9464  Leg Press (Machine)  20 Nov 2023, 17:37      160.0  10.0
9465  Leg Press (Machine)  20 Nov 2023, 17:37      160.0  10.0
9466  Leg Press (Machine)  20 Nov 2023, 17:37      160.0  10.0
9467  Leg Press (Machine)  20 Nov 2023, 17:37      160.0  10.0
Groups and counts how many outliers there are by exercise title:
Show Code
# 1. Detecta todos os outliers novamente
outliers_original = detectar_outliers(df, 'weight_kg')
# 2. Agrupa e conta quantos outliers existem por título de exercício
contagem_outliers = outliers_original['exercise_title'].value_counts()
print("Distribuição de Outliers por Exercício:")
print(contagem_outliers)
Distribuição de Outliers por Exercício:
exercise_title
Leg Press (Machine)    121
Name: count, dtype: int64

Legpress is an exercise where the absolute weight is significantly higher compared to the other exercises but they are not necessarily outliers

Leg Press (Machine) visualization:
Show Code
exercise_name = "Leg Press (Machine)"
df_ex = df[df["exercise_title"] == exercise_name]
fig = px.box(df_ex,y="weight_kg",points="all",hover_data=["reps","set_index","set_type","rpe","start_time"],title=f"Weight Distribuiton — {exercise_name}")
fig.update_layout(yaxis_title="Weight",xaxis_title="",showlegend=False,height=600,width=900,title_x = 0.5)
fig.show()

Correct values. Since my training is based on progressive overload, it makes sense that values close to 280 kg have few records, as they represent the most recent loads lifted.

Detect and display outliers for the ‘reps’ column:
Show Code
outliers_repeticoes = detectar_outliers(df, 'reps')

if not outliers_repeticoes.empty:
    print("Outliers em 'repeticoes (209) ':")
    print(outliers_repeticoes[['exercise_title','weight_kg', 'reps']])
Outliers em 'repeticoes (209) ':
                       exercise_title  weight_kg  reps
3860    Standing Calf Raise (Machine)       20.0  30.0
3861    Standing Calf Raise (Machine)       40.0  25.0
3862    Standing Calf Raise (Machine)       60.0  20.0
3867    Standing Calf Raise (Machine)       40.0  24.0
3868    Standing Calf Raise (Machine)       20.0  30.0
...                               ...        ...   ...
8907   Incline Bench Press (Dumbbell)       17.5  20.0
8957  Seated Shoulder Press (Machine)       37.5  20.0
9021  Seated Shoulder Press (Machine)       37.5  20.0
9105  Seated Shoulder Press (Machine)       37.5  20.0
9282             Cable Fly Crossovers       10.0  16.0

[239 rows x 3 columns]
Group and count outliers per exercise title:
Show Code
outliers_original = detectar_outliers(df, 'reps')

# 2. Agrupa e conta quantos outliers existem por título de exercício
contagem_outliers = outliers_original['exercise_title'].value_counts()

print("Distribuição de Outliers por Exercício:")
print(contagem_outliers)
Distribuição de Outliers por Exercício:
exercise_title
Standing Calf Raise (Machine)               117
Incline Bench Press (Dumbbell)               18
Lateral Raise (Dumbbell)                     11
Calf Extension (Machine)                      8
Chest Fly (Machine)                           8
Triceps Rope Pushdown                         7
Single Arm Tricep Extension (Dumbbell)        6
Triceps Extension (Cable)                     6
Lat Pulldown (Cable)                          6
Crunch (Machine)                              5
Preacher Curl (Machine)                       4
Single Arm Lateral Raise (Cable)              4
Lateral Raise (Machine)                       4
Chest Press (Machine)                         4
Lat Pulldown (Machine)                        4
Iso-Lateral High Row (Machine)                4
Seated Cable Row - V Grip (Cable)             4
Seated Incline Curl (Dumbbell)                3
Seated Shoulder Press (Machine)               3
Seated Wrist Extension (Barbell)              3
Bicep Curl (Dumbbell)                         3
Squat (Smith Machine)                         2
Seated Leg Curl (Machine)                     1
Single Leg Standing Calf Raise (Machine)      1
Back Extension (Machine)                      1
Single Arm Triceps Pushdown (Cable)           1
Cable Fly Crossovers                          1
Name: count, dtype: int64
Standing Calf Raise (Machine):
Show Code
exercise_name = "Standing Calf Raise (Machine)"
df_ex = df[df["exercise_title"] == exercise_name]
fig = px.box(df_ex,y="reps",points="all",hover_data=["reps","weight_kg","set_index","set_type","rpe","start_time"],title=f"Reps Distribuiton — {exercise_name}")
fig.update_layout(yaxis_title="Reps",xaxis_title="",showlegend=False,height=600,width=900,title_x = 0.5)
fig.show()

We evaluated all exercises flagged as outliers and the majority (as seen in the example above) were warm-up sets where no load or a very low load was used; therefore, they do not qualify as incorrect data. The only exception was the Preacher Curl (Machine), detailed below, which indeed contained a manual entry error in the number of repetitions and was handled separately.

Preacher Curl (Machine):
Show Code
exercise_name = "Preacher Curl (Machine)"
df_ex = df[df["exercise_title"] == exercise_name]
fig = px.box(df_ex,y="reps",points="all",hover_data=["reps","weight_kg","set_index","set_type","rpe","start_time"],title=f"Reps Distribuiton — {exercise_name}")
fig.update_layout(yaxis_title="Reps",xaxis_title="",showlegend=False,height=600,width=900,title_x = 0.5)
fig.show()

Here we have an incorrect value: where 111 is recorded, it should actually be 11 repetitions

Analyzing this specific exercise instance:
Show Code
df[
    (df['exercise_title'] == 'Preacher Curl (Machine)') &
    (df['start_time'].str.contains('23 Jan 2024'))
]
title start_time end_time description exercise_title superset_id exercise_notes set_index set_type weight_kg reps distance_km duration_seconds rpe
8604 Treino B - Costas + Biceps 23 Jan 2024, 11:09 23 Jan 2024, 12:09 NaN Preacher Curl (Machine) NaN NaN 0 normal 48.75 12.0 NaN NaN NaN
8605 Treino B - Costas + Biceps 23 Jan 2024, 11:09 23 Jan 2024, 12:09 NaN Preacher Curl (Machine) NaN NaN 1 normal 48.75 111.0 NaN NaN NaN
8606 Treino B - Costas + Biceps 23 Jan 2024, 11:09 23 Jan 2024, 12:09 NaN Preacher Curl (Machine) NaN NaN 2 normal 48.75 12.0 NaN NaN NaN
Imputing the correct value for this set:
Show Code
df.loc[8605, 'reps'] = 11.0
Revalidating the corrected set:
Show Code
df[
    (df['exercise_title'] == 'Preacher Curl (Machine)') &
    (df['start_time'].str.contains('23 Jan 2024'))
]
title start_time end_time description exercise_title superset_id exercise_notes set_index set_type weight_kg reps distance_km duration_seconds rpe
8604 Treino B - Costas + Biceps 23 Jan 2024, 11:09 23 Jan 2024, 12:09 NaN Preacher Curl (Machine) NaN NaN 0 normal 48.75 12.0 NaN NaN NaN
8605 Treino B - Costas + Biceps 23 Jan 2024, 11:09 23 Jan 2024, 12:09 NaN Preacher Curl (Machine) NaN NaN 1 normal 48.75 11.0 NaN NaN NaN
8606 Treino B - Costas + Biceps 23 Jan 2024, 11:09 23 Jan 2024, 12:09 NaN Preacher Curl (Machine) NaN NaN 2 normal 48.75 12.0 NaN NaN NaN

Validations

set_index:
Show Code
# Ordenando para garantir consistência
df_sorted = df.sort_values(
    by=["start_time", "exercise_title", "set_index"]
)

# Agrupando por treino + exercício
set_index_check = (
    df_sorted
    .groupby(["start_time", "exercise_title"])["set_index"]
    .apply(list)
    .reset_index(name="set_index_list")
)

# Funções de validação
def starts_with_zero(seq):
    return min(seq) == 0

def has_gaps(seq):
    return sorted(seq) != list(range(min(seq), max(seq) + 1))

# Aplicando validações
set_index_check["starts_with_zero"] = set_index_check["set_index_list"].apply(starts_with_zero)
set_index_check["has_gaps"] = set_index_check["set_index_list"].apply(has_gaps)

# Casos problemáticos
problemas_set_index = set_index_check[
    (~set_index_check["starts_with_zero"]) |
    (set_index_check["has_gaps"])
]

print(f"Total de exercícios analisados: {len(set_index_check)}")
print(f"Casos com problemas de set_index: {len(problemas_set_index)}")

problemas_set_index.head()
Total de exercícios analisados: 3447
Casos com problemas de set_index: 0
start_time exercise_title set_index_list starts_with_zero has_gaps
Workout Duration:
Show Code
# Convertendo para datetime
df["start_time_dt"] = pd.to_datetime(df["start_time"], format="%d %b %Y, %H:%M")
df["end_time_dt"] = pd.to_datetime(df["end_time"], format="%d %b %Y, %H:%M")

# Calculando duração em minutos
df["duracao_treino_min"] = (
    (df["end_time_dt"] - df["start_time_dt"])
    .dt.total_seconds() / 60
)

# Agregando por treino (um treino pode ter várias linhas)
duracao_por_treino = (
    df.groupby(["title", "start_time"])
    .agg(
        duracao_min=("duracao_treino_min", "first")
    )
    .reset_index()
)

# Treinos muito longos (> 4 horas)
treinos_longos = duracao_por_treino[
    duracao_por_treino["duracao_min"] > 240
]

# Treinos muito curtos (< 10 minutos)
treinos_curtos = duracao_por_treino[
    duracao_por_treino["duracao_min"] < 10
]

print(f"Treinos > 4h: {len(treinos_longos)}")
print(f"Treinos < 10min: {len(treinos_curtos)}")

treinos_longos.head(), treinos_curtos.head()
Treinos > 4h: 0
Treinos < 10min: 0
(Empty DataFrame
 Columns: [title, start_time, duracao_min]
 Index: [],
 Empty DataFrame
 Columns: [title, start_time, duracao_min]
 Index: [])
Naming Consistency:
Show Code
# Total de exercícios únicos
exercicios_unicos = df["exercise_title"].nunique()

# Frequência por exercício
freq_exercicios = df["exercise_title"].value_counts()

# Exercícios que aparecem apenas 1 vez
exercicios_uma_vez = freq_exercicios[freq_exercicios == 1]

print(f"Total de exercícios únicos: {exercicios_unicos}")
print(f"Exercícios que aparecem apenas 1 vez: {len(exercicios_uma_vez)}")

exercicios_uma_vez.head(10)
Total de exercícios únicos: 118
Exercícios que aparecem apenas 1 vez: 5
exercise_title
Squat (Bodyweight)       1
Plank                    1
Push Up                  1
Walking                  1
Stair Machine (Steps)    1
Name: count, dtype: int64

Analysis

Workouts per day:
Show Code
# Treinos por dia
df["data_treino"] = df["start_time_dt"].dt.date

treinos_por_dia = (
    df.groupby("data_treino")["title"]
    .nunique()
    .reset_index(name="qtde_treinos")
)

fig = px.bar(
    treinos_por_dia,
    x="data_treino",
    y="qtde_treinos",
    title="Training Frequency Over Time"
)

fig.update_layout(
    xaxis_title="Data",
    yaxis_title="Workout Count",
    title_x=0.5,
    height=500
)

fig.show()
Load progression - Exercise: Seated Leg Curl (Machine):
Show Code
exercise_name = "Seated Leg Curl (Machine)"

df_ex = (
    df[df["exercise_title"] == exercise_name]
    .groupby("data_treino")
    .agg(carga_max=("weight_kg", "max"))
    .reset_index()
)

fig = px.line(
    df_ex,
    x="data_treino",
    y="carga_max",
    markers=True,
    title=f"Load progression — {exercise_name}"
)

fig.update_layout(
    xaxis_title="Data",
    yaxis_title="Maximum Weight (kg)",
    title_x=0.5,
    height=500
)

fig.show()
Load progression - Exercise: Chest Fly (Machine):
Show Code
exercise_name = "Chest Fly (Machine)"

df_ex = (
    df[df["exercise_title"] == exercise_name]
    .groupby("data_treino")
    .agg(carga_max=("weight_kg", "max"))
    .reset_index()
)

fig = px.line(
    df_ex,
    x="data_treino",
    y="carga_max",
    markers=True,
    title=f"Load progression — {exercise_name}"
)

fig.update_layout(
    xaxis_title="Data",
    yaxis_title="Maximum Weight (kg)",
    title_x=0.5,
    height=500
)

fig.show()
Scatter plot: weight vs. rep:
Show Code
fig = px.scatter(
    df,
    x="weight_kg",
    y="reps",
    color="set_type",
    opacity=0.6,
    title="Weight and Reps Scatter",
    hover_data=["exercise_title", "data_treino"]
)

fig.update_layout(
    xaxis_title="Wight (kg)",
    yaxis_title="Repetitions",
    title_x=0.5,
    height=600
)

fig.show()
Workout duration histogram:
Show Code
fig = px.histogram(
    duracao_por_treino,
    x="duracao_min",
    nbins=30,
    title="Workout Duration Distribution (min)"
)

fig.update_layout(
    xaxis_title="Duratiom (min)",
    yaxis_title="Train Count",
    title_x=0.5,
    height=500
)

fig.show()
Most frequent exercises:
Show Code
top_exercicios = (
    df["exercise_title"]
    .value_counts()
    .head(15)
    .reset_index()
)

top_exercicios.columns = ["exercise_title", "frequencia"]

fig = px.bar(
    top_exercicios,
    x="frequencia",
    y="exercise_title",
    orientation="h",
    title="Most Frequent Exercises (Top 15)"
)

fig.update_layout(
    xaxis_title="Number of sets",
    yaxis_title="",
    title_x=0.5,
    height=600,
    yaxis={'categoryorder':'total ascending'}
)

fig.show()
Set types (distribution):
Show Code
fig = px.pie(
    df,
    names="set_type",
    title="Set Type Distribution"
)

fig.update_layout(
    title_x=0.5,
    height=500
)

fig.show()
Training volume:
Show Code
# Garantindo a coluna de data
df["data_treino"] = df["start_time_dt"].dt.date

# Cada linha representa uma série → volume = contagem de linhas
volume_series_diario = (
    df.groupby("data_treino")
    .size()
    .reset_index(name="total_series")
)

fig = px.line(
    volume_series_diario,
    x="data_treino",
    y="total_series",
    markers=True,
    title="Trainning Volume per Day (Number of sets)"
)

fig.update_layout(
    xaxis_title="Data",
    yaxis_title="Total Sets",
    title_x=0.5,
    height=500
)

fig.show()
Set voulme by type:
Show Code
df["data_treino"] = df["start_time_dt"].dt.date

df_efetivas = df[df["set_type"].isin(["normal", "failure"])]

volume_por_tipo = (
    df_efetivas
    .groupby(["data_treino", "set_type"])
    .size()
    .reset_index(name="total_series")
)

fig = px.area(
    volume_por_tipo,
    x="data_treino",
    y="total_series",
    color="set_type",
    title="Total Sets per Type"
)

fig.update_layout(
    xaxis_title="Data",
    yaxis_title="Total Sets",
    title_x=0.5,
    height=500
)

fig.show()
Final Dataset Export (Silver Layer):
Show Code
df.to_csv("hevy_workouts_clean.csv",index=False,encoding="utf-8")

Stage 2: Raw Diet Data

Before integrating nutrition data into the relational model, a full validation and structural audit was performed on the exported dataset from my personal SQLite database.

The objective of this stage was to:

  • Validate structural integrity
  • Ensure domain consistency
  • Confirm macro–calorie coherence
  • Normalize cycle identification
  • Produce an immutable clean dataset for ingestion
Initial Load:
Show Code
pd.set_option("display.max_rows", 200)
pd.set_option("display.max_colwidth", None)
df = pd.read_csv("diet_daily_raw_export.csv")
df.head()
data carboidratos_g proteinas_g gorduras_g kcal cardio_semanal_min cicle peso_kg observacoes
0 2023-05-06 250 190 85 2525 210 1Cutting I (01/05/2023 - 17/09/2023) 104.3 NaN
1 2023-05-07 250 190 85 2525 210 1Cutting I (01/05/2023 - 17/09/2023) 104.3 NaN
2 2023-05-08 250 190 85 2525 210 1Cutting I (01/05/2023 - 17/09/2023) 104.3 NaN
3 2023-05-09 250 190 85 2525 210 1Cutting I (01/05/2023 - 17/09/2023) 104.3 NaN
4 2023-05-10 250 190 85 2525 210 1Cutting I (01/05/2023 - 17/09/2023) 104.3 NaN
Shape:
Show Code
df.shape
(982, 9)
Structural Overview:
Show Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 982 entries, 0 to 981
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   data                982 non-null    object 
 1   carboidratos_g      982 non-null    int64  
 2   proteinas_g         982 non-null    int64  
 3   gorduras_g          982 non-null    int64  
 4   kcal                982 non-null    int64  
 5   cardio_semanal_min  982 non-null    int64  
 6   cicle               982 non-null    object 
 7   peso_kg             982 non-null    float64
 8   observacoes         173 non-null    object 
dtypes: float64(1), int64(5), object(3)
memory usage: 69.2+ KB

Observations:

  • No missing values in numeric macro columns
  • observacoes is optional (notes field)
  • data initially loaded as object (string)
  • cicle stored as free-text categorical field
Granularity validation:
Show Code
df.duplicated(subset=["data"]).sum()
np.int64(0)

This confirms the dataset respects the intended grain:

1 row = 1 day

Date conversion:
Show Code
df["data"] = pd.to_datetime(df["data"], errors="coerce")
df[df["data"].isna()]
data carboidratos_g proteinas_g gorduras_g kcal cardio_semanal_min cicle peso_kg observacoes

No invalid dates detected.

This confirms full temporal integrity.

Missing Values Audit:
Show Code
df.isna().sum()
data                    0
carboidratos_g          0
proteinas_g             0
gorduras_g              0
kcal                    0
cardio_semanal_min      0
cicle                   0
peso_kg                 0
observacoes           809
dtype: int64
  • All macro and calorie fields fully populated
  • observacoes contains null values (expected and acceptable)
  • No missing bodyweight records
  • No missing cycle classification
  • No unexpected structural gaps were found.

Domain Validations:

Negative Value Check:
Show Code
numeric_cols = [
    "carboidratos_g",
    "proteinas_g",
    "gorduras_g",
    "kcal",
    "cardio_semanal_min",
    "peso_kg"
]

(df[numeric_cols] < 0).sum()
carboidratos_g        0
proteinas_g           0
gorduras_g            0
kcal                  0
cardio_semanal_min    0
peso_kg               0
dtype: int64

All zero.

No invalid negative values exist in:

  • Macros
  • Calories
  • Weekly cardio
  • Bodyweight
Caloric Consistency Validation:

Calories were validated against macro-derived energy:

  • Carbs = 4 kcal/g
  • Protein = 4 kcal/g
  • Fat = 9 kcal/g
Show Code
df["kcal_estimado"] = (
    df["carboidratos_g"] * 4 +
    df["proteinas_g"] * 4 +
    df["gorduras_g"] * 9
)

df[["kcal", "kcal_estimado"]].head(10)
kcal kcal_estimado
0 2525 2525
1 2525 2525
2 2525 2525
3 2525 2525
4 2525 2525
5 2525 2525
6 2525 2525
7 2425 2425
8 2425 2425
9 2425 2425
Difference distribution:
Show Code
(df["kcal"] - df["kcal_estimado"]).abs().describe()
count    982.000000
mean       0.020367
std        0.638226
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max       20.000000
dtype: float64

Findings:

  • Mean deviation ≈ 0.02 kcal
  • Maximum deviation: 20 kcal
  • 75% of records show zero difference

Cycle Normalization:

The cicle column contains descriptive strings combining:

  • Sequential cycle number
  • Strategy (Cutting, Bulking, Reverse, Maintenance)
  • Date range
Unique values::
Show Code
df["cicle"].unique()
array(['1Cutting I (01/05/2023 - 17/09/2023)',
       '2Reversa I (18/09/2023 - 29/10/2023)',
       '3Cutting II (30/10/2023 - 23/02/2024)',
       '4Reversa II (24/02/2024 - 22/03/2024)',
       '5Bulking I (23/03/2024 - 19/05/2024)',
       '6Cutting III (20/05/2024 - 28/09/2024)',
       '7Bulking II (28/10/2024 - 01/01/2025)',
       '8Cutting IV (02/01/2025 - 17/04/2025)',
       '9Manutenção I (18/04/2025 - 05/10/2025)',
       '10Cutting V (06/10/2025 - XX/XX/XXXX)'], dtype=object)

To normalize this for relational modeling, a deterministic mapping was created:

Cycle map:
Show Code
cycle_map = {
    "1Cutting I (01/05/2023 - 17/09/2023)": 1,
    "2Reversa I (18/09/2023 - 29/10/2023)": 2,
    "3Cutting II (30/10/2023 - 23/02/2024)": 3,
    "4Reversa II (24/02/2024 - 22/03/2024)": 4,
    "5Bulking I (23/03/2024 - 19/05/2024)": 5,
    "6Cutting III (20/05/2024 - 28/09/2024)": 6,
    "7Bulking II (28/10/2024 - 01/01/2025)": 7,
    "8Cutting IV (02/01/2025 - 17/04/2025)": 8,
    "9Manutenção I (18/04/2025 - 05/10/2025)": 9,
    "10Cutting V (06/10/2025 - XX/XX/XXXX)": 10
}

df["cycle_id"] = df["cicle"].map(cycle_map)
df[df["cycle_id"].isna()][["data", "cicle"]]
data cicle

Result:

No unmapped values.

This guarantees referential integrity for the future cycles dimension table.

Column Standardization:

Columns were renamed to follow the project’s naming conventions:

  • English
  • Snake case
  • Explicit semantic meaning
Show Code
df = df.rename(columns={
    "data": "diet_date",
    "carboidratos_g": "carbs_g",
    "proteinas_g": "protein_g",
    "gorduras_g": "fat_g",
    "kcal": "calories_kcal",
    "cardio_seamanal_min": "cardio_weekly_min",
    "peso_kg": "bodyweight_kg",
    "observacoes": "notes"
})
df = df.drop(columns=["cicle", "kcal_estimado"])
df = df.replace({np.nan: None})
Final Dataset Export (Silver Layer):
Show Code
df.to_csv(
    "diet_daily_clean.csv",
    index=False,
    encoding="utf-8"
)

Stage 3: Relational Modeling (Amazon RDS)

After validating and cleaning both training and nutrition datasets, the next step was the implementation of a normalized relational model.

This stage was executed in:

Amazon RDS — PostgreSQL (Managed Database Service)

The objective was to:

  • Move from flat CSV structure to normalized relational modeling
  • Enforce referential integrity
  • Guarantee deterministic aggregations
  • Enable performant weekly analytics
  • Prepare the system for idempotent ETL ingestion

Infrastructure Context

  • Database engine: PostgreSQL
  • Deployment: Amazon RDS
  • Schema: fitness
  • Connection: SSL-enabled

RDS was intentionally chosen to:

  • Simulate a production-style managed database
  • Avoid local-only architecture
  • Demonstrate cloud database provisioning
  • Keep operational overhead minimal

Database ER Diagram

Schema Definition

The schema was explicitly defined to isolate all project objects.

Show SQL
CREATE SCHEMA IF NOT EXISTS fitness;
SET search_path TO fitness;

This ensures:

  • Logical separation
  • Namespace clarity
  • Clean migration management

Core Tables

The modeling follows a normalized structure where:

1 workout → many exercises 1 exercise → many sets 1 day → 1 diet record

Foreign keys enforce consistency across all relationships.


Table: exercises

Show SQL
CREATE TABLE exercises (
    exercise_id SERIAL PRIMARY KEY,
    exercise_name TEXT NOT NULL UNIQUE,
    exercise_type TEXT,
    is_cardio BOOLEAN,
    created_at TIMESTAMP DEFAULT NOW()
);

Design Notes

  • exercise_name is UNIQUE to prevent duplication during ETL.
  • is_cardio allows future semantic filtering.
  • exercise_type reserved for future classification (not hardcoded yet).
  • Timestamp ensures traceability.

This table acts as a dimension table for exercises.

Table: workouts

Show SQL
CREATE TABLE workouts (
    workout_id SERIAL PRIMARY KEY,
    workout_name TEXT,
    workout_date DATE NOT NULL,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    duration_minutes NUMERIC(5,2),
    description TEXT,
    source TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

Design Notes

  • workout_date stored separately for aggregation performance.
  • duration_minutes precomputed to avoid runtime calculation overhead.
  • source explicitly tracks ingestion origin (e.g., hevy_csv).
  • Not enforcing uniqueness constraint intentionally — allows ingestion flexibility if future multi-source system is introduced.

This table represents the training session grain.

Table: workout_exercises

Show SQL
CREATE TABLE workout_exercises (
    workout_exercise_id SERIAL PRIMARY KEY,
    workout_id INT NOT NULL REFERENCES workouts(workout_id) ON DELETE CASCADE,
    exercise_id INT NOT NULL REFERENCES exercises(exercise_id),
    exercise_order INT,
    superset_id INT,
    notes TEXT
);

Design Notes

This is a bridge table resolving:

  • Many exercises per workout
  • Exercise order inside workout
  • Superset grouping

ON DELETE CASCADE ensures:

  • Deleting a workout automatically deletes dependent records.

This avoids orphan data.

Table: sets

Show SQL
CREATE TABLE sets (
    set_id SERIAL PRIMARY KEY,
    workout_exercise_id INT NOT NULL REFERENCES workout_exercises(workout_exercise_id) ON DELETE CASCADE,
    set_index INT NOT NULL,
    set_type TEXT,
    weight_kg NUMERIC(6,2),
    reps INT,
    distance_km NUMERIC(6,3),
    duration_seconds INT,
    rpe NUMERIC(3,1),
    created_at TIMESTAMP DEFAULT NOW()
);

Design Notes

This is the most granular fact table.

Grain:

1 row = 1 set

Key decisions:

  • Numeric precision defined explicitly
  • set_type not constrained via ENUM to allow ingestion flexibility
  • No derived metrics stored (volume, 1RM etc.)
  • RPE stored but not enforced

All analytical metrics will be calculated in Gold layer views.

Performance Indexes

Indexes were created to optimize common aggregation paths.

Show SQL
CREATE INDEX idx_workouts_date ON workouts(workout_date);
CREATE INDEX idx_exercises_name ON exercises(exercise_name);
CREATE INDEX idx_sets_type ON sets(set_type);
CREATE INDEX idx_sets_workout_exercise ON sets(workout_exercise_id);

Index Strategy

  • workout_date → weekly aggregations
  • exercise_name → lookup & joins
  • set_type → filtering warmups/failure sets
  • workout_exercise_id → join optimization

Indexes were added intentionally after understanding expected query patterns.

Nutrition Table

Show SQL
CREATE TABLE diet_daily (
    diet_date DATE PRIMARY KEY,
    carbs_g INT,
    protein_g INT,
    fat_g INT,
    calories_kcal INT,
    cardio_weekly_min INT,
    cycle_id INT
        REFERENCES cycles(cycle_id),
    bodyweight_kg NUMERIC(5,2),
    notes TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

Design Notes

Grain:

1 row = 1 day

Key decisions:

  • diet_date is PRIMARY KEY → prevents duplicates
  • cycle_id references a dimension table (cycles)
  • No macro-derived metrics stored
  • cardio_weekly_min kept at daily level for consistency with original structure

This table enables:

  • Weekly averages
  • Cycle segmentation
  • Cross-analysis with training

Referential Integrity Strategy

Foreign keys were enforced in:

  • workout_exercises → workouts
  • workout_exercises → exercises
  • sets → workout_exercises
  • diet_daily → cycles

This guarantees:

  • No orphan sets
  • No unmapped exercises
  • No diet record without valid cycle

The model prevents silent corruption.

Stage 4: ETL Pipeline (Amazon RDS)

After validating and cleaning the raw datasets, the next step was to load the data into a normalized relational schema deployed in Amazon RDS (PostgreSQL).

The ETL pipeline was executed against an Amazon RDS PostgreSQL instance under the fitness schema.

Environment Configuration

Database credentials were managed through environment variables to avoid hardcoding secrets.

Show Code
import os
from dotenv import load_dotenv

# Carrega o .env da raiz do projeto
load_dotenv()

DB_HOST = os.getenv("DB_HOST")
DB_PORT = os.getenv("DB_PORT", "5432")
DB_NAME = os.getenv("DB_NAME", "postgres")
DB_USER = os.getenv("DB_USER", "postgres")
DB_PASSWORD = os.getenv("DB_PASSWORD")
DB_SCHEMA = os.getenv("DB_SCHEMA", "fitness")

CSV_PATH = os.getenv("CSV_PATH")

This ensures:

  • Secure configuration
  • Easy migration between local and AWS environments
  • No credentials stored in version control

Database Engine (Amazon RDS Connection)

Show Code
from sqlalchemy import create_engine
from etl.config.settings import (
    DB_HOST, DB_PORT, DB_NAME,
    DB_USER, DB_PASSWORD, DB_SCHEMA
)

def get_engine():
    if not DB_PASSWORD:
        raise RuntimeError("DB_PASSWORD não foi carregada do .env")

    url = (
        f"postgresql+psycopg2://{DB_USER}:{DB_PASSWORD}"
        f"@{DB_HOST}:{DB_PORT}/{DB_NAME}"
    )

    engine = create_engine(
        url,
        connect_args={"options": f"-csearch_path={DB_SCHEMA}"}
    )

    return engine

This engine connects directly to:

Amazon RDS → PostgreSQL → fitness schema


Load Order Strategy

Because of foreign key dependencies, the loading order must respect relational integrity:

  1. exercises
  2. workouts
  3. workout_exercises
  4. sets

This guarantees:

  • No orphan records
  • Valid foreign keys
  • Controlled dependency resolution

Load — Exercises

Show Code
import pandas as pd
from sqlalchemy import text
from etl.utils.db import get_engine
from etl.config.settings import CSV_PATH


def load_exercises():
    engine = get_engine()

    # 1. Ler CSV (Bronze → memória)
    df = pd.read_csv(CSV_PATH)

    # 2. Extrair exercícios únicos
    exercises = (
        df["exercise_title"]
        .dropna()
        .drop_duplicates()
        .sort_values()
        .to_frame(name="exercise_name")
    )

    # 3. Inferir se é cardio
    cardio_mask = (
        df.groupby("exercise_title")[["distance_km", "duration_seconds"]]
        .apply(lambda x: x.notna().any().any())
    )

    exercises["is_cardio"] = (
        exercises["exercise_name"]
        .map(cardio_mask)
        .fillna(False)
    )

    # 4. Inserção idempotente
    insert_sql = text("""
        INSERT INTO exercises (exercise_name, is_cardio)
        VALUES (:exercise_name, :is_cardio)
        ON CONFLICT (exercise_name) DO NOTHING;
    """)

    with engine.begin() as conn:
        conn.execute(
            insert_sql,
            exercises.to_dict(orient="records")
        )

    print(f"[OK] {len(exercises)} exercises processed.")

Key logic:

  • Extract unique exercises from CSV
  • Infer is_cardio
  • Insert using ON CONFLICT DO NOTHING

Idempotent insert:

Show SQL
INSERT INTO exercises (exercise_name, is_cardio)
VALUES (:exercise_name, :is_cardio)
ON CONFLICT (exercise_name) DO NOTHING;

This ensures:

  • Safe re-runs
  • No duplication
  • Deterministic state in RDS

Load — Workouts:

Show Code
import pandas as pd
from sqlalchemy import text
from etl.utils.db import get_engine
from etl.config.settings import CSV_PATH


def load_workouts():
    engine = get_engine()

    # 1. Ler CSV
    df = pd.read_csv(CSV_PATH)

    # 2. Parse datas
    df["start_time"] = pd.to_datetime(df["start_time"])
    df["end_time"] = pd.to_datetime(df["end_time"])

    # 3. Agrupar por sessão de treino
    workouts = (
        df.groupby(["title", "start_time", "end_time"], as_index=False)
        .agg(
            workout_name=("title", "first"),
            description=("description", "first")
        )
    )

    # 4. Campos derivados
    workouts["workout_date"] = workouts["start_time"].dt.date
    workouts["duration_minutes"] = (
        (workouts["end_time"] - workouts["start_time"])
        .dt.total_seconds() / 60
    )
    workouts["source"] = "hevy_csv"

    # 5. SQLs
    select_sql = text("""
        SELECT workout_id
        FROM workouts
        WHERE workout_name = :workout_name
          AND start_time = :start_time
          AND end_time = :end_time
    """)

    insert_sql = text("""
        INSERT INTO workouts (
            workout_name,
            workout_date,
            start_time,
            end_time,
            duration_minutes,
            description,
            source
        )
        VALUES (
            :workout_name,
            :workout_date,
            :start_time,
            :end_time,
            :duration_minutes,
            :description,
            :source
        )
    """)

    inserted = 0

    with engine.begin() as conn:
        for _, row in workouts.iterrows():
            exists = conn.execute(
                select_sql,
                {
                    "workout_name": row["workout_name"],
                    "start_time": row["start_time"],
                    "end_time": row["end_time"],
                }
            ).fetchone()

            if exists:
                continue

            conn.execute(insert_sql, row.to_dict())
            inserted += 1

    print(f"[OK] {inserted} workouts inserted.")

Steps:

  • Group by session (title, start_time, end_time)

  • Derive:

    • workout_date
    • duration_minutes
    • source
Existence check:
Show SQL
SELECT workout_id
FROM workouts
WHERE workout_name = :workout_name
  AND start_time = :start_time
  AND end_time = :end_time

Only inserts if not found.

This prevents duplicate training sessions on re-execution.


Load — Workout Exercises:

Show Code
import pandas as pd
from sqlalchemy import text
from etl.utils.db import get_engine
from etl.config.settings import CSV_PATH


def load_workout_exercises():
    engine = get_engine()

    # 1. Ler CSV
    df = pd.read_csv(CSV_PATH)

    # Parse datas
    df["start_time"] = pd.to_datetime(df["start_time"])
    df["end_time"] = pd.to_datetime(df["end_time"])

    # 2. Buscar mapeamento de workouts
    with engine.begin() as conn:
        workouts_map = {
            (row.workout_name, row.start_time, row.end_time): row.workout_id
            for row in conn.execute(text("""
                SELECT workout_id, workout_name, start_time, end_time
                FROM workouts
            """))
        }

        exercises_map = {
            row.exercise_name: row.exercise_id
            for row in conn.execute(text("""
                SELECT exercise_id, exercise_name
                FROM exercises
            """))
        }

    # 3. Gerar estrutura workout_exercises
    records = []
    seen = set()

    for (
        workout_name,
        start_time,
        end_time
    ), group in df.groupby(["title", "start_time", "end_time"]):

        workout_id = workouts_map.get((workout_name, start_time, end_time))
        if not workout_id:
            continue

        exercise_sequence = (
            group["exercise_title"]
            .drop_duplicates()
            .tolist()
        )

        for order, exercise_name in enumerate(exercise_sequence):
            exercise_id = exercises_map.get(exercise_name)
            if not exercise_id:
                continue

            superset_id = (
                group[group["exercise_title"] == exercise_name]["superset_id"]
                .dropna()
                .unique()
            )
            superset_id = int(superset_id[0]) if len(superset_id) > 0 else None

            key = (workout_id, exercise_id, order)
            if key in seen:
                continue
            seen.add(key)

            records.append({
                "workout_id": workout_id,
                "exercise_id": exercise_id,
                "exercise_order": order,
                "superset_id": superset_id,
                "notes": None
            })

    # 4. Inserção idempotente
    insert_sql = text("""
        INSERT INTO workout_exercises (
            workout_id,
            exercise_id,
            exercise_order,
            superset_id,
            notes
        )
        VALUES (
            :workout_id,
            :exercise_id,
            :exercise_order,
            :superset_id,
            :notes
        )
        ON CONFLICT DO NOTHING;
    """)

    with engine.begin() as conn:
        conn.execute(insert_sql, records)

    print(f"[OK] {len(records)} workout_exercises processed.")

Responsibilities:

  • Map workouts → exercises
  • Preserve exercise order
  • Preserve superset relationships
Idempotent logic:
Show SQL
INSERT INTO workout_exercises (...)
VALUES (...)
ON CONFLICT DO NOTHING;

Ensures safe replay.


Load — Sets (Most Granular Layer):

Show Code
import pandas as pd
from sqlalchemy import text
from etl.utils.db import get_engine
from etl.config.settings import CSV_PATH


def clean_nan(value):
    """
    Converte NaN do pandas para None (NULL no PostgreSQL)
    """
    if pd.isna(value):
        return None
    return value


def load_sets():
    engine = get_engine()

    # 1. Ler CSV
    df = pd.read_csv(CSV_PATH)

    # Parse datas
    df["start_time"] = pd.to_datetime(df["start_time"])
    df["end_time"] = pd.to_datetime(df["end_time"])

    # 2. Mapas auxiliares
    with engine.begin() as conn:
        workouts_map = {
            (row.workout_name, row.start_time, row.end_time): row.workout_id
            for row in conn.execute(text("""
                SELECT workout_id, workout_name, start_time, end_time
                FROM workouts
            """))
        }

        exercises_map = {
            row.exercise_name: row.exercise_id
            for row in conn.execute(text("""
                SELECT exercise_id, exercise_name
                FROM exercises
            """))
        }

        workout_ex_map = {
            (row.workout_id, row.exercise_id, row.exercise_order): row.workout_exercise_id
            for row in conn.execute(text("""
                SELECT workout_exercise_id, workout_id, exercise_id, exercise_order
                FROM workout_exercises
            """))
        }

    inserted = 0

    insert_sql = text("""
        INSERT INTO sets (
            workout_exercise_id,
            set_index,
            set_type,
            weight_kg,
            reps,
            distance_km,
            duration_seconds,
            rpe
        )
        VALUES (
            :workout_exercise_id,
            :set_index,
            :set_type,
            :weight_kg,
            :reps,
            :distance_km,
            :duration_seconds,
            :rpe
        )
    """)

    exists_sql = text("""
        SELECT 1
        FROM sets
        WHERE workout_exercise_id = :workout_exercise_id
          AND set_index = :set_index
    """)

    with engine.begin() as conn:
        for (
            workout_name,
            start_time,
            end_time
        ), group in df.groupby(["title", "start_time", "end_time"]):

            workout_id = workouts_map.get((workout_name, start_time, end_time))
            if not workout_id:
                continue

            # Ordem dos exercícios dentro do treino
            exercise_sequence = (
                group["exercise_title"]
                .drop_duplicates()
                .tolist()
            )

            exercise_order_map = {
                name: idx for idx, name in enumerate(exercise_sequence)
            }

            for _, row in group.iterrows():
                exercise_id = exercises_map.get(row["exercise_title"])
                if not exercise_id:
                    continue

                exercise_order = exercise_order_map.get(row["exercise_title"])
                workout_exercise_id = workout_ex_map.get(
                    (workout_id, exercise_id, exercise_order)
                )

                if not workout_exercise_id:
                    continue

                exists = conn.execute(
                    exists_sql,
                    {
                        "workout_exercise_id": workout_exercise_id,
                        "set_index": int(row["set_index"]),
                    }
                ).fetchone()

                if exists:
                    continue

                conn.execute(
                    insert_sql,
                    {
                        "workout_exercise_id": workout_exercise_id,
                        "set_index": int(row["set_index"]),
                        "set_type": row["set_type"],
                        "weight_kg": clean_nan(row["weight_kg"]),
                        "reps": clean_nan(row["reps"]),
                        "distance_km": clean_nan(row["distance_km"]),
                        "duration_seconds": clean_nan(row["duration_seconds"]),
                        "rpe": clean_nan(row["rpe"]),
                    }
                )

                inserted += 1

    print(f"[OK] {inserted} sets inserted.")

This is the most sensitive layer:

  • 1 row = 1 set
  • Must match workout_exercise_id
  • Must avoid duplication

Before insertion:

  • Convert NaN → NULL (PostgreSQL compatible)
Existence check:
Show SQL
SELECT 1
FROM sets
WHERE workout_exercise_id = :workout_exercise_id
  AND set_index = :set_index

If exists → skip If not → insert

This makes the pipeline:

Fully idempotent at the lowest grain level

Pipeline Orchestration:

All steps were orchestrated through a single execution entrypoint:

Show Code
from etl.load.load_exercises import load_exercises
from etl.load.load_workouts import load_workouts
from etl.load.load_workout_exercises import load_workout_exercises
from etl.load.load_sets import load_sets


def run():
    print("=== ETL STARTED ===")

    load_exercises()
    load_workouts()
    load_workout_exercises()
    load_sets()

    print("=== ETL FINISHED ===")


if __name__ == "__main__":
    run()

This executed directly against:

Amazon RDS (PostgreSQL)

Idempotency Validation

The ETL pipeline was executed multiple times directly in Amazon RDS to validate replay safety and deterministic behavior. Across repeated executions, no duplicate exercises, workouts, or sets were created, and no referential integrity violations occurred. Row counts remained stable, confirming that the loading logic — including ON CONFLICT clauses and explicit existence checks — guarantees idempotent behavior and consistent state reconstruction.

This validates that the pipeline is replay-safe by design, meaning the same input dataset always produces the same relational state without unintended duplication or corruption.


Production Characteristics

The ETL architecture was intentionally designed with production principles in mind. The modular structure, explicit foreign key mapping, deterministic execution order, and SQLAlchemy-based connectivity ensure that the pipeline can be easily extended into scheduled batch jobs, containerized workflows, or orchestrated environments such as Lambda or Airflow.

At this stage, validated CSV datasets were transformed into a fully normalized relational schema in Amazon RDS using PostgreSQL, idempotent SQL patterns, and controlled schema management. The system is now structurally stable, cloud-portable, and prepared for analytical aggregation in the Gold layer and subsequent LLM contract generation.

Stage 5: Gold Layer

After loading the normalized relational model into Amazon RDS (PostgreSQL), the next step was to build the Gold Layer.

This layer was designed specifically for:

  • Weekly deterministic aggregation
  • Safe week-over-week comparison
  • Explicit metric semantics
  • No hidden calculations
  • LLM-ready consumption

All views were created directly inside the Amazon RDS instance under the fitness schema.

Weekly Training Summary

This view aggregates training metrics at the weekly level.

Show SQL
CREATE OR REPLACE VIEW gold_weekly_fitness_summary AS
WITH training_summary AS (
    SELECT
        DATE_TRUNC('week', w.workout_date)::date AS week_start,
        COUNT(DISTINCT w.workout_id)             AS training_sessions,
        COUNT(s.set_id)                          AS total_sets,
        COUNT(*) FILTER (WHERE s.set_type = 'failure') AS failure_sets,
        ROUND(AVG(s.reps), 2)                    AS avg_reps_per_set
    FROM workouts w
    JOIN workout_exercises we
        ON w.workout_id = we.workout_id
    JOIN sets s
        ON we.workout_exercise_id = s.workout_exercise_id
    WHERE s.set_type <> 'warmup'
    GROUP BY week_start
)
SELECT
    d.week_start,

    -- Diet
    d.avg_calories_kcal,
    d.avg_carbs_g,
    d.avg_protein_g,
    d.avg_fat_g,
    d.avg_bodyweight_kg,
    d.cardio_weekly_min,
    d.logged_days,

    -- Training
    t.training_sessions,
    t.total_sets,
    t.failure_sets,
    t.avg_reps_per_set,

    -- Cycle
    c.cycle_name,
    c.cycle_type
FROM gold_weekly_diet_summary d
LEFT JOIN training_summary t
    ON d.week_start = t.week_start
LEFT JOIN cycles c
    ON d.cycle_id = c.cycle_id;

Purpose

This creates a stable weekly anchor containing:

  • Diet averages
  • Training volume
  • Failure intensity
  • Active cycle

This becomes the high-level weekly context layer.


Weekly Training Detail (Per Exercise)

Show SQL
CREATE OR REPLACE VIEW gold_weekly_training_detail AS
SELECT
    DATE_TRUNC('week', w.workout_date)::date AS week_start,

    e.exercise_name,
    mg.muscle_group_name,

    COUNT(s.set_id)                          AS total_sets,
    ROUND(AVG(s.weight_kg), 2)               AS avg_weight_kg,
    ROUND(AVG(s.reps), 2)                    AS avg_reps,
    MAX(s.weight_kg)                         AS max_weight_kg,

    COUNT(*) FILTER (WHERE s.set_type = 'failure') AS failure_sets
FROM workouts w
JOIN workout_exercises we
    ON w.workout_id = we.workout_id
JOIN exercises e
    ON we.exercise_id = e.exercise_id
JOIN exercise_muscle_map emm
    ON e.exercise_id = emm.exercise_id
JOIN muscle_groups mg
    ON emm.muscle_group_id = mg.muscle_group_id
JOIN sets s
    ON we.workout_exercise_id = s.workout_exercise_id
WHERE s.set_type <> 'warmup'
GROUP BY
    week_start,
    e.exercise_name,
    mg.muscle_group_name;

Purpose

This view exposes:

  • Volume by exercise
  • Muscle group mapping
  • Average and maximum loads
  • Failure density

This enables:

  • Muscle-specific analysis
  • Volume allocation insights
  • Performance segmentation

Weekly Exercise Performance (Structured Metrics)

Show SQL
CREATE OR REPLACE VIEW gold_weekly_exercise_performance AS
SELECT
    DATE_TRUNC('week', w.workout_date)::date AS week_start,

    e.exercise_id,
    e.exercise_name,
    mg.muscle_group_name,

    COUNT(s.set_id) AS total_sets,

    COUNT(*) FILTER (
        WHERE s.set_type = 'failure'
    ) AS failure_sets,

    ROUND(AVG(s.reps), 2) AS avg_reps,

    ROUND(AVG(s.weight_kg), 2) AS avg_weight_kg,
    MAX(s.weight_kg)           AS max_weight_kg,

    SUM(s.reps) AS total_reps,

    COUNT(DISTINCT w.workout_id) AS sessions_count

FROM workouts w
JOIN workout_exercises we
    ON w.workout_id = we.workout_id
JOIN sets s
    ON we.workout_exercise_id = s.workout_exercise_id
JOIN exercises e
    ON we.exercise_id = e.exercise_id
JOIN exercise_muscle_map emm
    ON e.exercise_id = emm.exercise_id
JOIN muscle_groups mg
    ON emm.muscle_group_id = mg.muscle_group_id

WHERE s.set_type <> 'warmup'

GROUP BY
    week_start,
    e.exercise_id,
    e.exercise_name,
    mg.muscle_group_name;

Purpose

This is the LLM-facing metric layer.

It produces:

  • Explicit totals
  • Explicit averages
  • Explicit max loads
  • Explicit session count

No inferred metrics. No derived semantics. Only controlled aggregations.

Week-over-Week Progression

This view introduces temporal comparison using window functions.

Show SQL
CREATE OR REPLACE VIEW gold_weekly_exercise_progression AS
SELECT
    week_start,
    exercise_id,
    exercise_name,
    muscle_group_name,

    -- Current week metrics
    total_sets,
    total_reps,
    avg_reps,
    avg_weight_kg,
    max_weight_kg,
    failure_sets,
    sessions_count,

    -- Previous week metrics
    LAG(total_sets)      OVER w AS prev_total_sets,
    LAG(total_reps)      OVER w AS prev_total_reps,
    LAG(avg_reps)        OVER w AS prev_avg_reps,
    LAG(avg_weight_kg)   OVER w AS prev_avg_weight_kg,
    LAG(max_weight_kg)   OVER w AS prev_max_weight_kg,
    LAG(failure_sets)    OVER w AS prev_failure_sets,

    -- Deltas
    total_sets    - LAG(total_sets)    OVER w AS delta_total_sets,
    total_reps    - LAG(total_reps)    OVER w AS delta_total_reps,
    avg_reps      - LAG(avg_reps)      OVER w AS delta_avg_reps,
    avg_weight_kg - LAG(avg_weight_kg) OVER w AS delta_avg_weight_kg,
    max_weight_kg - LAG(max_weight_kg) OVER w AS delta_max_weight_kg,
    failure_sets  - LAG(failure_sets)  OVER w AS delta_failure_sets

FROM gold_weekly_exercise_performance
WINDOW w AS (
    PARTITION BY exercise_id
    ORDER BY week_start
);

Purpose

This enables:

  • Deterministic progression tracking
  • Explicit deltas
  • Safe LLM comparison

Instead of asking the LLM to compute differences, the database computes them deterministically.

Final LLM Context View

This is the final structured analytical contract inside RDS.

Show SQL
CREATE OR REPLACE VIEW gold_llm_weekly_exercise_context AS
SELECT
    p.week_start,

    -- Exercise identity
    p.exercise_id,
    p.exercise_name,
    p.muscle_group_name,

    -- Training (current week)
    p.total_sets,
    p.total_reps,
    p.avg_reps,
    p.avg_weight_kg,
    p.max_weight_kg,
    p.failure_sets,
    p.sessions_count,

    -- Training deltas (vs previous week)
    p.delta_total_sets,
    p.delta_total_reps,
    p.delta_avg_reps,
    p.delta_avg_weight_kg,
    p.delta_max_weight_kg,
    p.delta_failure_sets,

    -- Diet context (weekly)
    d.avg_calories_kcal,
    d.avg_carbs_g,
    d.avg_protein_g,
    d.avg_fat_g,
    d.cardio_weekly_min,
    d.avg_bodyweight_kg,
    d.logged_days,

    -- Cycle context
    c.cycle_id,
    c.cycle_name,
    c.cycle_type

FROM gold_weekly_exercise_progression p
LEFT JOIN gold_weekly_diet_summary d
    ON p.week_start = d.week_start
LEFT JOIN cycles c
    ON d.cycle_id = c.cycle_id;

Purpose

This is the final structured semantic layer consumed by the JSON extractor.

It merges:

  • Exercise performance
  • Progression metrics
  • Diet context
  • Cycle context

Into a single analytical surface.

Gold Layer Summary

The Gold Layer was fully implemented inside Amazon RDS using SQL views.

Stage 6: Canonical JSON (LLM)

After building the normalized relational model and the Gold analytical layer, the final step was to generate a canonical, deterministic JSON contract designed specifically for language model consumption.

The goal was not to simply serialize database rows, but to:

  • Eliminate relational complexity
  • Remove ambiguity
  • Provide explicit deltas (no inference required)
  • Deliver a schema-stable, versioned payload

This JSON acts as the single source of truth for LLM analysis.

Design Principles

The canonical contract was designed under the following constraints:

  1. One file per week
  2. Fully deterministic structure
  3. Explicit weekly progression deltas
  4. No derived metrics inside the LLM
  5. No missing semantic context
  6. Schema versioning support

Schema Definition (Versioned Contract)

The schema is explicitly defined and version-controlled:

Show Code
{
  "week_start": "YYYY-MM-DD",
  "cycle": {
    "cycle_id": "int",
    "cycle_name": "string",
    "cycle_type": "cutting | bulking | maintenance | reverse"
  },
  "diet": {
    "avg_calories_kcal": "number",
    "avg_carbs_g": "number",
    "avg_protein_g": "number",
    "avg_fat_g": "number",
    "cardio_weekly_min": "number",
    "avg_bodyweight_kg": "number",
    "logged_days": "int"
  },
  "training_overview": {
    "training_sessions": "int",
    "total_sets": "int",
    "failure_sets": "int",
    "avg_reps_per_set": "number"
  },
  "exercises": [
    {
      "exercise_id": "int",
      "exercise_name": "string",
      "muscle_group": "string",
      "current_week": { ... },
      "delta_vs_last_week": { ... }
    }
  ]
}

Payload Builder Implementation

The JSON is generated from Gold views using a dedicated extraction script.

Show Code
import json
import os
from datetime import date
from decimal import Decimal

from sqlalchemy import text
from etl.utils.db import get_engine

# ===============================
# CONFIG
# ===============================

OUTPUT_DIR = "llm/payloads"
SCHEMA_VERSION = "v1"


# ===============================
# UTILS
# ===============================

def json_safe(value):
    """
    Converte tipos não serializáveis (Decimal, date, datetime)
    para formatos compatíveis com JSON.
    """
    if isinstance(value, Decimal):
        return float(value)
    if hasattr(value, "isoformat"):
        return value.isoformat()
    return value


# ===============================
# FETCH FUNCTIONS
# ===============================

def fetch_weekly_summary(conn, week_start):
    sql = text("""
        SELECT *
        FROM gold_weekly_fitness_summary
        WHERE week_start = :week_start
    """)
    row = conn.execute(
        sql,
        {"week_start": week_start}
    ).mappings().fetchone()

    return dict(row) if row else None


def fetch_exercise_context(conn, week_start):
    sql = text("""
        SELECT *
        FROM gold_llm_weekly_exercise_context_v1
        WHERE week_start = :week_start
        ORDER BY exercise_name
    """)

    rows = conn.execute(
        sql,
        {"week_start": week_start}
    ).mappings().fetchall()

    return [dict(r) for r in rows]


# ===============================
# PAYLOAD BUILDER
# ===============================

def build_payload(week_start: str):
    engine = get_engine()

    with engine.begin() as conn:
        weekly = fetch_weekly_summary(conn, week_start)
        if not weekly:
            raise RuntimeError(
                f"No weekly summary found for week_start={week_start}"
            )

        exercises_raw = fetch_exercise_context(conn, week_start)
        if not exercises_raw:
            raise RuntimeError(
                f"No exercise context found for week_start={week_start}"
            )

    # -------------------------------
    # Exercises
    # -------------------------------
    exercises = []

    for r in exercises_raw:
        exercises.append({
            "exercise_id": r["exercise_id"],
            "exercise_name": r["exercise_name"],
            "muscle_group": r["muscle_group_name"],

            "current_week": {
                "total_sets": r["current_total_sets"],
                "total_reps": r["current_total_reps"],
                "avg_reps": json_safe(r["current_avg_reps"]),
                "avg_weight_kg": json_safe(r["current_avg_weight_kg"]),
                "max_weight_kg": json_safe(r["current_max_weight_kg"]),
                "failure_sets": r["current_failure_sets"],
                "sessions_count": r["current_sessions"]
            },

            "delta_vs_last_week": {
                "total_sets": r["delta_sets"],
                "total_reps": r["delta_reps"],
                "avg_reps": json_safe(r["delta_avg_reps"]),
                "avg_weight_kg": json_safe(r["delta_avg_weight_kg"]),
                "max_weight_kg": json_safe(r["delta_max_weight_kg"]),
                "failure_sets": r["delta_failure_sets"]
            }
        })

    # -------------------------------
    # Final payload
    # -------------------------------
    payload = {
        "schema_version": SCHEMA_VERSION,
        "generated_at": date.today().isoformat(),
        "week_start": weekly["week_start"].isoformat(),

        "cycle": {
            "cycle_name": weekly["cycle_name"],
            "cycle_type": weekly["cycle_type"]
        },

        "diet": {
            "avg_calories_kcal": json_safe(weekly["avg_calories_kcal"]),
            "avg_carbs_g": json_safe(weekly["avg_carbs_g"]),
            "avg_protein_g": json_safe(weekly["avg_protein_g"]),
            "avg_fat_g": json_safe(weekly["avg_fat_g"]),
            "cardio_weekly_min": weekly["cardio_weekly_min"],
            "avg_bodyweight_kg": json_safe(weekly["avg_bodyweight_kg"]),
            "logged_days": weekly["logged_days"]
        },

        "training_overview": {
            "training_sessions": weekly["training_sessions"],
            "total_sets": weekly["total_sets"],
            "failure_sets": weekly["failure_sets"],
            "avg_reps_per_set": json_safe(weekly["avg_reps_per_set"])
        },

        "exercises": exercises
    }

    return payload


# ===============================
# SAVE
# ===============================

def save_payload(payload):
    os.makedirs(OUTPUT_DIR, exist_ok=True)

    fname = f"weekly_fitness_context_{payload['week_start']}.json"
    path = os.path.join(OUTPUT_DIR, fname)

    with open(path, "w", encoding="utf-8") as f:
        json.dump(payload, f, ensure_ascii=False, indent=2)

    return path


# ===============================
# MAIN
# ===============================

if __name__ == "__main__":
    # Use uma semana que você sabe que existe no banco
    WEEK_START = "2026-01-05"

    payload = build_payload(WEEK_START)
    path = save_payload(payload)

    print(f"[OK] Weekly context generated: {path}")

Why This Structure Is LLM-Optimal

The JSON contract was intentionally designed to eliminate the most common failure modes observed when large language models consume structured data. Instead of requiring the model to infer joins, compute temporal comparisons, or aggregate metrics on the fly, all calculations are performed upstream in the Gold layer. The LLM receives a fully contextualized, semantically complete representation of the week.

There are no implicit relationships left to interpret, no derived metrics to calculate, and no schema ambiguity. Weekly deltas are explicitly provided, progression is already computed, and all nutritional and training context is embedded within the same document. As a result, the model operates purely as an analytical reasoning layer rather than a data transformation engine. This dramatically increases reliability, interpretability, and consistency of responses.

Deterministic Properties

For any given week_start, the system always produces the exact same output. The canonical JSON is generated from stable Gold views, using a fixed schema definition, explicit ordering rules, and deterministic delta calculations based on window functions. There is no stochastic logic in the data layer, no dynamic field generation, and no schema mutation.

Because the transformation pipeline is idempotent and the Gold layer is reproducible, the JSON payload becomes a replay-safe artifact. Running the extraction process multiple times for the same week will always result in an identical file. This deterministic behavior is critical for safe LLM interaction, auditing, version control, and long-term system stability.

Architectural Role

The JSON contract represents the formal boundary between structured analytical data and generative AI systems. Upstream of this boundary, the system is strictly relational, normalized, and governed by SQL logic inside Amazon RDS. Downstream of this boundary, the system becomes model-agnostic and cloud-portable.

Because the contract is independent of any specific model provider, it can be stored in S3, consumed by Amazon Bedrock, OpenAI models, or any future inference endpoint without requiring changes to the data layer. This separation of concerns ensures that the analytical infrastructure remains stable even as the AI layer evolves.

Stage 7: AWS Execution

After the relational modeling, Gold-layer aggregation, and canonical JSON generation were completed, the final stage of the project was executed entirely within the AWS ecosystem.

The weekly payload generated from Amazon RDS was uploaded to Amazon S3 and consumed directly through Amazon Bedrock. The LLM interaction occurred inside the AWS environment, ensuring that the structured data layer, storage layer, and inference layer remained cloud-native and internally integrated.

The inference process followed this flow:

Amazon RDS (Gold Views) → Canonical JSON Extraction → S3 Upload → Amazon Bedrock (LLM Inference) → Generated Weekly Performance Report

The prompt was manually defined inside the Bedrock playground, and the JSON payload was provided as structured context. The model was instructed to generate a deterministic weekly analytical report, focusing on progression, regression, diet alignment, fatigue signals, and performance insights — without inventing metrics or extrapolating beyond the provided contract.

The entire execution was recorded to demonstrate real AWS operation, including payload submission and model response generation.


AWS Run-Once Execution (Bedrock Inference)

To demonstrate full cloud integration, the final analytical stage of the project was executed entirely inside Amazon Web Services.

This execution followed a run-once model, intentionally designed to validate the architecture while maintaining strict cost control. The objective was not continuous deployment, but architectural proof of integration between:

  • Structured relational data (Amazon RDS)
  • Canonical analytical artifact (JSON)
  • Generative AI inference (Amazon Bedrock)

S3 Bucket Creation and Canonical Payload Upload

The first step of the cloud execution layer was the creation of an Amazon S3 bucket to store the canonical JSON contract generated from the Gold layer.

The recorded demonstration includes:

  • Creation of a dedicated S3 bucket
  • Proper configuration and region selection
  • Upload of the generated weekly JSON payload
  • Verification of the object structure inside AWS

This step establishes S3 as the formal storage boundary between deterministic data engineering outputs and AI consumption layers.

By uploading the JSON to S3, the architecture aligns with production-grade patterns where:

  • Analytical artifacts are versioned
  • Outputs are cloud-stored
  • Downstream services can consume structured data
  • The data platform remains decoupled from inference engines

The S3 object represents the finalized analytical state of a given week_start.

Amazon Bedrock Prompt Execution and LLM Report Generation

After the canonical JSON was available in AWS, the next step was inference execution inside Amazon Bedrock.

In the recorded demonstration:

  • The Amazon Bedrock Playground was opened
  • A Claude model was selected
  • The canonical JSON payload was inserted as structured context
  • A engineered analytical prompt was defined
  • The model generated a weekly performance report
  • The result was reviewed and validated against the structured input

The prompt was designed to:

  • Constrain reasoning strictly to provided data
  • Avoid external assumptions
  • Focus on progression and regression signals
  • Correlate diet and training metrics
  • Produce structured analytical insights rather than narrative speculation

The model operates purely as an analytical reasoning layer, not as a data transformation layer.

This confirms that the system successfully transitions from:

Relational Modeling (RDS) → Analytical Aggregation (Gold Layer) → Deterministic JSON Contract → Cloud Storage (S3) → Generative Analysis (Bedrock)

The execution was run-once to minimize cloud costs, but the architecture fully supports future automation through event-driven or scheduled orchestration.


Project Conclusion

This project demonstrates the full construction of an end-to-end analytical data platform, starting from raw operational CSV exports and evolving into a cloud-executed, AI-ready reporting system. Every stage was deliberately engineered with production principles in mind: explicit data auditing before ingestion, normalized relational modeling in Amazon RDS, idempotent ETL design, deterministic Gold-layer aggregations, and a strictly defined canonical JSON contract. Rather than relying on the LLM to interpret raw or semi-structured data, the architecture ensures that all business logic, aggregation rules, and temporal comparisons are resolved upstream — preserving analytical correctness and reproducibility.

By executing the final stage inside AWS using Amazon S3 and Amazon Bedrock, the system transitions from a traditional data platform into a structured generative analytics engine. The run-once deployment validates real cloud integration while maintaining cost efficiency, proving that the architecture is portable, reproducible, and ready for automation. What began as personal training and nutrition logs was transformed into a deterministic, model-consistent, and LLM-optimized analytical system.


GitHub Repository

Access all code, datasets, notebooks, and files for this project:

Click here to access