Clustering Project: Quiz Prosperous Couple Leads Analysis

Author

Gabriel Ferreira

Published

June 4, 2025

Introduction

Analyze data collected through an interactive quiz, focused on couples planning their wedding. The main purpose is to segment these leads to better understand their profiles, needs, and stage in planning.

The main objective was to use unsupervised learning techniques (clustering) to identify distinct groups of couples based on their responses, allowing for the creation of more effective and personalized marketing strategies.

Project Structure

The development of this project followed a structured and iterative methodology, covering everything from data collection and preparation to the evaluation and interpretation of results.

Data Collection and Preparation

The initial phase involved the consolidation and cleaning of data. Leads were collected from three distinct sources, requiring a unification and standardization process to ensure dataset quality.

Data Loading and Concatenation

  • Data Source: Collection of online quiz responses, distributed across 03 CSV files.
  • Loading and Unification: Reading different files and consolidating them into a single dataset.
First 5 rows of df01:
Show Code
df_01.head(5)
code created_at button: oQcJxV options: opcoes_UAa9JQ button: 0FvZw6 options: opcoes_KxEbYn options: opcoes_iSdQa9 options: opcoes_GZPJo1 options: opcoes_n5nnNG options: opcoes_G7CtE4 options: opcoes_euOjgt options: opcoes_3FPmNX options: opcoes_kGxmr1 field: 11ABFZ field: 3hxWeL field: 5fSLVc button: enviar tracking
0 OdFIeA 30/04/2025 08:44:26 clicked NaN NaN (A) Ainda estamos completamente perdidos sobre... (B) Conseguimos nos manter, mas não sobra (B) Está envolvido(a), mas prefere que eu lidere (C) Criamos um planejamento inicial, mas ainda... (C) Teríamos que parcelar bastante ou contar c... (C) Uma cerimônia encantadora, com tudo bem feito (C) Temos destinos em mente, mas sem orçamento... (D) Estamos prontos, queremos agir e realizar ... Gines okmodesto@gmail.com 61974035966 clicked src: UNICOTRACKER-2348-2348174611426542420-174...
1 AShHDV 30/04/2025 08:47:31 clicked NaN NaN (A) Ainda estamos completamente perdidos sobre... (C) Temos folga, mas ainda não vivemos como go... (D) Prefere que eu resolva tudo sozinho(a) (D) Temos planilhas, metas e até cronograma de... (D) Poderíamos arcar com boa parte, mas querem... NaN NaN NaN Gines okmodesto@gmail.com 61974035966 clicked NaN
2 5lPo6L 30/04/2025 08:55:30 clicked NaN NaN (A) Ainda estamos completamente perdidos sobre... (B) Conseguimos nos manter, mas não sobra (A) Está completamente envolvido(a), sonha jun... (B) Temos anotações e ideias soltas (A) Não conseguiríamos bancar nada ainda NaN NaN NaN NaN NaN NaN NaN NaN
3 sSZ91S 30/04/2025 09:04:38 clicked NaN NaN (B) Temos algumas referências, mas nada decidido (A) Vivemos no limite e temos dívidas (B) Está envolvido(a), mas prefere que eu lidere (A) Não começamos ainda (B) Conseguiríamos fazer algo simples (A) Ainda não pensamos nisso (A) Nem pensamos nisso ainda (B) Queremos muito, mas temos medo de não dar ... Bruno Teste brunopurper3@gmail.com 48991954215 clicked NaN
4 jv69Q9 30/04/2025 09:08:10 clicked NaN NaN (A) Ainda estamos completamente perdidos sobre... (B) Conseguimos nos manter, mas não sobra (A) Está completamente envolvido(a), sonha jun... (A) Não começamos ainda (B) Conseguiríamos fazer algo simples (B) Algo íntimo e simples, só com pessoas próx... (B) Pensamos, mas parece fora da nossa realidade (B) Queremos muito, mas temos medo de não dar ... Gines okmodesto@gmail.com 61974035966 clicked NaN
Count of each column df01:
Show Code
df_01.count()
code                      353
created_at                353
button: oQcJxV            331
options: opcoes_UAa9JQ      0
button: 0FvZw6             22
options: opcoes_KxEbYn    260
options: opcoes_iSdQa9    249
options: opcoes_GZPJo1    235
options: opcoes_n5nnNG    227
options: opcoes_G7CtE4    223
options: opcoes_euOjgt    220
options: opcoes_3FPmNX    220
options: opcoes_kGxmr1    218
field: 11ABFZ              74
field: 3hxWeL              74
field: 5fSLVc              74
button: enviar             74
tracking                  341
dtype: int64
Count of each column df02:
Show Code
df_02.count()
code                      162
created_at                162
button: oQcJxV            120
options: opcoes_UAa9JQ      0
button: 0FvZw6             41
options: opcoes_KxEbYn    142
options: opcoes_iSdQa9    142
options: opcoes_GZPJo1    138
options: opcoes_n5nnNG    138
options: opcoes_G7CtE4    137
options: opcoes_euOjgt    136
options: opcoes_3FPmNX    135
options: opcoes_kGxmr1    135
field: 11ABFZ              63
field: 3hxWeL              63
field: 5fSLVc              63
button: enviar             63
tracking                  154
dtype: int64

Data Cleaning

The columns created_at options: opcoes_UAa9JQ, code,button: oQcJxV, button: 0FvZw6, button: enviar and tracking do not contain relevant information for the analysis, so we can remove them from df01 and df02. The columns field: 11ABFZ, field: 3hxWeL, field: 5fSLVc contain data about the lead, but have too little data to work with, so we will also remove them from the dataframes..

Removal of defined columns from df01 and df02:
Show Code
df_01 = df_01.drop(columns=["created_at","options: opcoes_UAa9JQ","code", "button: oQcJxV", "button: 0FvZw6", "field: 11ABFZ", "field: 3hxWeL","field: 5fSLVc", "button: enviar", "tracking"])
df_02 = df_02.drop(columns=["created_at","options: opcoes_UAa9JQ","code", "button: oQcJxV", "button: 0FvZw6", "field: 11ABFZ", "field: 3hxWeL","field: 5fSLVc", "button: enviar", "tracking"])

Let’s remove the columns: code button: oQcJxV options: opcoes_UAa9JQ button: 0FvZw6 options: opcoes_PhNxWH options: opcoes_NMBS1J options: opcoes_Peavhn options: opcoes_RyUh7O options: opcoes_S8x7OR options: opcoes_MNd05q options: opcoes_AQo3UU options: opcoes_yYueg1 field: 11ABFZ field: 3hxWeL field: 5fSLVc button: enviar tracking

from df_03, for the same reasons as the removal of columns from df1 and df02.

Removal of defined columns from df03:
Show Code
df_03 = df_03.drop(columns=["created_at","code", "button: oQcJxV","options: opcoes_UAa9JQ","button: 0FvZw6",
                            "button: 0FvZw6", "options: opcoes_PhNxWH", "options: opcoes_NMBS1J", "options: opcoes_Peavhn",
                            "options: opcoes_RyUh7O", "options: opcoes_S8x7OR", "options: opcoes_MNd05q", "options: opcoes_AQo3UU",
                            "options: opcoes_yYueg1","field: 11ABFZ", "field: 3hxWeL","field: 5fSLVc", "button: enviar", "tracking"])

Now all three dataframes have the same columns, let’s rename them and concatenate the dataframes.

Column Renaming and Concatenating the dataframes:
Show Code
rename_dict = {
    'options: opcoes_KxEbYn': 'pergunta_1',
    'options: opcoes_iSdQa9': 'pergunta_2',
    'options: opcoes_GZPJo1': 'pergunta_3',
    'options: opcoes_n5nnNG': 'pergunta_4',
    'options: opcoes_G7CtE4': 'pergunta_5',
    'options: opcoes_euOjgt': 'pergunta_6',
    'options: opcoes_3FPmNX': 'pergunta_7',
    'options: opcoes_kGxmr1': 'pergunta_8'
}
# Renomear colunas para df_01
df_01 = df_01.rename(columns=rename_dict)
df_02 = df_02.rename(columns=rename_dict)
df_03 = df_03.rename(columns=rename_dict)
# Concatenar os dataframes
df_gf = pd.concat([df_01, df_02, df_03], ignore_index=True)
Sample of the cleaned and concatenated dataframe:
Show Code
df_gf.sample(3)
pergunta_1 pergunta_2 pergunta_3 pergunta_4 pergunta_5 pergunta_6 pergunta_7 pergunta_8
543 Ainda estamos completamente perdidos sobre tudo Temos folga, mas ainda não vivemos como gostar... Está envolvido(a), mas prefere que eu lidere Temos anotações e ideias soltas Conseguiríamos fazer algo simples Algo íntimo e simples, só com pessoas próximas Temos destinos em mente, mas sem orçamento ainda Desejamos, mas nos falta tempo e direção
252 NaN NaN NaN NaN NaN NaN NaN NaN
417 (A) Ainda estamos completamente perdidos sobre... (C) Temos folga, mas ainda não vivemos como go... (C) Me apoia, mas não se envolve muito com o p... (A) Não começamos ainda (A) Não conseguiríamos bancar nada ainda (C) Uma cerimônia encantadora, com tudo bem feito (D) Já sabemos onde queremos ir e estamos nos ... (B) Queremos muito, mas temos medo de não dar ...

We noticed that some answers have the letter A/B/C/D before the answer and other answers do not. We need to map both scenarios.

Mapping responses (with and without letter):
Show Code
# Dictionary to map responses with letter
mapeamento_com_letra = {
    '(A) Ainda estamos completamente perdidos sobre tudo': 'A',
    '(B) Temos algumas referências, mas nada decidido': 'B',
    '(C) Já temos o estilo em mente, mas falta planejar': 'C',
    '(D) Sabemos exatamente o que queremos e já começamos a organizar': 'D',

    '(A) Vivemos no limite e temos dívidas': 'A',
    '(B) Conseguimos nos manter, mas não sobra': 'B',
    '(C) Temos folga, mas ainda não vivemos como gostaríamos': 'C',
    '(D) Estamos bem financeiramente, mas queremos crescer mais': 'D',

    '(A) Está completamente envolvido(a), sonha junto comigo': 'A',
    '(B) Está envolvido(a), mas prefere que eu lidere': 'B',
    '(C) Me apoia, mas não se envolve muito com o planejamento': 'C',
    '(D) Prefere que eu resolva tudo sozinho(a)': 'D',

    '(A) Não começamos ainda': 'A',
    '(B) Temos anotações e ideias soltas': 'B',
    '(C) Criamos um planejamento inicial, mas ainda sem orçamento': 'C',
    '(D) Temos planilhas, metas e até cronograma definido': 'D',

    '(A) Não conseguiríamos bancar nada ainda': 'A',
    '(B) Conseguiríamos fazer algo simples': 'B',
    '(C) Teríamos que parcelar bastante ou contar com ajuda': 'C',
    '(D) Poderíamos arcar com boa parte, mas queremos mais liberdade': 'D',

    '(A) Ainda não pensamos nisso': 'A',
    '(B) Algo íntimo e simples, só com pessoas próximas': 'B',
    '(C) Uma cerimônia encantadora, com tudo bem feito': 'C',
    '(D) Um evento inesquecível, com tudo que temos direito': 'D',

    '(A) Nem pensamos nisso ainda': 'A',
    '(B) Pensamos, mas parece fora da nossa realidade': 'B',
    '(C) Temos destinos em mente, mas sem orçamento ainda': 'C',
    '(D) Já sabemos onde queremos ir e estamos nos planejando': 'D',

    '(A) Desejamos, mas nos falta tempo e direção': 'A',
    '(B) Queremos muito, mas temos medo de não dar conta': 'B',
    '(C) Estamos dispostos, só falta um plano eficaz': 'C',
    '(D) Estamos prontos, queremos agir e realizar de verdade': 'D'
}
# Dictionary to map responses without letter
mapeamento_sem_letra = {
    'Ainda estamos completamente perdidos sobre tudo': 'A',
    'Temos algumas referências, mas nada decidido': 'B',
    'Já temos o estilo em mente, mas falta planejar': 'C',
    'Sabemos exatamente o que queremos e já começamos a organizar': 'D',

    'Vivemos no limite e temos dívidas': 'A',
    'Conseguimos nos manter, mas não sobra': 'B',
    'Temos folga, mas ainda não vivemos como gostaríamos': 'C',
    'Estamos bem financeiramente, mas queremos crescer mais': 'D',

    'Está completamente envolvido(a), sonha junto comigo': 'A',
    'Está envolvido(a), mas prefere que eu lidere': 'B',
    'Me apoia, mas não se envolve muito com o planejamento': 'C',
    'Prefere que eu resolva tudo sozinho(a)': 'D',

    'Não começamos ainda': 'A',
    'Temos anotações e ideias soltas': 'B',
    'Criamos um planejamento inicial, mas ainda sem orçamento': 'C',
    'Temos planilhas, metas e até cronograma definido': 'D',

    'Não conseguiríamos bancar nada ainda': 'A',
    'Conseguiríamos fazer algo simples': 'B',
    'Teríamos que parcelar bastante ou contar com ajuda': 'C',
    'Poderíamos arcar com boa parte, mas queremos mais liberdade': 'D',

    'Ainda não pensamos nisso': 'A',
    'Algo íntimo e simples, só com pessoas próximas': 'B',
    'Uma cerimônia encantadora, com tudo bem feito': 'C',
    'Um evento inesquecível, com tudo que temos direito': 'D',

    'Nem pensamos nisso ainda': 'A',
    'Pensamos, mas parece fora da nossa realidade': 'B',
    'Temos destinos em mente, mas sem orçamento ainda': 'C',
    'Já sabemos onde queremos ir e estamos nos planejando': 'D',

    'Desejamos, mas nos falta tempo e direção': 'A',
    'Queremos muito, mas temos medo de não dar conta': 'B',
    'Estamos dispostos, só falta um plano eficaz': 'C',
    'Estamos prontos, queremos agir e realizar de verdade': 'D'
}
# Function to map responses
def mapear_resposta(resposta, mapeamento_letra, mapeamento_sem):
    if isinstance(resposta, str):
        resposta = resposta.strip()
    
    resposta_mapeada = mapeamento_letra.get(resposta)
    if resposta_mapeada:
        return resposta_mapeada
    
    resposta_mapeada = mapeamento_sem.get(resposta)
    if resposta_mapeada:
        return resposta_mapeada
# Applies the function to all question columns
colunas_perguntas = ['pergunta_1', 'pergunta_2', 'pergunta_3', 'pergunta_4', 'pergunta_5', 'pergunta_6', 'pergunta_7', 'pergunta_8']
df_gf[colunas_perguntas] = df_gf[colunas_perguntas].applymap(lambda resposta: mapear_resposta(resposta, mapeamento_com_letra, mapeamento_sem_letra))

Handling Missing Values

Describe of the dataframe:
Show Code
df_gf.describe()
pergunta_1 pergunta_2 pergunta_3 pergunta_4 pergunta_5 pergunta_6 pergunta_7 pergunta_8
count 495 482 457 449 443 439 438 436
unique 4 4 4 4 4 4 4 4
top A B A A B B A B
freq 178 189 219 184 170 189 184 124
Checking for missing values:
Show Code
print(df_gf.isnull().sum())
pergunta_1    132
pergunta_2    145
pergunta_3    170
pergunta_4    178
pergunta_5    184
pergunta_6    188
pergunta_7    189
pergunta_8    191
dtype: int64

Since we have a lot of NaN values, we decided to remove all rows that have all questions unanswered and then replace the rows that have NaN but not in all questions with the mode of each question.

Removing rows with null values:
Show Code
# Contar diretamente as linhas onde todas as colunas são NaN
num_linhas_todas_nan = df_gf.isna().all(axis=1).sum()
# Remover linhas onde todas as colunas são NaN
df_gf= df_gf.dropna(how='all')
# Verificando valores nulos novamente
print(df_gf.isnull().sum())
pergunta_1     0
pergunta_2    13
pergunta_3    38
pergunta_4    46
pergunta_5    52
pergunta_6    56
pergunta_7    57
pergunta_8    59
dtype: int64

Now let’s analyze the data to find the volume of missing values per question and how the answers are distributed for each question.

Calculating the percentage of missing values in each column:
Show Code
# Calculando o percentual de valores ausentes em cada coluna
percentual_na_perguntas = df_gf.isna().mean() * 100
# Exibindo o percentual de valores ausentes
print(percentual_na_perguntas)
pergunta_1     0.000000
pergunta_2     2.626263
pergunta_3     7.676768
pergunta_4     9.292929
pergunta_5    10.505051
pergunta_6    11.313131
pergunta_7    11.515152
pergunta_8    11.919192
dtype: float64

Before cleaning the nan values, let’s look at how the data is distributed.

Creating chart without null values:
Show Code
# Criando um df sem os valores nan
df_limpo = df_gf.dropna()
# Lista das perguntas
perguntas = ['pergunta_1', 'pergunta_2', 'pergunta_3', 'pergunta_4', 
             'pergunta_5', 'pergunta_6', 'pergunta_7', 'pergunta_8']
# Cores do gráfico
cores = px.colors.qualitative.Vivid
# Criando o gráfico com subplots
fig = sp.make_subplots(
    rows=2, cols=4, 
    subplot_titles=perguntas,
    horizontal_spacing=0.10,
    vertical_spacing=0.13
)
for i, pergunta in enumerate(perguntas):
    contagem = df_limpo[pergunta].value_counts().sort_index()
    
    fig.add_trace(
        go.Bar(
            x=contagem.index,
            y=contagem.values,
            marker_color=cores[:len(contagem)],
            name=pergunta
        ),
        row=(i//4)+1, col=(i%4)+1
    )
fig.update_layout(
    height=500,
    width=900,
    title={
        'text': "Frequency of Answers per Question",
        'y': 0.98,          
        'x': 0.5,           
        'xanchor': 'center',
        'yanchor': 'top',
        'pad': {'b': 28}
    },
    margin=dict(t=95, b=60, l=40, r=40),
    showlegend=False,
    template="plotly_white"
)
fig.show()

Since we currently have a low percentage of null values in the dataset, we will replace the null values with the Mode of each question.

Imputing null values with the mode of each column:
Show Code
for col in df_gf.columns:
    if col.startswith('pergunta'):
        moda = df_gf[col].mode()[0]
        df_gf[col].fillna(moda, inplace=True)
Checking for null values again:
Show Code
print(df_gf.isnull().sum())
pergunta_1    0
pergunta_2    0
pergunta_3    0
pergunta_4    0
pergunta_5    0
pergunta_6    0
pergunta_7    0
pergunta_8    0
dtype: int64

Let’s create a data dictionary in case it’s necessary to consult the questions and alternatives during the analysis.

Creating a dictionary with all questions and alternatives:
Show Code
perguntas_dict = {
    "pergunta_1": {
        "texto": "Nível de clareza sobre o casamento dos sonhos: Como vocês descreveriam o nível de clareza que têm sobre o casamento que desejam?",
        "alternativas": {
            "A": "Ainda estamos completamente perdidos sobre tudo",
            "B": "Temos algumas referências, mas nada decidido",
            "C": "Já temos o estilo em mente, mas falta planejar",
            "D": "Sabemos exatamente o que queremos e já começamos a organizar"
        }
    },
    "pergunta_2": {
        "texto": "Situação financeira atual: Como você descreveria a situação financeira atual de vocês dois?",
        "alternativas": {
            "A": "Vivemos no limite e temos dívidas",
            "B": "Conseguimos nos manter, mas não sobra",
            "C": "Temos folga, mas ainda não vivemos como gostaríamos",
            "D": "Estamos bem financeiramente, mas queremos crescer mais"
        }
    },
    "pergunta_3": {
        "texto": "Apoio mútuo e envolvimento no sonho de casamento: Como está o envolvimento do seu parceiro(a) na realização do casamento dos sonhos?",
        "alternativas": {
            "A": "Está completamente envolvido(a), sonha junto comigo",
            "B": "Está envolvido(a), mas prefere que eu lidere",
            "C": "Me apoia, mas não se envolve muito com o planejamento",
            "D": "Prefere que eu resolva tudo sozinho(a)"
        }
    },
    "pergunta_4": {
        "texto": "Nível de organização do planejamento: Como vocês estão se organizando para planejar o casamento?",
        "alternativas": {
            "A": "Não começamos ainda",
            "B": "Temos anotações e ideias soltas",
            "C": "Criamos um planejamento inicial, mas ainda sem orçamento",
            "D": "Temos planilhas, metas e até cronograma definido"
        }
    },
    "pergunta_5": {
        "texto": "Possibilidade de investimento atual no casamento: Se fossem realizar o casamento ideal hoje, como pagariam?",
        "alternativas": {
            "A": "Não conseguiríamos bancar nada ainda",
            "B": "Conseguiríamos fazer algo simples",
            "C": "Teríamos que parcelar bastante ou contar com ajuda",
            "D": "Poderíamos arcar com boa parte, mas queremos mais liberdade"
        }
    },
    "pergunta_6": {
        "texto": "Estilo de casamento desejado: Qual o estilo de casamento dos seus sonhos?",
        "alternativas": {
            "A": "Ainda não pensamos nisso",
            "B": "Algo íntimo e simples, só com pessoas próximas",
            "C": "Uma cerimônia encantadora, com tudo bem feito",
            "D": "Um evento inesquecível, com tudo que temos direito"
        }
    },
    "pergunta_7": {
        "texto": "Planejamento da lua de mel: Vocês já pensaram na lua de mel?",
        "alternativas": {
            "A": "Nem pensamos nisso ainda",
            "B": "Pensamos, mas parece fora da nossa realidade",
            "C": "Temos destinos em mente, mas sem orçamento ainda",
            "D": "Já sabemos onde queremos ir e estamos nos planejando"
        }
    },
    "pergunta_8": {
        "texto": "Comprometimento em tornar esse sonho realidade: O quanto vocês estão comprometidos em transformar esse sonho em realidade?",
        "alternativas": {
            "A": "Desejamos, mas nos falta tempo e direção",
            "B": "Queremos muito, mas temos medo de não dar conta",
            "C": "Estamos dispostos, só falta um plano eficaz",
            "D": "Estamos prontos, queremos agir e realizar de verdade"
        }
    }
}

3. Model Building and Validation

The KMeans algorithm uses the Euclidean distance between points to form clusters. Euclidean distance is a measure of how far apart two points are in Euclidean space. Correlation can have a significant impact on clustering models like KMeans, as KMeans uses Euclidean distance between points to form clusters. Highly correlated variables can disproportionately influence the distances between points, leading to possible distortions in the formed clusters.

We need to consider multicollinearity (if two variables are highly correlated), the scale of variables (since KMeans is sensitive to scale), and dimensionality reduction (to speed up the grouping process when there are a large number of variables).

Therefore, before applying KMeans, let’s first analyze the correlation of categorical data. We will use Cramér’s V, which is a measure of association between two categorical variables, measuring how associated two nominal variables are.

Calculating Cramer’s V Matrix and performing the analysis:

Show Code
# Função para calcular Cramer's V
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix, correction=False)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1)) / (n-1))
    rcorr = r - ((r-1)**2) / (n-1)
    kcorr = k - ((k-1)**2) / (n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
# Criando matriz 
df = df_gf
categorical_columns = df.columns
# Criar uma matriz vazia
cramers_results = pd.DataFrame(np.zeros((len(categorical_columns), len(categorical_columns))),
                               index=categorical_columns,
                               columns=categorical_columns)
# Preencher a matriz
for col1 in categorical_columns:
    for col2 in categorical_columns:
        cramers_results.loc[col1, col2] = cramers_v(df[col1], df[col2])
# Imprimindo a mtriz
cramers_results
pergunta_1 pergunta_2 pergunta_3 pergunta_4 pergunta_5 pergunta_6 pergunta_7 pergunta_8
pergunta_1 1.000000 0.154443 0.000000 0.353861 0.193979 0.174872 0.150004 0.261624
pergunta_2 0.154443 1.000000 0.080247 0.217220 0.319514 0.159740 0.196799 0.175146
pergunta_3 0.000000 0.080247 1.000000 0.104207 0.099763 0.079644 0.102016 0.105724
pergunta_4 0.353861 0.217220 0.104207 1.000000 0.255274 0.215382 0.264954 0.324603
pergunta_5 0.193979 0.319514 0.099763 0.255274 1.000000 0.286524 0.228514 0.235780
pergunta_6 0.174872 0.159740 0.079644 0.215382 0.286524 1.000000 0.195961 0.227248
pergunta_7 0.150004 0.196799 0.102016 0.264954 0.228514 0.195961 1.000000 0.305950
pergunta_8 0.261624 0.175146 0.105724 0.324603 0.235780 0.227248 0.305950 1.000000

Visualizing the correlation matrix:

Show Code
plt.figure(figsize=(10,8))
sns.heatmap(cramers_results, annot=True, cmap='coolwarm', vmin=0, vmax=1)
plt.title("Cramér's V - Correlation between categorical variables")
plt.show()

We do not have highly correlated variables, but let’s analyze the three largest correlations to assess the consistency of this dataset.

Evaluating the three largest correlations:
Show Code
# Transformar a matriz em long format (par variável - valor de correlação)
corr_pairs = (
    cramers_results.where(np.triu(np.ones(cramers_results.shape), k=1).astype(bool))  # pegar só triângulo superior (evita repetição)
    .stack()  # transformar em Series com MultiIndex
    .reset_index()
)
corr_pairs.columns = ['1', '2', 'Cramers_V']
# Ordenar pelas maiores correlações
top_corr = corr_pairs.sort_values(by='Cramers_V', ascending=False).head(3)
print(top_corr)
             1           2  Cramers_V
2   pergunta_1  pergunta_4   0.353861
21  pergunta_4  pergunta_8   0.324603
9   pergunta_2  pergunta_5   0.319514
Analyzing correlation between questions 1 and 4:
Show Code
perguntas_dict["pergunta_1"]["texto"], perguntas_dict["pergunta_4"]["texto"]
('Nível de clareza sobre o casamento dos sonhos: Como vocês descreveriam o nível de clareza que têm sobre o casamento que desejam?',
 'Nível de organização do planejamento: Como vocês estão se organizando para planejar o casamento?')

This correlation makes sense because those who have clarity about the wedding they desire tend to be more organized regarding the planning.

Analyzing correlation between questions 4 and 8:
Show Code
perguntas_dict["pergunta_4"]["texto"], perguntas_dict["pergunta_8"]["texto"]
('Nível de organização do planejamento: Como vocês estão se organizando para planejar o casamento?',
 'Comprometimento em tornar esse sonho realidade: O quanto vocês estão comprometidos em transformar esse sonho em realidade?')

This correlation also makes sense because those who are more organized will be more committed to making the wedding happen.

Analyzing correlation between questions 2 and 5:
Show Code
perguntas_dict["pergunta_2"]["texto"], perguntas_dict["pergunta_5"]["texto"]
('Situação financeira atual: Como você descreveria a situação financeira atual de vocês dois?',
 'Possibilidade de investimento atual no casamento: Se fossem realizar o casamento ideal hoje, como pagariam?')

These questions also have a justifiable correlation, as the couple’s financial situation will directly impact the possibility of current investment in the wedding

Conclusion

  • There are no very strong relationships between the questions, which is expected in well-designed questionnaires where questions measure distinct, albeit related, aspects (no high risk of multicollinearity).
  • The highest correlations are classified as moderate or weak, indicating that each question captures different aspects of the leads’ profile or situation.

We will still apply PCA for dimensionality reduction, but it is not mandatory due to multicollinearity between variables, but rather because we are interested in reducing computational complexity and improving the efficiency of KNN in high-dimensional spaces.

One-Hot-Encoding Application

Since we have many features and need to feed the algorithm with numerical variables, we will apply OHE with the drop_first=True parameter. OHE transforms categorical variables into a binary numerical format, essential for algorithms that do not natively handle categories, as is the case with KMeans.

Applying One-Hot Encoding with drop_first=True:
Show Code
# Instanciando o codificador
ohe = OneHotEncoder(drop='first', sparse=False)
# Ajustando e transformando os dados
ohe_array = ohe.fit_transform(df_gf)
# Pegando os nomes das colunas geradas pelo OHE
ohe_columns = ohe.get_feature_names_out(df_gf.columns)
# Criando um novo DataFrame com os dados codificados
df_gf_ohe = pd.DataFrame(ohe_array, columns=ohe_columns, index=df_gf.index)
# Visualizando amostra aleatória de 10 linhas
df_gf_ohe.sample(10)
pergunta_1_B pergunta_1_C pergunta_1_D pergunta_2_B pergunta_2_C pergunta_2_D pergunta_3_B pergunta_3_C pergunta_3_D pergunta_4_B ... pergunta_5_D pergunta_6_B pergunta_6_C pergunta_6_D pergunta_7_B pergunta_7_C pergunta_7_D pergunta_8_B pergunta_8_C pergunta_8_D
606 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
222 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
160 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
287 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
393 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
566 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
602 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
593 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
227 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
463 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

10 rows × 24 columns

We can see the use of the drop=‘first’ parameter, which removes the first category of each variable to reduce dimensionality; we have 24 features instead of 32.

PCA Application

Principal Component Analysis (PCA) is a statistical dimensionality reduction technique that transforms a set of possibly correlated variables into a new set of uncorrelated variables, called principal components.

The objective for our project is to capture the maximum variability of the data in the first components and facilitate visualization, clustering, classification, and reduce noise. OHE significantly increases dimensionality (in this project’s case, from 8 questions to 24 variables after drop_first=True).

Applying PCA
Show Code
# Padronizando os dados
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_gf_ohe)
# Instanciando o PCA
pca = PCA()
# Ajustando o PCA aos dados
pca.fit(df_scaled)
# Gerando os componentes principais
df_pca = pca.transform(df_scaled)
# Convertendo em DataFrame para visualização
df_pca = pd.DataFrame(df_pca, columns=[f'PC{i+1}' for i in range(df_pca.shape[1])])
# Visualizando as 5 primeiras linhas
print(df_pca.head())
        PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
0  0.894431  2.735178 -0.826413  3.137230 -0.409440  0.161323 -0.650989   
1  2.981389 -1.378394  0.998942 -1.971190  0.027095 -1.719077 -1.687937   
2 -1.231628 -0.908463 -0.152123 -0.419888  0.238578  0.347453 -0.037611   
3 -1.217156 -0.110358 -1.339276 -0.422707 -0.274758  0.304763 -1.688694   
4 -1.842543 -1.716992 -1.497125  0.169806 -0.275240 -0.392597  0.011479   

        PC8       PC9      PC10  ...      PC15      PC16      PC17      PC18  \
0 -0.982705 -0.015318 -2.118046  ... -1.322269 -1.216456  0.611138 -1.027353   
1  1.375810  0.359631  1.131436  ... -1.148633  1.284379  2.483504  0.253910   
2  0.819806 -0.162855 -0.752639  ...  1.033716 -0.966877  0.895287 -0.054726   
3 -1.809124  0.729512  0.755415  ... -0.506001  0.649595 -1.187464 -0.104447   
4 -0.020066 -0.155247  0.608450  ...  0.350160  0.626633  0.647278  1.009298   

       PC19      PC20      PC21      PC22      PC23      PC24  
0 -0.089755  1.457162 -0.171782 -0.607896  0.618231 -0.156277  
1 -1.530543 -1.048446  0.320573 -0.063387  2.119997 -0.119219  
2 -0.394803 -0.066319  1.049833 -0.495020  0.547988 -0.459000  
3  0.070163 -0.923953 -1.024069  0.028527 -0.228793  1.166666  
4  0.507244  1.303698 -0.514279 -0.323762 -0.036504 -0.127500  

[5 rows x 24 columns]

The result of the principal component analysis aims to provide a basis for us to decide how many variables we will use and how much total variance we can explain in this dataset. Therefore, we should choose between 01 to 24 PCs and make a decision based on “how much we want to explain from this data,” i.e., the minimum components needed to have maximum interpretability.

Scree Plot - Explained Variance by PCA:
Show Code
# Dados da variância explicada
variancia_acumulada = np.cumsum(pca.explained_variance_ratio_)
# Criando a figura
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=list(range(1, len(variancia_acumulada) + 1)),
        y=variancia_acumulada,
        mode='lines+markers',
        line=dict(dash='dash', width=2),
        marker=dict(size=8, color='blue'),
        name='Accumulated Variance'
    )
)
fig.update_layout(
    title='Scree Plot - Explained Variance by PCA',
    xaxis_title='Number of Components',
    yaxis_title='Accumulated Explained Variance',
    template='plotly_white',
    width=800,
    height=500
)
fig.show()
Visualizing the explained variance by each component:
Show Code
for i, var in enumerate(pca.explained_variance_ratio_):
    print(f'PC{i+1}: {var:.4f} ({np.cumsum(pca.explained_variance_ratio_)[i]:.4f} accumulated)')
PC1: 0.1376 (0.1376 accumulated)
PC2: 0.0901 (0.2277 accumulated)
PC3: 0.0755 (0.3032 accumulated)
PC4: 0.0613 (0.3645 accumulated)
PC5: 0.0573 (0.4218 accumulated)
PC6: 0.0536 (0.4755 accumulated)
PC7: 0.0514 (0.5268 accumulated)
PC8: 0.0490 (0.5759 accumulated)
PC9: 0.0462 (0.6220 accumulated)
PC10: 0.0445 (0.6666 accumulated)
PC11: 0.0409 (0.7074 accumulated)
PC12: 0.0392 (0.7466 accumulated)
PC13: 0.0354 (0.7820 accumulated)
PC14: 0.0344 (0.8164 accumulated)
PC15: 0.0306 (0.8470 accumulated)
PC16: 0.0286 (0.8755 accumulated)
PC17: 0.0276 (0.9031 accumulated)
PC18: 0.0230 (0.9261 accumulated)
PC19: 0.0190 (0.9452 accumulated)
PC20: 0.0150 (0.9601 accumulated)
PC21: 0.0120 (0.9721 accumulated)
PC22: 0.0112 (0.9833 accumulated)
PC23: 0.0106 (0.9939 accumulated)
PC24: 0.0061 (1.0000 accumulated)

We chose to use 17 principal components, which preserve 90.31% of the total variance of the dataset, ensuring a balance between data simplification and information maintenance. This value was determined based on the analysis of the scree plot and the accumulated variance distribution, which shows the absence of a clear elbow, characteristic of categorical data. This strategy allows for faster processing, improved model performance, and maintained analytical robustness.

Defining the Value of K in Clustering Models

The KMeans algorithm is a data clustering technique that organizes a set of points into groups (or “clusters”) based on their similarities. Choosing the appropriate value of k is an important step in the project, as it can significantly affect the usefulness of the formed clusters. To do this, we will use the Elbow Method and the Silhouette Score to help define an optimal value for k.

Defining the ideal value of K

Generating Dataset with 17 principal components::
Show Code
pca = PCA(n_components=17)
df_pca = pca.fit_transform(df_gf_ohe)
Elbow Method:
Show Code
inertia = []
K_range = range(1, 11)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df_pca)
    inertia.append(kmeans.inertia_)
# Plot do Elbow
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=list(K_range),
        y=inertia,
        mode='lines+markers',
        marker=dict(size=8, color='blue'),
        line=dict(width=2),
        name='Inertia'
    )
)
fig.update_layout(
    title='Elbow Method',
    xaxis_title='Number of Clusters (K)',
    yaxis_title='Inertia',
    template='plotly_white',
    width=800,
    height=500
)
fig.show()

Interpretation: evaluates the sum of squared distances within clusters (Sum of Squared Errors - SSE) as a function of different k values. As k increases, the error decreases, as the clusters become smaller and more specific. In the SSE vs. k graph, we look for the point where there is a “break” or “bend” (an elbow). This point indicates that increasing k beyond it brings marginal gains in error reduction, signaling the optimal number of clusters.

We can see where the curve forms an elbow between values 2 and 3, with k=2 being slightly more accentuated in its curvature, suggesting the best value for K, but it’s not very distinct for k=3.

Silhouette Index:
Show Code
silhouette_scores = []
K_range_sil = range(2, 11)
for k in K_range_sil:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(df_pca)
    score = silhouette_score(df_pca, labels)
    silhouette_scores.append(score)
fig_silhouette = go.Figure()
fig_silhouette.add_trace(
    go.Scatter(
        x=list(K_range_sil),
        y=silhouette_scores,
        mode='lines+markers',
        marker=dict(
            size=8,
            color='blue',
            symbol='circle'
        ),
        line=dict(
            width=2,
            color='blue'
        ),
        name='Silhouette Score'
    )
)

fig_silhouette.update_layout(
    title='Silhouette Index Analysis',
    xaxis_title='Number of Clusters (K)',
    yaxis_title='Silhouette Score',
    template='plotly_white',
    width=800,
    height=500
)

fig_silhouette.show()
Checking silhouette values:
Show Code
for k, score in zip(K_range_sil, silhouette_scores):
    print(f"K={k}: Silhouette Score={score:.4f}")
K=2: Silhouette Score=0.1199
K=3: Silhouette Score=0.1239
K=4: Silhouette Score=0.1017
K=5: Silhouette Score=0.0961
K=6: Silhouette Score=0.1023
K=7: Silhouette Score=0.1044
K=8: Silhouette Score=0.0939
K=9: Silhouette Score=0.1094
K=10: Silhouette Score=0.1023

Interpretation: The values measure the quality of clusters by calculating how similar a point is to its own cluster compared to other clusters. The value ranges between -1 (poor grouping) and 1 (optimal grouping). Here we can visualize the best value for k=3, but very close for k=2.

Results and Decision

Elbow Method - K = 2 The elbow shows where the reduction in inertia begins to stabilize.

Practical interpretation: The data may have two macro structures, meaning a coarser division.

Silhouette - Best at K = 3 The highest silhouette value (0.1239) occurs with K=3.

We will use k=2 and k=3 and analyze which algorithm best suits our project.

Applying KMeans Algorithm for K=2 and K=3:
Show Code
# Aplicar KMeans para k=2
kmeans_k2 = KMeans(n_clusters=2, random_state=42)
clusters_k2 = kmeans_k2.fit_predict(df_pca)
# DataFrame com clusters K=2
df_clusters_k2 = pd.DataFrame(df_pca, columns=[f'PC{i+1}' for i in range(df_pca.shape[1])])
df_clusters_k2['Cluster'] = clusters_k2
# Aplicar KMeans para K=3
kmeans_k3 = KMeans(n_clusters=3, random_state=42)
clusters_k3 = kmeans_k3.fit_predict(df_pca)
# DataFrame com clusters K=3
df_clusters_k3 = pd.DataFrame(df_pca, columns=[f'PC{i+1}' for i in range(df_pca.shape[1])])
df_clusters_k3['Cluster'] = clusters_k3
Quantitative Evaluation of the two models – Silhouette Score:
Show Code
silhouette_k2 = silhouette_score(df_pca, clusters_k2)
silhouette_k3 = silhouette_score(df_pca, clusters_k3)
print(f'Silhouette Score K=2: {silhouette_k2}')
print(f'Silhouette Score K=3: {silhouette_k3}')
Silhouette Score K=2: 0.11988114338535223
Silhouette Score K=3: 0.12394148922394936
Cluster Distribution:
Show Code
## Distribuição dos Clusters
print("\nCluster Distribution - K=2")
print(pd.Series(clusters_k2).value_counts())

print("\nCluster Distribution - K=3")
print(pd.Series(clusters_k3).value_counts())

Cluster Distribution - K=2
1    252
0    243
Name: count, dtype: int64

Cluster Distribution - K=3
1    234
0    206
2     55
Name: count, dtype: int64
Comparison between K=2 and K=3:
Show Code
fig = make_subplots(rows=1, cols=2, subplot_titles=('Clusters with K=2', 'Clusters with K=3'))

# Cores para K=2
unique_clusters_k2 = sorted(df_clusters_k2['Cluster'].unique())
if len(unique_clusters_k2) > 1:
    colors_k2 = pcolors.sample_colorscale("Viridis", np.linspace(0, 1, len(unique_clusters_k2)))
elif len(unique_clusters_k2) == 1:
    colors_k2 = [pcolors.sample_colorscale("Viridis", 0.5)[0]]
else:
    colors_k2 = []

for i, cluster_val in enumerate(unique_clusters_k2):
    df_subset = df_clusters_k2[df_clusters_k2['Cluster'] == cluster_val]
    fig.add_trace(go.Scatter(
        x=df_subset['PC1'],
        y=df_subset['PC2'],
        mode='markers',
        marker=dict(
            color=colors_k2[i],
            opacity=0.7,
            size=7
        ),
        name=f'K=2, Cluster {cluster_val}',
        legendgroup='k2_group'
    ), row=1, col=1)

# Cores fixas para K=3
cores_k3 = ['red', 'blue', 'green']
unique_clusters_k3 = sorted(df_clusters_k3['Cluster'].unique())
for i, cluster_val in enumerate(unique_clusters_k3):
    df_subset = df_clusters_k3[df_clusters_k3['Cluster'] == cluster_val]
    cor = cores_k3[i % len(cores_k3)]
    fig.add_trace(go.Scatter(
        x=df_subset['PC1'],
        y=df_subset['PC2'],
        mode='markers',
        marker=dict(
            color=cor,
            opacity=0.7,
            size=7
        ),
        name=f'K=3, Cluster {cluster_val}',
        legendgroup='k3_group'
    ), row=1, col=2)

# Ajustando eixos
fig.update_xaxes(title_text="PC1", row=1, col=1)
fig.update_yaxes(title_text="PC2", row=1, col=1)
fig.update_xaxes(title_text="PC1", row=1, col=2)
fig.update_yaxes(title_text="PC2", row=1, col=2)

# Layout ajustado
fig.update_layout(
    width=1100,
    height=550,
    hovermode='closest',
    showlegend=False,
    title={
        'text': "Cluster Distribution in space for K=1 and K=2",
        'y': 0.97,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'pad': {'b': 25}
    },
    margin=dict(t=80, b=50, l=40, r=40),
    template='plotly_white'
)

fig.show()
Visualizing the center of each cluster:
Show Code
cluster_colors = ['red', 'blue', 'green']
fig = go.Figure()
unique_clusters = sorted(df_clusters_k3['Cluster'].unique())
for cluster_idx, cluster_num in enumerate(unique_clusters):
    mask = df_clusters_k3['Cluster'] == cluster_num
    fig.add_trace(go.Scatter(
        x=df_clusters_k3.loc[mask, df_clusters_k3.columns[0]],
        y=df_clusters_k3.loc[mask, df_clusters_k3.columns[1]],
        mode='markers',
        marker=dict(
            color=cluster_colors[cluster_idx % len(cluster_colors)],
            opacity=0.7,
            size=8
        ),
        name=f'Cluster {cluster_num}'
    ))
fig.add_trace(go.Scatter(
    x=kmeans_k3.cluster_centers_[:, 0],
    y=kmeans_k3.cluster_centers_[:, 1],
    mode='markers',
    marker=dict(
        size=16,
        color='black',
        symbol='x'
    ),
    name='Centroids'
))
fig.update_layout(
    title="Centroids of each Cluster with k=3",
    xaxis_title=df_clusters_k3.columns[0] if len(df_clusters_k3.columns) > 0 else 'Component 1',
    yaxis_title=df_clusters_k3.columns[1] if len(df_clusters_k3.columns) > 1 else 'Component 2',
    legend_title_text='Legend',
    width=900,
    height=650,
)
fig.show()

Choice of K=3

Although the Silhouette metric is only slightly superior for K=3 (0.1239) compared to K=2 (0.1198), it still suggests that the model with three clusters offers a more refined division of respondents’ behavioral profiles.

The 2D PCA plots show some overlap between clusters (consistent with low silhouette scores) but also reveal that the cluster centers are in distinct positions, indicating that K-Means successfully found different patterns and that the groups capture relevant differences in behavioral profiles.

Cluster Interpretation and Insight Generation

Getting and visualizing the centroids:
Show Code
centroids_pca = kmeans_k3.cluster_centers_
centroids_ohe = pca.inverse_transform(centroids_pca)
centroids_df = pd.DataFrame(centroids_ohe, columns=ohe.get_feature_names_out())
print(centroids_df)
   pergunta_1_B  pergunta_1_C  pergunta_1_D  pergunta_2_B  pergunta_2_C  \
0      0.263288      0.248406      0.098141      0.387035      0.193359   
1      0.219534      0.280162      0.095265      0.474151      0.123222   
2      0.025306      0.041283      0.881655      0.205811      0.278800   

   pergunta_2_D  pergunta_3_B  pergunta_3_C  pergunta_3_D  pergunta_4_B  ...  \
0      0.029919      0.223320      0.183665      0.055479      0.387110  ...   
1      0.069168      0.212591      0.204563      0.059880      0.330362  ...   
2      0.357298      0.204543      0.205405      0.155626      0.071828  ...   

   pergunta_5_D  pergunta_6_B  pergunta_6_C  pergunta_6_D  pergunta_7_B  \
0      0.049661      0.012986      0.641069      0.119277      0.205592   
1      0.041706      0.984988     -0.012480     -0.018666      0.196746   
2      0.472921      0.215230      0.397455      0.378123      0.138356   

   pergunta_7_C  pergunta_7_D  pergunta_8_B  pergunta_8_C  pergunta_8_D  
0      0.237598      0.047379      0.329218      0.284800      0.084105  
1      0.193919      0.042967      0.237891      0.185972      0.076092  
2      0.375957      0.421558      0.009357      0.069340      0.852161  

[3 rows x 24 columns]
Frequency table of answers in each Cluster:
Show Code
# Junta o cluster ao dataframe original
df_clusters = df_gf.copy()
df_clusters['Cluster'] = clusters_k3

titulos_perguntas = {
    'pergunta_1': 'Clarity Level',
    'pergunta_2': 'Current Financial Situation',
    'pergunta_3': 'Support and Involvement',
    'pergunta_4': 'Planning Organization Level',
    'pergunta_5': 'Current Investment Possibility',
    'pergunta_6': 'Desired Wedding Style',
    'pergunta_7': 'Honeymoon Planning',
    'pergunta_8': 'Commitment to make it real'
}

perguntas = df_gf.columns.tolist()
num_perguntas = len(perguntas)
n_rows = 3
n_cols = 3
subplot_titles_list = []
for i in range(n_rows * n_cols):
    if i < num_perguntas:
        pergunta_nome = perguntas[i]
        titulo = titulos_perguntas.get(pergunta_nome, pergunta_nome)
        subplot_titles_list.append(f'{titulo}')
    else:
        subplot_titles_list.append('')

fig = make_subplots(
    rows=n_rows,
    cols=n_cols,
    subplot_titles=subplot_titles_list,
    horizontal_spacing=0.07,
    vertical_spacing=0.12
)

palette = qualitative.Vivid
unique_cluster_values = sorted(df_clusters['Cluster'].unique())
cluster_color_map = {
    cluster_val: palette[i % len(palette)]
    for i, cluster_val in enumerate(unique_cluster_values)
}

for i, pergunta in enumerate(perguntas):
    if i >= n_rows * n_cols:
        break
    row_num = (i // n_cols) + 1
    col_num = (i % n_cols) + 1

    ordem_categorias = df_clusters[pergunta].value_counts().index.tolist()

    for cluster_val in unique_cluster_values:
        df_subset_cluster = df_clusters[df_clusters['Cluster'] == cluster_val]
        counts = df_subset_cluster[pergunta].value_counts()
        y_values = [counts.get(cat, 0) for cat in ordem_categorias]

        fig.add_trace(go.Bar(
            x=ordem_categorias,
            y=y_values,
            name=f'Cluster {cluster_val}',
            marker_color=cluster_color_map[cluster_val],
            showlegend=False
        ), row=row_num, col=col_num)

    fig.update_xaxes(
        type='category',
        categoryorder='array',
        categoryarray=ordem_categorias,
        tickangle=0,
        showticklabels=True,
        row=row_num,
        col=col_num
    )
    fig.update_yaxes(
        title_text=None,
        row=row_num,
        col=col_num
    )

fig.update_layout(
    height=1100,
    width=1100,
    barmode='group',
    template='plotly_white',
    showlegend=False,
    title={
        'text': 'Frequency of answers in each Cluster',
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'y': 0.97,
        'pad': {'b': 20}
    },
    margin=dict(t=100, b=65, l=35, r=35)
)

fig.show()
Calculating the top 5 averages for each answer per cluster:
Show Code
df_encoded = pd.get_dummies(df_clusters.drop('Cluster', axis=1), prefix_sep='_', drop_first=False)
df_encoded['Cluster'] = df_clusters['Cluster']
centroids = df_encoded.groupby('Cluster').mean()

fig = make_subplots(
    rows=1, 
    cols=3, 
    subplot_titles=[f"Cluster {i}" for i in centroids.index],
    specs=[[{"type": "table"}]*3]
)

for i, cluster in enumerate(centroids.index):
    top5 = centroids.loc[cluster].sort_values(ascending=False).head(5)

    fig.add_trace(
        go.Table(
            header=dict(
                values=["Answer", "Proportion"],
                fill_color="lightgrey",
                align="left",
                font=dict(color="black", size=12)
            ),
            cells=dict(
                values=[top5.index, top5.values.round(3)],
                align="left",
                height=30
            )
        ),
        row=1, col=i+1
    )

fig.update_layout(
    height=300, 
    width=1000,
    template="plotly_white",
    title={
        'text': 'Most representative answers per Cluster',
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'y': 0.96,
        'pad': {'b': 15}
    },
    margin=dict(t=70, b=35, l=30, r=30)
)
fig.show()

Based on the analysis of the centroids, which represent the average values (choice proportions) of each answer within each cluster, we can interpret that:

  • The closer to 1, the more predominant this characteristic is in the group.

  • Each value ranges from 0 to 1 and represents the relative frequency with which that option was chosen within the cluster.

This method allows us to understand the average profile and priorities of each segment.

Cluster Interpretation

Group 1: 234 leads (Cluster 1 — 47%)

Profile:
  • Wedding Style (Question 6): Predominantly, they desire “Something intimate and simple, only with close people” (🔸 pergunta_6_B – 99.6%). This is the most striking trait of this group.
  • Investment Possibility (Question 5): Most believe they “Could do something simple” if they were to have the wedding today (🔸 pergunta_5_B – 62.4%).
  • Organization Level (Question 4): A significant portion “Haven’t started yet” planning (🔸 pergunta_4_A – 57.3%).
  • Honeymoon Planning (Question 7): Most “Haven’t thought about it yet” (🔸 pergunta_7_A – 56.8%).
  • Partner’s Support (Question 3): The partner is “Completely involved, dreaming with me” (🔸 pergunta_3_A – 51.7%).
Behavior Summary:
  • This is the largest group and is characterized by a clear desire for a simpler, more intimate wedding.
  • Financially, they feel capable of holding a modest event, but have not yet started practical organization or honeymoon planning.
  • Partner involvement is high, indicating a shared dream.
  • Despite the desired simplicity, the lack of initiation in planning suggests a need for guidance to take the first steps, even for a smaller event.
Needs:
  • Ideas and inspirations for simple, elegant, and economical weddings.
  • Planning tools focused on smaller, more objective events.
  • Direction on how to start planning an intimate wedding without complications.
  • Content that validates the choice for a smaller wedding, showing its benefits and charm.

Group 2: 206 leads (Cluster 0 - 42%)

Profile:
  • Wedding Style (Question 6): They desire “A charming ceremony, with everything well done”, but not necessarily the most luxurious (🔸 pergunta_6_C – 62.6%).
  • Partner’s Support (Question 3): The partner is “Completely involved, dreaming with me” (🔸 pergunta_3_A – 53.4%).
  • Honeymoon Planning (Question 7): The vast majority “Haven’t thought about it yet” (🔸 pergunta_7_A – 49.0%).
  • Investment Possibility (Question 5): They feel they “Couldn’t afford anything yet” if the wedding were today (🔸 pergunta_5_A – 46.1%).
  • Organization Level (Question 4): They “Haven’t started yet” planning (🔸 pergunta_4_A – 44.7%).
Behavior Summary:
  • This group, one of the largest, is at a very early stage. They have clear dreams and desires about the ceremony style and have strong mutual support as a couple.
  • However, they face practical paralysis due to lack of organization and, crucially, the perception of financial incapacity at the moment.
  • They are dreaming big, but feel lost about where to start, with budget and organization being the main bottlenecks.
Needs:
  • Practical, step-by-step guides: “From scratch to dream wedding: a beginner’s guide”.
  • Basic organization tools: simple checklists, initial timelines, budget spreadsheet templates for beginners.
  • Solutions and ideas for affordable weddings: Content on how to achieve a charming ceremony on a limited budget.
  • Emotional and motivational content: Reinforcing that it’s normal to feel lost at the beginning and that it’s possible to turn the dream into reality with planning, even with limited resources.

Group 3: 55 leads (Cluster 2 — 11%)

Profile:
  • Clarity Level (Question 1): They have a very high level of clarity: “We know exactly what we want and have already started organizing” (�� pergunta_1_D – 89.1%).
  • Commitment (Question 8): They are highly committed: “We are ready, we want to act and truly achieve it” (🔸 pergunta_8_D – 85.5%).
  • Organization Level (Question 4): They are already well organized: “We have spreadsheets, goals, and even a defined timeline” (🔸 pergunta_4_D – 58.2%).
  • Investment Possibility (Question 5): They believe they “Could afford a good part, but we want more freedom” financially (🔸 pergunta_5_D – 49.1%).
  • Partner’s Support (Question 3): The partner is “Completely involved, dreaming with me” (🔸 pergunta_3_A – 47.3%).
Behavior Summary:
  • This is the smallest group, but it represents the most decisive and proactive brides.
  • They have total clarity about the desired wedding, are highly committed, and already have advanced planning.
  • Financially, they are in a relatively comfortable position, but seek to optimize their resources for “more freedom.”
  • Partner support is also strong, indicating a joint and aligned effort.
  • They have probably researched extensively and may be looking to optimize what they have already planned or find suppliers and solutions that fit their clear vision.
Needs:
  • Solutions to optimize the budget and maximize the value of the investment.
  • Advanced tools for vendor management or detailed timelines.
  • Specialized consulting to refine details or resolve specific planning points.
  • Inspiration for finishing touches or differentiated elements that add value to the already well-defined wedding.
  • Confirmation that they are on the right track and access to trusted suppliers.
Lead counts in each Group:
Show Code
# Contagem dos clusters
cluster_counts = df_clusters['Cluster'].value_counts().sort_index()
x_labels = {0: 'Grupo 2', 1: 'Grupo 1', 2: 'Grupo 3'}
x_axis_labels = [x_labels[c] for c in cluster_counts.index]
vivid_palette_px = px.colors.qualitative.Vivid
bar_colors = [vivid_palette_px[i % len(vivid_palette_px)] for i in range(len(cluster_counts.index))]
fig = go.Figure()

fig.add_trace(go.Bar(
    x=x_axis_labels,
    y=cluster_counts.values,
    marker_color=bar_colors,
    text=cluster_counts.values,
    texttemplate='<b>%{y}</b>',
    textposition='outside',
    textfont=dict(size=12, color='black', family='Arial')
))

fig.update_layout(
    title=dict(
        text='Leads by Group',
        font=dict(size=20, color='black', family='Arial Black'),
        x=0.5,
        xanchor='center',
        y=0.97,
        yanchor='top',
        pad={'b': 12}
    ),
    xaxis_title=None,
    xaxis=dict(
        tickangle=0,
        type='category',
        tickfont=dict(size=14, color='black')
    ),
    yaxis=dict(
        title=dict(
            text='Número de Leads',
            font=dict(size=14)
        ),
        showgrid=True,
        gridcolor='rgba(211, 211, 211, 0.7)',
        griddash='dash',
        gridwidth=1,
        range=[0, cluster_counts.values.max() * 1.15]
    ),
    width=520,
    height=450,
    plot_bgcolor='white',
    font=dict(size=14),
    margin=dict(t=75, b=55, l=60, r=60)
)

fig.show()

Segments Summary

Cluster Leads Style Organization Investment Emotion/Commitment
Group 1 234 Simple and intimate Loose ideas Something simple Dreaming together, cautious
Group 2 206 Charming, but lost Haven’t started Cannot afford Dreaming, but lost
Group 3 55 Charming to grandiose Extremely high Can afford well Very high commitment

The proposed segmentation clearly shows three very distinct profiles, with different needs, desires, and conditions.

The use of KMeans with K=3 was the most appropriate, as it captured:

  • Two large groups focusing on simplicity, but with subtle differences in the degree of organization and insecurity.

  • A smaller, but very valuable group of high-conversion clients and higher average ticket.