Customer Churn Analysis and Prediction

Author

Gabriel Ferreira

Published

September 16, 2025

Introduction

The objective of this project is to analyze and build a Machine Learning model based on customer churn data. The dataset contains information about a fictional telecommunications company that provided landline phone and internet services to 7,043 customers in California in the third quarter. It indicates which customers left, stayed, or signed up for their service. Churn prediction identifies customers who are likely to cancel their contracts soon. If the company can do this, it can address the users before churn.

The target variable we want to predict is categorical and has only two possible outcomes: churn or non-churn (Binary Classification). We would also like to understand why the model thinks our customers are leaving, and for that, we need to be able to interpret the model’s predictions. According to the description, this dataset contains the following information:

Customer services: phone; multiple lines; internet; tech support and extra services, such as online security, backup, device protection, and TV streaming
Account information: how long they’ve been a customer, contract type, payment method type
Charges: how much was charged to the customer last month and in total
Demographic information: gender, age, and whether they have dependents or a partner
Churn: yes/no, whether the customer left the company last month

Project Structure

The development of this project followed a structured and iterative methodology, covering everything from data collection and preparation to the evaluation and interpretation of results.

Exploratory Data Analysis (EDA)

Columns:

Show Code

df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

Sample:

Show Code

df.sample(5)

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
6965	7789-CRUVC	Female	1	Yes	No	61	Yes	Yes	Fiber optic	Yes	...	Yes	Yes	No	No	Month-to-month	Yes	Credit card (automatic)	94.20	5895.45	No
78	0191-ZHSKZ	Male	1	No	No	30	Yes	No	DSL	Yes	...	No	No	Yes	Yes	Month-to-month	Yes	Electronic check	74.75	2111.3	No
5641	7452-FOLON	Male	0	No	Yes	39	Yes	Yes	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	One year	No	Bank transfer (automatic)	25.45	958.45	No
5156	8109-YUOHE	Male	0	No	No	17	Yes	No	DSL	Yes	...	No	No	No	No	One year	No	Mailed check	56.10	946.95	No
6281	7435-ZNUYY	Male	0	No	No	6	Yes	No	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	One year	No	Mailed check	20.60	116.6	No

5 rows × 21 columns

Info:

Show Code

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

The column TotalCharges is currently an object and needs to be converted to a numeric type. ###### Info:

Show Code

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

Describe:

Show Code

df.describe()

	SeniorCitizen	tenure	MonthlyCharges	TotalCharges
count	7043.000000	7043.000000	7043.000000	7032.000000
mean	0.162147	32.371149	64.761692	2283.300441
std	0.368612	24.559481	30.090047	2266.771362
min	0.000000	0.000000	18.250000	18.800000
25%	0.000000	9.000000	35.500000	401.450000
50%	0.000000	29.000000	70.350000	1397.475000
75%	0.000000	55.000000	89.850000	3794.737500
max	1.000000	72.000000	118.750000	8684.800000

Checking Missing Values:

Show Code

print(df.isnull().sum())

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

The TotalCharges column has 11 missing values. Let’s analyze it to define the best strategy to handle them.

Displaying the Full Rows with Missing Value:

Show Code

df[df['TotalCharges'].isna()]

	customerID	gender	Partner	Dependents	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
488	4472-LVYGI	Female	Yes	Yes	No	No phone service	DSL	Yes	...	Yes	Yes	Yes	No	Two year	Yes	Bank transfer (automatic)	52.55	NaN	No
753	3115-CZMZD	Male	No	Yes	Yes	No	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.25	NaN	No
936	5709-LVOEQ	Female	Yes	Yes	Yes	No	DSL	Yes	...	Yes	No	Yes	Yes	Two year	No	Mailed check	80.85	NaN	No
1082	4367-NUYAO	Male	Yes	Yes	Yes	Yes	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.75	NaN	No
1340	1371-DWPAZ	Female	Yes	Yes	No	No phone service	DSL	Yes	...	Yes	Yes	Yes	No	Two year	No	Credit card (automatic)	56.05	NaN	No
3331	7644-OMVMY	Male	Yes	Yes	Yes	No	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	19.85	NaN	No
3826	3213-VVOLG	Male	Yes	Yes	Yes	Yes	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.35	NaN	No
4380	2520-SGTTA	Female	Yes	Yes	Yes	No	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.00	NaN	No
5218	2923-ARZLG	Male	Yes	Yes	Yes	No	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	One year	Yes	Mailed check	19.70	NaN	No
6670	4075-WKNIU	Female	Yes	Yes	Yes	Yes	DSL	No	...	Yes	Yes	Yes	No	Two year	No	Mailed check	73.35	NaN	No
6754	2775-SEFEE	Male	No	Yes	Yes	Yes	DSL	Yes	...	No	Yes	No	No	Two year	Yes	Bank transfer (automatic)	61.90	NaN	No

11 rows × 21 columns

We observe that the rows with missing TotalCharges have a tenure of 0. This means they are new customers who haven’t accumulated any charges yet. Therefore, we will impute these missing values with 0.

Filling the missing values with 0:

Show Code

df['TotalCharges'] = df['TotalCharges'].fillna(0)

Checking Missing Values Again:

Show Code

df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

TotalCharges Distribution:

Show Code

plt.figure(figsize = (6, 6))
plt.title("TotalCharges")
sns.boxplot(data = df['TotalCharges'], color = 'blue')
plt.show()

Univariate Analysis

TotalCharges:

Show Code

fig, axes = plt.subplots(1, 2, figsize=(8, 5))
fig.suptitle('TotalCharges Analysis', fontsize=16)

sns.boxplot(data=df['TotalCharges'], color='blue', ax=axes[0])
axes[0].set_title('Boxplot')

sns.histplot(df['TotalCharges'], color='red', kde=True, ax=axes[1])
axes[1].set_title('TotalCharges Distribution')
axes[1].set_ylabel('Density')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

MonthlyCharges:

Show Code

fig, axes = plt.subplots(1, 2, figsize=(8, 5))
fig.suptitle('MonthlyCharges Analysis', fontsize=16)

sns.boxplot(data=df['MonthlyCharges'], color='blue', ax=axes[0])
axes[0].set_title('Boxplot')

sns.histplot(df['MonthlyCharges'], color='red', kde=True, ax=axes[1])
axes[1].set_title('MonthlyCharges Distribution')
axes[1].set_ylabel('Density')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

Tenure:

Show Code

fig, axes = plt.subplots(1, 2, figsize=(8, 5))
fig.suptitle('Tenure Analysis', fontsize=16)

sns.boxplot(data=df['tenure'], color='blue', ax=axes[0])
axes[0].set_title('Boxplot')

sns.histplot(df['tenure'], color='red', kde=True, ax=axes[1])
axes[1].set_title('Tenure Distribution')
axes[1].set_ylabel('Density')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

We are going to replace ‘No internet service’ and ‘No phone service’ with ‘No’. This is because they carry the same meaning (the customer does not have that particular service), and keeping them as separate categories would create unnecessary features without adding relevant information for the model.

Replacing these values with “No” across the DataFrame:

Show Code

valores_substituir = ["No internet service", "No phone service"]
df = df.replace(valores_substituir, "No")

Let’s analyze the churn rate across our categorical variables, comparing the rate for each category against the global average churn rate

Creating Visuals:

Show Code

df_copy = df.copy()
df_copy['Churn'] = df_copy['Churn'].map({'No': 0, 'Yes': 1})
categorical_features = [
    'gender', 'MultipleLines', 'InternetService', 'OnlineSecurity',
    'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
    'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod'
]
global_mean = df_copy['Churn'].mean()
print(f"Global Mean Churn Rate: {global_mean:.2f}")

n_features = len(categorical_features)
ncols = 1
nrows = (n_features + ncols - 1) // ncols

fig, axes = plt.subplots(nrows, ncols, figsize=(8, nrows * 5), constrained_layout=True)
axes = axes.flatten() if isinstance(axes, (list, np.ndarray)) else [axes]

for i, feature in enumerate(categorical_features):
    df_group = df_copy.groupby(by=feature)['Churn'].agg(['mean']).reset_index()
    ax = axes[i]
    graph = sns.barplot(x=feature, y='mean', data=df_group, palette='viridis', ax=ax)
    ax.axhline(global_mean, linewidth=2, color='red', linestyle='--')
    ax.text(0, global_mean + 0.01, f"Global Mean: {global_mean:.2f}", color='red', weight='semibold')
    ax.set_title(f"Churn Rate by {feature}", fontsize=14)
    ax.set_ylabel("Mean Churn Rate", fontsize=12)
    ax.set_xlabel("")
    ax.set_ylim(0, max(df_group['mean'].max() * 1.2, global_mean * 1.2))
    ax.tick_params(axis='x', labelrotation=0)
for j in range(i + 1, len(axes)):
    axes[j].axis('off')
plt.show()

Global Mean Churn Rate: 0.27

Bivariate Analysis

MonthlyCharges x Churn:

Show Code

plt.figure(figsize = (8, 5))
sns.kdeplot(data=df, x='MonthlyCharges', hue='Churn', fill=True, common_norm=False, palette='mako')
plt.show()

TotalCharges x Churn:

Show Code

plt.figure(figsize = (8, 5))
sns.kdeplot(data=df, x='TotalCharges', hue='Churn', fill=True, common_norm=False, palette='mako')
plt.show()

Tenure x Churn:

Show Code

plt.figure(figsize = (8, 5))
sns.kdeplot(data=df, x='tenure', hue='Churn', fill=True, common_norm=False, palette='mako')
plt.show()

Inights

The average churn rate found in the dataset is approximately 27%, meaning that roughly one in every four customers terminates their contract. This is a significant rate and should be treated as a critical retention indicator. An analysis of the tenure variable shows a median of 29 months, but with 25% of customers staying for nine months or less, revealing a high concentration of short-term customers and suggesting that early churn is a primary risk.

Regarding billing, the MonthlyCharges variable has a mean of around 64.8 but with a wide spread, indicating highly distinct consumption profiles. TotalCharges, in turn, reflects both the variation in contract length and the accumulation of spending over time.

Observing the categorical variables, it’s notable that customers on month-to-month contracts, those who use paperless billing, and those with certain payment methods exhibit churn rates higher than the global average. These profiles, therefore, constitute strategic targets for targeted interventions. It is also worth highlighting that the data handling applied to TotalCharges was essential to prevent distortions and ensure consistency in the comparisons made.

The relationship between churn and tenure reveals that customers who cancel tend to be concentrated in the first few months of their contract, reinforcing the need for specific onboarding and loyalty strategies in the very first cycles. The comparison between churn and MonthlyCharges indicates that there are billing ranges in which the propensity to cancel is higher, especially among customers with higher invoice values, which suggests a need to evaluate the perceived cost-benefit in this segment.

Data Preprocessing

Dropping the customerID column as it has no predictive value:

Show Code

df = df.drop('customerID', axis=1)

Describe current df:

Show Code

df.describe(include=['object', 'bool'])

	gender	Partner	Dependents	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	Churn
count	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043	7043
unique	2	2	2	2	2	3	2	2	2	2	2	2	3	2	4	2
top	Male	No	No	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	No
freq	3555	3641	4933	6361	4072	3096	5024	4614	4621	4999	4336	4311	3875	4171	2365	5174

Label Encoding for binary variables:

Show Code

binarias = [
    'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
    'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'Churn'
]
# Mapping Yes/No to 1/0 and Male/Female to 1/0
mapeamento = {'Yes': 1,'No': 0,'Male': 1,'Female': 0}
for col in binarias:
    df[col] = df[col].map(mapeamento)

One-Hot Encoding for nominal variable:

Show Code

nominais = ['InternetService', 'Contract', 'PaymentMethod']
df = pd.get_dummies(df, columns=nominais, drop_first=True, dtype=int)

Correlation Analysis of Numerical Variables:

Show Code

sns.heatmap(df[['tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']].corr(), annot=True, cmap='coolwarm')

Correlation Analysis with the ‘Churn’ Column:

Show Code

corr_matrix = df.corr()
corr_churn = corr_matrix['Churn'].sort_values(ascending=False)
print(corr_churn)

Churn                                    1.000000
InternetService_Fiber optic              0.308020
PaymentMethod_Electronic check           0.301919
MonthlyCharges                           0.193356
PaperlessBilling                         0.191825
SeniorCitizen                            0.150889
StreamingTV                              0.063228
StreamingMovies                          0.061382
MultipleLines                            0.040102
PhoneService                             0.011942
gender                                  -0.008612
DeviceProtection                        -0.066160
OnlineBackup                            -0.082255
PaymentMethod_Mailed check              -0.091683
PaymentMethod_Credit card (automatic)   -0.134302
Partner                                 -0.150448
Dependents                              -0.164221
TechSupport                             -0.164674
OnlineSecurity                          -0.171226
Contract_One year                       -0.177820
TotalCharges                            -0.198324
InternetService_No                      -0.227890
Contract_Two year                       -0.302253
tenure                                  -0.352229
Name: Churn, dtype: float64

Correlation Matrix:

Show Code

corr_matrix = df.corr()
limite = 0.7
corr_pairs = (
    corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    .stack()
    .reset_index()
)
corr_pairs.columns = ['Var1', 'Var2', 'Correlation']
corr_high = corr_pairs[abs(corr_pairs['Correlation']) > limite].sort_values(
    by='Correlation', ascending=False
)
print("Highly correlated variables (|correlation| > 0.7):")
print(corr_high)
corr_filtrada = corr_matrix.mask(abs(corr_matrix) <= limite)

Highly correlated variables (|correlation| > 0.7):
               Var1                         Var2  Correlation
96           tenure                 TotalCharges     0.826178
233  MonthlyCharges  InternetService_Fiber optic     0.787066
234  MonthlyCharges           InternetService_No    -0.763557

The 0.826 correlation between tenure and TotalCharges is extremely high, indicating strong multicollinearity. TotalCharges is fundamentally the product of tenure and MonthlyCharges (or, more precisely, the sum of MonthlyCharges over the tenure). Keeping all three variables together would add redundancy and could complicate interpretability, especially for linear models, without necessarily adding more predictive power.

We opted to remove TotalCharges, as we still retain crucial information from tenure (the customer’s length of stay, a key indicator of loyalty and maturity) and MonthlyCharges (the current cost of service, which correlates with the quality/service package and is a strong predictor of churn). Together, these two variables already capture the essence of what TotalCharges represents.

The other strong correlations reflect the intrinsic business relationship between the service value (MonthlyCharges) and the service type (InternetService). Removing either of these variables would mean losing valuable information about the customer’s service package, which is a very strong predictor of churn. The Machine Learning model can and should use this information to learn the relationship. The multicollinearity here is more of a reflection that one variable (MonthlyCharges) is heavily influenced by another (InternetService), and both are relevant.

Dropping the TotalCharges column:

Show Code

df = df.drop('TotalCharges', axis=1)

Distribution of the Target Variable:

Show Code

churn_counts = df['Churn'].value_counts()
print("Churn Count:")
print(churn_counts)
plt.figure(figsize=(6,6))
plt.pie(
    churn_counts, 
    labels=['No (0)', 'Yes (1)'],   
    autopct='%1.1f%%', 
    startangle=90, 
    colors=['#1f77b4', '#ff7f0e'],
    explode=(0,0.05),
    shadow=True
)
plt.title("Churn Distribution (%)")
plt.show()

Churn Count:
Churn
0    5174
1    1869
Name: count, dtype: int64

We are facing a class imbalance problem in our target variable. To address this, we will test different strategies and approaches to determine which one yields the best model. Our objective is to build a model with an accuracy above 80%.

Why Recall? In the context of customer churn prediction, Recall (also known as Sensitivity or True Positive Rate) is the primary metric to optimize. It measures our model’s ability to correctly identify the highest possible proportion of customers who will actually leave the company (the “Churn” minority class).

Minimizing Customer Losses (False Negatives): The cost of a False Negative (a customer the model predicted would not churn, but who actually did) is typically very high for a business. Each lost customer represents not only the interruption of future revenue but also the potential loss of Lifetime Value (LTV), the costs of acquiring new customers, and a negative impact on reputation. A high Recall means we are minimizing these False Negatives, ensuring that most customers at risk of churning are flagged for intervention.

Opportunity for Proactive Intervention: By identifying a customer with a high probability of churning (thanks to a high Recall), the company gains the opportunity to implement targeted retention strategies (personalized offers, proactive support, satisfaction surveys). Missing this opportunity due to a False Negative is the most costly error in this scenario.

Although Recall is the priority, we do not disregard Precision, F1-Score, and AUC-ROC:

In summary, our focus is to ensure that the largest possible number of at-risk customers are identified (high Recall), enabling effective retention actions. The other metrics help us refine the model, ensuring that these interventions are as efficient and targeted as possible, thus optimizing the return on investment (ROI) in churn prevention.

Separating Features (X) and Target Variable (y):

Show Code

X = df.drop('Churn', axis=1)
y = df['Churn']

Splitting data into train and test:

Show Code

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Scaling Numerical Variables:

Show Code

numeric_cols = ['tenure', 'MonthlyCharges']
scaler = StandardScaler()
X_train[numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_test[numeric_cols] = scaler.transform(X_test[numeric_cols])

Visualizing the variables after scaling:

Show Code

numeric_cols = ["tenure", 'MonthlyCharges']
for feat in numeric_cols:
    plt.figure(figsize=(8, 5)) 
    sns.histplot(data=X_train, x=feat, kde=True)
    plt.title(f'{feat} Distribution', fontsize=14)
    plt.show()

Building and Evaluating Machine Learning Models

Logistic Regression:

Show Code

log_reg_model = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42, max_iter=1000)
log_reg_model.fit(X_train, y_train)
y_pred_log_reg = log_reg_model.predict(X_test)
y_proba_log_reg = log_reg_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (Logistic Regression) ---")
print(classification_report(y_test, y_pred_log_reg))

cm_log_reg = confusion_matrix(y_test, y_pred_log_reg)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_log_reg, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Real')
plt.xlabel('Predict')
plt.title('Confusion Matrix')
plt.show()
# AUC-ROC
auc_log_reg = roc_auc_score(y_test, y_proba_log_reg)
print(f"\nAUC-ROC (Logistic Regression): {auc_log_reg:.4f}")
# Plotar Curva ROC
fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(log_reg_model, X_test, y_test, ax=ax, name='Logistic Regression')
plt.title('ROC Curve - (Logistic Regression)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (Logistic Regression) ---
              precision    recall  f1-score   support

           0       0.90      0.73      0.81      1035
           1       0.51      0.78      0.62       374

    accuracy                           0.74      1409
   macro avg       0.70      0.75      0.71      1409
weighted avg       0.80      0.74      0.76      1409


AUC-ROC (Logistic Regression): 0.8389

Random Forest:

Show Code

rf_model = RandomForestClassifier(class_weight='balanced', random_state=42, n_estimators=100)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
y_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (Random Forest) ---")
print(classification_report(y_test, y_pred_rf))

cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Real')
plt.xlabel('Predict')
plt.title('Confusion Matrix - Random Forest')
plt.show()
# Calcular e exibir AUC-ROC
auc_rf = roc_auc_score(y_test, y_proba_rf)
print(f"\nAUC-ROC (Random Forest): {auc_rf:.4f}")

# Plotar Curva ROC
fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(rf_model, X_test, y_test, ax=ax, name='Random Forest')
plt.title('ROC Curve - Random Forest')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (Random Forest) ---
              precision    recall  f1-score   support

           0       0.83      0.89      0.86      1035
           1       0.63      0.49      0.55       374

    accuracy                           0.79      1409
   macro avg       0.73      0.69      0.71      1409
weighted avg       0.78      0.79      0.78      1409


AUC-ROC (Random Forest): 0.8195

Using Class Balancing

To address class imbalance, a hybrid resampling approach combining SMOTE with undersampling is highly effective. SMOTE (Synthetic Minority Over-sampling Technique) avoids overfitting by creating new, synthetic samples for the minority class based on its nearest neighbors, rather than just duplicating data. This is then coupled with an undersampling technique, such as Tomek Links (as in SMOTETomek), which cleans the feature space by removing majority class samples that are close to the newly created synthetic minority points. This dual strategy results in a more balanced and less noisy dataset, allowing the model to learn a clearer and more robust decision boundary between the classes.

Applying SMOTE + Undersampling (SMOTETomek)

Show Code

smote_tomek = SMOTETomek(random_state=42)
X_train_resampled, y_train_resampled = smote_tomek.fit_resample(X_train, y_train)

print("\nClass Distribution After Applying SMOTETomek")
print(y_train_resampled.value_counts())


Class Distribution After Applying SMOTETomek
Churn
0    3965
1    3965
Name: count, dtype: int64

Logistic Regression with SMOTETomek:

Show Code

log_reg_smote_model = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000) # SEM class_weight='balanced'
log_reg_smote_model.fit(X_train_resampled, y_train_resampled)
y_pred_log_reg_smote = log_reg_smote_model.predict(X_test)
y_proba_log_reg_smote = log_reg_smote_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (Logistic Regression with SMOTETomek) ---")
print(classification_report(y_test, y_pred_log_reg_smote))

cm_log_reg_smote = confusion_matrix(y_test, y_pred_log_reg_smote)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_log_reg_smote, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Real')
plt.xlabel('Predict')
plt.title('Confusion Matrix (Logistic Regression with SMOTETomek)')
plt.show()

auc_log_reg_smote = roc_auc_score(y_test, y_proba_log_reg_smote)
print(f"\nAUC-ROC (Logistic Regression with SMOTETomek): {auc_log_reg_smote:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(log_reg_smote_model, X_test, y_test, ax=ax, name='Logistic Regression (SMOTETomek)')
plt.title('ROC Curve - Logistic Regression (SMOTETomek)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (Logistic Regression with SMOTETomek) ---
              precision    recall  f1-score   support

           0       0.89      0.76      0.82      1035
           1       0.53      0.74      0.62       374

    accuracy                           0.76      1409
   macro avg       0.71      0.75      0.72      1409
weighted avg       0.79      0.76      0.77      1409


AUC-ROC (Logistic Regression with SMOTETomek): 0.8323

Random Forest with SMOTETomek

Show Code

rf_smote_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_smote_model.fit(X_train_resampled, y_train_resampled)
y_pred_rf_smote = rf_smote_model.predict(X_test)
y_proba_rf_smote = rf_smote_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (Random Forest with SMOTETomek) ---")
print(classification_report(y_test, y_pred_rf_smote))

cm_rf_smote = confusion_matrix(y_test, y_pred_rf_smote)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_rf_smote, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Real')
plt.xlabel('Predict')
plt.title('Confusion Matrix - Random Forest (SMOTETomek)')
plt.show()

auc_rf_smote = roc_auc_score(y_test, y_proba_rf_smote)
print(f"\nAUC-ROC (Random Forest w/ SMOTETomek): {auc_rf_smote:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(rf_smote_model, X_test, y_test, ax=ax, name='Random Forest (SMOTETomek)')
plt.title('ROC Curve - Random Forest (SMOTETomek)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (Random Forest with SMOTETomek) ---
              precision    recall  f1-score   support

           0       0.86      0.81      0.84      1035
           1       0.55      0.64      0.59       374

    accuracy                           0.77      1409
   macro avg       0.71      0.73      0.71      1409
weighted avg       0.78      0.77      0.77      1409


AUC-ROC (Random Forest w/ SMOTETomek): 0.8167

SVM with SMOTETomek

Show Code

svm_model = SVC(kernel='rbf', probability=True, random_state=42)
svm_model.fit(X_train_resampled, y_train_resampled)
y_pred_svm = svm_model.predict(X_test)
y_proba_svm = svm_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (SVM with SMOTETomek) ---")
print(classification_report(y_test, y_pred_svm))

cm_svm = confusion_matrix(y_test, y_pred_svm)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - SVM (SMOTETomek)')
plt.show()

auc_svm = roc_auc_score(y_test, y_proba_svm)
print(f"\nAUC-ROC (SVM with SMOTETomek): {auc_svm:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(svm_model, X_test, y_test, ax=ax, name='SVM')
plt.title('ROC Curve - SVM (SMOTETomek)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (SVM with SMOTETomek) ---
              precision    recall  f1-score   support

           0       0.89      0.78      0.83      1035
           1       0.55      0.73      0.62       374

    accuracy                           0.77      1409
   macro avg       0.72      0.76      0.73      1409
weighted avg       0.80      0.77      0.78      1409


AUC-ROC (SVM with SMOTETomek): 0.8233

Decision Tree with SMOTETomek

Show Code

dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train_resampled, y_train_resampled)
y_pred_dt = dt_model.predict(X_test)
y_proba_dt = dt_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (Decision Tree with SMOTETomek) ---")
print(classification_report(y_test, y_pred_dt))

cm_dt = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_dt, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - Decision Tree (SMOTETomek)')
plt.show()

auc_dt = roc_auc_score(y_test, y_proba_dt)
print(f"\nAUC-ROC (Decision Tree with SMOTETomek): {auc_dt:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(dt_model, X_test, y_test, ax=ax, name='Decision Tree')
plt.title('ROC Curve - Decision Tree (SMOTETomek)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (Decision Tree with SMOTETomek) ---
              precision    recall  f1-score   support

           0       0.85      0.77      0.81      1035
           1       0.50      0.61      0.55       374

    accuracy                           0.73      1409
   macro avg       0.67      0.69      0.68      1409
weighted avg       0.75      0.73      0.74      1409


AUC-ROC (Decision Tree with SMOTETomek): 0.6958

XGBoost with SMOTETomek

Show Code

xgb_model = XGBClassifier(objective='binary:logistic', eval_metric='logloss', use_label_encoder=False, random_state=42)
xgb_model.fit(X_train_resampled, y_train_resampled)
y_pred_xgb = xgb_model.predict(X_test)
y_proba_xgb = xgb_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (XGBoost with SMOTETomek) ---")
print(classification_report(y_test, y_pred_xgb))

cm_xgb = confusion_matrix(y_test, y_pred_xgb)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_xgb, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - XGBoost (SMOTETomek)')
plt.show()

auc_xgb = roc_auc_score(y_test, y_proba_xgb)
print(f"\nAUC-ROC (XGBoost with SMOTETomek): {auc_xgb:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(xgb_model, X_test, y_test, ax=ax, name='XGBoost')
plt.title('ROC Curve - XGBoost (SMOTETomek)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (XGBoost with SMOTETomek) ---
              precision    recall  f1-score   support

           0       0.87      0.80      0.84      1035
           1       0.55      0.68      0.61       374

    accuracy                           0.77      1409
   macro avg       0.71      0.74      0.72      1409
weighted avg       0.79      0.77      0.78      1409


AUC-ROC (XGBoost with SMOTETomek): 0.8169

LightGBM with SMOTETomek

Show Code

lgbm_model = LGBMClassifier(objective='binary', metric='binary_logloss', random_state=42)
lgbm_model.fit(X_train_resampled, y_train_resampled)
y_pred_lgbm = lgbm_model.predict(X_test)
y_proba_lgbm = lgbm_model.predict_proba(X_test)[:, 1]


print("\n--- Classification Report (LightGBM with SMOTETomek) ---")
print(classification_report(y_test, y_pred_lgbm))

cm_lgbm = confusion_matrix(y_test, y_pred_lgbm)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_lgbm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - LightGBM (SMOTETomek)')
plt.show()

auc_lgbm = roc_auc_score(y_test, y_proba_lgbm)
print(f"\nAUC-ROC (LightGBM with SMOTETomek): {auc_lgbm:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(lgbm_model, X_test, y_test, ax=ax, name='LightGBM')
plt.title('ROC Curve - LightGBM (SMOTETomek)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()

[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Number of positive: 3965, number of negative: 3965
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000879 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 550
[LightGBM] [Info] Number of data points in the train set: 7930, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000

--- Classification Report (LightGBM with SMOTETomek) ---
              precision    recall  f1-score   support

           0       0.88      0.78      0.83      1035
           1       0.54      0.69      0.61       374

    accuracy                           0.76      1409
   macro avg       0.71      0.74      0.72      1409
weighted avg       0.79      0.76      0.77      1409


AUC-ROC (LightGBM with SMOTETomek): 0.8268

K-Nearest Neighbors (KNN) with SMOTETomek

Show Code

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train_resampled, y_train_resampled)
y_pred_knn = knn_model.predict(X_test)
y_proba_knn = knn_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (KNN with SMOTETomek) ---")
print(classification_report(y_test, y_pred_knn))

cm_knn = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - KNN (SMOTETomek)')
plt.show()

auc_knn = roc_auc_score(y_test, y_proba_knn)
print(f"\nAUC-ROC (KNN with SMOTETomek): {auc_knn:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(knn_model, X_test, y_test, ax=ax, name='KNN')
plt.title('ROC Curve - KNN (SMOTETomek)')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Classification Report (KNN with SMOTETomek) ---
              precision    recall  f1-score   support

           0       0.88      0.71      0.79      1035
           1       0.48      0.74      0.58       374

    accuracy                           0.72      1409
   macro avg       0.68      0.72      0.68      1409
weighted avg       0.77      0.72      0.73      1409


AUC-ROC (KNN with SMOTETomek): 0.7799

Results Summary

Model	Approach	Precision (Churn=1)	Recall (Churn=1)	F1-Score (Churn=1)	AUC-ROC
Logistic Regression	`class_weight='balanced'`	0.51	0.78	0.62	0.8389
Random Forest	`class_weight='balanced'`	0.63	0.49	0.55	0.8195
Logistic Regression	SMOTETomek	0.53	0.74	0.62	0.8323
Random Forest	SMOTETomek	0.55	0.64	0.59	0.8167
SVM	SMOTETomek	0.55	0.73	0.62	0.8233
Decision Tree	SMOTETomek	0.50	0.61	0.55	0.6958
XGBoost	SMOTETomek	0.55	0.68	0.61	0.8169
LightGBM	SMOTETomek	0.54	0.69	0.61	0.8268
K-Nearest Neighbors (KNN)	SMOTETomek	0.48	0.74	0.58	0.7799

Cross-Validation and Hyperparameter Optimization

Show Code

count_class_0 = y_train.value_counts()[0]
count_class_1 = y_train.value_counts()[1]
scale_pos_weight_value = count_class_0 / count_class_1
print(f"Number of No Churn (class 0) in training set: {count_class_0}")
print(f"Number of Churn (class 1) in training set: {count_class_1}")
print(f"Calculated scale_pos_weight: {scale_pos_weight_value:.4f}")
recall_scorer = make_scorer(recall_score, pos_label=1)

Number of No Churn (class 0) in training set: 4139
Number of Churn (class 1) in training set: 1495
Calculated scale_pos_weight: 2.7686

Logistic Regression - Tunned

Show Code

print("\n--- Optimizing Hyperparameters for Logistic Regression (class_weight='balanced') ---")
log_reg_base = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42, max_iter=1000)

param_grid_lr = {
    'C': np.logspace(-3, 2, 6)
}

grid_search_lr = GridSearchCV(estimator=log_reg_base,
                               param_grid=param_grid_lr,
                               scoring=recall_scorer,
                               cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                               n_jobs=-1,
                               verbose=1)

grid_search_lr.fit(X_train, y_train)
print(f"Best Parameters (Logistic Regression): {grid_search_lr.best_params_}")
print(f"Best Recall (Logistic Regression) on CV: {grid_search_lr.best_score_:.4f}")
best_lr_model = grid_search_lr.best_estimator_
y_pred_lr_tuned = best_lr_model.predict(X_test)
y_proba_lr_tuned = best_lr_model.predict_proba(X_test)[:, 1]
print("\n--- Classification Report (Tuned Logistic Regression) ---")
print(classification_report(y_test, y_pred_lr_tuned))
cm_lr_tuned = confusion_matrix(y_test, y_pred_lr_tuned)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_lr_tuned, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - Tuned Logistic Regression')
plt.show()

auc_lr_tuned = roc_auc_score(y_test, y_proba_lr_tuned)
print(f"AUC-ROC (Tuned Logistic Regression): {auc_lr_tuned:.4f}")
fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(best_lr_model, X_test, y_test, ax=ax, name='Tuned Logistic Regression')
plt.title('ROC Curve - Tuned Logistic Regression')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()


--- Optimizing Hyperparameters for Logistic Regression (class_weight='balanced') ---
Fitting 5 folds for each of 6 candidates, totalling 30 fits
Best Parameters (Logistic Regression): {'C': 0.001}
Best Recall (Logistic Regression) on CV: 0.7946

--- Classification Report (Tuned Logistic Regression) ---
              precision    recall  f1-score   support

           0       0.91      0.70      0.79      1035
           1       0.49      0.80      0.61       374

    accuracy                           0.73      1409
   macro avg       0.70      0.75      0.70      1409
weighted avg       0.80      0.73      0.74      1409

AUC-ROC (Tuned Logistic Regression): 0.8353

Random Forest - Tunned

Show Code

print("\n\n--- Optimizing Hyperparameters for Random Forest (class_weight='balanced') ---")

rf_base = RandomForestClassifier(class_weight='balanced', random_state=42)

param_grid_rf = {
    'n_estimators': [100, 200, 300], # Number of trees
    'max_depth': [None, 10, 20], # Maximum tree depth
    'min_samples_split': [2, 5, 10] # Minimum number of samples to split a node
}

grid_search_rf = GridSearchCV(estimator=rf_base,
                               param_grid=param_grid_rf,
                               scoring=recall_scorer,
                               cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                               n_jobs=-1,
                               verbose=1)

grid_search_rf.fit(X_train, y_train)

print(f"Best Parameters (Random Forest): {grid_search_rf.best_params_}")
print(f"Best Recall (Random Forest) on CV: {grid_search_rf.best_score_:.4f}")

best_rf_model = grid_search_rf.best_estimator_
y_pred_rf_tuned = best_rf_model.predict(X_test)
y_proba_rf_tuned = best_rf_model.predict_proba(X_test)[:, 1]

print("\n--- Classification Report (Random Forest Tuned) ---")
print(classification_report(y_test, y_pred_rf_tuned))

cm_rf_tuned = confusion_matrix(y_test, y_pred_rf_tuned)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_rf_tuned, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - Random Forest Tuned')
plt.show()

auc_rf_tuned = roc_auc_score(y_test, y_proba_rf_tuned)
print(f"AUC-ROC (Random Forest Tuned): {auc_rf_tuned:.4f}")

fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(best_rf_model, X_test, y_test, ax=ax, name='Random Forest Tuned')
plt.title('ROC Curve - Random Forest Tuned')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()



--- Optimizing Hyperparameters for Random Forest (class_weight='balanced') ---
Fitting 5 folds for each of 27 candidates, totalling 135 fits
Best Parameters (Random Forest): {'max_depth': 10, 'min_samples_split': 10, 'n_estimators': 200}
Best Recall (Random Forest) on CV: 0.7378

--- Classification Report (Random Forest Tuned) ---
              precision    recall  f1-score   support

           0       0.89      0.77      0.83      1035
           1       0.54      0.74      0.62       374

    accuracy                           0.76      1409
   macro avg       0.72      0.76      0.73      1409
weighted avg       0.80      0.76      0.77      1409

AUC-ROC (Random Forest Tuned): 0.8421

Support Vector Machine (SVM) - Tunned

Show Code

print("\n\n--- Optimizing Hyperparameters for SVM (class_weight='balanced') ---")
svm_base = SVC(class_weight='balanced', probability=True, random_state=42)

param_grid_svm = {
    'C': [0.1, 1, 10],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

grid_search_svm = GridSearchCV(estimator=svm_base,
                               param_grid=param_grid_svm,
                               scoring=recall_scorer,
                               cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                               n_jobs=-1,
                               verbose=1)

grid_search_svm.fit(X_train, y_train)
print(f"Best Parameters (SVM): {grid_search_svm.best_params_}")
print(f"Best Recall (SVM) on CV: {grid_search_svm.best_score_:.4f}")

best_svm_model = grid_search_svm.best_estimator_
y_pred_svm_tuned = best_svm_model.predict(X_test)
y_proba_svm_tuned = best_svm_model.predict_proba(X_test)[:, 1]
print("\n--- Classification Report (SVM Tuned) ---")
print(classification_report(y_test, y_pred_svm_tuned))
cm_svm_tuned = confusion_matrix(y_test, y_pred_svm_tuned)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_svm_tuned, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - SVM Tuned')
plt.show()
auc_svm_tuned = roc_auc_score(y_test, y_proba_svm_tuned)
print(f"AUC-ROC (SVM Tuned): {auc_svm_tuned:.4f}")
fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(best_svm_model, X_test, y_test, ax=ax, name='SVM Tuned')
plt.title('ROC Curve - SVM Tuned')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()



--- Optimizing Hyperparameters for SVM (class_weight='balanced') ---
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best Parameters (SVM): {'C': 0.1, 'gamma': 0.001, 'kernel': 'rbf'}
Best Recall (SVM) on CV: 0.8776

--- Classification Report (SVM Tuned) ---
              precision    recall  f1-score   support

           0       0.93      0.62      0.74      1035
           1       0.46      0.88      0.60       374

    accuracy                           0.69      1409
   macro avg       0.69      0.75      0.67      1409
weighted avg       0.81      0.69      0.71      1409

AUC-ROC (SVM Tuned): 0.8334

XGBoost- Tunned

Show Code

print("\n\n--- Optimizing Hyperparameters for XGBoost (scale_pos_weight) ---")

xgb_base = XGBClassifier(objective='binary:logistic', eval_metric='logloss',
                         use_label_encoder=False, random_state=42,
                         scale_pos_weight=scale_pos_weight_value)
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

grid_search_xgb = GridSearchCV(estimator=xgb_base,
                               param_grid=param_grid_xgb,
                               scoring=recall_scorer,
                               cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                               n_jobs=-1,
                               verbose=1)

grid_search_xgb.fit(X_train, y_train)
print(f"Best Parameters (XGBoost): {grid_search_xgb.best_params_}")
print(f"Best Recall (XGBoost) on CV: {grid_search_xgb.best_score_:.4f}")
best_xgb_model = grid_search_xgb.best_estimator_
y_pred_xgb_tuned = best_xgb_model.predict(X_test)
y_proba_xgb_tuned = best_xgb_model.predict_proba(X_test)[:, 1]
print("\n--- Classification Report (XGBoost Tuned) ---")
print(classification_report(y_test, y_pred_xgb_tuned))
cm_xgb_tuned = confusion_matrix(y_test, y_pred_xgb_tuned)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_xgb_tuned, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - XGBoost Tuned')
plt.show()
auc_xgb_tuned = roc_auc_score(y_test, y_proba_xgb_tuned)
print(f"AUC-ROC (XGBoost Tuned): {auc_xgb_tuned:.4f}")
fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(best_xgb_model, X_test, y_test, ax=ax, name='XGBoost Tuned')
plt.title('ROC Curve - XGBoost Tuned')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()



--- Optimizing Hyperparameters for XGBoost (scale_pos_weight) ---
Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best Parameters (XGBoost): {'colsample_bytree': 0.8, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200}
Best Recall (XGBoost) on CV: 0.8294

--- Classification Report (XGBoost Tuned) ---
              precision    recall  f1-score   support

           0       0.91      0.70      0.79      1035
           1       0.49      0.81      0.61       374

    accuracy                           0.73      1409
   macro avg       0.70      0.75      0.70      1409
weighted avg       0.80      0.73      0.74      1409

AUC-ROC (XGBoost Tuned): 0.8410

LightGBM - Tunned

Show Code

print("\n\n--- Optimizing Hyperparameters for LightGBM (scale_pos_weight) ---")

lgbm_base = LGBMClassifier(objective='binary', metric='binary_logloss',
                           random_state=42,
                           scale_pos_weight=scale_pos_weight_value)
param_grid_lgbm = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1],
    'num_leaves': [20, 31, 40],
    'reg_alpha': [0, 0.1, 0.5],
    'reg_lambda': [0, 0.1, 0.5]
}
grid_search_lgbm = GridSearchCV(estimator=lgbm_base,
                               param_grid=param_grid_lgbm,
                               scoring=recall_scorer,
                               cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
                               n_jobs=-1,
                               verbose=1)

grid_search_lgbm.fit(X_train, y_train)
print(f"Best Parameters (LightGBM): {grid_search_lgbm.best_params_}")
print(f"Best Recall (LightGBM) on CV: {grid_search_lgbm.best_score_:.4f}")
best_lgbm_model = grid_search_lgbm.best_estimator_
y_pred_lgbm_tuned = best_lgbm_model.predict(X_test)
y_proba_lgbm_tuned = best_lgbm_model.predict_proba(X_test)[:, 1]
print("\n--- Classification Report (LightGBM Tuned) ---")
print(classification_report(y_test, y_pred_lgbm_tuned))
cm_lgbm_tuned = confusion_matrix(y_test, y_pred_lgbm_tuned)
plt.figure(figsize=(6, 5))
sns.heatmap(cm_lgbm_tuned, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Churn (0)', 'Churn (1)'],
            yticklabels=['No Churn (0)', 'Churn (1)'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix - LightGBM Tuned')
plt.show()
auc_lgbm_tuned = roc_auc_score(y_test, y_proba_lgbm_tuned)
print(f"AUC-ROC (LightGBM Tuned): {auc_lgbm_tuned:.4f}")
fig, ax = plt.subplots(figsize=(6, 5))
RocCurveDisplay.from_estimator(best_lgbm_model, X_test, y_test, ax=ax, name='LightGBM Tuned')
plt.title('ROC Curve - LightGBM Tuned')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.legend()
plt.show()



--- Optimizing Hyperparameters for LightGBM (scale_pos_weight) ---
Fitting 5 folds for each of 243 candidates, totalling 1215 fits
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Number of positive: 1495, number of negative: 4139
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000778 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 369
[LightGBM] [Info] Number of data points in the train set: 5634, number of used features: 22
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.265353 -> initscore=-1.018328
[LightGBM] [Info] Start training from score -1.018328
Best Parameters (LightGBM): {'learning_rate': 0.05, 'n_estimators': 100, 'num_leaves': 20, 'reg_alpha': 0.5, 'reg_lambda': 0.5}
Best Recall (LightGBM) on CV: 0.7873

--- Classification Report (LightGBM Tuned) ---
              precision    recall  f1-score   support

           0       0.91      0.74      0.82      1035
           1       0.52      0.79      0.63       374

    accuracy                           0.75      1409
   macro avg       0.72      0.77      0.72      1409
weighted avg       0.81      0.75      0.77      1409

AUC-ROC (LightGBM Tuned): 0.8428

The optimized versions with scale_pos_weight consistently outperform their SMOTETomek-based counterparts in AUC, Recall, and F1-Score. The conclusion is that, for this problem and dataset, using class_weight and scale_pos_weight in conjunction with hyperparameter optimization was the most effective strategy for achieving high Recall and AUC, surpassing the resampling approach with SMOTETomek.

Comparison of Model Performance Metrics

Results Consolidation:

Show Code

# Original Models (Without Optimization)
data_initial = [
    ['Logistic Regression', "class_weight='balanced'", 0.51, 0.78, 0.62, 0.8389, "Inicial"],
    ['Random Forest', "class_weight='balanced'", 0.63, 0.49, 0.55, 0.8195, "Inicial"],
    ['Logistic Regression', "SMOTETomek", 0.53, 0.74, 0.62, 0.8323, "Inicial"],
    ['Random Forest', "SMOTETomek", 0.55, 0.64, 0.59, 0.8167, "Inicial"],
    ['SVM', "SMOTETomek", 0.55, 0.73, 0.62, 0.8233, "Inicial"],
    ['Decision Tree', "SMOTETomek", 0.50, 0.61, 0.55, 0.6958, "Inicial"],
    ['XGBoost', "SMOTETomek", 0.55, 0.68, 0.61, 0.8169, "Inicial"],
    ['LightGBM', "SMOTETomek", 0.54, 0.69, 0.61, 0.8268, "Inicial"],
    ['KNN', "SMOTETomek", 0.48, 0.74, 0.58, 0.7799, "Inicial"]
]

# Tuned Models (Optimized)
data_tuned = [
    ['Logistic Regression', "class_weight='balanced'", 0.49, 0.80, 0.61, 0.8353, "Tuned"],
    ['Random Forest', "class_weight='balanced'", 0.54, 0.74, 0.62, 0.8421, "Tuned"],
    ['SVM', "class_weight='balanced'", 0.46, 0.88, 0.60, 0.8334, "Tuned"],
    ['XGBoost', "scale_pos_weight", 0.49, 0.81, 0.61, 0.8410, "Tuned"],
    ['LightGBM', "scale_pos_weight", 0.52, 0.79, 0.63, 0.8428, "Tuned"]
]

# Consolidate
columns = ['Model', 'Aproach', 'Precision (Churn=1)', 'Recall (Churn=1)', 
           'F1-Score (Churn=1)', 'AUC-ROC', 'Status']
df_results = pd.DataFrame(data_initial + data_tuned, columns=columns)

Sorting by Recall (Churn=1):

Show Code

df_sorted = df_results.sort_values(by="Recall (Churn=1)", ascending=False).reset_index(drop=True)
display(df_sorted)

	Model	Aproach	Precision (Churn=1)	Recall (Churn=1)	F1-Score (Churn=1)	AUC-ROC	Status
0	SVM	class_weight='balanced'	0.46	0.88	0.60	0.8334	Tuned
1	XGBoost	scale_pos_weight	0.49	0.81	0.61	0.8410	Tuned
2	Logistic Regression	class_weight='balanced'	0.49	0.80	0.61	0.8353	Tuned
3	LightGBM	scale_pos_weight	0.52	0.79	0.63	0.8428	Tuned
4	Logistic Regression	class_weight='balanced'	0.51	0.78	0.62	0.8389	Inicial
5	Logistic Regression	SMOTETomek	0.53	0.74	0.62	0.8323	Inicial
6	KNN	SMOTETomek	0.48	0.74	0.58	0.7799	Inicial
7	Random Forest	class_weight='balanced'	0.54	0.74	0.62	0.8421	Tuned
8	SVM	SMOTETomek	0.55	0.73	0.62	0.8233	Inicial
9	LightGBM	SMOTETomek	0.54	0.69	0.61	0.8268	Inicial
10	XGBoost	SMOTETomek	0.55	0.68	0.61	0.8169	Inicial
11	Random Forest	SMOTETomek	0.55	0.64	0.59	0.8167	Inicial
12	Decision Tree	SMOTETomek	0.50	0.61	0.55	0.6958	Inicial
13	Random Forest	class_weight='balanced'	0.63	0.49	0.55	0.8195	Inicial

Visuals: One plot for each metric :

Show Code

modelos_com_tuned = ['Logistic Regression', 'Random Forest', 'SVM', 'XGBoost', 'LightGBM']
df_filtered = df_results[df_results['Model'].isin(modelos_com_tuned)]
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (8, 5)
metricas = {
    'Precision (Churn=1)': 'Precision',
    'Recall (Churn=1)': 'Recall',
    'F1-Score (Churn=1)': 'F1-Score',
    'AUC-ROC': 'AUC-ROC'
}
for metrica, titulo in metricas.items():
    ordem = (
        df_filtered[df_filtered["Status"]=="Tuned"]
        .sort_values(by=metrica, ascending=True)["Model"]
        .tolist()
    )
    
    plt.figure(figsize=(8, 5))
    sns.barplot(
        x="Model", y=metrica, hue="Status", 
        data=df_filtered, palette="Set2", order=ordem
    )
    plt.title(f"{titulo} by Model (Initial vs Tuned)", fontsize=16)
    plt.ylabel(titulo)
    plt.xlabel("Model")
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

Radar Chart:

Show Code

df_tuned_only = df_results[df_results['Status'] == "Tuned"].set_index("Model")
metrics = ['Precision (Churn=1)', 'Recall (Churn=1)', 'F1-Score (Churn=1)', 'AUC-ROC']
df_metrics = df_tuned_only[metrics]
df_norm = df_metrics.apply(lambda x: (x - x.min()) / (x.max() - x.min()))
labels = metrics
num_vars = len(labels)
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]
plt.figure(figsize=(10, 6))
ax = plt.subplot(111, polar=True)
for idx, row in df_norm.iterrows():
    values = row.tolist()
    values += values[:1]  # close the plot
    ax.plot(angles, values, label=idx, linewidth=2)
    ax.fill(angles, values, alpha=0.15)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(labels, fontsize=12)
ax.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_yticklabels(["0.2","0.4","0.6","0.8","1.0"], fontsize=10)
ax.set_ylim(0, 1)
plt.title("Radar Chart - Comparison of Tuned Models", fontsize=16, pad=20)
plt.legend(loc="upper right", bbox_to_anchor=(1.2, 1.1))
plt.show()

Considering the results, the most promising models are:

Optimized LightGBM: Highest AUC and F1-Score, and an excellent Recall (0.79). It offers the best overall balance.
Optimized SVM: Highest Recall (0.88), but with the lowest Precision (0.46). It is important to evaluate if the cost of False Positives is justified by the extremely high Recall.

Treshold Tunning

Now that we have optimized models with excellent metrics, the next step is threshold tuning.

All the models we have used have a default classification threshold of 0.5. This means that if the predicted probability of churn is >= 0.5, the customer is classified as “churn”; otherwise, they are classified as “no churn”. However, for our problem, where Recall is the priority and Precision is a secondary concern, the 0.5 threshold may not be optimal.

We need to find a cutoff point that maximizes Recall for the “Churn” class (to avoid missing customers who are going to cancel), while maintaining Precision at an acceptable level (to avoid spending excessive resources on False Positives).

LightGBM and SVM Tunned

Show Code

def plot_pr_curve_and_evaluate_threshold(model, X_test, y_test, model_name):
    y_proba = model.predict_proba(X_test)[:, 1]
    precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
    auprc = auc(recall, precision)
    print(f"\n--- Cutoff Point Analysis for: {model_name} ---")
    print(f"Area Under the Precision-Recall Curve (AUPRC): {auprc:.4f}")
    plt.figure(figsize=(8, 6))
    plt.plot(recall, precision, marker='.', label=f'{model_name} (AUPRC = {auprc:.2f})')
    plt.xlabel('Recall (Coverage)')
    plt.ylabel('Precision')
    plt.title(f'Precision-Recall Curve - {model_name}')
    plt.legend()
    plt.grid(True)
    plt.show()
    print("\nPrecision and Recall at different Cutoff Points:")
    selected_threshold_indices = np.linspace(0, len(thresholds) - 1, 10).astype(int)
    print("Threshold | Precision  | Recall")
    print("----------------------------")
    for i in selected_threshold_indices:
        idx = min(i, len(thresholds) - 1)
        print(f"{thresholds[idx]:<9.2f} | {precision[idx]:<8.2f}   | {recall[idx]:<6.2f}")

    print("\nEvaluation Example with a Custom Threshold (Default 0.5):")
    custom_threshold = 0.5
    y_pred_custom_threshold = (y_proba >= custom_threshold).astype(int)
    print(f"Classification Report for Threshold = {custom_threshold:.2f}:")
    print(classification_report(y_test, y_pred_custom_threshold))
    cm_custom = confusion_matrix(y_test, y_pred_custom_threshold)
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm_custom, annot=True, fmt='d', cmap='Blues',
                xticklabels=['No Churn (0)', 'Churn (1)'],
                yticklabels=['No Churn (0)', 'Churn (1)'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title(f'Confusion Matrix - {model_name} (Threshold: {custom_threshold:.2f})')
    plt.show()
# Optimized LightGBM
plot_pr_curve_and_evaluate_threshold(best_lgbm_model, X_test, y_test, "Optimized LightGBM")
# Optimized SVM
plot_pr_curve_and_evaluate_threshold(best_svm_model, X_test, y_test, "Optimized SVM")


--- Cutoff Point Analysis for: Optimized LightGBM ---
Area Under the Precision-Recall Curve (AUPRC): 0.6565


Precision and Recall at different Cutoff Points:
Threshold | Precision  | Recall
----------------------------
0.01      | 0.27       | 1.00  
0.04      | 0.30       | 1.00  
0.07      | 0.34       | 0.99  
0.16      | 0.38       | 0.96  
0.28      | 0.43       | 0.92  
0.42      | 0.49       | 0.83  
0.59      | 0.55       | 0.71  
0.71      | 0.65       | 0.56  
0.81      | 0.72       | 0.32  
0.95      | 1.00       | 0.00  

Evaluation Example with a Custom Threshold (Default 0.5):
Classification Report for Threshold = 0.50:
              precision    recall  f1-score   support

           0       0.91      0.74      0.82      1035
           1       0.52      0.79      0.63       374

    accuracy                           0.75      1409
   macro avg       0.72      0.77      0.72      1409
weighted avg       0.81      0.75      0.77      1409


--- Cutoff Point Analysis for: Optimized SVM ---
Area Under the Precision-Recall Curve (AUPRC): 0.6254


Precision and Recall at different Cutoff Points:
Threshold | Precision  | Recall
----------------------------
0.00      | 0.27       | 1.00  
0.02      | 0.30       | 0.99  
0.04      | 0.34       | 0.99  
0.08      | 0.38       | 0.96  
0.14      | 0.43       | 0.90  
0.22      | 0.49       | 0.82  
0.32      | 0.54       | 0.69  
0.49      | 0.62       | 0.53  
0.68      | 0.70       | 0.29  
0.88      | 1.00       | 0.00  

Evaluation Example with a Custom Threshold (Default 0.5):
Classification Report for Threshold = 0.50:
              precision    recall  f1-score   support

           0       0.84      0.89      0.86      1035
           1       0.62      0.53      0.57       374

    accuracy                           0.79      1409
   macro avg       0.73      0.71      0.72      1409
weighted avg       0.78      0.79      0.78      1409

Model Selection

Considering our primary objective to prioritize Recall (capturing the maximum number of churners), without completely disregarding Precision (to avoid wasting resources):

The Optimized LightGBM with the default cutoff point of 0.50 proves to be the most balanced and effective choice.

It achieves a Recall of 0.79, meaning it correctly identifies 79% of the customers who will actually cancel. This is excellent for proactive retention.

It maintains a Precision of 0.52, which indicates that of the customers the model flags as potential churners, a little over half will actually cancel. This implies a manageable cost of False Positives.

Its F1-Score of 0.63 is the highest among the optimized models, confirming its strong balance.

Its AUPRC (0.6565) and AUC-ROC (0.8428) are the highest, attesting to its overall robustness.

The Optimized SVM is impressive for its potential for an extremely high Recall (reaching 0.88 at lower thresholds), but the drop in Precision would be too steep (below 0.50) to justify its selection for this project. This would only be viable if the cost of a False Positive were absolutely negligible, which is rarely the case in retention strategies.

Conclusion

Chosen Model: Optimized LightGBM Classifier Treshold: 0.50 (default)

The Optimized LightGBM, using its default cutoff point of 0.50, offers the best balance between detecting customers who will actually cancel (79% Recall) and minimizing unnecessary interventions (52% Precision). This balance is ideal for a business strategy where the priority is to retain customers, ensuring that the majority of potential churners are identified, while simultaneously optimizing the use of retention resources. Its superior performance in AUPRC and AUC-ROC also attests to its overall discriminatory power.

Feature Importance

Feature Importance Analysis seeks to quantify the contribution of each variable (feature) in our dataset to the model’s predictions. In other words, it helps us understand which customer characteristics have the greatest impact on the probability of churn, revealing the “triggers” and “protectors” behind this behavior.

For this analysis, we will employ the SHAP (SHapley Additive exPlanations) methodology. SHAP is an advanced and widely respected technique known for its ability to offer transparent and fair interpretability. Unlike simpler methods, SHAP not only tells us “how important” a feature is, but also:

The Magnitude of the Impact: How much a feature, on average, pushes the churn prediction up or down.
The Direction of the Impact: Whether high or low values of a feature tend to increase or decrease the probability of churn.
Individual Context: How each feature affects the churn prediction for each individual customer, allowing for a detailed understanding.

Why is this analysis vital for the Business?

Understanding the importance and directionality of features is fundamental for:

Strategic Prioritization: Directing marketing, sales, and customer service resources to the areas that truly make a difference in retention.
Developing Focused Actions: Creating retention campaigns, adjusting product/service offerings, or improving aspects of customer service based on concrete evidence.
Identifying Opportunities: Discovering new insights into the profile of customers who churn and those who remain loyal.

By the end of this analysis, we will have a clear view of the main drivers of churn, empowering the business to make more informed and proactive decisions to improve the satisfaction and loyalty of our customers.

Functions:

Show Code

def plot_top_importances(df_imp, value_col, top_n=20, title="Feature Importance"):
    dfp = df_imp.sort_values(value_col, ascending=False).head(top_n)
    plt.figure(figsize=(8, 6))
    plt.barh(dfp["feature"][::-1], dfp[value_col][::-1])
    plt.title(title)
    plt.xlabel(value_col)
    plt.tight_layout()
    plt.show()

def group_by_parent_feature(feature_names):
    parents = []
    for f in feature_names:
        parents.append(f.split('_', 1)[0])
    return np.array(parents)

def aggregate_grouped_importances(values, feature_names, agg="sum"):
    parents = group_by_parent_feature(feature_names)
    df = pd.DataFrame({"feature": feature_names, "value": values, "parent": parents})
    if agg == "sum":
        agg_df = (df.assign(value_abs=lambda d: d["value"].abs())
                    .groupby("parent", as_index=False)["value_abs"].sum()
                    .rename(columns={"parent":"feature", "value_abs":"value"}))
    elif agg == "mean":
        agg_df = (df.assign(value_abs=lambda d: d["value"].abs())
                    .groupby("parent", as_index=False)["value_abs"].mean()
                    .rename(columns={"parent":"feature", "value_abs":"value"}))
    else:
        raise ValueError("agg must be 'sum' or 'mean'")
    return agg_df

Gain:

Show Code

# feature names
feat_names = X_train.columns
# importance by "gain"
lgbm = joblib.load('best_lgbm_model.pkl')
gain_importances = lgbm.booster_.feature_importance(importance_type="gain")
imp_gain_df = pd.DataFrame({"feature": feat_names, "gain": gain_importances})
imp_gain_df = imp_gain_df.sort_values("gain", ascending=False)
# detailed plot
plot_top_importances(imp_gain_df, "gain", top_n=20, title="LightGBM - Importance (gain)")
# GROUPED version by 'parent' variable
imp_gain_grouped = aggregate_grouped_importances(gain_importances, feat_names, agg="sum")
plot_top_importances(imp_gain_grouped, "value", top_n=15, title="LightGBM - Aggregated Importance (gain)")
imp_gain_grouped.head(15)

	feature	value
0	Contract	21974.212258
1	Dependents	292.833550
2	DeviceProtection	14.491150
3	InternetService	3850.850205
4	MonthlyCharges	4717.019296
5	MultipleLines	346.108218
6	OnlineBackup	278.130722
7	OnlineSecurity	635.388637
8	PaperlessBilling	769.884130
9	Partner	45.985650
10	PaymentMethod	1814.236586
11	PhoneService	217.981278
12	SeniorCitizen	261.540139
13	StreamingMovies	1429.336064
14	StreamingTV	326.885152

Permutation Importance:

Show Code

perm = permutation_importance(
    lgbm, X_test, y_test,
    n_repeats=10, random_state=42, scoring="roc_auc"
)
imp_perm_df = pd.DataFrame({
    "feature": feat_names,
    "perm_imp_auc": perm.importances_mean,
    "perm_imp_std": perm.importances_std
}).sort_values("perm_imp_auc", ascending=False)

plot_top_importances(imp_perm_df, "perm_imp_auc", top_n=20, title="Permutation Importance (AUC)")

# GROUPED by 'parent' variable
imp_perm_grouped = aggregate_grouped_importances(imp_perm_df["perm_imp_auc"].values, feat_names, agg="sum")
plot_top_importances(imp_perm_grouped, "value", top_n=15, title="Aggregated Permutation Importance (AUC)")
imp_perm_grouped.head(15)

	feature	value
0	Contract	0.000119
1	Dependents	0.022862
2	DeviceProtection	0.001548
3	InternetService	0.000422
4	MonthlyCharges	0.000307
5	MultipleLines	0.003007
6	OnlineBackup	0.002141
7	OnlineSecurity	0.002959
8	PaperlessBilling	0.000470
9	Partner	0.029104
10	PaymentMethod	0.000749
11	PhoneService	0.004665
12	SeniorCitizen	0.070613
13	StreamingMovies	0.000827
14	StreamingTV	0.001181

SHAP Analysis:

Show Code

# Sample
X_sample = X_test.copy()
if X_sample.shape[0] > 2000:
    X_sample = X_sample.sample(2000, random_state=42)

# Explainer for LightGBM
lgbm = joblib.load('best_lgbm_model.pkl')
explainer = shap.TreeExplainer(lgbm)
shap_values = explainer.shap_values(X_sample)

# For binary, shap_values
sv = shap_values[1] if isinstance(shap_values, list) else shap_values

# 5.1) Beeswarm (global)
shap.summary_plot(sv, X_sample, show=True)

# 5.2) Importance by mean |SHAP|
mean_abs = np.abs(sv).mean(axis=0)
imp_shap_df = pd.DataFrame({"feature": X_sample.columns, "mean_abs_shap": mean_abs})\
                 .sort_values("mean_abs_shap", ascending=False)

plot_top_importances(imp_shap_df, "mean_abs_shap", top_n=20, title="SHAP - Mean |impact| (global)")

# 5.3) Directionality
def shap_direction(sv, X):
    dirs = []
    for i, f in enumerate(X.columns):
        xcol = pd.to_numeric(X[f], errors="coerce")
        corr = np.corrcoef(xcol.fillna(xcol.mean()), sv[:, i])[0, 1]
        dirs.append(corr)
    return pd.DataFrame({"feature": X.columns, "direction_corr": dirs})

dir_df = shap_direction(sv, X_sample)

shap_exec_df = (imp_shap_df.merge(dir_df, on="feature")
                            .assign(signal=lambda d: np.where(d["direction_corr"]>=0, "↑ risk", "↓ risk")))

# GROUPED by 'parent' for dummies
shap_group = aggregate_grouped_importances(imp_shap_df["mean_abs_shap"].values, imp_shap_df["feature"].values, agg="sum")
plot_top_importances(shap_group, "value", top_n=15, title="Aggregated SHAP - Mean |impact|")

shap_exec_df.head(20)

	feature	mean_abs_shap	direction_corr	signal
0	Contract_Two year	0.724372	-0.987947	↓ risk
1	tenure	0.497586	-0.862675	↓ risk
2	Contract_One year	0.350179	-0.942685	↓ risk
3	InternetService_Fiber optic	0.272144	0.923495	↑ risk
4	MonthlyCharges	0.229929	0.774371	↑ risk
5	PaymentMethod_Electronic check	0.196693	0.966441	↑ risk
6	OnlineSecurity	0.123907	-0.937402	↓ risk
7	StreamingMovies	0.121011	0.811060	↑ risk
8	PaperlessBilling	0.119186	0.927893	↑ risk
9	InternetService_No	0.111858	-0.927692	↓ risk
10	TechSupport	0.087674	-0.938758	↓ risk
11	MultipleLines	0.060462	0.796864	↑ risk
12	StreamingTV	0.058288	0.920769	↑ risk
13	Dependents	0.039370	-0.834655	↓ risk
14	PaymentMethod_Mailed check	0.035275	-0.843024	↓ risk
15	SeniorCitizen	0.034952	0.894678	↑ risk
16	OnlineBackup	0.032002	-0.800661	↓ risk
17	PhoneService	0.025272	-0.893882	↓ risk
18	gender	0.012525	0.124737	↑ risk
19	PaymentMethod_Credit card (automatic)	0.010431	-0.083750	↓ risk

PDP (Partial Dependence Plots):

Show Code

# select top 6 by SHAP global metric
top6 = imp_shap_df["feature"].head(6).tolist()

for f in top6:
    fig = plt.figure(figsize=(6,4))
    PartialDependenceDisplay.from_estimator(lgbm, X_test, [f], kind="average")
    plt.title(f"PDP – {f}")
    plt.tight_layout()
    plt.show()

<Figure size 576x384 with 0 Axes>

<Figure size 576x384 with 0 Axes>

<Figure size 576x384 with 0 Axes>

<Figure size 576x384 with 0 Axes>

<Figure size 576x384 with 0 Axes>

<Figure size 576x384 with 0 Axes>

Consolidated Ranking:

Show Code

# normalize and combine
r_gain = imp_gain_df.assign(rank_gain=lambda d: d["gain"].rank(ascending=False))
r_perm = imp_perm_df.assign(rank_perm=lambda d: d["perm_imp_auc"].rank(ascending=False))[["feature","rank_perm"]]
r_shap = imp_shap_df.assign(rank_shap=lambda d: d["mean_abs_shap"].rank(ascending=False))[["feature","rank_shap"]]

rank_df = (r_gain[["feature","rank_gain"]]
           .merge(r_perm, on="feature", how="outer")
           .merge(r_shap, on="feature", how="outer")
           .merge(shap_exec_df[["feature","direction_corr","signal"]], on="feature", how="left"))

rank_df["rank_mean"] = rank_df[["rank_gain","rank_perm","rank_shap"]].mean(axis=1)
rank_df = rank_df.sort_values("rank_mean").reset_index(drop=True)

# Show consolidated top 15
rank_df.head(15)

	feature	rank_gain	rank_perm	rank_shap	direction_corr	signal	rank_mean
0	Contract_Two year	1.0	1.0	1.0	-0.987947	↓ risk	1.000000
1	tenure	3.0	2.0	2.0	-0.862675	↓ risk	2.333333
2	Contract_One year	2.0	3.0	3.0	-0.942685	↓ risk	2.666667
3	InternetService_Fiber optic	5.0	4.0	4.0	0.923495	↑ risk	4.333333
4	MonthlyCharges	4.0	6.0	5.0	0.774371	↑ risk	5.000000
5	PaymentMethod_Electronic check	6.0	7.0	6.0	0.966441	↑ risk	6.333333
6	StreamingMovies	7.0	8.0	8.0	0.811060	↑ risk	7.666667
7	InternetService_No	9.0	5.0	10.0	-0.927692	↓ risk	8.000000
8	PaperlessBilling	8.0	9.0	9.0	0.927893	↑ risk	8.666667
9	OnlineSecurity	10.0	11.0	7.0	-0.937402	↓ risk	9.333333
10	TechSupport	11.0	10.0	11.0	-0.938758	↓ risk	10.666667
11	MultipleLines	12.0	17.0	12.0	0.796864	↑ risk	13.666667
12	Dependents	14.0	15.0	14.0	-0.834655	↓ risk	14.333333
13	SeniorCitizen	16.0	12.0	16.0	0.894678	↑ risk	14.666667
14	StreamingTV	13.0	21.0	13.0	0.920769	↑ risk	15.666667

Recommendations and Business Actions

Contract Type – Two-Year Contracts (Contract_Two year): Customers with two-year contracts show a significantly lower risk of churn. Encourage migration to this type of contract through offers and benefits.
Contract Type – One-Year Contracts (Contract_One year): Customers with one-year contracts also present lower churn risk compared to month-to-month contracts, but higher than two-year contracts. Reinforce the value of renewal and promote the transition to longer-term contracts.
Tenure: Customers with short tenure are at higher risk. Implement robust onboarding programs and proactive follow-up during the first months to ensure satisfaction and early engagement.
Internet Service – Fiber Optic (InternetService_Fiber optic): The fact that fiber optic customers show higher churn risk is a red flag. Service quality, support, or perceived cost-benefit for fiber users should be urgently investigated. Potential issues may include performance problems or inadequate pricing in this segment.
Monthly Charges (MonthlyCharges): Higher monthly charges increase churn risk. Analyze the price ranges where churn is most prevalent. Consider offering bundled packages or renegotiation for customers with high charges, especially when combined with low tenure.
Payment Method – Electronic Check (PaymentMethod_Electronic check): Customers using electronic checks are at high risk. This may indicate a segment with lower loyalty, financial difficulties, or dissatisfaction. Consider incentives to switch to other payment methods or establish closer monitoring for these customers.
Online Security (OnlineSecurity): Not having online security services increases churn risk. Promote the importance and benefits of online security offerings.
Streaming Services (StreamingMovies, StreamingTV): Having streaming services may increase churn risk. This is counterintuitive and may suggest that these services are not adding enough value to retain customers, or that customers seeking such services tend to have a profile more prone to exploring alternatives.

Deploy

Check this model in production by running churn predictions:

Click here to access

Github repository

Get acess to all codes and files:

Click here to access