Customer Analysis Project for Shopping Center

Author

Gabriel Ferreira

Published

September 18, 2024

Introduction

This project was carried out using a real dataset from a shopping center, with the main objective of studying the characteristics and purchasing patterns of its customers. Through an end-to-end (E2E) data analysis, from initial cleaning and preparation to the creation of visualizations, the goal was not only to describe consumer behavior but also to provide actionable insights that can guide strategic decisions, sales, and customer experience management.

The goal is to transform raw data into strategic knowledge, enabling the shopping center to optimize its operations, personalize its offers, and maximize the value of each customer.

Project Structure

The development of this project followed a structured methodology to ensure the accuracy and reliability of the extracted insights, covering everything from data collection and preparation to evaluation and interpretation of the results. Each phase, from data extraction to the interpretation of visualizations, was executed with attention to detail and focus on maximizing analytical value.

Data Collection and Preprocessing

Data quality is the foundation of any analysis. At this stage, we ensured the dataset was clean and ready for processing.

Data Loading and Initial Overview

  • Data Source: Collected from a shopping center in Singapore

The starting point was loading the dataset. It contained 28 columns, covering a wide range of information about customers, from demographic data and marital status to purchasing behaviors and interactions with promotional campaigns.

Load dataset + first 5 rows:
Show Code
caminho_do_arquivo = '../data/dataset_shoppingcenter.csv'
df = pd.read_csv(caminho_do_arquivo)
df.head()
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Response Complain Country
0 1826 1970 Graduation Divorced $84,835.00 0 0 6/16/14 0 189 ... 6 1 0 0 0 0 0 1 0 SP
1 1 1961 Graduation Single $57,091.00 0 0 6/15/14 0 464 ... 7 5 0 0 0 0 1 1 0 CA
2 10476 1958 Graduation Married $67,267.00 0 1 5/13/14 0 134 ... 5 2 0 0 0 0 0 0 0 US
3 1386 1967 Graduation Together $32,474.00 1 1 5/11/14 0 10 ... 2 7 0 0 0 0 0 0 0 AUS
4 5371 1989 Graduation Single $21,474.00 1 0 4/8/14 0 6 ... 2 7 1 0 0 0 0 1 0 SP

5 rows × 28 columns

Exploratory Data Analysis and Cleaning

Column count df01:
Show Code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   2240 non-null   int64 
 1   Year_Birth           2240 non-null   int64 
 2   Education            2240 non-null   object
 3   Marital_Status       2240 non-null   object
 4    Income              2216 non-null   object
 5   Kidhome              2240 non-null   int64 
 6   Teenhome             2240 non-null   int64 
 7   Dt_Customer          2240 non-null   object
 8   Recency              2240 non-null   int64 
 9   MntWines             2240 non-null   int64 
 10  MntFruits            2240 non-null   int64 
 11  MntMeatProducts      2240 non-null   int64 
 12  MntFishProducts      2240 non-null   int64 
 13  MntSweetProducts     2240 non-null   int64 
 14  MntGoldProds         2240 non-null   int64 
 15  NumDealsPurchases    2240 non-null   int64 
 16  NumWebPurchases      2240 non-null   int64 
 17  NumCatalogPurchases  2240 non-null   int64 
 18  NumStorePurchases    2240 non-null   int64 
 19  NumWebVisitsMonth    2240 non-null   int64 
 20  AcceptedCmp3         2240 non-null   int64 
 21  AcceptedCmp4         2240 non-null   int64 
 22  AcceptedCmp5         2240 non-null   int64 
 23  AcceptedCmp1         2240 non-null   int64 
 24  AcceptedCmp2         2240 non-null   int64 
 25  Response             2240 non-null   int64 
 26  Complain             2240 non-null   int64 
 27  Country              2240 non-null   object
dtypes: int64(23), object(5)
memory usage: 490.1+ KB

We identified that the column Income was formatted as an object due to special characters such as dollar signs ($) and commas (,). To enable numeric operations and statistical analyses, we converted this column to float. Additionally, a leading whitespace in the column name was fixed for easier handling. The Dt_Customer column (customer enrollment date) was also stored as an object and not recognized as a datetime. Conversion was necessary for temporal analyses, such as customer age at enrollment or sign-up trends over time.

Adjusting Income and Dt_Customer columns:
Show Code
df.rename({' Income ' : 'Income'}, axis = 1, inplace = True)
df['Income'] = df['Income'].str.replace('$', '')
df['Income'] = df['Income'].str.replace(',', '').astype(float)
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
Checking for Missing Values:
Show Code
df.isnull().sum()
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Response                0
Complain                0
Country                 0
dtype: int64

After converting Income to numeric, 24 missing values remained. To determine the best imputation strategy, we investigated the statistical distribution of the variable.

Statistical Summary of Income:
Show Code
df['Income'].describe()
count      2216.000000
mean      52247.251354
std       25173.076661
min        1730.000000
25%       35303.000000
50%       51381.500000
75%       68522.000000
max      666666.000000
Name: Income, dtype: float64

The analysis revealed a significant difference between mean (52,247) and median (51,381). Most notably, the maximum value (666,666) was extremely distant from the third quartile (68,522), a strong indication of outliers. Such extreme values disproportionately affect the mean, making it unreliable for central tendency.

Decision: We chose to replace missing values with the median income, as the median is more robust and less sensitive to outliers, ensuring imputation did not distort the real distribution of income.

Replacing Missing Values with Median in Income:
Show Code
median_income = df['Income'].median()
df['Income'].fillna(median_income, inplace=True)

Although we imputed missing values using the median, the presence of extreme outliers could still negatively influence models and visualizations. We analyzed the distribution of income further.

Income variable distribution:
Show Code
plt.figure(figsize = (6, 6))
plt.title("Customer Income Distribution")
sns.boxplot(data = df['Income'], color = 'blue')

The $666,666 outlier stood out, and given the dataset size and its singularity, we removed it to reduce noise and improve representativeness.

Removing the outlier and plotting again:
Show Code
df = df[df['Income'] != 666666.0]
plt.figure(figsize = (6, 6))
plt.title("Customer Income Distribution (without outlier)")
sns.boxplot(data = df['Income'], color = 'blue')

Boxplots before and after confirmed the effectiveness of this intervention, resulting in a cleaner representation of customer income.

Feature Engineering

The creation of new variables, or Feature Engineering, is an important step to enrich the dataset and extract deeper insights. We engineered metrics to capture customer profiles and behaviors.

a) Customer Age at Enrollment (Customer_Age_When_Enrolled)

To understand age distribution at enrollment, we created a column by subtracting birth year (Year_Birth) from enrollment year (Dt_Customer). This metric supports age-targeted campaigns.

Creating Customer_Age_When_Enrolled:
Show Code
df['Customer_Age_When_Enrolled'] = df['Dt_Customer'].dt.year - df['Year_Birth']

b) Total Spending (Total_Spent)

While spending was divided across product categories (wine, fruits, meat, fish, sweets, gold), a consolidated “Total_Spent” metric provides a holistic view of customer value.

Creating Total_Spent:
Show Code
df['Total_Spent'] = df[['MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds']].sum(axis=1)

c) Total Purchases (Total_Purchases)

Similarly, summing purchases across different channels (promotions, web, catalog, in-store) provided a unified measure of engagement.

Total Purchases (Total_Purchases):
Show Code
df['Total_Purchases'] = df[['NumDealsPurchases','NumWebPurchases','NumStorePurchases','NumCatalogPurchases']].sum(axis=1)

d) Kids and Teens at Home (Kids_Teen_at_home)

We combined Kidhome and Teenhome into one variable to simplify analysis and identify households with dependents, which may impact shopping patterns.

Creating Kids_Teen_at_home:
Show Code
df['Kids_Teen_at_home'] = df[['Kidhome','Teenhome']].sum(axis=1)

Cleaning Original Variables

After aggregation, we dropped Kidhome and Teenhome to avoid redundancy.

Dropping original columns:
Show Code
df = df.drop(['Kidhome', 'Teenhome'], axis = 1)

Outlier Re-check with IQR

To ensure the quality of the new attributes and existing columns, a new outlier verification was performed using the Interquartile Range (IQR) method for the most relevant features for the analysis (Customer_Age_When_Enrolled, Total_Spent, Total_Purchases, Kids_Teen_at_home). The 2 times IQR rule was applied for a more sensitive detection.

Defining columns of interest, creating the function and applying it to the defined columns for outlier identification:
Show Code
# Colunas de interesse
colunas = ['Customer_Age_When_Enrolled', 'Total_Spent', 'Total_Purchases', 'Kids_Teen_at_home']
# FUncao para detectar outliers
def detectar_outliers_iqr(col):
    Q1 = col.quantile(0.25)
    Q3 = col.quantile(0.75)
    IQR = Q3 - Q1
    limite_inferior = Q1 - 2 * IQR
    limite_superior = Q3 + 2 * IQR
    return {
        'Q1': Q1,
        'Q3': Q3,
        'IQR': IQR,
        'limite_inferior': limite_inferior,
        'limite_superior': limite_superior,
        'outliers': col[(col < limite_inferior) | (col > limite_superior)]
    }
# Imprimindo outliers nas colunas
for coluna in colunas:
    res = detectar_outliers_iqr(df[coluna])
    print(f'{coluna}: {res["outliers"].count()} outliers ({100*res["outliers"].count()/df.shape[0]:.2f}%)')
Customer_Age_When_Enrolled: 3 outliers (0.13%)
Total_Spent: 0 outliers (0.00%)
Total_Purchases: 0 outliers (0.00%)
Kids_Teen_at_home: 0 outliers (0.00%)

It was found that only 3 outliers persisted in the Customer_Age_When_Enrolled variable, representing a very low percentage (0.13%) of the dataset.

For better visualization, let’s analyze the boxplot of these variables:

Boxplot of variables of interest:
Show Code
plt.figure(figsize=(12,6))
for i, coluna in enumerate(colunas):
    plt.subplot(1, len(colunas), i+1)
    sns.boxplot(y=df[coluna])
    plt.title(coluna)
plt.tight_layout()
plt.show()

Identification of found outliers:
Show Code
for coluna in colunas:
    res = detectar_outliers_iqr(df[coluna])
    outliers = res['outliers']
    if not outliers.empty:
        print(f"\nColuna: '{coluna}'")
        print(outliers)

Coluna: 'Customer_Age_When_Enrolled'
513     121
827     114
2233    113
Name: Customer_Age_When_Enrolled, dtype: int64

These extremely atypical values (ages like 113, 114, and 121 years, likely registration errors) were removed to avoid any noise or distortion in future age analyses, which are crucial for customer segmentation.

Removing extreme values from the variable:
Show Code
df = df[df['Customer_Age_When_Enrolled'] != 113]
df = df[df['Customer_Age_When_Enrolled'] != 114]
df = df[df['Customer_Age_When_Enrolled'] != 121]

Duplicate Check and Final Dataset Structure

Duplicate Check and Final Dataset Structure

Checking for duplicate values:
Show Code
df.duplicated().sum()
0
Dataset verification after cleaning and creating new features:
Show Code
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2236 entries, 0 to 2239
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   ID                          2236 non-null   int64         
 1   Year_Birth                  2236 non-null   int64         
 2   Education                   2236 non-null   object        
 3   Marital_Status              2236 non-null   object        
 4   Income                      2236 non-null   float64       
 5   Dt_Customer                 2236 non-null   datetime64[ns]
 6   Recency                     2236 non-null   int64         
 7   MntWines                    2236 non-null   int64         
 8   MntFruits                   2236 non-null   int64         
 9   MntMeatProducts             2236 non-null   int64         
 10  MntFishProducts             2236 non-null   int64         
 11  MntSweetProducts            2236 non-null   int64         
 12  MntGoldProds                2236 non-null   int64         
 13  NumDealsPurchases           2236 non-null   int64         
 14  NumWebPurchases             2236 non-null   int64         
 15  NumCatalogPurchases         2236 non-null   int64         
 16  NumStorePurchases           2236 non-null   int64         
 17  NumWebVisitsMonth           2236 non-null   int64         
 18  AcceptedCmp3                2236 non-null   int64         
 19  AcceptedCmp4                2236 non-null   int64         
 20  AcceptedCmp5                2236 non-null   int64         
 21  AcceptedCmp1                2236 non-null   int64         
 22  AcceptedCmp2                2236 non-null   int64         
 23  Response                    2236 non-null   int64         
 24  Complain                    2236 non-null   int64         
 25  Country                     2236 non-null   object        
 26  Customer_Age_When_Enrolled  2236 non-null   int64         
 27  Total_Spent                 2236 non-null   int64         
 28  Total_Purchases             2236 non-null   int64         
 29  Kids_Teen_at_home           2236 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(25), object(3)
memory usage: 541.5+ KB

The resulting dataset, with 2,236 rows and 30 columns, is now clean, more complete with the new features, and ready for the exploratory analysis phase.

Exploratory Data Analysis and Insight Visualizations

Customer Age Analysis:

Show Code
plt.figure(figsize = (6, 6))
plt.title("Customer Age Distribution")
sns.violinplot(data = df['Customer_Age_When_Enrolled'], color = 'purple')

The distribution of customer ages, visualized through a violin plot, revealed a striking demographic characteristic: the majority of customers are concentrated in the 35 to 45 age group. This data is of utmost importance, as it points to the shopping center’s main market niche, indicating that marketing strategies, product and service selection, and even promotional events should be directed to meet the needs and preferences of this specific demographic group. Customers in this age range generally have greater purchasing power and financial stability, making them a highly valuable target audience.

Distribution by Education Level:

Show Code
education_counts = df['Education'].value_counts()
plt.figure(figsize = (14, 8))
plt.title("Customer Distribution by Education Level")
sns.countplot(x = df['Education'], palette = 'rocket_r')

The analysis of customer education levels, through a count plot, highlighted the predominance of customers with Bachelor’s degrees, followed by those with PhDs and Master’s degrees. A minority holds only a basic education level. This educational pattern suggests a customer profile that values quality, perhaps with greater access to information and decision-making power regarding their purchases. Understanding this characteristic can influence communication, the type of stores and services offered, and the sales approach, which can be more sophisticated and value-focused.

Profile by Marital Status:

Show Code
# Criando um df para armazernar a contagem de cada tipo de estado civil
estado_civil = df['Marital_Status'].value_counts().to_frame('Count')
# Criando o grafico
sns.set_palette('Blues_r')
plt.figure(figsize = (4, 7))
plt.pie(estado_civil['Count'], 
        labels = estado_civil.index, 
        explode = (0.1, 0, 0, 0, 0, 1, 2.5, 4), 
        shadow = True, 
        autopct = '%1.1f%%')
plt.show()

The distribution of marital status, illustrated by a pie chart, revealed that most customers are married or living together, followed by single individuals. The predominance of couples suggests that many purchasing decisions may be influenced by family dynamics. This could open opportunities for promotions and events aimed at couples or families, as well as offering products and services that cater to multiple household members.

Income Distribution and Spending Patterns:

Show Code
df = df[df['Income'] < 200000]
plt.figure(figsize = (12,7))
sns.distplot(df['Income'], color = 'magenta')
plt.xlabel('Salary')
plt.ylabel('Density')
plt.show()

The analysis of income distribution, through a distplot, confirmed that the vast majority of customers have an income around 50,000. Although the distribution is approximately normal,some customers have incomes above 150,000. This concentration indicates a public with medium-to-high purchasing power, who may be sensitive to value propositions but also seek products and services that justify the investment. Higher-income customers, though fewer in number, represent a high-potential segment for premium products and services.

Relationship between Income and Education:

Show Code
sns.set_palette('rocket')
plt.figure(figsize = (12, 7))
sns.violinplot(y = df['Income'], x = df['Education'])
plt.xlabel('Educational Level')
plt.ylabel('Salary')
plt.show()

The relationship between income and education, explored by a violin plot, validated a common trend: the higher the education level, the higher the income. This correlation reinforces the customer profile as being well-educated with good financial capacity. Stores and brands that align their products and communication with an intellectualized public with aspirations for quality will benefit more in this environment.

Relationship between Income and Spending on Gold Products:

Show Code
plt.figure(figsize = (12, 7))
sns.regplot(x = df['Income'], y = df['MntGoldProds'], color = 'green')
plt.xlabel('Salary')
plt.ylabel('Spending on Gold Products')
plt.show()

The relationship between income and spending on gold products, visualized by a regplot, showed a clear positive trend: customers with higher incomes tend to spend more on gold products. This is an interesting indicator of customer purchasing power and taste. Gold and other high-value-added products can be promoted more effectively to the higher-income segment, perhaps through exclusive shopping experiences or differentiated loyalty programs.

Relationship between Income and Total Spending on Purchases:

Show Code
plt.figure(figsize = (12, 7))
sns.regplot(x = df['Income'], y = df['Total_Spent'], color = 'red')
plt.xlabel('Salary')
plt.ylabel('Total Spending on Purchases')
plt.show()

Even more importantly, the analysis of the relationship between income and total spending on purchases, also through a regplot, confirmed a robust positive relationship: customers with higher incomes naturally contribute a larger volume of spending at the shopping center. This is one of the most fundamental insights, reinforcing the importance of attracting and retaining high-income customers, as they are the backbone of revenue. Strategies aimed at increasing disposable income or the perceived value for these customers can have a direct and significant impact on financial results.

Customer Age at Enrollment Date:

Show Code
plt.figure(figsize = (7, 5))
sns.distplot(df['Customer_Age_When_Enrolled'], color = 'brown')
plt.xlabel('Customer Age at Enrollment Date')
plt.ylabel('Density')
plt.show()

The distribution of customer age at the registration date, a distplot, reiterated the concentration of customers in their 40s. This solidifies the idea that the shopping center has a strong appeal to a more mature and established audience, consolidating an important and consistent market niche. Communication and offers should resonate with the interests and lifestyle of this demographic.

Customer Distribution by Country of Origin:

Show Code
plt.figure(figsize=(12, 6))
sns.countplot(
    x='Country',
    data=df,
    palette='icefire_r',
    order=df['Country'].value_counts().index
)
plt.xlabel('Country', fontsize=12)
plt.ylabel('Number of Customers', fontsize=12)
plt.tight_layout()
plt.show()

The analysis of customer distribution by country, using a countplot, revealed a clear concentration of customers in the dataset’s country of origin, Singapore, followed by Saudi Arabia and Canada. Singapore’s dominance is expected and reinforces the shopping center’s local base. The significant presence of customers from Saudi Arabia and Canada indicates a potential international or high-income clientele who may be visiting the region. This opens the way for shopping tourism strategies and partnerships with travel agencies, or even adapting offers for specific cultures.

Relationship between Country of Origin and Total Purchases:

Show Code
plt.figure(figsize = (12, 7))
sns.violinplot(x = df['Country'], y = df['Total_Purchases'], palette = 'coolwarm')
plt.show()

A violin plot comparing the country of origin with total purchases showed that, in general, countries exhibit similar purchasing behavior. However, there were notable exceptions: the United States showed some customers with significantly higher purchase values, while Mexico displayed distinctly different behavior from the others. This suggests that, although the average purchasing profile is consistent, there are market niches in specific countries, such as the “super buyers” from the USA, who deserve special attention. For Mexico, a more in-depth study of why its behavior diverges could reveal specific opportunities or challenges.

Grouping of Total Spending by Country:

Show Code
pm = df[['Total_Purchases', 'Country']].groupby(['Country']).agg([sum])
sns.set_palette('Spectral')
plt.figure(figsize = (7, 7))
plt.pie(
    pm['Total_Purchases']['sum'],
    labels=pm.index,
    shadow=True,
    autopct='%1.1f%%',
    labeldistance=1.3
)
plt.tight_layout()
plt.show()

The analysis of total aggregated spending by country, through a pie chart, confirmed what was already intuitive: the majority of revenue comes from Singapore, followed by Saudi Arabia, which together account for almost 55% of total purchases. This concentration of value in two countries reinforces the need to focus retention and engagement efforts on these primary markets. Personalized strategies and loyalty campaigns for customers from Singapore and Saudi Arabia are crucial for the shopping center’s financial health.

Customer Registrations by Month:

Show Code
cadastros_mes = df['Dt_Customer'].dt.month.value_counts().sort_index()
meses = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
cores = sns.color_palette("Spectral", len(meses))
plt.figure(figsize=(10,6))
plt.bar(meses, cadastros_mes, color=cores)
plt.xlabel('Month')
plt.ylabel('Registrations')
plt.tight_layout()
plt.show()

The distribution of customer registrations throughout the months of the year, presented in a bar chart, revealed that the acquisition process is relatively well distributed. There was a slight dip in June and July, which could be seasonal or indicate opportunities for acquisition campaigns during these periods. The stability in monthly registrations is a good sign. The slight declines in June and July could be targets for specific marketing actions to boost acquisition during these months, perhaps with exclusive promotions for new registrations or themed events.

Conclusion

This analysis of the shopping center’s customer profile revealed crucial insights that can guide future business strategies. By understanding who the customers are, where they come from, their purchasing power, and their spending habits, the shopping center can significantly optimize its operations.

Key Project Insights:

  • Defined Target Audience: The shopping center predominantly attracts an adult audience, aged between 35 and 45, with high education (Bachelor’s, PhD, Master’s) and mostly married or in a stable union. This is its core business and should be the main focus.

  • Consistent Purchasing Power: The majority of customers have income concentrated around $50,000, with a clear positive correlation between income, education, and total spending (including high-value products like gold). This validates the strategy of focusing on an audience that seeks quality and has investment capacity.

  • Geographic Customer Base: Although diversified, the customer base is predominantly local (Singapore) with a notable contribution from Saudi Arabia and Canada. This suggests opportunities to optimize strategies for both the domestic market and shopping tourism.

  • Country-Specific Behaviors: The identification of high-spending customers in the USA and a different pattern in Mexico points to the need for country-segmented approaches to maximize returns.

  • Acquisition Stability: Customer registrations are well distributed throughout the year, with small dips in June and July, which can be opportunities for targeted acquisition campaigns.

Ultimately, this project highlights how Data Science bridges the gap between raw data and business strategy, empowering retail organizations to remain competitive and customer-centric.”