Skip to main content

TRUE ARTIFICIAL INTELLIGENCE

ML - Clustering of mall customers


Unsupervised machine learning is closely aligned with what some call true artificial intelligence. Unlike supervised machine learning, in unsupervised machine learning there is no supervisor to provide any sort of guidance. This algorithm needs to find pattern in the input data to get better understanding.

Clustering

Clustering is one of the unsupervised machine learning technique used for analyzing data in many fields. Its task is to divide the input data into groups, called ‘clusters’, in such a way that observations in one cluster follow similar pattern than observations in other clusters follow.


clustering of similar data points in machine learning


From the above example, we can clearly see that squares are grouped in one cluster, all circles are grouped in another clusters and triangles are grouped in the third cluster. Some examples are grouping of similar news articles, grouping of similar customers based on their profile etc.

In this project we are going to use mall customer’s dataset to group the customers based on their annual income and spending score. First we are going to import all the required libraries and we are going to read our dataset. We will do some EDA, to know more about the data.

From the above example, we can clearly see that squares are grouped in one cluster, all circles are grouped in another clusters and triangles are grouped in the third cluster. Some examples are grouping of similar news articles, grouping of similar customers based on their profile etc.

In this project we are going to use mall customer’s dataset to group the customers based on their annual income and spending score. First we are going to import all the required libraries and we are going to read our dataset. We will do some EDA, to know more about the data.


#importing necessary libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns


#reading the data

dataset=pd.read_csv('Mall_Customers.csv')


#EDA

dataset.head()

dataset.info()

dataset.isna().sum()

dataset.describe()

dataset.columns


Our dataset has five columns namely, customer id, age, gender, annual income, spending score. But we are interested only in two columns annual income and spending score since we are going to group customers based on these two parameters. Now we need to extract only these two columns for our further analysis. 


data=dataset[['Annual_Income_(k$)', 'Spending_Score']]

x=data[['Annual_Income_(k$)']]

y=data[['Spending_Score']]

X=data.iloc[:, [0,1]].values


In the k-means clustering algorithm we need to assume the number of clusters that is the k value. To find the k value we are going to use the elbow method.

 

Elbow method

Perform k-means clustering on the dataset for a range of value k (for example 1 to 10) and calculate the sum of squared error (SSE) or percentage of variance explained for each k. Plot a line chart for cluster number vs. SSE and then look for an elbow shape on the line graph. So the goal with the elbow method is to choose a small value of k that has a low SSE, and the elbow usually represents this value. Percentage of variance explained tends to increase with increase in k and we’ll pick the point where the elbow shape appears that is k=5.


from sklearn.cluster import KMeans

wcss=[]

for i in range(1,11):

    kmeans_clu=KMeans(n_clusters = i, random_state=56)

    kmeans_clu.fit(X)

    wcss.append(kmeans_clu.inertia_)


plt.figure(figsize=(10,5))

plt.plot(range(1,11), wcss)

plt.title('The Elbow Method')

plt.xlabel('Number of clusters')

plt.ylabel('wcss')

plt.show()


k-value plot using python programming


Now we got our k value as 5. We can build our model using k-means clustering algorithm with k=5. The below code is used to build the model and visualize the clusters on the graph. There are five clusters with different colors and a centroid (black in color). Customers with similar characteristics are place in unique cluster.


kmeans=KMeans(n_clusters=5,random_state=56)

y_kmeans=kmeans.fit_predict(X)


plt.scatter(X[y_kmeans==0,0], X[y_kmeans==0,1], s=100,c='pink', label='Cluster 1')

plt.scatter(X[y_kmeans==1,0], X[y_kmeans==1,1], s=100,c='yellow', label='Cluster 2')

plt.scatter(X[y_kmeans==2,0], X[y_kmeans==2,1], s=100,c='green', label='Cluster 3')

plt.scatter(X[y_kmeans==3,0], X[y_kmeans==3,1], s=100,c='blue', label='Cluster 4')

plt.scatter(X[y_kmeans==4,0], X[y_kmeans==4,1], s=100,c='red', label='Cluster 5')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='black',

label='Centroids')

plt.title('clusters of customers')

plt.xlabel('Annual Income')

plt.ylabel('Spending score')

plt.legend()

plt.show()


five clusters of customers machine learning


Conclusion

We have now build our unsupervised machine learning model for mall customer dataset using k-means clustering algorithm. We have successfully grouped customers into five clusters based on their annual income and spending score.



Comments

Popular posts from this blog

Salary Prediction Web App using Streamlit

Salary Prediction Web App In this article, we are going to discuss how to predict the salary based on various attributes related to salary  using Random Forest Regression. This study focuses on a system that predicts the salary of a candidate based on candidate’s qualifications, historical data, and work experience. This app uses a machine learning algorithm to give the result. The algorithm used is Random Forest Regression. In this problem, the target variable (or output), y, takes value of salary for a given set of input features (or inputs), X. The dataset contains gender, secondary school percentage, higher secondary school percentage, higher secondary school stream, degree percentage, degree type, work experience and specialization of candidate. Below is the step-by-step Approach: Step 1: Import the necessary modules and read the dataset we are going to use for this analysis. Below is a screenshot of the dataset we used in our analysis. Step 2: Now before moving ...

STREAMLIT MULTIPAGE WEB APPLICATION | AREA CALCULATOR

Multipage Web App So far, we have worked with python streamlit library and we have built machine learning web applications using streamlit. In this blog we will see how to build a multi-page web app using streamlit. Streamlit multipage web app We can create multiple apps and navigate across each of them in a main app using a radio button. First, we have created separate apps for each shape to calculate the area of that particular shape example app1.py, app2.py, app3.py etc. Then we have created a main app and added a navigator using radio buttons. Now we just have to run the main app and navigate through the desired web page. Area Calculator This particular multipage web app we named it as area calculator. We have included introduction page and ten shapes of which we can calculate the area by putting required inputs. We have downloaded the multiapp.py framework from GitHub, as we have a greater number of web pages. Each shape in the navigation bar indicates new web p...