ML - Clustering of mall customers
Unsupervised machine learning is closely aligned with what some call true artificial intelligence. Unlike supervised machine learning, in unsupervised machine learning there is no supervisor to provide any sort of guidance. This algorithm needs to find pattern in the input data to get better understanding.
Clustering
Clustering is one of the unsupervised machine learning technique used for analyzing data in many fields. Its task is to divide the input data into groups, called ‘clusters’, in such a way that observations in one cluster follow similar pattern than observations in other clusters follow.
From the above example, we can clearly see that squares are grouped in one cluster, all circles are grouped in another clusters and triangles are grouped in the third cluster. Some examples are grouping of similar news articles, grouping of similar customers based on their profile etc.
In this project we are going to use mall customer’s dataset to group the customers based on their annual income and spending score. First we are going to import all the required libraries and we are going to read our dataset. We will do some EDA, to know more about the data.
From the above example, we can clearly see that squares are grouped in one cluster, all circles are grouped in another clusters and triangles are grouped in the third cluster. Some examples are grouping of similar news articles, grouping of similar customers based on their profile etc.
In this project we are going to use mall customer’s dataset to group the customers based on their annual income and spending score. First we are going to import all the required libraries and we are going to read our dataset. We will do some EDA, to know more about the data.
#importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#reading the data
dataset=pd.read_csv('Mall_Customers.csv')
#EDA
dataset.head()
dataset.info()
dataset.isna().sum()
dataset.describe()
dataset.columns
Our dataset has five columns namely, customer id, age, gender, annual income, spending score. But we are interested only in two columns annual income and spending score since we are going to group customers based on these two parameters. Now we need to extract only these two columns for our further analysis.
data=dataset[['Annual_Income_(k$)', 'Spending_Score']]
x=data[['Annual_Income_(k$)']]
y=data[['Spending_Score']]
X=data.iloc[:, [0,1]].values
In the k-means clustering algorithm we need to assume the number of clusters that is the k value. To find the k value we are going to use the elbow method.
Elbow method
Perform k-means clustering on the dataset for a range of value k (for example 1 to 10) and calculate the sum of squared error (SSE) or percentage of variance explained for each k. Plot a line chart for cluster number vs. SSE and then look for an elbow shape on the line graph. So the goal with the elbow method is to choose a small value of k that has a low SSE, and the elbow usually represents this value. Percentage of variance explained tends to increase with increase in k and we’ll pick the point where the elbow shape appears that is k=5.
from sklearn.cluster import KMeans
wcss=[]
for i in range(1,11):
kmeans_clu=KMeans(n_clusters = i, random_state=56)
kmeans_clu.fit(X)
wcss.append(kmeans_clu.inertia_)
plt.figure(figsize=(10,5))
plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()
Now we got our k value as 5. We can build our model using k-means clustering algorithm with k=5. The below code is used to build the model and visualize the clusters on the graph. There are five clusters with different colors and a centroid (black in color). Customers with similar characteristics are place in unique cluster.
kmeans=KMeans(n_clusters=5,random_state=56)
y_kmeans=kmeans.fit_predict(X)
plt.scatter(X[y_kmeans==0,0], X[y_kmeans==0,1], s=100,c='pink', label='Cluster 1')
plt.scatter(X[y_kmeans==1,0], X[y_kmeans==1,1], s=100,c='yellow', label='Cluster 2')
plt.scatter(X[y_kmeans==2,0], X[y_kmeans==2,1], s=100,c='green', label='Cluster 3')
plt.scatter(X[y_kmeans==3,0], X[y_kmeans==3,1], s=100,c='blue', label='Cluster 4')
plt.scatter(X[y_kmeans==4,0], X[y_kmeans==4,1], s=100,c='red', label='Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='black',
label='Centroids')
plt.title('clusters of customers')
plt.xlabel('Annual Income')
plt.ylabel('Spending score')
plt.legend()
plt.show()
Conclusion
We have now build our unsupervised machine learning model for mall customer dataset using k-means clustering algorithm. We have successfully grouped customers into five clusters based on their annual income and spending score.
Comments
Post a Comment