mlh local hackday

What it does

Kmeans Clustering algorithm

Clustering is a technique used to group data together based on some patternsand learn something about the data Clustering has several applications in real life for example a bank uses clustering algorithm to create clusters of data to give credit card offers to client it is also used in image segmentation document clustering and in recommendation engines like in spotify. In clustering, we do not have a target to predict. We look at the data and then try to club similar observations and form different groups. Hence it is an unsupervised learning problem. all clusters should follow two property: 1> data falling in similar cluster must be similar to each other 2> data falling in different cluster must be as different as possible

Evaluation metrics to determine quality of clusters: 1> Inertia - This metric calculates sum of intra cluster distance i.e the distance of a point with the centroid of the cluster so lower the intracluster distance better the cluster 2> Dunn index - this metric calculates distance between two clusters so more the dunn index the better

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

Steps in k-means clustering algorithm :-> 1> first select value of k which is the number of clusters 2>then select k centroids i.e one centroid for each cluster 3>assign all points to the closest center 4>recompute centroids for new formed cluster

repeat steps 3 and 4 until :-> 1> Centroids of newly formed clusters do not change 2> Points remain in the same cluster 3> Maximum number of iterations are reached

Challenges in kmeans algorithm -> 1> cluster size is different 2> densities of original point are different means some point are scattered some are condensed initially

One solution to these challenges is to use higher no of clusters but selecting random centroid will lead to different cluster formations after each iteration he maximum possible number of clusters will be equal to the number of observations in the dataset.

But then how can we decide the optimum number of clusters? One thing we can do is plot a graph, also known as an elbow curve, where the x-axis will represent the number of clusters and the y-axis will be an evaluation metric .the cluster value where this decrease in inertia value in the plot becomes constant can be chosen as the right cluster value for our data

How we built it

Challenges we ran into

learning AI

What we learned

learnt k means algorithm

What's next for Learning about artificial intelligence algorithms

learning other artificial intelligence algorithms

Built With

  • k-means
Share this project: