Unsupervised Machine Learning — A Friendly, Step-by-Step Tutorial

Think of going into a library with only a single book there, and you are expected to arrange the books. You would most likely begin sorting books by appearance or feel, similar cover art or length or topic, although nobody informed you of the genres. It is the key idea of unsupervised learning: it seeks structure of data in case of no annotated responses.

This tutorial explains what is meant by unsupervised learning and why this type of learning is alternative to supervised learning and provides a summary of the most common types of algorithms: Clustering and dimensionality reduction. It also introduces the mathematical intuition behind the underlying basic mathematics of such applications and provides line by line explanation of the K-Means and PCA algorithms in Python.

What is Unsupervised Learning — and why it matters?

Unsupervised learning Unsupervised learning refers to a group of methods used to determine patterns, clusters or order in data without any target labels. This input output is not learnt by the model, rather internal structure of data is learnt.

Why it matters:

Most real world data is not labelled. Labeling is quite expensive and time consuming.
Exploration & discovery. It helps in finding groups, anomalies and structure before launching costly label based projects.
The dimensionality reduction is one of its subfields that make it easier to visualite the complex data and process it.
Useful in preprocessing, features, anomaly detection, recommendations among others.

Key differences from supervised learning (simple examples)

Supervised learning

It is comprised of labeled examples: (input-correct output).
Example: Training of spam in Email based on a large number of emails that are labelled spam or not spam.
Goal: ultimate objective of this is to forecast label of unseen new data.

Unsupervised learning

None of the labels, models develop structure spontaneously.
Example: Partition of the customer purchase records into segments (there are no segments labels present).
Goal: Purpose: Find patterns behind the data or effectively compress and characterize the data.

Short table:

Main types of Unsupervised Learning

Here’s a smoother rephrasing of your text:

A. Clustering — grouping similar data points

K-Means: Separates the data into k cluster based on centres.
Hierarchical Clustering: This algorithm generates a tree (dendrogram) of how the clusters merge or break.
DBSCAN: Density based algorithm, which detects clusters of various shapes and noise/outliers.

B. Dimensionality Reduction — simplifying complexity of high-dimensional data

PCA (Principal Component Analysis): This is a linear analysis used to identify directions (principal components) having the maximum possible variance.
t-SNE: It is a nonlinear algorithm and mainly a visualization algorithm which preserves local structure.
UMAP: This is yet another modern visualisation-effective algorithm, UMAP, which is analogous to t-SNE.

Intuition + simple math (beginner-friendly)

K-Means (intuitive)

Pick k (number of clusters).
Randomly place k centroids.
Repeat:
1. Each point is allocated to the closest centroid.
2. Move each centroid to the mean of its assigned points.
Terminate when the assignments are not changing.

Objective (what K-Means minimizes): within-cluster squared distances:

This formula represents the total within-cluster variance — the quantity K-Means tries to minimize. Here’s what each term means:

j: Total clustering cost (what K-Means minimizes).
$k$ : Number of clusters.
$C_i$ : Set of points in cluster $i$
$x$ : A data point.
$\mu_i$ : Centroid (mean) of cluster $i$ .

$\|x - \mu_i\|^2$ : Squared distance between point and centroid.

PCA (intuitive)

Find a new coordinate system where the first axis (PC1) captures the most variance, PC2 the next most (and is orthogonal to PC1), and so on.
You can project high-dimensional data onto the first few principal components for visualization or to reduce noise.

This concept tells you how much of the total variance in the data is captured by the $i^{th}$ principal component. It’s commonly used in PCA (Principal Component Analysis) to understand the importance of each component.

Explanation of results (how to interpret)

Clusters: Points with the same color belong to the same cluster assigned by KMeans.
Centroids: The red Xs are center points — represent the “average” member of that cluster.
Silhouette score: Gives a numeric sense of clustering quality. For Iris, you usually get a moderately good score since species are somewhat separable.

Short notes on Hierarchical and DBSCAN (intuition)

Hierarchical clustering

Build a tree of clusters (dendrogram).
Good for small datasets and when you want multi-scale cluster views.
You can “cut” the tree to get a chosen number of clusters.

DBSCAN

Parameters: eps (radius), min_samples.
Dense regions (core points) form clusters; low-density points are labeled noise.
Great for clusters with weird shapes and automatic outlier detection.
Not good with widely varying densities or very high dimensions.

t-SNE (very short overview)

t-SNE is a nonlinear projection for visualization (keeps local neighbourhoods intact).
Good for visualizing clusters on high-dimensional data, but:
- It’s stochastic (use random_state).
- It doesn’t preserve global distances well.
- Use it only for visualization (not as a general dimensionality reduction for modelling).

Common challenges & how to overcome them

Choosing the right number of clusters (k)
- Use elbow method, silhouette score, or domain knowledge.
Feature scaling
- Always scale numeric features before KMeans and PCA.
Outliers influence KMeans
- Use robust methods (DBSCAN) or remove/clip outliers beforehand.
Cluster evaluation
- No ground truth: use silhouette, Davies-Bouldin, or compare to business metrics.
High dimensionality
- Use PCA/UMAP to reduce dimensionality before clustering.
Interpretability
- Summarize clusters with representative examples or feature means.
Different data types
- For categorical features, use appropriate encodings or distance measures (K-Prototypes, Gower distance).
Local optima / initialization
- For KMeans, use multiple n_init runs and good init (like 'k-means++').

Real-world applications (simple examples)

Marketing: Segment customers for targeted campaigns (group by purchase patterns).
E-commerce: Product clustering for recommendations (group similar products).
Healthcare: Group patients by symptoms or gene expression to find subtypes.
Finance: Detect anomalous transactions (fraud).
Cybersecurity: Identify unusual login patterns or scans as anomalies.
Manufacturing: Monitor sensor streams and detect equipment anomalies.
NLP: Topic modeling and document clustering (group similar articles).
Astronomy: Group stars/galaxies by spectral properties.

Summary & takeaways

Unsupervised learning discovers structure in unlabeled data: clusters, low-dimensional structure, and anomalies.
Clustering (K-Means, Hierarchical, DBSCAN) organizes data into groups — pick method by data shape, size, and noise.
Dimensionality reduction (PCA, t-SNE) helps visualization and reduces noise; PCA is linear and interpretable, t-SNE is for visualization only.
Preprocessing matters: scale numeric data, handle categorical features appropriately.
Evaluation is harder than supervised learning — rely on silhouette, domain knowledge, and qualitative checks.
Start simple: Try PCA + K-Means, visualize clusters, then iterate with more advanced techniques (DBSCAN, UMAP, deep clustering).

Thanks for reading 💗!

If you found this post useful:

⮕  Share it with others who might benefit.
⮕  Leave a comment with your thoughts or questions—I’d love to hear from you.
⮕  Follow/Subscribe to the blog for more helpful guides, tips, and insights.

Why Python Is the Best Programming Language for Machine Learning, AI, and Deep Learning

Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) have become some of the most transformative technologies of our time. From self-driving cars and recommendation systems to chatbots and healthcare diagnostics, these technologies are reshaping industries at an incredible pace. But have you ever wondered: Why do most AI researchers, data scientists, and developers prefer Python over other programming languages? In this blog, we’ll explore in depth why Python has emerged as the most popular and powerful language for AI, ML, and DL development . 1. Simplicity and Readability – Focus on Problem Solving, Not Syntax Complex mathematics and algorithms are some of the greatest challenges that newcomers in the world of AI/ML face. Python eases this load by providing a clean syntax that is easy to read. Let’s look at three concrete examples where Python’s clean syntax helps newcomers in AI/ML handle complex mathematics and algorithms more easily compared to ot...

The NemoCite Post

Search This Blog