Unsupervised Machine Learning
Unsupervised Learning Overview
Unsupervised learning is a type of machine learning where the data does not come with predefined labels. The algorithm tries to identify hidden patterns or groupings in the data on its own. It is commonly used for clustering, dimensionality reduction, anomaly detection, and association rule learning.
1. K-Means Clustering
K-Means is a method that partitions data into k distinct clusters based on similarity. The algorithm randomly initializes k centroids and assigns each data point to the nearest centroid. It then recalculates centroids and repeats the process until convergence.
-
Use case: Customer segmentation, image compression, document clustering.
-
Key point: You need to predefine the number of clusters (k).
2. Hierarchical Clustering
This technique builds a tree of clusters called a dendrogram. It starts either by treating each data point as its own cluster and merging them step by step (agglomerative) or by treating the entire dataset as one cluster and splitting it recursively (divisive).
-
Use case: Gene expression analysis, social network grouping.
-
Key point: Does not require a fixed number of clusters upfront.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN groups together points that are closely packed and labels points in low-density areas as outliers. It's great for discovering clusters of arbitrary shape.
-
Use case: Geospatial data clustering, fraud detection.
-
Key point: Automatically detects the number of clusters and identifies noise.
4. Principal Component Analysis (PCA)
PCA is a technique used for reducing the number of dimensions in your data while retaining the most important features. It transforms the data into a new coordinate system such that the greatest variance lies along the first axis, the second greatest along the second axis, and so on.
-
Use case: Data compression, visualization, noise reduction.
-
Key point: Best suited for linear data structures.
5. t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is used for visualizing high-dimensional data in two or three dimensions. It captures non-linear relationships and reveals clusters more clearly than PCA in many cases.
-
Use case: Visualizing word embeddings, image features.
-
Key point: Great for visualizations, but not for general-purpose dimensionality reduction.
6. Autoencoders
Autoencoders are a type of neural network trained to reconstruct their input. The middle (bottleneck) layer learns a compressed version of the data. They are powerful for unsupervised feature learning and dimensionality reduction.
-
Use case: Anomaly detection, image denoising, feature extraction.
-
Key point: Works well with non-linear data and large datasets.