W1 - Unsupervided learning

Explore unsupervised learning with k-means clustering and anomaly detection, covering algorithms, optimization objectives, and feature selection for effective modeling.

This week, you will learn two key unsupervised learning algorithms: clustering and anomaly detection

Learning Objectives

Implement the k-means clustering algorithm
Implement the k-means optimization objective
Initialize the k-means algorithm
Choose the number of clusters for the k-means algorithm
Implement an anomaly detection system
Decide when to use supervised learning vs. anomaly detection
Implement the centroid update function in k-means
Implement the function that finds the closest centroids to each point in k-means

Welcome!

Clustering

What is clustering?

A clustering algorithm looks at a number of data points and automatically finds data points that are related or similar to each other.

In supervised learning, the dataset included both the inputs x as well as the target outputs y.

In contrast, in unsupervised learning, you are given a dataset but not the the target labels y.

Instead, we’re going to ask the algorithm to find something interesting about the data, and in particular for algorithm called clustering algorithm, we look if the dataset can be grouped into clusters, meaning groups of points that are similar to each other.

Exemple of applications

K-means intuition

STEP 0 : randomly pick two points,which I’ve shown here as a red cross and the blue cross, at where might be the centers of two different clusters.
STEP 1 : Go through each of these points and look at whether it is closer to the red cross or to the blue cross.

STEP 2 : Move the red cross to whatever is the average location of the red dots, that is the red cluster centroid will move here.

Keep on repeating these two steps, until ther are no more changes

K-means algorithm

When you implement this algorithm, you find that it’s actually a little bit more convenient to minimize the squared distance because the cluster centroid with the smallest square distance should be the same as the cluster centroid with the smallest distance.

Each of these x values are vectors with n numbers in them if you have n features

If a cluster has zero training examples assigned to it:

the most common thing to do is to just eliminate that cluster
an alternative would be to just randomly reinitialize that cluster centroid

K-means is also frequently applied to data sets where the clusters are not that well separated

Optimization objective

The cost function J is a function of :

$c^{<1>}$, … $c^{}$ the indexes of cluster chosen for each point $x^{i}$
$\mu_1$, … $\mu_k$ the centers of the clusters

The cost function is the average of squared distance between every training example $x^{i}$ and $\mu_{c(i)}$ the location of the cluster centroid to which the training example has been assigned

The K means algorithm is trying to find assignments of points of clusters centroid as well as find locations of clusters centroid that minimizes the squared distance

And what they will do on every step is try to update

the cluster assignments
the positions of the cluster centralism, in order to keep on reducing this cost function J

By the way, this cost function J is called the distortion function

Let’s now take a deeper look at the algorithm

The first part update C1 through CM to minimize the cost function J as much as possible while holding $\mu$ fixes.
Tnd second part, C1 through CM are fix, we update $\mu{1}$, $\mu{k}$ to minimize the cost function or the distortion as much as possible.

How moving the centroid decreases the average of distances

The distortion cost function should go down or stay the same in the worst case (if it goes up, that means there’s a bug in the code)

Initializing K-means

The very first step of the K means clustering algorithm, was to choose random locations as the initial guesses for the cluster centroids

Depending on how you choose the initail random centroid of the clusters, you end up with different data sets. But with this less fortunate choice of random initialization, that gets stuck in a local minimum.

The solution is to run it multiple times and then to try to find the best local optima.

More formally, here is the algorithm if you want to rank multiple times using different random initialization

Choosing the number of clusters

There are lots of applications where the data itself does not give a clear indicator for how many clusters there are in it.

If you look at the academic literature on K-means, there are a few techniques to try to automatically choose the number of clusters to use for a certain application. One way to try to choose the value of K is called the elbow method (not used by Andrew personnally)

For a lot of applications, the right number of clusters is truly ambiguous to find that, because a lot of cost functions just decreases smoothly and it doesn’t have a clear elbow

Both of these are completely valid and completely fine groupings of the data into clusters, but whether you want to use three clusters or five clusters can now be decided based on what makes sense for your t-shirt business.

Anomaly detection

Finding unusual events

The most common way to carry out anomaly detection is through a technique called density estimation.

Anomaly detections examples

Gaussian (normal) distribution

This curve here shows what is p(x) or the probability of x. This curve is qualified as the bell-shaped curve.

If you’re wondering what does p(x) really means, here’s one way to interpret it. If uou get 100 numbers drawn from this probability distribution, and you were to plot a histogram like below. And if you go from 100 to an infinite number of examples, the histogram would ends up with this bell-shaped curve.

Probabilities always have to sum up to one, so that’s why the area under the curve is always equal to one, which is why when the Gaussian distribution becomes skinnier, it has to become taller as well

How you calculate standard deviation $\sigma$, variance $\sigma^2$ and the mean $\mu$

Anomaly detection algorithm

We calculate p(x) as the p(x1) x p(x2) X … , P(xn). If you’ve taken an advanced class in probability and statistics before, you may recognize that this equation corresponds to assuming that the features x1, x2, … xm are statistically independent. But this algorithm often works fine even that the features are not actually statistically independent.

If 1 or more of the features are either very large or very small relative to what it has seen in the training set, then P($x_j$) is very small and then the product is very small

Examples

Developing and evaluating an anomaly detection system

When you are developing a learning algorithm, making decisions on choosing different features or trying different values of the parameters, is much easier if you have a way of evaluating the learning algorithm, computing a number that tells you if the algorithm got better or worse

Note that is unsupervised algotithm because you train the model without having label of anomaly = yes/no But having a small number of anomalies to evaluating nad tuning algorithm is helpful

Cross-validation set is uded to tune $\epsilon$ and tunes needed features $x_j$

Because the number of flawed engines is so small there’s one other alternative, which is to not use a test set, and keep only training set and a cross-validation set

Associated algoritm :

Anomaly detection vs. supervised learning

When should you use anomaly detection and when should you use supervised learning? The decision is actually quite subtle in some applications.

Other examples

Choosing what features to use

Carefully choosing the features, is even more important for anomaly detection, than for supervised learning approaches.

For non gaussian, distribution, try to transform the feauture to have a gaussian distribution. In practice, try different values of C, and then try to take a look to pick one that looks better in terms of making the distribution more Gaussian

The process is to train the model and then to see what anomalies in the cross validation set the algorithm is failing to detect. And then to look at those examples to see if that can inspire the creation of new features that would allow the algorithm to spot.

Other example of creating a new feature with combination of two existing.

high CPU is not anormal
low network trafic is not alornmal
but high CPU with low trafic is maybe anormal

Last modified February 4, 2024: meta description on coursera (b2d9a0d)