What is unsupervised learning, and why?

Definition: To learn a function $f(x)$ merely from $x$.

Instead of learning a function $f$ from $x$ and $y$ in supervised learning, in unsupervised learning, we only observe $x$, without $y$. Therefore, if $f$ is about regression or classification, learning it would be more difficult for unsupervised learning.

This situation is common in reality. Sometimes, we can easily collect a lot of $x$, but we lack their labels. For example, suppose we want to build a model to classify various types of dogs by their images. For collecting $x$, we can search “dog” in image search engines and retrieve millions of images. However, only a few images have associated labels indicating their types. And to get $y$ for all $x$, we need costly manual labeling. In such occasions, unsupervised learning is more applicable.

Typically, $f$ in unsupervised learning is more about learning a representation of $x$. The idea is to discover the patterns of $x$ by merely observing $x$ (without $y$). The found patterns (annotated as $z$)can be useful and insightful, and can help further tasks (“downstream tasks”), e.g., we can apply supervised learning tasks (regression or classification) with $z$ as its input (i.e., $y=f(z)$) and the result model ($f$) can outperform direct supervised learned model ($f$ learned from $y=f(x)$).

Classic models of unsupervised learning

Here we introduce the most classic models of unsupervised learning:

1. Clustering and K-means

Clustering is about classifying input $x$ without $y$:

Clustering: to learn $\hat{y} = f(x)$, given $x$

Since the model cannot learn from labels $y$, it must learn from the patterns (or structures) of $x$. Simply speaking, we can observe the population of $x$ and identify the groups within it.

To identify a group (or so called ‘a cluster’) is just to find some $x$ closed to each other. That’s why the method is named “clustering”. Therefore, the basic concept is “distance”, which measures the proximation between two samples in $x$. There are many different definitions of distance (read Page 7-16 of this slides for further details about distance metric).

K-means is probably the most famous clustering method. Let’s have a look at how it works.

In K-means, we use the most common distance, which is Euclidean Distance (or so called “L2-distance”). With the whole population of $X$, for two samples $x_1$ and $x_2$ from $X$, the Euclidean distance between them is defined as:

$$ D(x_1,x_2)=\sqrt{\sum_{i=1}^n({x_1^i - x_2^i})^2} $$

Here $n$ is the number of dimensions for $x$. If $n=2$ , the distance is the length of direct path between $x_1$ and $x_2$:

With this definition of distance, K-means works in the following steps:

Selecting the number of cluster $K$.
Randomly assigning $K$ samples from $X$ as the centers of clusters. These centers are named “centroids”.