Saurabh Jain: Unsupervised learning

In Supervised machine learning, input data is labeled with the correct answer. The input data is split into two parts namely training data and test data. However, imagine a situation when the data is not labeled with a correct answer. Suppose you are given the Income Tax Returns of huge numbers of users and you wish to find the anomalous return filers or outliers. Or let say you wish to map the users into different clusters as per their demographic attributes. In such a situation, data scientists opt for Unsupervised machine learning.

Unsupervised machine learning (UML) helps in discovering hidden patterns in the data. For instance, let’s assume you have images of a set of animals. UML model can help you in partitioning and clustering different animals in different buckets without being explicitly trained about their attributes and features. Let us understand how this magic happens.

Whenever a huge amount of unlabeled data is passed to the model. The model finds certain common features in the data. On basis of these common features, all the data points are clustered, partitioned, and categorized into different sets. For example, suppose you have 1000 points with X and Y coordinates and you wish to divide them into two clusters. In a primitive setup, one can take the following 5 steps:

1. Initialize two hypothetical (X, Y) tuples m1 and m2 as the centroid of two empty buckets C1, C2.

2. For each point, find its distance from the two centroids.

3. If the distance to C1 is less than the distance to C2 then put the point in bucket C1, else put the point in bucket C2.

4. Recalculate the mean points of the two buckets C1 and C2 and take them as new centroids m1 and m2.

5. Repeat step 2-3 till the time values in the buckets are not stabilized.

In the end, you will have all the points in the data space categorized in the two buckets C1 and C2.

A fine reader can spot the devils in the details in the above five steps. There are four devils sitting in these five steps.

1. How do you decide the value of K? The number of clusters/buckets could be 2, or 3, or even N.

2. How do you define the distance function to calculate distance between data points in a complex scenario?

3. When do you say that C1 and C2 are stabilized? How do you calculate loss or distortion which has to be minimized for accurate estimation of stabilization?

4. Who is going to validate the results?

One should be careful in applying unsupervised machine learning to a problem space due to the following challenges:

1. As the data is unlabeled, one can never be sure about the accuracy or precision of the outcome.

2. There is no definitive outcome. For instance, if you use the K value as 2, you get two clusters. If you use the k value as 5, you can get five clusters. Every algorithm listed above has its own share of pitfalls.

3. It requires a better domain knowledge (expert) so that one should be able to see the outcome and validate the results of unsupervised machine learning.

Despite the above challenges, Unsupervised machine learning has proved to be pathbreaking and revolutionary in the field of Data sciences due to its widespread application. It can be useful in many scenarios.

1. Clustering: It is useful in clustering the data into different clusters. For instance, looking at the ITR profiles of people, one can cluster taxpayers into obedient and non-obedient taxpayers.

2. Anomaly detection: It can identify outliers in the data. Any anomalous input can be spotted using this. For instance, any unusual hike in refund claims or exemption claims can be spotted. Or it could be used to detect signals from an alien planet or intrusion detection in the network

3. Association mining: It can be used to identify a set of items that occur together in the dataset. For example, if people have a tendency of buying bread-butter more frequently than bread-spices so a shopkeeper can decide whether to put bread with butter or with spices.

4. Dimensionality-reduction: If a data point contains 1000s of attribute then it is difficult to analyze and study it. Dimensionality reduction helps in reducing the dimensions of analysis. For instance, out of 1000 attributes/features, 10 important features could be drawn and studied.

Some of the common algorithms/ models for achieving these objectives are listed below.

Clustering: A common approach for clustering is K-MEANS clustering. The 5-steps process explained above is an example of a K-means clustering algorithm where the value of K is 2. Other prominent approaches are DBSCAN and OPTICS.

Anomaly detection: Isolation forest, Local outlier factors are commonly used for discovering anomalous patterns.

Association mining: Apriori algorithm, and FP-growth algorithm is prominently used in mining patterns from the unlabeled data.

Dimensionality reduction: Principal component analysis and Singular value decomposition are the most commonly used dimensionality reduction approaches.

In recent years, unsupervised learning has been used vigorously in neural networks. This has revolutionized the research and application of Machine Learning and Artificial Intelligence. However, this requires a dedicated article on its own.

Saurabh Jain

Thursday, December 31, 2020

Unsupervised learning

No comments:

Post a Comment

About Me