Jump to content

Session 1 summary: Fundamentals of machine learning


Recommended Posts

Ranjita Poudel

Sanjoy Dasgupta presented the first session of Machine Learning Virtual conference, “Fundamentals of Machine Learning”. He is a professor of computer science and engineering at the University of California, San Diego. He obtained his BA from Harvard University in 1993 and a PhD from University of California, Berkeley. He works on algorithmic statistics with a particular focus on unsupervised and minimally supervised learning and is a coauthor of a textbook, Algorithms. He was program co-chair of the conference of Learning Theory in 2009 and of the International conference on Machine Learning in 2013.
Machine learning is a method of data analysis that automates analytical data modeling. Although machine learning encompasses broad range of aspects, one of most important objectives is to make predictions. The speaker went through some detailed example of prediction problems and then gave an overview of the entire field.
One of the most popular prediction problems includes the visual recognition of the handwritten digits. In this case, a handwritten picture of a digit is given, and we need to figure out what digit it is. As the handwriting is noisy, extracting information based on their visual features is difficult. Machine learning approach pulls a large set of data of handwritten images digits, each one of them labelled with its corresponding digit to recognize the pattern.
The speaker provided an example of MNIST dataset which consists of a collection of 60,000 images with 6000 of each digit. To solve this particular digit prediction problem, two sets of data are necessary. The first one is a training set, which is used to learn a classifier (i.e., a function that takes an image and outputs what digit it thinks that image corresponds to). In addition to the training dataset, a test dataset with additional 10,000 labelled images are available to attest how good a classifier really is. If the function of the simple type of classifier is to understand its nearest neighbor then in order to classify a new image, it looks through the entire training set of 60,000 data points and finds the one that is closest to it and outputs the label of its’ nearest neighbor.
The nearest neighbor is calculated by computing the distance between each of the point in the vector dimension that represent the image. For example, each image in the MNIST datasets are 28X28 pixels. First, the images are stretch out into a vector of 784 coordinates to make a 784-dimensional vector. This is done by first cutting out the first 28 row and then the second row and so on till the vector of 784 (28X28) dimensions is formed. Once this is achieved, the vector between any two points (x and z) is computed using Pythagorean theorem for each of the 784 points and the difference between these vectors along each coordinate is computed. Once the Euclidean distance is available for each of the specific vector representation, the accuracy of the classifier is attested.
To assess how good the classifier is, the error rate of the classifier on the training set is evaluated (i.e. how many of the training points does the training set gets wrong). However, in every training point, its nearest neighbor is itself, so the training error rate is zero. Thus, training error is not a good measure of future performance as it is far too optimistic. Therefore, the error rate of test set is a good example of the future performance. In the particular example the speaker presented, the error rate in the test set is 3.09% i.e., the classifier got 390 wrong out of 10,000 samples. There are also several technique available to minimize the error rate currently.
After, the prediction problem, the speaker went through an overview of the field. The speaker briefly described the three categories of the field: supervised learning, unsupervised learning and interactive learning.

  1. Supervised learning: This method can be used to solve prediction problem such as labelling the hand written digit test. In any prediction problem there is input ‘X’ and output ‘Y’. For example, ‘X’ could be an input image of an animal and ‘Y’ could be animal name. A learning algorithm is given a bunch of examples to the training set such that human being basically illustrates saying that if the picture is of a bear then the output is “bear” (i.e., animal name). Now based on the example, the machine takes the mapping from ‘X’ to ‘Y’ and picks image ‘X’ and outputs an animal name ‘Y’. The way the algorithm takes it is basically by picking the mapping on the training set that performs well. Among different types of training set, the speakers briefly discussed about 3 classification in supervised learning:
    a) Discrete: This classification has two outputs and is called binary classification. One example of binary classification is spam detection, in which ‘X’ is each case of email message and the corresponding ‘Y’ is either “spam” or “no spam”. The other variation is the multiclass classification, in this case each ‘X’ has multiple but finite ‘Y’. For example, in case of type of news, ‘X’ is a news article and ‘Y’ could be any label ‘politics’, ‘business’ or ‘sports’ or so on. Finally, a more complex discrete classification example is grammatical parsing. In this case ‘X’ is a sentence and ‘Y’ is an entire parse tree of the sentence, in which output is finite but very complex.
    b) Continuous: In the continuous classification problem, the output can take continuous value and is called regression problem. For example, to measure the state air pollution, air quality index is assessed (air quality index is continuous value and lies in a scale). Another example presented by the speaker is the prediction of life expectancy (continuous value) of a person based on the age, sex and other known values.
    c) Probabilities values: In case of probability values, the value is a number that lies between 0 and 1. One of the important implications of this type of classification is credit card fraud detection. In this case, the input ‘X’ is the details of transaction and output ‘Y’ is the probability that the transaction is fraudulent.
    Apart from these examples, there are several methods that are implicated in the supervised learning algorithms such as nearest neighbor method, linear method like linear regression, logistic regression and support vector machines. Similarly, there are non-linear methods such as kernel methods, decision tree and neural networks.

  2. Unsupervised learning: This is another category of machine learning and is applicable when we need to find a good representation of available data. In unsupervised learning there is no ‘Y’ that needs to be predicted, so the primary goal is to look for the structure in the data. There are different types of structure in the data such as clusters, linear subspaces etc.
    i) Clustering is one of the widely used techniques in unsupervised learning. In this case, the given dataset is partitioned into groups of similar points. In typical clustering method such as K-means clustering or EM, the desired number of clusters is provided along with the input dataset. Specifying the specific number of clusters requires trial and error as cluster structure are present at multiple scale or multiple level of granularity. Due to this limitation, a popular alternative to this type of clustering is hierarchical clustering. Hierarchical clustering is a technique in which the hierarchy is specified by a tree that has a datapoint at its lead. Each subtree hanging from its node is a cluster and there are clusters at any desired level of details. This technique is widely used in gene expression analysis, where the clusters of genes that are co-expressed can be treated as a single unit.
    ii) Beyond clustering, another method of unsupervised learning is principal component analysis. One of the oldest procedures in statistics, this approach uses dimension of the dataset by projecting it into lower dimensional subspace. The data is compressed into fewer dimensions. One of the popular uses of dimensionality reduction is to visualize high dimensional data, the dimension has to be reduced to 2 or 3. Popular methods of doing this includes t-SNE or multidimensional scaling. However, it is also sometimes helpful to increase the dimension of the data to unpack the information contained in the data and to expose it. This type of approach is called sparse coding. In this case, each point is mapped to a vector with possible higher dimension but is sparse (i.e., it has a few nonzero coordinates).

  3. Interactive Learning: The last category of machine learning is called learning through interactions or interactive learning. One example of interactive learning is where the scientist interacts with the machine learning algorithm (i.e., building the algorithm) is called “human in the loop”. This core technology is still in its infancy unlike a more developed form of interactive learning called reinforcement learning. The reinforcement learning approach provides algorithmic model of an agent wandering around in the environment taking action where it’s possibly trying to maximize the rewards from the environment. Reinforcement learning has endless applications in the environment and this model is applied in humans, animal, robots and games.
    In conclusion, the speaker gave the basic overview of the machine learning field and provided a list of references for those who are interested in learning and implementing machine learning in their research:
    • Kevin P. Murphy. Machine Learning: A Probabilistic Perspective
    • Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning
    • Chris Bishop. Pattern Recognition and Machine Learning
    • David MacKay. Information Theory, Inference, and Learning Algorithms.

Link to comment
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in

×
×
  • Create New...