# Scalable feature learning

## Abstract/Contents

- Abstract
- Over the past decade, machine learning has emerged as a powerful methodology that empowers autonomous decision making by learning and generalizing from examples. Thanks to machine learning, we now have software that classifies spam emails, recog- nizes faces from images, recommends movies and books. Despite this success, machine learning often requires a large amount of labeled data and significant manual feature engineering. For example, it is difficult to design algorithms that can recognize ob- jects from images as well as humans can. This difficulty is due to the fact that data are high-dimensional (a small 100x100 pixel image is often represented as 10,000 di- mensional vector) and highly-variable (due to many factors of transformations such as translation, rotation, illumination, scaling, viewpoint changes). To simplify this task, it is often necessary to construct features which are invariant to transformations. Features have become the lens that machine learning algorithms see the world. Despite its importance to machine learning and A.I., the process of construct- ing features is typically carried out by human experts and requires a great deal of knowledge and time, typically years. Even worse, these features may only work on a restricted set of problems and it can be difficult to generalize them for other do- mains. It is generally believed that automating the process of creating features is an important step to move A.I. and machine learning forward. Deep learning and unsupervised feature learning have shown great promises as methods to overcome manual feature engineering by learning features from data. However, these methods have been fundamentally limited by our computational abil- ities, and typically applied to small-sized problems. My recent work on deep learning and unsupervised feature learning has mainly focused on addressing their scalability, especially when applied to big data. In particular, my work tackles fundamental challenges when scaling up these algorithms by i) simplifying their optimization problems, ii) enabling model parallelism via sparse network connections, iii) enabling robust data parallelism by relaxing synchronization in optimization. The details of these techniques are described below Making deep learning simple: While certain classes of unsupervised feature learning algorithms, such as Independent Component Analysis (ICA), are effective in learning feature representations from data, they are difficult to optimize. To ad- dress this, we developed a simplified training method, known as RICA, by introducing a reconstruction penalty as a replacement for orthogonalization. Via a sequence of mathematical equivalences, we proved that RICA is equivalent to the original ICA op- timization problem under certain hyperparameter settings. The new approach how- ever has more freedom to learn overcomplete representations and converges faster. Our proof also shows connections between ICA and other deep learning approaches (sparse coding, RBMs, autoencoders etc.). Our algorithm, RICA, in addition to being scalable and able to learn invariant features, can be used to learn features for different domains. We have succeeded in applying the algorithms to learn features to identify activities in videos and cancers in MRI medical images. The learned representations are now state-of-the-art in both domains. Enabling model parallelism via model partitioning: A major weakness of deep learning algorithms, including RICA, is that they can be slow when applied to large problems. This is due to the fact that in standard deep learning models, every feature connects to every input dimension, e.g., every feature on a 100x100 pixel image is a 10,000 dimensional vector. This fundamental weakness hinders our understanding of the deep learning's potential when applied to real problems. To address this weakness, I have also worked on methods to scale up deep learning al- gorithms to big data. To this end, I proposed the idea of tiling local receptive fields to significantly reduce the number of parameters in a deep model. Specifically, each feature in our models only connects to a small area of the input data (local receptive fields). To further reduce the parameters, features that are far away can share the weights. Unlike convolutional models, adjacent receptive fields may not share pa- rameters. This flexibility allows the model to learn more invariant properties of the data more than just translational invariances typically achieved by full weight sharing in convolutional models. Visualization shows that tiling RICA can learn rotational, scaling and translational invariances from unlabeled data. In addition to reducing the number of parameters, local receptive fields also allow model partitioning and thus parallelism. This can be achieved by splitting the feature computations for non-overlapped areas of input data to different machines. This scheme of model partitioning (also known as model parallelism) enables the use of hundreds of machines to compute and train features. While this approach works well with hundreds of machines, scaling further can be difficult. This is because the entire system may have to wait for one slow machine and the chance of having one slow machine goes up as we use more machines. In practice, we use model partitioning in combination with asynchronous stochastic gradient descent described below. Enabling data parallelism via asynchronous SGD: I have also contributed to the development of asynchronous stochastic gradient descent (SGD) for scaling up deep learning models using thousands of machines. In detail, the previous approach of parallelizing deep learning is to train multiple model replicas (each with model partitioning as described above) and then communicate parameters via a central server called master. The communication is typically synchronous: the master has to wait for messages from all slaves before computing updates; all slaves have to wait for the message from the master to perform new computations. This mechanism has a weakness that if one of the slave is slow, the entire training procedure is slow. We found that asynchronous communications address this problem. In particular, the master updates its parameters as long as it receives a message from the slave and vice versa. Even though messages can be out-of-date (e.g., gradient being computed on delayed parameters), the method works well, lets us scale to thousands of machines and is much faster than conventional synchronous updates.

## Description

Type of resource | text |
---|---|

Form | electronic; electronic resource; remote |

Extent | 1 online resource. |

Publication date | 2013 |

Issuance | monographic |

Language | English |

## Creators/Contributors

Associated with | Le, Quoc V | |
---|---|---|

Associated with | Stanford University, Department of Computer Science. | |

Primary advisor | Ng, Andrew Y, 1976- | |

Thesis advisor | Ng, Andrew Y, 1976- | |

Thesis advisor | Bengio, Yoshua | |

Thesis advisor | Koller, Daphne | |

Advisor | Bengio, Yoshua | |

Advisor | Koller, Daphne |

## Subjects

Genre | Theses |
---|

## Bibliographic information

Statement of responsibility | Quoc V. Le. |
---|---|

Note | Submitted to the Department of Computer Science. |

Thesis | Ph.D. Stanford University 2013 |

Location | electronic resource |

## Access conditions

- Copyright
- © 2013 by Quoc Viet Le
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

## Also listed in

Loading usage metrics...