Abstracts

IJCV-06

K. Q. Weinberger and L. K. Saul (2006). Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision 70(1): 77-90. [pdf]

Abstract: Can we detect low dimensional structure in high dimensional data sets of images? In this paper, we propose an algorithm for unsupervised learning of image manifolds by semidefinite programming. Given a data set of images, our algorithm computes a low dimensional representation of each image with the property that distances between nearby images are preserved. More generally, it can be used to analyze high dimensional data that lies on or near a low dimensional manifold. We illustrate the algorithm on easily visualized examples of curves and surfaces, as well as on actual images of faces, handwritten digits, and solid objects.

AAAI-06

K. Q. Weinberger and L. K. Saul (2006). An introduction to nonlinear dimensionality reduction by maximum variance unfolding. To appear in Proceedings of the Twenty First National Conference on Artificial Intelligence (AAAI-06), Boston, MA. [pdf]

Abstract: Many problems in AI are simplified by clever representations of sensory or symbolic input. How to discover such representations automatically, from large amounts of unlabeled data, remains a fundamental challenge. The goal of statistical methods for dimensionality reduction is to detect and discover low dimensional structure in high dimensional data. In this paper, we review a recently proposed algorithm -- maximum variance unfolding -- for learning faithful low dimensional representations of high dimensional data. The algorithm relies on modern tools in convex optimization that are proving increasingly useful in many areas of machine learning.

ICASSP-06

F. Sha and L. K. Saul (2006). Large margin Gaussian mixture modeling for phonetic classification and recognition. To appear in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP-06), Toulouse, France. [pdf]

Abstract: We develop a framework for large margin classification by Gaussian mixture models (GMMs). Large margin GMMs have many parallels to support vector machines (SVMs) but use ellipsoids to model classes instead of half-spaces. Model parameters are trained discriminatively to maximize the margin of correct classification, as measured in terms of Mahalanobis distances. The required optimization is convex over the model's parameter space of positive semidefinite matrices and can be performed efficiently. Large margin GMMs are naturally suited to large problems in multiway classification; we apply them to phonetic classification and recognition on the TIMIT database. On both tasks, we obtain significant improvement over baseline systems trained by maximum likelihood estimation. For the problem of phonetic classification, our results are competitive with other state-of-the-art classifiers, such as hidden conditional random fields.

NIPS-05

K. Q. Weinberger, J. Blitzer, and L. K. Saul (2006). Distance metric learning for large margin nearest neighbor classification. To appear in Y. Weiss, B. Schoelkopf, and J. Platt (eds.), Advances in Neural Information Processing Systems 18. MIT Press: Cambridge, MA. [pdf]

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty,we find that metrics trained in this way lead to significant improvements in kNN classification---for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. Our approach has many parallels to support vector machines, including a convex objective function based on the hinge loss, but does not require modifications for problems with large numbers of classes.

book chapter

L. K. Saul, K. Q. Weinberger, J. H. Ham, F. Sha, and D. D. Lee (2006). Spectral methods for dimensionality reduction. To appear in O. Chapelle, B. Schoelkopf, and A. Zien (eds.), Semisupervised Learning. MIT Press: Cambridge, MA. [pdf]

Abstract: Spectral methods have recently emerged as a powerful tool for nonlinear dimensionality reduction and manifold learning. These methods are able to reveal low dimensional structure in high dimensional data from the top or bottom eigenvectors of specially constructed matrices. To analyze data that lies on a low dimensional submanifold, the matrices are constructed from sparse weighted graphs whose vertices represent input patterns and whose edges indicate neighborhood relations. The main computations for manifold learning are based on tractable, polynomial-time optimizations, such as shortest path problems, least squares fits, semidefinite programming, and matrix diagonalization. This chapter provides an overview of unsupervised learning algorithms that can be viewed as spectral methods for linear and nonlinear dimensionality reduction.

ISMIR J. A. Burgoyne and L. K. Saul (2005). Learning harmonic relationships in digital audio with Dirichlet-based hidden Markov models. In Proceedings of the Sixth International Conference on Music Information Retrieval (ISMIR-05), pages 438-443. London, England. [pdf]

Abstract: Harmonic analysis is a standard musicological tool for understanding many pieces of Western classical music and making comparisons among them. Traditionally, this analysis is done on paper scores, and most past research in machine-assisted analysis has begun with digital representations of them. Human music students are also taught to hear their musical analyses, however, in both musical recordings and performances. Our approach attempts to teach machines to do the same, beginning with a corpus of recorded Mozart symphonies. The audio files are first transformed into an ordered series of normalized pitch class profile (PCP) vectors. Simplified rules of tonal harmony are encoded in a transition matrix. Classical music tends to change key more frequently than popular music, and so these rules account not only for chords, as most previous work has done, but also for the keys in which they function. A hidden Markov model (HMM) is used with this transition matrix to train Dirichlet distributions for major and minor keys on the PCP vectors. The system tracks chords and keys successfully and shows promise for a real-time implementation.

ICMC-05 J. A. Burgoyne and L. K. Saul (2005). Visualization of low dimensional structure in tonal pitch space. In Proceedings of the International Computer Music Conference (ICMC-05), pages 243-246. Barcelona, Spain. [pdf]

Abstract: In his 2001 monograph Tonal Pitch Space, Fred Lerdahl defined a distance function over tonal and post-tonal harmonies distilled from years of research on music cognition. Although this work references the toroidal structure commonly associated with harmonic space, it stops short of presenting an explicit embedding of this torus. It is possible to use statistical techniques to recreate such an embedding from the distance function, yielding a more complex structure than the standard toroidal model has heretofore assumed. Nonlinear techniques can reduce the dimensionality of this structure and be tuned to emphasize global or local anatomy. The resulting manifolds highlight the relationships inherent in the tonal system and offer a basis for future work in machine-assisted analysis and music theory.

ICML-05

F. Sha and L. K. Saul (2005). Analysis and extension of spectral methods for nonlinear dimensionality reduction. In Proceedings of the Twenty Second International Conference on Machine Learning (ICML-05), pages 785-792. Bonn, Germany. [pdf]

Abstract: Many unsupervised algorithms for nonlinear dimensionality reduction, such as locally linear embedding (LLE) and Laplacian eigenmaps, are derived from the spectral decompositions of sparse matrices. While these algorithms aim to preserve certain proximity relations on average, their embeddings are not explicitly designed to preserve local features such as distances or angles. In this paper, we show how to construct a low dimensional embedding that maximally preserves angles between nearby data points. The embedding is derived from the bottom eigenvectors of LLE and/or Laplacian eigenmaps by solving an additional (but small) problem in semidefinite programming, whose size is independent of the number of data points. The solution obtained by semidefinite programming also yields an estimate of the data's intrinsic dimensionality. Experimental results on several data sets demonstrate the merits of our approach.

AISTATS-05

K. Q. Weinberger, B. D. Packer, and L. K. Saul (2005). Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In Z. Ghahramani and R. Cowell (eds.), Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 381-388. Barbados, West Indies. Best student paper award. [pdf]

Abstract: We describe an algorithm for nonlinear dimensionality reduction based on semidefinite programming and kernel matrix factorization. The algorithm learns a kernel matrix for high dimensional data that lies on or near a low dimensional manifold. In earlier work, the kernel matrix was learned by maximizing the variance in feature space while preserving the distances and angles between nearest neighbors. In this paper, exploiting recent ideas for semi-supervised learning on graphs, we show that for well-sampled manifolds, the entire kernel matrix can be very accurately reconstructed from a much smaller submatrix of inner products between randomly chosen landmarks. By representing the kernel matrix in this way, we are able to achieve order-of-magnitude reductions in computation time and to handle much larger data sets.

AISTATS-05

J. H. Ham, D. D. Lee, and L. K. Saul (2005). Semisupervised alignment of manifolds. In Z. Ghahramani and R. Cowell (eds.), Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 120-127. Barbados, West Indies. [pdf]

Abstract: In this paper, we study a family of semi-supervised learning algorithms for "aligning" different data sets that are characterized by the same underlying manifold. The optimizations of these algorithms are based on graphs that provide a discretized approximation to the manifold. Partial alignments of the data sets -- obtained from prior knowledge of their manifold structure or from pairwise correspondences of subsets of labeled examples -- are completed by integrating supervised signals with unsupervised frameworks for manifold learning. As an illustration of this semisupervised setting, we show how to learn mappings between different data sets of images that are parameterized by the same underlying modes of variability (e.g., pose and viewing angle). The curse of dimensionality in these problems is overcome by exploiting the low dimensional structure of image manifolds.

NIPS-04

F. Sha and L. K. Saul (2005). Real-time pitch determination of one or more voices by nonnegative matrix factorization. In L. K. Saul, Y. Weiss, and L. Bottou (eds.), Advances in Neural Information Processing Systems 17, pages 1233-1240. MIT Press: Cambridge, MA. [pdf]

Abstract: An auditory "scene", composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an important cue for auditory scene analysis. In this paper, with the goal of building agents that operate in human environments, we describe a real-time system to identify the presence of one or more voices and compute their pitch. The signal processing in the front end is based on instantaneous frequency estimation, a method for tracking the partials of voiced speech, while the pattern-matching in the back end is based on nonnegative matrix factorization, an unsupervised algorithm for learning the parts of complex objects. While supporting a framework to analyze complicated auditory scenes, our system maintains real-time operability and state-of-the-art performance in clean speech.

NIPS-04

J. Blitzer, K. Q. Weinberger, L. K. Saul, and F. C. N. Pereira (2005). Hierarchical distributed representations for statistical language models. In L. K. Saul, Y. Weiss, and L. Bottou (eds.), Advances in Neural Information Processing Systems 17, pages 185-192. MIT Press: Cambridge, MA. [pdf]

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction, then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al, our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models. We also discuss extensions of our approach to longer multiword contexts.

IMC-04

Y. Mao and L. K. Saul (2004). Modeling distances in large-scale networks by matrix factorization. In Proceedings of the Second Internet Measurement Conference (IMC-04), pages 278-287. Sicily, Italy. [pdf]

Abstract: In this paper, we propose a model for representing and predicting distances in large-scale networks by matrix factorization. The model is useful for network distance sensitive applications, such as content distribution networks, topology-aware overlays, and server selections. Our approach overcomes several limitations of previous coordinates-based mechanisms, which cannot model asymmetric routing or sub-optimal routing policies. We present two algorithms---singular value decomposition (SVD) and nonnegative matrix factorization (NMF)---that learn scalable models of network distances from limited numbers of measurements. Extensive simulations on real-world data sets show that these models lead to more accurate and robust predictions of latencies in large-scale networks than previous approaches.

ICML-04

K. Q. Weinberger, F. Sha, and L. K. Saul (2004). Learning a kernel matrix for nonlinear dimensionality reduction. In Proceedings of the Twenty First International Confernence on Machine Learning (ICML-04), pages 839-846. Banff, Canada. Outstanding student paper award. [pdf, matlab]

Abstract: We investigate how to learn a kernel matrix for high dimensional data that lies on or near a low dimensional manifold. Noting that the kernel matrix implicitly maps the data into a nonlinear feature space, we show how to discover a mapping that unfolds the underlying manifold from which the data was sampled. The kernel matrix is constructed by maximizing the variance in feature space subject to local constraints that preserve the angles and distances between nearest neighbors. The main optimization involves an instance of semidefinite programming---a fundamentally different computation than previous algorithms for manifold learning, such as Isomap and locally linear embedding. The optimized kernels perform better than polynomial and Gaussian kernels for problems in manifold learning, but worse for problems in large margin classification. We explain these results in terms of the geometric properties of different kernels and comment on various interpretations of other manifold learning algorithms as kernel methods.

CVPR-04

K. Q. Weinberger and L. K. Saul (2004). Unsupervised learning of image manifolds by semidefinite programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-04), vol. 2, pages 988-995. Washington D.C. Outstanding student paper award. [pdf, matlab]

Abstract: Can we detect low dimensional structure in high dimensional data sets of images and video? The problem of dimensionality reduction arises often in computer vision and pattern recognition. In this paper, we propose a new solution to this problem based on semidefinite programming. Our algorithm can be used to analyze high dimensional data that lies on or near a low dimensional manifold. It overcomes certain limitations of previous work in manifold learning, such as Isomap and locally linear embedding. We illustrate the algorithm on easily visualized examples of curves and surfaces, as well as on actual images of faces, handwritten digits, and solid objects.

ICASSP-04

V. Jain and L. K. Saul (2004). Exploratory analysis and visualization of speech and music by locally linear embedding. In Proceedings of the International Conference of Speech, Acoustics, and Signal Processing (ICASSP-04), vol. 3, pages 984-987. Montreal, Canada. [pdf]

Abstract Many problems in voice recognition and audio processing involve feature extraction from raw waveforms. The goal of feature extraction is to reduce the dimensionality of the audio signal while preserving the informative signatures that, for example, distinguish different phonemes in speech or identify particular instruments in music. If the acoustic variability of a data set is described by a small number of continuous features, then we can imagine the data as lying on a low dimensional manifold in the high dimensional space of all possible waveforms. Locally linear embedding (LLE) is an unsupervised learning algorithm for feature extraction in this setting. In this paper, we present results from the exploratory analysis and visualization of speech and music by LLE.

ICASSP-04

Y. Lin, D. D. Lee, and L. K. Saul (2004). Nonnegative deconvolution for time of arrival estimation. In Proceedings of the International Conference of Speech, Acoustics, and Signal Processing (ICASSP-04), vol. 2, pages 377-380. Montreal, Canada. [pdf]

Abstract The interaural time difference (ITD) of arrival is a primary cue for acoustic sound source localization. Traditional estimation techniques for ITD based upon cross-correlation are related to maximum-likelihood estimation of a simple generative model. We generalize the time difference estimation into a deconvolution problem with nonnegativity constraints. The resulting nonnegative least squares optimization can be efficiently solved using a novel iterative algorithm with guaranteed global convergence properties. We illustrate the utility of this algorithm using simulations and experimental results from a robot platform.

ICASSP-04

F. Sha, J. A. Burgoyne, and L. K. Saul (2004). Multiband statistical learning for f0 estimation in speech. In Proceedings of the International Conference of Speech, Acoustics, and Signal Processing (ICASSP-04), vol. 5, pages 661-664. Montreal, Canada. [pdf]

Abstract: We investigate a simple algorithm that combines multiband processing and least squares fits to estimate f0 contours in speech. The algorithm is untraditional in several respects: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably, in real time, without the need for postprocessing to produce smooth contours. We show that a baseline implementation of the algorithm, though already quite accurate, is significantly improved by incorporating a model of statistical learning into its final stages. Model parameters are estimated from training data to minimize the likelihood of gross errors in f0 as well as errors in classifying voiced versus unvoiced speech. Experimental results on several databases confirm the benefits of statistical learning.

JMLR-03

L. K. Saul and S. T. Roweis (2003). Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4:119-155. [ps.gz, pdf]

Abstract: The problem of dimensionality reduction arises in many fields of information processing, including machine learning, data compression, scientific visualization, pattern recognition, and neural computation. Here we describe locally linear embedding (LLE), an unsupervised learning algorithm that computes low dimensional, neighborhood preserving embeddings of high dimensional data. The data, assumed to be sampled from an underlying manifold, are mapped into a single global coordinate system of lower dimensionality. The mapping is derived from the symmetries of locally linear reconstructions, and the actual computation of the embedding reduces to a sparse eigenvalue problem. Notably, the optimizations in LLE--though capable of generating highly nonlinear embeddings--are simple to implement, and they do not involve local minima. In this paper, we describe the implementation of the algorithm in detail and discuss several extensions that enhance its performance. We present results of the algorithm applied to data sampled from known manifolds, as well as to collections of images of faces, lips, and handwritten digits. These examples are used to provide extensive illustrations of the algorithm's performance--both successes and failures--and to relate the algorithm to previous and ongoing work in nonlinear dimensionality reduction.

EuroSpeech-03

L. K. Saul, F. Sha, and D. D. Lee (2003). Statistical signal processing with nonnegativity constraints. In Proceedings of the Eighth European Conference on Speech Communication and Technology, volume 2, pages 1001-1004, Geneva, Switzerland. [pdf]

Abstract: Nonnegativity constraints arise frequently in statistical learning and pattern recognition. Multiplicative updates provide natural solutions to optimizations involving these constraints. One well known set of multiplicative updates is given by the Expectation- Maximization algorithm for hidden Markov models, as used in automatic speech recognition. Recently, we have derived similar algorithms for nonnegative deconvolution and nonnegative quadratic programming. These algorithms have applications to low-level problems in voice processing, such as fundamental frequency estimation, as well as high-level problems, such as the training of large margin classifiers. In this paper, we describe these algorithms and the ideas that connect them.

ICML-03

J. H. Ham, D. D. Lee, and L. K. Saul (2003). Learning high dimensional correspondences from low dimensional manifolds. In Proceedings of the ICML 2003 Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, pages 34-41, Washington, D.C. [pdf]

Abstract: Many different high dimensional data sets are characterized by the same underlying modes of variability. When these modes of variability are continuous and few in number, they can be viewed as parameterizing a low dimensional manifold. The manifold provides a compact shared representation of the data, suggesting correspondences between the high dimensional examples from different data sets. These correspondences, though naturally induced by the underlying manifold, are difficult to learn using traditional methods in supervised learning. In this paper, we generalize three methods in unsupervised learning -- principal components analysis, factor analysis, and locally linear embedding -- to discover subspaces and manifolds that provide common low dimensional representations of different high dimensional data sets. We use the shared representations discovered by these algorithms to put high dimensional examples from different data sets into correspondence. Finally, we show that a notion of self-correspondence between examples in the same data set can be used to improve the performance of these algorithms on small data sets. The algorithms are demonstrated on images and text.

COLT-03

F. Sha, L. K. Saul, and D. D. Lee (2003). Multiplicative updates for large margin classifiers. In Proceedings of the Sixteenth Annual Conference on Computational Learning Theory, pages 188-202, Washington D.C. [pdf]

Abstract: Various problems in nonnegative quadratic programming arise in the training of large margin classifiers. We derive multiplicative updates for these problems that converge monotonically to the desired solutions for hard and soft margin classifiers. The updates differ strikingly in form from other multiplicative updates used in machine learning. In this paper, we provide complete proofs of convergence for these updates and extend previous work to incorporate sum and box constraints in addition to nonnegativity.

AISTATS-03

A. I. Schein, L. K. Saul, and L. Ungar (2003). A generalized linear model for principal component analysis of binary data. In C. M. Bishop and B. J. Frey (eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, pages 14-21, Key West, FL. [pdf]

Abstract: We investigate a generalized linear model for dimensionality reduction of binary data. The model is related to principal component analysis (PCA) in the same way that logistic regression is related to linear regression. Thus we refer to the model as logistic PCA. In this paper, we derive an alternating least squares method to estimate the basis vectors and generalized linear coefficients of the logistic PCA model. The resulting updates have a simple closed form and are guaranteed at each iteration to improve the model's likelihood. We evaluate the performance of logistic PCA as measured by reconstruction error rates on data sets drawn from four real world applications. In general, we find that logistic PCA is much better suited to modeling binary data than conventional PCA.

NIPS-02

L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). Real time voice processing with audiovisual feedback: toward autonomous agents with perfect pitch. In S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15, pages 1205-1212. MIT Press: Cambridge, MA. [pdf, matlab]

Abstract: We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The algorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high resolution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares fit. The pitch tracker is used in two real time multimedia applications: a voice-to-MIDI player that synthesizes electronic music from vocalized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user's pitch scrolling across the screen as he or she sings into the computer.

NIPS-02

F. Sha, L. K. Saul, and D. D. Lee (2003). Multiplicative updates for nonnegative quadratic programming in support vector machines. In S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15, pages 1065-1073. MIT Press: Cambridge, MA. [pdf]

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefficients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classifiers.