Friday, April 28, 2017

Thesis: Randomized Algorithms for Large-Scale Data Analysis by Farhad Pourkamali-Anaraki

Image 1

Stephen just sent me the following:

Hi Igor, 
It's a pleasure to write to you again and announce the graduation of my PhD student Farhad Pourkamali-Anaraki.

It contains a lot of good things, some published some not. In particular (see attached image 1) he has great work on a 1-pass algorithm for K-means that seems to be one of the only 1-pass algorithms to accurately estimate cluster centers (implementation at https://github.com/stephenbeckr/SparsifiedKMeans ), and also has very recent work on efficient variations of the Nystrom method for approximating kernel matrices that seems to give the high-accuracy of the clustered Nystrom method at a fraction of the computational cost (see image 2). 
Best,
Stephen

Image 2




Thanks Stephen but I think the following paper also does 1-pass for K-Means (Keriven N., Tremblay N., Traonmilin Y., Gribonval R., "Compressive K-means" and its implementation SketchMLbox: A MATLAB toolbox for large-scale mixture learning ) even though the contruction seems different. Both of these implementations will be added to the Advanced Matrix Factorization Jungle page.

Anyway, congratulations Dr. Pourkamali-Anaraki !
Randomized Algorithms for Large-Scale Data AnalysisFarhad Pourkamali-Anaraki The abstract reads :

Massive high-dimensional data sets are ubiquitous in all scientific disciplines. Extract- ing meaningful information from these data sets will bring future advances in fields of science and engineering. However, the complexity and high-dimensionality of modern data sets pose unique computational and statistical challenges. The computational requirements of analyzing large-scale data exceed the capacity of traditional data analytic tools. The challenges surrounding large high-dimensional data are felt not just in processing power, but also in memory access, storage requirements, and communication costs. For example, modern data sets are often too large to fit into the main memory of a single workstation and thus data points are processed sequentially without a chance to store the full data. Therefore, there is an urgent need for the development of scalable learning tools and efficient optimization algorithms in today’s high-dimensional data regimes.

A powerful approach to tackle these challenges is centered around preprocessing high-dimensional data sets via a dimensionality reduction technique that preserves the underlying geometry and structure of the data. This approach stems from the observation that high- dimensional data sets often have intrinsic dimension which is significantly smaller than the ambient dimension. Therefore, information-preserving dimensionality reduction methods are valuable tools for reducing the memory and computational requirements of data analytic tasks on large-scale data sets.

Recently, randomized dimension reduction has received a lot of attention in several fields, including signal processing, machine learning, and numerical linear algebra. These methods use random sampling or random projection to construct low-dimensional representations of the data, known as sketches or compressive measurements. These randomized methods are effective in modern data settings since they provide a non-adaptive data- independent mapping of high-dimensional data into a low-dimensional space. However, such methods require strong theoretical guarantees to ensure that the key properties of original data are preserved under a randomized mapping.

This dissertation focuses on the design and analysis of efficient data analytic tasks using randomized dimensionality reduction techniques. Specifically, four efficient signal processing and machine learning algorithms for large high-dimensional data sets are proposed: covariance estimation and principal component analysis, dictionary learning, clustering, and low-rank approximation of positive semidefinite kernel matrices. These techniques are valu- able tools to extract important information and patterns from massive data sets. Moreover, an efficient data sparsification framework is introduced that does not require incoherence and distributional assumptions on the data. A main feature of the proposed compression scheme is that it requires only one pass over the data due to the randomized preconditioning transformation, which makes it applicable to streaming and distributed data settings.

The main contribution of this dissertation is threefold: (1) strong theoretical guarantees are provided to ensure that the proposed randomized methods preserve the key properties and structure of high-dimensional data; (2) tradeoffs between accuracy and memory/computation savings are characterized for a large class of data sets as well as dimensionality reduction methods, including random linear maps and random sampling; (3) extensive numerical experiments are presented to demonstrate the performance and benefits of our proposed methods compared to prior works.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

No comments:

Printfriendly