The Analysis of Data Project

The Analysis of Data Project provides educational material in the area of data analysis.

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website Hardcopies are available at affordable prices.

Please email the author with typos, comments, and suggestions for improvements.

About the Author
Guy Lebanon is a director at Amazon where he leads Search-Lab, an innovation center that explores big bets for improving Amazon’s search experience. Previously, he was a director at Netflix where he lead its homepage personalization efforts including its well known movie recommendation system. Before that he was a tenured professor at the Georgia Institute of Technology and had science and engineering management positions at Amazon and LinkedIn. Guy published several books and over 70 refereed articles in machine learning, and received a PhD from Carnegie Mellon University. He chaired the 2015 AI & Statistics conference and the 2012 ACM CIKM conference and was an action editor of Journal of Machine Learning Research during 2013-2017. He won first place in the PASCAL image segmentation competition three times, and received the NSF CAREER Award, the WWW best student paper award, the ICML best paper runner-up award, the Yahoo Faculty Research and Engagement Award, and is a Siebel Scholar.


Volume 1: Probability

Introduction to multivariate probability theory, including random vectors, random processes, markov chains, limit theorems, and related mathematics such as set theory, metric spaces, differentiation, integration, and measure theory.

Print: amazon e-store
HTML: viewer 1 viewer 2
PDF: chapter 1 chapter A

Table of Contents

Volume 2: Computing

Overview of essential computing for data analysis, including operating systems, C++ and R programming, data structures, databases, parallel computing, and big data.

HTML: viewer 1 viewer 2
PDF: chapter 4 chapter 5

Expected publication date: 2015.


Volume 3: Statistics and Machine Learning

Introduction to statistics and machine learning, including m-estimators, hypothesis tests, regression, clustering, classification, regularization, and non-parametric methods. The text will cover theory, methodology, and case studies.