The Analysis of Data Project

The Analysis of Data Project provides educational material in the area of data analysis.

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website Hardcopies are available at affordable prices.

Please email the author with typos, comments, and suggestions for improvements.

About the Author
Guy Lebanon is a senior manager at LinkedIn, where he leads the feed relevance team. Prior to that he was an advisor to an SVP and a senior manager at Amazon where he lead the machine learning science team at Amazon's main campus in Seattle WA. Prior to that Guy was a tenured professor at the Georgia Institute of Technology and a scientist at Google and Yahoo. His main research areas are machine learning and data science. Guy received his PhD from Carnegie Mellon University and BA, and MS degrees from Technion - Israel Institute of Technology. Dr. Lebanon has authored over 60 refereed publications. He is an action editor of Journal of Machine Learning Research, was the program chair of the 2012 ACM CIKM Conference, and will be the conference co-chair of AI and Statistics (AISTATS 2015). He received the NSF CAREER Award, the WWW best student paper award, the ICML best paper runner-up award, the Yahoo Faculty Research and Engagement Award, and is a Siebel Scholar.


Volume 1: Probability

Introduction to multivariate probability theory, including random vectors, random processes, markov chains, limit theorems, and related mathematics such as set theory, metric spaces, differentiation, integration, and measure theory.

Print: amazon e-store
HTML: viewer 1 viewer 2
PDF: chapter 1 chapter A

Table of Contents

Volume 2: Computing

Overview of essential computing for data analysis, including operating systems, C++ and R programming, data structures, databases, parallel computing, and big data.

HTML: viewer 1 viewer 2
PDF: chapter 4 chapter 5

Expected publication date: 2015.


Volume 3: Statistics and Machine Learning

Introduction to statistics and machine learning, including m-estimators, hypothesis tests, regression, clustering, classification, regularization, and non-parametric methods. The text will cover theory, methodology, and case studies.