The Analysis of Data Project

The Analysis of Data Project provides educational material on the area of data analysis.

  • The project features comprehensive coverage of all relevant disciplines including probability, statistics, computing, and machine learning.
  • The content is almost self-contained and includes mathematical prerequisites and basic computing concepts.
  • The R programming language is used to demonstrate the contents. Full code is available, facilitating reproducibility of experiments and letting readers experiment with variations of the code.
  • The presentation is mathematically rigorous, and includes derivations and proofs in most cases.
  • HTML versions are freely available on the website http://theanalysisofdata.com. Hardcopies are available at affordable prices.
About the Author
Guy Lebanon is an associate professor of computing at the Georgia Institute of Technology. His main research areas are statistical machine learning, computational statistics, and information visualization. He also serves as the associate director of the FODAVA project and visiting scientist at Google Research.

Before coming to Georgia Tech, Dr. Lebanon was an assistant professor of statistics and electrical and computer engineering at Purdue University. He received his PhD in 2005 from Carnegie Mellon University and BA, and MS degrees from Technion - Israel Institute of Technology.

Dr. Lebanon has authored over 50 refereed publications. He is the program co-chair of the 2012 ACM CIKM Conference, and guest editor of Data Mining and Knowledge Discovery journal. He received the NSF CAREER Award, the Yahoo Faculty Research and Engagement Award, and is a Siebel Scholar.

 

Volume 1: Probability

Introduction to probability theory with an emphasis on the multivariate case. Includes random vectors, random processes, Markov chains, limit theorems, and related mathematics such as metric spaces, measure theory, and integration.

HTML: contents viewer 1 viewer 2
PDF: contents chapter 1 chapter A
Print: coming soon
Volume 2: Computing

Overview of essential computing for data analysis, including operating systems, C++ and R programming, data structures, databases, and parallel computing.

In progress

 

Volume 3: Statistics and Machine Learning

Introduction to statistics and machine learning, including m-estimators, hypothesis tests, regression, clustering, classification, regularization, and non-parametric methods. The text will cover theory, methodology, and case studies.

In progress