Set Up the Env

I would like to share my learning in Machine Learning in this space. The blog can be seem as a way to organize and share my learning, ideas and random thoughts as well as a way to review the concepts that I have learned both inside and outside my classes.

If you happened to visit my blog and find anything interests you, please feel free to leave your thoughts. I will be happy to have conversations here.

Also, if you see anything that is not correct in the posts, please let me know. Thanks.


K-Means and EM

Both k-means and EM algorithm belongs to the category of unsupervised learning. I have explained in this post what is unsupervised learning.


The k-means algorithm (aka Lloyd’s algorithm) is simple and straightforward. Given you have a dataset X, the basically idea is to alternating between the following two steps until you minimize the objective:

Step 1: Given K points that represents the center of K clusters (thus K classes), assign each data point to the class that has the center point closest to it.

Step 2: Now you have a set of data points for each class. So you calculate the center/mean for each class base on the data points belong to that class.

where the objective can be defined as the sum of squared distance between each data point and its closest cluster center.

Following the algorithm above, the squared distance will keep decreasing until it converges to a local minima.

(A simple example code can be found here.)


“The expectation maximization algorithm, or EM algorithm, is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables.” (Chris Bishop, Patter Recognition and Machine Learning)

EM  (expectation maximization) is a generalization of k-means algorithm. In EM we introduce the notion of hidden variables or latent variables which are used to represent variables that we do not observe. Usually in clustering problem, we use hidden variables to represent the class assignments.

Similar to k-means, EM alternating between two steps until convergence:

E-step: calculate the expectation of the hidden variables

M-step: maximal likelihood estimation of the parameters of the assumed/given models

To understand EM and its convergence requires some statistics and calculus. Thus, it is helpful to look at one of most common uses of EM, i.e. its usage in Gaussian Mixture Model.

Linear Regression

What is Linear Regression?

Linear Regression is more of a statistic concept rather than a machine learning concept. In the simplest form (1-dimensional input and output), linear regression can be seem as fitting a line defined by y = ax + b that best estimates y for a given x.

From the name, we can also explained Linear Regression in two parts – “Linear” and “Regression”.

“Regression” means the method is used for analysis/estimation on data with continuous/quantitative outcome rather than categorical/qualitative outcome. For example, the estimate of whether a car is black or not is categorical/qualitative, whereas the estimate of the price of a given dish is continuous/quantitative.

“Linear” indicates that the method is in the category of linear models. Generally, linear model’s linearity is in term of parameter (see a brief explanation of difference between linear in parameters and linear in variables by Chinny84 on StackExchange). If a method’s estimate can be expressed as \hat{Y} = \hat{\beta}_0+\sum_{i=1}^dX_j\hat{\beta}_j, where \hat{\beta}_0 is the intercept, \hat{\beta}_j are the coefficients and X_j are the inputs in j dimensions, the method is a linear model.

Different Types of Linear Regression

Ordinary Least Square

Ordinary Least Square (OLS) uses least square error \sum_{i=1}^{n} Y_i-\hat{Y}_i to measure the cost. Then the problem become finding the \beta_0^* and \beta_i^* to minimize the least square error.

Ridge Regression

Ridge Regression is OLS with a L2-norm constraint on \beta_i. The purpose of the constraint is to restrict the coefficients in order to decrease the variance of the model. This is very helpful when the input features are highly correlated.


Lasso is very similar to Ridge, except it uses L1-norm. Lasso is more likely to reduce coefficients to 0 than Ridge. However, Lasso does not have a close form solution like Ridge and OLS.

to be cont…

Review on Probabilities

  1. Events
    • \mathbb{P}[A\  or\  B] \le \mathbb{P}[A] + \mathbb{P}[B]
    • \mathbb{P}[A\bigcup B] \le \mathbb{P}[A] + \mathbb{P}[B]
    • If A \Rightarrow B, then \mathbb{P} \le \mathbb{P}[B]
  2.  Expectations
    • E[X] = \int xf(x)dx
    • E[c] = c
    • E[X + c] = E[X] + c
    • E[cX] = cE[X]
    • E[X+Y] = E[X] + E[Y]
    • E[XY] = E[X]E[Y] if X and Y are independent
    • E[\vec{X}] = (E[X_1] \ E[X_2] \  ... \  E[X_d])^T
    • If X \ge 0, E[X] =\int_0^\infty \mathbb{P}[X\ge 0]dt
  3. Variance
    • Var(X) = E[(X - E[X])^2]
    • Var(X) = E[X^2] - (E[X])^2
    • Var(c) = 0
    • Var(X + c) = Var(X)
    • Var(cX) = c^2Var(X)
    • Var(X+Y) = Var(X)Var(Y) if X and Y are independent
    • Var(\sum\limits_{i=1}^NX_i) = \sum\limits_{i=1}^NVar(X_i) if the X_i...X_n are independent
  4. Covariance
    • Cov(X,Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X][Y]
    • Var(X) = Cov(X,X)
    • Cov(X,c) = 0
    • Cov(aX, bY) = abCov(X,Y)
    • Cov(\vec{X}) = E[(\vec{X} - E[\vec{X}])(\vec{X} - E[\vec{X}])^T] = E[\vec{X}\vec{X}^T] - E[\vec{X}](E[\vec{X}])^T
    • Cov(\vec{X}|\vec{Y}) = E[(\vec{X} - E[\vec{X}|\vec{Y}])(\vec{X} - E[\vec{X}|\vec{Y}])^T|\vec{Y}]
  5. Mean Square Error: MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta}-E[\hat{\theta}])^2] + (E[\hat{\theta}]-\theta)^2 = Var(\hat{\theta}) + Bias(\hat{\theta})^2, more detail explanation can be found on wikipedia.
  6. Law of Total Expectation E[X] = E[E[X|Y]]
  7. Inequalities
    • Jensen: for convex f, f(E[X]) \le E[f(X)]
    • Markov: If X \ge 0 then for all t >0, \mathbb[P][X\ge t] \le \frac{E[X]}{t}
    • Chebyshev-Cantelli: for t>0, \mathbb{P}[|X-E[X]|\ge t] \le \frac{Var(X)}{t^2}
    • Chebyshev’s Association: let f and g be nondecreasing (nonincreasing) real-valued functions defined on the real line. If X is a real-valued random variable then, E[f(X)g(X)] \ge E[f(X)]E[g(X)] (E[f(X)g(X)] \le E[f(X)]E[g(X)])
    • Harris’: extends Chebyshev’s Association to functions f,g:\mathbb{R}^n\rightarrow \mathbb{R}
    • Chernoff: for all t\in \mathbb{R}, \mathbb{P}[X\ge t] \le \inf\limits_{\lambda \ge 0}E[e^{\lambda (X-t)}]
    • Cauchy-Schwarz: if E[X^2]<\infty E[Y^2]<\infty, then |E[XY]|\le \sqrt{E[X^2]E[Y^2]}
    • Hoeffding’s tail: Let X_1,...,X_n be independent bounded random variable such that X_i falls in the interval [a_i,b_i] with probability one. Then for any t>0, \mathbb{P}[S_n-E[S_n]\ge t] \le \exp(\frac{-2t^2}{\sum\limits_{i=1}^n(b_i-a_i)^2}) and \mathbb{P}[S_n-E[S_n]\le -t] \le \exp(\frac{-2t^2}{\sum\limits_{i=1}^n(b_i-a_i)^2})
  8. Other
    • If \mathbb{P}[X > t]\le F(t), then with probability at least 1 - \delta, X\le F^{-1}(\delta).

Categories of Machine Learning Algorithms

There are a few ways to categorize a machine learning algorithm. Here I will give you a brief introduction of these categories in easy words.

Statistical Learning

There are some debates on whether Machine Learning is just Statistics. Here’s a really good comparison of the two by Brendan O’Connor. And here is a good discussion on StackExchange. My standing is that some machine learning approaches are not statistical, even though they can be explained by statistics. But do not be puzzled when you see statistics everywhere in a machine learning textbook.

Supervised and Unsupervised Learning

Depending on the data we have and what our goal is, we need to decide whether we will use supervised or unsupervised learning.

In a supervised learning, we will have a set of data (e.g., set S) that is correctly labeled. Then we can train algorithms using S or a subset of S that predict labels from the input variables. At the end, we typically choose the algorithm that minimize the difference between its predicted labels and the true labels.

Example: We have data of the heights of 100 people, and we have each of these heights link to a label which corresponds to the age range of that person (i.e., “up to 2-year-old”, “2 to 6”, “6 to 12”, “12 to 16”, “16 and older”). Then in a supervised learning, we train algorithms that use the input variable (i.e., the height) to predict the label (i.e., the age range corresponding to that height). We check the efficiency of the algorithms by comparing its predictions with the true labels.

In an unsupervised learning, we do not have correctly labeled data (or if we have we do not use). So we cannot train algorithms base on their “correctness” (the correctness in the supervised learning is subject to the final purpose of the machine learning problem, which I will write about in future posts). Instead, we try to find meaningful structures and relationships from the data. (Unsupervised learning is also known in Statistics as exploratory data analysis.)

Example: We have the height data as above but without the correct labels given. What we can do is to use unsupervised learning to group the heights into different groups and interpret them.

Parametric and Nonparametric Methods

“In statistics, a parametric model is a family of distributions that can be described using a finite number of parameters.” (wikipedia) Thus, a parametric method in machine learning assumes the distribution of the data takes a specific form. Or more precisely, a parametric method assumes the underlying real classifier or regressor takes a specific form.

On the contrary, a nonparametric method does not have such assumption. As a result, nonparametric methods are more flexible but generally harder to interpret.

Generative Model and Discriminative Model

A algorithm using the generative model has the ability to generate data similar to the training data, because it learns about the joint probability of the input variables and labels.

On the other hand, a algorithm using the discriminative model only learns about the conditional probability, i.e., given the input variable, what is the label going to be.

Here’s a discussion from StackExchange with some interesting examples.


These categories could easily overlap depends on the specific algorithm.

What is Machine Learning?

My definition of Machine Learning in simple words:

Machine Learning is a way to let machine/computer/silicon learn from experience/existing-data in order to give a more educated guess regarding the future/unknown.

However the definition may not be exactly the same as the ones you find on a typical machine learning textbook. And here’s a bunch of more formal textbook definitions from Jason Brownlee’s wonderful post What is Machine Learning: A Tour of Authoritative Definitions and a Handy One-Liner You Can Use.

Why Machine Learning?

The intuition is that if a machine learning method is evolved by experience, then the method is more flexible (i.e., less problem specific), thus can be applied more generally.

Linear Algebra Review

Since linear algebra is broadly used in Machine Learning, here are the concepts that I thank are important.

  1. Dot Product
    • dot product of two vectors can be seem as linearly transform one to the 1D line defined by the other
    • \vec{v} \cdot \vec{u} = |\vec{v}||\vec{u}|\cos\theta
    • \vec{v} \cdot \vec{u} = \vec{u} \cdot \vec{v}
    • c(\vec{v}\cdot\vec{u}) = (c\vec{v})\cdot\vec{u}
    • (c\vec{v} + d\vec{u})\cdot\vec{w} = (c\vec{v})\cdot\vec{w} + (d\vec{u})\cdot\vec{w}
    • |\vec{v}|^2 =\vec{v}\cdot\vec{v} = \vec{v}^T\vec{v} =\langle\vec{v},\vec{v}\rangle
    • \vec{v} \cdot \vec{u} = 0 if and only if \vec{v} and \vec{u} are orthogonal to each other
  2. Matrix Multiplication
    • AB\not= BA
    • A(B+C) = AB + BC
    • (A+B)C = AC + BC
    • c(AB) = (cA)B = A(cB)
    • A(BC) = (AB)C
    • AI = A = IA
    • AA^{-1} = I = A^{-1}A
    • (AB)^{-1} = B^{-1}A^{-1}
    • det(AB) = det(A) det(B)
    • A [\vec{v}_1\:\vec{v}_2\:\vec{v}_3] = [A\vec{v}_1\:A\vec{v}_2\:A\vec{v}_3]
  3. Transpose
    • (A^T)^T = A
    • (A+B)^T = A^T + B^T
    • (cA)^T = cA^T
    • (AB)^T = B^TA^T
    • (A^{-1})^T = (A^T)^{-1}
  4. Vector Differentiation
    • \frac{d(x^TAx)}{d(x)} = x^T(A^T + A)
  5. Determinants
    • determinant of two 2D vectors is the area of the signed parallelogram formed by these two vectors (\det(\vec{a}\vec{b}) = \left| \begin{array}{cc} \vec{a}_1 & \vec{a}_2 \\ \vec{b}_1 & \vec{b}_2 \end{array} \right| )
    • determinant of three 3D vectors is the signed volume of the parallelepiped formed by these three vectors
    • if the determinant of a matrix A is 0, then A is singular. Below are some more properties of determinant of matrix:
    • |aA| = a^d|A|
    • |AB| = |A||B|
    • |A| = |A^T|
    • |A| = \frac{1}{|A^{-1}|}
  6. Rank of a matrix (for real matrix A\in\mathbb{R}^{m\times n})
    • columnrank(A) = rowrank(A) = rank(A)
    • rank(A) = rank(A^T)
    • rank(A) \le \min(m,n)
    • rank(AB) \le \min(rank(A), rank(B))
    • rank(A+B) \le rank(A) + rank(B)
    • rank(A^TA) = rank(AA^T) = rank(A)
  7. Symetric Matrices
    • A matrix A is positive semidefinite if for all vectors \vec{v} such that \vec{v}^TA\vec{v} \ge 0
      • A is positive semidefinite iff A=A^T
      • A=U^TU is positive semidefinite
  8. Trace (for square matrix)
    • tr(A) = \sum\limits_{i=1}^nA_{ii} where A\in\mathbb{R}^{n\times n}
    • tr(A) = tr(A^T)
    • tr(A+B) = tr(A) + tr(B)
    • tr(cA) = ctr(A)
    • tr(AB) = tr(BA) = tr(B^TA^T) = tr(A^TB^T)
  9. Linear Transformation
  10. Invertibility of a matrix
    • A matrix is invertible iff it is full rank
  11. Orthogonal Matrix
    • “an orthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors” (from wikipedia)
    • “The rows of an orthogonal matrix are an orthonormal basis. That is, each row has length one, and are mutually perpendicular. Similarly, the columns are also an orthonormal basis. In fact, given any orthonormal basis, the matrix whose rows are that basis is an orthogonal matrix. It is automatically the case that the columns are another orthonormal basis.” (from Wolfram MathWorld)
    • Let A and B be orthogonal matrices:
      • A^TA = I = AA^T
      • \|A\vec{v}\|_2 = \|\vec{v}\|_2
      • \langle \vec{v},\vec{u} \rangle = \langle A\vec{v},A\vec{u} \rangle
      • \det(A)=\pm 1 (when it equals +1, A is a rotation matrix; o.w. A is a reflection matrix)
      • A + B and A^{-1} are both orthogonal matrices
  12. Eigenvectors and Eigenvalues
    1. when a transformation only scale or reverse the vector but doesn’t change the direction of the vector (except reverse), we say the vector is a eigenvector of the transformation
    2. A\vec{v} = \lambda\vec{v} where \lambda is the eigenvalue associates with the eigenvector \vec{v}
    3. when we assume there is at least one eigenvector, we can use this equation to find it: \det(\lambda I - A) = 0
    4. eigenvalue decomposition
      •  Definition: “Let P be a matrix of eigenvectors of a given square matrix A and D be a diagonal matrix with the corresponding eigenvalues on the diagonal. Then, as long as P is a square matrix, A can be written as an eigen decomposition A = PDP^{-1}. Furthermore, if A is symmetric, then the columns of P are orthogonal vectors.” (from Wolfram MathWorld)
    5. Properties
      • tr(A) = \sum\limits_{i=1}^n\lambda_i
      • |A| = \prod\limits_{i=1}^n\lambda_i
  13. Singular Value Decomposition

Loss Functions

According to Wikipedia, “a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some ‘cost’ associated with the event.”(link) The definition works well in Machine Learning.


0/1 loss: 1\{f(x_i)\not=y_i\}

However, 0/1 loss is not convex, thus hard to optimize. To work around this problem, we use surrogate loss functions. These are convex functions that simulate the 0/1 loss and will always be the upper bound of 0/1 loss:

Hinge loss: max(0, 1-f(x_i)y_i)

Logistic loss: \frac{1}{\ln2}\ln(1+e^{-f(x_i)y_i})



Exponential loss:

Squared loss:


  1. L1 loss / absolute loss: |y_i-f(x_i)|
  2. L2 loss / squared loss: (y_i-f(x_i))^2
    • Gini index