Review on Information Theory

  1. Entropy
    • Entropy \mathrm{H} of a discrete random variable X with possible values {x_1, ..., x_n} and probability mass function P(X) is: \mathrm{H}(X) = \mathbb{E}[\mathrm{I}(X)] = \mathbb{E}[-\log_b{P(X)}] = - \sum\limits_{i=1}^nP(x_i)\log_b{P(x_i)}
  2. Kullback–Leibler divergence (relative entropy)
    • For discrete probability distributions P and Q, the Kullback–Leibler divergence from Q to P is defined to be: D_{KL}(P\|Q) = \sum\limits_{i}P(x_i)\log_b{\frac{P(x_i)}{Q(x_i)}}

Review on Calculus

  1. Derivatives
  2. Matrix calculus
    • Derivative: for f:\mathbb{R}^{m\times n}\to\mathbb{R}, (\frac{\partial f(A)}{\partial A})_{i,j} = \frac{\partial f(A)}{\partial A_{i,j}} where A,\frac{\partial f(A)}{\partial A} \in \mathbb{R}^{m \times n}
    • Hessian: for f:\mathbb{R}^m \to \mathbb{R}, \mathbf{H}_{i,j} = \frac{\partial^2 f}{\partial x_i \partial x_j} where \mathbf{H}\in \mathbb{R}^{m\times m}
  3. Taylor series
    • f(x) = f(c) + \frac{f'(c)}{1!}(x-c) + \frac{f''(c)}{2!}(x-c)^2 + ... + \frac{f^{(n)}(c)}{n!}(x-c)^n ... = \sum\limits_{n=0}^{\infty}\frac{f^{(n)}(c)}{n!}(x-c)^n for some constant c
    • Lagrange error bound: if |f^{(n+1)}(x)|\le M on an interval (a-r, a+r) for some r > 0, then |R_n(x)|\le M\frac{|x-a|^{n+1}}{(n+1)!}\le M\frac{r^{n+1}}{(n+1)!} for all x\in(a-r,a+r).
  4. Exponential
    • Lambert W Function is the inverse relation of the function f(x) = xe^x, so W(xe^x) = x or W(xe^x)e^{w(xe^x)}=f(x)
      • W(0) = 0
      • W(1)=\Omega\approx 0.5671
      • W(e) = 1
  5. Some helpful inequalities:
    • for non-negative A, B, C, A \le B + C\sqrt{A} \Rightarrow A \le B + C^2 + C\sqrt{B}
    • Taylor expansion inequalities:
      • e^x \ge 1 + x
      • e^x \le 1 + x + \frac{x^2}{2} for x \le 0
      • e^x \le 1 + x + (e-2)x^2 for x \le 1
      • e^x + e^{-x} \le 2e^{x^2}
      • \frac{1}{2}e^x +\frac{1}{2}e^{-1x} \le e^{\frac{1}{2}x^2}

Review on Probabilities

  1. Events
    • \mathbb{P}[A\  or\  B] \le \mathbb{P}[A] + \mathbb{P}[B]
    • \mathbb{P}[A\bigcup B] \le \mathbb{P}[A] + \mathbb{P}[B]
    • If A \Rightarrow B, then \mathbb{P} \le \mathbb{P}[B]
  2. Conditional Probabilities
    • \mathbb{P}[A|B] = \frac{\mathbb{P}[A,B]}{\mathbb{P}[B]}
    • Since \mathbb{P}[A_n,...,A_1] = \mathbb{P}[A_n|A_{n-1},...,A_1] \cdot \mathbb{P}[A_{n-1},...,A_1], \mathbb{P}[\bigcap\limits_{k=1}^nA_k] = \prod\limits_{k=1}^n\mathbb{P}[A_k|\bigcap\limits_{j=1}^{k-1}A_j]
  3.  Expectations
    • E[X] = \int xf(x)dx
    • E[c] = c
    • E[X + c] = E[X] + c
    • E[cX] = cE[X]
    • E[X+Y] = E[X] + E[Y]
    • E[XY] = E[X]E[Y] if X and Y are independent
    • E[\vec{X}] = (E[X_1] \ E[X_2] \  ... \  E[X_d])^T
    • If X \ge 0, E[X] =\int_0^\infty \mathbb{P}[X\ge 0]dt
  4. Variance
    • Var(X) = \int(x-\mu)^2f(x)dx
    • Var(X) = E[(X - E[X])^2]
    • Var(X) = E[X^2] - (E[X])^2
    • Var(c) = 0
    • Var(X + c) = Var(X)
    • Var(cX) = c^2Var(X)
    • Var(X+Y) = Var(X)Var(Y) if X and Y are independent
    • Var(\sum\limits_{i=1}^NX_i) = \sum\limits_{i=1}^NVar(X_i) if the X_i...X_n are independent
  5. Covariance
    • Cov(X,Y) = \int\int(x-E[x])(y-E[y])f(x,y)dxdy
    • Cov(X,Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X][Y]
    • Var(X) = Cov(X,X)
    • Cov(X,c) = 0
    • Cov(aX, bY) = abCov(X,Y)
    • Cov(\vec{X}) = E[(\vec{X} - E[\vec{X}])(\vec{X} - E[\vec{X}])^T] = E[\vec{X}\vec{X}^T] - E[\vec{X}](E[\vec{X}])^T
    • Cov(\vec{X}|\vec{Y}) = E[(\vec{X} - E[\vec{X}|\vec{Y}])(\vec{X} - E[\vec{X}|\vec{Y}])^T|\vec{Y}]
  6. Mean Square Error: MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta}-E[\hat{\theta}])^2] + (E[\hat{\theta}]-\theta)^2 = Var(\hat{\theta}) + Bias(\hat{\theta})^2, more detail explanation can be found on wikipedia.
  7. Law of Total Expectation E[X] = E[E[X|Y]]
  8. Inequalities
    • Jensen: for convex f, f(E[X]) \le E[f(X)]; for concave f, f(E[X]) \ge E[f(X)]
    • Markov: If X \ge 0 then for all t >0, \mathbb[P][X\ge t] \le \frac{E[X]}{t}
    • Chebyshev-Cantelli: for t>0, \mathbb{P}[|X-E[X]|\ge t] \le \frac{Var(X)}{Var(X) + t^2}
    • Chebyshev’s Association: let f and g be nondecreasing (nonincreasing) real-valued functions defined on the real line. If X is a real-valued random variable then, E[f(X)g(X)] \ge E[f(X)]E[g(X)] (E[f(X)g(X)] \le E[f(X)]E[g(X)])
    • Harris’: extends Chebyshev’s Association to functions f,g:\mathbb{R}^n\rightarrow \mathbb{R}
    • Chernoff bound: for any t\in \mathbb{R}, for any random variable X, \mathbb{P}[X\ge t] = \mathbb{P}[e^{sX}\ge e^{st}] \le \frac{E[s^{sX}]}{e^{st}}
    • Cauchy-Schwarz: if E[X^2]<\infty E[Y^2]<\infty, then |E[XY]|\le \sqrt{E[X^2]E[Y^2]}
    • Hoeffding’s tail: Let X_1,...,X_n be independent bounded random variable such that X_i falls in the interval [a_i,b_i] with probability one. Then for any t>0, \mathbb{P}[S_n-E[S_n]\ge t] \le \exp(\frac{-2t^2}{\sum\limits_{i=1}^n(b_i-a_i)^2}) and \mathbb{P}[S_n-E[S_n]\le -t] \le \exp(\frac{-2t^2}{\sum\limits_{i=1}^n(b_i-a_i)^2})
  9. Other
    • If \mathbb{P}[X > t]\le F(t), then with probability at least 1 - \delta, X\le F^{-1}(\delta).
    • Moment generating function of a random variable X is M_X(t) = E[e^{tX}] where t\in\mathbb{R}

Review on Linear Algebra

Since linear algebra is broadly used in Machine Learning, here are the concepts that I thank are important.

  1. Dot Product
    • dot product of two vectors can be seem as linearly transform one to the 1D line defined by the other
    • \vec{v} \cdot \vec{u} = |\vec{v}||\vec{u}|\cos\theta
    • \vec{v} \cdot \vec{u} = \vec{u} \cdot \vec{v}
    • c(\vec{v}\cdot\vec{u}) = (c\vec{v})\cdot\vec{u}
    • (c\vec{v} + d\vec{u})\cdot\vec{w} = (c\vec{v})\cdot\vec{w} + (d\vec{u})\cdot\vec{w}
    • |\vec{v}|^2 =\vec{v}\cdot\vec{v} = \vec{v}^T\vec{v} =\langle\vec{v},\vec{v}\rangle
    • \vec{v} \cdot \vec{u} = 0 if and only if \vec{v} and \vec{u} are orthogonal to each other
  2. Matrix Multiplication
    • AB\not= BA
    • A(B+C) = AB + BC
    • (A+B)C = AC + BC
    • c(AB) = (cA)B = A(cB)
    • A(BC) = (AB)C
    • AI = A = IA
    • AA^{-1} = I = A^{-1}A
    • (AB)^{-1} = B^{-1}A^{-1}
    • det(AB) = det(A) det(B)
    • A [\vec{v}_1\:\vec{v}_2\:\vec{v}_3] = [A\vec{v}_1\:A\vec{v}_2\:A\vec{v}_3]
  3. Transpose
    • (A^T)^T = A
    • (A+B)^T = A^T + B^T
    • (cA)^T = cA^T
    • (AB)^T = B^TA^T
    • (A^{-1})^T = (A^T)^{-1}
  4. Vector Differentiation
    • \frac{d(x^TAx)}{d(x)} = x^T(A^T + A)
  5. Determinants
    • determinant of two 2D vectors is the area of the signed parallelogram formed by these two vectors (\det(\vec{a}\vec{b}) = \left| \begin{array}{cc} \vec{a}_1 & \vec{a}_2 \\ \vec{b}_1 & \vec{b}_2 \end{array} \right| )
    • determinant of three 3D vectors is the signed volume of the parallelepiped formed by these three vectors
    • if the determinant of a matrix A is 0, then A is singular. Below are some more properties of determinant of matrix:
    • |aA| = a^d|A|
    • |AB| = |A||B|
    • |A| = |A^T|
    • |A| = \frac{1}{|A^{-1}|}
  6. p-norm (l_p-norm)
    • for p \in \mathbb{R}, \vec{x} \in \mathbb{R}^n, \|\vec{x}\|_p = (\sum\limits_{i=1}^n |x_i|^p)^{\frac{1}{p}}
  7. Rank of a matrix (for real matrix A\in\mathbb{R}^{m\times n})
    • columnrank(A) = rowrank(A) = rank(A)
    • rank(A) = rank(A^T)
    • rank(A) \le \min(m,n)
    • rank(AB) \le \min(rank(A), rank(B))
    • rank(A+B) \le rank(A) + rank(B)
    • rank(A^TA) = rank(AA^T) = rank(A)
  8. Symetric Matrices
    • A matrix A is positive semidefinite if for all vectors \vec{v} such that \vec{v}^TA\vec{v} \ge 0
      • A is positive semidefinite iff A=A^T
      • A=U^TU is positive semidefinite
  9. Trace (for square matrix)
    • tr(A) = \sum\limits_{i=1}^nA_{ii} where A\in\mathbb{R}^{n\times n}
    • tr(A) = tr(A^T)
    • tr(A+B) = tr(A) + tr(B)
    • tr(cA) = ctr(A)
    • tr(AB) = tr(BA) = tr(B^TA^T) = tr(A^TB^T)
  10. Linear Transformation
  11. Invertibility of a matrix
    • A matrix is invertible iff it is full rank
  12. Orthogonal Matrix
    • “an orthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors” (from wikipedia)
    • “The rows of an orthogonal matrix are an orthonormal basis. That is, each row has length one, and are mutually perpendicular. Similarly, the columns are also an orthonormal basis. In fact, given any orthonormal basis, the matrix whose rows are that basis is an orthogonal matrix. It is automatically the case that the columns are another orthonormal basis.” (from Wolfram MathWorld)
    • Let A and B be orthogonal matrices:
      • A^TA = I = AA^T
      • \|A\vec{v}\|_2 = \|\vec{v}\|_2
      • \langle \vec{v},\vec{u} \rangle = \langle A\vec{v},A\vec{u} \rangle
      • \det(A)=\pm 1 (when it equals +1, A is a rotation matrix; o.w. A is a reflection matrix)
      • A + B and A^{-1} are both orthogonal matrices
  13. Eigenvectors and Eigenvalues
    1. when a transformation only scale or reverse the vector but doesn’t change the direction of the vector (except reverse), we say the vector is a eigenvector of the transformation
    2. A\vec{v} = \lambda\vec{v} where \lambda is the eigenvalue associates with the eigenvector \vec{v}
    3. when we assume there is at least one eigenvector, we can use this equation to find it: \det(\lambda I - A) = 0
    4. eigenvalue decomposition
      •  Definition: “Let P be a matrix of eigenvectors of a given square matrix A and D be a diagonal matrix with the corresponding eigenvalues on the diagonal. Then, as long as P is a square matrix, A can be written as an eigen decomposition A = PDP^{-1}. Furthermore, if A is symmetric, then the columns of P are orthogonal vectors.” (from Wolfram MathWorld)
    5. Properties
      • tr(A) = \sum\limits_{i=1}^n\lambda_i
      • |A| = \prod\limits_{i=1}^n\lambda_i
  14. Singular Value Decomposition