Review on Measure Theory and Set Theory

This is an evolving note, such that new concepts will constantly be added.

  1. Power set 2^S is the set of all subsets of S
  2. Sigma-algebra
    • Given a set \Omega, a \sigma-algebra on \Omega is a collection \Sigma \subseteq 2^\Omega s.t. \Sigma is nonempty and \Sigma is 1) closed under complements; 2) closed under countable unions
      • \Omega \in\Sigma
      • \emptyset \in\Sigma
      • \Sigma is closed under countable intersections
    • Examples of\sigma-algebras
      • \{\emptyset, \Omega\}
      • \{\emptyset, E, E^C, \Omega\}
      • 2^\Omega
      • Any intersection of \sigma-algebras is a \sigma-algebra
  3. Measure
    • A measure \mu on \Omega with \sigma-algebra \Sigma is a function for all E in \Sigma, \mu: E\to [0,\infty] s.t. 1) \mu(\emptyset)=0; 2) \mu(\bigcup\limits_{k=1}^{\infty}E_i) =\sum\limits_{k=1}^{\infty}\mu(E_i) for all countable collections \{E_i\}^\infty_{i=1} of pairwise disjoint sets in \Sigma
    • A probability measure is a measure P s.t. P(\Omega)=1

Review on Information Theory

This is an evolving note, such that new concepts will constantly be added.

  1.  Entropy
    • Entropy \mathrm{H} of a discrete random variable X with possible values {x_1, ..., x_n} and probability mass function P(X) is: \mathrm{H}(X) = \mathbb{E}[\mathrm{I}(X)] = \mathbb{E}[-\log_b{P(X)}] = - \sum\limits_{i=1}^nP(x_i)\log_b{P(x_i)}
  2. Kullback–Leibler divergence (relative entropy)
    • For discrete probability distributions P and Q, the Kullback–Leibler divergence from Q to P is defined to be: D_{KL}(P\|Q) = \sum\limits_{i}P(x_i)\log_b{\frac{P(x_i)}{Q(x_i)}}

Review on Calculus

This is an evolving note, such that new concepts will constantly be added.

  1. Derivatives
    • \frac{d}{dx}cx = c
    • \frac{d}{dx}x^n = nx^{n-1}
    • \frac{d}{dx}f(x)+g(x) =f'(x)+g'(x)
    • \frac{d}{dx}f(x)g(x) = f'(x)g(x) + f(x)g'(x)
    • \frac{d}{dx}\frac{f(x)}{g(x)} = \frac{f'(x)g(x)-f(x)g'(x)}{g(x)^2}
    • \frac{d}{dx}e^x = e^x
    • \frac{d}{dx}log_a(x) = \frac{1}{x\ln(a)}
    • \frac{d}{dx}a^x = a^x\ln(a)
    • \frac{d}{dx} \sin(x) = \cos(x)
    • \frac{d}{dx} \cos(x) = -\sin(x)
    • \frac{d}{dx} \tan(x) = \sec^2(x)
    • \frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx}
  2. Matrix calculus
    • Derivative: for f:\mathbb{R}^{m\times n}\to\mathbb{R}, (\frac{\partial f(A)}{\partial A})_{i,j} = \frac{\partial f(A)}{\partial A_{i,j}} where A,\frac{\partial f(A)}{\partial A} \in \mathbb{R}^{m \times n}
    • Hessian: for f:\mathbb{R}^m \to \mathbb{R}, \mathbf{H}_{i,j} = \frac{\partial^2 f}{\partial x_i \partial x_j} where \mathbf{H}\in \mathbb{R}^{m\times m}
  3. Taylor series
    • f(x) = f(c) + \frac{f'(c)}{1!}(x-c) + \frac{f''(c)}{2!}(x-c)^2 + ... + \frac{f^{(n)}(c)}{n!}(x-c)^n ... = \sum\limits_{n=0}^{\infty}\frac{f^{(n)}(c)}{n!}(x-c)^n for some constant c
    • n-th taylor polynomial P_n(x) =\sum\limits_{i=0}^{n}\frac{f^{(i)}(c)}{i!}(x-c)^i
    • error of the n-th taylor polynomial R_n(x) = f(x) - P_n(x)
    • Lagrange error bound: if |f^{(n+1)}(x)|\le M on an interval (a-r, a+r) for some r > 0, then |R_n(x)|\le M\frac{|x-a|^{n+1}}{(n+1)!}\le M\frac{r^{n+1}}{(n+1)!} for all x\in(a-r,a+r).
  4. Exponentials
    • Properties
      • (x^a)^b = x^{ab}
      • x^a\cdot x^b = x^{a+b}
      • \frac{x^a}{x^b} = x^{a-b}
    • Lambert W Function is the inverse relation of the function f(x) = xe^x, so W(xe^x) = x or W(xe^x)e^{w(xe^x)}=f(x)
      • W(0) = 0
      • W(1)=\Omega\approx 0.5671
      • W(e) = 1
  5. Logarithms
    • Properties
      • \log(x^a) = a\log(x)
      • \log_a(x) = \frac{\log_b(x)}{\log_b(a)}
      • \log(xy) = \log(x) + \log(y)
      • \log(\frac{x}{y}) = \log(x) - \log(y)
      • \log_a(a^x) = x
      • \log(1) = 0
  6. Some helpful inequalities:
    • for non-negative A, B, C, A \le B + C\sqrt{A} \Rightarrow A \le B + C^2 + C\sqrt{B}
    • Taylor expansion inequalities:
      • e^x \ge 1 + x
      • e^x \le 1 + x + \frac{x^2}{2} for x \le 0
      • e^x \le 1 + x + x^2 for x \le 1
      • e^x + e^{-x} \le 2e^{x^2}
      • \frac{1}{2}e^x +\frac{1}{2}e^{-1x} \le e^{\frac{1}{2}x^2}

Review on Probabilities

This is an evolving note, such that new concepts will constantly be added.

  1.  Events
    • \mathbb{P}[A\  or\  B] = \mathbb{P}[A\bigcup B] = \mathbb{P}[A] + \mathbb{P}[B] - \mathbb{P}[A \bigcap B] \le \mathbb{P}[A] + \mathbb{P}[B]
    • \mathbb{P}[A\ and\ B] = \mathbb{P}[A\bigcap B]
    • If A \Rightarrow B, then \mathbb{P} \le \mathbb{P}[B]
  2.  Expectations
    • E[X] = \int xf(x)dx
    • E[c] = c
    • E[X + c] = E[X] + c
    • E[cX] = cE[X]
    • E[X+Y] = E[X] + E[Y]
    • E[XY] = E[X]E[Y] if X and Y are independent
    • E[\vec{X}] = (E[X_1] \ E[X_2] \  ... \  E[X_d])^T
    • If X \ge 0, E[X] =\int_0^\infty \mathbb{P}[X\ge 0]dt
    • Law of large number: for infinite sequence of i.i.d. random variable X, sample mean \bar{X}_n = \frac{1}{n}(X_1+X_2+...+X_n) converges to E[X_i] as n approaches \infty
  3. Conditional Probabilities and Conditional Expectations
    • \mathbb{P}[A|B] = \frac{\mathbb{P}[A,B]}{\mathbb{P}[B]}
    • Since \mathbb{P}[A_n,...,A_1] = \mathbb{P}[A_n|A_{n-1},...,A_1] \cdot \mathbb{P}[A_{n-1},...,A_1], \mathbb{P}[\bigcap\limits_{k=1}^nA_k] = \prod\limits_{k=1}^n\mathbb{P}[A_k|\bigcap\limits_{j=1}^{k-1}A_j]
    • Conditional density of X given Y: f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}
    • Conditional expectation of X given Y: E[X|Y=y] = \int^\infty_{-\infty} x f_{X|Y}(x|y)dx
    • Law of Total Expectation E[X] = E_Y[E_{X|Y}[X|Y]]
  4. Variance
    • Var(X) = \int(x-\mu)^2f(x)dx
    • Var(X) = E[(X - E[X])^2]
    • Var(X) = E[X^2] - (E[X])^2
    • Var(c) = 0
    • Var(X + c) = Var(X)
    • Var(cX) = c^2Var(X)
    • Var(X+Y) = Var(X)Var(Y) if X and Y are independent
    • Var(\sum\limits_{i=1}^NX_i) = \sum\limits_{i=1}^NVar(X_i) if the X_i...X_n are independent
  5. Covariance
    • Cov(X,Y) = \int\int(x-E[x])(y-E[y])f(x,y)dxdy
    • Cov(X,Y) = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X][Y]
    • Var(X) = Cov(X,X)
    • Cov(X,c) = 0
    • Cov(aX, bY) = abCov(X,Y)
    • Cov(\vec{X}) = E[(\vec{X} - E[\vec{X}])(\vec{X} - E[\vec{X}])^T] = E[\vec{X}\vec{X}^T] - E[\vec{X}](E[\vec{X}])^T
    • Cov(\vec{X}|\vec{Y}) = E[(\vec{X} - E[\vec{X}|\vec{Y}])(\vec{X} - E[\vec{X}|\vec{Y}])^T|\vec{Y}]
  6. Mean Square Error: MSE(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta}-E[\hat{\theta}])^2] + (E[\hat{\theta}]-\theta)^2 = Var(\hat{\theta}) + Bias(\hat{\theta})^2, more detail explanation can be found on wikipedia.
  7. Inequalities
    • Boole’s inequality (Union bound): for a countable set of events A_1, A_2,..., \mathbb{P}[\bigcup\limits_i A_i] \le \sum\limits_{i} \mathbb{P}[A_i]
    • Jensen: for convex f, f(E[X]) \le E[f(X)]; for concave f, f(E[X]) \ge E[f(X)]
    • Markov: If X \ge 0 then for all t >0, \mathbb[P][X\ge t] \le \frac{E[X]}{t}
    • Chebyshev-Cantelli: for t>0, \mathbb{P}[|X-E[X]|\ge t] \le \frac{Var(X)}{Var(X) + t^2}
    • Chebyshev’s Association: let f and g be nondecreasing (nonincreasing) real-valued functions defined on the real line. If X is a real-valued random variable then, E[f(X)g(X)] \ge E[f(X)]E[g(X)]
      • E[f(X)g(X)] \le E[f(X)]E[g(X)]
    • Harris’: extends Chebyshev’s Association to functions f,g:\mathbb{R}^n\rightarrow \mathbb{R}
    • Chernoff bound: for any t\in \mathbb{R}, for any random variable X, \mathbb{P}[X\ge t] = \mathbb{P}[e^{sX}\ge e^{st}] \le \frac{E[s^{sX}]}{e^{st}}
    • Cauchy-Schwarz: if E[X^2]<\infty E[Y^2]<\infty, then |E[XY]|\le \sqrt{E[X^2]E[Y^2]}
    • Hoeffding’s inequality: Let X be a random variable with E[X]=0, a \le X \le b. Then for s>0, E[\exp(sX)] \le \exp(s^2\frac{(b-a)^2}{8})
    • Hoeffding’s tail: Let X_1,...,X_n be independent bounded random variable such that X_i falls in the interval [a_i,b_i] with probability one. Then for any t>0, \mathbb{P}[S_n-E[S_n]\ge t] \le \exp\Big(\frac{-2t^2}{\sum\limits_{i=1}^n(b_i-a_i)^2}\Big) and \mathbb{P}[S_n-E[S_n]\le -t] \le \exp\Big(\frac{-2t^2}{\sum\limits_{i=1}^n(b_i-a_i)^2}\Big)
    • Bernstein’s inequality: Let X_1,...,X_n be independent real-valued random variables with 0 mean, and assume that |X_i|\le c with probability 1. Let \sigma^2 = \frac{1}{n}\sum\limits_{i=1}^n Var[X_i]. Then, for any \epsilon > 0, \mathbb{P}[\frac{1}{n}\sum\limits_{i=1}^n X_i > \epsilon] \le \exp(-\frac{n\epsilon^2}{2\sigma^2 + 2c\epsilon/3}).
      • when \sigma^2 < \epsilon, the upper bound behaves like \exp(-n\epsilon) instead of the \exp(-n\epsilon^2) guaranteed by Hoeffding’s tail.
    • McDiarmid’s inequality: let X_1, X_2,...,X_n be i.i.d. random variables, and let f: X^n \mapsto \mathbb{R} be s.t. \forall i, a_1, a_2, ..., a_n, a'_i, |f(a_1, ..., a_i, ..., a_n) - f(a_1, ..., a'_i, ..., a_n)| \le c_i. Then \mathbb{P}[f(x_1, x_2, ..., x_n) - E[f(x_1, x_2,...,x_n)] \ge \epsilon] \le \exp\Big(-\frac{2\epsilon^2}{\sum\limits_{i=1}^n c_i^2}\Big).
    • Azuma-Hoeffding’s inequality: let x_0, x_1, ..., x_n be a martingale (see definition of martingale below) s.t. \forall k, |x_k - x_{k-1}| < c_k for some constant c_k, then \mathbb{P}[x_n - x_0 \ge \epsilon] \le \exp\Big(- \frac{\epsilon^2}{2\sum\limits_{k=1}^n c_k^2}\Big).
  8. Variables
    • Rademacher random variables: \sigma_1,...,\sigma_n s.t. \mathbb{P}[\sigma_i = 1] = \mathbb{P}[\sigma_i = -1] = 1/2, \forall i \in [n]
      • Rademacher averages: for a set of vectors S \subseteq \mathbb{R}^n, the Rademacher average is defined as \mbox{Rad}(S) = E_{\sigma}[\sup\limits_{v\in S}\langle v,\sigma \rangle_n], where \langle v, \sigma \rangle_n = \frac{1}{n}\sum\limits_{i=1}^n v_i\sigma_i
    • Martingale is a special sequence of random variables
      • A discrete-time martingale is a discrete-time stochastic process X_1, X_2, ... , X_m that satisfies
        1. E[|X_i|] < \infty
        2. Let filtration \mathcal{F}_i = \sigma(X_1, X_2, ..., X_i) then E[X_{i+1}|\mathcal{F}_i] = X_i. e.g. \mathcal{F}_i = \{X_1,X_2,...,X_i\}
  9.  Other
    • If \mathbb{P}[X > t]\le F(t), then with probability at least 1 - \delta, X\le F^{-1}(\delta).
    • Moment generating function of a random variable X is M_X(t) = E[e^{tX}] where t\in\mathbb{R}

Review on Linear Algebra

This is an evolving note, such that new concepts will constantly be added.

  1. Dot Product
    • dot product of two vectors can be seem as linearly transform one to the 1D line defined by the other
    • \vec{v} \cdot \vec{u} = |\vec{v}||\vec{u}|\cos\theta
    • \vec{v} \cdot \vec{u} = \vec{u} \cdot \vec{v}
    • c(\vec{v}\cdot\vec{u}) = (c\vec{v})\cdot\vec{u}
    • (c\vec{v} + d\vec{u})\cdot\vec{w} = (c\vec{v})\cdot\vec{w} + (d\vec{u})\cdot\vec{w}
    • |\vec{v}|^2 =\vec{v}\cdot\vec{v} = \vec{v}^T\vec{v} =\langle\vec{v},\vec{v}\rangle
    • \vec{v} \cdot \vec{u} = 0 if and only if \vec{v} and \vec{u} are orthogonal to each other
  2. Matrix Multiplication
    • AB\not= BA
    • A(B+C) = AB + BC
    • (A+B)C = AC + BC
    • c(AB) = (cA)B = A(cB)
    • A(BC) = (AB)C
    • AI = A = IA
    • AA^{-1} = I = A^{-1}A
    • (AB)^{-1} = B^{-1}A^{-1}
    • det(AB) = det(A) det(B)
    • A [\vec{v}_1\:\vec{v}_2\:\vec{v}_3] = [A\vec{v}_1\:A\vec{v}_2\:A\vec{v}_3]
  3. Transpose
    • (A^T)^T = A
    • (A+B)^T = A^T + B^T
    • (cA)^T = cA^T
    • (AB)^T = B^TA^T
    • (A^{-1})^T = (A^T)^{-1}
  4. Vector Differentiation
    • \frac{d(x^TAx)}{d(x)} = x^T(A^T + A)
  5. Determinants
    • determinant of two 2D vectors is the area of the signed parallelogram formed by these two vectors (\det(\vec{a}\vec{b}) = \left| \begin{array}{cc} \vec{a}_1 & \vec{a}_2 \\ \vec{b}_1 & \vec{b}_2 \end{array} \right| )
    • determinant of three 3D vectors is the signed volume of the parallelepiped formed by these three vectors
    • if the determinant of a matrix A is 0, then A is singular. Below are some more properties of determinant of matrix:
    • |aA| = a^d|A|
    • |AB| = |A||B|
    • |A| = |A^T|
    • |A| = \frac{1}{|A^{-1}|}
  6. Norms
    • p_norm (l_p-norm): for p \in \mathbb{R}, \vec{x} \in \mathbb{R}^n, \|\vec{x}\|_p = (\sum\limits_{i=1}^n |x_i|^p)^{\frac{1}{p}}
    • Uniform norm (sup norm): for \vec{x} = (x_1, x_2, \ldots, x_n), \|x\|_\infty = \max(|x_1|, |x_2|,\ldots, |x_n|)
  7. Rank of a matrix (for real matrix A\in\mathbb{R}^{m\times n})
    • columnrank(A) = rowrank(A) = rank(A)
    • rank(A) = rank(A^T)
    • rank(A) \le \min(m,n)
    • rank(AB) \le \min(rank(A), rank(B))
    • rank(A+B) \le rank(A) + rank(B)
    • rank(A^TA) = rank(AA^T) = rank(A)
  8. Symetric Matrices
    • A matrix A is positive semidefinite if for all vectors \vec{v} such that \vec{v}^TA\vec{v} \ge 0
      • A is positive semidefinite iff A=A^T
      • A=U^TU is positive semidefinite
  9. Trace (for square matrix)
    • tr(A) = \sum\limits_{i=1}^nA_{ii} where A\in\mathbb{R}^{n\times n}
    • tr(A) = tr(A^T)
    • tr(A+B) = tr(A) + tr(B)
    • tr(cA) = ctr(A)
    • tr(AB) = tr(BA) = tr(B^TA^T) = tr(A^TB^T)
  10. Linear Transformation
  11. Invertibility of a matrix
    • A matrix is invertible iff it is full rank
  12. Orthogonal Matrix
    • “an orthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors” (from wikipedia)
    • “The rows of an orthogonal matrix are an orthonormal basis. That is, each row has length one, and are mutually perpendicular. Similarly, the columns are also an orthonormal basis. In fact, given any orthonormal basis, the matrix whose rows are that basis is an orthogonal matrix. It is automatically the case that the columns are another orthonormal basis.” (from Wolfram MathWorld)
    • Let A and B be orthogonal matrices:
      • A^TA = I = AA^T
      • \|A\vec{v}\|_2 = \|\vec{v}\|_2
      • \langle \vec{v},\vec{u} \rangle = \langle A\vec{v},A\vec{u} \rangle
      • \det(A)=\pm 1 (when it equals +1, A is a rotation matrix; o.w. A is a reflection matrix)
      • A + B and A^{-1} are both orthogonal matrices
  13. Eigenvectors and Eigenvalues
    1. when a transformation only scale or reverse the vector but doesn’t change the direction of the vector (except reverse), we say the vector is a eigenvector of the transformation
    2. A\vec{v} = \lambda\vec{v} where \lambda is the eigenvalue associates with the eigenvector \vec{v}
    3. when we assume there is at least one eigenvector, we can use this equation to find it: \det(\lambda I - A) = 0
    4. eigenvalue decomposition
      •  Definition: “Let P be a matrix of eigenvectors of a given square matrix A and D be a diagonal matrix with the corresponding eigenvalues on the diagonal. Then, as long as P is a square matrix, A can be written as an eigen decomposition A = PDP^{-1}. Furthermore, if A is symmetric, then the columns of P are orthogonal vectors.” (from Wolfram MathWorld)
    5. Properties
      • tr(A) = \sum\limits_{i=1}^n\lambda_i
      • |A| = \prod\limits_{i=1}^n\lambda_i
  14. Singular Value Decomposition