Machine Learning Notes

Undirected Graphical Models

  • Inference, parameter learning (MLE), Structure learning (Prior over graphs, Regularization)
  • Markov Blanket
  • DGM but not UGM: rain, sprinkle -> wet
  • Inference: Generally intractable (approx. algorithms: MCMC)
  • Parameter Learning:
    • Complete Data -> Iterative Methods, max ent: doable but non-trivial
    • Incomplete Data -> E-M
    • Cannot generate data for undirected graphs (very hard)
    • How to generate data from Undirected Graphical Model
      • Random Walk
    • Supervised learning: Data are labeled
    • Query: Nearest neighbor (works for classification and regression)

Bayesian Belief Network

  • every node, conditioned on all immediate parents, is independent of all non-descendants
  • $O(2^{|\text{Parents}|})$: Factorization.
  • c -> r -> w. c is not independent of w. However, conditioned on KNOWing r, w is independent of c.
  • Parameter Learning in DGM
  • If there are complete data, then just need to worry about smoothing
  • incomplete -> EM algorithm. MLE from incomplete data
  • Structure learning
  1. initialize u1, u2, … to some guess (centers of clusters of data points)
  2. calculate “probabilistic assignment” of each $x_i$:
  3. Re-estimate

1st-order Markov Assumption

L-Z compression(1/3)
本来有一段信息basketball,compress。这需要compression algorithm to be adaptive
做20次添加到basketball的末尾,比对compress的结果。最后compression最短的那个添加的东西和basketball最有关

传送信息的时候,有小的DT但是有exceptions, 我们要找到rule + 处理exception的平衡(最短的)
MDL: Minimum Description Length

Naive Bayes Algorithm

  • Conjunction of attributes to a finite set of output

Lecture 19 Naive Bayes

  • Conditional independent vs. truly independent
    • z = x XOR y: if z=0, then x tells me y, y tells me x, but x, y independent
  • Naive Bayes: overly confident
    • 因为对于不是iid的data,假设了他们都independent,都乘起来就会overly confident

Lecture 17

  • Find $h_map$:
    • 如果不好直接算最大值就取log。和对likelihood取log,算导数一样
  • Example: Gaussian
    • Given data $x_1, x_2, …$, assume $\sigma^2 = 1$, estimate u
    • hard bias: choice of gaussian; soft: max-likelihood method
      -

Lecture 16

  • Q1: How good is a hypothesis?
    • True error: $Err_D(h)$ (confidence interval of true error)
  • Q2: Is my hypothesis better than yours:
    • $d = Err_Dh_B - Err_Dh_A$
    • $\hat{d} = Err{SA}….Err{SB}$: unbiased estimation (notes有公式,很像confidence interval. SA SB are two samples better be similar. Pair Test. 比如一个班做两套不同的卷子是不合理的)
  • Q3: Is my learning algorithm better than yours:
    • $h=L_A(s)$ (s is the training sample)
    • We compare $Err_D(L_A(s))$, $Err_D(L_B(s))$. (s是蓝色的(observed), 所以这两个数都是observed,所以对于不同的training sample, 结果可能不一样。所以变成左右两边取expetation.)
    • 对于一些s当做training,用test sample测error,可以找到d hat。如果sample有限就每次拿出来一点当training另外的当testing

Bayesian Learning

  • Probabilistic soft bias. $\forall h\in H, p(h)$ captures initial belief.
  • Consistency -> Match $P(D|h)$: likelihood
  • Bayes Game
    • Start with $H = (h_1, h_2, …, h_n)$. PRIOR(INITIAL): $p(h)$.
    • Data D
    • Likelihood $P(D|h)$
    • Bayes formula(MUST)
      • $P(h, D) = P(h) P(D|h) = P(D) P(h|D)$
      • $P(h)$ captures the belief BEFORE seeing the data, PRIOR
      • $P(h|D)$ captures … AFTER, POSTERIOR
      • $P(h|D) = \frac{P(D|h) * P(h)}{P(D)}$
      • If we have some prior belief and likelihood, after we see some data, the posterior is something we should believe. Bayesian learning updates your belief about life.
      • $h_{map}$是maximize posterior $p(h|D)$的h
      • 如果$p(h)=\frac{1}{|H|}$, 那就相当于maximize$p(D|h)$, likelihood. 也叫 max likelihood hypothesis

Lecture 1

Jan 14 Wed


Machine Learning: program that gets better at some tasks(functions) with experiences(data, examples)
Formalizing a task as a function learning task (f: board pos -> next move)
Human Annotation: human assigning scores

  • f: INPUT -> {choice1, choice2, …} : classification
  • f: INPUT -> Real : Regression
  • f: INPUT -> {0, 1} probability: Logistic Regression
  • f: INPUT -> P(INPUT) Probability Density Estimation
    Binary Classification = Concept Learning = Learning a subset of $X$.

Lecture 2

Wed Jan 21 10:48:53 EST 2015


  • Input space: A set of features
  • Target Concept: The concept or function to be learned, denoted by $c$, a boolean-
    valued function defined over the instances $X$; that is, $c$ : $X \rightarrow {0, 1}$
  • Training Examples: $D$
  • Hypothesis Conjunctions of literals. Eg $<?, Cold, High, ?, ?>$. We are looking for $h$ in the set of all possible hypothesis $H$ such that $\forall x, h(x) = c(x)$
  • $|X| \ll |H| \ll |C| = 2^x$, $|C|$ is the set of all possible functions (concept space)
  • X is input space, H is hypothesis space (each item + 1), C is concept space ($2^{|X|}$)

Lecture 3

$D={(x_1, c(x_1)), …}$ data
Each concept c_i is a bit vector.

Hypothesis consistency $Consistent(h, D) = \forall \in D, h(x) = c(x)$
Version space $VS_{H,D} = {h\in H| Consistent(h, D)}$

Find-S algorithm

1
2
3
4
5
6
7
8
hypo = <null, null, ...> # says NO to everything
# D is the data {(x1, c(x1)), (x2, c(x2)), ...}
# looking for target concept c in C
for positive_example in D:
for attribute_constraint in hypo:
if pos_example satisfied by attr_constraint: pass
else: hypo := next more general constraint satisfied by attr_constraint
return hypo

Problems with Find-S

  • Fails when training data inconsistent
  • Picks a maximally specific $h$

List-Then-Eliminate

1
2
3
4
hypos = a list containing every hypothesis
for <x, c(x)> in D:
remove h in hypos if h(x) != c(x)
return hypos

Lecture 4 Informaiton theory

Bias

  • Hard bias: eg. “Language bias” (conjunctions)
  • Soft bias: “ranking”, preference bias (prefer hypothesis that tends to say NO)

Information = Reduction in Uncertainty
Suprisal = information received when observing an outcome = $I(\text{sunny}) = log_2 \frac{1}{prob(\text{sunny})} \in [0, \infty)$
Information is additive:

Lecture 5

Suprise(Information learned) = $log2\frac{1}{p(e)}$, where $p(e)$ is subjective, int BITS.
Cross Entropy $\Sigma p_i I(s_i)$.
Fundamental Inequality
Distance $D
{KL}(p||q) = E_p[log\frac{p(x)}{q(x)}]\ge 0$
A special case would be $p = q$, when we fully know the “truth”. $0\lte CH(p, p) \lte log k$, where $log k$ is fully uniform. 20 questions game.
Ask questions which answers you think are equally likely to be $0$ and $1$, so that you can learn the most information.

$H(Letter) = log{27} = 4.75 BITS$.

Lecture 6

Entropy is a tight lower bound of efficiency of emitting events.
Regular Variable: holds a single value.
Random Variable: distribution
Joint Entropy, Conditional Entropy, Average Conditional Entropy.
Average Conditional Entropy, $H(T/M)$.
$H(T) - H(M/T)$, how much $M$ is telling us on average about $T$.
$H(T) - H(M/T) = H(M) - H(M/T)$
$I(X; Y; Z) = I(X,Y) - I(X,Y|Z)$
$I(X, Y) = H(X) - H(X|Y)$
$I(y, {x_1, x_2, …}) = I(y, x_1) + (y; x_2|x_1) + I(y, x_3|x_1, x_2) + …$

Lecture 7

Decision Tree

Same attribute will not appear twice in a single path

Lecture 8

Draw a decision tree from $f(A, B, C, D, E) = (A\wedge B)\vee (C\wedge \not D\wedge E)$
Decision trees can represent any functions.

Lecture 9

Law of total variance: $VAR(y) = VAR(E[y|x]) + E[VAR(y|x)]$
First: how far apart the centers are
Second: WITHIN-CLUSTER VARIANCE

Lecture 11

Linear Regression

To a multi-linear regression: $\beta{OLS} = (X^TX)^{-1}(X^TY)$
Hard bias: …
Soft bias: square of residue
_Sparse Estimation
: p >> n. $\beta$ is non-identifiable

  • L2 Norm(sum of squared beta): “Ridge Regression”
  • L1 Norm(sum of abs beta): “Lasso Regression”
  • L0 Norm(number of non-zero beta)
    Regularization: minimize $argmin(Error(data, param) + Complexity(Param))$
    In this example: square of residues + norm of Beta

Lecture 12

Neural Networks

Works really well with complex real-world sensor-data
Since human neurons switch slowly compared to computers, but is able to do complicated computations quickly. It is conjectured that humans do parallel computation.
Learning corresponds to assigning weights to edges in an acyclic directed graph.

Perceptrons: Takes a vector of real inputs, calculates a linear comb. If bigger threshold, output 1; 0 otherwise
Learning a perceptron = choosing values for the coeffs!
It can represent primitive boolean functions AND, OR, NAND, …
And all boolean functions in general
Training a perceptron: start with arbitrary weights, repeat… (p88)
to modify $w_i$ to new $w_i$: $\Delta w_i = \ita (t-o)x_i$, where $\ita$ is the Learning Rate, which is a small value.

Lecture 13

Single Linear Unit / Network of L.U.’s / Perceptron(STEP fxn,Classifier) / Network of Perceptrons
Expressive Power(hard bias): All Linear fxns / Still Linear, / Linear Classifiers / Any decision problem
Optimization Criterion(soft bias): Mean Squared Error / Same / Misclassific Rate / Same
Computational complexity: Algebraically, Gradient Decent / Same / Perceptron Learning Rule / Not continuous (No known feasible algo.)

(no bias -> no learning)
Sigmoid function:
Network of Sigmoid: any function / MSE / Gradient Decent (exists local minimal)

Back Propagation:

Midterm Prep

Data mining: extract information from a large data set and transform it to a structure for future use
Target function
Hypothesis: h is a conjunction of constraints on attributes
constraints: value; don’t care; no value allowed;
Concept learning:
Training example: <$x_1$, c($x_1$)>, where c is the target function or concept.
Determine: h such that h(x) = c(x) for all x in D
Find-S: h0 = most specific (no to everything); for each pos example, “merge” the attribute constraint
consistent: (h, D) if h(x) = c(x) for all c in D
version space: subset of H(hypothesis space) consistent with all training data D
List-then-Eliminate: start with the biggest hypothesis space; find version space by checking each training example
Input Space: X, Hypothesis Space: H (+1), Concept Space: $2^{|X|}$

Information Theory
Information: Reduction in uncertainty: $I(E)=log_2\frac{1}{P(E)}$
Entropy: From a Zero-memory source S, the entropy is $\Sigma_i p_i * I(s_i) = E[log\frac{1}{p_i}]$
avg amount of information
H(T, M) < H(T) + H(M) ???
q distribution ???
H(T) - H(T|M) = H(M) - H(M|T) ???

Lecture 15

Hypothesis Testing

Q1: Given a hypothesis, how good is it?
Q2: Given two hypothesis, which one is better?
Q3: Given two learning algorithms, which one is better?
Not about the portion of the mistakes in the whole input space, but the probability of making errors.(Consider when most of the examples are mistakes)
$Err_r(h)$ is an estimator of $Err_D(h)$.
$BIAS = E[EST] - TRUTH$ (Mean is around the truth)
$VARIANCE = E[EST - E[EST]]^2$(Also needs to be accurate (sharp around the truth))
True Error to Sample Error (Probability calculation)
Sample Error to True Error (Statistical Inferrence)