Jan 31 2015

[15251 Uncountability and Uncomputability]

Jan 22 2015

[15251] Formalization of Proof, Finite Automata

Formalization of Proof

First Order Logic = Prop. Logic + {$\forall$, $\exists$, $=$, “constants”, “relations”, “functions”}
Variables are now objects.
Vocabulary is a collection of constant-names, function-names, relation-names.
Interpretation

Specifies the universe
Maps constant-name to object
Maps relation-name to actual relation
Maps function-name to actual function
Satisfiability/Tautology: Similar to that in first order logic.

Finite Automata

Alphabet: A nonempty finite set $\Sigma$ of symbols, $\Sigma = {0, 1}$ usually.
String: A finite sequence of 0 or more symbols.

The length-0 string is $\epsilon$
$\Sigma^n$ means all strings with length n.
$\Sigma^{*}$ means all strings over $\Sigma$

Language: A collection of strings. A subset of $\Sigma^{*}$.

Thus, we can think of a decision problem as a function:
$f: \Sigma^{*} \rightarrow {NO, YES}$
Decision problem is finding a language L such that f(element in L) is true

DFA: $L(M) = x\in \Sigma^{*}: M \text{accepts } x$.

A deterministic finite automaton is a 5-tuple:

$Q$ is a nonempty finite set of states
$\Sigma$ is an alphabet
$\delta: Q * \Sigma \rightarrow Q$ is the transition function
$q_0\in Q$ is the start state
$F\subset Q$ is the set of accepting states

Accept
Let $w = w_1w_2w_3…w_n$
We say that M accepts string w if there exists states $r_0r_1…r_n\in Q$ such that

$r_0 = q_0$
$\delta(r_{t-1}, w_t) = r_t$ for all $t$
$r_n\in F$.
Otherwise, we say $M$ rejects $w$.
The sequence $r_1r_2…r_n$ is called the computation trace.

Regular Languages
A language is regular if there is a DFA that decides it

Proving a language is not regular

Assume for contradiction there is a DFA M which decides language L
Argue there are two strings x, y which reach the same state in M.
Show there is a string z such that $xz\in L$ and $yz\notin L$.
Example: $a^nb^n$

$L_1\cup L_2$ is regular if $L_1$ and $L_2$ are.(Union Theorem)
$L_1\dot L_2$ (concatenation)
$L^{*}$ is regular if $L$ is.

More on Regular Languages

The empty language is regular
The singleton of empty string is regular
The singleton of each character in the alpahbet is regular
Union
Concatenation
Kleene star
In fact, all set operations can be performed
(On how to prove a language is regular)[http://cseweb.ucsd.edu/~clbailey/ClosureProofTemplate.htm]
(On how to prove regular languages are closed under reverse)[http://web.cecs.pdx.edu/~hook/cs581sp11/reverse.pdf]
Pumping Lemma Not really understand

Jan 21 2015

Machine Learning Notes

Undirected Graphical Models

Inference, parameter learning (MLE), Structure learning (Prior over graphs, Regularization)
Markov Blanket
DGM but not UGM: rain, sprinkle -> wet
Inference: Generally intractable (approx. algorithms: MCMC)
Parameter Learning:
- Complete Data -> Iterative Methods, max ent: doable but non-trivial
- Incomplete Data -> E-M
- Cannot generate data for undirected graphs (very hard)
- How to generate data from Undirected Graphical Model
  - Random Walk
- Supervised learning: Data are labeled
- Query: Nearest neighbor (works for classification and regression)

Bayesian Belief Network

every node, conditioned on all immediate parents, is independent of all non-descendants
$O(2^{|\text{Parents}|})$: Factorization.
c -> r -> w. c is not independent of w. However, conditioned on KNOWing r, w is independent of c.
Parameter Learning in DGM
If there are complete data, then just need to worry about smoothing
incomplete -> EM algorithm. MLE from incomplete data
Structure learning

initialize u1, u2, … to some guess (centers of clusters of data points)
calculate “probabilistic assignment” of each $x_i$:
Re-estimate

1st-order Markov Assumption

L-Z compression(1/3)
本来有一段信息basketball，compress。这需要compression algorithm to be adaptive
做20次添加到basketball的末尾，比对compress的结果。最后compression最短的那个添加的东西和basketball最有关

传送信息的时候，有小的DT但是有exceptions, 我们要找到rule + 处理exception的平衡（最短的）
MDL: Minimum Description Length

Naive Bayes Algorithm

Conjunction of attributes to a finite set of output

Lecture 19 Naive Bayes

Conditional independent vs. truly independent
- z = x XOR y: if z=0, then x tells me y, y tells me x, but x, y independent
Naive Bayes: overly confident
- 因为对于不是iid的data，假设了他们都independent，都乘起来就会overly confident

Lecture 17

Find $h_map$:
- 如果不好直接算最大值就取log。和对likelihood取log，算导数一样
Example: Gaussian
- Given data $x_1, x_2, …$, assume $\sigma^2 = 1$, estimate u
- hard bias: choice of gaussian; soft: max-likelihood method
  -

Lecture 16

Q1: How good is a hypothesis?
- True error: $Err_D(h)$ (confidence interval of true error)
Q2: Is my hypothesis better than yours:
- $d = Err_Dh_B - Err_Dh_A$
- $\hat{d} = Err{SA}….Err{SB}$: unbiased estimation (notes有公式，很像confidence interval. SA SB are two samples better be similar. Pair Test. 比如一个班做两套不同的卷子是不合理的)
Q3: Is my learning algorithm better than yours:
- $h=L_A(s)$ (s is the training sample)
- We compare $Err_D(L_A(s))$, $Err_D(L_B(s))$. (s是蓝色的(observed), 所以这两个数都是observed，所以对于不同的training sample, 结果可能不一样。所以变成左右两边取expetation.)
- 对于一些s当做training，用test sample测error，可以找到d hat。如果sample有限就每次拿出来一点当training另外的当testing

Bayesian Learning

Probabilistic soft bias. $\forall h\in H, p(h)$ captures initial belief.
Consistency -> Match $P(D|h)$: likelihood
Bayes Game
- Start with $H = (h_1, h_2, …, h_n)$. PRIOR(INITIAL): $p(h)$.
- Data D
- Likelihood $P(D|h)$
- Bayes formula(MUST)
  - $P(h, D) = P(h) P(D|h) = P(D) P(h|D)$
  - $P(h)$ captures the belief BEFORE seeing the data, PRIOR
  - $P(h|D)$ captures … AFTER, POSTERIOR
  - $P(h|D) = \frac{P(D|h) * P(h)}{P(D)}$
  - If we have some prior belief and likelihood, after we see some data, the posterior is something we should believe. Bayesian learning updates your belief about life.
  - $h_{map}$是maximize posterior $p(h|D)$的h
  - 如果$p(h)=\frac{1}{|H|}$, 那就相当于maximize$p(D|h)$, likelihood. 也叫 max likelihood hypothesis

Lecture 1

Jan 14 Wed

Machine Learning: program that gets better at some tasks(functions) with experiences(data, examples)
Formalizing a task as a function learning task (f: board pos -> next move)
Human Annotation: human assigning scores

f: INPUT -> {choice1, choice2, …} : classification
f: INPUT -> Real : Regression
f: INPUT -> {0, 1} probability: Logistic Regression
f: INPUT -> P(INPUT) Probability Density Estimation
Binary Classification = Concept Learning = Learning a subset of $X$.

Lecture 2

Wed Jan 21 10:48:53 EST 2015

Input space: A set of features
Target Concept: The concept or function to be learned, denoted by $c$, a boolean-
valued function defined over the instances $X$; that is, $c$ : $X \rightarrow {0, 1}$
Training Examples: $D$
Hypothesis Conjunctions of literals. Eg $<?, Cold, High, ?, ?>$. We are looking for $h$ in the set of all possible hypothesis $H$ such that $\forall x, h(x) = c(x)$
$|X| \ll |H| \ll |C| = 2^x$, $|C|$ is the set of all possible functions (concept space)
X is input space, H is hypothesis space (each item + 1), C is concept space ($2^{|X|}$)

Lecture 3

$D={(x_1, c(x_1)), …}$ data
Each concept c_i is a bit vector.

Hypothesis consistency $Consistent(h, D) = \forall \in D, h(x) = c(x)$
Version space $VS_{H,D} = {h\in H| Consistent(h, D)}$

Find-S algorithm

hypo = <null, null, ...> # says NO to everything
# D is the data {(x1, c(x1)), (x2, c(x2)), ...}
# looking for target concept c in C
for positive_example in D:
    for attribute_constraint in hypo:
        if pos_example satisfied by attr_constraint: pass
        else: hypo := next more general constraint satisfied by attr_constraint
return hypo

Problems with Find-S

Fails when training data inconsistent
Picks a maximally specific $h$

List-Then-Eliminate

hypos = a list containing every hypothesis
for <x, c(x)> in D:
    remove h in hypos if h(x) != c(x)
return hypos

Lecture 4 Informaiton theory

Bias

Hard bias: eg. “Language bias” (conjunctions)
Soft bias: “ranking”, preference bias (prefer hypothesis that tends to say NO)

Information = Reduction in Uncertainty
Suprisal = information received when observing an outcome = $I(\text{sunny}) = log_2 \frac{1}{prob(\text{sunny})} \in [0, \infty)$
Information is additive:

Lecture 5

Suprise(Information learned) = $log2\frac{1}{p(e)}$, where $p(e)$ is subjective, int BITS.
Cross Entropy $\Sigma p_i I(s_i)$.
Fundamental Inequality
Distance $D{KL}(p||q) = E_p[log\frac{p(x)}{q(x)}]\ge 0$
A special case would be $p = q$, when we fully know the “truth”. $0\lte CH(p, p) \lte log k$, where $log k$ is fully uniform. 20 questions game.
Ask questions which answers you think are equally likely to be $0$ and $1$, so that you can learn the most information.

$H(Letter) = log{27} = 4.75 BITS$.

Lecture 6

Entropy is a tight lower bound of efficiency of emitting events.
Regular Variable: holds a single value.
Random Variable: distribution
Joint Entropy, Conditional Entropy, Average Conditional Entropy.
Average Conditional Entropy, $H(T/M)$.
$H(T) - H(M/T)$, how much $M$ is telling us on average about $T$.
$H(T) - H(M/T) = H(M) - H(M/T)$
$I(X; Y; Z) = I(X,Y) - I(X,Y|Z)$
$I(X, Y) = H(X) - H(X|Y)$
$I(y, {x_1, x_2, …}) = I(y, x_1) + (y; x_2|x_1) + I(y, x_3|x_1, x_2) + …$

Lecture 7

Decision Tree

Same attribute will not appear twice in a single path

Lecture 8

Draw a decision tree from $f(A, B, C, D, E) = (A\wedge B)\vee (C\wedge \not D\wedge E)$
Decision trees can represent any functions.

Lecture 9

Law of total variance: $VAR(y) = VAR(E[y|x]) + E[VAR(y|x)]$
First: how far apart the centers are
Second: WITHIN-CLUSTER VARIANCE

Lecture 11

Linear Regression

To a multi-linear regression: $\beta{OLS} = (X^TX)^{-1}(X^TY)$
Hard bias: …
Soft bias: square of residue
_Sparse Estimation: p >> n. $\beta$ is non-identifiable

L2 Norm(sum of squared beta): “Ridge Regression”
L1 Norm(sum of abs beta): “Lasso Regression”
L0 Norm(number of non-zero beta)
Regularization: minimize $argmin(Error(data, param) + Complexity(Param))$
In this example: square of residues + norm of Beta

Lecture 12

Neural Networks

Works really well with complex real-world sensor-data
Since human neurons switch slowly compared to computers, but is able to do complicated computations quickly. It is conjectured that humans do parallel computation.
Learning corresponds to assigning weights to edges in an acyclic directed graph.

Perceptrons: Takes a vector of real inputs, calculates a linear comb. If bigger threshold, output 1; 0 otherwise
Learning a perceptron = choosing values for the coeffs!
It can represent primitive boolean functions AND, OR, NAND, …
And all boolean functions in general
Training a perceptron: start with arbitrary weights, repeat… (p88)
to modify $w_i$ to new $w_i$: $\Delta w_i = \ita (t-o)x_i$, where $\ita$ is the Learning Rate, which is a small value.

Lecture 13

Single Linear Unit / Network of L.U.’s / Perceptron(STEP fxn,Classifier) / Network of Perceptrons
Expressive Power(hard bias): All Linear fxns / Still Linear, / Linear Classifiers / Any decision problem
Optimization Criterion(soft bias): Mean Squared Error / Same / Misclassific Rate / Same
Computational complexity: Algebraically, Gradient Decent / Same / Perceptron Learning Rule / Not continuous (No known feasible algo.)

(no bias -> no learning)
Sigmoid function:
Network of Sigmoid: any function / MSE / Gradient Decent (exists local minimal)

Back Propagation:

Midterm Prep

Data mining: extract information from a large data set and transform it to a structure for future use
Target function
Hypothesis: h is a conjunction of constraints on attributes
constraints: value; don’t care; no value allowed;
Concept learning:
Training example: <$x_1$, c($x_1$)>, where c is the target function or concept.
Determine: h such that h(x) = c(x) for all x in D
Find-S: h0 = most specific (no to everything); for each pos example, “merge” the attribute constraint
consistent: (h, D) if h(x) = c(x) for all c in D
version space: subset of H(hypothesis space) consistent with all training data D
List-then-Eliminate: start with the biggest hypothesis space; find version space by checking each training example
Input Space: X, Hypothesis Space: H (+1), Concept Space: $2^{|X|}$

Information Theory
Information: Reduction in uncertainty: $I(E)=log_2\frac{1}{P(E)}$
Entropy: From a Zero-memory source S, the entropy is $\Sigma_i p_i * I(s_i) = E[log\frac{1}{p_i}]$
avg amount of information
H(T, M) < H(T) + H(M) ???
q distribution ???
H(T) - H(T|M) = H(M) - H(M|T) ???

Lecture 15

Hypothesis Testing

Q1: Given a hypothesis, how good is it?
Q2: Given two hypothesis, which one is better?
Q3: Given two learning algorithms, which one is better?
Not about the portion of the mistakes in the whole input space, but the probability of making errors.(Consider when most of the examples are mistakes)
$Err_r(h)$ is an estimator of $Err_D(h)$.
$BIAS = E[EST] - TRUTH$ (Mean is around the truth)
$VARIANCE = E[EST - E[EST]]^2$(Also needs to be accurate (sharp around the truth))
True Error to Sample Error (Probability calculation)
Sample Error to True Error (Statistical Inferrence)