10-‐701 Machine Learning Recita2on 2: Probability / Sta2s2cs Dougal Sutherland 9/24/2013
Sample spaces • Start with a sample space Ω for an “experiment” – The set of all possible outcomes – Flipping a coin: {H, T} – Flipping a coin three 2mes: {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} – A person’s age: the posi2ve integers – A person’s height: the posi2ve reals
Events • An event E is a subset of Ω – Can do normal set opera2ons
• Don’t need to allow for any arbitrary subset • Just need a 𝜎-‐algebra, which is a set B that – Contains ∅ – Is closed under complements – Is closed under countable unions
• In prac2ce, usually don’t need to worry about it
Probability axioms A probability func2on P is a func2on from events in our 𝜎-‐algebra to real numbers sa2sfying: 1. Nonnega2vity: P (E)
0
2. Unit measure: P (⌦) = 1 3. σ-‐addi2vity: P (E1 [ E2 [ . . . ) =
1 X i=1
P (Ei )
if the Ei are mutually exclusive: Ei ∩ Ej = ∅ for i ≠ j
Consequences of axioms c
P (A ) = 1
P (A)
P (A [ Ac ) = P (⌦) = 1
(ax. 2) = P (A) + P (Ac ) since A, Ac are disjoint (ax. 3)
P (A) 1
since P(Ac) = 1 – P(A) ≥ 0
P (;) = 0 P (;) = P (⌦c ) = 1
P (⌦) = 1
1=0
Possible probability func2ons • For Ω = {H, T}: – We know P(∅) = 0, P(Ω) = 1 always – {∅, Ω} is a valid 𝜎-‐algebra so we could be done – But we can probably observe H vs T: • P(H) = p can be anything in [0, 1] • P(T) = P({H}c) = 1 – p
More consequences of axioms P (B \ Ac ) = P (B)
P (A \ B)
P (A [ B) = P (A) + P (B)
Corollary: P (A \ B)
P (A \ B)
P (A) + P (B)
A ✓ B implies P (A) P (B)
1
Interpreta2on of probabili2es • Frequen2st: – long-‐run por2on of 2mes the event happens n X 1 P (E) = lim (Xi 2 E) n!1 n i=1
• Bayesian: – Quan2fica2on of beliefs – Can derive axioms from a certain set of “common sense” descrip2ons of how beliefs should work (Cox’s Theorem)
Defining probabili2es • Don’t want to have to check the axioms for every probability func2on • One general palern: – Let {ω1, ω2, …} be a countable set of “atomic events” (mutually exclusive, cover all of Ω) – Define corresponding nonnega2ve pi that sum to 1 X pi – Then a valid probability func2on is P (E) = i:!i 2E
Condi2oning • Basically, just change the sample space – P(E1 | E2) changes P(E1) with sample space Ω to have sample space E2: P(E1∩E2) / P(E2).
• If E2 is empty this isn’t well-‐defined. • If E1∩E2= ∅, P(E1|E2) = 0. E2
E1
Ω
Bayes’ Rule P(A, B) = P(A|B) P(B) = P(B|A) P(A) so P(A|B) = P(B|A) P(A) / P(B) P(model|data) = P(data|model) P(model) / P(data) P(English|French) = P(French|English) P(English) / P(French)
Independence • Events A and B are independent (A⟂B) if P(A∩B) = P(A) P(B) • Equivalently, P(A|B) = P(A), P(B|A) = P(B) • If A⟂B, then Ac⟂B, A⟂Bc, Ac⟂Bc
Condi2onal independence • P(A∩B|C) = P(A|C) P(B|C) or P(A|B,C) = P(A|C) • WARNING: A⟂B doesn’t imply A⟂B|Z! – Let A and B be coin flips and Z be A xor B. – Then A⟂B, A⟂Z, B⟂Z but (B|A,Z) is fully known.
Independence of several events • The last example had A⟂B, B⟂Z, A⟂Z (pairwise independent), but we don’t have A, B, Z all “mutually independent.” • P(A∩B∩Z) = P(A) P(B) P(Z) also isn’t enough. • The defini2on: for any subcollec2on i1, …, ik, 0
P@
k \
j=1
1
Ai j A =
k Y
j=1
P Ai j
Random variables • So far we’ve been talking only about events • Usually we work with random variables • Technically: a func2on on the sample space – Whether a coin flip was heads: X(ω) = 1 if ω = H, 0 if ω = T – Number of heads in a sequence: a func2on from Ω = {H, T}n to {0, 1, …, n}
• Normally, func2on is into Rd or Zd – Though it can be anything
Probability mass func2on • Discrete (not “discreet”) RVs: domain is a countable subset of the reals – e.g. X = number of heads in a sequence of coin flips
• Naturally defines atomic events for each value – e.g. {X = 0}, {X = 1}, …, {X = n}
• Probability mass func2on: func2on from values to probability of that value (basically a table) – e.g. PX(k) = P({X = k})
• Nonnega2ve, sums to 1
Jointly distributed random variables • If X and Y have a joint distribu2on, then they’re components of the random vector concat(X, Y) • Joint PMF is just a mul2dimensional table • Marginal of X is the distribu2on o f X i gnoring Y X P (X) =
P (X, Y )
Y
Condi2oning, independence of RVs • Condi2oning, independence for RVs are basically the same as for events: – P(A|B) = P(A, B) / P(B) • but now talking about funcDons rather than scalars
– P(A|B) = P(B|A) P(A) / P(B) – A⟂B if P(A, B) = P(A) P(B) • also as func2ons, i.e. true for any value for A and B
• i.i.d.: “independent and iden2cally distributed”
Cumula2ve distribu2on func2ons • The cdf is FX(x) = P(X ≤ x) • Useful for things: ✓ a lot of theore2cal ◆ – e.g. P
max Xi x
1in
= P ((X1 x) \ · · · \ (Xn x)) = P (X1 x) · · · P (Xn x) = (P (X1 x))
F ( 1) = 0
F is nondecreasing
n
F (1) = 1
CDFs for con2nuous RVs • Can’t do a mass func2on: P(X=x) = 0 for any x • S2ll can do FX(x) = P(X ≤ x) the same way • F is con2nuous if a con2nuous RV; right-‐con2nuous if mixed • Joint CDF: P(X ≤ x, Y ≤ y)
Probability density func2ons • Deriva2ve of the CDF: P (X x) = • Nonnega2ve, but can be > 1 • Integrates to 1
Z
x
f (x) dx 1
Important distribu2ons • Discrete: – Bernoulli (coin flip) – Binomial (# of heads in a series of coin flips) – Categorical (dice roll) – Mul2nomial (# of each result in a series of dice rolls) – Poisson (# of events with a certain rate) – Hypergeometric (sampling without replacement) – Geometric (# of flips un2l heads) – Nega2ve binomial (# of flips un2l n heads)
Important distribu2ons • Con2nuous: – Normal/Gaussian – Uniform – Beta – Chi-‐squared – Exponen2al – Gamma – Laplace
Func2ons of an RV • If X is a random variable, so is X2 • We can get a distribu2on for Y = g(X): P (g(X) 2 A) = P ({x : g(x) 2 A}) = P X 2 g
1
(A)
Expecta2on • The mean of a random variable: n X 1 E [X] = lim Xi n!1 n i=1
Expecta2on • The mean of a rZandom variable: EX [g(X)] =
EX [g(X)] =
Linear:
X
g(x) dP (X = x)
g(x) P (X = x)
EX [↵g(X) + h(x)] =
EX [g(X)] =
Z
Z
g(x) p(X = x) dx
(↵g(x) + h(x)) dP (X = x) Z Z = ↵ g(x) dP (X = x) + h(x) dP (X = x)
= ↵EX [g(X)] + EX [h(X)]
Variance • A measure of the spread of a distribu2on: ⇥
Var(X) = E (X
E[X])
2
⇤
p
• Standard devia2on: Var(X) ⇥
2
⇤
Var(X) = E (X ⇥ 2 =E X
µ)
= E[X 2 ]
2µ E[X] + µ2
= E[X 2 ]
µ2
2µX + µ
2
⇤
Law of large numbers • Actually, I just lied to you: the defini2on of expecta2on is the integral, not the limit • But it’s okay, because of the law of large ! n numbers: 1X P
lim
n!1
n
Xi = E[X]
=1
i=1
• Always true as long as the expecta2on exists – though the proof is harder and convergence is slower if E[X2] doesn’t exist
Central limit theorem If X1, …, Xn are iid and have finite mean/variance: ✓ ◆ n 2 X 1 Xi ⇠ N µ, n i=1 n
Covariance • Covariance: Cov(X, Y ) = E [(X
E[X])(Y
E[Y ])]
• Measure of linear rela2onship between X, Y • Note Var(X) = Cov(X, X) Cov(X, Y ) = E [(X
E[X])(Y
E[Y ])]
= E [XY
X E[Y ]
= E [XY ]
E[X] E[Y ]
= E [XY ]
E[X] E[Y ]
E[X]Y + E[X] E[Y ]]
E[X] E[Y ] + E[X] E[Y ]
Correla2on ⇢X,Y =
Cov(X, Y ) X Y
Covariance matrix • If we have a random n-‐vector X: ⇥
⇤
Cov(X) = E (X E[X])(X E[X]) = E[XX T ] E[X] E[X]T 2 3 Cov(X1 , X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn ) 6 Cov(X2 , X1 ) Cov(X2 , X2 ) . . . Cov(X2 , Xn ) 7 6 7 =6 7 .. .. .. . . 4 5 . . . . Cov(Xn , X1 )
• • • •
T
Cov(Xn , X2 )
Symmetric, posi2ve semi-‐definite Cov(A X + a) = A Cov(X) AT Cov(X, Y) = Cov(Y, X)T Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)
...
Cov(Xn , Xn )
Maximum likelihood es2mate (MLE) • If you think your data is e.g. normally distributed, you s2ll need to find the mean and the variance. • Most common way: maximize the likelihood of the data under the model. n Y arg max P (X; ✓) = arg max ✓
✓
P (Xi ; ✓)
i=1
Maximum a posteriori (MAP) • The MLE is prone to overfi}ng, as we just saw • Par2al solu2on: add a prior arg max P (✓ | X) = arg max R ✓
✓
P (X | ✓)P (✓) P (X | #)P (#) d# #
= arg max P (X | ✓)P (✓) ✓
Posterior mean • Another choice for a Bayesian es2mate – Minimizes L2 risk – Posterior mode, aka MAP, minimizes L0 risk – Posterior median minimizes L1 risk
• In general, harder to compute than MAP – In “easy” parametric cases, might be known – Can es2mate from posterior samples • Markov chain Monte Carlo
Posterior distribu2on • The “best” answer: don’t use a point es2mate! P (X | ✓)P (✓) P (✓ | X) = R P (X | #)P (#) d# #
• Problem: it’s hard to compute
– Can find the “best” approxima2on in some simpler family (Varia2onal Bayes, expecta2on propaga2on) – Can get approximate samples from the posterior (Markov chain Monte Carlo)
Nonparametric sta2s2cs • Problem: real data rarely follows idealized distribu2ons – Mul2modal – Heavier tails
• If you blindly use it, you might get tricked • Instead: – Histograms – Kernel density es2ma2on – kNN density es2ma2on
Classifica2on • Now that we know how to talk about probability distribu2ons, how do we use them? • Start with two-‐class classifica2on: – Posi2ve/nega2ve generated by different distribu2ons P (x | +)P (+) P (+ | x) = P (x)
P(
P (x | )P ( ) | x) = P (x)
P (+ | x) P (x | +)P (+) = P ( | x) P (x | )P ( )
Classifica2on P (+ | x) P (x | +)P (+) = P ( | x) P (x | )P ( )
• P(+), P(-‐) are class prior probabili2es – Can just es2mate with counts in training data
• P(x | +), P(x | -‐) are the core of the classifier: – Watson-‐Nadaraya: use kernel density es2mate – K-‐nearest-‐neighbor: use kNN density es2mate – Naïve Bayes: • assume P(x | class) = P(x1 | class) … P(xn | class) • model each of those parametrically