R2 probability - PDF Free Download

10-‐701 Machine Learning Recita2on 2: Probability / Sta2s2cs Dougal Sutherland 9/24/2013

Sample spaces •  Start with a sample space Ω for an “experiment” –  The set of all possible outcomes –  Flipping a coin: {H, T} –  Flipping a coin three 2mes: {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –  A person’s age: the posi2ve integers –  A person’s height: the posi2ve reals

Events •  An event E is a subset of Ω –  Can do normal set opera2ons

•  Don’t need to allow for any arbitrary subset •  Just need a 𝜎-‐algebra, which is a set B that –  Contains ∅ –  Is closed under complements –  Is closed under countable unions

•  In prac2ce, usually don’t need to worry about it

Probability axioms A probability func2on P is a func2on from events in our 𝜎-‐algebra to real numbers sa2sfying: 1. Nonnega2vity: P (E)

0

2. Unit measure: P (⌦) = 1 3. σ-‐addi2vity: P (E1 [ E2 [ . . . ) =

1 X i=1

P (Ei )

if the Ei are mutually exclusive: Ei ∩ Ej = ∅ for i ≠ j

Consequences of axioms c

P (A ) = 1

P (A)

P (A [ Ac ) = P (⌦) = 1

(ax. 2) = P (A) + P (Ac ) since A, Ac are disjoint (ax. 3)

P (A)  1

since P(Ac) = 1 – P(A) ≥ 0

P (;) = 0 P (;) = P (⌦c ) = 1

P (⌦) = 1

1=0

Possible probability func2ons •  For Ω = {H, T}: –  We know P(∅) = 0, P(Ω) = 1 always –  {∅, Ω} is a valid 𝜎-‐algebra so we could be done –  But we can probably observe H vs T: •  P(H) = p can be anything in [0, 1] •  P(T) = P({H}c) = 1 – p

More consequences of axioms P (B \ Ac ) = P (B)

P (A \ B)

P (A [ B) = P (A) + P (B)

Corollary: P (A \ B)

P (A \ B)

P (A) + P (B)

A ✓ B implies P (A)  P (B)

1

Interpreta2on of probabili2es •  Frequen2st: –  long-‐run por2on of 2mes the event happens n X 1 P (E) = lim (Xi 2 E) n!1 n i=1

•  Bayesian: –  Quan2ﬁca2on of beliefs –  Can derive axioms from a certain set of “common sense” descrip2ons of how beliefs should work (Cox’s Theorem)

Deﬁning probabili2es •  Don’t want to have to check the axioms for every probability func2on •  One general palern: –  Let {ω1, ω2, …} be a countable set of “atomic events” (mutually exclusive, cover all of Ω) –  Deﬁne corresponding nonnega2ve pi that sum to 1 X pi –  Then a valid probability func2on is P (E) = i:!i 2E

Condi2oning •  Basically, just change the sample space –  P(E1 | E2) changes P(E1) with sample space Ω to have sample space E2: P(E1∩E2) / P(E2).

•  If E2 is empty this isn’t well-‐deﬁned. •  If E1∩E2= ∅, P(E1|E2) = 0. E2

E1

Ω

Bayes’ Rule P(A, B) = P(A|B) P(B) = P(B|A) P(A) so P(A|B) = P(B|A) P(A) / P(B) P(model|data) = P(data|model) P(model) / P(data) P(English|French) = P(French|English) P(English) / P(French)

Independence •  Events A and B are independent (A⟂B) if P(A∩B) = P(A) P(B) •  Equivalently, P(A|B) = P(A), P(B|A) = P(B) •  If A⟂B, then Ac⟂B, A⟂Bc, Ac⟂Bc

Condi2onal independence •  P(A∩B|C) = P(A|C) P(B|C) or P(A|B,C) = P(A|C) •  WARNING: A⟂B doesn’t imply A⟂B|Z! –  Let A and B be coin ﬂips and Z be A xor B. –  Then A⟂B, A⟂Z, B⟂Z but (B|A,Z) is fully known.

Independence of several events •  The last example had A⟂B, B⟂Z, A⟂Z (pairwise independent), but we don’t have A, B, Z all “mutually independent.” •  P(A∩B∩Z) = P(A) P(B) P(Z) also isn’t enough. •  The deﬁni2on: for any subcollec2on i1, …, ik, 0

P@

k \

j=1

1

Ai j A =

k Y

j=1

P Ai j

Random variables •  So far we’ve been talking only about events •  Usually we work with random variables •  Technically: a func2on on the sample space –  Whether a coin ﬂip was heads: X(ω) = 1 if ω = H, 0 if ω = T –  Number of heads in a sequence: a func2on from Ω = {H, T}n to {0, 1, …, n}

•  Normally, func2on is into Rd or Zd –  Though it can be anything

Probability mass func2on •  Discrete (not “discreet”) RVs: domain is a countable subset of the reals –  e.g. X = number of heads in a sequence of coin ﬂips

•  Naturally deﬁnes atomic events for each value –  e.g. {X = 0}, {X = 1}, …, {X = n}

•  Probability mass func2on: func2on from values to probability of that value (basically a table) –  e.g. PX(k) = P({X = k})

•  Nonnega2ve, sums to 1

Jointly distributed random variables •  If X and Y have a joint distribu2on, then they’re components of the random vector concat(X, Y) •  Joint PMF is just a mul2dimensional table •  Marginal of X is the distribu2on o f X i gnoring Y X P (X) =

P (X, Y )

Y

Condi2oning, independence of RVs •  Condi2oning, independence for RVs are basically the same as for events: –  P(A|B) = P(A, B) / P(B) •  but now talking about funcDons rather than scalars

–  P(A|B) = P(B|A) P(A) / P(B) –  A⟂B if P(A, B) = P(A) P(B) •  also as func2ons, i.e. true for any value for A and B

•  i.i.d.: “independent and iden2cally distributed”

Cumula2ve distribu2on func2ons •  The cdf is FX(x) = P(X ≤ x) •  Useful for things: ✓ a lot of theore2cal ◆ –  e.g. P

max Xi  x

1in

= P ((X1  x) \ · · · \ (Xn  x)) = P (X1  x) · · · P (Xn  x) = (P (X1  x))

F ( 1) = 0

F is nondecreasing

n

F (1) = 1

CDFs for con2nuous RVs •  Can’t do a mass func2on: P(X=x) = 0 for any x •  S2ll can do FX(x) = P(X ≤ x) the same way •  F is con2nuous if a con2nuous RV; right-‐con2nuous if mixed •  Joint CDF: P(X ≤ x, Y ≤ y)

Probability density func2ons •  Deriva2ve of the CDF: P (X  x) = •  Nonnega2ve, but can be > 1 •  Integrates to 1

Z

x

f (x) dx 1

Important distribu2ons •  Discrete: –  Bernoulli (coin ﬂip) –  Binomial (# of heads in a series of coin ﬂips) –  Categorical (dice roll) –  Mul2nomial (# of each result in a series of dice rolls) –  Poisson (# of events with a certain rate) –  Hypergeometric (sampling without replacement) –  Geometric (# of ﬂips un2l heads) –  Nega2ve binomial (# of ﬂips un2l n heads)

Important distribu2ons •  Con2nuous: –  Normal/Gaussian –  Uniform –  Beta –  Chi-‐squared –  Exponen2al –  Gamma –  Laplace

Func2ons of an RV •  If X is a random variable, so is X2 •  We can get a distribu2on for Y = g(X): P (g(X) 2 A) = P ({x : g(x) 2 A}) = P X 2 g

1

(A)

Expecta2on •  The mean of a random variable: n X 1 E [X] = lim Xi n!1 n i=1

Expecta2on •  The mean of a rZandom variable: EX [g(X)] =

EX [g(X)] =

Linear:

X

g(x) dP (X = x)

g(x) P (X = x)

EX [↵g(X) + h(x)] =

EX [g(X)] =

Z

Z

g(x) p(X = x) dx

(↵g(x) + h(x)) dP (X = x) Z Z = ↵ g(x) dP (X = x) + h(x) dP (X = x)

= ↵EX [g(X)] + EX [h(X)]

Variance •  A measure of the spread of a distribu2on: ⇥

Var(X) = E (X

E[X])

2

⇤

p

•  Standard devia2on: Var(X) ⇥

2

⇤

Var(X) = E (X ⇥ 2 =E X

µ)

= E[X 2 ]

2µ E[X] + µ2

= E[X 2 ]

µ2

2µX + µ

2

⇤

Law of large numbers •  Actually, I just lied to you: the deﬁni2on of expecta2on is the integral, not the limit •  But it’s okay, because of the law of large ! n numbers: 1X P

lim

n!1

n

Xi = E[X]

=1

i=1

•  Always true as long as the expecta2on exists –  though the proof is harder and convergence is slower if E[X2] doesn’t exist

Central limit theorem If X1, …, Xn are iid and have ﬁnite mean/variance: ✓ ◆ n 2 X 1 Xi ⇠ N µ, n i=1 n

Covariance •  Covariance: Cov(X, Y ) = E [(X

E[X])(Y

E[Y ])]

•  Measure of linear rela2onship between X, Y •  Note Var(X) = Cov(X, X) Cov(X, Y ) = E [(X

E[X])(Y

E[Y ])]

= E [XY

X E[Y ]

= E [XY ]

E[X] E[Y ]

= E [XY ]

E[X] E[Y ]

E[X]Y + E[X] E[Y ]]

E[X] E[Y ] + E[X] E[Y ]

Correla2on ⇢X,Y =

Cov(X, Y ) X Y

Covariance matrix •  If we have a random n-‐vector X: ⇥

⇤

Cov(X) = E (X E[X])(X E[X]) = E[XX T ] E[X] E[X]T 2 3 Cov(X1 , X1 ) Cov(X1 , X2 ) . . . Cov(X1 , Xn ) 6 Cov(X2 , X1 ) Cov(X2 , X2 ) . . . Cov(X2 , Xn ) 7 6 7 =6 7 .. .. .. . . 4 5 . . . . Cov(Xn , X1 )

•  •  •  • 

T

Cov(Xn , X2 )

Symmetric, posi2ve semi-‐deﬁnite Cov(A X + a) = A Cov(X) AT Cov(X, Y) = Cov(Y, X)T Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)

...

Cov(Xn , Xn )

Maximum likelihood es2mate (MLE) •  If you think your data is e.g. normally distributed, you s2ll need to ﬁnd the mean and the variance. •  Most common way: maximize the likelihood of the data under the model. n Y arg max P (X; ✓) = arg max ✓

✓

P (Xi ; ✓)

i=1

Maximum a posteriori (MAP) •  The MLE is prone to overﬁ}ng, as we just saw •  Par2al solu2on: add a prior arg max P (✓ | X) = arg max R ✓

✓

P (X | ✓)P (✓) P (X | #)P (#) d# #

= arg max P (X | ✓)P (✓) ✓

Posterior mean •  Another choice for a Bayesian es2mate –  Minimizes L2 risk –  Posterior mode, aka MAP, minimizes L0 risk –  Posterior median minimizes L1 risk

•  In general, harder to compute than MAP –  In “easy” parametric cases, might be known –  Can es2mate from posterior samples •  Markov chain Monte Carlo

Posterior distribu2on •  The “best” answer: don’t use a point es2mate! P (X | ✓)P (✓) P (✓ | X) = R P (X | #)P (#) d# #

•  Problem: it’s hard to compute

–  Can ﬁnd the “best” approxima2on in some simpler family (Varia2onal Bayes, expecta2on propaga2on) –  Can get approximate samples from the posterior (Markov chain Monte Carlo)

Nonparametric sta2s2cs •  Problem: real data rarely follows idealized distribu2ons –  Mul2modal –  Heavier tails

•  If you blindly use it, you might get tricked •  Instead: –  Histograms –  Kernel density es2ma2on –  kNN density es2ma2on

Classiﬁca2on •  Now that we know how to talk about probability distribu2ons, how do we use them? •  Start with two-‐class classiﬁca2on: –  Posi2ve/nega2ve generated by diﬀerent distribu2ons P (x | +)P (+) P (+ | x) = P (x)

P(

P (x | )P ( ) | x) = P (x)

P (+ | x) P (x | +)P (+) = P ( | x) P (x | )P ( )

Classiﬁca2on P (+ | x) P (x | +)P (+) = P ( | x) P (x | )P ( )

•  P(+), P(-‐) are class prior probabili2es –  Can just es2mate with counts in training data

•  P(x | +), P(x | -‐) are the core of the classiﬁer: –  Watson-‐Nadaraya: use kernel density es2mate –  K-‐nearest-‐neighbor: use kNN density es2mate –  Naïve Bayes: •  assume P(x | class) = P(x1 | class) … P(xn | class) •  model each of those parametrically