Here's a cute thing I noticed over lunch, distilled down to the most trivial case I could think of:

Alice has a stream of bits she wants to model, with a single value σ. How to choose the best σ? Well, the one that minimizes the error. What error? Well, let's try squared distance to the actual value. Because, like... linear regression or something. So if the training data consists of C0 zeroes, and C1 ones, the total error is

E = C0σ2 + C1(1-σ)2

and of course we minimize this by setting the derivative to zero:

0 = C0σ - C1(1-σ)
σ = C1 / (C0 + C1)

Now Bob also has a stream of bits he wants to model, with a single value σ. How to choose the best σ? Well, the one that minimizes the error. What error? Well, let's try the entropy of the bit we actually see, given that we interpret σ as the probability with which we expect to see a 1. Because, like... information theory or something. So if the training data consists of C0 zeroes, and C1 ones, the total error is

E = -C0lg(1-σ) - C1lg(σ)

and of course we minimize this by setting the derivative to zero:

0 = C0/(1-σ) - C1
σ = C1 / (C0 + C1)

The same answer! It seems like the operative fact is that the derivative of x2 and the derivative of lg(x) are reciprocal to one another. (ignoring the not-relevant constant factor 2)
Tags:
• Post a new comment

#### Error

Anonymous comments are disabled in this journal

default userpic