
[Sep. 16th, 201503:56 pm]
Jason

Here's a cute thing I noticed over lunch, distilled down to the most trivial case I could think of:
Alice has a stream of bits she wants to model, with a single value σ. How to choose the best σ? Well, the one that minimizes the error. What error? Well, let's try squared distance to the actual value. Because, like... linear regression or something. So if the training data consists of C_{0} zeroes, and C_{1} ones, the total error is
E = C_{0}σ^{2} + C_{1}(1σ)^{2}
and of course we minimize this by setting the derivative to zero:
0 = C_{0}σ  C_{1}(1σ) σ = C_{1} / (C_{0} + C_{1})
Now Bob also has a stream of bits he wants to model, with a single value σ. How to choose the best σ? Well, the one that minimizes the error. What error? Well, let's try the entropy of the bit we actually see, given that we interpret σ as the probability with which we expect to see a 1. Because, like... information theory or something. So if the training data consists of C_{0} zeroes, and C_{1} ones, the total error is
E = C_{0}lg(1σ)  C_{1}lg(σ)
and of course we minimize this by setting the derivative to zero:
0 = C_{0}/(1σ)  C_{1}/σ σ = C_{1} / (C_{0} + C_{1})
The same answer! It seems like the operative fact is that the derivative of x^{2} and the derivative of lg(x) are reciprocal to one another. (ignoring the notrelevant constant factor 2) 

