Alice has a stream of bits she wants to model, with a single value σ. How to choose the best σ? Well, the one that minimizes the error. What error? Well, let's try squared distance to the actual value. Because, like... linear regression or something. So if the training data consists of C

_{0}zeroes, and C

_{1}ones, the total error is

E = C

_{0}σ

^{2}+ C

_{1}(1-σ)

^{2}

and of course we minimize this by setting the derivative to zero:

0 = C

_{0}σ - C

_{1}(1-σ)

σ = C

_{1}/ (C

_{0}+ C

_{1})

Now Bob also has a stream of bits he wants to model, with a single value σ. How to choose the best σ? Well, the one that minimizes the error. What error? Well, let's try the entropy of the bit we actually see, given that we interpret σ as the probability with which we expect to see a 1. Because, like... information theory or something. So if the training data consists of C

_{0}zeroes, and C

_{1}ones, the total error is

E = -C

_{0}lg(1-σ) - C

_{1}lg(σ)

and of course we minimize this by setting the derivative to zero:

0 = C

_{0}/(1-σ) - C

_{1}/σ

σ = C

_{1}/ (C

_{0}+ C

_{1})

The same answer! It seems like the operative fact is that the derivative of x

^{2}and the derivative of lg(x) are reciprocal to one another. (ignoring the not-relevant constant factor 2)