The main entry point for me is Differential Privacy, Cynthia Dwork. A bunch of CMU theory people's names (Avrim, Maria Balcan, Katrina, Aaron Roth --- and indeed my understanding was helped a lot by Katrina sketching out the picture a bit last I talked to her when I visited Spoons'n'Copic in NYC a couple weekends ago) keep coming up downstream from this paper too.
The basic trick is both really simple and deep in a nice way: it is possible to have probabilistic guarantees of privacy by exporting to the public not your whole database, not a deterministically "santitized" version of your whole database, (which historically has led to disaster even when people thought they knew what they were doing) but a randomly perturbed function of your database under certain conditions.
Perturbed how?
Okay, well, let me set up the problem first. Let D be the collection of all possible databases of a certain type that might hypothetically arise. Perhaps D is the set of all sets of pairs (P, H) where P is a person and H is their height. Let f be some function D → Rn that computes some public information, a vector of n real numbers, from a database. For example, f outputs a histogram counting how many people are, rounded off to the nearest foot, n feet tall, for n = {1,...,9} or whatever. The punchline coming up is that all I need to do is to add some random noise to f, and it will protect the privacy of anyone sitting in the original database, in the following sense: if that person wasn't in the database, with high probability you wouldn't notice.
The example function f above has the useful property that it has low "sensitivity". The sensitivity Δf of an f : D → Rn is defined to be the maximum possible L1-norm (i.e. sum of absolute value of all coordinates) of ||f(D1) - f(D2)|| where D1 and D2 are databases that only differ from one another by one entry being removed or inserted. Histogramming functions like the f above have Δf = 1. The most effect you can have by adding or deleting one dude is to make one bucket have one bigger or smaller count.
Now it's time to define ε-differential privacy. The f above isn't yet private, but we'll want to define a randomized "release function" based on it, and of the same type, that is. A random function g : D → Rn has ε-differential privacy if for any D1 and D2 from D that differ by one record, and any adversarial experiment S : Rn → bool, we get
Pr[S(f(D1))] ≤ exp(ε) Pr[S(f(D2))]
In plain english, the likelihood of the adversary getting the result he did wouldn't have been much different if your sensitive records weren't in the original database at all!
Given any f, we let g be just the result of adding Laplace noise to f, with the standard deviation being σ --- the sigma is just a knob we can tune, for the main theorem is:
Main Theorem if f has sensitivity Δf, then this g has (Δf/σ)-differential privacy. We can turn up σ as much as we like to get more privacy, at the cost of fidelity of the published data.
I almost felt let down by how easy the proof is. Unpacking definitions, we want to show
Pr[S(g(D1))] ≤ exp(Δf/σ) Pr[S(g(D2))]
Boole says we can forget about the generality of all possible adversarial experiments and focus on the ones where our adversary is looking at the density function at one particular vector, i.e. it suffices to show for all r ∈ Rn,
Pr[g(D1) = r] ≤ exp(Δf/σ) Pr[g(D2) = r]
Ok, now we unpack the definition of the Laplace distribution. The probability that g(D1) = r falls off exponentially as the absolute value of how far away f(D1) is from r, with σ of course controlling how quickly it falls off. There's a normalization factor of (1/2σ) out front as well. Our new goal is
(1/2σ)exp(-|r - f(D1)|/σ) ≤ exp(Δf/σ) (1/2σ)exp(-|r - f(D2)|/σ)
(where the vertical bars are again the L1 norm) Now normalization factors cancel on both sides
exp(-|r - f(D1)|/σ) ≤ exp(Δf/σ) exp(-|r - f(D2)|/σ)
take the log of both sides and multiply by sigma
-|r - f(D1)| ≤ Δf -|r - f(D2)|
swap the negatives around
|r - f(D2)| ≤ Δf + |r - f(D1)|
and we see that we could prove this by just appealing to the triangle equality on the L1 norm
|r - f(D2)| ≤ |f(D1) - f(D2)| + |r - f(D1)|
if only Δf ≥ |f(D1) - f(D2)| --- except Δf was exactly chosen to be the max over all possible |f(D1) - f(D2)|, so it's true, QED.
Fundamentally this feels intuitively like it's just a corollary to a sort of uniform continuity property of appropriate functions f.