I am not sure if this paper is the canonical answer, but I think it's a rather nice one. The punchline is that concurrency can be thought of as a rather simple monad transformer.
What monad transformer is it? Let me start by discussing the CPS monad a bit first.
The Continuation-Passing Style monad consists first of all of the type operator, call it C, that takes a type A to the double-negation CA = (A → ⊥) → ⊥. The injection η : α → Cα just throws the value it gets to the current continuation. It's
λ x . λ k . k x
The Kleisli lifting operation * : Cα → (α → Cβ) → Cβ makes some sense if you put yourself in a real CPS-translation sort of mood. What it is is the following:
λ z . λ f . λ c . z (λ a:α . f a c)
The story goes like this. We have this computation z : Cα, and another computation that expects to get an α-value f : (α → Cβ). Our job is to build up a CPS computation, so that means that in any event we capture our current continuation c : (β → ⊥). This is where our ultimate answer, of type β, is supposed to be returned. So we launch off into the computation z. It will never return to us, but it will return an α-value to the continuation (λ a:α . f a c) we gave it. In this continuation, we grab the actual value, and feed it to f, which becomes a CPS computation monadically returning a β, so we subsequently feed it our original continuation c, and it's satisfied.
If you go through this story carefully, you can see that it doesn't actually depend on the type ⊥ being ⊥! For every type R of "results" or "ultimate return values" or however you want to think about it, there's a CPS-returning-R monad, whose type operator is CRA = (A → R) → R, and whose injection and kleisli lift are the same two lambda terms above.
What the paper does is, for any monad M that represents the effects that you already expected to be in your sequential language, make a clever choice of R, so that CR is the monad for the concurrent version of that language. And that choice of R is sort of like the binary tree monad, except you can inject little wrappers that apply the monad M internal unary tree nodes. It's the type (mixing ML and mathematical notation a bit)
datatype action = Atom of (M action) | Fork of action * action | Stop
The thing that I find tremendously cute about the whole setup is that you just can't avoid thinking correctly about what counts as atomic blocks. Every time you shell out to the monad M: that's an atomic block. By default, interleaving doesn't make any sense. The whole enterprise is to define what being allowed to engage in interleaving means, and you just have to tell the language which chunks are discernible enough to interleave. I remember Liu's talk at NJPLS had this same feeling about swapping the default assumption about atomicity and concurrency that I liked.
Anyway, once you capture the "trace" of your concurrent program in this data structure, it's then entirely definable how scheduling works. You can just write a function from CR() → M(). The paper does so, implementing a simple round-robin scheme. It's really slick. There's something that feels like cheating about it, since it seems possible to emulate fork-join parallelism, but the "trace" datastructure only has the forks in it — interaction between processes takes place in the sequential composition of the original monad M.