Shared posts

19 Jul 13:45

Bypassing no-go theorems

by John Shutt
This is not at all what I had in mind.
— Albert Einstein, in response to David Bohm's hidden variable theory.

A no-go theorem is a formal proof that a certain kind of theory cannot work.  (The term no-go theorem seems to be used mainly in physics; I find it useful in a more general context.)

A valid no-go theorem identifies a hopeless avenue of research; but in some cases, it also identifies a potentially valuable avenue for research.  This is because in some cases, the no-go theorem is commonly understood more broadly than its actual technical result.  Hence the no-go theorem is actually showing that some specific tactic doesn't work, but is interpreted to mean that some broad strategy doesn't work.  So when you see a no-go theorem that's being given a very broad interpretation, you may do well to ask whether there is, after all, a way to get around the theorem, by achieving what the theorem is informally understood to preclude without doing what the theorem formally precludes.

In this post, I'm going to look at four no-go theorems with broad informal interpretations.  Two of them are in physics; I touch on them briefly here, as examples of the pattern (having explored them in more detail in an earlier post).  A third is in programming-language semantics, where I found myself with a result that bypassed a no-go theorem of Mitchell Wand.  And the fourth is a no-go theorem in logic that I don't actually know quite how to bypass... or even whether it can be bypassed... yet... but I've got some ideas where to look, and it's good fun to have a go at it:  Gödel's Theorem.

John von Neuman's no-go theorem

In 1932, John von Neumann proved that no hidden variable theory can make all exactly the same predictions as quantum mechanics (QM):  all hidden variable theories are experimentally distinguishable from QM.  In 1952, David Bohm published a hidden variable theory experimentally indistinguishable from QM.

How did Bohm bypass von Neumann's no-go theorem?  Simple, really.  (If bypassing a no-go theorem is possible at all, it's likely to be very simple once you see how).  The no-go theorem assumed that the hidden variable theory would be local; that is, that under the theory, the effect of an event in spacetime cannot propagate across spacetime faster than the speed of light.  This was, indeed, a property Einstein wanted out of a hidden variable theory:  no "spooky action at a distance".  But Bohm's hidden variable theory involved a quantum potential field that obeys Schrödinger's Equation — trivially adopting the mathematical infrastructure of quantum mechanics, spooky-action-at-a-distance and all, yet doing it in a way that gave each particle its own unobservable precise position and momentum.  Einstein remarked, "This is not at all what I had in mind."

John Stewart Bell's no-go theorem

In 1964, John Stewart Bell published a proof that all local hidden variable theories are experimentally distinguishable from QM.  For, of course, suitable definition of "local hidden variable theory".  Bell's result can be bypassed by formulating a hidden variable theory in which signals can propagate backwards in time — an approach advocated by the so-called transactional interpretation of QM, and which, as noted in my earlier post on metaclassical physics, admits the possibility of a theory that is still "local" with respect to a fifth dimension of meta-time.

Mitchell Wand's no-go theorem

In 1998, Mitchell Wand published a paper The Theory of Fexprs is Trivial.

The obvious interpretation of the title of the paper is that if you include fexprs in your programming language, the theory of the language will be trivial.  When the paper first came out, I had recently hit on my key insight about how to handle fexprs, around which the Kernel programming language would grow, so naturally I scrutinized Wand's paper very closely to be sure it didn't represent a fundamental threat to what I was doing.  It didn't.  I might put Wand's central result this way:  If a programming language has reflection that makes all computational states observable in the syntactic theory of the language, and if computational states are in one-to-one correspondence with syntactic forms, then the syntactic theory of the language is trivial.  This isn't a problem for Kernel because neither of these conditions holds:  not all computational states are observable, and computational states are not in one-to-one correspondence with syntactic forms.  I could make a case, in fact, that in S-expression Lisp, input syntax represents only data:  computational states cannot be represented using input syntax at all, which means both that the syntactic theory of the language is already trivial on conceptual grounds, and also that the theory of fexprs is not syntactic.

At the time I started writing my dissertation, the best explanation I'd devised for why my theory was nontrivial despite Wand was that Wand did not distinguish between Lisp evaluation and calculus term rewriting, whereas for me Lisp evaluation was only one of several kinds of term rewriting.  Quotation, or fexprs, can prevent an operand from being evaluated; but trivialization results from a context preventing its subterm from being rewritten.  It's quite possible to prevent operand evaluation without trivializing the theory, provided evaluation is a specific kind of rewriting (requiring, in technical parlance, a redex that includes some context surrounding the evaluated term).

Despite myself, though, I was heavily influenced by Wand's result and by the long tradition in which it followed.  Fexprs had been rejected circa 1980 as a Lisp paradigm, in favor of macros.  A rejected paradigm is usually ridiculed in order to rally followers more strongly behind the new paradigm (see here).  My pursuit of $vau as a dissertation topic involved a years-long process of gradually ratcheting up expectations.  At first, I didn't think it would be worth formulating a vau-calculus at all, because of course it wouldn't be well-enough behaved to justify the formulation.  Then I thought, well, an operational semantics for an elementary subset of Kernel would be worth writing.  Then I studied Plotkin's and Felleisen's work, and realized I could provide a semantics for Kernel that would meet Plotkin's well-behavedness criteria, rather than the slightly weakened criteria Felleisen had used for his side-effectful calculus.  And then came the shock.  When I realized that the vau-calculus I'd come up with, besides being essentially as well-behaved as Plotkin's call-by-value lambda-calculus, was actually (up to isomorphism) a conservative extension of call-by-value lambda-calculus.  In other words, my theory of fexprs consisted of the entire theory of call-by-value lambda-calculus plus additional theorems.

I was boggled.  And I was naively excited.  I figured, I'd better get this result published, quick, before somebody else notices it and publishes it first — because it's so incredibly obvious, it can't be long before someone else does find it.  Did I say "naively"?  That's an understatement.  There's some advice for prospective graduate students floating around, which for some reason I associate with Richard Feynman (though I could easily be wrong about that [note: a reader points out this]), to the effect that you shouldn't be afraid people will steal your ideas when you share them, because if your ideas are any good you'll have trouble getting anyone to listen to them at all.  In studying this stuff for years on end I had gone so far down untrodden paths that I was seeing things from a drastically unconventional angle, and if even so I had only just come around a corner to where I could see this thing, others were nowhere close to any vantage from which they could see it.

[note: I've since written a post elaborating on this, Explicit evaluation.]
Kurt Gödel's no-go theorem

Likely the single most famous no-go theorem around is Gödel's Theorem.  (Actually, it's Gödel's Theorems, plural, but the common informal understanding of the result doesn't require this distinction — and Gödel's result lends itself spectacularly to informal generalization.)  This is what I'm going to spend most of this post on, because, well, it's jolly good fun (recalling the remark attributed to Abraham Lincoln: People who like this sort of thing will find it just the sort of thing they like).

The backstory to Gödel was that in the early-to-mid nineteenth century, mathematics had gotten itself a shiny new foundation in the form of a formal axiomatic approach.  And through the second half of the nineteenth century mathematicians expanded on this idea.  Until, as the nineteenth century gave way to the twentieth, they started to uncover paradoxes implied by their sets of axioms.

A perennial favorite (perhaps because it's easily explained) is Russell's Paradox.  Let A be the set of all sets that do not contain themselves.  Does A contain itself?  Intuitively, one can see at once that if A contains itself, then by its definition it does not contain itself; and if it does not contain itself, then by its definition it does contain itself.  The paradox mattered for mathematicians, though, for how it arose from their logical axioms, so we'll be a bit more precise here.  The two key axioms involved are reductio ad absurdum and the Law of the Excluded Middle.

Reductio ad absurdum says that if you suppose a proposition P, and under this supposition you are able to derive a contradiction, then not-P.  Supposing A contains itself, we derive a contradiction, therefore A does not contain itself. Supposing A does not contain itself, we derive a contradiction, therefore —careful!— A does not not contain itself. This is where the Law of the Excluded Middle comes in:  A either does or does not contain itself, therefore since it does not not contain itself, it does contain itself.  We have therefore an antinomy, that is, we've proved both a proposition P and its negation not-P (A both does and does not contain itself).  And antinomies are really bad news, because according to these axioms we've already named, if there is some proposition P for which you can prove both P and not-P, then you can prove every proposition, no matter what it is.  Like this:  Take any proposition Q.  Suppose not-Q; then P and not-P, which is a contradiction, therefore by reductio ad absurdum, not-not-Q, and by the Law of the Excluded Middle, Q.

When Russell's Paradox was published, the shiny new axiomatic foundations of mathematics were still less than a human lifetime old.  Mathematicians started trying to figure out where things had gone wrong.  The axioms of classical logic were evidently inconsistent, leading to antinomies, and the Law of the Excluded Middle was identified as a problem.

One approach to the problem, proposed by David Hilbert, was to back off to a smaller set of axioms that were manifestly consistent, then use that smaller set of axioms to prove that a somewhat larger set of axioms was consistent.  Although clearly the whole of classical logic was inconsistent, Hilbert hoped to salvage as much of it as he could.  This plan to use a smaller set of axioms to bootstrap consistency of a larger set of axioms was called Hilbert's program, and I'm remarking it because we'll have occasion to come back to it later.

Unfortunately, in 1931 Kurt Gödel proved a no-go theorem for Hilbert's program:  that for any reasonably powerful system of formal logic, if the logic is consistent, then it cannot prove the consistency of its own axioms, let alone its own axioms plus some more on the side.  The proof ran something like this:  For any sufficiently powerful formal logic M, one can construct a proposition A of M that amounts to "this proposition is unprovable".  If A were provable, that would prove that A is false, an antinomy; if not-A were provable, that would prove that A is true, again an antinomy; so M can only be consistent if both A and not-A are unprovable.  But if M were able to prove its own consistency, that would prove that A is unprovable (because A must be unprovable in order for M to be consistent), which would prove that A is true, producing an antinomy, and M would in fact be inconsistent.  Run by that again:  If M can prove its own consistency, then M is in fact inconsistent.

Typically, on completion of a scientific paradigm shift, the questions that caused the shift cease to be treated as viable questions by new researchers; research on those questions tapers off rapidly, pushed forward only by people who were already engaged by those questions at the time of the shift.  So it was with Gödel's results.  Later generations mostly treated them as the final word on the foundations of mathematics:  don't even bother, we know it's impossible.  That was pretty much the consensus view when I began studying this stuff in the 1980s, and it's still pretty much the consensus view today.

Going there

Having been trained to think of Gödel's Theorem as a force of nature, I nevertheless found myself studying it more seriously when writing the theoretical background material for my dissertation.  I found myself discoursing at length on the relationship between mathematics, logic, and computation, and a curious discrepancy caught my eye.  Consider the following Lisp predicate.

($define! A ($lambda (P) (not? (P P))))
Predicate A takes one argument, P, which is expected to be a predicate of one argument, and returns the negation of what P would return when passed to itself.  This is a direct Lisp translation of Russell's Paradox.  What happens when A is passed itself?

The answer is, when A is passed itself, (A A), nothing interesting happens — which is really very interesting.  The predicate attempts to recurse forever, never terminating, and in theory it will eventually fill up all available memory with a stack of pending continuations, and terminate with an error.  What it won't do is cause mathematicians to despair of finding a solid foundation for their subject.  If asking whether set A contains itself is so troublesome, why is applying predicate A to itself just a practical limit on how predicate A should be used?

That question came from my dissertation.  Meanwhile, another question came from the other major document I was developing, the R-1RK.  I wanted to devise a uniquely Lisp-ish variant of the concept of eager type-checking.  It seemed obvious to me that there should not be a fixed set of rules of type inference built into the language; that lacks generality, and is not extensible.  So my idea was this:  In keeping with the philosophy that everything should be first-class, let theorems about the program be an encapsulated type of first-class objects.  And carefully design the constructors for this theorem type so that you can't construct the object unless it's provable.  In effect, the constructors are the axioms of the logic.  Modus ponens, say, is a constructor:  given a theorem P and a theorem P-implies-Q, the modus-ponens constructor allows you to construct a theorem Q.  As desired, there is no built-in inference engine:  the programmer takes ultimate responsibility for figuring out how to prove things.

Of the many questions in how to design such a first-class theorem type, one of the early ones has to be, what system of axioms should we use?  Clearly not classical logic, because we know that would give us an inconsistent system.  This though was pretty discouraging, because it seemed I would find myself directly confronting in my design the very sort of problems that Gödel's Theorem says are ultimately unsolvable.

But then I had a whimsical thought; the sort of thing that seems at once not-impossible and yet such a long shot that one can just relax and enjoy exploring it without feeling under pressure to produce a result in any particular timeframe (and yet, I have moved my thinking forward on this over the years, which keeps it interesting).  What if we could find a way to take advantage of the fact that our logic is embedded in a computational system, by somehow bleeding off the paradoxes into mere nontermination?  So that they produce the anticlimax of functions that don't terminate instead of the existential angst of inconsistent mathematical foundations?

Fragments

At this point, my coherent vision dissolves into fragments of tentative insight, stray puzzle pieces I'm still pushing around hoping to fit together.

One fragment:  Alfred Tarski —who fits the aforementioned profile of someone already engaged by the questions when Gödel's results came out— suggested post-Gödel that the notion of consistency should be derived from common sense.  Hilbert's program had actually pursued a formal definition of consistency, as the property that not all propositions are provable; this does have a certain practicality to it, in that the practical difficulty with the classical antinomies was that they allowed all propositions Q to be proven, so that "Q can be proven" ceased to be a informative statement about Q.  Tarski, though, remarked that when a non-mathematician is told that both P and not-P are true, they can see that something is wrong without having to first receive a lecture on the formal consequences of antinomies in interaction with reductio ad absurdum.

So, how about we resort to some common sense, here?  A common-sensical description of Russell's Paradox might go like this:

A is the set of all sets that do not contain themselves.  If A contains itself, then it does not contain itself.  But if it does not contain itself, then it does contain itself.  But if it does contain itself, then it does not contain itself.  But if it does not contain itself...
And that is just what we want to happen:  instead of deriving an antinomy, the reasoning just regresses infinitely.  A human being can see very quickly that this is going nowhere, and doesn't bother to iterate beyond the first four sentences at most (and once they've learned the pattern, next time they'll probably stop after even fewer sentences), but they don't come out of the experience believing that A both does and does not belong to itself; they come out believing that there's no way of resolving the question.

So perhaps we should be asking how to make the conflict here do an infinite regress instead of producing a (common-sensically wrong) answer after a finite number of steps.  This seems to be some sort of deep structural change to how logical reasoning would work, possibly not even a modification of the axioms at all but rather of how they are used.  If it does involve tampering with an axiom, the axiom tampered with might well be reductio ad absurdum rather than the Law of the Excluded Middle.

This idea — tampering with reductio ad absurdum rather than the Law of the Excluded Middle — strikes a rather intriguing historical chord.  Because, you see, one of the mathematicians notably pursuing Hilbert's program pre-Gödel did try to eliminate the classical antinomies by leaving intact the Law of the Excluded Middle and instead modifying reductio ad aburdum.  His name was Alonzo Church —you may have heard of him— and the logic he produced had, in retrospect, a curiously computational flavor to it.  While he was at it, he took the opportunity to simplify the treatment of variables in his logic, by having only a single structure that binds variables, which (for reasons lost in history) he chose to call λ.  Universal and existential quantifiers in his logic were higher-order functions, which didn't themselves bind variables but instead operated on functions that did the binding for them.  Quite a clever device, this λ.

Unfortunately, it didn't take many years after Church's publication to show that antinomies arose in his system after all.  Following the natural reflex of Hilbert's program, Church tried to find a subset of his logical axioms that could be proven consistent — and succeeded.  It turned out that if you left out all the operators except λ you could prove that each proposition P was only equivalent to at most one irreducible form.  This result was published in 1936 by Church and one of his students, J. Barkley Rosser, and today is known as the Church–Rosser Theorem (you may have heard of that, too).  In the long run, Church's logic is an obscure historical footnote, while its λ-only subset turned out to be of great interest for computation, and is well-known under the name "λ-calculus".

So evidently this idea of tampering with reductio ad absurdum and bringing computation into the mix is not exactly unprecedented.  Is it possible that there is something there that Alonzo Church didn't notice?  I'd have to say, "yes".  Alonzo Church is one of those people who (like Albert Einstein, you'll recall he came up in relation to the first no-go theorem I discussed) in retrospect appears to have set a standard of intelligence none of us can possibly aspire to — but all such people are limited by the time they live in.  Einstein died years before Bell's Theorem was published.  Heck, Aristotle was clearly a smart guy too, and just think of everything he missed through the inconvenience of being born about two millennia before the scientific revolution.  And Alonzo Church couldn't, by the nature of the beast, have created his logic based on a modern perspective on computation and logic since it was in part the further development of his own work over many decades that has produced that modern perspective.

I've got one more puzzle piece I'm pushing around, that seems like it ought to fit in somewhere.  Remember I said Church's logic was shown to have antinomies?  Well, at the time the antinomy derivation was rather baroque.  It involved a form of the Richard Paradox, which concerns the use of an expression in some class to designate an object that by definition cannot be designated by expressions of that class.  (A version due to G.G. Berry concerns the twenty-one syllable English expression "the least natural number not nameable in fewer than twenty-two syllables".)  The Richard Paradox is naturally facilitated by granting first-class status to functions, as λ-calculus and Lisp do.  But, it turns out, there is a much simpler paradox contained in Church's logic, involving less logical machinery and therefore better suited for understanding what goes wrong when λ-calculus is embedded in a logic.  This is Curry's Paradox.

I'll assume, for this last bit, that you're at least hazily familiar with λ-calculus, so it'll come back to you when you see it.

For Curry's Paradox, we need one logical operator, three logical axioms, and the machinery of λ-calculus itself.  Our one logical operator is the binary implication operator, ⇒.  The syntax of the augmented λ-calculus is then

T   ::=   x | c | (λx.T) | (TT) | (T⇒T)
That is, a term is either a variable, or a constant, or a lambda-expression, or an application, or an implication.  We don't need a negation operator, because we're sticking with the generalized notion of inconsistency as the property that all propositions are provable.  Our axioms assert that certain terms are provable:
  1. For all terms P and Q, if provably P and provably (P⇒Q), then provably Q.    (modus ponens)
  2. For all terms P, provably P⇒P.
  3. For all terms P and Q, provably ((P⇒(P⇒Q))⇒(P⇒Q)).
The sole rewriting axiom of λ-calculus, lest we forget, is the β-rule:
(λx.F)G → F[x ← G]
That is, to apply function (λx.F) to argument G, substitute G for all free occurrences of x in F.

To prove inconsistency, first we need a simple result that comes entirely from λ-calculus itself, called the Fixpoint Theorem.  This result says that for every term F, there exists a term G such that FG = G (that is, every term F has a fixpoint).  The proof works like this:

Suppose F is any term, and let G = (λx.(F(xx)))(λx.(F(xx))), where x doesn't occur in F.  Then G = (λx.(F(xx)))(λx.(F(xx))) → (F(xx))[x ← (λx.(F(xx)))] = F((λx.(F(xx)))(λx.(F(xx)))) = FG.
Notice that although the Fixpoint Theorem apparently says that every F has a fixpoint G, it does not actually require F to be a function at all:  instead of providing a G to which F can be applied, it provides a G from which FG can be derived.  And —moreover— for most F, derivation from G is a divergent computation (G → FG → F(FG) → F(F(FG)) → ...).

Now we're ready for our proof of inconsistency:  that for every term P, provably P.

Suppose P is any term.  Let Q = λx.(x⇒P).  By the Fixpoint Theorem, let R be a term such that QR = R.  By writing out the definition of Q and then applying the β-rule, we have QR = (λx.(x⇒P))R → (R⇒P), therefore R = (R⇒P).

By the second axiom, provably (R⇒R); but R = R⇒P, so, by replacing the right hand R in (R⇒R) with (R⇒P), provably (R⇒(R⇒P)).

By the third axiom, provably ((R⇒(R⇒P))⇒(R⇒P)); and we already have provably (R⇒(R⇒P)), so, by modus ponens, provably (R⇒P). But R = (R⇒P), so provably R.

Since provably R and provably (R⇒P), by modus ponens, provably P.

Note: I've had to fix errors in this proof twice since publication; there's some sort of lesson there about either formal proofs or paradoxes.
So, why did I go through all this in detail?  Besides, of course, enjoying a good paradox.  Well, mostly, this:  The entire derivation turns on the essential premise that derivation in the calculus, as occurs (oddly backwards) in the proof of the Fixpoint Theorem, is a relation between logical terms — which is to say that all terms in the calculus have logical meaning.

And we've seen something like that before, in my early explanation of Mitchell Wand's no-go theorem:  trivialization of theory resulted from assuming that all calculus derivation was evaluation.  So, if we got around Wand's no-go theorem by recognizing that some derivation is not evaluation, what can we do by recognizing that some derivation is not deduction?

19 Jul 13:41

Finding Undefined Behavior Bugs by Finding Dead Code

by regehr

Here’s a draft of a very cool paper by Xi Wang and others at MIT; it is going to appear at the next SOSP.

The idea is to look for code that becomes dead when a C/C++ compiler is smart about exploiting undefined behavior. The classic example of this class of error was found in the Linux kernel several years ago. The code was basically:

struct foo *s = ...;
int x = s->f;
if (!s) return ERROR;
... use s ...

The problem is that the dereference of s in line 2 permits a compiler to infer that s is not null (if the pointer is null then the function is undefined; the compiler can simply ignore this case). Thus, the null check in line 3 gets silently optimized away and now the kernel contains an exploitable bug if an attacker can find a way to invoke this code with a null pointer. Here’s another example that I like where the compiler silently discarded a security check that invoked undefined behavior. There are more examples in the paper and I encourage people to look at them. The point is that these problems are real, and they’re nasty because the problem can only be seen by looking at the compiler’s output. Compilers are getting smarter all the time, causing code that previously worked to break. A sufficiently advanced compiler is indistinguishable from an adversary.

So the premise of this paper is that if you write some code that executes conditionally, and that code can be eliminated by a compiler that understands how to exploit undefined behavior, then there’s a potential bug that you’d like to learn about. But how can we find this kind of code? One way would be to instrument the compiler optimization passes that eliminate code based on undefined behavior. This has some problems. First, due to complex interactions between optimization passes in a real compiler, it is not at all straightforward to determine whether a particular piece of dead code is that way due to undefined behavior, vs. being dead in a boring way. Chris Lattner’s article about undefined behavior has some good discussion of this issue. The second problem is that if we, for example, instrument LLVM to detect cases where it optimizes away code due to undefined behavior, this doesn’t tell us much about problems that we might run into when using a different compiler, or using the next version of LLVM. In particular, it would be nice to be able to find “time bombs” — undefined behavior bugs that aren’t yet exploited by a compiler, but that might be exploited in the future.

The approach taken in this paper was to develop a custom undefined behavior detector based on static analysis of a single function at a time. It takes LLVM IR and converts it into a SAT instance that is satisfiable when code can be eliminated. This has a couple of advantages. First, by considering one function at a time, the tool can scale to very large programs (note that a single function at the LLVM level may contain inlined code from other functions — this increases precision without slowing things down too much). Second, modern SAT solvers are quite strong, giving the tool the opportunity to find bugs that cannot yet be exploited by production compilers. Third, by writing a standalone tool, problems of being entangled with all of the incidental complexity that is found in a collection of real compiler passes are avoided.

At this point you are probably thinking: “I do not want some stupid tool telling me about every piece of dead code in a huge legacy application, because there are millions of those and almost none of them are bugs.” The solution has two parts. First, the most obvious dead code gets eliminated by running LLVM passes like mem2reg, SCCP, and DCE. Second, the authors have a clever solution where they run their tool twice: first without knowledge of undefined behavior, and second with it. A bug is only reported if the second pass can eliminate code that the first one cannot. Based on this technique, the paper reports a very low rate of false positives.

One thing that is interesting about the tool in this paper is that it has very conservative criteria for reporting a bug: either undefined behavior is invoked on all paths (e.g. it finds a type 3 function) or else it finds code that is made dead by exploiting undefined behavior. Although this conservative behavior could be seen as a disadvantage, in many ways it is a big advantage because the problems that it flags are probably serious, because the code that gets eliminated by the compiler is often a safety check of some sort. The authors were able to find, report, and convince developers to fix 157 bugs in open source programs, which is impressive. They also provide some good anecdotes about developers who could not be convinced to fix undefined behavior bugs, something I’ve also run into.

In summary, by adopting a solid premise (“developers want to know when code they write can be eliminated based on exploitation of undefined behavior”) the authors of this paper have found a way to home in on a serious class of bugs and also to avoid the false positive problems that might otherwise plague an intraprocedural static analysis tool. Xi says that the tool should be released in August.

18 Jul 15:33

Active Anti-Entropy

by kellabyte

Distributed systems that have relaxed consistency guarantees between replicas need ways to eventually converge on the same state. Doing this with a large set of data can be a real challenge. Processing and comparing full sets of data is expensive. If two replicas aren’t consistent you want to find out what pieces of data aren’t the same as efficiently as possible so that you can repair the inconsistent data.

Several eventually consistent databases such as Cassandra, Riak and Voldemort implement ideas introduced in the Amazon Dynamo paper. The Dynamo paper included the idea of using Merkle Tree’s (hash tree) for comparing data efficiently. Bit Torrent uses merkle tree’s as well.

Merkle Tree

The leaves in a merkle tree are hashes of data. Nodes further up in the tree are the hashes of their respective children. This allows you to compare in a top-down fashion until you find a mismatch. If the top level hash is the same, both nodes are consistent without having to check any further.

Anti-entropy

Anti-entropy is the process of detecting inconsistencies and repairing these inconsistencies. Sometimes this is simply describes as a “repair” that you may issue manually. In Riak there is support for what’s called active anti-entropy. Active anti-entropy takes a more automated  approach by continuously in the background comparing merkle tree’s between replicas and repairing inconsistencies.

In my opinion the choice whether active anti-entropy makes sense depends on a lot of factors about your system and the workload characteristics.

Joseph Blomstedt has an excellent video

describing how active anti-entropy works in Riak. The video is also a really great example of how merkle tree’s work so I suggest anyone using Cassandra or Riak or just interested in distributed systems to give it a watch. Or four.
19 Jun 16:35

Excellent Papers for 2012

by Research @ Google
Posted by Corinna Cortes and Alfred Spector, Google Research

Googlers across the company actively engage with the scientific community by publishing technical papers, contributing open-source packages, working on standards, introducing new APIs and tools, giving talks and presentations, participating in ongoing technical debates, and much more. Our publications offer technical and algorithmic advances, feature aspects we learn as we develop novel products and services, and shed light on some of the technical challenges we face at Google.

In an effort to highlight some of our work, we periodically select a number of publications to be featured on this blog. We first posted a set of papers on this blog in mid-2010 and subsequently discussed them in more detail in the following blog postings. In a second round, we highlighted new noteworthy papers from the later half of 2010 and again in 2011. This time we honor the influential papers authored or co-authored by Googlers covering all of 2012 -- covering roughly 6% of our total publications.  It’s tough choosing, so we may have left out some important papers.  So, do see the publications list to review the complete group.

In the coming weeks we will be offering a more in-depth look at some of these publications, but here are the summaries:

Algorithms and Theory

Online Matching with Stochastic Rewards
Aranyak Mehta*, Debmalya Panigrahi [FOCS'12]
Online advertising is inherently stochastic: value is realized only if the user clicks on the ad, while the ad platform knows only the probability of the click. This paper is the first to introduce the stochastic nature of the rewards to the rich algorithmic field of online allocations. The core algorithmic problem it formulates is online bipartite matching with stochastic rewards, with known click probabilities. The main result is an online algorithm which obtains a large fraction of the optimal value. The paper also shows the difficulty introduced by the stochastic nature, by showing how it behaves very differently from the classic (non-stochastic) online matching problem.

Matching with our Eyes Closed
Gagan Goel*, Pushkar Tripathi* [FOCS'12]
In this paper we propose a simple randomized algorithm for finding a matching in a large graph. Unlike most solutions to this problem, our approach does not rely on building large combinatorial structures (like blossoms) but works completely locally. We analyze the performance of our algorithm and show that it does significantly better than the greedy algorithm. In doing so we improve a celebrated 18 year old result by Aronson et. al.

Simultaneous Approximations for Adversarial and Stochastic Online Budgeted Allocation
Vahab Mirrokni*, Shayan Oveis Gharan, Morteza Zadimoghaddam, [SODA'12]
In this paper, we study online algorithms that simultaneously perform well in worst-case and average-case instances, or equivalently algorithms that perform well in both stochastic and adversarial models at the same time. This is motivated by online allocation of queries to advertisers with budget constraints. Stochastic models are not robust enough to deal with traffic spikes and adversarial models are too pessimistic. While several algorithms have been proposed for these problems, each algorithm was known to perform well in one model and not both, and we present new results for a single algorithm that works well in both models.

Economics and EC

Polyhedral Clinching Auctions and the Adwords Polytope
Gagan Goel*, Vahab Mirrokni*, Renato Paes Leme [STOC'12]
Budgets play a major role in ad auctions where advertisers explicitly declare budget constraints. Very little is known in auctions about satisfying such budget constraints while keeping incentive compatibility and efficiency. The problem becomes even harder in the presence of complex combinatorial constraints over the set of feasible allocations. We present a class of ascending-price auctions addressing this problem for a very general class of (polymatroid) allocation constraints including the AdWords problem with multiple keywords and multiple slots.

HCI

Backtracking Events as Indicators of Usability Problems in Creation-Oriented Applications
David Akers*, Robin Jeffries*, Matthew Simpson*, Terry Winograd [TOCHI '12]
Backtracking events such as undo can be useful automatic indicators of usability problems for creation-oriented applications such as word processors and photo editors. Our paper presents a new cost-effective usability evaluation method based on this insight.

Talking in Circles: Selective Sharing in Google+
Sanjay Kairam, Michael J. Brzozowski*, David Huffaker*, Ed H. Chi*, [CHI'12]
This paper explores why so many people share selectively on Google+: to protect their privacy but also to focus and target their audience. People use Circles to support these goals, organizing contacts by life facet, tie strength, and interest.

Information Retrieval

Online selection of diverse results
Debmalya Panigrahi, Atish Das Sarma, Gagan Aggarwal*, and Andrew Tomkins*, [WSDM'12]
We consider the problem of selecting subsets of items that are simultaneously diverse in multiple dimensions, which arises in the context of recommending interesting content to users. We formally model this optimization problem, identify its key structural characteristics, and use these observations to design an extremely scalable and efficient algorithm. We prove that the algorithm always produces a nearly optimal solution and also perform experiments on real-world data that indicate that the algorithm performs even better in practice than the analytical guarantees.

Machine Learning

Large Scale Distributed Deep Networks
Jeffrey Dean, Greg S. Corrado*, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Andrew Y. Ng, NIPS 2012;
In this paper, we examine several techniques to improve the time to convergence for neural networks and other models trained by gradient-based methods. The paper describes a system we have built that exploits both model-level parallelism (by partitioning the nodes of a large model across multiple machines) and data-level parallelism (by having multiple replicas of a model process different training data and coordinating the application of updates to the model state through a centralized-but-partitioned parameter server system). Our results show that very large neural networks can be trained effectively and quickly on large clusters of machines.

Open Problem: Better Bounds for Online Logistic Regression
Brendan McMahan* and Matthew Streeter*, COLT/ICML'12 Joint Open Problem Session, JMLR: Workshop and Conference Proceedings.
One of the goals of research at Google is to help point out important open problems--precise questions that are interesting academically but also have important practical ramifications. This open problem is about logistic regression, a widely used algorithm for predicting probabilities (what is the probability an email message is spam, or that a search ad will be clicked). We show that in the simple one-dimensional case, much better results are possible than current theoretical analysis suggests, and we ask whether our results can be generalized to arbitrary logistic regression problems.

Spectral Learning of General Weighted Automata via Constrained Matrix Completion
Borja Balle and Mehryar Mohri*, NIPS 2012.
Learning weighted automata from finite samples drawn from an unknown distribution is a central problem in machine learning and computer science in general, with a variety of applications in text and speech processing, bioinformatics, and other areas. This paper presents a new family of algorithms for tackling this problem for which it proves learning guarantees. The algorithms introduced combine ideas from two different domains: matrix completion and spectral methods.

Machine Translation

Improved Domain Adaptation for Statistical Machine Translation
Wei Wang*, Klaus Macherey*, Wolfgang Macherey*, Franz Och* and Peng Xu*, [AMTA'12]
Research in domain adaptation for machine translation has been mostly focusing on one domain. We present a simple and effective domain adaptation infrastructure that makes an MT system with a single translation model capable of providing adapted, close-to-upper-bound domain-specific accuracy while preserving the generic translation accuracy. Large-scale experiments on 20 language pairs for patent and generic domains show the viability of our approach.

Multimedia and Computer Vision

Reconstructing the World's Museums
Jianxiong Xiao and Yasutaka Furukawa*, [ECCV '12]
Virtual navigation and exploration of large indoor environments (e.g., museums) have been so far limited to either blueprint-style 2D maps that lack photo-realistic views of scenes, or ground-level image-to-image transitions, which are immersive but ill-suited for navigation. This paper presents a novel vision-based 3D reconstruction and visualization system to automatically produce clean and well-regularized texture-mapped 3D models for large indoor scenes, from ground-level photographs and 3D laser points. For the first time, we enable users to easily browse a large scale indoor environment from a bird's-eye view, locate specific room interiors, fly into a place of interest, view immersive ground-level panoramas, and zoom out again, all with seamless 3D transitions.

The intervalgram: An audio feature for large-scale melody recognition
Thomas C. Walters*, David Ross*, Richard F. Lyon*, [CMMR'12]
Intervalgrams are small images that summarize the structure of short segments of music by looking at the musical intervals between the notes present in the music. We use them for finding cover songs - different pieces of music that share the same underlying composition. Wedo this by comparing 'heatmaps' which look at the similarity between intervalgrams from different pieces of music over time. If we see a strong diagonal line in the heatmap, it's good evidence that the songs are musically similar.

General and Nested Wiberg Minimization
Dennis Strelow*, [CVPR'12]
Eriksson and van den Hengel’s CVPR 2010 paper showed that Wiberg’s least squares matrix factorization, which effectively eliminates one matrix from the factorization problem, could be applied to the harder case of L1 factorization. Our paper generalizes their approach beyond factorization to general nonlinear problems in two sets of variables, like perspective structure-from-motion. We also show that with our generalized method, one Wiberg minimization can also be nested inside another, effectively eliminating two of three sets of unknowns, and we demonstrated this idea using projective struture-from-motion

Calibration-Free Rolling Shutter Removal
Matthias Grundmann*, Vivek Kwatra*, Daniel Castro, Irfan Essa*, International Conference on Computational Photography '12. Best paper.
Mobile phones and current generation DSLR’s, contain an electronic rolling shutter, capturing each frame one row of pixels at a time. Consequently, if the camera moves during capture, it will cause image distortions ranging from shear to wobbly distortions. We propose a calibration-free solution based on a novel parametric mixture model to correct these rolling shutter distortions in videos that enables real-time rolling shutter rectification as part of YouTube’s video stabilizer.

Natural Language Processing

Vine Pruning for Efficient Multi-Pass Dependency Parsing
Alexander Rush, Slav Petrov*, The 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '12), Best Paper Award.
Being able to accurately analyze the grammatical structure of sentences is crucial for language understanding applications such as machine translation or question answering. In this paper we present a method that is up to 200 times faster than existing methods and enables the grammatical analysis of text in large-scale applications. The key idea is to perform the analysis in multiple coarse-to-fine passes, resolving easy ambiguities first and tackling the harder ones later on.

Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure
Oscar Tackstrom, Ryan McDonald*, Jakob Uszkoreit*, North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '12), Best Student Paper Award.
This paper studies how to build meaningful cross-lingual word clusters, i.e., clusters containing lexical items from two languages that are coherent along some abstract dimension. This is done by coupling distributional statistics learned from huge amounts of language specific data coupled with constraints generated from parallel corpora. The resulting clusters are used to improve the accuracy of multi-lingual syntactic parsing for languages without any training resources.

Networks

How to Split a Flow
Tzvika Hartman*, Avinatan Hassidim*, Haim Kaplan*, Danny Raz*, Michal Segalov*, [INFOCOM '12]
Decomposing a flow into a small number of paths is a very important task arises in various network optimization mechanisms. In this paper we develop an an approximation algorithm for this problem that has both provable worst case performance grantees as well as good practical behavior.

Deadline-Aware Datacenter TCP (D2TCP)
Balajee Vamanan, Jahangir Hasan*, T. N. Vijaykumar, [SIGCOMM '12]
Some of our most important products like search and ads operate under soft-real-time constraints. They are architected and fine-tuned to return results to users within a few hundred milliseconds. Deadline-Aware Datacenter TCP is a research effort into making the datacenter networks deadline aware, thus improving the performance of such key applications.

Trickle: Rate Limiting YouTube Video Streaming
Monia Ghobadi, Yuchung Cheng*, Ankur Jain*, Matt Mathis* [USENIX '12]
Trickle is a server-side mechanism to stream YouTube video smoothly to reduce burst and buffer-bloat. It paces the video stream by placing an upper bound on TCP’s congestion window based on the streaming rate and the round-trip time. In initial evaluation Trickle reduces the TCP loss rate by up to 43% and the RTT by up to 28%. Given the promising results we are deploying Trickle to all YouTube servers.

Social Systems

Look Who I Found: Understanding the Effects of Sharing Curated Friend Groups
Lujun Fang*, Alex Fabrikant*, Kristen LeFevre*, [Web Science '12]. Best Student Paper award.
In this paper, we studied the impact of the Google+ circle-sharing feature, which allows individual users to share (publicly and privately) pre-curated groups of friends and contacts. We specifically investigated the impact on the growth and structure of the Google+ social network. In the course of the analysis, we identified two natural categories of shared circles ("communities" and "celebrities"). We also observed that the circle-sharing feature is associated with the accelerated densification of community-type circles.

Software Engineering

AddressSanitizer: A Fast Address Sanity Checker
Konstantin Serebryany*, Derek Bruening*, Alexander Potapenko*, Dmitry Vyukov*, [USENIX ATC '12].
The paper “AddressSanitizer: A Fast Address Sanity Checker” describes a dynamic tool that finds memory corruption bugs in C or C++ programs with only a 2x slowdown. The major feature of AddressSanitizer is simplicity -- this is why the tool is very fast.

Speech

Japanese and Korean Voice Search
Mike Schuster*, Kaisuke Nakajima*, IEEE International Conference on Acoustics, Speech, and Signal Processing [ICASSP'12].
"Japanese and Korean voice search" explains in detail how the Android voice search systems for these difficult languages were developed. We describe how to segment statistically to be able to handle infinite vocabularies without out-of-vocabulary words, how to handle the lack of spaces between words for language modeling and dictionary generation, and how to deal best with multiple ambiguities during evaluation scoring of reference transcriptions against hypotheses. The combination of techniques presented led to high quality speech recognition systems--as of 6/2013 Japanese and Korean are #2 and #3 in terms of traffic after the US.

Google's Cross-Dialect Arabic Voice Search
Fadi Biadsy*, Pedro J. Moreno*, Martin Jansche*, IEEE International Conference on Acoustics, Speech, and Signal Processing [ICASSP 2012].
This paper describes Google’s automatic speech recognition systems for recognizing several Arabic dialects spoken in the Middle East, with the potential to reach more than 125 million users. We suggest solutions for challenges specific to Arabic, such as the diacritization problem, where short vowels are not written in Arabic text. We conduct experiments to identify the optimal manner in which acoustic data should be clustered among dialects.

Deep Neural Networks for Acoustic Modeling in Speech Recognition
Geoffrey Hinton*, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew W. Senior*, Vincent Vanhoucke*, Patrick Nguyen, Tara Sainath, Brian Kingsbury, Signal Processing Magazine (2012)"
Survey paper on the DNN breakthrough in automatic speech recognition accuracy.

Statistics

Empowering Online Advertisements by Empowering Viewers with the Right to Choose
Max Pashkevich*, Sundar Dorai-Raj*, Melanie Kellar*, Dan Zigmond*, Journal of Advertising Research, vol. 52 (2012).
YouTube’s TrueView in-stream video advertising format (a form of skippable in-stream ads) can improve the online video viewing experience for users without sacrificing advertising value for advertisers or content owners.

Structured Data

Efficient Spatial Sampling of Large Geographical Tables
Anish Das Sarma*, Hongrae Lee*, Hector Gonzalez*, Jayant Madhavan*, Alon Halevy*, [SIGMOD '12].
This paper presents fundamental results for the "thinning problem": determining appropriate samples of data to be shown on specific geographical regions and zoom levels. This problem is widely applicable for a number of cloud-based geographic visualization systems such as Google Maps, Fusion Tables, and the developed algorithms are part of the Fusion Tables backend. The SIGMOD 2012 paper was selected among the best papers of the conference, and invited to a special best-papers issue of TODS.

Systems

Spanner: Google's Globally-Distributed Database
James C. Corbett*, Jeffrey Dean*, Michael Epstein*, Andrew Fikes*, Christopher Frost*, JJ Furman*, Sanjay Ghemawat*, Andrey Gubarev*, Christopher Heiser*, Peter Hochschild*, Wilson Hsieh*, Sebastian Kanthak*, Eugene Kogan*, Hongyi Li*, Alexander Lloyd*, Sergey Melnik*, David Mwaura*, David Nagle*, Sean Quinlan*, Rajesh Rao*, Lindsay Rolig*, Dale Woodford*, Yasushi Saito*, Christopher Taylor*, Michal Szymaniak*, Ruth Wang*, [OSDI '12]
This paper shows how a new time API and its implementation can provide the abstraction of tightly synchronized clocks, even on a global scale. We describe how we used this technology to build a globally-distributed database that supports a variety of powerful features: non-blocking reads in the past, lock-free snapshot transactions, and atomic schema changes.
18 Jun 16:20

Software exam gets low grade in first test

Only 12 people took the first exam given in the U.S. to license software engineers for work on safety-critical systems, and only six passed it.

View the full article HERE.
18 Jun 15:34

Shuffles, Bayes' theorem and continuations.

by Dan Piponi
Introduction
Back in the 80s when I was a kid I came across a program for the BBC Micro that could tell what card you had picked from a deck of cards even though you'd buried your card within the deck wherever you wanted and had cut and shuffled the deck. I thought I'd try to implement a slightly more sophisticated version of the same trick that could handle multiple shuffles, and multiple types of shuffle.
The idea is that we prepare a deck of cards in some known sequence and have someone pick out a card and place it at the top of the deck. They perform some kind of randomisation procedure on the cards, eg. cut and shuffle it a couple of times, and then you get to look at the final sequence of cards. Can we tell which card was picked out?
Some probability theory
Let's formalise this a bit. Our decks will have cards. There is a small number of initial states our deck can be in, corresponding to the known sequence with a single card moved to the top. Let's label these initial states . There is a (usually large) number of permutations that could be applied through shuffling. We'll label these . We'll try to do arrange that this isn't simply the set of all permutations (though it's not necessarily a disaster if it is).
We want to figure out the initial state given some final state . In other words we want
We can use Bayes theorem to get:
Now is the sum over all ways of starting with and ending up with . So
where the sum is over all such that . I'm assuming that the shuffles are independent of the initial sequence of cards. This gives us an algorithm. We do a brute force simulation of every possible shuffle that we're considering applied to each possible initial state. After each shuffle we sum the corresponding probability for those shuffles that give our known final state .
Each shuffle is going to be built up as the product of a sequence of building blocks with each block randomly selected based on what happened before. Let's call the blocks names like . So if then . As we work through the shuffle we will accumulate the probability. After the first block we have a probability of . The probability after the second is and so on. At any point we'll call the probability accumulated so far the importance. I've borrowed that name from the world of rendering because this algorithm has a remarkable similarity to recursive ray-tracing.
Some computer science
I'd like to be able to chain a sequence of shuffles. But wait! There's a catch! Today's the day I finally want to get around to checking out the lambda expression support in C++. I've been putting this off for years. (I'm using gcc 4.7.) So I'm not going to have a Haskell non-determinism monad to make life easy.
Suppose I have two types of shuffle, type and type . I could easily write a loop to iterate over all shuffles of type , and in the innermost part of the loop I could call another loop over all shuffles of type . But then if I want to replace with I have to change the code to replace the inner part with code for . That's no good. I'd like to be able to replace the innermost part of the outer loop with any code I want without actually editing that part of the code. It's easy with lambda expressions. I write the type loop code so that it takes as argument a lambda function representing what I want done inside the loop.
There's another way of looking at this. You can skip this paragraph if you don't care about the connection to Haskell. But in Haskell you might do something like this by using a non-determinism monad, or even a probability monad. But as I pointed out a while back, you can fake every monad using the continuation monad. One way to implement continuations in C++ is to use continuation passing style. And that's what I'll do. The continuations are just the lambdas that I mentioned in the previous paragraph.
Some C++ code


> #include <iostream>
> #include <cstdlib>
> using namespace std;
You can bump this up the deck size if you have the CPU power:
> const int deck_size = 13;
A deck of cards is represented by a simple array of integers with each card being assigned a unique integer.
> struct Deck {
>     int card[deck_size];
>     bool operator==(const Deck &other) {
>         for (int i = 0; i < deck_size; ++i) {
>             if (card[i] != other.card[i]) {
>                 return false;
>             }
>         }
>         return true;
>     }
> };
The riffle shuffle works by splitting a deck into two piles and interleaving the parts onto a new destination deck. Here's a schematic diagram with the two piles coloured orange and blue:
The function riffle_helper helps loop through all possible riffles. I could assume that each card arriving at the destination is equally likely to come from the left pile or the right pile. But I observe that whenever I do a real riffle shuffle the cards seem to come in 'runs'. So if a card falls from the left pile then the next one is more likely to as well. That's just an empirical observation based on a small number of trials, you can tweak the probabilities yourself to fit reality better. (Oh, and I got this code upside-down compared to what people really do. I need to fix it when I have a moment...)

> enum Side {
>     LEFT,
>     RIGHT,
>     NO_SIDE
> };
This function shuffles together cards from the locations given by left_ptr and right_ptr in src_deck into dest_deck, eventually calling cont on each result. I use a template because I don't know the type of the lambda expression I'm passing in. (If I want to know its type I think I have to mess with decltype. It's all a bit weird.)
> template<class Cont>
> void riffle_helper(double importance, int split,
>                    int left_ptr, int right_ptr, int dest_ptr, Side oldside,
>                    const Deck &src_deck, Deck dest_deck, Cont cont) {
>     if (dest_ptr == deck_size) {
>         cont(importance, dest_deck);
>         return;
>     }
First I deal with the cases where one or other of the piles is empty so there's no choice about where the next card is coming from:
>     if (left_ptr >= split) {
>         dest_deck.card[dest_ptr] = src_deck.card[right_ptr];
>         riffle_helper(importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont);
>         return;
>     }
>     if (right_ptr >= deck_size) {
>         dest_deck.card[dest_ptr] = src_deck.card[left_ptr];
>         riffle_helper(importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);
>         return;
>     }
>     double p;
>     if (oldside == NO_SIDE) {
>         p = 0.5;
>     } else {
>         p = LEFT == oldside ? 0.75 : 0.25;
>     }
>     double new_importance = importance*p;
>     dest_deck.card[dest_ptr] = src_deck.card[left_ptr];
>     riffle_helper(new_importance, split, left_ptr+1, right_ptr, dest_ptr+1, LEFT, src_deck, dest_deck, cont);

> if (oldside == NO_SIDE) { > p = 0.5; > } else { > p = RIGHT == oldside ? 0.75 : 0.25; > } > new_importance = importance*p; > dest_deck.card[dest_ptr] = src_deck.card[right_ptr]; > riffle_helper(new_importance, split, left_ptr, right_ptr+1, dest_ptr+1, RIGHT, src_deck, dest_deck, cont); > }
The function riffle iterates over all possible riffle shuffles of src_deck calling cont on each one. Note that I assume that when the deck is split into two before shuffling together, each pile has at least 3 cards. You may want to change that assumption.
> template<class Cont>
> void riffle(double importance, const Deck &src_deck, Cont cont) {
>     double new_importance = importance/(deck_size-5);
>     for (int split = 3; split < deck_size-2; ++split) {
>         riffle_helper(new_importance, split, 0, split, 0, NO_SIDE, src_deck, Deck(), cont);
>     }
> }
Iterate over all possible cuts of src_dec calling cont on each result. I assume the cut leaves at least 3 cards in each pile.
> template<class Cont>
> void cut(double importance, const Deck &src_deck, Cont cont) {
>     double new_importance = importance/(deck_size-5);
>     for (int split = 3; split < deck_size-2; ++split) {
>         Deck new_deck;
>         for (int i = 0; i < deck_size; ++i) {
>             if (i < deck_size-split) {
>                 new_deck.card[i] = src_deck.card[i+split];
>             } else {
>                 new_deck.card[i] = src_deck.card[i-(deck_size-split)];
>             }
>         }
>         cont(new_importance, new_deck);
>     }
> }
Overhand shuffle remaining cards in src_deck to dest_deck. Here's an attempt to represent what an overhand shuffle does. It reverses the order of a deck that has been split into segments. The order within each segment is left unchanged.

> template<class Cont>
> void overhand_helper(double importance, const Deck &src_deck,
>                      int cards_left, Deck dest_deck, Cont cont) {
>     if (cards_left <= 0) {
>         cont(importance, dest_deck);
>     } else {
>         double new_importance = importance/cards_left;
>         for (int ncards = 1; ncards <= cards_left; ++ncards) {
>             //
>             // Take i cards from the source and place them at the bottom of the
>             // destination.
>             //
>             for (int j = 0; j < ncards; ++j) {
>                 dest_deck.card[cards_left-ncards+j] = src_deck.card[deck_size-cards_left+j];
>             }
>             overhand_helper(new_importance, src_deck, cards_left-ncards, dest_deck, cont);
>         }
>     }
> }
Iterate over all possible overhand shuffles of cards in src_deck calling cont on each result. In practice I often find overhand shuffles result in cards mysteriously jumping segments and messing up the algorithm, whereas poorly executed riffle shuffles still work fine. I'm also assuming that each time a pile of cards is transferred the size of the pile is chosen uniformly from the set of all possible segments at that stage.
> template<class Cont>
> void overhand(double importance, const Deck &src_deck, Cont cont) {
>     overhand_helper(importance, src_deck, deck_size, Deck(), cont);
> }
The final code doesn't bother computing the denominator from Bayes' theorem. The most likely initial state is given by the one that results in the highest score. If you normalise the scores to sum to one you'll get actual probabilities.
> int main() {
This is the array representation of the cards in the following picture:
>     Deck target = {{ 10, 11, 6, 12, 1, 13, 8, 2, 9, 3, 5, 4, 7 }};

>     Deck deck;
Our known starting sequence is just 1, 2, 3, ..., J, Q, K. We iterate over all ways to pick a card out from this sequence and place it at the top.
>     for (int k = 0; k < deck_size; ++k) {
>         deck.card[0] = k+1;
>         for (int i = 1; i < deck_size; ++i) {
>             deck.card[i] = (i > k ? i : i-1)+1;
>         }
>         double likelihood = 0.0;
Here is where I use the lambdas. For this example I'm doing an overhand shuffle followed by a riffle shuffle. (The syntax is pretty bizarre and its also weird that I kinda sorta specify the type of my lambda but that's not really what the type of the expression is. But having manually faked and lifted lambdas many times in C++ I can see why it's the way it is.) Note how I've made likelihood mutable and have given these lambda expressions write access to it.
>         overhand(1.0, deck, [&likelihood, target](double importance, Deck &deck) -> void {
>            riffle(importance, deck, [&likelihood, target](double importance, Deck &deck) -> void {
>                if (deck == target) {
We sum the probabilities for all ways of generating the target deck:
>                    likelihood += importance;
>                }
>         }); });
>         cout << "If top card = " << deck.card[0] << endl;
>         cout << "then unnormalised probability = " << likelihood << endl;
>         cout << endl;
>     }

> }
Run the above code and you get unnormalised probabilities
If top card = 4
then unnormalised probability = 5.7568e-12
If top card = 6
then unnormalised probability = 5.37301e-11
If top card = 7
then unnormalised probability = 1.791e-11
In fact, I had chosen 6. Some discussion
Don't expect it to work perfectly! It can only give probabilities but it's often surprisingly good. But there is a lot of room for improvement. Some work looking at how people actually shuffle could give a better probabilistic model.
Some exercises.
1. The code can be made orders of magnitude faster. The final shuffle is performed and then the result is compared to the target sequence. But you can start comparing cards with the target before the shuffle is finished. Most times you'll only need to look at the first card of the result of a shuffle before you know you haven't matched the target. Fixing this will give a big speed up.
2. The continuation passing style makes it easy to incorporate other sources of knowledge. For example if you 'accidentally' peek at the bottom card after the first shuffle you can incorporate that knowledge into the algorithm. Figure out how.
3. Write lots more kinds of shuffles and experiment. I'm hoping someone good with magic will come up with a sequence of operations that looks hopelessly random but allows a good probability of recovering the chosen card. You could also combine this with other techniques such as designing shuffles that maintain various kinds of invariant.
4. The code can be rewritten to work backwards from the final state to the initial states. Work out how to do this. (This parallels ray-tracing where you can work from eye to light or from light to eye.)
5. We're doing the same work over and over again. We don't need to compute all of the shuffles for each initial state. We can compute each shuffle once and reuse it on each initial state. Try implementing it.