What is the ideal way to debug a parallel program? In my mind, a parallel debugging framework should provide the following:
- In the production run, records the execution with zero or very low overhead.
- In the debugging phase, let people deterministically replay what happened when the bug manifests. It is even better if the replay can go both forward and backward.
- Programmers can attach their favorite debuggers to the replay session.
- They are both software-only solutions.
- They both make compromise - push the complexity from recording to replaying.
- They both work on multiprocessors.
The idea of PRES is to sacrifice the efficiency of replay. Instead of reproducing the buggy execution at the first try, it requires more than one attempts to reproduce it. Meanwhile, the reward of this is that it only needs to record much less information in the production run. One could think of it as PRES drawing a "sketch" in the production run, instead of a "finished paint". Although it may need to try several times to recover the "finished paint", but "sketch" is fast and enough to catch a rough idea about what happened. If painters use this idea, why can't we? Very smart! During replaying, PRES uses both the sketch from the production run and feedbacks from previously failed attempts. The idea is that when PRES encounter a data race that has not been recorded in the production run, it makes a random guess. As soon as PRES finds out the execution does not match the sketch any more, or does not reproduce symptoms in the buggy execution, what it will do is to roll back a bit, flip some data races and give it another try. This actually reminds me about another paper, Rx, which is one of the system research papers that I like the most.
The idea of ODR is to sacrifice the accuracy of replay. Instead of reproducing the buggy execution based on internal values, it tries to reproduce an execution that has the same output as the buggy one. Output includes segment fault, core dump, print out strings and such. The reproduced execution may be different from the buggy one, but the rationale behind this idea is that if it has the same output, it is very likely that the bug that appears in the buggy execution also manifests in the reproduced execution. If output-determinism is enough, why would we want value-determinism?
There are many other good papers about deterministic replay for debugging parallel programs. If interested, you can find more from the references of these two papers.
Besides academic research, VMWare's replay debugger is a production-class parallel debugger tool. Although it only works for single processor execution, it has almost everything an ideal parallel program debugger should have. I was really really impressed. E Lewis gave talk in Google about it:
6 comments:
The talk from Google is impressive, though the video was old; they have been working on that feature for a long time.
To Helen: Yeah, you're right. It was about one year ago they put it into work station 6.5. For the latest information about VMWare Replay Debugger, you can take a look at the presenter's blog: http://www.replaydebugging.com/.
now I among your readers
Nice post! GA is also my biggest earning. However, it’s not a much.
Very helpful post! with amazing furniture that is the best!
Post a Comment