Failure-oblivious computing is a runtime/compiler technique that keeps programs running even when they encounter memory errors common in unsafe languages like C (out-of-bounds accesses, invalid pointers). Introduced by Rinard and colleagues at OSDI 2004, it contrasts with typical memory checkers: instead of reporting the error or aborting, it “continues execution” by discarding invalid writes and returning synthesized (“manufactured”) values for invalid reads, so the address space isn’t corrupted and control flow resumes. The English-language Wikipedia also frames the key point this way within the “Fault tolerance” article: the system proceeds using manufactured values rather than stopping, which is the opposite of fail-fast checkers.
Mechanically, a “safe-C” style compiler inserts dynamic bounds checks and swaps in “continuation code” on failure. Invalid writes are dropped to localize data corruption; invalid reads receive small integer sequences or other manufactured values to avoid infinite loops and to allow forward progress. One concrete example (from Midnight Commander) is keeping a path-search loop from stalling by ensuring the returned sequence contains a ‘/’, allowing the scan to advance.
An extension called Boundless Memory Blocks stores out-of-bounds writes in a hash table and returns the corresponding values on later out-of-bounds reads at the same logical address. This can better preserve the original program’s intent when the size computation is off, though it introduces the need to cap the stash (e.g., with an LRU policy). Rinard’s work also explored variants such as “wrapping” offsets within the same data unit.
Empirical evaluation on Apache, Sendmail, Pine, Mutt, and Midnight Commander shows that, even under inputs crafted to exploit known bugs, these systems can continue service while neutralizing effects like stack smashing. Performance costs vary with workload and instrumentation paths: often under 2× overhead but sometimes 8–12×; overview summaries also cite “80–500%” increases (≈1.8–6×). The spread is large, so decisions should be workload-driven.
Where it excels: systems whose error propagation is short and request-scoped (e.g., server handling of independent requests). Dropping writes confines damage; synthetic values on reads avert exceptions, so the next request is typically unaffected. Where it struggles: domains where a single error globally pollutes computation (e.g., numeric kernels) and during development, where crashing aids diagnosis. Conceptually it belongs to “acceptability-oriented” computing: a deliberate choice to prioritize execution continuity and availability over strict correctness.
Benefits include (1) higher availability (the process doesn’t crash), (2) mitigation of buffer-overrun attack classes by discarding dangerous writes, and (3) relatively low adoption cost (mostly recompilation). Downsides include (a) semantic drift—execution may proceed along paths the original program never intended, (b) a “bystander effect” that can erode engineering discipline if teams rely on the safety net instead of fixing root causes, and (c) reduced visibility of defects. Mitigations: keep rich logs, constrain use to production/runtime guardrails (not development), and maintain clear separation so errors remain visible for later remediation. Follow-on research explores related “don’t crash” techniques, such as Recovery Shepherding (PLDI 2014), and context-aware strategies that search for safe continuations while bounding impact.
In short, failure-oblivious computing shines when you explicitly value “continuation over correctness”—for SLA-driven online systems, large-scale batch processing, or as a temporary guard to harden legacy C against unknown inputs. Conversely, in domains demanding numerical fidelity, reproducibility, or strong auditability, pair (or replace) it with static analysis, type-safe languages, formal methods, and a fail-fast operational posture, and evaluate adoption with particular caution.
Comments
Post a Comment