Sunday, November 06, 2016

Design for Errors

Making finding and fixing errors easier. (Posted by Jerry Yoakum)


Errors in software are to be expected. Since you expect error you should make design decisions to optimize the likelihood that:
  1. Errors are not introduced.
    • Use code reviews to catch functional errors.
    • Use static analysis to catch coding errors.
  2. Errors that are introduced are easily detected.
    • Ensure that your code is setup log unexpected errors.
    • Use tools like AppDynamics to view the ranking of errors over different time frames.
    • Use tools, such as, Splunk to graph errors by type over time. Especially, critical to do this before and after deploying to highlight new errors introduced by the latest deployment.
  3. Errors that remain in the software after deployment are either noncritical or are compensated for during execution so that the error does not cause a disaster.
    • This is all about testing that things work in a way that you want when they are broken. Here's some examples:
      1. Database is down.
        Don't let your service return anything that makes clients think that their order was saved. You have to make the call based on your business if it is better to return error responses or just refuse to accept requests all together. Which of those could get you into more trouble? But don't just code for that. Actually kill your database in the test environment and test it. Setup a functional test that does this every time you build.
      2. Disk is full and you can no longer log.
        Same thing as above. Setup a functional test that points the logs at a very small virtual drive and make this test happen with every build.
      3. System encoding changed.
        Seriously, this is a thing. On Macs the default is utf-8, on Windows it is ascii, on Linux it is something. I have some Python 3 code that I originally wrote on a Mac but when I moved it to a Windows machine it choked on the encoding all because I didn't set an encoding. I was [unknowingly] depending on the system default. Servers are group property. Maybe only your OPS team can touch them but you probably have more than one Operations Engineer. Or your servers are configured via Chef or something else and everyone misses that single line where someone decided it would be good to explicitly define the system encoding. Boom! Code that was working fails for what most would consider a trivial bit of code.
Such robustness is not easy to incorporate into a design. Some of the ideas that help include the following:
  1. Never fall out of a case statement. For example, if there are four possible values for a variable, don't check just for three and assume that the fourth is the only remaining possibility. Instead, assume the impossible; check for the fourth value and trap the error condition early.
  2. Predict as many "impossible" conditions that you can and develop strategies for recovery.
  3. To eliminate conditions that may cause disasters, do fault tree analysis for predictable unsafe conditions.