I have been fascinated by rockets since I was young. As I get older, my appreciation has only grown for the amazing amount of engineering that goes into those majestic machines.
I would like to share with you some insights from one of my personal heroines: Margaret Hamilton, the hacker who saved Apollo 11. Here is a letter she wrote detailing how Neil Armstrong’s Eagle almost did not land on the moon — and how the design decisions she made saved the mission:
Due to an error in the checklist manual, the rendezvous radar switch was placed in the wrong position. This caused it to send erroneous signals to the computer. The result was that the computer was being asked to perform all of its normal functions for landing while receiving an extra load of spurious data which used up 15% of its time.
The computer (or rather the software in it) was smart enough to recognize that it was being asked to perform more tasks than it should be performing. It then sent out an alarm, which meant to the astronaut, I am overloaded with more tasks than I should be doing at this time and I am going to keep only the more important tasks; i.e. the ones needed for landing. Actually, the computer was programmed to do more than recognize error conditions.
A complete set of recovery programs was incorporated into the software. The software’s action, in this case, was to eliminate lower priority tasks and re-establish the more important ones … If the computer had not recognized this problem and taken recovery action, I doubt if Apollo 11 would have been the successful moon landing it was.
3 key takeaways from this story are instructive for software developers today:
Assume that things will break
You must plan for error conditions — especially in a developer role. You do not need to get pedantic about finding ways that things could break. But you do need to ensure that your application does something sensible during an error condition. Sometimes, it is okay for this to be a custom 500 page thrown to the user!
What you do NOT want is for the user to assume that everything is okay and have the system fail silently without their knowledge. This leads to data loss, frustration, and support tickets. This is a major drawback of many recent single-page-applications: if the client cannot reach the API, the client can continue for some time before the user picks up on the fact that something is wrong. In addition, a crucial component of handling errors is the post-mortem process afterwards. One of the best ways to save yourself from headache in those situations is to ensure that you are logging enough data to debug problems after they have happened.
Build in recovery and maintenance tools
While reading the story above, did you notice that the flight computer had programs to fix itself? Lines of business applications do not usually need this sort of error correction outside of good database constraints. But it is a good idea to define what “normal” is for your applications, and write some cron jobs or recurring tasks in your background job solution (Sidekiq, Resque) to check that everything looks good. The more critical your process — and the higher your costs of deviance — the more crucial it is that you catch deviant behavior early.
To fix anything that might have gone awry, a robust administration portal is essential. It helps you keep tabs on what is going on in your application. More importantly, it allows other teams within your organization to contribute to troubleshooting and error correction efforts.
In the Ruby on Rails world, Active Admin (activeadmin.info) is the de-facto standard for creating these sort of interfaces. It is a very powerful tool, which is why it helps to extend its capabilities using custom partials as quickly as possible. The alternative — that you develop rake tasks for recovery or maintenance tasks — prevents re-use and subjects those important pieces of code to bit-rot. This can leave them rusty and out of step with the rest of the application when you need them most.
Keep the human in the loop
One of the most striking things to me in the story above is that the Eagle Lander raised some errors, even while attempting to continue on its own. When you have two well-trained pilots in the craft, you want all the expertise that they can give you. The same is true of your development and customer success teams – if you have hired well, you should have teams full of very well-trained and well-equipped people who can make better decisions than you can economically code.
I think many developers underestimate the intelligence of their customers. For that reason, developers do not tend to give enough attention to crafting error messages that let customers know what went wrong and how to fix it.
Just like investing in your administration portal is important, investing in error and notice communications to your customers is critical to reducing support overhead as your team scales. The end result is that you create empowered customers who have a sense of ownership over the application — which means they will be your best advocates in the future.
Margaret Hamilton did all of the above in the 70s, in assembly language, before any best practices for software engineering had been formalized — she even had to coin the term “software engineering!”
If the last 50 years have taught us anything about software development, it is that precious few endeavors work correctly on the first try. Even fewer keep working correctly in the future.
It benefits all of us to plan for and mitigate failure before it occurs.