Dec 6
Lessons learned (or not) from NASA
I just recently found this story on IEEE Spectrum, about Apollo 13. It seems that Ron Howard’s dramatization, surprise surprise, had not told the whole story. That 18 page story about Apollo 13 tells the story of the engineers, flight controllers, and the flight directors behind the heroic and often miraculous actions taken to save the crew. There were a few patterns I noticed in the story.
The first and most obvious one was emergency simulations. NASA trains its engineers to work not just when everything is peachy keen, but also when everything is balls-fuck upside down hell. They went through every conceivable “failure state”, even the ones that could never happen, and trained on how to handle them. This leads to the core concepts of NASA: “Don’t guess”, and “Check the instrumentation”. If you guess, people die. If the instrumentation is wrong, people die.
The next pattern I noticed was the attitude of “Get it right, or leave.” There were several instances in the story where an engineer or flight controller had run into a similar problem before, either in simulation, or in a real world situation, and they couldn’t solve it then. But, they researched, studied, and tested, until they found a solution. Then they made sure this solution to this problem was on the books for reference.
Finally, was communication. Several times, during the Apollo 13 crisis, the flight director Krantz would make a decision based on what he was told. The guys in the backroom knew further problems would crop up because of these decisions, and set out to solve them, before they came up. Every decision that was made, was available for all to know, and the lower-down guys took care of their responsibilities, so they had a solution ready when it was needed.
So lets apply these lessons to programming and IT. Your IT department(or you for that matter), should train on the network or a reasonable analogue, and train to fix problems that happen, from the root database master going down, to logic bombs, to random quantum interstices slicing ethernet cables randomly. You can’t know what will happen, but by the end, you should be able to, with your eyes closed, diagnose(properly) a major or minor problem, and do the correct actions. If suddenly the website is being DDOSed by a 10-Gigabit flood, what do you do? Do you have the plans in place? Do your people have the skills?
Next, apply anything you couldn’t solve, and find a correct solution. Make it available, and make it right. Massive time-based bug brings down payroll in simulations, and you couldn’t get the db up in time? Learn how afterwards. Drill it in, never forget it.
And finally, whoever is in charge, should make their decisions known to everyone, ranging from, “I’m going to shorten the TCP window on our routers” to, “I’m going to ramp up our S3 servers.” If you know that a possible problem might crop up, like inaccurate data flows, overloading the bottleneck, whatever it is, start working on it.
I will be applying these lessons to OddCo IT in the future, because I expect the best out of my people. I will take them through hell and back, and they will thank me for it, because it will make them better engineers.
Comments are off for this post