The Essence of Failure
# Software EngineeringThe Essence of Failure is a very well-known book in Japan. By analyzing the various defeats of the Japanese Army and Navy during World War II, the author holds up a mirror to show readers how an organization can gradually weaken itself through its systems, culture, and processes. Although it analyzes the problems of the Japanese Army and Navy, the underlying causes discussed in the book are also very suitable for organizations in general.
Failure is often not the result of a momentary mistake, but of long-term institutional imbalance and cultural inertia
In project work, failure is usually not caused by a single reason, but accumulates from small issues over time.
Here are several of the reasons mentioned:
- Diffused responsibility
- Information blocking and filtering: lower-level staff dare not report the truth
- Valuing spiritual/ideological slogans over material support
- Dogmatism
- Lack of reflection and improvement
When an organization becomes used to hiding bad news, glorifying obedience rather than questioning, or narrowing success metrics to a single quantitative target, it becomes extremely fragile when external pressure arrives.
The author uses case studies to break down these effects, trying to explain that if structural defects are not corrected, individual heroism or makeshift improvisation alone cannot reverse the overall trend—let alone expecting bushido spirit to turn the tide of war.
Software Development
Information
For example, when junior engineers do not dare to report risks or user pain points, management ends up making decisions based on distorted information.
Assume you are not a developer on the front line, and your first reaction to any issue is: “This feature (insert anything) is so simple—why does it take so long to build?”
But most software requirements are inherently very ambiguous and constantly changing, and there is also a lot of exploration involved in coding. During development, it is not just about implementing the feature; you also have to consider historical factors, how to integrate it into the existing codebase, and how to make future development smoother.
Don’t Get Stuck in Past Success
In an era where LLMs and AI continue to advance, many ways of doing things will gradually change. In this era, aligning thought and action is a great practice. But don’t let past success stop you from questioning, or from doing any review or change at all.
Learn from Failure
In software development, bugs, accidents, and other issues are inevitable. In larger companies, there is usually an incident reporting mechanism. Suppose the system suddenly goes down and won’t boot. What are the reactions of developers and management? Is it anger, blame-shifting, or finding a scapegoat, or is there a standard process that can minimize the damage caused by the accident?
The least useful response to an incident is emotional reaction.
You need to learn from failure so that the same mistake does not happen again. To do this, management needs to create an environment where no one is blamed, and instead focus on finding solutions.
- The product belongs to everyone; blaming one person means there is something wrong with the company’s system
- The codebase is the shared historical burden of the entire RD team; if errors are all attributed to specific individuals, that also means there is no good development process
Years ago, there was a Reddit post that went viral: Accidentally destroyed production database on first day of a job, and was told to leave, on top of this i was told by the CTO that they need to get legal involved, how screwed am i?
The post describes how this new employee accidentally deleted the production database on his first day. In one step, he failed to copy the value output by the tool and instead used the environment variable recorded in the documentation, and that environment variable pointed to production. In the end, he was fired on the spot.
I don’t know what everyone’s reaction to this story would be. Is it that the employee was so careless and didn’t even check whether the value was correct? Or do you think it is simply outrageous that a developer could make such a mistake? If you look at things with this kind of mentality, you will instead be heavily hit by the arrival of an accident.
A company that relies on luck and does not back up its production database is, in itself, destined to pay the price sooner or later. You only realize how important backups are after data is lost. If a new employee can so easily delete the production database, that also means other employees will eventually be able to do the same.
Among the many comments, some people also shared Netflix’s approach.
I forget the exact details, but as I remember it, they not only regularly practiced how to respond quickly when disasters happen and how to back up systems. They also set permissions very well in normal times, and even encouraged employees by giving rewards if they could break the production database.
For the same problem, some companies start by blaming people; others address it through systems and processes to prevent human error.
Don’t Rely on Slogans
Slogans need to be implemented through systems in order to truly honor them. For example, companies that use “move fast and iterate quickly” as a slogan are often just using it as an excuse to squeeze more output out of developers, while completely disregarding engineers’ professionalism.
If you really want to make it happen, it might be reflected in several areas:
- Placing great emphasis on CI/CD so developers can deploy without pain
- Recognizing that development speed has its limits
- Building all infrastructure properly so developers can focus on development
- Valuing the creation of documentation and processes
- Understanding that rapid iteration inevitably sacrifices part of the quality