What we learned from implementing Ray Dalio's Error Log method is to encourage failures and mistakes in the team while learning from them.
Mark Zuckerberg, CEO of Facebook, once said: "Move fast and break things. Unless you are breaking stuff, you are not moving fast enough."
This is a philosophy followed by Facebook and a lot of their speed in execution could be attributed to this mantra. Startups would love speed (one of their key strengths compared to incumbents in the market) and generally feel pumped to subscribe to this thought process. However, what no one highlights is the kind of systems that are built internally at Facebook to ensure that a developer is given the freedom to build and break things while not taking the business down.
I learned this from my conversations with engineers at Facebook and it seemed like the right step to take for Facebook when they scaled up as a company considering they never want to take away the benefits of making mistakes.
When very early-stage startups blindly follow this philosophy, they try to conveniently ignore that a small error on their side is enough to sometimes take their business down (or a considerable part of it). We tend to forget that we are fragile.
At Typito we tried our best to be cautious whenever we pushed something to the public be it a product update/email campaign or anything. Trying to be over-cautious was an overhead for sure but we were not sure if there was another way to go about it. Life is nice when we don't commit mistakes or errors right? But it turns out mistakes and errors are bound to happen and you need to learn to cope with them.
The App downtime event and its cascading effects
It was a fine morning in 2017. I got up around 6:30 AM and checked my email looking for any important messages. I saw 3 emails from customers asking if our app is down. Immediately I opened a browser, tried opening Typito, and realized that the SSL certificates expired. My co-founder Srijith knew about this but he thought we had one more day before expiry and he was traveling to his hometown that night. We realized the app was down for more than 2 hours and I decided I had to vent this out on Srijith and did the silliest thing you can do at that moment: go to #general channel on Slack and shout (typing in CAPS) at him as if that was going to solve the problem. He was reaching his hometown in Kerala in another hour and we had to wait it out before he could resolve the issue (~ 3.5 hours of downtime). After this incident, our team started being extra cautious while executing and it slowed us down.
Looking back, I realize it was a miserable thing for me to do and I was setting a precedent to our small team that mistakes would be penalized. All this without realizing that we are still a small team and there's only so much we could do to ensure that we do not commit errors. I apologized to Srijith and my team later for the way I behaved in that instance. But the challenge remained: how do you build a culture that encourages failures and mistakes while being able to learn from them?
Error Logs by Ray Dalio
It was towards the end of 2017 that I got my hands on Principles a book written by Ray Dalio, CEO of the biggest Hedge Fund firm in the World?-?Bridgewater Associates. I had already read the summary version of Principlesby then and was a big admirer of his efforts to build a team based on meritocracy. But one of the best parts of the book that I could relate with was the section under "Work Principles" where he explains how he tried to build a culture in which it is okay to make mistakes and unacceptable not to learn from them.
Ray explains in the book how to build a culture in which it is okay to make mistakes and unacceptable not to learn from them.
He goes on to elaborate on how the Hedge Fund firm ended up losing to the tune of hundreds of thousands of dollars because of a careless mistake by one of his employees. He understood that letting the person go would build a culture that's averse to failures and came up with a public log where everyone should list down the mistakes they commit, explain how it impacted the customers/company in detail, and with reflection on how it could be avoided in the future. With the error log Bridgewater identified areas where mistakes were common such as their approach to the foreign exchange market, and in those cases, they remedied it by taking actions like setting up a trading course here in the building. All employees are expected to read the error logs as well and this has helped them avoid recurring mistakes or failures while also building a culture that encourages the team to move fast and break things (if needed). (Mark should be happy now!)
Let's see how we adopted Error Logs by Ray Dalio as a culture experiment in Typito and how it turned out.
Following the Error Logs process from Ray's Principles should help us (Typito) build a culture that tolerates errors or failures and encourages the team to move fast while not making the same mistakes.
1. Maintain 2 separate documents: one for logs related to tech/engineering and another for non-tech/marketing/growth. We thought it's easier to keep this segregation for easier look-up. The documents can be called anything you like and can be on any tool you like. In our case using tips for our office, we call these docs "Production Issue Logs" and maintain the engineering one on Dropbox Paper and the non-engineering one on Trello as a list.
Brief of what we log for every issue
- Whenever a critical issue happens, the team first works towards resolving it. After that, the person who's most likely responsible for the issue would spend time to summarise what happened and would add a note on the Production Issue Log. Here's an example of non-engineering issue log related to an error I committed in April 2018 when we were experimenting with webinars on video marketing to help our customers:
The log should be self-explanatory so that anyone on the team can read and understand what happened.
- It's been a year since we started the Production Issue Log experiment. So far there are 27 logs in engineering and 6 logs in non-engineering. That would be a total of 33 issues in the last 1 year which we felt could've impacted our customer's user experience in a critical manner! So yes, mistakes are bound to happen.
- Once we started this process, we've seen that number of recurring issues became close to zero. Out of the 33 issues, only 1 issue recurred. While I don't have a comparison with how we did before this experiment (we didn't document them then), I think we've become gotten better at not repeating the same mistakes since we now have a Bible that also acts like a checklist.
- Most importantly, this helped us build a lot more trust in each other and this resulted in everyone being more responsible for what they do at Typito. This would be a qualitative subjective observation. But here's what Basim, our product engineer feels about the process: "Developers make mistakes all the time. The more experienced you get just means the more you make different types of mistakes. The PIL (Production Issue Logs) framework allows me to move past that paranoia of making mistakes. Being a small team, it allows me to push out code at a decent speed. Documenting the issues gives me a sense of admitting the issue, learning from it, and moving on. Over a period of time, it allows me to retrospect what I am doing wrong, to learn if there's a pattern in the mistakes and what can be avoided going forward."
We plan to continue with the Production Issue Log process going forward and we highly recommend this to other startups. It helps build better accountability, trust, and discipline in your early days as a team.