Reliability Engineering

Imagine that you are working on a team that is creating a new feature that allows users to submit and watch videos. The application uses a third party- we’ll call it Encodurz- that encodes the videos that are uploaded to your application. You work hard to test the feature; making sure that the UI is flawless, that the user journeys make sense, and that the videos play correctly, among many other things.

It’s time for your new feature to be released to the world, and you’re excited! But on the day of the release, Encodurz has an outage. No videos can be encoded, so none of the uploaded videos will display! Your users don’t know that it’s not your fault; all they see is that the new feature doesn’t work. They call and complain to customer service and they post complaints on Twitter. This is why reliability engineering is important!

Reliability engineering focuses on the ability of applications to be as available as possible. It aims to offer a good user experience, even in the following situations:
* The server goes down
* The database is unavailable
* An API that your application relies on is unavailable
* A third-party provider that your application depends on is unavailable

Do you know how your application will behave in those scenarios? If not, it sounds like it’s a good time to test those scenarios! There are two ways to test:

Bring the service down
You can bring a server down by unplugging it, but chances are your server is not nearby for you to do that. But you can also bring a server, a webservice, or a database down by shutting it down using scripted commands. If you don’t have permission to do that, you can find someone in DevOps at your company who has the correct permissions.

Change your connection strings
It’s really easy to simulate an outage of a server, a database, an API or any other service your application depends on simply by changing the way your application connects to it. For example, if your app connects to your company database with a username and password, all you need to do is simply send in a bad password. Or you could change the URI needed for the connection so that it’s incorrect.

NOTE that it is a bad idea to test either of the above strategies in Production, at least until your application is VERY resilient. You will want to do this testing in your test environment.

Once you have discovered what happens when your application or its dependencies has an outage, it’s time to make your app as resilient as possible. Here are seven strategies for doing that:

  1. Use a “circuit-breaker”
    This method puts logic in the code that tries connecting to a resource a few times, and when it is unable to connect, switches over to a different resource. For example, if your application usually points to Server A, and you have a backup server called Server B, when the circuit-breaker is tripped the connection changes over to Server B.
  2. Use retries
    Sometimes a third-party app will fail temporarily for an unknown reason. You don’t want your request to the app to fail and never try again. So you can build in some retries; perhaps if the request fails, you wait 30 seconds and try again, if it fails again you wait 60 seconds and try again, and so on. You don’t want to retry indefinitely, but instead set some sort of time limit so if the request still hasn’t succeeded when the limit is met, an error is returned.
  3. Use cached data
    It’s a great idea to have some kind of caching service that will be able to serve up data if a request fails for some reason. When the request fails, your application just grabs the slightly-stale cached data and returns that instead.
  4. Enter into read-only mode
    If your application detects that there’s a problem writing to a data source, you can configure it to go into a read-only mode so that your users can at least see their data. You should set a message to display when this is the case to explain to users why they can’t update their data at the moment.
  5. Provide messaging that something isn’t quite right
    It is so annoying to get a cryptic error like “Error: T-128556” when using an application. That’s not helpful at all! Instead, provide your users with as much detail as you can about what’s wrong. In the example at the beginning of this post, there could have been a message that read “Sorry, we are having an issue connecting to our video encoding software at the moment. Please try again in a few minutes.”
  6. Have a status page that explains what is going on
    If your application goes down completely or is very degraded, it’s a great idea to have a status page (hosted on a different server) that provides a way to communicate to your end users what’s happening. You could include a timestamp with your time zone, a list of the features that are affected, and a message about what’s going wrong. Then you can keep the status page updated at regular intervals until the problem is fixed.
  7. After the problem has ended, do a post-mortem to see what lessons you’ve learned
    If the outage was very brief or only affected a few users, you might not need to make the post-mortem public. But if it was a big outage, it’s a great idea to communicate to customers what happened, why it happened, and how you are going to prevent it in the future. See Slack’s message about their January 4th outage for a really good example of this.

No application can run perfectly 100% of the time; servers are imperfect and our apps are almost always dependent upon outside forces. But it’s important to know exactly what will happen when services are down and to figure out the best ways to respond to those issues before they happen. As testers, we can encourage our team to participate in this process.

6 thoughts on “Reliability Engineering

  1. Pingback: Testing Bits: 378 – January 31st, 2021 – February 6th, 2021 | Testing Curator Blog

  2. Karlo Smid

    Hi Kristin!
    Great topic for your post, gives a nice overview of possible mitigation strategies.
    One note. We must be smart in simulating how the 3rd party system is in trouble. When this happens, 3rd party system will respond, but in a very unusual way. Maybe DNS service is not available, on there is a network connection problem between nodes on the connection path. Or it would answer but very slowly. I like to read available post mortems, they are a great source of ideas on how things could go wrong.

    Regards, Karlo.

    1. kristinjackvony Post author

      Hi Karlo- Thanks for adding this really great point! It’s definitely a good idea to research the third-party tools that your application depends on and find out how they communicate outages and handle problems.

  3. Pingback: Five Blogs – 8 February 2021 – 5blogs

Comments are closed.