On crafting troubleshooting-friendly responses in web apps

Once again I have wasted 1-2 hours trying to figure out where the damn "403 Forbidden" was coming from. Yet, with a little forethought and a few seconds or minutes of time, hours could have been saved on such cases. This one would have been trivial if this was already in place, instead of the original, spartan response.sendError(403):

response.addHeader("X-Authority", "HomeController")
response.sendError(403, "Access to the app (temporarily) disabled for everybody")

I would know to look into HomeController and that there is no problem with that particular user’s credentials. I could also search the code for that error message.

In total I have certainly spent days hunting for the source of HTTP errors, especially auth-related ones. So I beg you, when you write any error-handling or secondary-flow code, think about the poor person who is going to troubleshoot it and give her - likely your future self - friendly hints instead of ugly, useless 403 Forbidden / 401 Unauthorized / 500 Internal Server Error / …​. Let’s have a look at what you can do to prevent a lot of frustration and wasted time.

Causes and examples

But first, let’s look at why do we have this problem. Requests might pass through a number of intermediary network components (in our case this included at one point AWS CloudFront, an AWS Load Balancer, Træfik as a Kubernetes ingress controller, and two Nginx proxies) and any of them can return an unexpected response (error, redirect). Inside your application, a response can be returned by a number of code units involved in the request processing - your own or library Filters and Interceptors, Controllers, ErrorHandlers, AOP interceptors, and who knows what. Some of those you don’t have the source code for so you cannot simply grep for "403" - and even where you can it doesn’t necessarily help because the actual code could use a constant from a number of competing enums (org.(springframework|apache|eclipse.jetty).http.HttpStatus, javax.servlet.http,HttpServletResponse…​). So figuring out where a response came from is hard, if not impossible. (Notice that I use Java examples here but the problem is not specific to any programming language, though some languages and frameworks make it worse.)

So what can we do about it?

First of all, make sure that you return useful status text with any non-200 requests. Then include helpful troubleshooting headers. Here is an example from one of our XHR responses:

...
server: Minbedrift_Web_Server (1)
x-forwarded-via: MinBedrift (2)
x-minbedrift-uniqueprocesscallid: 4796623666-d9e5083a (3)
x-neo-duration-ms: 1744 (4)
x-touched-by-minbedrift-backend: true (5)
x-trb-request-uri: /api/neo/customers/subscriptions/you (6)

So we communicate:

  1. which network component is responding - Minbedrift_Web_Server

  2. (confusingly) that the request for just forwarded to another service

  3. a unique ID associated with that request and passed downstream and used in logs

  4. some timing info

  5. that the request passed through this network component (since I wasn’t sure that upstream proxies do not override the Server header)

  6. what the request path the application code saw looked like (since we and CloudFront do some URL rewriting and we had struggled to get it right)

Here is an example response body from an authentication filter:

{"error_message": "Unauthorized",
  "authority": "AuthenticationFilter",
  "reason": 3}

The good thing is that it tells you where the error originated - AuthenticationFilter (including the full package name might be even more useful) - and provides some details about the cause without exposing too much information to a hacker. Thus you can eventually learn that the cause is FAILED_JWT_VALIDATION. The bad thing is that you need to know to enable debug logging for a particular package (not the filter’s) to get a log message detailing why the JWT validation failed.

And here is an example of what a user of Telia Identity OIDC might see inside the page when configuration is missing for the provided redirect URI:

Technical error:

   Invalid redirect_url

What do we want to know

What you need to know about a request processing will vary based on the environment (production, test) and the actual endpoint. Some of the things you might be interested in:

A Distributed request tracing ID

Distributed request tracing id is a unique ID that follows the request across all systems and is included in each log message and thus makes it possible to correlate logs across these systems. It is invaluable for figuring out where an error originates from and what is the cause (since details are often not communicated upstream but are typically logged).

It is valuable to display the tracing ID to the user in the case of an error so that they have something to tell the customer support when they call in to complain. It is far more useful that when they get just a "server error." We show them something like:

Unexpected error

Sorry, something went wrong! Please try again in a short while. Talk to the customer support if the problem persists. Your error reference code: 51ee24c9-fd82-48ec-82fa-0608ca6bc514

The name of the responding network component

Ideally we want to know which network component provided the response, at least of those under our direct control (so, in my case, anything between the browser and our backend-for-frontend). I communicate that using the standard server header.

If I cannot be sure whether an upstream proxy does not override the header, I (also) set the per-component header X-Touched-By-<component name>: true.

(All network components that support distributed tracing would automatically add their own "span" to the request "trace" thus superseding the need for this.)

The name of the responding code unit

Which of the filters/interceptors/controllers generated the response? Whenever this is not obvious I like to communicate this f.ex. using the X-Authority header - such as X-Authority: MyAppAuthorizationFilter - or an authority field in a response. With this information I can search the code for the corresponding class / namespace / function and look at its code or debug it to find out why it is responding in such a way. It is indispensable in a non-trivial application where you don’t have a clear picture of all the pieces in your head, especially if some of them come from libraries.

The reason for the response

Providing details of why the particular decision has been made can also be indispensable. Especially authentication can be denied for a number of reasons - a missing token, an invalid token, an expired token, an unknown issuer, …​ . Make sure to include sufficient details in the status text, response body, or a response HTTP header such as X-Reason.

From the examples above, that is the "reason": 3 (cryptic for security reasons, you must have access to the code to interpret it) and the "technical error: Invalid redirect_url".

A stack trace

Typically you don’t want to include stack traces in responses from your production application because they provide too much information to a possible hacker. But there is no reason not to include them in your test or staging environment, making troubleshooting quicker.

I like to include them under a name that makes it clear that they are only included in a non-production setting. Otherwise I will forget about it and get worried that we expose them also in production. (True story.)

Further thoughts

If returning a lot of troubleshooting data with each response is undesirable, you might make them conditional - return a minimal subset (e.g. tracing ID) and only return the full set if a particular request header or query parameter is present. (Beware that some intermediaries might strip off headers, f.ex. CloudFront.)

Security considerations

As mentioned in passing above, you want to provide as much information as possible to speed up troubleshooting but you also do not want expose any information that might aid a potential hacker in breaking into your system.

In general you want to provide somewhat general information that makes sense to you (and any developer on the project) - that’s why we say "AuthenticationFilter" but exclude the package, "MinBedrift_Web_Server" and not "MinBedrift Jetty v1.2.3". You may also put the troubleshooting information someplace else - the code, a log file - and return just a reference to that, such as a tracing ID or the "reason: 3". These are just a few ideas, there are many more options.

Distributed Request Tracing FTW!

Distributed request tracing in general and OpenTracing (e.g. Jaeger) in particular solve many of these for you:

  1. You requests gets a unique tracing ID propagated to all downstream systems and included in each log, making it easy to find all the relevant log messages across the systems.

  2. Each network component and code unit that supports it creates its own, named span within the total trace so you can more easily see which components / units were involved in processing the request.

  3. Log messages are not logged directly but rather included with the trace (or rather a span within it) so you could directly return them in a testing environment and make them easily accessible in a troubleshooting application in production, without needing to run and access a log aggregation application.

  4. Spans can also have tags, such as db.query, instance.ip etc, providing even more troubleshooting context.

  5. It is secure - you only need to include the tracing ID in the response, which tells essentially nothing to a hacker yet gives you easy access to all the information - all the spans, names, logs, tags.

Libraries can add a good deal of troubleshooting information automatically to your traces but you are still responsible for making sure all the necessary information is there and perhaps manually adding spans for important code units. And you still must provide meaningful error messages.

Summary

Understanding where an unexpected response in a distributed system with complex, multi-layerd software comes from is hard. With little effort, you can save lot of time and frustration to yourself and future other developers and customer support representatives by making sure that you return helpful error messages / status codes and that you communicate sufficient context to be able to find out where and why the response originated. Distributed Request Tracing is a great help.