The Evolution of Engineering Challenges

On a recent phone call I was asked the question, “What is your biggest engineering challenge to date and how did you overcome it?”

This question I’d consider quite vague as engineering is a broad subject. On one hand, I’d love to be able to say there was a huge problem that needed to be solved and I was able to do it — it’s happened before many times, but the other hand is directing me that this wasn’t my most-difficult engineering challenge to date.

Many people I’ve encountered will try to find the biggest problem they’ve ever experienced, but that isn’t necessarily the best example. For me, one of the most difficult engineering challenges can be as simple as fixing Time of Check Time of Use (TOCTOU) bugs in applications. A number of fellow developers I’ve met have thought that TOCTOU bugs are trivial and easy to resolve, and many of them likely are – but there’s many bugs that are incredibly difficult to prevent TOCTOU in.

Let’s take a point of sale system! You’re reloading your giftcard balance and hit the “Add Balance” button twice after paying, it sends two requests within 10ms of each other to the backend. The backend application processes the requests at almost the exact same time, they ask the database for the status (5ms latency) and it’s showing the deposited balance hasn’t been transferred. They both proceed to add $25 to the giftcard balance.

In an ideal world you’d only have $25 balance on this giftcard, however, in this case because of the TOCTOU bug we’re seeing a total balance of $50 was deposited even though $25 was paid.

TOCTOU can be difficult to solve in these cases, as one of the issues we’re fighting with is latency. If we send the request in the same millisecond, and our database servers under load and takes 5ms to respond to both our requests and show empty, we’re in a race condition. While many database servers are able to handle this peacefully, if we’re under increased load due to “slow” denial of service, or massive amounts of users and no free resources to auto-scale, we’re going to run into this condition.

So, how do you solve the TOCTOU in this case? There are many different approaches you could take like disabling the button after the first click, by doing this we’re now talking some client-side code to disable it. It may work to stop 99% of accidental submissions, but what if the user has scripted it? Then your client-side fix to disable the button won’t work.

We can also try other approaches. I’ve seen one former colleague implement a “double check” that before updating the ORM model will re-poll the database to check the balance hasn’t changed since the model first loaded. That is definitely one solution that can be applied, but there are still some tricky ways to bypass it.

When I encountered a very similar bug, I tried to implement hacky fixes and felt it was frivolous to the bug. In the end, I decided on a clean approach with minimal impact: Rate Limiting.

Often times, Rate Limiting is undervalued when tackling timing-based bugs such as this TOCTOU example. You can account for the reasonable amount of traffic you expect per IP on this endpoint and apply a limitation. In this example, I could easily say that One Request per every 20ms is allowed and that would easily account for the known database latency and give extra padding — without interfering with legitimate operations of the program from other users on the same network.

It is definitely not a perfect solution, but it’s a battle-tested solution that works. Nothing will ever be the perfect solution, but it’s definitely a complex engineering challenge that you can either implement good-enough or break entire applications.

How would you solve a TOCTOU challenge?