Your Worst-case Serverless Scenario Part II: Magic With Numbers

In this second installment of the “Your Worst-case Serverless Scenario” series we will explain how an intended amount of 100k table records turned into 56 million records. This will be explained by accompanying it with some pseudo functions of our code, a small case study and a visualization that will help you to better grasp the concept. If you haven’t read the first part (Invocation Hell) of this series yet, we encourage you to do so, since we use some of the discussed material in this part.

Your Worst-case Serverless Scenario Part I: Invocation Hell
Your Worst-case Serverless Scenario Part II: Magic With Numbers
Your Worst-case Serverless Scenario Part III: The Invisible Process

Hackerman: How 100k Turned Into 56 Million

This one is quite interesting. Even though it didn’t directly cause much trouble, it definitely shouldn’t have happened. Before we dive into what went wrong here, it’s probably good to show you some pseudocode of the part of the function that caused this to happen. There was a lot more happening here, but for understanding the issues at hand this simplification of the code will be enough.

Let’s walk through this very briefly. The hackerMan function takes one argument amountOfItemsLeftToWrite, which stands for the amount of items we still need to write to our table. Then, we prepare the items in the right format and insert them into DynamoDB with a batchWriteItem operation and by putting these insertions on an array of promises. After that, amountOfItemsToWriteNow is subtracted from amountOfItemsLeftToWrite, which is passed to the next invocation call if there are still items left to write. This invocation is also being put on this array of promises and will finally be awaited. If there is an error somewhere, we have some generic error handling in place and the function (chain) should stop here.

First of all, there is 1 thing amiss that we won’t go into detail here, because it falls out of the scope of this story, but that we would still like to address. Namely: if an error has been thrown, the chain should stop, but as you can see no action whatsoever will be taken on items that have been already written to the database, which is of course a bad thing to allow, because now we are left with a function that partly fulfilled its purpose and no protocol to tell us how we should treat the part that succeeded.

Some Important Information

Now, let’s get on to why 100k items turned into almost 56 million items. To understand this, you should understand a few key components here: batchWriteItem & DynamoDB, asynchronous code and asynchronous Lambda executions. The latter we already discussed in the previous section: if there is an error, whether it be a timeout or something else, Lambda retries the function up to 2 times. As for the asynchronous code in our function, we basically executed the insertions and the invocation of the new function at the same time, which is important, because that means that even if an error occurs while inserting items, a new function has already been invoked that doesn’t know anything about that error. Lastly, it is important to note that DynamoDB’s batchWriteItem can throw an error, which can be a timeout error in some cases. A timeout error is what happened in this scenario, meaning that the Lambda execution contexts were occupied longer than usual, worsening the concurrent Lambda invocation limit described in the previous section. Moreover, this timeout error, or another kind of error on batchWriteItem, does not mean that none of the items have been written to the table. That means that in our case, 0 to 24 items have been written to the database by that function if an error on the batchWriteItem operation has occurred.

A Small Case Study

If you’ve read all of the above and can still follow, you might already have a hunch of what’s going on. If not, don’t worry, we’ll do our best to illustrate what happened on a smaller scale. Imagine a chain of 4 identical functions, each having a goal to write 25 items to the origin table and with the collective goal of writing 100 items to the origin table. If everything goes well and no errors have been thrown, this goal is reached and everyone is happy. But, what if an error happens on the second invocation while executing the batchWriteItem operation? By that point, 25 items have been successfully written to the origin table. And not only that; the third invocation will already have been called, because this code was executed asynchronously together with the batchWriteItem operation that only later threw an error. If we assume that these third and fourth invocations don’t throw any errors, that means that they write another 50 items to the origin table on top of the first 25, bringing the total amount to 75 items. So, what exactly happened to the second invocation? As we stated above, anywhere between 0 to 24 items would have been written to the origin table before throwing an error. For the sake of simplicity, we assume it has written 10 items before the error was thrown, bringing the total amount to 85 items. What happens next is crucial for ending up with more instead of less items: because the first function invoked the second function and this invocation had the Event InvocationType option parameter attached to it, it means that Lambda automatically retries this function up to 2 times.

Okay, so the total amount of items is already 85, but now the second invocation is being retried another time. Again, for simplicity’s sake, we presume that from now on everything will go as planned and no further errors will be thrown. This function writes another 25 items to the origin table and will now invoke the next function with the parameter amountOfItemsLeftToWrite set to 50, because this function itself got the parameter with the value of 75. Since it has no logic to handle retries, it will know nothing about it’s fail from before and just go on with the parameter it initially got. The next function receives the value of 50, writes 25 items and invokes the next function, which will also write 25 items and concludes that the chain is over, because amountOfItemsLeftToWrite has reached the value of 0. This now brings the total amount of items to 85 + 3*25 = 160.

So even with a very small chain of just 4 functions in the happy flow and 1 error during the second invocation, we can end up with 60 more items, which is an increase of 60%! Of course, the real increase will always depend on where in the chain the errors occur, how many errors occur, how many retries have been performed by Lambda, how many items the batchWriteItem inserted before the error, the batch size, the chain size and when you force quit the chain. In our case, ~ 100k became ~ 56 million, which is a shocking increase of 55900%! Funnily enough, the increase could have been even higher if there was no such thing as the Lambda concurrent execution limit, because this has drastically slowed down the number of concurrent executions of this function as well. Without that limit, we would also have noticed our huge mistake much later, resulting in a longer running chain.

How To Redesign This Mess As Quickly As Possible?

Now that you have a clearer picture of what exactly went wrong and what building blocks were responsible for these problems, let’s see how we can fix all this in a way that we don’t have to begin from scratch. Note that this still isn’t the most desirable way of handling such functionality, but it shows how you can prevent most of the bigger problems with just a few adjustments.

Getting rid of the asynchronous code. You’re doing something async because it will save execution time and it’s possible if all involved parts don’t need the results of the others. However, the function invocation part clearly needs to know the result of the insertion, because we don’t want to continue with another function if this one errors. If we do, the parameter amountOfItemsLeftToWrite will be erroneous as well. So, the first step is to wait for the result of insertItemsIntoDynamoDB(items) before subtracting anything from amountOfItemsLeftToWrite and invoking another function.
Make better use of Lambda’s execution time. Lambda can run up to 15 minutes per invocation. If the user doesn’t need to wait for the response (in this case he isn’t), why spin up more separate containers than necessary? Lambda’s context has a built in function that tells you how much time is still left before it terminates. By making good use of this function, you can run the process as long as possible before invoking the next function. This will drastically reduce the amount of (concurrent) Lambda’s that you’ll invoke.
Built in a sleep function to prevent going wild. This depends on if you can afford taking more time. In our use case, it wasn’t necessary to insert all the items in the table right away or as fast as possible, but might as well take a few hours or even half a day. So to prevent DynamoDB throttling errors because auto scaling can’t scale up fast enough, you can also take a little more time between each (batch) insert.

The new function in pseudocode would look something like the one below.

We hope that everything is clear so far. In the third and final part of this series we will talk about an ‘invisible’ DynamoDB process that causes tables to dysfunction and wrap up this story.

Visuals in this post have been made by David Kempenaar.

From ideato product