How Memo Bank designs and builds digital products.
Written by Jérémie Martinez
Building a reliable Core Banking System
At Memo Bank, we have chosen from the very beginning to build our own Core Banking System from scratch. We realized early on that we could not fulfill our banking ambitions by using the same tools that banks and neobanks were already using.
Today, we would like to talk about resilience in a distributed system. This subject was indeed at the heart of our concerns alongside scalability, security and privacy. Unfortunately, resilience is too often overlooked.
In distributed systems, failures are inevitable (outages, network errors, crashes, …) , no doubt about that. Resilience is the ability of a system to embrace these contingencies and ensure errors are handled as gracefully as possible.
We built our bank to operate fully in real-time in order to set a new standard for the industry, hence achieving error handling without disrupting real-time traffic was mandatory for us. We wanted to be as fault tolerant as possible to ensure our bank’s continuity, high availability and avoid disruptions from any single point of failure.
In this article, we would like to focus on our dead letter queue (DLQ) system which is one of the tools we invested the most in our infrastructure.
Dead letter queue
A dead-letter queue is a holding queue for messages that failed to be delivered.
In our infrastructure, we have two distinct messaging systems:
- ActiveMQ for discrete messaging such as commands.
- Kafka for message streaming such as real time events.
We already went into details on our messaging systems in a previous article, now let’s dive into our failure management for each of them.
An ActiveMQ cluster has its own configurable mechanism built-in. After N tries, messages are isolated in another queue. To make things easier, all our messages are wrapped with the same envelope, so we can gather faulty messages in the same DLQ. Once we fix the bug or the outage, we reinject the messages in their original queue. That’s it.
Like discrete messaging, we isolate faulty messages into a DLQ, but things get more complicated when streaming messages, since we might have many consumers for a single topic. Therefore, we needed to ensure we could target a specific consumer when reinjecting. We managed to implement such a strategy thanks to our Kafka headers which contain a specific consumer name on the reinjected faulty message. Basically, all consumers ignore messages that are flagged with another consumer name ensuring that other consumers are not disturbed.
For some topics, the order of events is guaranteed by using the event number and a partition key. This is particularly practical for managing transactions and balance by account, for example. Indeed, you don’t want to make decisions based upon a wrong balance because one or many events were missed. So we had to implement isolation and reinjection in a way the order continues to be guaranteed even with one event going to DLQ.
As you have probably noticed, we wanted to use the same logic for both messaging strategies, so we could have a single tool to manage all topics regardless of the language, application and type of messaging. This tool evolves with our system and we regularly improve it based on new use cases.
Obviously, any action regarding a message is kept as an immutable trace of who, when and what, so we can achieve full traceability. We even went further by adding the ability to edit the content so we can fix bad data easily.
On this journey, we obviously encountered some pitfalls that we would like to share.
Don’t take anything for granted when it comes to message structure
Firstly, we should not have assumed that the messages would always have the correct structure when reading from the DLQ. Indeed, one of our applications failed to produce a correctly formatted message and we ended up blocking our DLQ completely. Therefore, you always need a fallback for the lowest structure possible such as a byte array, so that even if someone were to send an MP3 as a message, we would still be able to isolate, read and handle it.
Secondly, we should not have underestimated the various possible reasons for an outage. Obviously, we had put various circuit breakers in place to ensure one outage does not propagate in the whole system. Unfortunately, we missed some cases which led to a consumer failing on all messages and we were not able to prevent all events from going to the DLQ.
What’s next ?
As of today, our DLQ system has proven to be very efficient. It helped us to be more reliable, reduce outages and sleep better. We continue to improve it as we go and we listed improvements that could become necessary in the future.
Indeed, given the very low rate of errors we observe, our reinjection process is still quite manual but we have many ideas to automate it, such as precise error categorisation and automatic retries that could prevent human intervention on partner downtime for instance.
Finally, if building a distributed, scalable, available and obviously very resilient Core Banking System is a challenge you would like to tackle, please come help us.