The role of In-Memory Data Grids in Event Driven & Streaming Data Architectures
By Tanmay Ambre, Executive IT Architect, Financial Services Sector and Shahir A. Daya, IBM Distinguished Engineer and CTO Financial Services
Tanmay and I both work in the field as part of IBM Services. We are both hands-on and spend most of our time with our clients working on delivering their projects. We have seen the market adoption of Event-Driven and Streaming Architectures increase to the point where it is fast becoming the standard architecture for use cases that are event-driven or require real-time stream processing.
Based on recent client projects that Tanmay and I have delivered, we have found that In-Memory Data Grids have a role to play in high throughput / zero data loss event-driven architectures. What we would like to do in this article is share with you our view on what role data grids play in event-driven architectures and share our lessons learned.
We continue to evolve the architecture as our clients add additional use cases to leverage the platform even further and make it an integral part of their solutions. We plan to continue to share our experiences as we evolve this. We welcome the comments and experiences of others in the field.
Introduction to In-Memory Data Grids (IMDGs)
So, what exactly is an In-Memory Data Grid? There are many great articles available that describe what an In-Memory Data Grid is, so I’ll focus on the key characteristics. An IMDG is:
- a data store in which you can store objects
- the data is stored in memory
- data is partitioned across the nodes of a cluster
- nodes can be added easily to add more memory to the grid
With that context, let’s dig deeper into why IMDGs are important.
Why are IMDGs important?
Rapid advances in digital, social and IoT capabilities are giving rise to innovative business applications. They are also fueling exponential growth in data volumes. Systems process millions of events and generate a massive amount of data. Organizations are also exposing their business functionality as APIs with stringent response time and larger throughput Service Level Agreements (SLAs). These modern applications/services require high-speed access to data to ensure engaging user experience and meeting throughput and response time SLAs.
To put this in context, consider these application scenarios:
1. Trade processing applications — these applications have massive throughput requirements, and each trade-related event must be processed in the context of its history, investor’s profile, and financial position. At the same time, business operations users must have a real-time aggregated view of key risk and financial metrics across geographies, clients, and types of securities.
2. Payment processing systems — again, massive throughput requirement. Each payment must be correlated to previous transaction history, customer’s financial position, and regulatory checks, which again could span across multiple transactions.
3. Online commerce applications — these applications require fast access to product catalogues, managing customer’s shopping carts, making recommendations based on items in the shopping cart and other criteria, tuning the search results based on user’s past browsing/buying behaviour, etc.
4. Open Banking — In a typical Open Banking scenario, a bank exposes an API to provide customer’s account balance and other details. Third-party application developers then leverage this API to build rich mobile applications. This results in an exponential rise in the number of account data queries.
Building these applications with traditional databases would involve substantial engineering complexities, especially those related to meeting the non-functional requirements of response time and throughput. By using IMDGs, a fast data layer can be created that will help in data lookup, storing states and also storing and maintaining business and operational metrics. They can also be used as a primary data store and then using its write-behind capability to write the data into backend databases. Populating IMDGs is not expensive in terms of performance and effort. It can be done concurrently and using established architectural patterns such as pub-sub (publish-subscribe) and change data capture (CDC).
A typical usage pattern that we have observed is depicted below:
These are several characteristics of IMDGs that we like a lot:
- Fast and highly concurrent access to data
- Help reduce the load on Systems of Record
- Distributed and fault-tolerant in nature
- Easy to scale as data volumes increase
- Support for cloud-based architectures
- Ease of setup and deploy
- Some such as Apache Ignite provide an ANSI-99 SQL compliant interface with JDBC and ODBC drivers
- They are also rapidly becoming cost-effective, as memory costs come down
Following are some of the patterns where we think using an IMDG is beneficial:
1. High-Performance Computing (HPC)
HPC is about aggregating compute and memory resources of multiple servers (commodity) to deliver higher performance and concurrency compared to a standalone server. The key principle being divide and conquer (map-reduce). i.e., dividing the data and compute tasks into smaller segments and distributing them for parallel processing to different servers in the grid, and aggregating the results of these smaller segments to compute the outcome.
As data keeps growing at an exponential speed, partitioning data and enabling distributed querying capabilities to provide high throughput is key to HPC. IMDGs are an essential building block for HPC. IMDGs use partitioning of data to distribute it across multiple server nodes in the cluster. Queries also get split and distributed across the cluster. This means the performance of IMDGs can be scaled up to meet high throughput requirements.
Additionally, IMDGs platforms such as Apache Ignite also provide distributed compute capability, i.e., the same nodes of their cluster can be used to schedule and execute concurrent compute tasks. They also provide the ability of “data and compute colocation” i.e., running the distributed compute sub-tasks on the nodes, which also holds the data required by those sub-tasks. This eliminates the network latency and provides higher performance.
2. Acting as “Q” for Command Query Responsibility Segregation (CQRS)
In a typical application, there is a significant skew between read and write workloads. The “reads” out-number the “writes” significantly. For example, a bank’s customer makes fewer payments compared to the number of times he/she would check his/her account balance. CQRS provides for segregation of read and write workloads, thus ensuring write performance is not impacted adversely due to large volumes of reads.
We have observed the following characteristics while working on projects with our clients:
- Reads are more for the “recent” data rather than “old” data
- The UI requires data in a representation that is different from the underlying data model, typically an aggregation across multiple entities
- Real-time aggregations need to be built for display on dashboards
- Response time requirements for reads are aggressive
- Read workload increases significantly during marketing initiatives, global political, and market-related events
IMDGs make sense in implementing the read part of CQRS for systems displaying the above characteristics due to the following:
- It is easy to populate these grids in event-driven and stream processing types of applications. These data streams provide a mechanism to transform/enrich the data, which is more aligned to the consumption requirements.
- Most of the IMDGs support different eviction policies.
- Faster performance and high concurrency (through partitioning and appropriate colocation strategy) compared to traditional databases.
- Ease of scaling is an important characteristic. IMDGs are inherently distributed in nature and can be easily scaled horizontally.
- IMDGs also support replication, thus making them more resilient to failure.
3. Decoupling of Reporting workloads from Systems of Record
Once an IMDG is populated, it could be used to service reporting workloads. This would be particularly useful in cases where reports are being generated directly from systems of record instead of from a Data Warehouse or Data Lake. However, this would require the IMDG to provide a connector interface that is compatible with the reporting tool. Apache Ignite provide ANSI-99 SQL compliance and provide JDBC/ODBC drivers, which most of the reporting tools support. By leveraging the IMDG for reporting, a significant amount of reporting workload can be reduced from the systems of record.
There are a few things that need to be taken into account:
- If using the SQL interface (JDBC, ODBC drivers): (a) queries need to be optimized for performance, (b) proper indexes have been created, (c) try to store data in de-normalized form to avoid joins, (d) if joins are unavoidable, try to ensure data being joined is “collocated”.
- The IMDG needs to be tuned to support concurrent online and reporting workloads. This could involve segregating the workloads using different thread pools and having a proper throttling mechanism in place.
- The IMDG can lose its data in case of a cluster-wide failure, and if the data is not backed up in a persistent store, the data needs to be recovered as part of recovery before reports can be generated.
4. Real-time Analytics
There are many metrics and insights that need to be generated in near real-time. This is especially important for systems analyzing risk, detecting deviant behaviour, i.e. fraud detection, anomaly detection, and IoT based systems. Some sample use cases are:
- Business dashboards that show metrics related to business KPI/SLA aggregated across different dimensions
- Risk assessment applications
- Pattern and anomaly detection applications such as fraud detection, high-value payment processing, predicting failure, etc.
- Machine Learning inferencing, especially for models that require a large volume of near real-time data
This entails:
- Correlating different types of events
- Aggregating events and data
- Running algorithms to identify patterns
- Fast access to near real-time data to feed into machine learning models
The key challenge in these use cases is to perform at scale and speed. Fast access to data is required. State needs to be maintained and continuously updated. Multiple datasets need to be joined to correlate events and identify patterns. All of this has to be done within stringent response time requirements and at scale. To add to the complexity, the data volume is continuously growing.
IMDGs are very helpful here. Using them, a fast data layer can be built that can maintain state, history and aggregated metrics. In event-driven and stream processing-based systems, the stream data processing pipelines can populate the grid and also publish aggregated metrics in near real-time. Event processors then reference the data in IMDGs as part of their execution, instead of querying databases or making service calls to systems of record. Another advantage of IMDGs is their scalability. As data volumes grow, it can be easily scaled.
IMDGs such as Apache Ignite provide compute APIs which enable running compute tasks directly on the data nodes with “map-reduce” capability. It supports the colocation of compute tasks and data. This means compute tasks are executed on the node where the relevant data resides. Thus, reducing network traffic and improving performance. This feature can be used to create compute tasks for data aggregation.
Note: In case if you are using an event streaming platform like Kafka, consider using the Kafka Streams API for real-time analytics and event correlation. This is an option you should evaluate for your particular use case/scenario. The result from Kafka Streams processors can be published to IMDGs for fast access.
Typical challenges in IMDG implementation projects
IMDGs are extremely beneficial for providing fast data access and reduce load on the existing traditional systems of record. However, there are certain challenges pertaining to its use that need to be considered as part of design:
- Without persistence of in-memory data, there is a risk of data loss in case of IMDG cluster-wide failure. This can be addressed to a certain extent by having a cross datacenter cluster setup and ensuring replicas are maintained across data centers. A more reliable solution is to have persistence enabled for the IMDG. However, persistence has performance and storage costs associated with it. Persistence can either be on a disk (for example, native persistence in the case of Apache Ignite) or in a 3rd party database such as Cassandra or an RDBMS. IMDGs provide “write-behind” pattern for persistence to improve performance. However, performance issues manifest when there are big bursts of inserts/updates happening on the IMDG.
- Recovering from a cluster-wide crash. When the IMDG cluster crashes, data is lost. If this data is backed up on persistence storage, then it needs to be reloaded before allowing applications to access it. Having optimal recovery time is extremely important to ensure minimum disruption. Recovery time is proportional to the data volume and number of threads involved in recovery. If you are dealing with a large amount of data, it is highly advisable to test recovery scenarios as part of the development and tune the IMDG and the infrastructure to optimize recovery time.
- Memory is still not very cheap and setting up an IMDG has cost implications. Increasing the cluster size as data volumes grow comes at a cost. It is prudent to set up limits on how many days’ worth of data needs to be available in the IMDG. This decision needs to be made, taking into consideration the pattern and volume of queries being made on the IMDG.
- The IMDG is an additional set of components in the architecture to manage and maintain. This adds to maintenance and operating costs. IT Management teams need to be enabled to manage and maintain these additional components. Development teams need to be educated on using an IMDG, especially on the differences between a database and an IMDG. We have noticed direct comparisons of IMDG being done with traditional databases. This should be avoided. Both have their own strengths and serve different purposes.
Lessons Learned
Here are some of the many lessons we have learnt as we have undertaken client projects that leveraged an IMDG:
- Data replication and rebalancing are very important for IMDGs because, without them, even a single node failure can make the grid inconsistent.
- Automation of basic admin activities (start/stop, etc.) on the IMDG is important, especially when there are a large number of nodes in the cluster.
- Monitoring dashboards and alerts need to be created for key IMDG metrics if out-of-the-box capabilities do not exist.
- Design data recovery mechanisms during initial stages of development. Don’t leave them for the last moment. Test the mechanisms with production like volumes to benchmark data recovery times
- Performance testing is important, especially in scenarios that result in massive bursts in insert/update rates and even more important when using persistence.
- Decide how much data you want to retain in the IMDG. Scaling the nodes is easy but costs money.
- Rolling upgrade of IMDG nodes is a must-have feature. Without rolling upgrade capability, every version upgrade would require the cluster to be restarted, and data recovery would be required.
- A considerable amount of time needs to be spent on optimizing and tuning the IMDG setup. This time needs to be factored into the overall project planning. Don’t take it lightly.
- IMDGs have a long list of configuration options. There will be a lot of decisions that need to be made related to the IMDG configuration. You need to make sure they are well-documented.
- It is important to educate development teams on the differences between a database and an IMDG, especially when using a SQL compliant IMDG such as Apache Ignite. Relational databases support very advanced and complex SQLs that may not behave efficiently on an IMDG. Another aspect is data collocation, which is important to understand while creating data caches and designing queries that join multiple caches.
Summary
Tanmay and I have tried to share our experience with In-Memory Data Grids. We shared the kinds of use cases we see benefits in, what challenges we have needed to address, and lessons we have learnt along the way. We are both very passionate about this topic, and we hope that you have found this article useful. If you have implemented an IMDG and have other learnings not covered in our article, please share them in the comments.
References
- N. Ivanov, “In-Memory Data Grid: Explained…”, GridGain Systems, 2020. [Online]. Available: https://www.gridgain.com/resources/blog/in-memory-data-grid-explained. [Accessed: 23-Dec-2020]
- M. Fowler, “bliki: CQRS”, martinfowler.com, 2020. [Online]. Available: https://martinfowler.com/bliki/CQRS.html. [Accessed: 23-Dec-2020]
- “In-Memory Data Grid — White Paper”, Gridgain.com, 2013. [Online]. Available: https://www.gridgain.com/media/in-memory-datagrid.pdf. [Accessed: 20-Jan-2021]
- “Distributed Database — Apache Ignite®”, Ignite.apache.org, 2021. [Online]. Available: https://ignite.apache.org. [Accessed: 20-Jan-2021]
- J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google.com, 2004. [Online]. Available: https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf. [Accessed: 21-Jan-2021]
- “Kafka Streams”, Apache Kafka Documentation, 2021. [Online]. Available: https://kafka.apache.org/documentation/streams/. [Accessed: 22-Jan-2021]
- “What is Machine Learning Inference?”, Hazelcast, 2021. [Online]. Available: https://hazelcast.com/glossary/machine-learning-inference/. [Accessed: 22-Jan-2021]