Articles

Common Problems Resolved By ClearConnect

I have explained what real-time multi-process apps are and have gone through the core requirements for them. Now I’m going to explain how our java framework, ClearConnect, satisfies the requirements and solves common problems.

Data Transport Technology

We chose a point to point design so that network stability issues associated with data multicasting are avoided. The default implementation is TCP/IP (TCPChannel) however it wasn’t long before we came across the need to use solace. So we enhanced our code to support pluggable transport technologies and developed a solace implementation that preserves the point to point semantics.

Records (Data On ClearConnect)

Data that is on ClearConnect has been homogenised into simple but flexible records, so all clients (record consumers) and services (record publishers) that are on ClearConnect ‘speak the same language’. The records contain key-value pairs with a maximum depth of 2. In other words, a record can have a value that is another set of key-value pairs (a sub-map) but the sub-map cannot contain other sub-map values. The record keys can only be text and the values can be any type in the table below:

 Type Java Type Description
TextValue String Any arbitrary text.
LongValue Long A whole number, more precisely a 64-bit two’s complement integer.
DoubleValue Double A decimal number, more precisely a double-precision 64-bit IEEE 754 floating point.
BlobValue byte[] Any object serialized into an array of bytes.
A sub-map Map A map that supports text keys and values that can be a TextValue, LongValue, DoubleValue or BlobValue.

Our keep-it-simple approach to records means

  • they are flexible enough to model objects up to 2 dimensions (like a spreadsheet or database table)
  • you are not faced with a varied and confusing choice of types (e.g. a whole number can only be represented by a LongValue)
  • it is restrictive enough to force a considered and rational approach to data modelling

Real-Time Approach

We put a lot of effort to ensure that ClearConnect is fast since a key feature is that it is real-time. On ClearConnect, records are published using a threading model that is optimised for speed and it automatically scales with increasing number of CPU cores. We also developed an optimised algorithm for transferring records which uses the image on join, followed by deltas approach that I have already discussed. Finally we have provided a choice of codecs for encoding records before transfer and decoded them at the other end.

These features result in a very efficient and fast way of sharing data while presenting options for scaling and/or optimising further based on the kind of data that is prevalent in your system. In order to ensure that the performance of ClearConnect does not degrade as we enhance and add features, we have a network throughput test which is used to benchmark our releases against each other. Also we have other performance metrics that show a latency of 9 microseconds for an average message size of 134 bytes, on an Intel core i5 machine. These are discussed in more detail on our code wiki.

High Availability

High availability is another key feature for a real-time multi-process app, the framework has to be resilient to one or more processes becoming temporarily or permanently unavailable. In ClearConnect a service can run in one of two available redundancy modes: fault tolerant and load balanced. Both rely on multiple instances (processes) to be started for a single service.

In fault tolerant mode, one instance is the primary, active one and the others are passive, warm stand-bys. If the active one becomes unavailable, one of the warm stand-bys takes over and all clients re-sync with it. No records are lost.

In load balanced mode, each instance is actively participating in record transfer and are therefore sharing the load. Service instances can be removed or new ones added at runtime without causing any disruption. This means you can scale your real-time multi-process app without stopping it, thereby delivering an uninterrupted service.

In ClearConnect, redundancy modes are possible because we have a single service (the registry) that coordinates connections between service and clients. The registry itself runs in fault tolerant mode and any services/clients of a running system only depend on it while they establish their connections. This means that the registry can be stopped completely without any effect on existing components. However when no registry service is available, new connections cannot be established.

Operational Use

From an operational point of view, development and support teams will need to be able to quickly see and diagnose issues. For this reason we supply appropriate tooling so that ClearConnect is completely transparent.

Our PlatformDesktop UI is fully featured and has the ability to show services, RPCs, clients, connections and records. It is what is used to check the health of the system as a whole and allows you to drill down into problem areas. These can be analysed in more depth by examining the logs for the problematic service or client.

Our logging has been optimised for speed and it is asynchronous so it will not block normal operation. ClearConnect logs have essential-only entries, by this I mean that the logs contain all information that is needed to diagnose issues but nothing more. We meticulously went though our log statements to ensure there is no redundant or distracting information.

Adoption

Greenfield IT projects are rare. Most projects are about enhancing an existing system, so we wanted to make sure that ClearConnect can be easily adopted. We have used the following approach for this:

  • No dependency: the platform just needs Java to run so you cannot get into the situation where you have conflicting dependencies. Also you don’t need to install anything other then the platform itself.
  • Convention over configuration: to start a platform service you don’t have to configure anything because default values are used that can be overridden if required.
  • Discoverability: services on the platform advertise themselves and their RPCs. The code provides appropriate callbacks for services and their RPC’s coming on-line so it is easy to discover them.

The Overall Solution

From a developer point of view the solution that the Fimtra platform offers is quick to start using and easy to pick up. Many of the difficult to solve problems are taken care of, so designing and implementing apps becomes easier. You can focus your efforts on the application’s core business logic with the knowledge that data transfer and service management is efficient with rich functionality.

In an operational environment ClearConnect is very fast, transparent and has tooling that makes detecting early signs of a problem possible. ClearConnect services are resilient to restarts by automatically and seamlessly reconnecting and resuming data exchange. Also there are various options for scaling your apps; you can use hardware with an increased number of cores, add more load balanced nodes, use a tailored codec or use a faster transport technology.

A Framework For Real-Time Multi-Process Apps

Now that I have explained what I mean by real-time multi-process apps, I will go into some detail of what a framework needs to provide to develop them.

To coordinate multiple processes, a framework has to

  • use a communication protocol (data transport technology) because the processes may exist on separate hosts
  • enable data to flow between the processes
  • provide support for invoking behaviour on the processes

Data Transport Technology

The choice of data transport technology is important. Two core requirements that need to be satisfied are

  • it has to be fast because the aim is to be as close to real-time as possible
  • it has to be resilient because data cannot be lost

If you want data to be transferred quickly a good starting point is to ensure that you only send the data that is absolutely necessary, so it is wise to choose a transport that minimises the use of meta data/headers. This means the choice becomes restricted to TCP/IP, UDP or any of the layers below them.

In industry I have never seen the lower layers used directly in an application, they are simply too low level. However I have seen TCP/IP and UDP used, sometimes even together. I don’t think the choice of either one is obvious but in the interest of reducing complexity I will discount a hybrid solution for now. Comparing the two does help rationalise the choice:

TCP/IP UDP
Methodology Unicast Multicast
Throughput (in a perfect network) Very fast Very fast (theoretically faster than TCP/IP)
Data integrity Guaranteed due to packet acknowledgement. Lossy. Retransmission logic needs to be implemented.
Network stress Contained Potentially intense

TCP/IP satisfies the requirement of guaranteed data at the expense of throughput (in a perfect network). It is also unicast (point to point) which means that the stress it puts on the network is contained. However, in situations where you do want to broadcast data, the logic has to be implemented.

UDP on the other hand is faster because no packet acknowledgement takes place, however it makes situations where you want to target data to a single consumer difficult. Also retransmission logic has to be coded to ensure data integrity and this is not easy because if you stick to UDP, retransmitting data means even consumers that haven’t missed packets, receive them again. These may simply ignore duplicate packets but they still need the logic to do this. A single slow or faulty consumer that continuously requests retransmissions can result in putting the network and applications under intense stress because they get flooded with data, all of which needs to be processed. Processing a lot of data ties up resources which can cause more retransmission requests, exacerbating the problem. Ultimately this becomes an ever increasing spiral that cannot be sustained. In a stressed network, UDP may be a contributing factor and will most likely perform slower than TCP/IP. I have seen major outages in well managed, high end networks happen because of this problem.

Data Flow Between Processes

For information to be exchanged between processes the framework needs to support publishing and consumption of data. By doing this in a systematic way, a single implementation can be used and putting some rules around the data, simplifies the problem. I’ll call data that follows these rules, a record:

  • a record has a unique ID and version
  • a record is homogenised into a standard structure (e.g. a set of key-value pairs)
  • a record is immutable
  • records with the same ID but different version will contain different data

Records defined this way bring order to a potentially chaotic landscape. It also means that optimisations can be put in place that will reduce transfer speed, so that it is real-time, and the network doesn’t become flooded.

The aim is to design consumers of data (clients) so that they receive the minimum amount of data requested. This is possible when a client receives all the key-value pairs in a record that it requests (or joins) and then only the changes (or deltas) for the record, on subsequent updates. I call this approach image on join, followed by deltas. It requires the client to keep a local copy of a record and then merge changes into it for each update. In this approach it is also a good idea to detect out of sequence/missed updates so that a client can resynchronise the data if it needs to.

Controlling Behaviour Using RPCs

In the above section you may have noticed the idea of clients performing actions by requesting or resynchronising data, this implies controlling the behaviour of a record publisher (service) from the client end i.e. remotely. This is made possible with remote procedure calls (RPCs). When clients request a record they need to invoke an RPC on a service asking for a record with a given ID and usually for any subsequent updates.

Summary And Conclusions

The above sections describe minimum requirements of a framework for real-time multi-process applications. Due to their complex nature, services and clients are loosely coupled and will rely on heartbeats and timeouts to infer the state of the various processes, also interactions between them will be asynchronous. A framework should seek to conceal this complex nature with a sensible API. This is possible to do only by applying rules to data, service behaviour and client behaviour.

In ClearConnect, records, services and clients have been defined and you are provided with a complete toolbox for creating your own. The API has a prevalence of callback methods which are necessary to support the asynchronous nature. We also ensured that the implementation works out of the box with minimal configuration and added utility methods that give a synchronous feel to many operations. Our higher order API makes use of our own utilities to really simplify interactions between services and clients so that you can focus on your application’s business logic. Finally we ensured that there is complete visibility of your real-time multi-process apps with essential-only logging and tooling that allows visualisation all records, service and clients.

Real-Time Multi-Process Applications

In software development there is a constantly changing environment: programming languages are regularly released with improved API’s and more powerful frameworks, standards are updated to support innovations in hardware, cheap computation and storage means it is now widely available on all kinds of devices, and connectivity has become inherently supported. It is this that catalyses the evolution of how apps are written. One important part of this evolution is related to computational concurrency i.e. the ability for logic to be executed at the same time. Hardware has supported this for a long time and it is what has allowed applications to become multi-threaded where each thread within a process can execute logic concurrently.

As hardware and programming languages have improved, the support for writing multi-threaded apps has also become more widely available. The real driving force behind this is the insatiable appetite for fast applications with rich functionality that can run, not only on computers but also on an ever increasing range of devices. These kind of apps are more than just multi-threaded though, they usually depend on multiple processes working together in a coordinated fashion.

Take a website as an example, it is hosted on a server which has it’s own process with multiple threads and it is viewed on a web browser that also has it’s own process with multiple threads. In order to deliver a functional website the two processes have to be coordinated, this is done with the glue between them: HTTP. The glue is sophisticated, it does more than enable process coordination, it is layered on top of TCP so there is also a remote aspect at play where the processes are running on separate hosts.

A website and browser have very clear and importantly distinct responsibilities: one serves data and the other renders and interprets the data. But what happens when you have coordinated processes that can run remotely, which also have similar responsibilities? This allows you to capitalise on multiple threads to achieve performance while at the same time scaling across multiple processes and hosts. This is what is needed to really scale up in an industrial way and the result is a technology that makes clustering, cloud type redundancy, efficient use of big data and grid computing a reality. These are what I have come to think of as real-time multi-process apps.

When it comes to coding, there is big difference between writing an app that runs in a single thread and one that uses multiple threads. The complexity of multiple threads interacting with each other, forces a disciplined approach underpinned by an appropriate threading model. In short, it is hard to develop a multi-threaded app and if you don’t get it right, it is hard to identify and fix defects. The complexity and therefore difficulty increases by orders of magnitude with real-time multi-process apps, these require a carefully designed framework that shields you form the complexities of managing threads and processes so that you can focus on implementing business logic.

real-time multi-process

This is what we strived to do when we wrote ClearConnect, our real-time multi-process application framework. It manages interactions of multi-threaded processes generically which is complex and difficult, so we stuck to core concepts to help us rationalise this difficulty. For example we standardised data flowing between processes, we made components work with no complex configuration (convention over configuration + discoverability) and we made everything ClearConnect does transparent (comprehensive logs + tooling). A more complete description of our concepts can be found in the Concepts and Programming Manual. The result is a framework with a complete yet clear API that works with nothing except a Java enabled computer.

Bearing in mind the complex nature of multiple threads and processes, ClearConnect is relatively easy to understand and it can be supported with some basic knowledge. We have tested and compared it so we know it is fast at runtime, and we have used it to develop business applications so we know it is quick and easy to use.

In my next post I will go into some detail of how real-time multi-process apps work, based on my experiences in industry and with ClearConnect.