Single Page Applications - Fault Isolation and Recovery
In my previous article, Browser Scalability, I addressed the ability of a stateful browser using a stateless strategy to scale web applications and add the dimension of fault isolation. Included was a discussion of how the browser application and user experience would be unaffected if the location of the server-side APIs were to change due to server load or an outage. Within, I briefly mentioned the topic of web sockets and did not go into detail on how to ensure continuity in that area of the application architecture. This is a high-level article on how to solve that issue.
Single Page Applications (SPA) will always request and receive data via HyperText Transfer Protocol (HTTP). A request to the server contains a verb and optionally a body of data. The responsibility of the server is to respond to that request with a status as to the success of the request and optionally a body of data. An example would be the SPA making a “GET” request for account information. Once the server has received the request it would retrieve the data from its database and send the data to the browser with a status of OK (Response Code: 200) along with the requested data. Each request is independent and if the SPA is maintaining application state, any capable server will be able to process that request.
Requests of HTTP are unidirectional, meaning the browser must initiate the request and wait for a reply. Some features in modern web applications require the ability to have bidirectional communication or a conversation with the server. This will allow the web application to contact the server and the server to contact the web application as needed. This type of communication is implemented via Web Sockets. Web Sockets were introduced in 2010/2011, specifically for the purpose of providing stateful, bidirectional communication with browser applications. When a web socket connection is established, we enter a state of server affinity where we are bound to a specific server due for that connection.
In a web application that contains a real-time chat feature, the back-and-forth conversation is over a web socket connection. If the web application needs to connect to a different server due to an unplanned event, the web socket connection will be lost, and the chat session will be interrupted. When this happens the experience of the user is compromised and to re-establish that connection can prove technically difficult. In addition to chat functionality, other expected features such as notifications will also be compromised by the loss of the web socket connection. Without the proper architecture, in a fail-over event data will be lost.
Now for some good news. When a web socket connection is required, it is opened by the web application, SPA. That web application may also close and reopen that connection as needed. As part of the connection open logic on the server side of the application, a session is created which allows the server to interact with the web application. The session is identified by a unique identifier as there can be many sessions active at one time. Back to the chat application, if we can retain the contents on each side of the conversation and maintain a status for each post, we will have the ability to ensure a durable context independent of the web socket connection. In effect, if the web socket connection is lost from one server and then established on another server, the conversation can be caught up. Granted there might be a slight delay as this happens and none the less everything will be intact.
In another example, we will consider an equities trading application. This web application, SPA allows traders to view real-time market price information. If this information is presented to the user in a browser SPA experience, a common method to deliver this real-time price information is via web sockets. As it is critical that the trader has up to date information, a disconnect from the server will be disadvantageous and may result in the trader losing money. In this scenario, the server is pushing real-time quote data via a web socket with an expected maximum time between updates of 5 seconds. Should that time be exceeded, the SPA will contact the server via an HTTP request to determine if everything is okay. If the SPA does not receive a reply indicating the server is in good health, the SPA will begin to contact other servers to resume service.
To accomplish this failover feat to another server, the SPA must receive a list of servers which can support the SPA during initialization. This list will be stored in the local memory of the SPA for future reference. Once it is detected that the current server is unavailable the SPA will contact the next server in the list. If the problem arises again, the next server will be selected, and so on until service is established. Outside of this example, the SPA will send messages to alert the operations team a problem exists.
When the current server becomes unavailable and the SPA is establishing a connection to the new server, market quote information is not being delivered. This information is critical, and it is the responsibility of the SPA and server application to regain and able deliver the missing updates to the SPA. This may be accomplished using a server-side memory cache which is replicated to all servers which support our SPA. This cache will remember market quote information sent to each instance of the SPA application for a short amount of time, say 5 minutes. Each entry in the memory cache will have a sequence number and an identifier as to the user who is receiving price quote data.
Each quote message received by the SPA will contain the sequence number and the SPA will remember the last sequence number received. When the SPA is disconnected from its current server and establishes a new web socket connection with the next server in the availability list, it will send a ‘I was disconnected’ message with the username and last sequence number received. When the server receives that message, it will access its memory cache and send the missing price quote messages in sequence. Once these have been sent, normal updates over the web socket connection will resume.
A similar strategy may be implemented within the SPA for the case when the SPA and server application are engaged in a synchronous conversation. The playback of missing messages is a little more complex as each side will need to make sure the messages are exchanged in order. From this example, you now understand how to fault isolate a stateful web socket connection. My post on fault isolating and managing state in your SPA provides additional guidance on how to manage other aspects of a SPA to ensure the user experience is not affected during a server-side outage of network disconnect.
With these principles in your core web application architecture along with other SPA fault isolation strategies, you will be well on your way to creating durable web applications which will provide your users with an experience of “it always works”.
For more information on Fault Isolation, have a look at this blog post – Architectural Principles – Fault Isolation and Swim Lanes.