If It's Not Observable, It Can't be Released – Design To Be Monitored

Design To Be Monitored is one of those AKF architectural principles that resonates with many product engineering teams. After all, high availability, reliability, and visibility are a must for a firm to remain competitive in their markets.

The best applications and systems designed from the ground up with this principle in the architecture allow for self-diagnosing and potentially automated self-healing if implemented and architected well. Services must emit events and logs appropriately, so we are designed for ops and a deep level of observability. This includes everything from availability, performance, usage auditing, and security. If done right, teams are able to gain a holistic view of the user. This can be very powerful allowing teams to observe features that are worth further investment or divestment as well as user flow patterns that may require further iterations. Said another way, we empower product teams with various goals to get insights into the state of their distributed solutions and iterate more quickly.

Many engineering teams attempt to rely on systems that aggregate measurements that are from stale and aging data. Such an approach doesn’t allow teams to identify and correct issues quickly. Instead, we need deep traceability and the capability of logging a path through the system to pinpoint bottlenecks, performance issues, and failure points.

Monitoring should be designed while the application and system are designed – not after. Poor user experiences will harm the product’s reputation and the company's reputation. Ongoing undesirable experience will result in a loss of customers. We argue Design To Be Monitored should be interpreted as “If it’s not Observable, it can’t be Released.” 0 Observability = 0 Release

What Does Observability Mean?

Modern distributed systems can be complex by nature. Far too many teams fail to fit their systems with monitoring that’s needed until it is too late. Their systems become a black box that’s poorly monitored resulting in poor and less than timely fault detection along with minimal data that can help inform future architectural enhancements. A Single Page Application that interacts with multiple micro services along with potentially several caching and persistent tiers using varying technologies and cloud services all create a level of complexity that needs strong observability.

Said another way, we must profile our applications with monitoring data, log data, and tracing data so that we can interpret what’s being observed with alerting and graphing tools.

Good observability allows us to understand several key indicators and correlations:

The response rates for user requests

The number of concurrent user requests

The number of transactions in business metrics

The volume of traffic

The rates at which transactions are being completed

Processing times for requests

The number of concurrent users versus request latency times

The number of concurrent users versus the average response times

The volume of requests versus the number of errors

Good observability allow us to quickly coordinate incident response on the above by:

Scaling up virtual resources such as AWS EC2 instance types

Scaling out container nodes for more capacity

Backing out recent code deployments or turning off feature flags

Restarting unhealthy nodes

Alerting the right person at the right time

Monitoring

To monitor we collect ambient performance information like the use of disks (I/O), memory utilization, CPU utilization, request queue length, bytes written and read, and network activity are key. We also must collect system data such as versions, states, services, processes, and consumptions. Business metrics are also critical to monitor as we discuss in Monitoring for Early Fault Detection. Third party services used by the system MUST also be monitored. If we collect the aforementioned well, we can profile the behavior of each application component and subcomponent. We can generate a snapshot of the current health of the system to verify all components of the system are functioning as expected. Any deviations from the norm allows an engineer to quickly spot components that are experiencing problems. Further, comprehensive monitoring allows us to measure availability thru transactional metrics that we have aligned with business outcomes.

Logging

Logging events is critical to correlate what customers were doing and/or the demand from a customer’s perspective on the application at the time of anomaly detection. An agent or a small service known as a collector on each system is responsible for reading each event, formatting it so it can be easily read, and then sending it to an external platform outside of the systems hosting environment. We should monitor each collector to make sure its not consuming too many resources and that it’s operational for observability.

Tracing

Tracing is crucial as it allows us to see correlations across components in a distributed system. With tracing we can see metrics that show service dependencies, latency between components, and abnormal events that may occur between them. To trace we must instrument our code to capture the tracing data from the front-end components to back-end components. Alternatively managed Service Mesh’s such as Traefik Mesh, AWS App Mesh, Istio, Nginx Service Mesh can allow for automatic deployment of proxies between endpoints to track requests and collect traces giving us end-to-end visibility.

Using Data To Determine Future Issues

If we design to be monitored comprehensively as described above, performance metrics can be analyzed to determine if the system will need additional resources or architectural changes needed for scale. Using recent and current workloads we can spot trends and predict whether the system is likely to acceptably perform and remain healthy. To do such an analysis we should look at the rate of requests each service or component is handling, the response times of those requests, and the volume of data that is flowing in and out of each service. If any of the metrics violate defined thresholds, we can self—heal (auto scale, restart services, apply throttling, etc) and we can use the data and threshold violations to determine what architecture changes are needed.

When building and operating products that scale and are highly available, we must be able to observe, react quickly, and determine our architectural needs.

AKF has helped many clients with an approach to monitoring and scaling. Contact us. We would love to help.