Demystifying M.E.L.T. - the key data for business observability
Article by New Relic senior director of customer solution, APJ, Jill Macmurchy.
We’ve previously discussed the role of business observability in software development and the core components required to make it a reality. Observability involves gathering different types of data about all components within a system, to establish the "Why?" rather than just the "What went wrong?". The acronym M.E.L.T. is used to define four essential data types: metrics, events, logs and traces.
Metrics are the starting point for observability. They’re an aggregated set of measurements grouped or collected at regular intervals. Most share several traits: a timestamp, a name, one or more numeric values and a count of how many events are represented. Metric examples include error rate, response time, or throughput.
Metrics are typically a compact, cost-effective way to store a lot of data. They’re also dimensional for quick analysis; providing a great way to measure overall system health. Because of this, many tools have emerged to collect metrics, such as Prometheus and StatsD. However, metrics do require careful decision making, and decisions need to be made ahead of time about how data should be analysed. Certain areas can’t be calculated after the fact unless all raw sample events are available for analysis. To have complete observability, collecting and analysing metrics is a must.
An event is a discrete action happening at any moment in time. Take a vending machine for instance. An event could be the moment when a user makes a purchase from the machine. There might be states that are derived from other events, such as a product becoming sold out after a purchase.
Events are a critical telemetry type for any observability solution. They’re valuable because they can be used to validate the occurrence of a particular action at a particular time and enable a fine-grained analysis in real time. However, events are often overlooked or can be confused with logs. What’s the difference? Events contain a higher level of abstraction than the level of detail provided by logs. Logs record everything, whereas events are records of selected significant things.
Adding metadata to events makes them much more powerful. With the vending machine example, we could add additional attributes such as “ItemCategory” and “PaymentType”. This allows questions to be asked, such as "How much money was made from each item category?" or "What is the most common payment type used?”.
The limitation of events is that each one takes some amount of computational energy to collect and process, which could potentially take up a lot of space in a database. Because of this, it’s necessary to be selective about what kinds of events are stored.
Logs are the original data type. They’re important when engineers are in deep debugging mode, trying to understand a problem and troubleshoot code. Logs provide high-fidelity data and detailed context around an event, so engineers can recreate what happened millisecond by millisecond.
Logs are particularly valuable for troubleshooting things such as databases, caches and load balancers, as well as older proprietary systems that aren’t friendly to in-process instrumentation.
Log data is sometimes unstructured which makes it hard to parse in a systematic way, but when log data is structured, it’s much easier to search the data and derive events or metrics from it. There are tools that reduce the toil and effort of collecting, filtering, and exporting logs, such as Fluentd, Fluent Bit, Logstash, and AWS CloudWatch.
Traces are samples of causal chains of events, and trace data is needed to determine the relationships between different entities. Traces are very valuable for highlighting inefficiencies, bottlenecks and roadblocks in the customer journey, as they can be used to show the end-to-end latency of individual calls in a distributed architecture.
Applications often call multiple other applications depending on the task they’re trying to accomplish, and often process data in parallel. This means the call-chain can be inconsistent and have unreliable timing. Passing a trace context between each service is the only way to ensure a consistent call-chain, and uniquely identify a single transaction through the entire chain.
W3C Trace Context is set to become the standard for propagating trace context across process boundaries. It makes distributed tracing easier to implement and more reliable, and therefore more valuable for developers working with highly distributed applications. Recently, the specification reached " Recommendation" status.
Whatever stage of observability an organisation is at, understanding the use cases for each M.E.L.T. data type is a critical part of building an observability practice. It allows better sense to be made of data and relationships, enabling quicker and easier resolution of issues and prevention of them reoccurring. This creates reliability and performances: the key objectives of observability.
This improves reliability, operational agility and ultimately better customer experience.