Tuesday 2020-06-02

In A simple way to get more value from metrics, Dan Luu mentioned a Maslow's Hierarchy of Observability, which appears to have originated from a Splunk post.

Splunk's hierarchy uses terms of presentation, which creates a distance from the underlying data management.

From first principles, we have data, which then passes through three filters: Data is either observed or not (Splunk's uptime monitoring), Observed data is either recorded or not (Splunk's dashboarding), and Recorded data is either cleaned or not.

Real-world data will have duplicates, mismeasurements, and incomparability issues. Witness Luu's wrangling with 94 different names for JVM survivor space, some of which were counters and others gauges.

Cleaned data is either queriable or not by some general statistics system such as R. While stats queries can be seen as another filter -- data is either presented or not -- if viewed as a control of cleaned data, then the hierarchy needs another level wherein each lower level in the hierarchy is under ad-hoc control.

Ideally, monitoring will point out a problem, and also allow the ad-hoc collection, cleaning, and analysis of new data to pinpoint the issue.

Currently, this pipeline is a pain to use as it requires: (a) writing code in Dtrace/BPF, SQL, some ETL cleaner, R, and (b) predictive awareness of pipeline throughput limits and bottlenecks for each language.