State of the document: Draft (It will updated to reflect current deployment practices after discussion with DevOps)
There should be platform wide solution for logging aggregation in order to quickly identify and find the root causes of issues
There should be configurable alerting in order to get information about new the issues as soon as possible
Log entries for the same business transaction in different microservices should have unique correlation id in order to find all log entries for the same business transaction.
Assumption: Folio should be deployed in cluster for production
Okapi aggregates logs from all applications into single file.
The Mapped Diagnostic Context (MDC) can be leveraged in cases when it is possible. There should be implementation of mechanism in order to supply request-id to MDC. (e.g. MDC.put("requestId", "some-id");)
When user request routed to multiple, different microservices and something goes wrong and a request fails, this request-id that will be included with every log message and allow request tracking.
Common JSON format
For any logging aggregation solution json format of logs is preferable over plain text in terms of performance and simplicity of parsing. All Folio modules should have the same json format for logs, e.g. the following:
<configuration> <appender name="FILE" class="ch.qos.logback.core.FileAppender"> <file>log/application.log</file> <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder"> <layout class="ch.qos.logback.contrib.json.classic.JsonLayout"> <jsonFormatter class="ch.qos.logback.contrib.jackson.JacksonJsonFormatter"/> <appendLineSeparator>true</appendLineSeparator> </layout> </encoder> </appender> <root level="debug"> <appender-ref ref="FILE"/> </root> </configuration>
Common library for logging
Common logging library should include common logback.xml and will have mechanism in order to supply request-id to MDC.
The goal of centralized logging stack is quickly sort through and analyze the heavy volume of logs. On of the most popular centralized logging solution is the Elasticsearch, Fluentd, and Kibana (EFK) stack.
- There could be Elastic deployed within platform for full-text search, so it wouldn't be additional technology in the stack
- Fluentd is used instead of Logstash, because the latter has certain performance problems and not so simple in terms of configuration
- There tools with richer functionality (e.g. datadog)
Elasticsearch is a real-time, distributed, and scalable search engine which allows for full-text and structured search, as well as analytics. It is commonly used to index and search through large volumes of log data, but can also be used to search many different kinds of documents.
Elasticsearch is commonly deployed alongside Kibana, a powerful data visualization frontend and dashboard for Elasticsearch. Kibana allows you to explore your Elasticsearch log data through a web interface, and build dashboards and queries to quickly answer questions.
Fluentd will be used to collect, transform, and ship log data to the Elasticsearch backend. Fluentd is a popular open-source data collector that we’ll set up on our Kubernetes nodes to tail container log files, filter and transform the log data, and deliver it to the Elasticsearch cluster, where it will be indexed and stored.
There are many plugins available for watching and alerting on Elasticsearch index in Kibana e.g. X-Pack, SentiNL, ElastAlert. Alerting can be easily implemented in Kibana (see: https://www.elastic.co/blog/creating-a-threshold-alert-in-elasticsearch-is-simpler-than-ever)
Elastalert is open source simple and popular open source tool for alerting on anomalies, spikes, or other patterns of interest found in data stored in Elasticsearch. Elastalert works with all versions of Elasticsearch.
kube-logging namespace should be created into which EFK stack components should be installed. This Namespace will also allow one to quickly clean up and remove the logging stack without any loss of function to the Kubernetes cluster. For cluster high-availability 3 Elasticsearch
Pods should be deployed to avoid the “split-brain” issue (see A new era for cluster coordination in Elasticsearch and Voting configurations).
K8s deployment: Kibana
To launch Kibana on Kubernetes Service called kibana should be created in the
kube-logging namespace. Deployment consists of one Pod replica. Latest kibana docker image located at: docker.elastic.co/kibana/. Range of 0.1 vCPU - 1 vCPU should be guaranteed to the Pod.
K8s deployment: Fluentd
Fluentd should be deployed as a DaemonSet, which is a Kubernetes workload type that runs a copy of a given Pod on each node in the Kubernetes cluster (see: https://kubernetes.io/docs/concepts/cluster-administration/logging/#using-a-node-logging-agent).
Folio modules should use single common slf4j configuration, for writing JSON files on the nodes. The Fluentd Pod will tail these log files, filter log events, transform the log data, and ship it off to the Elasticsearch. Fluentd DaemonSet spec provided by the Fluentd maintainers should be used along with docs provided by the Fluentd maintainers: Kuberentes Fluentd.
Service Account called
fluentd that the Fluentd pods will use to access the Kubernetes API should be created in the
kube-logging namespace with label
app: fluentd (see: Configure Service Accounts for Pods in the official Kubernetes docs). ClusterRole with
watch permissions on the
namespaces objects should be created.
NoSchedule toleration should be defined to match the equivalent taint on Kubernetes master nodes. This will ensure that the DaemonSet also gets rolled out to the Kubernetes masters (see: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/).
https://hub.docker.com/r/fluent/fluentd-kubernetes-daemonset/ provided by the Fluentd maintainers should be used. This Dockerfile and contents of this image are available in Fluentd’s fluentd-kubernetes-daemonset Github repo.
The following environment variables should be configured for Fluentd:
FLUENT_ELASTICSEARCH_HOST: Elasticsearch headless Service address defined earlier:
elasticsearch.kube-logging.svc.cluster.local. This will resolve to a list of IP addresses for the 3 Elasticsearch Pods. The actual Elasticsearch host will most likely be the first IP address returned in this list. To distribute logs across the cluster, you will need to modify the configuration for Fluentd’s Elasticsearch Output plugin (see: Elasticsearch Output Plugin).
|Common logging format for all backend modules and infrastructure components (Okapi, PubSub)||Agreed|
|The same format for all frontend modules|
|Format: json (configuration mentioned above)||POC for performance (plain vs json)|
|Including properties: log level|
|Including (if possible) request-id to all logs entries|
|Logging to files|
Should be discussed with Dev-Ops (John Malconian )
|Making optional artifact with EFK included|