Forem: Stefan Thies

How To Centralize Logs: Linux System Journal

Stefan Thies — Tue, 16 Jul 2019 15:09:55 +0000

Did you know that most Linux systems have a complete log management solution onboard? Distributions based on systemd contain journald and journalctl.

systemd-journald - All Linux system processes write logs to the system journal, which is managed by journald. The system journal is local log storage.

journalctl* is the command-line client to display logs with various filter options like time, system unit or any other field stored in the log event. For advanced searches, it is possible to pipe the output to grep, which makes it easy to apply complex search expressions to journalctl output.
The journalctl client is not only useful for log search, but it also provides various other functions such as management of the system journal storage.

systemd-journal-upload is a service to forward log events to a remote journald instance. Configuring journal-upload on all your Linux machines forwarding log events to a central log server is the best way to centralize logs. Then you can use journalctl on the central log server for log search. Even though the Linux console is cool, you want a web UI to search logs and visualize extracted data for easier and more practical troubleshooting.

Shipping the logs to the Elastic Stack is a common practice to centralize logs, but how can this be done with journald?

Unlike rsyslog, journald has no option to forward logs directly to Elasticsearch. As we need json data the output of journalctl -o json could be useful. Piping the output of journalctl to Logagent could be a solution:

journalctl -o json -f | logagent -i mylogs -u http://elasticsearch:9200

Cool, it works! However, the running processes would not catch logs at boot time or recover gracefully after a restart - we might lose some logs - a nogo!

Luckily, Logagent has got a plugin that receives logs from the systemd-journal-upload service. Let’s start from scratch and set up Logagent to receive journald logs and store them in Elasticsearch or Sematext Cloud.

Setup Logagent as a local logging hub

To run Logagent you will need a Logs App Token first. If you don't have Sematext Logs Apps yet, you can create Logs Apps now.
Then you can install Logagent. The default setup ships log files from /var/log to Sematext Cloud. To run Logagent you will need a Logs App Token.

To receive logs from the journal-upload service, activate the plugin in /etc/sematext/logagent.conf.

# Global options
options:
  includeOriginalLine: true


input:
   journal-upload:
    module: input-journald-upload
    port: 9090
    worker: 0
    systemdUnitFilter: 
      include: !!js/regexp /.*/i
    # exclude: !!js/regexp /docker|containerd/i
    # add static tags to every log event 
    tags:
     log_shipper: logagent
     # _index is special tag for log routing with elasticsearch output-plugin
     # Set the index name here in case journald logs should be 
     # stored in a separate index
     # _index: MY_INDEX_FOR_ELASTICSEARCH_OUTPUT or 
     #         YOUR_SEMATEXT_LOGS_TOKEN_HERE
     # you can add any other static tag 
     # node_role: kubernetes_worker
     # journald might provide many fields, 
     # to reduce storage usage you can remove redundant fields
    removeFields:
      - __CURSOR
      - __REALTIME_TIMESTAMP
      - _SOURCE_REALTIME_TIMESTAMP
      - __MONOTONIC_TIMESTAMP
      - _TRANSPORT

output: 
  # output data for debugging on stdout in YAML format
  # stdout: yaml
  sematext-cloud:
    module: elasticsearch
    url: https://logsene-receiver.sematext.com
    # url: https://logsene-receiver.eu.sematext.com
    index: YOUR_SEMATEXT_LOGS_TOKEN_HERE

Now we can restart the logagent service with

systemctl restart logagent

Perfect, our logging hub for journald logs is running.

Let’s move on to set up the systemd-journal-upload on our Linux server.

Setup systemd-journal-upload

Note: Please note the example uses 127.0.0.1 as IP address for Logagent. You should replace 127.0.0.1 with the IP address with the actual IP of the server you run logagent.

Use the following command to install systemd-journal-remote

sudo apt-get install systemd-journal-remote

Edit /etc/systemd/journal-upload.conf and change the URL property.

[Upload]
URL=http://127.0.0,1:9090
# ServerKeyFile=/etc/ssl/private/journal-upload.pem
# ServerCertificateFile=/etc/ssl/certs/journal-upload.pem
# TrustedCertificateFile=/etc/ssl/ca/trusted.pem

This will make sure journal-upload starts on boot.

Note that the upload service might stop if creating the HTTP connection doesn't work. Should that happen the service stores the current cursor position in the system-journal. Therefore, you should set useful restart options in the service definition.

Edit /etc/systemd/system/multi-user.target.wants/systemd-journal-upload.service to change restart options.

[Unit]
Description=Journal Remote Upload Service
Documentation=man:systemd-journal-upload(8)
After=network.target


[Service]
ExecStart=/lib/systemd/systemd-journal-upload \
          --save-state
User=systemd-journal-upload
SupplementaryGroups=systemd-journal
PrivateTmp=yes
PrivateDevices=yes
#WatchdogSec=3min
Restart=always
TimeoutStartSec=1
TimeoutStopSec=1
StartLimitBurst=1000
StartLimitIntervalSec=5
# If there are many split up journal files we need a lot of fds to
# access them all and combine
LimitNOFILE=16384
[Install]
WantedBy=multi-user.target

Apply the changes and restart journal-upload:

systemctl daemon-reload
sudo systemctl enable systemd-journal-upload.service

Check if your logs arrive in Sematext Cloud by opening your Logs App.
The following video shows how to use the Sematext UI.

Best Practices for Efficient Log Management and Monitoring

Stefan Thies — Tue, 16 Apr 2019 10:52:13 +0000

When managing cloud-native applications, it’s essential to have end-to-end visibility into what’s happening at any given time. This is especially true because of the distributed and dynamic nature of cloud-native apps, which are often deployed using ephemeral technologies like containers and serverless functions.

With so much flux and complexity across a cloud-native system, it’s important to have robust monitoring and logging in place to control and manage the inevitable chaos. This post discusses what we consider to be some of the ** best practices and standards to follow when logging and monitoring cloud-native applications**.

1. Use a managed logging solution vs building your own infrastructure

First off, logging should reflect your applications. In a world of cloud-native applications, logging solutions should be built on the same principles as high availability, distributed processing, and intelligent failover that consequently lay the foundation for the applications themselves. This is what differentiates modern cloud-native apps from legacy monolithic apps.

The tools to implement this approach include Elasticsearch, Fluentd, Kibana (which, together, are often called the EFK stack), and others. They are architected to handle large-scale data analysis and deliver results in real time. They facilitate complex search queries over data and enable open API-based integration with other tools. However, though the raw materials are available, bringing it all together and making sure it meets your purposes is a whole other challenge.

Rather than build out this system on your own, it makes sense to use a managed logging solution that is built and scaled by a vendor. We go over that in detail in 5 Reasons to Run Elastic Stack in the Cloud. With ready-made integrations, all you need to do is connect your sources and destinations, and you’re all set to analyze application logs the easy way. This leaves you free to spend more time monitoring and logging your application rather than building out logging infrastructure.

2. Know what logs to monitor, and what not to monitor

Know what not to log. Just because you can log something doesn’t mean you should — and logging too much data can make it harder to find the data that actually matters. It also adds complexity to your log storage and management processes because it gives you more logs to manage.

Thus, consider carefully what you actually need to log. Any types of production-environment data that are critical for compliance or auditing purposes should certainly be logged. So should data that helps you troubleshoot performance problems, solve user-experience issues or monitor security-related events.

On the other hand, there are categories of data that you do not need to log, such as data from test environments that are not an essential part of your software delivery pipeline. There are also some kinds of data that you should not log for compliance or security reasons. For example, if a user has enabled a do-not-track setting, you should not log data associated with that user. Similarly, you should avoid logging highly sensitive data, such as credit-card numbers, unless you are certain that your logging and storage processes meet the security requirements for that data.

3. Implement a log security and retention policy

Logs contain sensitive data. A log security policy should review sensitive data – like personal data of your clients or internal access keys for APIs. Make sure that sensitive data gets anonymized or encrypted before you ship logs to any third party. GDPR log management best practices teaches you about good practices for data protection of sensitive data and personal data in web server logs. The secure transport of log data to log management servers requires the setup of encrypted endpoints for TLS or HTTPS on client and server side.

Logs from different sources might require different retention times. Some applications are only relevant for troubleshooting for a few days. Security-related or business transaction logs require longer retention times. Therefore, a retention policy should be flexible, depending on the log source.

4. Log Storage

Planning the capacity for log storage should consider high load peaks. When systems run well, the amount of data produced per day is nearly constant and depends mainly on the system utilization and amount of transactions per day. In the case of critical system errors, we typically see accelerated growth in the log volume. If the log storage hits storage limits, you lose the latest logs, which are essential to fix system errors. The log storage must work as a cyclic buffer, which deletes the oldest data first before any storage limit is applied.

Design your log storage so that it’s scalable and reliable – there is nothing worse than having system downtimes and a lack of information for troubleshooting, which in turn, can elongate downtime.

Log storage should have a separate security policy. Every attacker will try to avoid or delete his traces in log files. Therefore you should ship logs in real-time to the central log storage. If the attacker has access to your infrastructure, sending logs off-site, e.g., using a logging SaaS will help keep evidence untampered.

5. Review & constantly maintain your logs

Unmaintained log data could lead to longer troubleshooting times, risks of exposing sensitive data or higher costs for log storage. Review the log output of your applications and adjust it to your needs. Reviews should cover usability, operational and security aspects.

Create meaningful log messages

Readable and useful log messages are key for faster troubleshooting. If logs contain only some error codes or ‘cryptic’ error messages it can be difficult to understand. As a Developer, you can save your organization a lot of time by providing a meaningful log message.

Use structured log formats

The log format should be structured (e.g., JSON or key/value format) having various fields like timestamp, severity, message and any other relevant data fields like process ID, transaction ID, etc. If you don’t use a unique log format for all your applications, normalize the logs in the log shipper. Parse logs and store logs in a structured format.

Make log level configurable

Some applications logs are too verbose and other application logs don’t provide enough information about the activities. Adjustable log levels are the key to configure the verbosity of logs. Another topic for log reviews is the challenge to balance between logging relevant information and not exposing personal data or security-related information. If so, make sure that those messages can be anonymized or encrypted.

Inspect audit logs frequently

Acting on security issues is crucial – so you should always have an eye on audit logs. Setup security tools such as auditd or OSSEC agents. The tools implement real-time log analysis and generate alert logs pointing to potential security issues. On top of such audit logs, you should define alerts on logs in order to be notified quickly on any suspicious activity. For more details, check out a quick tutorial on using auditd, plus you’ll find some complementary frameworks too.

Use a checklist for log reviews:

Is the log message meaningful for users?
Does the log message include context for troubleshooting?
Are the log message structured and include
- timestamp,
- severity/log level
- message
- additional troubleshooting information in separate fields
Are 3rd party logs parsed and structured (configure log shipper)?
Are log levels configurable?
Does the log message include personal data or security-related data?
Inspect audit logs and adjust log alert rules
Setup alerts on logs

6. Don’t do log analysis in a silo: Correlate all data sources

Connect the dots. Logging is one part of an entire monitoring strategy. To practice truly effective monitoring, you need to complement your logging with other types of monitoring like monitoring based on events, alerts and tracing. This is the only way to get the whole story of what’s happening at any point in time. Logs are great for giving you high-definition detail on issues, but this is useful only once you’ve seen the forest and are ready to zoom into the trees. Metrics and events at an aggregate level may be more effective, especially when starting to troubleshoot an issue.

Don’t look at logs in a silo — Complement them with other types of monitoring like APM, network monitoring, infrastructure monitoring, and more. See APM vs. Log Management for more detail. This also means that the monitoring solution you use should be comprehensive enough to provide all your monitoring information in one place, or flexible enough to easily integrate with other tools that provide this information. This way, as a user, you have a single-pane view of your entire stack.

7. View logging as an enabler of GitOps

For busy DevOps teams, it’s easy to view logging as a nice-to-have, or an add-on that you can embrace once you’ve figured out automated CI/CD pipelines and are releasing more frequently. However, another way to look at logging is to see it as an enabler of DevOps and CI/CD. To practice automation at every step of the development pipeline, you need the visibility to know where issues are introduced, and what the main sources of these issues are — faulty code, dependency issues, external attacks, insufficient resources, or something else. The causes can be innumerable, but logging gives you the insight you need to find and fix these issues.

As continuous integration increasingly becomes about enabling GitOps at the very start of the pipeline, there’s a need to not compromise on quality and security authentication in the name of automation and speed.

8. Get real-time feedback on any type of events

Automated testing and new approaches like headless testing are making it possible to get real-time feedback on every single code change in a developer environment, even before a commit. As testing shifts left, and there is an increasing focus on the start of the pipeline, logging is essential to gain visibility and enable GitOps. Without the appropriate testing and logging, you’ll be left with runaway releases and deployment hell.

9. Use logging to identify automation opportunities & trends

Logging helps to catch issues early on in the pipeline and saves your team valuable time and energy. It also helps you find opportunities for automation. You can set up custom alerts to trigger when something breaks, and even set up automated actions to be initiated when these alerts are triggered. Whether it’s through Slack, a custom script, or a Jenkins automation plugin, you can drive automation in your GitOps process using logs. For all these reasons, you need to view logging as an enabler and driver of GitOps rather than an add-on.

Conclusion & next steps

In conclusion, logging is an essential part of building and managing cloud-native apps. For logging to be successful, it should reflect the state of your applications and be able to scale along with them. Logging should never be done in a silo. This is why a monitoring solution for cloud-native applications should consider other types of monitoring and metrics. Logging can often be viewed as an afterthought, but teams that want to go all the way with GitOps see logging as a driver and enabler of observability, and hence, as indispensable.

Looking for a full stack monitoring solution? Try Sematext Cloud free for 30 days. Sematext bridges the gap between logs, metrics, real user monitoring and tracing allowing you to benefit from faster actionable insights

*About the Authors *

Twain Taylor | Contributor

Twain is a Fixate IO Contributor and began his career at Google, where, among other things, he was involved in technical support for the AdWords team. His work involved reviewing stack traces, and resolving issues affecting both customers and the Support team, and handling escalations. Later, he built branded social media applications, and automation scripts to help startups better manage their marketing operations. Today, as a technology journalist he helps IT magazines, and startups change the way teams build and ship applications.

Stefan Thies | DevOps Evangelist | Sematext

10+ years of work experience as a product manager and pre-sales engineer in the telecommunications industry. Passionate about new software technologies and scalable system architectures. Likes NodeJS for POCs.

The post Best Practices for Efficient Log Management and Monitoring appeared first on Sematext.

Better Observability with New Container Agents

Stefan Thies — Wed, 03 Apr 2019 09:33:56 +0000

Why a New Docker Agent?

If you liked Sematext Docker Agent you’ll love our new agent for Docker monitoring that provides you with even more insight into your Docker, Kubernetes, and Swarm clusters. Because of its power, small footprint, and ease of installation the old Sematext Docker Agent enjoyed high adoption by the Docker DevOps community.

An all-in-one Docker monitoring tool, certified by Docker since 2015, it could monitor all key Docker metrics, container events, as well as collect and parse logs. However, container technology is developing rapidly – Docker Enterprise and Kubernetes gained in popularity, as did cloud container platforms like Google GKE. It was time for an update. Except, we didn’t just update it. We rewrote it, made it even smaller and more modular, and much more powerful. The new Sematext Agent can monitor not only Docker containers but also see inside them. It has first-class Kubernetes monitoring support and kernel tracing capabilities, all with super low CPU and memory footprint.

To better serve the need for advanced monitoring and advanced logging functionality we’ve split the agent in two. This enables faster release cycles and even easier deployment for each specific use case – embracing containerized architectures and orchestration tools. There are two separate images but – importantly – you still benefit from having a single deployment via Helm on Kubernetes.

The following new images replace the old Sematext Docker Agent:

sematext/agent – container monitoring, infrastructure monitoring, cluster monitoring and events from container engines and orchestration tools
sematext/logagent – log collection, log parsing, log enrichment, and log shipping for containers
Both images are Docker certified

Together with the new monitoring agent, we introduced new Dashboards to display the collected data, such as Container Infrastructure monitoring and Kubernetes cluster metrics.

Container monitoring with heatmap

Kubernetes Dashboard – tracking deployment status and Pod restarts over time

So let’s introduce you to sematext/logagent & sematext/agent.

Container Logs Processing with Logagent

Logagent is a general purpose open-source log shipper. The Logagent Docker image is pre-configured for the log collection on container platforms. It runs as a tiny container on every Docker host and collects logs for all cluster nodes and their containers. All container logs are enriched with Kubernetes, Docker Enterprise, and Docker Swarm metadata.

The deployment of Logagent is very similar to the deployment of Sematext Docker Agent and is fully compatible with all its configurations options for logs. The format for log parser patterns also remains the same. Logagent, like its predecessor, recognizes log formats from various applications / official images out of the box.

The following little example shows how easy it is to deploy Logagent, run a web server, and get structured web server logs for web analytics in Sematext.


# Start Logagent
docker run -d --restart=always -e LOGS_TOKEN=YourLogsToken \
-v /var/run/docker.sock:/var/run/docker.socksematext/logagent
# Start Nginx web server
docker run -d -p 8081:80 nginx
# Access the web server
curl http://127.0.0.1:8081

A few seconds later, we see the result in Sematext, beautiful, structured web server logs including container metadata.

Structured web server logs with container metadata

With a few clicks, we can add widgets to create a web server logs dashboard, showing Top IP addresses and Top URLs or containers.

Sematext UI with Top N widgets for various log fields

That was easy for logs, so let’s have a look at the new Docker monitoring agent.

Monitoring Containers with Sematext Agent

Sematext Agent collects metrics about hosts (CPU, memory, disk, network, processes), containers and orchestrator platforms and ships that to Sematext Cloud. To gain deep insight into the Linux kernel, Sematext Agent relies on eBPF to implant instrumentation points (attach eBPF programs to kprobes) on kernel functions. Using Linux kernel instrumentation allows Sematext Agent a very efficient and powerful system exploration approach. It has the ability to auto-discover services deployed on physical machines, virtual hosts, and containers, as well as a mechanism for collecting infrastructure inventory info. It also collects events from different sources such as OOM notifications, container or Kubernetes events.

The plethora of information collected to provide you with full stack observability of your applications, services, and infrastructure is neatly organized in dashboards for infrastructure monitoring, container monitoring and Kubernetes cluster monitoring.

Kernel Tracing with eBPF

Many traditional monitoring agents are based on checks running periodically. Such checks run scripts and may even use commands like ‘ps -efa’ to discover running processes. There are several disadvantages to periodical checks. For example, periodical checks could miss short-running processes. Depending on the frequency of the checks they have their own overhead. Linux kernel observability using eBPF, on the other hand, can trace any kernel function call in the user space. Using eBPF makes it possible to automatically discover new processes without periodical checks. There is a lot more eBPF can do. For instance, it can also discover any file system changes or network activity of all processes, including containers.

eBPF architecture – Source: https://github.com/cilium/cilium/

The new Sematext Agent makes heavy use of eBPF for auto-discovery of processes and their activity. To do that, Sematext Agent attaches bytecode at various hook points in the kernel to detect:

– Process creation and termination

Socket listen/accept
Signals
Out of memory errors
File system activity

Because eBPF is not available in older Linux kernels the agent also has fallback mechanisms like polling /procfs. If you are curious whether eBPF is available on your machines, have a look at the new Inventory Monitoring in Sematext, as it displays all Linux kernel versions used across your infrastructure.

Low CPU & Memory Footprint

Saving money on cloud resources is a hot topic for every company deploying applications to the cloud. Keeping costs down is a must for any company that wants to be competitive in today’s markets. We are very keenly aware of that, being a fully bootstrapped and cost-conscious organization ourselves. Sematext Agent is a native binary. As such, it doesn’t have the overhead of a runtime environment such as JVM, Ruby, Python, etc. Moreover, we have put a lot of effort into profiling and minimizing the Sematext Agent memory and CPU footprint to make it nearly invisible when it’s running on your infrastructure.

Container and Kubernetes Monitoring

What are the advantages of the new agent for container monitoring?

First of all the Docker Remote API is limited to Docker environments, while Kubernetes emerged to the most popular orchestration tool. In addition more and more alternative container runtimes are available on the market. Therefore the new Sematext Agent takes a container runtime agnostic approach for container monitoring.

Container runtime agnostic discovery and monitoring
- Automatic container discovery
- Support for Docker and Rkt container engines
Container metrics
- CPU usage
- Disk space usage and IO stats
- Memory usage, memory limits, and memory fail counters
- Network IO stats
Host inventory information
- Host kernel version and other system information, like distro, architecture, number of CPUs, etc.
- Information about installed software packages
Container metadata
- Container name
- Image name
- Container networks
- Container volumes
- Container environment
- Container labels including relevant information about orchestration
- Kubernetes metadata such as Pod name, UUID, namespace
- Docker Swarm metadata such as service name, swarm task, etc.
Collection of container events
- Docker events such as start/stop/die/volume mount, etc.
- Kubernetes events such as Pod status changes deployed, destroyed, etc.
Tracking deployment status and Pod restarts over time
Process metrics such as CPU usage, memory usage and disk IO

Let’s see how Sematext Agent is deployed.

Getting Started with Sematext Agent

To run Sematext Agent you will need a Docker App token. If you don’t have any Docker Apps yet, you can create one now.

The Sematext UI displays copy and paste instructions for various ways of deployments for Docker, Docker Enterprise/Swarm, Kubernetes DaemonSets or Helm charts.

The Sematext Agent Documentation contains all configuration options. After a short time you will see container information in the infrastructure monitoring, Docker and Kubernetes reports.

Migration

When migrating to the new agents you can do a simple “nearly in-place replacement” by first removing the old agent and then quickly setting up the new ones. This may result in a bit of gap in your metrics and logs between the time you remove the old Sematext Docker Agent and set up the new ones, but the switch to new agents should take only a few minutes. If this short gap in data is not acceptable, but a bit of data duplication is, then you can switch the order of operations – set up new agents first and then remove the old one.

Please read Monitoring Docker With Sematext, it shows all details, useful options for the agent deployments, log search tips, alert rule definition and more.

What Else is Planned in the Near Future?

The new Sematext Agent has the ability to auto-discover running inside containers. Once the discovered services are exposed in Sematext the Sematext Agent will enable you to seamlessly start monitoring applications you have running inside your containers. The automatic deployment will work on bare metal and VM servers, Docker Enterprise and Kubernetes.

Logagent will grow its open-source repository of supported log formats and plugins for hands-free log collection and parsing.

The new agent includes process monitoring and package inventory collection capabilities, which will soon start showing up in the Sematext UI, so if you don’t have the new agent now is a good time to upgrade your Sematext Agent!

The post Better Observability with New Container Agents appeared first on Sematext.

Docker Container Monitoring Open Source Tools

Stefan Thies — Mon, 01 Apr 2019 16:30:05 +0000

In Part 1: Docker Container monitoring and management Challenges – we discussed why container monitoring is challenging, especially in the context of orchestration tools. In Part 2 we described Top 10 Container Metrics to Monitor. Next, let’s have a look at examples and available container monitoring tools for better operational insights into container deployments.

Command Line Tools

The first step to get visibility into your container infrastructure is probably using the built-in tools like docker command line and kubectl for Kubernetes. There’s a whole set of commands that are used to find the relevant information about containers. Please note that the usage of kubectl and docker command is typically available to only a few people who have direct access to the orchestration tool. Nevertheless, all cloud engineers require command line skills and, in some situations, command line tools are indeed the only tools available.

Before we start looking at Docker log collection tools, check out these two useful Docker Cheatsheets.

Check out Docker Commands Cheat Sheet

Check out Docker Swarm Cheat Sheet

There are several monitoring dashboards available for orchestration tools like Rancher, Portainer, Docker Enterprise. They typically provide a simple real-time metrics and a real-time logs view. The navigation to containers takes a few clicks. However, overviews with aggregated metrics or cluster-wide log search are typically not integrated. As such, the basic monitoring functionality in most orchestration tools is simply too basic. Better tools are needed. Let’s look at some more attractive and capable monitoring and log management solutions.

Open Source Tools for Docker Monitoring, Logging and Tracing

Several open source tools are available for DIY-style container monitoring and logging. Typically logs and metrics are stored in different data stores. Elastic Stack is the tool of choice for logs while Prometheus is popular for metrics.

Depending on your metrics and logs data store choices you may need to use a different set of data collectors and dashboard tools. Telegraf and Prometheus are the most flexible open source data collectors we’ve evaluated. Prometheus exporters need a scraper (Prometheus Server or alternative 3rd party scraper) or a remote storage interface for Prometheus Server to store metrics in alternative data stores. Grafana is the most flexible monitoring dashboard tool with support for most data sources like Prometheus, InfluxDB, Elasticsearch, etc.

Kibana and metric beats for data collection are tightly bound to Elasticsearch and are thus not usable with any other data store.

The following matrix shows which data collectors typically play with which storage engine and monitoring dashboard tool. Note there are several other variations possible.

Data Collector for Containers	Storage / Time Series DB	User Interface
Log collectors
---
Fluentd
Filebeat	Elasticsearch	Kibana
Telegraf / syslog + docker syslog driver	InfluxDB	Grafana, Chronograf
Logagent	Elasticsearch & Sematext Cloud	Kibana & Sematext
Metric collectors
---
Prometheus Exporters	Prometheus
Various 3rd party and commercial integrations	Promdash, Grafana
Metric Beats	Elasticsearch	Kibana
Telegraf	InfluxDB	Grafana, Chronograf
Telegraf Elasticsearch output	Elasticsearch	Kibana
Sematext Agent	InfluxDB & Sematext Cloud	Chronograf & Sematext Cloud

Compatibility of monitoring tools and time series storage engines

The Elastic Stack might seem like an excellent candidate to unify metrics and logs in one data store. As providers of Elasticsearch consulting, training, and Elasticsearch support we would love nothing more than everyone using Elasticsearch for not just logs, but also metrics. However, the truth is that Elasticsearch is not the most efficient as time series databases for metrics. Trust us, we’ve run numerous benchmarks, applied all kinds of performance tuning tricks from our rather big back of Elasticsearch tricks, but it turns out there are better, more efficient, faster data stores for metrics than Elasticsearch. The setup and maintenance of logging and monitoring infrastructure become complicated when it reaches a larger scale.

After the initial setup of storage engines for metrics and logs the time-consuming work starts: the setup of log shippers and monitoring agents, dashboards and alert rules. Dealing with log collection for containers can be tricky, so you’ll want to consult top 10 Docker logging gotchas and Docker log driver alternatives.

Docker Monitoring with Grafana

After the setup of data collectors, we need to visualize metrics and logs. The most popular dashboard tools are Grafana and Kibana. Grafana does an excellent job as a dashboard tool for showing data from a number of data sources including Elasticsearch, InfluxDB, and Prometheus. In general, though, Grafana is really more tailored for metrics even though using Grafana for logs with Elasticsearch is possible, too. Grafana is still very limited for ad-hoc log searches but has integrated alerting on logs.

Docker Monitoring with Kibana

Kibana, on the other hand, supports only Elasticsearch as a data source. Some dashboard views are “impossible” to implement because different monitoring and logging tools have limited options to correlate data from different data stores. Once dashboards are built and ready to share with the team, the next hot topic for Kibana users is security, authentication and role-based access control (RBAC). Grafana supports user authentication and simple roles, while Kibana (or in general the Elastic Stack) requires X-Pack as commercial extensions to support various security features like user authentication and RBAC. Depending on the requirements of your organization one of the X-Pack Alternatives might be helpful.

Total Cost of Ownership

When planning the setup of open source monitoring people often underestimate the amount of data generated by monitoring agents and log shippers. More specifically, most organizations underestimate the resources needed for processing, storage, and retrieval of metrics and logs as their volume grows and, even more importantly, organizations often underestimate the human effort and time that will have to be invested into ongoing maintenance of the monitoring infrastructure and open-source tools. When that happens not only does the cost of infrastructure for monitoring and logging jump beyond anyone’s predictions, but so does the time and thus money required for maintenance. A common way to deal with this is to limit data retention. This requires fewer resources, less expertise needed to scale the infrastructure and tools and thus less maintenance, but this, of course, limits visibility and insights one can derive from long-term data.

Infrastructure costs are only one reason why we often see limited storage for metrics, traces, and logs. For example, InfluxDB has no clustering or sharding in the open source edition, and Prometheus supports only short retention time to avoid performance problems.

Another approach used for dealing with that is the reduction of granularity of metrics from 10-second accuracy to a minute or even more, sampling, and such. As a consequence, DevOps teams have less accurate information with less time to analyze problems, and limited view visibility for permanent or recurring issues, conducting historical trend analysis, or capacity planning.

Microservices Distributed Transaction Tracing

In this post, we discussed only monitoring and logging. We completely ignoring distributed transaction tracing as the third pillar of observability for a moment. Keep in mind that as soon we start collecting transaction traces across microservices, the amount of data will explode and thus further increase the total cost of ownership of an on-premise monitoring setup. Note that data collection tools mentioned in this post handle only metrics and logs, not traces (for more on transaction tracing and, more specifically OpenTracing-compatible tracers see our Jaeger vs. Zipkin). Similarly, the dashboard tools we covered here don’t come with data collection and visualizations for transaction traces. This means that for distributed transaction tracing we need the third set of tools if we want to put together and run our own monitoring setup – welcome to the DevOps jungle!

DIY Container Monitoring Pros and Cons

There are a number of open source container observability tools for logging, monitoring, and tracing. If you and your team have time and if observability really needs to be your team’s core competency, you’ll need to invest time into finding the most promising tools, learning how to actually use them while evaluating them, and finally install, configure, and maintaining them. It would be wise to compare multiple solutions and check how well various tools play together. We suggest the following evaluation criteria:

Coverage of collected metrics. Some tools collect only a few metrics, some gather a ton of metrics (which you may not really need), while other tools let you configure which metrics to collect. Missing relevant metrics can be frustrating when one is working under pressure to solve a production issue, just like having too many or wrong metrics will make it harder to locate signals that truly matter. Tools that require configuration for collection or visualization of each metric are time-consuming to set up and maintain. Don’t choose such tools. Instead, look for tools that give you good defaults and freedom to customize which metrics to collect.
Coverage of log formats. A typical application stack consists of multiple components like databases, web servers, message queues, etc. Make sure that you can structure logs from your applications. This is key if you want to use your logs not only for troubleshooting, but also for deriving insights from logs. Defining log parser patterns with regular expressions or grok is time-consuming. It is very helpful having a library of existing patterns. This is a time saver, especially in the container world when you use official docker images.
Collection of events. Any indication of why a service was restarted or crashed will help you classify problems quickly and get to the root cause faster. Any container monitoring tool should thus be collecting Docker events and Kubernetes status events if you run Kubernetes.
Correlation of metrics, logs, and traces. Whether you initially spot a problem through metrics, logs, or traces, having access to all these observability data makes troubleshooting so much faster. A single UI displaying data from various sources is thus key for an interactive drill down, fast troubleshooting, faster MTTR and, frankly, makes devops’ job more enjoyable. See example.
Machine Learning capabilities and anomaly detection for alerting on logs and metrics. Threshold-based alerts work well only for known and constant workloads. In dynamic environments, threshold-based alerts create too much noise. Make sure the solution you select has this core capability and that it doesn’t take ages to learn the baseline or require too much tweaking, training, and such.
Detect and correlate metrics** with the same behavior.** When metrics behave in similar patterns, we typically find one of the metrics is the symptom of the root cause of a performance bottleneck. A good example we have seen in practice is high CPU usage paired with container swap activity and disk IO – in such a case CPU usage and even more disk IO could be reduced by switching off swapping for containers. For system metrics above the correlation is often known – but when you track your application-specific metrics you might find new correlation and bottlenecks in your microservices to optimize.
Single sign-on. Correlating data stored in silos is impossible. Moreover, using multiple services often requires multiple accounts and forces you to learn not one, but multiple services, their UIs, etc. Each time you need to use both of them there is the painful overhead of needing to adjust things like time ranges before you can look at data in them in separate windows. This costs time and money and makes it harder to share data with the team.
Role-based access control. Lack of RBAC ** ** is going to be a show-stopper for any tool seeking adoption at the corporate level. Tools that work fine for small teams and SMBs, but lack multi-user support with roles and permissions almost never meet requirements of large enterprises.

Conclusion

Concluding above, DevOps engineers need well-integrated monitoring, logging and tracing solution with advanced functionality like correlating between metrics, traces and logs. The saved costs of engineering and infrastructure required to run in-house monitoring can quickly pay off. Adjustable data retention times per monitored service help to optimize costs and satisfy operational needs. The better user experience for your DevOps team helps a lot, especially faster troubleshooting minimizes the revenue loss once a critical bug or performance issue hits your commercial services. In part 4 we will describe container monitoring with Sematext. While developing Sematext Cloud, we had had the above ideas in mind with the goal to provide a better container monitoring solution.

The post Docker Container Monitoring Open Source Tools appeared first on Sematext.

Docker Container Performance Metrics

Stefan Thies — Mon, 01 Apr 2019 16:25:18 +0000

In Part 1 we’ve described what makes monitoring container environments challenging. Because each container typically runs a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage. Traditional monitoring solutions take metrics from each server and the applications they run. These servers and applications running on them are typically very static, with very long uptimes. Container deployments are different: a set of containers may run many applications, all sharing the resources of one or more underlying hosts. It’s not uncommon for Docker servers to run thousands of short-term containers (e.g.. for batch jobs) while a set of permanent services runs in parallel. Traditional monitoring tools not used to such dynamic environments are not suited for such deployments. On the other hand, some modern monitoring solutions were built with such dynamic systems in mind and even have out of the box reporting for docker monitoring. Moreover, container resource sharing calls for stricter enforcement of resource usage limits, an additional issue you must watch carefully. To make appropriate adjustments for resource quotas you need good visibility into any limits containers may have reached or errors they may have encountered or caused. We recommend using monitoring alerts according to defined limits; this way you can adjust limits or resource usage even before errors start happening.

Note: All screenshots in this post are from Sematext Cloud_ and its_ Dockermonitoringintegration.

Watch Resources of Your Docker Hosts

Host CPU

Understanding the CPU utilization of hosts and containers helps one optimize the resource usage of Docker hosts. The container CPU usage can be throttled in order to avoid a single busy container slowing down other containers by using up all available CPU resources. Throttling the CPU time is a good way to ensure the minimum of processing power needed by essential services – it’s like the good old nice levels in Unix/Linux.

When the resource usage is optimized, a high CPU utilization might actually be expected and even desired, and alerts might make sense only for when CPU utilization drops (think service outages) or increases for a longer period over some max limit (e.g. 85%).

Host Memory

The total memory used in each Docker host is important to know for the current operations and for capacity planning. Dynamic cluster managers like Docker Swarm use the total memory available on the host and the requested memory for containers to decide on which host a new container should ideally be launched. Deployments might fail if a cluster manager is unable to find a host with sufficient resources for the container. That’s why it is important to know the host memory usage and the memory limits of containers. Adjusting the capacity of new cluster nodes according to the footprint of Docker applications could help optimize resource usage.

Host Disk Space

Docker images and containers consume additional disk space. For example, an application image might include a Linux operating system and might have a size of 150-700 MB depending on the size of the base image and installed tools in the container. Persistent Docker volumes consume disk space on the host as well. In our experience watching the disk space and using cleanup tools are essential for continuous operations of Docker hosts.

Because disk space is very critical it makes sense to define alerts for disk space utilization to serve as early warnings and provide enough time to clean up disks or add additional volumes. For example, Sematext Monitoring automatically sets alert rules for disk space usage for you, so you don’t have to remember to do it.

A good practice is to run tasks to clean up the disk by removing unused containers and images frequently.

Total Number of Running Containers

The current and historical number of containers is an interesting metric for many reasons. For example, it is very handy during deployments and updates to check that everything is running like before.

When cluster managers like Docker Swarm, Mesos, Kubernetes automatically schedule containers to run on different hosts using different scheduling policies, the number of containers running on each host can help one verify the activated scheduling policies. A stacked bar chart displaying the number of containers on each host and the total number of containers provides a quick visualization of how the cluster manager distributed the containers across the available hosts.

Container counts per Docker host over time

This metric can have different “patterns” depending on the use case. For example, batch jobs running in containers vs. long-running services commonly result in different container count patterns. A batch job typically starts a container on demand or starts it periodically, and the container with that job terminates after a relatively short time. In such a scenario one might see a big variation in the number of containers running resulting in a “spiky” container count metric. On the other hand, long-running services such as web servers or databases typically run until they get re-deployed during software updates. Although scaling mechanisms might increase or decrease the number of containers depending on load, traffic, and other factors, the container count metric will typically be relatively steady because in such cases containers are often added and removed more gradually. Because of that, there is no general pattern we could use for a default Docker alert rule on the number of running containers.

Nevertheless, alerts based on anomaly detection, which detect sudden changes in the number of the containers in total (or for specific hosts) in a short time window, can be very handy for most of the use cases. The simple threshold-based alerts make sense only when the maximum or the minimum number of running containers is known, and in dynamic environments that scale up and down based on external factors, this is often not the case.

Docker Container Metrics

Docker container metrics are basically the same metrics available for every Linux process but include limits set via cgroups by Docker, such as limits for CPU or memory usage. Please note that sophisticated monitoring solutions like Sematext Cloud are able to aggregate container metrics on different levels like Docker Hosts/Cluster Nodes, Image Name or ID and Container Name or ID. Having the ability to do that makes it easy to track resources usage by hosts, application types (image names) or specific containers. In the following examples, we might use aggregations on various levels.

Container CPU – Throttled CPU Time

One of the most basic bits of information is information about how much CPU is being consumed by all containers, images, or by specific containers. A great advantage of using Docker is the capability to limit CPU utilization by containers. Of course, you can’t tune and optimize something if you don’t measure it, so monitoring such limits is essential. Observing the total time a container’s CPU usage was throttled provides the information one needs to adjust the setting for CPU shares in Docker. Please note that CPU time is throttled only when the host CPU usage is maxed out. As long as the host has spare CPU cycles available for Docker it will not throttle containers’ CPU usage. Therefore, the throttled CPU is typically zero and a spike of this metric is typically a good indication of one or more containers needing more CPU power than the host can provide.

Container CPU usage and throttled CPU time

Container Memory – Fail Counters

It is a good practice to set memory limits for containers. Doing that helps avoid a memory-hungry container taking all available memory and starving all other containers on the same server. Runtime constraints on resources can be defined in the Docker run command. For example, “-m 300M” sets the memory limit for the container to 300 MB. Docker exposes a metric called container memory fail counters. This counter is increased each time memory allocation fails — that is, each time the pre-set memory limit is hit. Thus, spikes in this metric indicate one or more containers needing more memory than was allocated. If the process in the container terminates because of this error, we might also see out of memory events from Docker.

A spike in memory fail counters is a critical event and putting alerts on the memory fail counter is very helpful to detect wrong settings for the memory limits or to discover containers that try to consume more memory than expected.

Container Memory Usage

Different applications have different memory footprints. ** ** Knowing the memory footprint of the application containers is important for having a stable environment. ** ** Container memory limits ensure that applications perform well, without using too much memory, which could affect other containers on the same host. The best practice is to tune memory setting in a few iterations:

Monitor memory usage of the application container
Set memory limits according to the observations
Continue monitoring of memory, memory fail counters and Out-Of-Memory events. If OOM events happen, the container memory limits may need to be increased, or debugging is required to find the reason for the high memory consumptions.

Container memory usage

Container Swap

Like the memory of any other process, a container’s memory could be swapped to disk. For applications like Elasticsearch or Solr one often finds instructions to deactivate swap on the Linux host – but if you run such applications on Docker it might be sufficient just to set “–memory-swappiness=0” in the Docker run command!

Container swap, memory pages, and swap rate

Container Disk I/O

In Docker, multiple applications use the same resources concurrently. Thus, watching the disk I/O helps one define limits for specific applications and give higher throughput to critical applications like data stores or web servers, while throttling disk I/O for batch operations. For example, the command docker run -it –device-write-bps /dev/sda:1mb mybatchjob would limit the container disk writes to a maximum of 1 MB/s.

Container I/O throughput

Container Network Metrics

Networking for containers can be very challenging. By default, all containers share a network, or containers might be linked together to share a separated network on the same host. However, when it comes to networking between containers running on different hosts an overlay network is required, or containers could share the host network. Having many options for network configurations means there are many possible causes of network errors.

Moreover, not only errors or dropped packets are important to watch out for. Today, most of the applications are deeply dependent on network communication. The throughput of virtual networks could be a bottleneck especially for containers like load balancers. In addition, the network traffic might be a good indicator how much applications are used by clients and sometimes you might see high spikes, which could indicate denial of service attacks, load tests, or failures in client apps.

So watch the container network traffic – it is a useful metric in many cases and complements IT network monitoring.

Network traffic and transmission rates

Summary

There you have it — the top Docker metrics to watch. Staying focused on these top metrics and corresponding analysis will help you stay on the road while driving towards successful Docker deployments on many platforms such as Docker Enterprise, Kubernetes, AWS EKS, Google GKS or any other platform supporting Docker containers.

If you’d like to learn even more about Docker Monitoring and Logging continue to read Part 3 Container Monitoring Tools of this series.

The post Docker Container Performance Metrics appeared first on Sematext.

Docker Container Monitoring and Management Challenges

Stefan Thies — Mon, 01 Apr 2019 16:22:06 +0000

Organizations that adopt container orchestration tools for application deployment face a new maintenance challenge. Orchestration tools like Kubernetes or Docker Swarm are designed to decide to which host a container should be deployed and potentially do that on an ongoing basis. Although this functionality is great for helping us make better use of the underlying infrastructure, it creates new challenges in production. While troubleshooting a common first question is “Which specific container is having issues?”. The second question is often “On which host is it running?“. Being able to map container deployments to the underlying container hosts is essential for troubleshooting. But there is more to learn about container monitoring, let us see why infrastructure monitoring is different for containers.

Monitoring a container infrastructure is different from traditional server monitoring. First of all, containers present a new infrastructure layer we simply didn’t have before. Secondly, we have to cope with dynamic placement in one or more clusters possibly running on different cloud services. Finally, containers provide new ways for resource management.

New Infrastructure Layers

Containers add a new layer to the infrastructure and the mapping of containers to servers lets us see where exactly in our infrastructure each container is running. Modern container monitoring tools must thus discover all running containers automatically in order to capture dynamic changes in the deployment and update the container to host mapping in real-time. Thus, the traditional server performance monitoring that is not designed with such requirements in mind is inadequate for monitoring containers.

New Dynamic Deployment and Orchestration

Container orchestration tools like Kubernetes or Docker Swarm are often used to dynamically allocate containers to the most suitable hosts, typically those that have sufficient resources to run them. Containers might thus move from one host to another, especially while scaling the number of containers up or down or during container redeployments. There is no static relation between hosts and services they are running anymore! For troubleshooting, this means one must first figure out which host is running which containers. And, vice versa, when a host exhibits poor performance, it’s highly valuable to be able to isolate which container is the source of the problem and if any containers suffer from the performance issue.

New Resource Management and Metrics

Containers use of resources can be restricted. Using inadequate container resource limits can lead to situations where a container performs poorly simply because it can’t allocate enough resources. At the same time, the cluster host itself might not be fully utilized. How could this problem be discovered and fixed? A good example is monitoring memory fail counters – one of the key container metrics or throttled CPU time. In such situations, monitoring just the overall server performance would not indicate any slowness of containers hitting resource limits. Only monitoring of the actual container metrics for each container helps in this situation. Setting the right resource limits requires detailed knowledge of the resources a container might need under load. A good practice is to monitor container metrics and adjust the resource limits to match the actual needs or scale the number of containers behind a load balancer.

New Log Management Needs

Docker not only changed the deployment of applications, but also the workflow for log management. Instead of writing logs to files, Docker logs are console output streams from containers. Docker Logging Drivers collect and forward logs to their destinations. To match this new logging paradigm, legacy applications need to be updated to write logs to the console instead of writing them to local log files. Some containers start multiple processes, therefore log streams might contain a mix of plain text messages from start scripts and unstructured or structured logs from different containerized applications. The problem is obvious – you can’t just take both log streams (stderr/stdout) from multiple processes and containers, all mixed up, and treat them like a blob, or assume they all use the same log structure and format. You need to be able to tell which log event belongs to what container, what app, parse it correctly, etc.

To do log parsing right, the origin of the container log output needs to be identified. That knowledge can then be used to apply the right parser and add metadata like container name, container image, container ID to each log event. Docker Log Drivers simplified logging a lot, but there are still many Docker logging gotchas. Luckily, modern log shippers integrate with Docker API or Kubernetes API and are able to apply log parsers for different applications.

New Microservice Architecture and Distributed Transaction Tracing

Microservices have been around for over a decade under one name or another. Now often deployed in separate containers it became obvious we needed a way to trace transactions through various microservice layers, from the client all the way down to queues, storage, calls to external services, etc. This created a new interest in Distributed Transaction Tracing that, although not new, has now re-emerged as the third pillar of observability.

New Container Monitoring Tools

In Part 2 we’ll explore key container metrics and in Part 3 we will learn essential monitoring commands and open-source monitoring tools for containers.

The post Docker Container Monitoring and Management Challenges appeared first on Sematext.

Monitoring Elasticsearch with Sematext

Stefan Thies — Wed, 27 Mar 2019 14:07:19 +0000

As shown in Elasticsearch Key Metrics, the setup, tuning, and operations of Elasticsearch require deep insights into the performance metrics such as index rate, query rate, query latency, merge times, and many more. Sematext provides an excellent alternative to other Elasticsearch monitoring tools.

Open-source tools to monitor Elasticsearch are free, but your time is not. Relatively speaking, they’re rather expensive. Thus, Sematext Cloud aims to save you time, effort, and your hair!

How Sematext Helps you Save Time

Here are a few things you will NOT** have to do when using Sematext for Elasticsearch** monitoring:

figure out which metrics to collect and which ones to ignore
give metrics meaningful labels
hunt for metric descriptions in the docs so that you know what each one actually shows
build charts to group metrics that you really want on the same charts, not several separate charts
figure out which aggregation to use for each set of metrics (min? max? avg? something else?)
build dashboards to combine charts with metrics you typically want to see together
set up basic alert rules

All of the above is not even a complete story. Do you want to collect Elasticsearch logs? How about structuring them? Sematext does all this automatically for you!

In this post, we will look at how Sematext provides more comprehensive – and easy to set up – monitoring for Elasticsearch and other technologies in your infrastructure. By combining events, logs, and metrics together in one integrated full stack observability platform and using the Sematext open-source monitoring agent and its integrations, which are also open-source, you can monitor your whole infrastructure and apps, not just your Elasticsearch cluster. You can also get deeper visibility into your entire software stack by collecting, processing, and analyzing your logs.

Elasticsearch Monitoring

Collecting Elasticsearch Metrics

Sematext Elasticsearch integration collects over 100 different Elasticsearch metrics for JVM, index performance, cluster health, query performance and more. Sematext maintains and supports official Elasticsearch monitoring integration. Moreover, the Sematext Elasticsearch integration is customizable and open source.

Bottom line: you don’t need to deal with configuring the agent for metrics collection, which is the first huge time saver!

Installing Monitoring Agent

Setting up the monitoring agent takes less than 5 minutes:

Create an Elasticsearch App in the Integrations / Overview (or Sematext Cloud Europe). This will let you install the agent and control access to your monitoring and logs data. The short What is an App in Sematext Cloud video has more details.
Name your Elasticsearch monitoring App and, if you want to collect Elasticsearch logs as well, create a Logs App along the way.
Install the Sematext Agent according to the setup instructions displayed in the UI.

App creation and setup instructions in Sematext Cloud

For example, on Ubuntu, add Sematext Linux packages with the following command:

echo "deb http://pub-repo.sematext.com/ubuntu sematext main" | sudo tee /etc/apt/sources.list.d/sematext.list > /dev/null
wget -O - https://pub-repo.sematext.com/ubuntu/sematext.gpg.key | sudo apt-key add -
sudo apt-get updatesudo apt-get install spm-client

Then setup Elasticsearch monitoring by providing Elasticsearch server connection details:

sudo bash /opt/spm/bin/setup-sematext --monitoring-token --app-type elasticsearch \
--agent-type standalone \
--SPM_MONITOR_ES_NODE_HOSTPORT 'localhost:9200' \
--infra-token

In case you have Elasticsearch secured with HTTPS and basic authentication, you can add the following parameters to the command:

--SPM_MONITOR_ES_NODE_BASICAUTH_USERNAME userName
--SPM_MONITOR_ES_NODE_BASICAUTH_PASSWORD passWord

In addition, you need to specify the HTTPS as protocol SPM_MONITOR_ES_NODE_HOSTPORT, as shown in the complete setup command:

sudo bash /opt/spm/bin/setup-sematext --monitoring-token <your-token-goes-here> \
--app-type elasticsearch \
--agent-type standalone \
--SPM_MONITOR_ES_NODE_HOSTPORT 'https://localhost:9200' \
--infra-token <your-token-goes-here> \
--SPM_MONITOR_ES_NODE_BASICAUTH_USERNAME userName \
--SPM_MONITOR_ES_NODE_BASICAUTH_PASSWORD passWord

Go grab a drink, but hurry! Elasticsearch metrics will start appearing in your charts in less than a minute.

Elasticsearch Monitoring Dashboard

When you open the Elasticsearch App you find a predefined set of dashboards that organize more than 100 Elasticsearch metrics and general server monitoring in predefined charts grouped into an intuitively organized set of monitoring dashboards:

Overview with charts for all key Elasticsearch metrics
Operating System metrics such as CPU, memory, network, disk usage, etc.
Java Virtual Machine metrics for Garbage collection, JVM Memory, JVM Threads and JVM open files
Elasticsearch metrics
- Cluster Health : The number of Elasticsearch nodes and shard status (active, relocating, initializing, ..)
- Shard Stats : The number of shards, shard status per index
- Index Stats : The number of indexed documents, size on disk, indexing rate, merging rate, merged documents
- Search : Request rate, query and fetch latency, realtime-get latency
- Thread Pools : Number of threads per pool, thread pool size
- Circuit Breakers : Field data stats, request size stats
- Connections : Connected sockets, Node-Node-Transport stats, TCP socket and traffic stats

Elasticsearch key metrics in Sematext Cloud

Setup Alerts for Elasticsearch Metrics

To save you time Sematext automatically creates a set of default alert rules such as alerts for low disk space. You can create additional alerts on any metric. Watch Alerts in Sematext Cloud for more details.

Alerting on Elasticsearch Metrics

There are 3 types of alerts in Sematext:

Heartbeat alerts , which notify you when a Elasticsearch DB server is down
Classic threshold-based alerts that notify you when a metric value crosses a predefined threshold
Alerts based on statistical anomaly detection that notify you when metric values suddenly change and deviate from the baseline

Let’s see how to actually create some alert rules for Elasticsearch metrics in the animation below. The request query count chart shows a spike. We normally have up to 100 requests, but we see it can jump to over 600 requests. To create an alert rule on a metric we’d go to the pulldown in the top right corner of a chart and choose “Create alert”. The alert rule applies the filters from the current view and you can choose various notification options such as email or configured notification hooks (PagerDuty, Slack, VictorOps, BigPanda, OpsGenie, Pusher, generic webhooks etc.). Alerts are triggered either by anomaly detection, watching metric changes in a given time window or through the use of classic threshold-based alerts.

Alert creation for Elasticsearch request query count metric

Elasticsearch Logs

Shipping Elasticsearch Logs

Since having logs and metrics in one platform makes troubleshooting simpler and faster let’s ship Elasticsearch logs too. You can use many log shippers, but we’ll use Logagent because it’s lightweight, easy to set up, and because it can parse and structure logs out of the box. The log parser extracts timestamp, severity, and messages. For query traces, the log parser also extracts the unique query ID to group logs related to query execution.

Create a Logs App to obtain an App token
Install Logagent npm package

sudo npm i -g @sematext/logagent

you don’t have Node.js, you can install it easily. E.g. On Debian/Ubuntu:

curl -sL https://deb.nodesource.com/setup\_10.x | sudo -E bash -
sudo apt-get install -y nodejs

Install the Logagent service by specifying the logs token and the path to Elasticsearch log files. You can use -g '/var/log/ **/elasticsearch*.log' to ship only logs from Elasticsearch server. If you run other services, on the same server consider shipping all logs using -g '/var/log/** /*.log' The default settings ship all logs from /var/log/**/*.log when the -g parameter is not specified. Logagent detects the init system and installs Systemd or Upstart service scripts. On Mac OS X it creates a launchd service. Simply run:

sudo logagent-setup -i YOUR_LOGS_TOKEN -g '/var/log/**/elasticsearch*.log'

for EU region use:

sudo logagent-setup -i LOGS_TOKEN \
-u logsene-receiver.eu.sematext.com \
-g '/var/log/**/elasticsearch*.log'

The setup script generates the configuration file in /etc/sematext/logagent.conf and starts Logagent as system service.

Note, if you run Elasticsearch in containers, setup Logagent for container logs.

Log Search and Dashboards

Once you have logs in Sematext you can search through them when troubleshooting, save queries you run frequently or create your individual logs dashboard.

Search for Elasticsearch Logs

Log Search Syntax

If you know how to search with Google, you’ll know how to search your logs in Sematext Cloud.

Use AND, OR, NOT operators – e.g. (error OR warn) NOT exception
Group AND, OR, NOT clauses – e.g. message:(exception OR error OR timeout) AND severity:(error OR warn)
Don’t like Booleans? Use + and – to include and exclude – e.g. +message:error -message:timeout -host:db1.example.com
Use field references explicitly – e.g. message:timeout
Need a phrase search? Use quotation marks – e.g. message:”fatal error”

When digging through logs you might find yourself running the same searches again and again. To solve this annoyance, Sematext lets you save queries so you can re-execute them quickly without having to retype them. Please watch how using logs for troubleshootingsimplifies your work.

Alerting on Elasticsearch Logs

To create an alert on logs we start by running a query that matches exactly those log events that we want to be alerted about. To create an alert just click to the floppy disk icon.

Similar to the setup of metric alert rules, we can define threshold-based or anomaly detection alerts based on the number of matching log events the alert query returns.

Please watch Alerts in Sematext Cloud for more details.

Elasticsearch Metrics and Log Correlation

A typical troubleshooting workflow starts from detecting a spike in the metrics, then digging into logs to find the root cause of the problem. Sematext makes this really simple and fast. Your metrics and logs live under the same roof. Logs are centralized, the search is fast, and the powerful log search syntax is simple to use. Correlation of metrics and logs is literally one click away.

Elasticsearch logs and metrics in a single view

Monitor Elasticsearch with Sematext

Comprehensive monitoring for Elasticsearch involves identifying key metrics for Elasticsearch, collecting metrics and logs, and then connecting everything in a meaningful way. In this post, we’ve shown you how to monitor Elasticsearch metrics and logs in one place. We used OOTB and customized dashboards, metrics correlation, log correlation, anomaly detection, and alerts. And with other open-source integrations, like Apache Kafka, you can easily start monitoring Elasticsearch alongside metrics, logs, and distributed request traces from all of the other technologies in your infrastructure. Get deeper visibility into Elasticsearch today with a free Sematext trial.

The post Monitoring Elasticsearch with Sematext appeared first on Sematext.

Monitoring ClickHouse with Sematext

Stefan Thies — Fri, 15 Feb 2019 09:42:58 +0000

As shown in Part 1 – ClickHouse Monitoring Key Metrics – the setup, tuning, and operations of ClickHouse require deep insights into the performance metrics such as locks, replication status, merge operations, cache usage and many more. Sematext provides an excellent alternative to other ClickHouse monitoring tools, a more comprehensive – and easy to set up – monitoring solution for ClickHouse and other technologies in your infrastructure.

How much hair do you have?

Open-source monitoring tools are free, but your time is not. Relatively speaking, it’s actually rather expensive. Thus, Sematext aims to save you time, effort….and your hair.

Here are a few things you will NOT** have to do when using Sematext for ClickHouse** monitoring:

figure out which metrics to collect and which ones to ignore
give metrics meaningful labels
hunt for metric descriptions in the docs so that you know what each of them actually shows
build charts to group metrics that you really want on the same charts, not N separate charts
figure out, for each metrics, which aggregation to use (min? max? avg? something else?)
build dashboards to combine charts with metrics you typically want to see together
set up basic alert rules

The above is not even a complete story. Want to collect ClickHouse logs? Want to structure them? Prepare to do more legwork. Sematext does all this automatically for you!

In this post, we will look at how Sematext provides more comprehensive – and easy to set up – monitoring for ClickHouse and other technologies in your infrastructure by combining events, logs, and metrics together in one integrated full stack observability platform. By using the Sematext open-source monitoring agent and its integrations, which are also open-source, you can monitor your whole infrastructure and apps, not just your ClickHouse DB. You can also get deeper visibility into your full stack by collecting, processing, and analyzing your logs.

ClickHouse Monitoring

Collecting ClickHouse Metrics

Sematext ClickHouse integration collects over 70 different ClickHouse metrics for system, queries, merge tree, replication, replicas, mark cache, R/W buffers, dictionaries, locks, distributed engine, as well as Zookeeper errors & wait times. Sematext maintains and supports official ClickHouse monitoring integration. Moreover, the Sematext ClickHouse integration is customizable and open source.

Bottom line: you don’t need to deal with configuring the agent for metrics collection, which is the first huge time saver!

Installing Monitoring Agent

Setting up the monitoring agent takes less than 5 minutes:

1. Create a ClickHouse App in the Integrations / Overview (or Sematext Cloud Europe). This will let you install the agent and control access to your monitoring and logs data. The short What is an App in Sematext Cloud video has more details.
2. Name your ClickHouse monitoring App and, if you want to collect ClickHouse logs as well, create a Logs App along the way.
3. Install the Sematext Agent according to the setup instructions displayed in the UI.

For example, on Ubuntu, add Sematext Linux packages with the following command:

echo "deb http://pub-repo.sematext.com/ubuntu sematext main" | sudo tee /etc/apt/sources.list.d/sematext.list \> /dev/null wget -O - https://pub-repo.sematext.com/ubuntu/sematext.gpg.key | sudo apt-key add - sudo apt-get update sudo apt-get install spm-client

Then setup ClickHouse monitoring by providing ClickHouse server connection details:

sudo bash /opt/spm/bin/setup-sematext --monitoring-token --app-type clickhouse --agent-type standalone --SPM\_MONITOR\_CLICKHOUSE\_DB\_USER '' --SPM\_MONITOR\_CLICKHOUSE\_DB\_PASSWORD '' --SPM\_MONITOR\_CLICKHOUSE\_DB\_HOST\_PORT 'localhost:8123'

The last step, go grab a drink.. but hurry – ClickHouse metrics will start appearing in your charts in less than a minute.

ClickHouse Monitoring Dashboard

When you open the ClickHouse App you find a predefined set of dashboards that organize more than 70 ClickHouse metrics and general server monitoring in predefined charts grouped into an intuitively organized set of monitoring dashboards:

Overview with charts for all key ClickHouse metrics
Operating System metrics such as CPU, memory, network, disk usage, etc.
ClickHouse metrics
- Query : query time, query count, query memory
- Merge : merged bytes, merge count, merged rows
- MergeTree stats : table size on disk, row count and active part count
- Replication : replication part checks, failed replication part checks, lost replicated parts, distributed connection retires and distributed connection fails
- Zookeeper : A breakdown of various zookeeper errors, zookeeper wait times and leader elections
- Replicas : Replication status, Replica parts, replica queue size, replica queue inserts, replica queue merges
- System : Memory allocator stats, active HTTP and TCP connections, network errors and cache dictionary stats
- R/W Buffer : Reads/Writes, open files and file descriptor failures
- Mark Cache : Mark cache hits and misses and mark cache size
- Locks : Lock acquired read blocks, RW Lock reader wait time, Lock write wait times, Lock read wait times

ClickHouse key metrics in Sematext Cloud

Setup ClickHouse Alerts

Alerting on ClickHouse Metrics

There are 3 types of alerts in Sematext:

Heartbeat alerts , which notify you when a ClickHouse DB server is down
Classic threshold-based alerts that notify you when a metric value crosses a pre-defined threshold
Alerts based on statistical anomaly detection that notify you when metric values suddenly change and deviate from the baseline

Let’s see how to actually create some alert rules for ClickHouse metrics in the animation below. The Query Processing Memory chart shows a spike. We normally have a really low query memory close to 3 MB, but we see it can jump to over 80 MB. To create an alert rule on a metric we’d go to the pulldown in the top right corner of a chart and choose “Create alert”. The alert rule applies the filters from the current view and you can choose various notification options such as email or configured notification hooks (PagerDuty, Slack, VictorOps, BigPanda, OpsGenie, Pusher, generic webhooks etc.). Alerts are triggered either by anomaly detection, watching metric changes in a given time window or through the use of classic threshold-based alerts.

Alert creation for ClickHouse query memory

ClickHouse Logs

Shipping ClickHouse Logs

Since having logs and metrics in one platform makes troubleshooting simpler and faster let’s ship ClickHouse logs, too. You can use many log shippers, but we’ll use Logagent because it’s lightweight, easy to set up and because it can parse and structure logs out of the box. The log parser extracts timestamp, severity, and messages. For query traces, the log parser also extracts the unique query ID to group logs related to query execution.

Step 1. Create a Logs App to obtain an App token

Step 2. Install Logagent npm package

sudo npm i -g @sematext/logagent

If you don’t have Node, you can install it easily. E.g. On Debian/Ubuntu:

curl -sL https://deb.nodesource.com/setup\_10.x | sudo -E bash -sudo apt-get install -y nodejs

Step 3. Install Logagent service by specifying the logs token and the path to ClickHouse log files. You can use -g ‘var/log/ /clickhouse*.log` _ to ship only logs from ClickHouse server. If you run other services, such as ZooKeeper or MySQL on the same server consider shipping all logs using _-g ‘var/log//*.log’. The default settings ship all logs from /var/log/**/*.log when the -g parameter is not specified.

Logagent detects the init system and installs Systemd or Upstart service scripts. On Mac OS X it creates a Launchd service. Simply run:

sudo logagent-setup -i YOUR\_LOGS\_TOKEN -g ‘var/log/\*\*/clickhouse\*.log #for EU region: #sudo logagent-setup -i LOGS_TOKEN #-u logsene-receiver.eu.sematext.com #-g ‘var/log/*/clickhouse.log

The setup script generates the configuration file in /etc/sematext/logagent.conf and starts Logagent as system service.

Note, if you run ClickHouse in containers, setup Logagent for container logs. Note that ClickHouse server does not log to console when running in containers. You need to mount a modified ClickHouse server config file to /etc/clickhouse-server/config.xml in the container to enable logging to console:

\<logger\> \<console\>1\</console\>\</logger\>

Log Search and Dashboards

Once you have logs in Sematext you can search them when troubleshooting, save queries you run frequently or create your individual logs dashboard.

Search for ClickHouse Logs

Log Search Syntax

If you know how to search with Google, you’ll know how to search your logs in Sematext Cloud.

Use AND, OR, NOT operators – e.g. (error OR warn) NOT exception
Group your AND, OR, NOT clauses – e.g. message:(exception OR error OR timeout) AND severity:(error OR warn)
Don’t like Booleans? Use + and – to include and exclude – e.g. +message:error -message:timeout -host:db1.example.com
Use explicitly field references – e.g. message:timeout
Need a phrase search? Use quotation marks – e.g. message:”fatal error”

Alerting on ClickHouse Logs

To create an alert on logs we start by running a query that matches exactly those log events that we want to be alerted about. To create an alert just click to the floppy disk icon.

Similar to the setup of metric alert rules, we can define threshold-based or anomaly detection alerts based on the number of matching log events the alert query returns.

Please watch Alerts in Sematext Cloud for more details.

ClickHouse Metrics and Log Correlation

A typical troubleshooting workflow starts from detecting a slowness in metrics, then digging into logs to find the root cause of the problem. Sematext makes this really simple and fast. Your metrics and logs live under one roof. Logs are centralized, the search is fast, and powerful log search syntax is simple to use. Correlation of metrics and logs is literally a click away.

ClickHouse logs and metrics in a single view

Full Stack Observability for ClickHouse & Friends

ClickHouse’s best friends are ZooKeeper, MySQL and Kafka. While ZooKeeper is essential for ClickHouse cluster operations other integrations are optionally used to access data from external storages.

ZooKeeper : ClickHouse relies on ZooKeeper to synchronize distributed workloads in ClickHouse clusters.
MySQL : ClickHouse supports MySQL as an external storage engine for ClickHouse tables. In addition, MySQL can serve as an external ClickHouse dictionary for key/value lookups.
Kafka : Apache Kafka can be used as an external storage engine for ClickHouse tables.
Others : Several other integrations exist for external dictionaries (data lookups), such as generic ODBC interfaces or MongoDB or PostgreSQL (3rd party) integrations.

ClickHouse and Zookeeper

Because ClickHouse cluster stability depends on ZooKeeper performing well, we recommend setting up ZooKeeper monitoring and log collection for related logs. As shown in the dashboard below, this allows us to, for example, correlate high ZooKeeper wait times in ClickHouse metrics, with JVM garbage collection metrics in ZooKeeper.

Distributed system require ZooKeeper responses to be quick, so any delays caused by slow JVM garbage collection can cause performance and cluster stability issues. Having the ability to easily correlate ZooKeeper and ClickHouse metrics, like in the custom ClickHouse monitoring dashboard below, makes it easier to fix the root cause of bad performance faster.

Correlation between ZooKeeper/JVM Garbage Collection and ClickHouse ZooKeeper wait time

In many cases, a performance issue can be solved by analyzing “related” metrics as well as the metric that was suspicious at the beginning of the investigation. For ad-hoc metric correlation analysis use the automatic metrics correlation to find all metrics whose patterns correlate to any base metric of your choice.

Metric correlation of garbage collection time and request latency in ZooKeeper

Integration with MySQL or Kafka

ClickHouse integrates with MySQL. MySQL can be used as an external storage engine to query data from MySQL tables in ClickHouse queries. In addition, ClickHouse supports MySQL for data lookups as an external dictionary. A dictionary is a mapping (key -> attributes) that is convenient for various types of reference lists. ClickHouse supports special functions for working with dictionaries that can be used in queries. It is easier and more efficient to use dictionaries with functions than a JOIN with reference tables. If you run MySQL, see MySQL Monitoring for more info.

Apache Kafka is only integrated as an external storage engine. Once ClickHouse faces a high latency while requesting data from external data sources such as MySQL or Kafka we have to figure out what slows down external data sources. By setting up MySQL and Kafka monitoring you can benefit from great observability having your favorite ZooKeeper, MySQL, Kafka, and ClickHouse metrics together with related logs for faster troubleshooting. See Monitoring Kafka and Consumer Lag to learn more.

Monitor ClickHouse with Sematext

Comprehensive monitoring for ClickHouse involves identifying key metrics for both the ClickHouse cluster and ZooKeeper, collecting metrics and logs, and connecting everything in a meaningful way. In this post, we’ve shown you how to monitor ClickHouse metrics and logs in one place. We used OOTB and customized dashboards, metrics correlation, log correlation, anomaly detection, and alerts. And with other open-source integrations, like MySQL or Kafka, you can easily start monitoring ClickHouse alongside metrics, logs, and distributed request traces from all of the other technologies in your infrastructure. Get deeper visibility into ClickHouse today with a free Sematext trial.

The post Monitoring ClickHouse with Sematext appeared first on Sematext.

IoT: Air Pollution Tracking with Node.js, Elastic Stack, and MQTT

Stefan Thies — Mon, 05 Mar 2018 15:34:39 +0000

What can you do with a couple of IoT devices, Node.js, Elasticsearch, and MQTT? You can put together your own Internet of Things setup for measuring air pollution, like I have. In this blog post, I’ll share all the details about the hardware setup, software configuration, data analytics, an IoT dashboard, and MQTT broker-based integration with other tools from the IoT ecosystem, like Node-Red and Octoblu. Of course, I’ll also share a few interesting findings about air pollution IoT sensor measurements taken at a few locations in Germany. Take a look – doing this is much easier than you might think when you use the right tools!

Motivation

Recently, theVolkswagen Emission Scandal (Wikipedia) escalated again. Reasons were controversial animal experiments asreported by New York Times. This sparked numerous debates about banning Diesel cars from city centers in Germany, where I live. People talk about global car bans, but I’m amazed nobody is really talking about smart-city concepts yet. Besides the discussion around the cheating on nitrogen oxide emissions, the EU wants to enforce lower limits of particulates (measured in PM10 and PM2.5) in Germany. The impact on the health of high PM10 concentration is described in “Health effects of particles in ambient air”.

Well, that is politics and medicine and we are computer scientists, data engineers or DevOps specialists, so I asked myself

“What can we do for environmental protection”? Living in a world where data-driven decisions are becoming more common, collecting data and visualizing facts is one way to contribute.

Tweet to @sematext

As the recent scandal shows, large companies might influence scientific studies, lobbyists are influencing governments, so why not collect open source data and create independent analytics and independent opinions based on open data – or your own data! We can help with recipes for device setups, software configuration or sharing data in a platform or analyzing data, help with the interpretation and we could speak about it in public, in meetups, conferences, etc.

As for me, I just wanted to see measurements in my environment because public government data lists only major cities and reports they provide typically have maps with low resolution. So I decided to start a little IoT DIY project with off-the-shelf components to measure the air pollution with a particular matter/dust sensor, tracking the PM10, PM2.5, as well as the PM2.5/PM10 ratio values. I wanted to be able to do this with a mobile device and measure in various locations where I work and live. My office is close to the main street and close to an industrial area, but I recently moved into a new house in a rural town that feels like a “climatic spa” and actually has a health resort. To make it easy for others to put together Internet of Things systems like the one described here I created “Air Pollution Tracker”, so anyone can collect data at their locations, experiment with the setup, and share their data.

The Hardware

Ok, let’s get technical and first see the hardware setup of the IoT sensor device I put together:

So that’s what our setup looks like. Let’s see what each part of this IoT sensor device is and does:

Measuring Particulate Matter with a Nova SDS011 dust sensor
Logging the location of the measurement with a GPS sensor
Wi-Fi connection to my mobile phone to transmit measurement results via MQTT
A power bank provides the power for the Banana-Pi device
Banana-Pi (more powerful than Raspberry Pi) with Debian Linux and Node.js for data collection and shipping of the sensor data

Note that the USB power might not be sufficient for GPS, Wi-Fi, PM sensor, and an internal Ethernet interface.

The Software

The software architecture is based on MQTT messages, which is designed to scale to thousands of devices and supports an easy way of sharing data in real-time for any kind of processing. We created open source plugins for @sematext/logagent in Node.js to collect and correlate data from Nova SDS011 sensor and the GPS device. The measurements are shipped in JSON format to an MQTT broker, which can store data in Elasticsearch or, as we did, in Sematext Cloud. The MQTT-based architecture allows other clients to listen to the event stream and create e.g. alerts or public tweets or control traffic lights when PM10 limits are reached. In addition, the MQTT messages are recorded for historical analysis and visualisation.

Architecture

Sniffing fresh air and collecting data from PM sensor

The project started with a Google search for particulate matter sensors and availability of the device and Node.js drivers because Node.js is my favourite programming language. After some research I ordered the Nova SDS011 with USB to serial converter. Reading values from serial port looked easy to implement, and the USB interface works on my MacBook and the Banana-Pi device. The next step was creating an input plugin for @sematext/logagent to inject the sensor data in the Logagent processing pipeline. Logagent can ship data to MQTT, Elasticsearch, Apache Kafka, or simple file output.

I wanted to measure air quality in multiple locations, so I needed to collect the location of measurements. This would let me visualize air pollution on a map. The initial approach was to add a static location information to the plugin configuration, but then I changed things to get the location information from other sources, like GPS or by tracking my iPhone. The Logagent plugin for the Nova SDS011 sensor is open source and published in the NPM registry. The Logagent configuration for the Nova SDS011 plugin requires the module name and the name of the serial port. Optionally, you can specify the measurement collection frequency using the workingPeriod setting in minutes:

input: 
  novaSDS011: 
    module: input-nova-sda011 
    comPort: /dev/ttyUSB0 
    # persistent setting for measurement interval in minutes 
    workingPeriod: 1

Getting accurate GPS position

After the setup of the serial port driver and Logagent, the first experiments started on my MacBook. To get accurate GPS position when changing places I wanted to track my location automatically. At first I used the Logagent plugin logagent-apple-location to track the position of my iPhone. To do that I had to extend the PM sensor plugin to listen to the “location” events to enrich the sensor data with GPS coordinates and retrieved address. That was a good start for experiments until my new GPS device finally arrived and I switched to using logagent-gps plugin to get accurate GPS positions independently from internet connectivity. When the internet connection is present, the plugin queries Google Maps API to find the address of the current location and uses a cache to avoid hitting the Google API limit quickly. The downside of the cache is the loss of accuracy. With the cache in place the street numbers and addresses don’t change within a few hundred meters distance. The configuration for the Logagent GPS plugin is very simple. It needs only the COM port for the serial interface and the npm module name:

input: 
  gps: 
    module: logagent-gps
    comPort: /dev/ttyACM0

Calculating values from sensor measurements

Smaller particles are considered more dangerous and therefore it might be interesting to see the ratio of PM10 and PM2.5 values. The Nova SDS011 provides only PM10 and PM2.5 measurements, and the ratio of PM2.5/PM10 needs to be calculated. Please note that the mass of PM2.5 particles is a subset of PM10 particles. Therefore the PM2.5 value is always smaller than the PM10 value. Logagent supports JavaScript functions for input and output filters in the configuration file, so that is what we used here.

# calculate PM2.5/PM10 ratio in percent 
outputFilter:
  - module: !!js/function >
      function (context, config, eventEmitter, data, callback)  {
        if (data.PM10 && data.PM10 > 0) {
            data.PM25ratio = (data['PM2.5']/data.PM10) * 100
        }
        callback(null, data)
      }

The “data” variable holds the current measurement values and the callback function needs to be called to return the modified data object after the calculation. The new data object contains now PM10, PM2.5 and the calculated PM25ratio values!

Shipping and consuming sensor data with MQTT

The standardized MQTT protocol has a very small overhead and most IoT tools support MQTT. MQTT works with pub/sub mechanisms to distribute messages to multiple clients. In our case, the sensor device sends JSON messages to the MQTT broker using the topic called “sensor-data”. We use the Logagent MQTT output plugin and the public service mqtt://test.mosquitto.org. Please note that you should use the test.mosquitto.org server only for short tests. For a production setup you should run your own MQTT broker. For example, you could run Mosquito MQTT broker in a Docker container or you could use the Logagent MQTT-broker plugin and run another instance of Logagent as a MQTT broker.

output:  
  mqtt:    
    module: output-mqtt    
    url: mqtt://test.mosquitto.org
    topic: sensor-data
    debug: false
    # optional filter settings matching data field with regular expressions
    filter:
      field: logSource
      match: Nova

Now we could use any MQTT client on another machine, connected to the same MQTT broker and subscribe to messages arriving in the “sensor-data” topic.

If you want to process measurements in some way or act on them you can use tools likeNode-Red orOctoblu and create IoT workflows. For example, the MQTT plugin in Node-Red takes MQTT broker address and topic as parameters, so you can use that to subscribe to that “sensor-data” topic and get measurements that were sent to the MQTT broker As soon you start Node-Red pointed to the MQTT broker you will get the air pollution data into your Node-Red workflow. Then you perform various actions on or based on received measurements. For example, you could tweet messages when conditions match or change the color of LED lights according to the sensor values, or you could control air conditioning … the possibilities are endless here! Thinking a little bigger, a smart-city might choose to control the traffic and use air pollution as one of the criteria for traffic routing decisions. The Node-Red architecture can plug devices, logic elements, or neural network components. Node-Red is a great playground to prototype any logic based on the air pollution measurements.

Node-Red IoT workflows

Storing data in Elasticsearch or Sematext Cloud

We stored what is effectively IoT time-series sensor data via Logagent Elasticsearch plugin directly in Sematext Cloud. Sematext Cloud provides Elasticsearch API compatible endpoints for data, dashboards and alerts. The Elasticsearch plugin needs the Elasticsearch URL and index name. For Sematext Cloud we use the write token provided by Sematext UI as index name:

sematext-cloud:
    module: elasticsearch
    url: https://logsene-receiver.sematext.com
    index: 9eed3c42-1e14-44d2-b319-XXXXXXX

The complete device setup for Banana-PI

The setup for the Banana-PI device in a few steps:

Create Bananian (Debian) SD card
Configure Wi-Fi card for your mobile phone by setting the wpa_–essid and the wpa-password_ in /etc/network/interfaces for the wlan0 interface. Enable the Internet tethering on your mobile phone (“Hotspot” on iPhone).
Install Node.js

   curl -sL https://deb.nodesource.com/setup\_8.x | bash - && apt-get install -y nodejs

Install @sematext/logagent and relevant plugins

     npm i -g --unsafe-perm @sematext/logagent logagent-gps logagent-novasds      npm i -g --unsafe-perm @sematext/logagent-nodejs-monitor      logagent-setup -t YOUR-TOKEN -e [https://logsene-receiver.sematext.com](https://logsene-receiver.sematext.com)        service logagent stop

Create the Logagent configuration (see below). Test the configuration with

logagent --config logagent.conf

Copy the working configuration to /etc/sematext/logagent.conf and start the service with

# Example for Logagent configuration
# Plase adjust following settings: 
#   input.novaSDS011.comPort
#   input.gps.comPort
#   input.nodejsMonitor.SPM_TOKEN
#   output.mqtt.url
#   output.elasticsearch.url
#   output.elasticsearch.indices

options: 
  # suppress log event output to console
  suppress: true
  # Write Logagent stats in the Logagent log file /var/log/logagent.log
  # The stats show how many events have been processed and shipped
  # Log interval in seconds
  printStats: 60

input:
  novaSDS011:
    module: input-nova-sda011
    # Find TTY name: ls -l /dev/tty* | grep 'dialout'
    comPort: /dev/ttyUSB0
    # Working period in minutes. The setting is persistent 
    # for the connected Nova SDS011 sensor
    # 1 minute measurement interval
    workingPeriod: 1

  gps: 
    module: logagent-gps
    # Find TTY name: ls -l /dev/tty* | grep 'dialout'
    comPort: /dev/ttyACM0
    # Emit only location event, to share the location with nova sensor
    emitOnlyLocationEvent: true
    # disable debug output
    debug: false

  # Optional, monitor logagent and device performance
  # Create in Sematext Cloud a Node.js monitoring app
  # to obtain the SPM_TOKEN
  nodejsMonitor: 
    module: @sematext/logagent-nodejs-monitor
    SPM_TOKEN: YOUR_SEMATEXT_NODEJS_MONITORING_TOKEN

  # collect all system logs for troubleshooting
  files: 
    - /var/log/**/*.log

# calculate PM2.5/PM10 ratio in percent 
outputFilter:
  - module: !!js/function >
      function (context, config, eventEmitter, data, callback)  {
        if (data.PM10 && data.PM10 > 0) {
            data.PM25ratio = (data['PM2.5']/data.PM10) * 100
        }
        callback(null, data)
      }

output: 
  # print log events in yaml format
  # when options.suppress=false
  stdout: yaml
  # Forward sensor logs to MQTT broker
  mqtt:
    module: output-mqtt
    url: mqtt://test.mosquitto.org
    topic: sensor-data
    debug: false
    # optional filter settings matching data field with regular expressions
    # we use the filter to exclude the system log files
    filter: 
      field: logSource
      match: Nova

  # Store log events & sensor data in Sematext Cloud or Elasticsearch
  # Create a log application in Sematext Cloud to obtain a token
  elasticsearch:
    module: elasticsearch
    url: https://logsene-receiver.sematext.com
    # url: https://logsene-receiver.eu.sematext.com
    # url: http://127.0.0.1:9200 
    # We route system logs and sensor data to different indices
    # each index has a list of regular expressions matching the logSource field
    indices:
      # sensor data index
      YOUR_SEMATEXT_LOGS_TOKEN: 
        - Nova
      # system logs index
      ANOTHER__SEMATEXT_LOGS_TOKEN:
        - var.log.*

CPU and memory footprint

A lot of what I do at Sematext has to do with performance monitoring, so I couldn’t help myself and had to look at the telemetry of this DIY IoT setup of mine. The low resource usage of the Node.js based Logagent with less than 1% CPU and less than 34 MB memory is impressive! Other logging tools like Logstash require 20 times more memory (600 MB+) and would use most of the resources on microcomputers like Banana-Pi or Raspberry-Pi and exhaust the battery in no time!

If you’re curious about performance like I am, but also if you want to be notified when there are performance or stability issues with your setup you may want to add the logagent-nodejs-monitor plugin as shown below. Finally, we complete the configuration with the collection of all device logs with the file input plugin. The log files in /var/log contain valuable information like Wi-Fi status or USB device information.

input:
  nodejs-monitor:
    module: '@sematext/logagent-nodejs-monitor'
    SPM_TOKEN: 2f3e0e1f-94b5-47ad-8c72-6a09721515d8
  files: 
    - /var/log/**/*.log

We restart Logagent to apply configuration changes:

_service restart logagent_

After a few seconds, we will see logs and metrics in the Sematext UI. Having performance metrics and logs in one view is really valuable for any kind of troubleshooting. In my case, the USB wire had a bad contact and the lost USB connection was logged in /var/kern.log (see screenshot).

Node.js performance metrics and Banana-Pi logs in Sematext Cloud

Visualizing the air pollution

Before we create visualisations, we need to know the data structure of messages produced by the sensor/logagent. We could easily draw numeric values as a date histogram such as PM10, PM2_5 and PM25ratio. Maps can be created with the geo-coordinates. Having the address of each measurement makes it easy to find measurements in a specific city and the hostname might help us identify the sensor device.

{
  "@timestamp": "2018-02-05T20:59:38.230Z",
  "severity": "info",
  "host": "bananapi",
  "ip": "172.20.10.9",
  "PM2_5": 7.6,
  "PM10": 18,
  "geoip": {
    "location": [
      6.83125466218682,
      49.53914001560465
    ]
  }
  "address": "Weiskirchen, Germany",
  "city": "Weiskirchen",
  "country": "Germany",
  "logSource": "NovaSDS011",
  "PM25ratio": 42.22222222222222,
  "@timestamp_received": "2018-02-05T20:59:58.569Z",
  "logsene_original_type": "logs"
}

Example JSON message stored in Elasticsearch / Sematext Cloud

To visualize all data I’ve used Kibana, which is integrated in Sematext Cloud. Once the visualisations are created in Kibana, we can create a dashboard showing map and sensor values. At a first glance we can immediately see that air pollution is 50% lower in the north (where I live), than in the office that is close to the main street.

Kibana Dashboard in Sematext Cloud

Observation of Particulate Matter concentrations in various scenarios

The following graph was recorded while traveling from my office location to my home. The spike in the graph happened when I stopped my car and took the measurement device out of the car. You can see that the PM10 value jumped for a short time up to 80 µg/m³ , which is double the EU limit of 40 µg/m³ average per year, though just for a minute. Good to know that the air in my hometown has only half of the particulate matter compared to the office location – at least as long I don’t start my diesel engine … anyhow a good reason to stay in the home office.

PM10 levels in Weiskirchen (orange) and Nalbach (red)

Environment.on(”smog”, alert)

Having dashboards is cool, but you can’t really watch a dashboard all day long. So let’s use alerts. The open source ELK stack has its limits – no built-in alerting – but we can use alerts in Sematext Cloud. Here a saved query, filtering only PM10 values greater than 40 (EU limit) or 50 (DE limit) is used to trigger alerts:

Setup Alert for PM10 values above 50.

Alert notification in Sematext UI

With such an alert in place, we can add the Event stream (screenshot above) with alerts to a dashboard(screenshot below) and receive alerts via Slack channel on the mobile phone, for example.

Dashboard in Sematext UI, including Alert notifications

Slack notification when PM10 reaches the configured threshold of PM10>40 or PM10>50

Conclusion

The costs of various sensor devices are low and assembling the gadgets and setting up the software could be done in literally a few hours. It took me much longer to find good solutions for various tiny problems and to code a few Logagent plugins, but even scripting a plugin module takes only a few hours. Using Sematext Cloud instead of a local ELK stack is a big time saver for the server setup (I don’t need any physical or cloud servers, just devices and Sematext SaaS). The alerting for Elasticsearch queries and forwarding of alerts to Slack made the solution complete.

The biggest source of satisfaction in this project was to make the invisible visible with the “electronic nose” – you feel like a Ghostbuster! You see PM10 values increasing when a window gets opened, or when you start to vacuum the living room or when you forget your spaghetti on the stove while programming … Outside sensors “smell” when a neighbor starts his car engine, a visitor parks his car in front of your house, a guest starts smoking a cigarette on the terrace…

An interesting fact is that PM10 values are higher close to the main street and actually reached the EU limit (PM10>40) and the German limit (PM10>50) during rush hour in front of my office! The maximum value measured was PM10=69 at my office window. The PM10 values decrease as close as a few hundred meters away from the main street. Think about how being aware of this could impact your life decisions – a move to a new flat or office could really impact your health. Knowing the time of the highest air pollution could also help to keep particles out of your flat. My measurement showed that airing the office room before 2pm and after 9pm would be best to keep the PM concentration low. Luckily, I recently moved to a small village and the good thing I can find here is fresh air and, as you can see, inspiration and time for fresh ideas!

The real surprise to me was that I ended up in politics by calling the city administration for an appointment with the mayor to discuss with him a traffic light, which switches to red once the PM10 limit is reached. Cars could be kept out of the town because a bypass road already exists, but is currently underutilized and should be used much more. I hope I get my appointment with the mayor scheduled soon and when I speak with him I will have data to back up my suggestions. The administration first asked for a written letter to give me an official statement – let’s see if we get one more Smart-City on this planet finally making data-driven decisions applied in real-life, and not only in business Stay tuned!

The post IoT: Air Pollution Tracking with Node.js, Elastic Stack, and MQTT appeared first on Sematext.

Top 10 Docker Logging Gotchas

Stefan Thies — Wed, 10 Jan 2018 08:17:01 +0000

Docker changed not only how applications are deployed, it also changed the workflow for log management. Instead of writing logs to files, containers write logs to the console (stdout/stderr) and Docker Logging Drivers forward logs to their destination. A check against Docker Github issues quickly shows that users have various problems when dealing with Docker logs. Managing logs with Docker seems to be tricky and needs a deeper knowledge of Docker Logging Driver implementations and alternatives to overcome issues that people report.

Execute commands in containers, Docker networks, Data cleanup and more…

Docker Commands Cheat Sheet Download

So what are the top 10 Docker logging gotchas, every Docker user should know?

First, let’s start with an overview of Docker Logging Drivers and options to ship logs to centralized Log Management solutions such as Elastic Stack (former ELK Stack) or Sematext Cloud.

In the early days of Docker, container logs were only available via Docker remote API, i.e. via “docker logs” command and a few advanced log shippers. Later on, Docker introduced logging drivers as plugins, to open Docker for integrations with various log management tools. These logging drivers are implemented as binary plugins in the docker daemon. Recently, the plugin architecture got extended to run logging drivers as external processes, which could register as plugins and retrieve logs via Unix socket. Currently, logging drivers shipped with docker binaries are binary plugins, but this might change in the near future.

Docker Logging Drivers receive container logs and forward them to remote destinations or files. The default logging driver is “json-file”. It stores container logs in JSON format on local disk. Docker has a plugin architecture for logging drivers, so there are plugins for open source tools and commercial tools available:

Journald – storing container logs in the system journal
Syslog Driver – supporting UDP, TCP, TLS
Fluentd – supporting TCP or Unix socket connections to fluentd
Splunk – HTTP/HTTPS forwarding to Splunk server
Gelf – UDP log forwarding to Graylog2

For a complete log management solution additional tools need to be involved:

Log parser to structure logs, typically part of log shippers (fluentd, rsyslog, logstash, logagent, …)
Log indexing, visualisation and alerting:
- Elasticsearch and Kibana (Elastic Stack, also known as ELK stack)
- Splunk
- Logentries
- Loggl
- Sumologic
- Graylog OSS / Enterprise
- Sematext Cloud / Enterprise
- and many more…

To ship logs to one of the backends you might need to select a logging driver or logging tool that supports your Log Management solution of choice. If your tool needs Syslog input, you might choose the Syslog driver.

Let’s look into Top 10 Docker logging gotchas every Docker user should know.

1. Docker logs command works only with json-file Logging driver

The default Logging driver “json-file” writes logs to the local disk, and the json-file driver is the only one that works in parallel to “docker logs” command. As soon one uses alternative logging drivers, such as Syslog, Gelf or Splunk, the Docker logs API calls start failing, and the “docker logs” command shows an error reporting the limitations instead of displaying the logs on the console. Not only does the docker log command fail, but many other tools using the Docker API for logs, such as Docker user interfaces like Portainer or log collection containers like Logspout are not able to show the container logs in this situation.

See https://github.com/moby/moby/issues/30887

2. Docker Syslog driver can block container deployment

Using Docker Syslog driver with TCP or TLS is a reliable way to deliver logs. However, the Syslog logging driver requires an established TCP connection to the Syslog server when a container starts up. If this connection can’t be established at container start time, the container start fails with an error message like

docker: Error response from daemon: Failed to initialize logging driver: dial tcp

This means a temporary network problem or high network latency could block the deployment of containers. In addition, a restart of the Syslog server could tear down all containers logging via TCP/TS to a central Syslog server, which is definitely the situation to avoid.

See: https://github.com/docker/docker/issues/21966

3. Docker syslog driver loses logs when destination is down

Similar to the issue #2 above, causing a loss of logs is the missing ability of Docker logging drivers to buffer logs on disk when they can’t be delivered to remote destinations.

An interesting issue to watch: https://github.com/moby/moby/issues/30979

4. Docker logging drivers don’t support multi-line logs like error stack traces

When we think about logs, most people think of simple single-line logs, say like Nginx or Apache logs. However, logs can also span multiple lines. For example, exception traces typically span multiple lines, so to help Logstash users we’ve shared how to handle stack traces with Logstash. Things are no better in the world of containers, where things get even more complicated because logs from all apps running in containers get emitted to the same output – stdout. No wonder seeing issue #22920closed with “Closed. Don’t care.” disappointed so many people. Luckily, there are tools like SematextDocker Agent that can parse multi-line logs out of the box, as well as apply custom multi-line patterns.

5. Docker service logs command hangs with non-json logging driver

While the json-files driver seems robust, other log drivers could unfortunately still cause trouble with Docker Swarm mode.

See Github Issue: https://github.com/docker/docker/issues/28793

6. Docker daemon crashes if fluentd daemon is gone and buffer is full

Another scenario where a logging driver causes trouble when the remote destination is not reachable – in this particular case the logging drivers throws exceptions that cause Docker daemon to crash.

7. Docker container gets stuck in Created state on Splunk driver failure

If the Splunk server returns a 504 on container start, the container is actually started, but docker reports the container as failed to start. Once in this state, the container no longer appears under docker ps, and the container process cannot be stopped with docker kill. The only way to stop the process is to manually kill it.

Github: https://github.com/moby/moby/issues/24376

8. Docker logs skipping/missing application logs (journald driver)

It turns out that this issue is caused by journald rate limits, which needs to be increased as Docker creates logs for all running applications and journald might skip some logs due to its rate limitation settings. So be aware of your journald settings when you connect Docker to it.

9. Gelf driver issues

The Gelf logging driver is missing a TCP or TLS option and supports only UDP, which could be risky to lose log messages when UDP packets get dropped. Some issues report a problem of DNS resolution/caching with GELF drivers so your logs might be sent to “Nirvana” when your Graylog server IP changes – and this could happen quickly using container deployments.

10. Docker does not support multiple log drivers

It would be nice to have logs stored locally on the server and the possibility to ship them to remote servers. Currently, Docker does not support multiple log drivers, so users are forced to pick a single log driver. Not an easy decision knowing various issues listed in this post.

That’s it! My Top 10 Docker Logging Gotchas!

What’s next?

You should think about not only collecting logs but also host and container metrics, and events.

The post Top 10 Docker Logging Gotchas appeared first on Sematext.

Have you started using Kubernetes?

We’ve prepared a Kubernetes Cheat Sheet which puts all key Kubernetes commands (think kubectl) at your fingertips. Organized in logical groups from resource management (e.g. creating or listing pods, services, daemons), viewing and finding resources, to monitoring and logging. Download yours.