Forem: Radu Gheorghe

Linux Logging Tutorial: What Are Linux Logs, How to View, Search and Centralize Them

Radu Gheorghe — Mon, 27 Jul 2020 12:17:37 +0000

TL;DR note: if you want the bzip2 -9 version of this post, scroll down to the very last section for some quick pointers. If you want to learn a bit about Linux system logs, please continue, as we'll talk about all these and more:

What are Linux logs and who generates them
Important types of Linux logs and their typical location
How to read and search logs, whether they're written by journald or syslog
How to centralize logs of many servers in one location. Spoiler alert: the easiest way is to send all system logs to Sematext Cloud in three commands, so you can build actionable dashboards:

Short Recap: What Are Linux Logs?

Linux logs are pieces of data that Linux writes, related to what the server, kernel, services, and applications running on it are doing, with an associated timestamp. They often come with other structured data, such as a hostname, being a valuable analysis and troubleshooting tool for admins when they encounter performance issues. You can read more about logs and why should you monitor them in our complete guide to log management. Here's an example of SSH log from /var/log/auth.log directory:

May 5 08:57:27 ubuntu-bionic sshd[5544]: pam_unix(sshd:session): session opened for user vagrant by (uid=0)

Notice how the log contains a few fields, like the timestamp, the hostname, the process writing the log and its PID, before the message itself. In Linux, logs come from different sources, mainly:

Systemd journal. Most Linux distros have systemd to manage services (like SSH above). Systemd catches the output of these services (i.e., logs like the one above) and writes them to the journal. The journal is written in a binary format, so you'll use journalctl to explore it, like:

    $ journalctl
    ...
    May 05 08:57:27 ubuntu-bionic sshd[5544]: pam_unix(sshd:session): session opened for user vagrant by (uid=0)
    ...

Syslog. When there's no systemd, processes like SSH can write to a UNIX socket (e.g., /dev/log) in the syslog message format. A syslog daemon (e.g., rsyslog) then picks the message, parses it and writes it to various destinations. By default, it writes to files in /var/log, which is how we got the earlier message from /var/log/auth.log.
The Linux kernel writes its own logs to a ring buffer. Systemd or the syslog daemon can read logs from this buffer, then write to the journal or flat files (typically /var/log/kern.log). You can also see kernel logs directly via dmesg:

$ dmesg -T
...
[Tue May 5 08:41:31 2020] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
...

Audit logs. These are a special case of kernel messages designed for auditing actions such as file access. You'd typically have a service to listen for such security logs, like auditd. By default, auditd writes audit messages to /var/log/audit/audit.log
Application logs. Non-system applications tend to write to /var/log as well. Here are some popular examples:
- Apache HTTPD logs are typically written to /var/log/httpd or /var/log/apache2. HTTP access logs would be in /var/log/httpd/access.log
- MySQL logs typically go to /var/log/mysql.log or /var/log/mysqld.log
- Older Linux versions would record boot logs via bootlogd to /var/log/boot or /var/log/boot.log. Systemd now takes care of this: you can view boot-related logs via journalctl -b. Distros without systemd have a syslog daemon reading from the kernel ring buffer, which normally has all the boot messages. So you can find your boot/reboot logs in /var/log/messages or /var/log/syslog
- Last but not least, you may have your own apps using a logging library to write to a specific file

These sources can interact with each other: journald can forward all its messages to syslog. Applications can write to syslog or the journal. It's Linux, where everything is configurable. But for now, we'll focus on the defaults: where can you typically find different types of logs in most modern distributions?

Log Files Location: Where Are They Stored?

Typically, you'll find Linux server logs in the /var/log directory. This is where syslog daemons are normally configured to write. It's also where most applications (e.g., Apache HTTPD) write by default. For Systemd journal, the default location is /var/log/journal, but you can't view the files directly because they're binary. So how do you view them?

How to Check Linux Logs

If your Linux distro uses Systemd (and most modern distros do), then all your system logs are in the journal. You can view them with journalctl, and you can find the most important journalctl commands here. If your distribution writes to local files via syslog, you can view them with standard text processing tools, such as cat, less or grep:

# grep "error" /var/log/syslog | tail
Mar 31 09:48:02 ubuntu-bionic rsyslogd: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2002.0 try https://www.rsyslog.com/e/2078 ]
...

If you're using auditd to manage audit logs, you can check them in /var/log/audit.log by default, but you can also search them with ausearch. That said, you're better off shipping these security logs to a central location, especially if you have multiple servers. For this task, a tool like Auditbeat might work better than auditd. We wrote a separate tutorial on centralizing audit logs with Auditbeat, but in the next section we'll focus on centralizing Linux system logs in general.

Centralizing Linux Logs

System logs can be in two places: systemd's journal or plain text files written by a syslog daemon. Some distributions (e.g., Ubuntu) have both: journald is set up to forward to syslog. This is done by setting ForwardToSyslog=Yes in journald.conf.

Centralizing Logs via Journald

Our recommendation is to use the journal-upload to centralize logs, if the distribution has systemd. You can check this by running journalctl - if the command isn't found, you don't have the journal. As promised earlier, you can centralize your system logs to Sematext Cloud with three commands:

Install journal-upload. On Ubuntu, this works via sudo apt-get install systemd-journal-remote
Configure journal-upload. In /etc/systemd/journal-upload.conf, set URL=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN
Start journal-upload now and on every boot: systemctl enable systemd-journal-upload && systemctl start systemd-journal-upload

Alternatively, you can use Logagent's journal-upload input to gather journal entries from one or more machines, before shipping them to a central location. That central location can be Sematext Cloud, a local ELK stack or something else: If you want to learn more about journald and journalctl, as well as the options you have around centralizing the journal, have a look at our complete guide to journald.

Centralizing Logs via syslog

There are a few scenarios in which centralizing Linux logs with syslog might make sense:

Your Linux distribution doesn't have journald. This means system logs go directly to your syslog daemon
You want to use your syslog daemon to collect and parse application logs as well. An example is described in our tutorial for Apache logs with rsyslog and Elasticsearch.
You want to forward journal entries to syslog (i.e., by setting ForwardToSyslog=Yes in journald.conf), so you can use a syslog protocol as a transport. However, this approach will lose some of journald's structured data: journald only forwards syslog-specific fields.
Similar to the above, except that you'd configure the syslog daemon to read from the journal (like journalctl does). This approach doesn't lose structured data, but is more error prone (e.g., in case of journal corruption) and adds more overhead.

In all situations listed above, data will go through your syslog daemon. From there, you can send it to any of the supported destinations. Most Linux distributions come with rsyslog installed. To forward data to another syslog server via TCP, you can add this line in your /etc/rsyslog.conf:

*.* @@logsene-syslog-receiver.sematext.com

This particular line will forward data to Sematext Cloud's syslog endpoint, but you can replace logsene-syslog-receiver.sematext.com with the host name of your own syslog server. Some syslog daemons can output data to Elasticsearch via HTTP/HTTPS. rsyslog is one of them and so is syslog-ng. For example, if you use rsyslog on Ubuntu, you'll install the Elasticsearch output module first:

sudo apt-get install rsyslog-elasticsearch

Then, in the configuration file, you need two elements:

A template that formats your syslog messages as JSON, for Elasticsearch to consume

template(name="LogseneFormat" type="list" option.json="on") {
 constant(value="{")
 constant(value="\\"@timestamp\\":\\"")
 property(name="timereported" dateFormat="rfc3339")
 constant(value="\\",\\"message\\":\\"")
 property(name="msg")
 constant(value="\\",\\"host\\":\\"")
 property(name="hostname")
 constant(value="\\",\\"severity\\":\\"")
 property(name="syslogseverity-text")
 constant(value="\\",\\"facility\\":\\"")
 property(name="syslogfacility-text")
 constant(value="\\",\\"syslog-tag\\":\\"")
 property(name="syslogtag")
 constant(value="\\",\\"source\\":\\"")
 property(name="programname")
 constant(value="\\"}")
}

An action that forwards data to Elasticsearch, using the template specified above

module(load="omelasticsearch")
action(type="omelasticsearch"
 template="LogseneFormat" # the template that you defined earlier
 searchIndex="LOGSENE_APP_TOKEN_GOES_HERE"
 server="logsene-receiver.sematext.com"
 serverport="443"
 usehttps="on"
 bulkmode="on"
 queue.dequeuebatchsize="100" # how many messages to send at once
 action.resumeretrycount="-1") # buffer messages if connection fails

The above example shows how to send messages to Sematext Cloud's Elasticsearch API, but you can adjust the action element to point it to your local Elasticsearch:

searchIndex would be your own rolling index alias
server would be the hostname of an Elasticsearch node
serverport can be 9200 or a custom port Elasticsearch listens to
usehttps="off" would send data over plain HTTP

Whether you use a syslog protocol, the Elasticsearch API or something else, it's better to forward syslog directly from the syslog daemon than to tail individual files from /var/log using a different log shipper. Tailing files will add overhead and miss some of the metadata, such as facility or severity. Which is not to say that files in /var/log are useless. You'll need them in two scenarios:

Logs of applications that write directly to /var/log. For example, HTTP logs, FTP logs, mysql logs and so on. You can tail such files with a log shipper. We have tutorials on parsing apache logs with rsyslog and with Logstash.
Process system logs with UNIX text tools like grep. Here, different log files contain different kinds of data. We'll look at the typical configuration in the next section.

What Are the Most Important Log Files You Should Monitor

By default, some distributions write system logs to syslog (either directly or from the journal). The syslog daemon writes these logs to files under /var/log. Typically that syslog daemon is rsyslog, though syslog-ng works in a similar fashion. In this section, we'll look at the important log files and:

what kind of information you'll find in them
how is rsyslog configured to write there (in case you want to change the configuration)
how to view the same information with journalctl, in case it doesn't forward to syslog

/var/log/syslog or /var/log/messages

This is the “catch-all” of syslog. For example:

# logger "this is a test"
# tail -1 /var/log/syslog
May 7 15:33:11 ubuntu-bionic test-user: this is a test

Typically, you'll find all messages here (error logs, informational messages, and every other severity), as this line from /etc/rsyslog.conf suggests:

*.* /var/log/syslog

The only exception is the stop action. For example, you may find something like this:

:msg,contains,"[UFW " /var/log/ufw.log
& stop

In plain English, this block says:

If the msg property of this message contains "[UFW "
Then write to /var/log/ufw.log (the file output module is implied)
If the action succeeds (&), then don't process this message further (stop)

So if the /var/log/syslog action comes later, it won't write UFW messages there. If there's nothing in /var/log/syslog or /var/log/messages, you probably have journald set up not to forward to syslog. The same data (and more) can be viewed via journalctl with no parameters. By default, journalctl pages data through less, but if you want to filter through grep you'll need to disable paging:

# journalctl --no-pager | grep "this is a test"
May 07 15:33:11 ubuntu-bionic test-user[7526]: this is a test

/var/log/kern.log or /var/log/dmesg

This is where kernel messages go by default:

Apr 17 16:47:28 ubuntu-bionic kernel: [ 0.004000] console [tty1] enabled

It's really down to filtering syslog messages by the kern facility:

kern.* /var/log/kern.log

If you don't have syslog (or the file is missing) and you have journald, you can show kernel messages in journalctl:

# journalctl -k
...
Apr 17 16:47:28 ubuntu-bionic kernel: console [tty1] enabled
...

/var/log/auth.log or /var/log/secure

This is where you find authentication messages, generated by services like sshd:

May 7 15:03:09 ubuntu-bionic sshd[1202]: pam_unix(sshd:session): session closed for user vagrant

This is another filter by facility, this time by two values (auth and authpriv):

auth,authpriv.* /var/log/auth.log

You can do such filters in journalctl as well, except that you have to provide numeric facility levels:

# journalctl SYSLOG_FACILITY=4 SYSLOG_FACILITY=10
...
May 7 15:03:09 ubuntu-bionic sshd[1202]: pam_unix(sshd:session): session closed for user vagrant
...

/var/log/cron.log

This is where your cron messages go (i.e., jobs that run regularly):

May 06 08:19:01 localhost.localdomain anacron[1142]: Job `cron.daily' started

Yet another facility filter:

cron.* /var/log/cron

With journalctl, you'd do:

# journalctl SYSLOG_FACILITY=9

/var/log/mail.log or /var/log/maillog

Email daemons such as Postfix typically log to syslog in the mail facility, just like cron logs to the cron facility. Then, rsyslog puts these logs in a different file:

mail.* /var/log/mail.log

If you're using journald, you can still view mail logs with:

# journalctl SYSLOG_FACILITY=2

Because journald exposes the syslog API, everything that normally goes to syslog ends up in the journal.

TL;DR Takeaways

Let's summarize the actionables here:

The location and format of your Linux system logs depends on how your distro is configured.
Most distros have systemd. It means all your system logs live in the journal. To view and search it, use journalctl. Use the complete guide to journald for reference.
Some distros get system logs to syslog. Either directly or through the journal. In this case you likely have logs written to various files in /var/log. Have a look at the section above for details on each important file.
Either way, if you manage multiple servers, you'll want to centralize system logs with a log management software such as Sematext Cloud. Sematext makes this very easy, as it has both journald integration and syslog integration. Though you can use your own ELK stack if you prefer build vs buy.
If you need help with your own ELK stack, please reach out, as we provide ELK stack consulting, Elasticsearch production support and Elasticsearch and ELK stack training classes

Tutorial: Logging with journald

Radu Gheorghe — Tue, 09 Jun 2020 07:37:39 +0000

If you're using Linux, I'm sure you bumped into journald: it's what most distros use by default for system logging. Most applications running as a service will also log to the journal. So how do you make use of these logs to:

find the error or debug message that you're looking for?
make sure logs don't fill your disk?
centralize journals so you don't have to ssh to each box?

In this post, we'll answer all the above and more. We will dive into the following topics:

what is journald, how it came to be and what are its benefits
main configuration options, like when to remove old logs so you don't run out of disk
journald and containers: can/should containers log to the journal?
journald vs syslog: advantages and disadvantages of both, how they integrate
ways to centralize journals. Advantages and disadvantages of each method, and which is the best. Spoiler alert: you can configure journald to send logs directly to Sematext Cloud; or you can use the open-source Logagent as a journald aggregator. Either way, you'll have one place to search and analyze your journal events:

There are lots of other options to centralize journal entries, and lots of tools to help. We'll explore them in detail, but before that, let's zoom in to journald itself.

What is journald?

journald is the part of systemd that deals with logging. systemd, at its core, is in charge of managing services: it starts them up and keeps them alive.

All services and systemd itself need to log: “ssh started” or “user root logged in”, they might say. That's where journald comes in: to capture these logs, record them, make them easy to find, and remove them when they pass a certain age.

Why use journald?

In short, because syslog sucks :) Jokes aside, the paper announcing journald explained that systemd needed functionality that was hard to get through existing syslog implementations. Examples include structured logging, indexing logs for fast search, access control and signed messages.

As you might expect, not everyone agrees with these statements or the general approach systemd took with journald. But by now, systemd is adopted by most Linux distributions, and it includes journald as well. journald happily coexists with syslog daemons, as:

some syslog daemons can both read from and write to the journal
journald exposes the syslog API

journald benefits

Think of journald as your mini-command-line-ELK that lives on virtually every Linux box. It provides lots of features, most importantly:

Indexing. journald uses a binary storage for logs, where data is indexed. Lookups are much faster than with plain text files
Structured logging. Though it's possible with syslog, too, it's enforced here. Combined with indexing, it means you can easily filter specific logs (e.g. with a set priority, in a set timeframe)
Access control. By default, storage files are split by user, with different permissions to each. As a regular user, you won't see everything root sees, but you'll see your own logs
Automatic log rotation. You can configure journald (see below) to keep logs only up to a space limit, or based on free space

Configuring journald

To tweak how journald behaves, you'll edit /etc/systemd/journald.conf and then reload the journal service like:

systemctl reload systemd-journald.service

Though earlier versions of journald need to be restarted:

systemctl restart systemd-journald.service

Most important settings will be around storage: whether the journal should be kept in memory or on disk, when to remove old logs and how much to rate limit. We'll focus on some of those next, but you can see all the configuration options in journald.conf's man page.

journald storage

The Storage option controls whether the journal is stored in memory (under /run/log/journal) or on disk (under /var/log/journal). Setting Storage=volatile will store the journal in memory, while Storage=persistent will store it on disk. Most distributions have it set to auto, which means it will store the journal on disk if /var/log/journal exists, otherwise it will be stored in memory.

Once you've decided where to store the journal, you may want to set some limits. For example, SystemMaxUse=4G will limit /var/log/journal to about 4GB. Similarly, SystemKeepFree=10G will try to keep 10GB of disk space free. If you choose to keep the journal in memory, the equivalent options are RuntimeMaxUse and RuntimeKeepFree.

You can check the current disk usage of the journal with journalctl via journalctl --disk-usage. If you need to, you can clean it up on demand via journalctl --vacuum-size=4GB (i.e. to reduce it to 4GB).

Compression is enabled by default on log entries larger than 512 bytes. If you want to change this threshold to, say 1KB, you'd add Compress=1K.

Also by default, journald will drop all log messages from a service if it passes certain limits. These limits can be configured via RateLimitBurst and RateLimitIntervalSec, which default to 10000 and 30s respectively. Actual values will depend on the available free space. For example, if you have more than 64GB of free disk space, the multiplier will be 6. Meaning it will drop logs from a service after 60K messages sent in 30 seconds.

The rate limit defaults are sensible, unless you have a specific service that's generating lots of logs (e.g. a web server). In that case, it might be better to LogRateLimitBurst and LogRateLimitIntervalSec in that application's service definition.

journald commands via journalctl

journalctl is your main tool for interacting with the journal. If you just run it, you'll see:

all entries, from oldest to newest
paged by less
lines go past the edge of your screen if they have to (use left and right arrow keys to navigate)
format is similar to the syslog output, as it is configured in most Linux distributions: syslog timestamp + hostname + program and its PID + message

Here's an example snippet:

Apr 09 10:22:49 localhost.localdomain su[866]: pam_unix(su-l:session): session opened for user solr by (uid=0)<
Apr 09 10:22:49 localhost.localdomain systemd[1]: Started Session c1 of user solr.<
Apr 09 10:22:49 localhost.localdomain systemd[1]: Created slice User Slice of solr.<
Apr 09 10:22:49 localhost.localdomain su[866]: (to solr) root on none

This is rarely what you want. More common scenarios are:

last N lines (equivalent of tail -n 20 - if N=20): journalctl -n 20
follow (tail -f equivalent): journalctl -f
page from newest to oldest: journalctl --reverse
skip paging and just grep for something (e.g. “solr”): journalctl --no-pager | grep solr

If you often find yourself using --no-pager, you can change the default pager through the SYSTEMD_PAGER variable. export SYSTEMD_PAGER=cat will disable paging. That said, you might want to look into journalctl's own options for displaying and filtering - described below - before using text processing tools.

journalctl display settings

The main option here is --output, which can take many values. As an ELK consultant, I want my timestamps ISO 8601, and --output=short-iso will do just that. Now this is more like it:

2020-04-09T10:23:01+0000 localhost.localdomain solr[860]: Started Solr server on port 8983 (pid=999). Happy searching!
2020-04-09T10:23:01+0000 localhost.localdomain su[866]: pam_unix(su-l:session): session closed for user solr

journald keeps more information than what the short/short-iso output shows. Adding --output=json-pretty (or just json if you want it compact) can look like this for a single event:

{
 "__CURSOR" : "s=83694dffb084461ea30a168e6cef1e6c;i=103f;b=f0bbba1703cb43229559a8fcb4cb08b9;m=c2c9508c;t=5a2d9c22f07ed;x=c5fe854a514cef39",
 "__REALTIME_TIMESTAMP" : "1586431033018349",
 "__MONOTONIC_TIMESTAMP" : "3267973260",
 "_BOOT_ID" : "f0bbba1703cb43229559a8fcb4cb08b9",
 "PRIORITY" : "6",
 "_UID" : "0",
 "_GID" : "0",
 "_MACHINE_ID" : "13e3a06d01d54447a683822d7e0b4dc9",
 "_HOSTNAME" : "localhost.localdomain",
 "SYSLOG_FACILITY" : "3",
 "SYSLOG_IDENTIFIER" : "systemd",
 "_TRANSPORT" : "journal",
 "_PID" : "1",
 "_COMM" : "systemd",
 "_EXE" : "/usr/lib/systemd/systemd",
 "_CAP_EFFECTIVE" : "1fffffffff",
 "_SYSTEMD_CGROUP" : "/",
 "CODE_FILE" : "src/core/job.c",
 "CODE_FUNCTION" : "job_log_status_message",
 "RESULT" : "done",
 "MESSAGE_ID" : "9d1aaa27d60140bd96365438aad20286",
 "_SELINUX_CONTEXT" : "system_u:system_r:init_t:s0",
 "UNIT" : "user-0.slice",
 "MESSAGE" : "Removed slice User Slice of root.",
 "CODE_LINE" : "781",
 "_CMDLINE" : "/usr/lib/systemd/systemd --switched-root --system --deserialize 22",
 "_SOURCE_REALTIME_TIMESTAMP" : "1586431033018103"
}

This is where you can use structured logging to filter events. Next up, we'll look closer at the most important options for filtering.

journald log filtering

You can filter by any field (see the JSON output above) by specifying key=value arguments, like:

journalctl _SYSTEMD_UNIT=sshd.service

There are shortcuts, for example the _SYSTEMD_UNIT above can be expressed as -u. The above command is the equivalent of of:

journalctl -u sshd.service

Other useful shortcuts:

severity (here called priority). journalctl -p warning will show logs with at least a severity of warning
show only kernel messages: journalctl --dmesg

You can also filter by time, of course. Here, you have multiple options:

--since/--until as a full timestamp. For example: journalctl --since="2020-04-09 11:30:00"
date only (00:00:00 is assumed as the time): journalctl --since=2020-04-09
abbreviations: journalctl --since=yesterday --until=now

In general, you have to specify the exact value you're looking for. With the exception of _SYSTEMD_UNIT. Here, patterns also work:

journalctl -u sshd*

Newer versions of systemd also allow a --grep flag, which allows you to filter the MESSAGE field by regex. But you can always pipe the journalctl output through grep itself.

journald and boots

Besides messages logged by applications, journald remembers significant events, such as system reboots. Here's an example:

# journalctl MESSAGE="Server listening on 0.0.0.0 port 22."
**-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --**
Apr 08 11:53:23 localhost.localdomain sshd[822]: Server listening on 0.0.0.0 port 22.
Apr 08 13:23:42 localhost.localdomain sshd[7425]: Server listening on 0.0.0.0 port 22.
**-- Reboot --**
Apr 09 10:22:49 localhost.localdomain sshd[857]: Server listening on 0.0.0.0 port 22.

You can suppress these special messages via -q. Use -b to show only messages after a certain boot. For example, to show messages since the last boot:

# journalctl MESSAGE="Server listening on 0.0.0.0 port 22." -b
-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --
Apr 09 10:22:49 localhost.localdomain sshd[857]: Server listening on 0.0.0.0 port 22.

You can specify a boot as an offset to the current one (e.g. -b -1 is the boot before the last). You can also specify a boot ID, but to do this you need to know what are the available boot IDs:

# journalctl --list-boots
-1 d26652f008ef4020b15a3d510bbcb381 Wed 2020-04-08 11:53:18 UTC—Wed 2020-04-08 14:31:16 UTC
 0 f0bbba1703cb43229559a8fcb4cb08b9 Thu 2020-04-09 10:22:43 UTC—Thu 2020-04-09 12:01:01 UTC

And then:

# journalctl MESSAGE="Server listening on 0.0.0.0 port 22." -b d26652f008ef4020b15a3d510bbcb381
-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --
Apr 08 11:53:23 localhost.localdomain sshd[822]: Server listening on 0.0.0.0 port 22.
Apr 08 13:23:42 localhost.localdomain sshd[7425]: Server listening on 0.0.0.0 port 22.

This is all useful if you configure journald for persistent storage (see the configuration section above).

journald centralized logging

As you probably noticed, journald is quite host-centric. In practice, you'll want to access these logs in a central location, without having to SSH into each machine.

There are multiple ways of centralizing journald logs, and we'll detail each below:

systemd-journal-upload uploads journal entries. Either directly to Sematext Cloud or to a log shipper that can read its output, such as the open-source Logagent
systemd-journal-remote as a “centralizer”. The idea is to have all journals on one host, so you can use journalctl to search (see above). This can work in “pull” or “push” mode
a syslog daemon or another log shipper reads from the local journal. Then, it forwards logs to a central store like ELK or Sematext Cloud
journald forwards entries to a local syslog socket. Then, a log shipper (typically a syslog daemon) picks messages up and forwards them to the central store

systemd-journal-upload to ELK or Sematext Cloud

systemd-journal-upload is a service that pushes new journal entries over HTTP/HTTPS. That destination can be the Sematext Cloud Journald Receiver - the easiest way to centralize journald logs. And probably the best, as we'll discuss below.

Although it's part of journald/systemd, systemd-journal-upload isn't installed by default on most distros. So you have to add it via something like:

apt-get install systemd-journal-remote

Then, uploading journal entries is as easy as:

systemd-journal-upload --url=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN

Though most likely you'll want to configure it as a service:

$ cat /etc/systemd/journal-upload.conf
[Upload]
URL=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN

If you need more control, or if you want to send journal entries to your local Elasticsearch, you can use the open-source Logagent with its journald input plugin as a journald centralizer: Here's the relevant part of logagent.conf:

input:
  journal-upload:
    module: input-journald-upload
    port: 9090
    worker: 0
    systemdUnitFilter:
      include: !!js/regexp /.*/i

Using Logagent and Elasticsearch or Sematext Cloud (i.e. we host Logagent and Elasticsearch for you) is probably the best option to centralize journald logs. That's because you get all journald's structured data over a reliable protocol (HTTP/HTTPS) with minimal overhead. The catch? Initial import is tricky, because it can generate a massive HTTP payload. For this, you might want to do the initial import by streaming journalctl output through Logagent, like:

journalctl --output=json --no-page | logagent --index SEMATEXT-LOGS-TOKEN

systemd-journal-remote

Journald comes with its own “log centralizer”: systemd-journal-remote. You don't get anywhere near the flexibility of ELK/Sematext Cloud, but it's already there and it might be enough for small environments.

systemd-journal-remote can either pull journals from remote systems or listen for journal entries on HTTP/HTTPS. The push model - where systemd-journal-upload is in charge of pushing logs - is typically better because:

it can continuously tail the journal and remembers where it left off (i.e. maintains a cursor)
you don't need to open access to the journal of every system

systemd-journal-remote typically comes in the same package as systemd-journal-upload. Once it's installed, you can make it listen to HTTP/HTTPS traffic:

host2# systemd-journal-remote --listen-http=0.0.0.0:19352 --output=/var/log/journal/remote

Now you can push the journal of a remote host like this:

host1# systemd-journal-upload --url=http://host2:19352

systemd-journal-remote and systemd-journal-gatewayd

s ystemd-journal-remote can also pull journal entries from remote hosts. These hosts would normally serve their journal via systemd-journal-gatewayd (which is often provided by the same package). Once you have systemd-journal-gatewayd, you can start it via:

host1# systemctl start systemd-journal-gatewayd.socket

You can verify if it works like this:

curl host1:19531/entries

Then, from the “central” host, you can use systemd-journal-remote to fetch journal entries:

host2# systemd-journal-remote --url [http://](http://host1:19531)[host1](http://host1:19531)[:19531](http://host1:19531)

By default, systemd-journal-remote will write the imported journal to /var/log/journal/remote/ (you might have to create it first!), so you can search them via journalctl:

journalctl -D /var/log/journal/remote/

Tools that read directly from the journal

Another approach for centralizing journald logs is to have a log shipper read from the journal, much like journalctl does. Then, it can process logs and send them to destinations like Elasticsearch or Sematext Cloud (which exposes the Elasticsearch API).

For this approach, there's a PoC journald input plugin for Logstash. As you probably know, Logstash is easy to use, so reading from the journal is as easy as:

input {
  journald {
  # you may add other options here, but of course the defaults are sensible :)
  }
}

Journalbeat is also available. It's as easy to install and use as Filebeat, except that it reads from the journal. But it's marked as experimental.

Why PoC and experimental? Because of potential journal corruption which might lead to nasty results. Check the comments in rsyslog's journal input documentation for details.

Syslog daemons are also log shippers. Some of them can also read from the journal, or even write to it. There's a lot to say about syslog and the journal, so we'll dissect the topic in a section of its own.

journald vs syslog

Journald provides a good out-of-the-box logging experience for systemd. The trade-off is, journald is a bit of a monolith, having everything from log storage and rotation, to log transport and search. Some would argue that syslog is more UNIX-y: more lenient, easier to integrate with other tools. Which was its main criticism to begin with.

Flame wars aside, there's good integration between the two. Journald provides a syslog API and can forward to syslog (see below). On the other hand, syslog daemons have journal integrations. For example, rsyslog provides plugins to both read from journald and write to journald. In fact, they recommend two architectures:

A small setup (e.g. N embedded devices and one server) could work by centralizing journald logs (see above). If embedded devices don't have systemd/journald but have syslog, they can centralize via syslog to the server and finally write to the server's journal. This journal will act like a mini-ELK
A larger setup can work by aggregating journal entries through a syslog daemon. We'll concentrate on this scenario in the rest of this section

There are two ways of centralizing journal entries via syslog:

syslog daemon acts as a journald client (like journalctl or Logstash or Journalbeat)
journald forwards messages to syslog (via socket)

Option 1) is slower - reading from the journal is slower than reading from the socket - but captures all the fields from the journal. Option 2) is safer (e.g. no issues with journal corruption), but the journal will only forward traditional syslog fields (like severity, hostname, message..). Typically, you'd go for 2) unless you need the structured info. Here's an example configuration for implementing 1) with rsyslog, and writing all messages to Elasticsearch or Sematext Cloud:

# module that reads from journal
module(load="imjournal"
 StateFile="/var/run/journal.state" # we write here where we left off
 PersistStateInterval="100" # update the state file every 100 messages
)
# journal entries are read as JSON, we'll need this to parse them
module(load="mmjsonparse")
# Elasticsearch or Sematext Cloud HTTP output
module(load="omelasticsearch")

# this is done on every message (i.e. parses the JSON)
action(type="mmjsonparse")

# output template that simply writes the parsed JSON
template(name="all-json" type="list"){
 property(name="$!all-json")
}

action(type="omelasticsearch"
 template="all-json" # use the template defined earlier
 searchIndex="SEMATEXT-LOGS-APP-TOKEN-GOES-HERE"
 server="logsene-receiver.sematext.com"
 serverport="80"
 bulkmode="on" # use the bulk API
 action.resumeretrycount="-1" # retry indefinitely if Logsene/Elasticsearch is unreachable
)

For option 2), we'll need to configure journald to forward to a socket. It's as easy as adding this to /etc/systemd/journald.conf:

ForwardToSyslog=yes

And it will write messages, in syslog format, to /run/systemd/journal/syslog. On the rsyslog side, you'll have to configure its socket input module to listen to that socket. Here's a similar example of sending logs to Elasticsearch or Sematext Cloud:

module(load="imuxsock"
 SysSock.Name="/run/systemd/journal/syslog")

# template to write traditional syslog fields as JSON
template(name="plain-syslog"
 type="list") {
 constant(value="{")
 constant(value="\"timestamp\":\"") property(name="timereported" dateFormat="rfc3339")
 constant(value="\",\"host\":\"") property(name="hostname")
 constant(value="\",\"severity\":\"") property(name="syslogseverity-text")
 constant(value="\",\"facility\":\"") property(name="syslogfacility-text")
 constant(value="\",\"tag\":\"") property(name="syslogtag" format="json")
 constant(value="\",\"message\":\"") property(name="msg" format="json")
 constant(value="\"}")
}

action(type="omelasticsearch"
 template="plain-syslog" # use the template defined earlier
 searchIndex="SEMATEXT-LOGS-APP-TOKEN-GOES-HERE"
 server="logsene-receiver.sematext.com"
 serverport="80"
 bulkmode="on" # use the bulk API
 action.resumeretrycount="-1" # retry indefinitely if Logsene/Elasticsearch is unreachable
)

Whether you read the journal through syslog, systemd-journal-upload or through a log shipper, all the above methods assume that you're dealing with Linux running on bare metal or VMs. But what if you're using containers? Let's explore your options in the next section.

journald and containers

In this context, I think it's worth making a distinction between Docker containers and systemd containers. Let's take them one at a time.

journald and Docker

Typically, a Docker container won't have systemd, because it would make it too “heavy”. As a consequence, it won't have journald, either. That said, you probably have journald on the host, if the host is running Linux. This means you can use the journald logging driver to send all the logs of a host's containers to that host's journal. It's as easy as:

docker run my_container --log-driver=journald

And that container's logs will be in the journal:

# journalctl CONTAINER_NAME=my_container --all
Apr 09 13:03:28 localhost.localdomain dockerd-current[25558]: hello journal

If you want to use journald by default, you can make the change in daemon.json and restart Docker:

# cat /etc/docker/daemon.json
{
 "log-driver": "journald"
}
systemctl restart docker

If you have more than one host, you're back to the centralizing problem that we explored in the previous section: getting all journals in one place. This makes journald an intermediate step that may not be necessary.

A better approach is to centralize container logs via Logagent, which can run as a container. Here, Logagent picks up logs and forwards them to a central place, like Elasticsearch or Sematext Cloud. But it's not the only way. In fact, we explore different approaches, with their pros and cons, in our Complete Guide to Docker logging.

journald and systemd containers

systemd provides containers as well (called machines) via systemd-nspawn. Unlike Docker containers, systemd-nspawn machines can log to the journal directly. You can read the logs of a specific machine like this:

journalctl --machine $MACHINE_NAME

Where $MACHINE_NAME is one of the running machines. You'd use machinectl list to see all of them.

As with Docker's journald logging driver, this setup might be challenging when you have multiple hosts. You'll either want to centralize your journals - as described in the previous section. Or, you can send logs from your systemd containers directly to the central location - either via a log shipper or a logging library.

Conclusions

Did you read all the way to the end? You're a hero! And you probably figured that journald is good for structured logging, quick local searches, and tight integration with systemd. Its design shows its weaknesses when it comes to centralizing log events. Here we have many options, but none is perfect. That said, Logagent's journald input and Sematext Cloud's journald receiver (the hosted equivalent) come pretty close.

Entity Extraction with spaCy

Radu Gheorghe — Fri, 26 Apr 2019 09:56:58 +0000

What is Entity Extraction?

Entity extraction is, in the context of search, the process of figuring out which fields a query should target, as opposed to always hitting all fields. The reason we may want to involve entity extraction in search is to improve precision. For example: how do we tell that, when the user typed in Apple iPhone, the intent was to run company:Apple AND product:iPhone? And not bring back phone stickers in the shape of an apple?

What is spaCy?

spaCy is a Python framework that can do many Natural Language Processing (NLP) tasks. Named Entity Extraction (NER) is one of them, along with text classification, part-of-speech tagging, and others.

If this sounds familiar, that may be because we previously wrote about a different Python framework that can help us with entity extraction: Scikit-learn. Though Scikit-learn is more a collection of machine learning tools, rather than an NLP framework. spaCy is closer, in terms of functionality, to OpenNLP. We used all three for entity extraction during our Activate 2018 presentation.

Getting spaCy is as easy as:

pip install spacy

In this post, we’ll use a pre-built model to extract entities, then we’ll build our own model.

Using a pre-built model

spaCy comes with pre-built models for lots of languages. For example, to get the English one, you’d do:

python -m spacy download en_core_web_sm

Then, in your Python application, it’s a matter of loading it:

nlp = spacy.load('en_core_web_sm')

And then you can use it to extract entities. In our Activate example, we did:

doc = nlp(u"#bbuzz 2016: Rafał Kuć - Running High Performance And Fault Tolerant Elasticsearch")
for entity in doc.ents:
    print(entity.label_, ' | ', entity.text)

Which outputs:

MONEY | #bbuzz
DATE | 2016
PERSON | Rafał Kuć - Running High

For this particular example, this result is “approximate” at best. 2016 is indeed a date, but #bbuzz isn’t money. And I doubt that Rafał was Running High while giving that presentation.

For this use-case, we’d need to build our own model.

Training a new model

To train a new model, we first need to create a pipeline that defines how we process data. In this case, we want to extract entities. Then, we’ll train a model by running test data through this pipeline. Once the model is trained, we can use it to extract entities from new data as well.

Let’s zoom into each step.

spaCy pipelines

With spaCy you can do much more than just entity extraction. For example, before extracting entities, you may need to pre-process text, for example via stemming. Or we may want to do part-of-speech tagging: is this word a verb or a noun?

For the scope of our tutorial, we’ll create an empty model, give it a name, then add a simple pipeline to it. That simple pipeline will only do named entity extraction (NER):

nlp = spacy.blank('en') # new, empty model. Let’s say it’s for the English language
nlp.vocab.vectors.name = 'example_model_training' # give a name to our list of vectors
# add NER pipeline
ner = nlp.create_pipe('ner') # our pipeline would just do NER
nlp.add_pipe(ner, last=True) # we add the pipeline to the model

Data and labels

To train the model, we’ll need some training data. In the case of product search, these would be queries, where we pre-label entities. For example:

DATA = \[
  (u"Search Analytics: Business Value & BigData NoSQL Backend, Otis Gospodnetic ", {'entities': [ (58,75,'PERSON') ] }),
  (u"Introduction to Elasticsearch by Radu ", {'entities': [ (16,29,'TECH'), (32, 36, 'PERSON') ] }),
  # …
]

Our training data has a few characteristics:

The text itself is Unicode
The entities array contains a list of tuples. Each tuple is an entity labeled from the text
Each tuple contains three elements: start offset, end offset and entity name

Training the model

Before training, we need to make our model aware of the possible entities. To do that, we add all the labels we’re aware of:

nlp.entity.add_label('PERSON')
nlp.entity.add_label('TECH')
# ...

Now we can begin training. We’ll need to allocate the models and get an optimizer via our model’s begin_training() method:

optimizer = nlp.begin_training()

Then we update the model with our training data. Each text, with its annotations (those labeled entities), would be passed to the update() function of our model. Along with the newly created optimizer:

nlp.update([text], [annotations], sgd=optimizer)

In our Activate example, because we have little training data, we just loop through it a few times, in random order:

for i in range(20):
    random.shuffle(DATA)
    for text, annotations in DATA:
        nlp.update([text], [annotations], sgd=optimizer)

And that’s it! Now we have a model built for our own use-case.

Predicting entities

The model we just built is already loaded in memory. If you don’t want to train it every time, you can save it to disk and load it when needed. With the model loaded, you’ll use it to predict entities just as you would with a pre-built model:

doc = nlp(u"#bbuzz 2016: Rafał Kuć - Running High Performance And Fault Tolerant Elasticsearch")
for entity in doc.ents:
    print(entity.label_, ' | ', entity.text)

Even with this small dataset, results typically look better than with the default model:

PERSON | Rafał Kuć
TECH | Elasticsearch

I’ve mentioned typically because on different runs, because of the randomization, the model looks different. Ultimately, if you want accurate results, there’s no substitute for training set size. Unless something was indeed fishy with Rafał in 2016, because at times I get:

PERSON | Rafał Kuć

TECH | High

Conclusions and next steps

Like in the OpenNLP example we showed before, spaCy comes with pre-built models and makes it easy to build your own. It also comes with a command-line training tool. That said, it’s less configurable - or at least you don’t have all the options as accessible as in a purpose-built tool, like Scikit-learn. For entity extraction, spaCy will use a Convolutional Neural Network, but you can plug in your own model if you need to.

If you find this stuff exciting, please join us: we’re hiring worldwide. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please reach out, because we provide:

Solr, Elasticsearch and Elastic Stack consulting
Solr, Elasticsearch and Elastic Stack production support
Solr, Elasticsearch and Elastic Stack training classes (on site and remote, public and private)
Monitoring, log centralization and tracing for not only Solr and Elasticsearch, but for other applications (e.g. Kafka, Zookeeper), infrastructure and containers

Entity Extraction with Scikit-learn Classifiers

Radu Gheorghe — Mon, 18 Mar 2019 14:39:28 +0000

What is entity extraction?

Entity extraction is the process of figuring out which fields a query should target, as opposed to always hitting all fields. For example: how to tell, when the user typed in Apple iPhone, that the intent was to run company:Apple AND product:iPhone?

Is entity extraction a classification problem?

Typically, when you think about entity extraction, you think about context: in Nokia 3310 is an old phone words like is or an are strong indicators that before them, we have a subject. E-commerce queries are a special case: we often have little context. In our “Entity Extraction for Product Searches” presentation at Activate, we argued that if all you have is Nokia 3310, figuring out that Nokia is a manufacturer and 3310 is a model is a classification problem. In this post, we’ll explore one of the approaches to solve this classification problem: training and using Scikit-learn classification models.

What’s Scikit-learn and how can I get it?

Scikit-learn is a popular machine learning library. It’s written in Python, so to get it, you can just:

pip install sklearn
pip install numpy

We’ll install NumPy as well, because we need to provide the training set as a NumPy array.

Feature selection

Before implementing anything, we need to figure out which features are relevant for classification. Feature selection is a continuous process, but we need something to begin with. In the Activate example, we used three features: term frequency, number of digits and number of spaces. We assume that, typically, manufacturer names will occur more often in our index compared to model numbers, which are pretty unique. We expect more digits in model numbers and more spaces in manufacturer names. The fundamental question is, what would help one distinguish an entity from another. In this case, the manufacturer from the model number. You can get creative with features: does the entity match a dictionary of manufacturers or models? How long is the query and in which position(s) is our entity located? Because there are common constructs in E-commerce, such as manufacturer+model (Nokia 3310) or model+generation (iPhone 3GS, if we stick to old school).

Training and test sets

Data cleanup

When it comes to training and testing a model, the old “garbage in, garbage out” saying applies here as well. You’ll want to curate your data as you see fit: lowercasing and stemming would be useful in many entity extraction setups. Just as they are for regular search :) When testing or applying the model, you’ll notice that some “entities” span across multiple words. You could take word n-grams to fix this problem. For example, in Apple Mac Book, you’d take apple, mac, book, apple mac and mac book, and expect to get apple as manufacturer and mac and mac book as models. From which you can take the larger gram (mac book) or both (mac + mac book, but rank “mac book” higher), depending on how you’d like to balance precision and recall.

Parsing entities into feature arrays

When training a model, you don’t feed Scikit-learn the actual words, but the features of those words. You’ll need code that, given the queries (or entities), can generate feature arrays. In our example, for Nokia, you’ll have 0 numbers, 0 spaces and its frequency in your index. In our sample code, we read data from a file. We assume each line contains an entity and we also use the file to judge frequencies: if we encounter an entity N times, we’ll get a frequency of N. In the end, we return a dictionary, where the entity is the key, and the value is the feature array for that entity.

 def read_into_feature_dict(file):
     with open(file) as le_file:
         le_dict = {}
         for line in le_file:
             line = line.strip("\\n")
             if line not in le_dict:
                 # other features besides frequency
                 digits = sum(c.isdigit() for c in line)
                 spaces = sum(c.isspace() for c in line)
                 # initialize an array of [frequency, digits, spaces]. Frequency is initially 1
                 le_dict[line] = [1,digits, spaces]
             else:
                 # increment frequency if we met this before
                 le_dict[line][0] = le_dict[line][0] + 1
         return le_dict

Training a model

To train the model, we’ll need only the list of feature arrays, without the keys. This list of feature arrays is our training set (X), but we’ll also need labels for each entity (y). In our case, labels are manufacturers or models:

# we have a file with manufacturers and one with models. Read them into dictionaries
mfr_feature_dict = read_into_feature_dict("mfrs")
model_feature_dict = read_into_feature_dict("models")

# from the dictionaries, we get only the feature arrays and add them to one list
training = []
for i in mfr_feature_dict:
    training.append(mfr_feature_dict[i])
for i in model_feature_dict:
    training.append(model_feature_dict[i])

# make the list a NumPy array. That’s what Scikit-learn requires
X = np.array(training)

# add training labels. We know that we first added manufacturers, then models
y = []
for i in range(len(mfr_feature_dict)):
    y.append("mfr")
for i in range(len(model_feature_dict)):
    y.append("model")

At this point, we can select a model and train it. Scikit-learn comes with a variety of classifiers out of the box. From simple linear Support Vector Machines like we’re using in this example, to decision trees and perceptrons (the same sort of algorithms you saw in our OpenNLP tutorial). You’d use them in similar way, though parameters are different, of course. With our training X and y, and the algorithm selected, we can try it. For linear SVC, the code can be:

# select the algorithm. Here, linear SVC
clf = svm.SVC(kernel='linear', C = 1.0)
# train it
clf.fit(X, y)

Here, C is the penalty parameter for the error term. The intuition is that, with higher C, your model will fit your training set better, but it may also lead to overfitting. There are other SVC parameters as well, such as the number of iterations.

Using the model to predict entities

At this point, we can use our model for entity extraction. Or at least we can test it. To do that, we can build a test X from some test samples and use the predict() function of our classifier to get the suggested entities:

def test_from_file(test_file):
  test_X = []

  # same function that we used for the training set: read manufacturers/codes from a file
  # then turn them into a dictionary of entities to feature arrays
  test_dict = read_into_feature_dict(test_file)

  # concatenate feature arrays into our X
  for feature_set in test_dict:
    test_X.append(test_dict[feature_set])
  print(test_dict.keys())
  print(test_X)

  # use our model to predict entities for each entity
  print(clf.predict(test_X))

Conclusions and next steps

With well-selected features, classification is a good solution to extract entities from E-commerce queries. We showed an example here with Scikit-learn, but of course, there are other good options. SpaCy is one of them, and we’ll publish another how-to here soon! If you find this stuff exciting, please join us: we’re hiring worldwide. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please reach out, because we provide:

Solr, Elasticsearch and Elastic Stack consulting
Solr, Elasticsearch and Elastic Stack production support
Solr, Elasticsearch and Elastic Stack training classes (on site and remote, public and private)
Monitoring, log centralization and tracing for not only Solr and Elasticsearch but for other applications (e.g. Kafka, Zookeeper), infrastructure and containers

If you want to boost your productivity with Solr or Elasticsearch, check out two useful Cheat Sheets to help you boost your productivity and save time when you’re working with any of these two open-source search engines:

How to access all the new Solr features – Running Solr, Data Manipulation, Searching, Faceting, etc. Download yours here
Key Elasticsearch operations every developer needs – index creation, mapping manipulation, indexing API, and more! Download yours here