<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Radu Gheorghe</title>
    <description>The latest articles on Forem by Radu Gheorghe (@radu0gheorghe).</description>
    <link>https://forem.com/radu0gheorghe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F145777%2Ff5adba2d-3bc5-47bb-926f-5abd57d6f1c6.jpeg</url>
      <title>Forem: Radu Gheorghe</title>
      <link>https://forem.com/radu0gheorghe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/radu0gheorghe"/>
    <language>en</language>
    <item>
      <title>Linux Logging Tutorial: What Are Linux Logs, How to View, Search and Centralize Them</title>
      <dc:creator>Radu Gheorghe</dc:creator>
      <pubDate>Mon, 27 Jul 2020 12:17:37 +0000</pubDate>
      <link>https://forem.com/sematext/linux-logging-tutorial-what-are-linux-logs-how-to-view-search-and-centralize-them-2bi5</link>
      <guid>https://forem.com/sematext/linux-logging-tutorial-what-are-linux-logs-how-to-view-search-and-centralize-them-2bi5</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR note&lt;/strong&gt;: if you want the &lt;code&gt;bzip2 -9&lt;/code&gt; version of this post, scroll down to the very last section for some quick pointers. If you want to learn a bit about Linux system logs, please continue, as we'll talk about all these and more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;What are Linux logs&lt;/strong&gt; and who generates them&lt;/li&gt;
&lt;li&gt;  Important &lt;strong&gt;types of Linux logs&lt;/strong&gt; and their typical location&lt;/li&gt;
&lt;li&gt;  How to &lt;strong&gt;read and search logs&lt;/strong&gt;, whether they're written by journald or syslog&lt;/li&gt;
&lt;li&gt;  How to &lt;strong&gt;centralize logs&lt;/strong&gt; of many servers in one location. Spoiler alert: the easiest way is to send all system logs to Sematext Cloud in &lt;strong&gt;three commands&lt;/strong&gt;, so you can build actionable dashboards:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F06%2Flinux-logging-post-3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F06%2Flinux-logging-post-3.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Short Recap: What Are Linux Logs?
&lt;/h2&gt;

&lt;p&gt;Linux logs are pieces of data that Linux writes, related to what the server, kernel, services, and applications running on it are doing, with an associated timestamp. They often come with other structured data, such as a hostname, being a valuable &lt;a href="https://sematext.com/blog/log-analysis/" rel="noopener noreferrer"&gt;analysis&lt;/a&gt; and troubleshooting tool for admins when they encounter performance issues. You can read more about logs and why should you monitor them in our &lt;a href="https://sematext.com/guides/log-management/" rel="noopener noreferrer"&gt;complete guide to log management&lt;/a&gt;. Here's an example of SSH log from &lt;code&gt;/var/log/auth.log&lt;/code&gt; directory:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;May 5 08:57:27 ubuntu-bionic sshd[5544]: pam_unix(sshd:session): session opened for user vagrant by (uid=0)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Notice how the log contains a few fields, like the timestamp, the hostname, the process writing the log and its PID, before the message itself. In Linux, logs come from different sources, mainly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/blog/journald-logging-tutorial/" rel="noopener noreferrer"&gt;Systemd journal&lt;/a&gt;. Most Linux distros have &lt;a href="https://systemd.io/" rel="noopener noreferrer"&gt;systemd&lt;/a&gt; to manage services (like SSH above). Systemd catches the output of these services (i.e., logs like the one above) and writes them to the journal. The journal is written in a binary format, so you'll use &lt;a href="https://sematext.com/blog/journald-logging-tutorial#toc-journald-commands-via-journalctl-5" rel="noopener noreferrer"&gt;journalctl&lt;/a&gt; to explore it, like:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    $ journalctl
    ...
    May 05 08:57:27 ubuntu-bionic sshd[5544]: pam_unix(sshd:session): session opened for user vagrant by (uid=0)
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/" rel="noopener noreferrer"&gt;Syslog&lt;/a&gt;. When there's no systemd, processes like SSH can write to a UNIX socket (e.g., &lt;code&gt;/dev/log&lt;/code&gt;) in the &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/#toc-syslog-message-formats-2" rel="noopener noreferrer"&gt;syslog message format&lt;/a&gt;. A &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/#toc-syslog-daemons-0" rel="noopener noreferrer"&gt;syslog daemon&lt;/a&gt; (e.g., &lt;a href="https://www.rsyslog.com/" rel="noopener noreferrer"&gt;rsyslog&lt;/a&gt;) then picks the message, parses it and writes it to various destinations. By default, it writes to files in &lt;code&gt;/var/log&lt;/code&gt;, which is how we got the earlier message from /var/log/auth.log.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Linux kernel&lt;/strong&gt; writes its own logs to a ring buffer. Systemd or the syslog daemon can read logs from this buffer, then write to the journal or flat files (typically &lt;code&gt;/var/log/kern.log&lt;/code&gt;). You can also see kernel logs directly via &lt;code&gt;dmesg&lt;/code&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ dmesg -T
...
[Tue May 5 08:41:31 2020] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/blog/auditd-logs-auditbeat-elasticsearch-logsene/" rel="noopener noreferrer"&gt;Audit logs&lt;/a&gt;. These are a special case of kernel messages designed for auditing actions such as file access. You'd typically have a service to listen for such security logs, like auditd. By default, &lt;a href="https://sematext.com/blog/auditd-logs-auditbeat-elasticsearch-logsene/#toc-audit-logs-in-linux-a-quick-tutorial-on-using-auditd-1" rel="noopener noreferrer"&gt;auditd&lt;/a&gt; writes audit messages to &lt;code&gt;/var/log/audit/audit.log&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Application logs&lt;/strong&gt;. Non-system applications tend to write to /var/log as well. Here are some popular examples:

&lt;ul&gt;
&lt;li&gt;  Apache HTTPD logs are typically written to &lt;code&gt;/var/log/httpd&lt;/code&gt; or &lt;code&gt;/var/log/apache2&lt;/code&gt;. HTTP access logs would be in /var/log/httpd/access.log&lt;/li&gt;
&lt;li&gt;  MySQL logs typically go to &lt;code&gt;/var/log/mysql.log&lt;/code&gt; or /var/log/mysqld.log&lt;/li&gt;
&lt;li&gt;  Older Linux versions would record boot logs via &lt;a href="https://manpages.debian.org/buster/bootlogd/bootlogd.8.en.html" rel="noopener noreferrer"&gt;bootlogd&lt;/a&gt; to &lt;code&gt;/var/log/boot&lt;/code&gt; or &lt;code&gt;/var/log/boot.log&lt;/code&gt;. Systemd now takes care of this: you can view boot-related logs via &lt;code&gt;journalctl -b&lt;/code&gt;. Distros without systemd have a syslog daemon reading from the kernel ring buffer, which normally has all the boot messages. So you can find your boot/reboot logs in &lt;code&gt;/var/log/messages&lt;/code&gt; or /var/log/syslog&lt;/li&gt;
&lt;li&gt;  Last but not least, you may have your own apps using a &lt;a href="https://sematext.com/blog/logging-libraries-vs-log-shippers/" rel="noopener noreferrer"&gt;logging library&lt;/a&gt; to write to a specific file&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;These sources can interact with each other: journald can forward all its messages to syslog. Applications can write to syslog or the journal. It's Linux, where everything is configurable. But for now, we'll focus on the defaults: where can you &lt;strong&gt;typically&lt;/strong&gt; find different types of logs in most modern distributions?&lt;/p&gt;

&lt;h2&gt;
  
  
  Log Files Location: Where Are They Stored?
&lt;/h2&gt;

&lt;p&gt;Typically, you'll find Linux server logs in the &lt;code&gt;/var/log&lt;/code&gt; directory. This is where syslog daemons are normally configured to write. It's also where most applications (e.g., Apache HTTPD) write by default. For Systemd journal, the default location is &lt;code&gt;/var/log/journal&lt;/code&gt;, but you can't view the files directly because they're binary. So how &lt;strong&gt;do&lt;/strong&gt; you view them?&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Check Linux Logs
&lt;/h2&gt;

&lt;p&gt;If your Linux distro uses Systemd (and most modern distros do), then all your system logs are in the journal. You can view them with &lt;code&gt;journalctl&lt;/code&gt;, and you can find the most important &lt;a href="https://sematext.com/blog/journald-logging-tutorial/#toc-journald-commands-via-journalctl-5" rel="noopener noreferrer"&gt;journalctl commands here&lt;/a&gt;. If your distribution writes to local files via syslog, you can view them with standard text processing tools, such as &lt;a href="https://linux.die.net/man/1/cat" rel="noopener noreferrer"&gt;cat&lt;/a&gt;, &lt;a href="https://linux.die.net/man/1/less" rel="noopener noreferrer"&gt;less&lt;/a&gt; or &lt;a href="https://linux.die.net/man/1/grep" rel="noopener noreferrer"&gt;grep&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# grep "error" /var/log/syslog | tail
Mar 31 09:48:02 ubuntu-bionic rsyslogd: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function. [v8.2002.0 try https://www.rsyslog.com/e/2078 ]
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're using &lt;a href="https://linux.die.net/man/8/auditd" rel="noopener noreferrer"&gt;auditd&lt;/a&gt; to manage audit logs, you can check them in &lt;code&gt;/var/log/audit.log&lt;/code&gt; by default, but you can also search them with &lt;a href="https://sematext.com/blog/auditd-logs-auditbeat-elasticsearch-logsene/#toc-searching-and-analyzing-audit-logs-with-ausearch-and-aureport-5" rel="noopener noreferrer"&gt;ausearch&lt;/a&gt;. That said, you're better off shipping these security logs to a central location, especially if you have multiple servers. For this task, a tool like &lt;a href="https://www.elastic.co/beats/auditbeat" rel="noopener noreferrer"&gt;Auditbeat&lt;/a&gt; might work better than auditd. We wrote a separate &lt;a href="https://sematext.com/blog/auditd-logs-auditbeat-elasticsearch-logsene/" rel="noopener noreferrer"&gt;tutorial on centralizing audit logs with Auditbeat&lt;/a&gt;, but in the next section we'll focus on centralizing Linux system logs in general.&lt;/p&gt;

&lt;h2&gt;
  
  
  Centralizing Linux Logs
&lt;/h2&gt;

&lt;p&gt;System logs can be in two places: systemd's journal or plain text files written by a syslog daemon. Some distributions (e.g., Ubuntu) have both: journald is set up to forward to syslog. This is done by setting &lt;code&gt;ForwardToSyslog=Yes&lt;/code&gt; in &lt;code&gt;journald.conf&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Centralizing Logs via Journald
&lt;/h3&gt;

&lt;p&gt;Our recommendation is to use the &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-upload.html" rel="noopener noreferrer"&gt;journal-upload&lt;/a&gt; to &lt;a href="https://sematext.com/blog/log-aggregation/" rel="noopener noreferrer"&gt;centralize logs&lt;/a&gt;, if the distribution has systemd. You can check this by running &lt;code&gt;journalctl&lt;/code&gt; - if the command isn't found, you don't have the journal. As promised earlier, you can &lt;a href="https://sematext.com/docs/integration/journald-integration/" rel="noopener noreferrer"&gt;centralize your system logs to Sematext Cloud with three commands&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Install journal-upload&lt;/strong&gt;. On Ubuntu, this works via &lt;code&gt;sudo apt-get install systemd-journal-remote&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Configure journal-upload&lt;/strong&gt;. In &lt;code&gt;/etc/systemd/journal-upload.conf&lt;/code&gt;, set &lt;code&gt;URL=&lt;/code&gt;&lt;code&gt;http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Start journal-upload&lt;/strong&gt; now and on every boot: &lt;code&gt;systemctl enable systemd-journal-upload &amp;amp;&amp;amp; systemctl start systemd-journal-upload&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Alternatively, you can use &lt;a href="https://sematext.com/docs/logagent/input-plugin-journald-upload/" rel="noopener noreferrer"&gt;Logagent's journal-upload input&lt;/a&gt; to gather journal entries from one or more machines, before shipping them to a central location. That central location can be &lt;a href="https://sematext.com/logsene" rel="noopener noreferrer"&gt;Sematext Cloud&lt;/a&gt;, a local &lt;a href="https://sematext.com/guides/elk-stack/" rel="noopener noreferrer"&gt;ELK stack&lt;/a&gt; or something else: &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F06%2Flinux-logging-post-2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F06%2Flinux-logging-post-2.png"&gt;&lt;/a&gt; If you want to learn more about journald and journalctl, as well as the options you have around centralizing the journal, have a look at our &lt;a href="https://sematext.com/blog/journald-logging-tutorial" rel="noopener noreferrer"&gt;complete guide to journald&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Centralizing Logs via syslog
&lt;/h3&gt;

&lt;p&gt;There are a few scenarios in which centralizing Linux logs with syslog might make sense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Your Linux distribution doesn't have journald. This means system logs go directly to your &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/#toc-syslog-daemons-0" rel="noopener noreferrer"&gt;syslog daemon&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  You want to &lt;strong&gt;use your syslog daemon to collect and parse application logs&lt;/strong&gt; as well. An example is described in our &lt;a href="https://sematext.com/blog/recipe-apache-logs-rsyslog-parsing-elasticsearch/" rel="noopener noreferrer"&gt;tutorial for Apache logs with rsyslog and Elasticsearch&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  You want to &lt;strong&gt;forward journal entries to syslog&lt;/strong&gt; (i.e., by setting &lt;code&gt;ForwardToSyslog=Yes&lt;/code&gt; in &lt;code&gt;journald.conf&lt;/code&gt;), so you can use a &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/#toc-syslog-protocols-6" rel="noopener noreferrer"&gt;syslog protocol&lt;/a&gt; as a transport. However, this approach will lose some of journald's structured data: journald only forwards &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/#toc-syslog-message-formats-2" rel="noopener noreferrer"&gt;syslog-specific fields&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  Similar to the above, except that you'd &lt;strong&gt;configure the syslog daemon to read from the journal&lt;/strong&gt; (like &lt;code&gt;journalctl&lt;/code&gt; does). This approach doesn't lose structured data, but is more error prone (e.g., in case of journal corruption) and adds more overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In all situations listed above, data will go through your syslog daemon. From there, you can send it to any of the supported destinations. Most Linux distributions come with &lt;a href="https://www.rsyslog.com/" rel="noopener noreferrer"&gt;rsyslog&lt;/a&gt; installed. To forward data to another syslog server via TCP, you can add this line in your &lt;code&gt;/etc/rsyslog.conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;*.* @@logsene-syslog-receiver.sematext.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This particular line will forward data to &lt;a href="https://sematext.com/docs/logs/syslog/" rel="noopener noreferrer"&gt;Sematext Cloud's syslog endpoint&lt;/a&gt;, but you can replace &lt;code&gt;logsene-syslog-receiver.sematext.com&lt;/code&gt; with the host name of your own syslog server. Some syslog daemons can output data to Elasticsearch via HTTP/HTTPS. &lt;a href="https://rsyslog.readthedocs.io/en/latest/configuration/modules/omelasticsearch.html" rel="noopener noreferrer"&gt;rsyslog is one of them&lt;/a&gt; and &lt;a href="https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.21/administration-guide/32#TOPIC-1197819" rel="noopener noreferrer"&gt;so is syslog-ng&lt;/a&gt;. For example, if you use rsyslog on Ubuntu, you'll install the Elasticsearch output module first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get install rsyslog-elasticsearch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, in the configuration file, you need two elements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;A template that formats your syslog messages as JSON&lt;/strong&gt;, for Elasticsearch to consume
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;template(name="LogseneFormat" type="list" option.json="on") {
 constant(value="{")
 constant(value="\\"@timestamp\\":\\"")
 property(name="timereported" dateFormat="rfc3339")
 constant(value="\\",\\"message\\":\\"")
 property(name="msg")
 constant(value="\\",\\"host\\":\\"")
 property(name="hostname")
 constant(value="\\",\\"severity\\":\\"")
 property(name="syslogseverity-text")
 constant(value="\\",\\"facility\\":\\"")
 property(name="syslogfacility-text")
 constant(value="\\",\\"syslog-tag\\":\\"")
 property(name="syslogtag")
 constant(value="\\",\\"source\\":\\"")
 property(name="programname")
 constant(value="\\"}")
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;An action that forwards data to Elasticsearch&lt;/strong&gt;, using the template specified above
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module(load="omelasticsearch")
action(type="omelasticsearch"
 template="LogseneFormat" # the template that you defined earlier
 searchIndex="LOGSENE_APP_TOKEN_GOES_HERE"
 server="logsene-receiver.sematext.com"
 serverport="443"
 usehttps="on"
 bulkmode="on"
 queue.dequeuebatchsize="100" # how many messages to send at once
 action.resumeretrycount="-1") # buffer messages if connection fails
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above example shows how to send messages to &lt;a href="https://sematext.com/docs/logs/index-events-via-elasticsearch-api/" rel="noopener noreferrer"&gt;Sematext Cloud's Elasticsearch API&lt;/a&gt;, but you can adjust the action element to point it to your local Elasticsearch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;searchIndex&lt;/code&gt; would be your own &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.6/indices-rollover-index.html" rel="noopener noreferrer"&gt;rolling index alias&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;server&lt;/code&gt; would be the hostname of an Elasticsearch node&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;serverport&lt;/code&gt; can be 9200 or a custom port Elasticsearch listens to&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;usehttps="off"&lt;/code&gt; would send data over plain HTTP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Whether you use a syslog protocol, the Elasticsearch API or something else, it's better to &lt;strong&gt;forward syslog directly&lt;/strong&gt; from the syslog daemon than to &lt;strong&gt;tail individual files from&lt;/strong&gt; /var/log using a &lt;a href="https://sematext.com/blog/logstash-alternatives/" rel="noopener noreferrer"&gt;different log shipper&lt;/a&gt;. Tailing files will add overhead and miss some of the metadata, such as facility or severity. Which is not to say that files in /var/log are useless. You'll need them in two scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Logs of applications that write directly to /var/log. For example, HTTP logs, FTP logs, mysql logs and so on. You can tail such files with a log shipper. We have tutorials on &lt;a href="https://sematext.com/blog/recipe-apache-logs-rsyslog-parsing-elasticsearch/" rel="noopener noreferrer"&gt;parsing apache logs with rsyslog&lt;/a&gt; and &lt;a href="https://sematext.com/blog/getting-started-with-logstash/" rel="noopener noreferrer"&gt;with Logstash&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Process system logs with UNIX text tools like grep&lt;/strong&gt;. Here, different log files contain different kinds of data. We'll look at the typical configuration in the next section.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Are the Most Important Log Files You Should Monitor
&lt;/h2&gt;

&lt;p&gt;By default, some distributions write system logs to syslog (either directly or from the journal). The syslog daemon writes these logs to files under &lt;code&gt;/var/log&lt;/code&gt;. Typically that syslog daemon is rsyslog, though syslog-ng works in a similar fashion. In this section, we'll look at the important log files and:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  what kind of information you'll find in them&lt;/li&gt;
&lt;li&gt;  how is rsyslog configured to write there (in case you want to change the configuration)&lt;/li&gt;
&lt;li&gt;  how to view the same information with &lt;code&gt;journalctl&lt;/code&gt;, in case it doesn't forward to syslog&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  /var/log/syslog or /var/log/messages
&lt;/h3&gt;

&lt;p&gt;This is the “catch-all” of syslog. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# logger "this is a test"
# tail -1 /var/log/syslog
May 7 15:33:11 ubuntu-bionic test-user: this is a test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Typically, you'll find all messages here (error logs, informational messages, and every other &lt;a href="https://en.wikipedia.org/wiki/Syslog#Severity_level" rel="noopener noreferrer"&gt;severity&lt;/a&gt;), as this line from &lt;code&gt;/etc/rsyslog.conf&lt;/code&gt; suggests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;*.* /var/log/syslog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The only exception is the &lt;a href="https://www.rsyslog.com/doc/v8-stable/configuration/actions.html#discard-stop" rel="noopener noreferrer"&gt;stop action&lt;/a&gt;. For example, you may find something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;:msg,contains,"[UFW " /var/log/ufw.log
&amp;amp; stop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In plain English, this block says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If the &lt;a href="https://www.rsyslog.com/doc/v8-stable/configuration/properties.html#message-properties" rel="noopener noreferrer"&gt;msg property&lt;/a&gt; of this message contains "[UFW "&lt;/li&gt;
&lt;li&gt;  Then write to /var/log/ufw.log (the &lt;a href="https://www.rsyslog.com/doc/v8-stable/configuration/modules/omfile.html" rel="noopener noreferrer"&gt;file output module&lt;/a&gt; is implied)&lt;/li&gt;
&lt;li&gt;  If the action succeeds (&amp;amp;), then don't process this message further (stop)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if the &lt;code&gt;/var/log/syslog&lt;/code&gt; action comes later, it won't write UFW messages there. If there's nothing in &lt;code&gt;/var/log/syslog&lt;/code&gt; or &lt;code&gt;/var/log/messages&lt;/code&gt;, you probably have journald set up not to forward to syslog. The same data (and more) can be viewed via &lt;code&gt;journalctl&lt;/code&gt; with no parameters. By default, &lt;code&gt;journalctl&lt;/code&gt; pages data through &lt;code&gt;less&lt;/code&gt;, but if you want to filter through &lt;code&gt;grep&lt;/code&gt; you'll need to disable paging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl --no-pager | grep "this is a test"
May 07 15:33:11 ubuntu-bionic test-user[7526]: this is a test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  /var/log/kern.log or /var/log/dmesg
&lt;/h3&gt;

&lt;p&gt;This is where kernel messages go by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Apr 17 16:47:28 ubuntu-bionic kernel: [ 0.004000] console [tty1] enabled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's really down to filtering syslog messages by the &lt;code&gt;kern&lt;/code&gt; facility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kern.* /var/log/kern.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you don't have syslog (or the file is missing) and you have journald, you can show kernel messages in &lt;code&gt;journalctl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl -k
...
Apr 17 16:47:28 ubuntu-bionic kernel: console [tty1] enabled
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  /var/log/auth.log or /var/log/secure
&lt;/h3&gt;

&lt;p&gt;This is where you find authentication messages, generated by services like sshd:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;May 7 15:03:09 ubuntu-bionic sshd[1202]: pam_unix(sshd:session): session closed for user vagrant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is another filter by facility, this time by two values (&lt;code&gt;auth&lt;/code&gt; and &lt;code&gt;authpriv&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;auth,authpriv.* /var/log/auth.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can do such filters in &lt;code&gt;journalctl&lt;/code&gt; as well, except that you have to provide &lt;a href="https://en.wikipedia.org/wiki/Syslog#Facility" rel="noopener noreferrer"&gt;numeric facility levels&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl SYSLOG_FACILITY=4 SYSLOG_FACILITY=10
...
May 7 15:03:09 ubuntu-bionic sshd[1202]: pam_unix(sshd:session): session closed for user vagrant
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  /var/log/cron.log
&lt;/h3&gt;

&lt;p&gt;This is where your &lt;a href="http://man7.org/linux/man-pages/man8/cron.8.html" rel="noopener noreferrer"&gt;cron&lt;/a&gt; messages go (i.e., jobs that run regularly):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;May 06 08:19:01 localhost.localdomain anacron[1142]: Job `cron.daily' started
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yet another facility filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cron.* /var/log/cron
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With journalctl, you'd do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl SYSLOG_FACILITY=9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  /var/log/mail.log or /var/log/maillog
&lt;/h3&gt;

&lt;p&gt;Email daemons such as Postfix typically log to syslog in the &lt;code&gt;mail&lt;/code&gt; facility, just like &lt;code&gt;cron&lt;/code&gt; logs to the &lt;code&gt;cron&lt;/code&gt; facility. Then, rsyslog puts these logs in a different file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;mail.* /var/log/mail.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you're using journald, you can still view mail logs with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl SYSLOG_FACILITY=2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because journald exposes the &lt;a href="https://linux.die.net/man/3/syslog" rel="noopener noreferrer"&gt;syslog API&lt;/a&gt;, everything that normally goes to syslog ends up in the journal.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR Takeaways
&lt;/h2&gt;

&lt;p&gt;Let's summarize the actionables here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The location and format of your Linux system logs &lt;strong&gt;depends on how your distro is configured&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Most distros have systemd&lt;/strong&gt;. It means all your system &lt;strong&gt;logs live in the journal&lt;/strong&gt;. To view and search it, &lt;strong&gt;use journalctl&lt;/strong&gt;. Use the &lt;a href="https://sematext.com/blog/journald-logging-tutorial" rel="noopener noreferrer"&gt;complete guide to journald&lt;/a&gt; for reference.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Some distros get system logs to syslog&lt;/strong&gt;. Either directly or through the journal. In this case you likely have logs written to various files in &lt;code&gt;/var/log&lt;/code&gt;. Have a look at the section above for details on each important file.&lt;/li&gt;
&lt;li&gt;  Either way, if you manage multiple servers, you'll want to centralize system logs with a &lt;a href="https://sematext.com/blog/best-log-management-tools/" rel="noopener noreferrer"&gt;log management software&lt;/a&gt; such as &lt;a href="https://sematext.com/cloud" rel="noopener noreferrer"&gt;Sematext Cloud&lt;/a&gt;. Sematext makes this very easy, as it has both &lt;a href="https://sematext.com/docs/integration/journald-integration/" rel="noopener noreferrer"&gt;journald integration&lt;/a&gt; and &lt;a href="https://sematext.com/docs/logs/syslog/" rel="noopener noreferrer"&gt;syslog integration&lt;/a&gt;. Though you can use your own &lt;a href="https://sematext.com/guides/elk-stack/" rel="noopener noreferrer"&gt;ELK stack&lt;/a&gt; if you prefer &lt;a href="https://sematext.com/elastic-stack-alternative/" rel="noopener noreferrer"&gt;build vs buy&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  If you need help with your own ELK stack, please reach out, as we provide &lt;a href="https://sematext.com/consulting/logging/" rel="noopener noreferrer"&gt;ELK stack consulting&lt;/a&gt;, &lt;a href="https://sematext.com/support/elasticsearch-production-support/" rel="noopener noreferrer"&gt;Elasticsearch production support&lt;/a&gt; and &lt;a href="https://sematext.com/training/elasticsearch/" rel="noopener noreferrer"&gt;Elasticsearch and ELK stack training classes&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>linux</category>
      <category>logging</category>
    </item>
    <item>
      <title>Tutorial: Logging with journald</title>
      <dc:creator>Radu Gheorghe</dc:creator>
      <pubDate>Tue, 09 Jun 2020 07:37:39 +0000</pubDate>
      <link>https://forem.com/sematext/tutorial-logging-with-journald-50l8</link>
      <guid>https://forem.com/sematext/tutorial-logging-with-journald-50l8</guid>
      <description>&lt;p&gt;If you're using Linux, I'm sure you bumped into &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html" rel="noopener noreferrer"&gt;journald&lt;/a&gt;: it's what most distros use by default for system logging. Most applications running as a service will also log to the journal. So how do you make use of these logs to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  find the error or debug message that you're looking for?&lt;/li&gt;
&lt;li&gt;  make sure logs don't fill your disk?&lt;/li&gt;
&lt;li&gt;  centralize journals so you don't have to ssh to each box?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this post, we'll answer all the above and more. We will dive into the following topics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;what is journald&lt;/strong&gt;, how it came to be and what are its benefits&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;main configuration options&lt;/strong&gt;, like when to remove old logs so you don't run out of disk&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;journald and containers&lt;/strong&gt;: can/should containers log to the journal?&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;journald vs syslog&lt;/strong&gt;: advantages and disadvantages of both, how they integrate&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;ways to centralize journals&lt;/strong&gt;. Advantages and disadvantages of each method, and which is &lt;a href="https://sematext.com/product-updates/%23/2020/we-have-a-new-logs-integration-for-journald" rel="noopener noreferrer"&gt;the best&lt;/a&gt;. Spoiler alert: you can &lt;a href="https://sematext.com/docs/integration/journald-integration/" rel="noopener noreferrer"&gt;configure journald to send logs directly to Sematext Cloud&lt;/a&gt;; or you can &lt;a href="https://sematext.com/docs/logagent/input-plugin-journald-upload/" rel="noopener noreferrer"&gt;use the open-source Logagent as a journald aggregator&lt;/a&gt;. Either way, you'll have one place to search and analyze your journal events:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F04%2Fjornald-logging-post-image2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F04%2Fjornald-logging-post-image2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are lots of other options to centralize journal entries, and lots of tools to help. We'll explore them in detail, but before that, let's zoom in to journald itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is journald?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;journald&lt;/strong&gt; is the part of &lt;a href="https://systemd.io/" rel="noopener noreferrer"&gt;systemd&lt;/a&gt; that deals with logging. &lt;strong&gt;systemd&lt;/strong&gt;, at its core, is in charge of managing services: it starts them up and keeps them alive.&lt;/p&gt;

&lt;p&gt;All services and systemd itself need to log: “ssh started” or “user root logged in”, they might say. That's where journald comes in: to capture these logs, record them, make them easy to find, and remove them when they pass a certain age.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use journald?
&lt;/h2&gt;

&lt;p&gt;In short, because syslog sucks :) Jokes aside, the &lt;a href="https://docs.google.com/document/pub?id%3D1IC9yOXj7j6cdLLxWEBAGRL6wl97tFxgjLUEHIX3MSTs%26pli%3D1" rel="noopener noreferrer"&gt;paper announcing journald&lt;/a&gt; explained that systemd needed functionality that was hard to get through &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/" rel="noopener noreferrer"&gt;existing syslog implementations&lt;/a&gt;. Examples include structured logging, indexing logs for fast search, access control and signed messages.&lt;/p&gt;

&lt;p&gt;As you might expect, &lt;a href="https://rainer.gerhards.net/2013/05/rsyslog-vs-systemd-journal.html" rel="noopener noreferrer"&gt;not everyone agrees with these statements&lt;/a&gt; or the general approach systemd took with journald. But by now, systemd is adopted by most Linux distributions, and it includes journald as well. journald happily coexists with syslog daemons, as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  some syslog daemons can both read from and write to the journal&lt;/li&gt;
&lt;li&gt;  journald exposes the &lt;a href="https://linux.die.net/man/3/syslog" rel="noopener noreferrer"&gt;syslog API&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  journald benefits
&lt;/h3&gt;

&lt;p&gt;Think of journald as your mini-command-line-&lt;a href="https://sematext.com/guides/elk-stack/" rel="noopener noreferrer"&gt;ELK&lt;/a&gt; that lives on virtually every Linux box. It provides lots of features, most importantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Indexing&lt;/strong&gt;. journald uses a binary storage for logs, where data is indexed. Lookups are much faster than with plain text files&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Structured logging&lt;/strong&gt;. Though &lt;a href="https://sematext.com/blog/structured-logging-with-rsyslog-and-elasticsearch/" rel="noopener noreferrer"&gt;it's possible with syslog, too&lt;/a&gt;, it's enforced here. Combined with indexing, it means you can easily filter specific logs (e.g. with a set priority, in a set timeframe)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Access control&lt;/strong&gt;. By default, storage files are split by user, with different permissions to each. As a regular user, you won't see everything root sees, but you'll see your own logs&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Automatic log rotation&lt;/strong&gt;. You can configure journald (see below) to keep logs only up to a space limit, or based on free space&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Configuring journald
&lt;/h2&gt;

&lt;p&gt;To tweak how journald behaves, you'll edit &lt;code&gt;/etc/systemd/journald.conf&lt;/code&gt; and then reload the journal service like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;systemctl reload systemd-journald.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Though &lt;a href="https://github.com/systemd/systemd/issues/2236" rel="noopener noreferrer"&gt;earlier versions of journald need to be restarted&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;systemctl restart systemd-journald.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Most important settings will be around storage: whether the journal should be kept in memory or on disk, when to remove old logs and how much to rate limit. We'll focus on some of those next, but you can see all the configuration options in &lt;a href="https://www.freedesktop.org/software/systemd/man/journald.conf.html%23" rel="noopener noreferrer"&gt;journald.conf's man page&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  journald storage
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;&lt;code&gt;Storage&lt;/code&gt;&lt;/strong&gt; option controls whether the journal is stored in memory (under &lt;code&gt;/run/log/journal&lt;/code&gt;) or on disk (under &lt;code&gt;/var/log/journal&lt;/code&gt;). Setting &lt;strong&gt;&lt;code&gt;Storage=volatile&lt;/code&gt;&lt;/strong&gt; will store the journal in memory, while &lt;strong&gt;&lt;code&gt;Storage=persistent&lt;/code&gt;&lt;/strong&gt; will store it on disk. Most distributions have it set to auto, which means it will store the journal on disk if &lt;code&gt;/var/log/journal&lt;/code&gt; exists, otherwise it will be stored in memory.&lt;/p&gt;

&lt;p&gt;Once you've decided where to store the journal, you may want to set some limits. For example, &lt;strong&gt;&lt;code&gt;SystemMaxUse=4G&lt;/code&gt;&lt;/strong&gt; will limit &lt;code&gt;/var/log/journal&lt;/code&gt; to about 4GB. Similarly, &lt;strong&gt;&lt;code&gt;SystemKeepFree=10G&lt;/code&gt;&lt;/strong&gt; will try to keep 10GB of disk space free. If you choose to keep the journal in memory, the equivalent options are &lt;strong&gt;&lt;code&gt;RuntimeMaxUse&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;RuntimeKeepFree&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can check the current disk usage of the journal with &lt;a href="https://www.freedesktop.org/software/systemd/man/journalctl.html" rel="noopener noreferrer"&gt;journalctl&lt;/a&gt; via &lt;strong&gt;&lt;code&gt;journalctl --disk-usage&lt;/code&gt;&lt;/strong&gt;. If you need to, you can clean it up on demand via &lt;strong&gt;&lt;code&gt;journalctl --vacuum-size=4GB&lt;/code&gt;&lt;/strong&gt; (i.e. to reduce it to 4GB).&lt;/p&gt;

&lt;p&gt;Compression is enabled by default on log entries larger than 512 bytes. If you want to change this threshold to, say 1KB, you'd add &lt;strong&gt;&lt;code&gt;Compress=1K&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Also by default, journald will drop all log messages from a service if it passes certain limits. These limits can be configured via &lt;strong&gt;&lt;code&gt;RateLimitBurst&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;RateLimitIntervalSec&lt;/code&gt;&lt;/strong&gt;, which default to &lt;strong&gt;&lt;code&gt;10000&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;30s&lt;/code&gt;&lt;/strong&gt; respectively. Actual values will depend on the available free space. For example, if you have more than 64GB of free disk space, the multiplier will be 6. Meaning it will drop logs from a service after 60K messages sent in 30 seconds.&lt;/p&gt;

&lt;p&gt;The rate limit defaults are sensible, unless you have a specific service that's generating lots of logs (e.g. a web server). In that case, it might be better to &lt;strong&gt;&lt;code&gt;LogRateLimitBurst&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;LogRateLimitIntervalSec&lt;/code&gt;&lt;/strong&gt; in that application's &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd.exec.html" rel="noopener noreferrer"&gt;service definition&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  journald commands via journalctl
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.freedesktop.org/software/systemd/man/journalctl.html" rel="noopener noreferrer"&gt;journalctl&lt;/a&gt; is your main tool for interacting with the journal. If you just run it, you'll see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  all entries, from oldest to newest&lt;/li&gt;
&lt;li&gt;  paged by &lt;a href="https://linux.die.net/man/1/less" rel="noopener noreferrer"&gt;less&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  lines go past the edge of your screen if they have to (use left and right arrow keys to navigate)&lt;/li&gt;
&lt;li&gt;  format is similar to the syslog output, as it is configured in most Linux distributions: &lt;strong&gt;syslog timestamp + hostname + program and its PID + message&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's an example snippet:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Apr 09 10:22:49 localhost.localdomain su[866]: pam_unix(su-l:session): session opened for user solr by (uid=0)&amp;lt;
Apr 09 10:22:49 localhost.localdomain systemd[1]: Started Session c1 of user solr.&amp;lt;
Apr 09 10:22:49 localhost.localdomain systemd[1]: Created slice User Slice of solr.&amp;lt;
Apr 09 10:22:49 localhost.localdomain su[866]: (to solr) root on none
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is rarely what you want. More common scenarios are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;last N lines&lt;/strong&gt; (equivalent of tail -n 20 - if N=20): &lt;code&gt;journalctl -n 20&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;follow&lt;/strong&gt; (tail -f equivalent): &lt;code&gt;journalctl -f&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  page &lt;strong&gt;from newest to oldest&lt;/strong&gt;: &lt;code&gt;journalctl --reverse&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;skip paging and just grep&lt;/strong&gt; for something (e.g. “solr”): &lt;code&gt;journalctl --no-pager | grep solr&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you often find yourself using &lt;code&gt;--no-pager&lt;/code&gt;, you can change the default pager through the &lt;code&gt;SYSTEMD_PAGER&lt;/code&gt; variable. &lt;code&gt;export SYSTEMD_PAGER=cat&lt;/code&gt; &lt;strong&gt;will disable paging&lt;/strong&gt;. That said, you might want to look into journalctl's own options for displaying and filtering - described below - before using text processing tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  journalctl display settings
&lt;/h3&gt;

&lt;p&gt;The main option here is &lt;code&gt;--output&lt;/code&gt;, which can &lt;a href="https://www.freedesktop.org/software/systemd/man/journalctl.html" rel="noopener noreferrer"&gt;take many values&lt;/a&gt;. As an &lt;a href="https://sematext.com/consulting/logging/" rel="noopener noreferrer"&gt;ELK consultant&lt;/a&gt;, I want my timestamps &lt;a href="https://en.wikipedia.org/wiki/ISO_8601" rel="noopener noreferrer"&gt;ISO 8601&lt;/a&gt;, and &lt;strong&gt;&lt;code&gt;--output=short-iso&lt;/code&gt;&lt;/strong&gt; will do just that. Now this is more like it:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2020-04-09T10:23:01+0000 localhost.localdomain solr[860]: Started Solr server on port 8983 (pid=999). Happy searching!
2020-04-09T10:23:01+0000 localhost.localdomain su[866]: pam_unix(su-l:session): session closed for user solr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;journald keeps more information than what the &lt;strong&gt;short/short-iso&lt;/strong&gt; output shows. Adding &lt;strong&gt;&lt;code&gt;--output=json-pretty&lt;/code&gt;&lt;/strong&gt; (or just &lt;strong&gt;json&lt;/strong&gt; if you want it compact) can look like this for a single event:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
 "__CURSOR" : "s=83694dffb084461ea30a168e6cef1e6c;i=103f;b=f0bbba1703cb43229559a8fcb4cb08b9;m=c2c9508c;t=5a2d9c22f07ed;x=c5fe854a514cef39",
 "__REALTIME_TIMESTAMP" : "1586431033018349",
 "__MONOTONIC_TIMESTAMP" : "3267973260",
 "_BOOT_ID" : "f0bbba1703cb43229559a8fcb4cb08b9",
 "PRIORITY" : "6",
 "_UID" : "0",
 "_GID" : "0",
 "_MACHINE_ID" : "13e3a06d01d54447a683822d7e0b4dc9",
 "_HOSTNAME" : "localhost.localdomain",
 "SYSLOG_FACILITY" : "3",
 "SYSLOG_IDENTIFIER" : "systemd",
 "_TRANSPORT" : "journal",
 "_PID" : "1",
 "_COMM" : "systemd",
 "_EXE" : "/usr/lib/systemd/systemd",
 "_CAP_EFFECTIVE" : "1fffffffff",
 "_SYSTEMD_CGROUP" : "/",
 "CODE_FILE" : "src/core/job.c",
 "CODE_FUNCTION" : "job_log_status_message",
 "RESULT" : "done",
 "MESSAGE_ID" : "9d1aaa27d60140bd96365438aad20286",
 "_SELINUX_CONTEXT" : "system_u:system_r:init_t:s0",
 "UNIT" : "user-0.slice",
 "MESSAGE" : "Removed slice User Slice of root.",
 "CODE_LINE" : "781",
 "_CMDLINE" : "/usr/lib/systemd/systemd --switched-root --system --deserialize 22",
 "_SOURCE_REALTIME_TIMESTAMP" : "1586431033018103"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is where you can use structured logging to filter events. Next up, we'll look closer at the most important options for filtering.&lt;/p&gt;

&lt;h3&gt;
  
  
  journald log filtering
&lt;/h3&gt;

&lt;p&gt;You can filter by any field (see the JSON output above) by specifying &lt;strong&gt;&lt;em&gt;key=value arguments&lt;/em&gt;&lt;/strong&gt;, like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;journalctl _SYSTEMD_UNIT=sshd.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There are shortcuts, for example the &lt;strong&gt;&lt;code&gt;_SYSTEMD_UNIT&lt;/code&gt;&lt;/strong&gt; above can be expressed as &lt;strong&gt;&lt;code&gt;-u&lt;/code&gt;&lt;/strong&gt;. The above command is the equivalent of of:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;journalctl -u sshd.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Other useful shortcuts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;severity&lt;/strong&gt; (here called &lt;strong&gt;priority&lt;/strong&gt;). &lt;strong&gt;&lt;code&gt;journalctl -p warning&lt;/code&gt;&lt;/strong&gt; will show logs with at least a severity of &lt;strong&gt;&lt;code&gt;warning&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  show only kernel messages: &lt;strong&gt;&lt;code&gt;journalctl --dmesg&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also filter by time, of course. Here, you have multiple options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;&lt;code&gt;--since/--until&lt;/code&gt;&lt;/strong&gt; as a &lt;strong&gt;full timestamp&lt;/strong&gt;. For example: &lt;strong&gt;&lt;code&gt;journalctl --since="2020-04-09 11:30:00"&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;date only&lt;/strong&gt; (00:00:00 is assumed as the time): &lt;strong&gt;&lt;code&gt;journalctl --since=2020-04-09&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;code&gt;abbreviations: journalctl --since=yesterday --until=now&lt;/code&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In general, you have to specify the exact value you're looking for. With the exception of _SYSTEMD_UNIT. Here, patterns also work:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;journalctl -u sshd*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Newer versions of systemd also allow a &lt;strong&gt;&lt;code&gt;--grep&lt;/code&gt;&lt;/strong&gt; flag, which allows you to filter the &lt;code&gt;MESSAGE&lt;/code&gt; field by regex. But you can always pipe the journalctl output through grep itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  journald and boots
&lt;/h3&gt;

&lt;p&gt;Besides messages logged by applications, journald remembers significant events, such as system reboots. Here's an example:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl MESSAGE="Server listening on 0.0.0.0 port 22."
**-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --**
Apr 08 11:53:23 localhost.localdomain sshd[822]: Server listening on 0.0.0.0 port 22.
Apr 08 13:23:42 localhost.localdomain sshd[7425]: Server listening on 0.0.0.0 port 22.
**-- Reboot --**
Apr 09 10:22:49 localhost.localdomain sshd[857]: Server listening on 0.0.0.0 port 22.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can suppress these special messages via &lt;strong&gt;-q&lt;/strong&gt;. Use &lt;strong&gt;-b&lt;/strong&gt; to show only messages after a certain boot. For example, to show messages since the last boot:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl MESSAGE="Server listening on 0.0.0.0 port 22." -b
-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --
Apr 09 10:22:49 localhost.localdomain sshd[857]: Server listening on 0.0.0.0 port 22.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can specify a boot as an offset to the current one (e.g. &lt;strong&gt;-b -1&lt;/strong&gt; is the boot before the last). You can also specify a boot ID, but to do this you need to know what are the available boot IDs:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl --list-boots
-1 d26652f008ef4020b15a3d510bbcb381 Wed 2020-04-08 11:53:18 UTC—Wed 2020-04-08 14:31:16 UTC
 0 f0bbba1703cb43229559a8fcb4cb08b9 Thu 2020-04-09 10:22:43 UTC—Thu 2020-04-09 12:01:01 UTC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And then:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl MESSAGE="Server listening on 0.0.0.0 port 22." -b d26652f008ef4020b15a3d510bbcb381
-- Logs begin at Wed 2020-04-08 11:53:18 UTC, end at Thu 2020-04-09 12:01:01 UTC. --
Apr 08 11:53:23 localhost.localdomain sshd[822]: Server listening on 0.0.0.0 port 22.
Apr 08 13:23:42 localhost.localdomain sshd[7425]: Server listening on 0.0.0.0 port 22.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This is all useful if you configure journald for persistent storage (see the configuration section above).&lt;/p&gt;

&lt;h2&gt;
  
  
  journald centralized logging
&lt;/h2&gt;

&lt;p&gt;As you probably noticed, journald is quite host-centric. In practice, you'll want to access these logs in a central location, without having to SSH into each machine.&lt;/p&gt;

&lt;p&gt;There are multiple ways of centralizing journald logs, and we'll detail each below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-upload.html" rel="noopener noreferrer"&gt;systemd-journal-upload&lt;/a&gt; uploads journal entries&lt;/strong&gt;. Either &lt;a href="https://sematext.com/docs/integration/journald-integration/" rel="noopener noreferrer"&gt;directly to Sematext Cloud&lt;/a&gt; or to a log shipper that can read its output, such as the &lt;a href="https://sematext.com/docs/logagent/input-plugin-journald-upload/" rel="noopener noreferrer"&gt;open-source Logagent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html%23" rel="noopener noreferrer"&gt;systemd-journal-remote&lt;/a&gt; as a “centralizer”&lt;/strong&gt;. The idea is to have all journals on one host, so you can use journalctl to search (see above). This can work in “pull” or “push” mode&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;a &lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/" rel="noopener noreferrer"&gt;syslog daemon&lt;/a&gt; or &lt;a href="https://sematext.com/blog/logstash-alternatives/" rel="noopener noreferrer"&gt;another log shipper&lt;/a&gt; reads from the local journal&lt;/strong&gt;. Then, it forwards logs to a central store like &lt;a href="https://sematext.com/guides/elk-stack/" rel="noopener noreferrer"&gt;ELK&lt;/a&gt; or &lt;a href="https://sematext.com/docs/logs/syslog/" rel="noopener noreferrer"&gt;Sematext Cloud&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;journald forwards entries to a local syslog socket&lt;/strong&gt;. Then, a log shipper (typically a syslog daemon) picks messages up and forwards them to the central store&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  systemd-journal-upload to ELK or Sematext Cloud
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-upload.html" rel="noopener noreferrer"&gt;systemd-journal-upload&lt;/a&gt; is a service that pushes new journal entries over HTTP/HTTPS. That destination can be the &lt;a href="https://sematext.com/docs/integration/journald-integration/" rel="noopener noreferrer"&gt;Sematext Cloud Journald Receiver&lt;/a&gt; - the easiest way to centralize journald logs. And probably the best, as we'll discuss below.&lt;/p&gt;

&lt;p&gt;Although it's part of journald/systemd, &lt;code&gt;systemd-journal-upload&lt;/code&gt; isn't installed by default on most distros. So you have to add it via something like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apt-get install systemd-journal-remote
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then, uploading journal entries is as easy as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;systemd-journal-upload --url=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Though most likely you'll want to configure it as a service:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cat /etc/systemd/journal-upload.conf
[Upload]
URL=http://logsene-journald-receiver.sematext.com:80/YOUR_LOGS_TOKEN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you need more control, or if you want to send journal entries to your local Elasticsearch, you can use the &lt;a href="https://github.com/sematext/logagent-js" rel="noopener noreferrer"&gt;open-source&lt;/a&gt;&lt;a href="https://github.com/sematext/logagent-js" rel="noopener noreferrer"&gt;Logagent&lt;/a&gt; with its &lt;a href="https://sematext.com/docs/logagent/input-plugin-journald-upload/" rel="noopener noreferrer"&gt;journald input plugin&lt;/a&gt; as a journald centralizer: &lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F04%2Fjornald-logging-post-image1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsematext.com%2Fwp-content%2Fuploads%2F2020%2F04%2Fjornald-logging-post-image1.png"&gt;&lt;/a&gt; Here's the relevant part of &lt;code&gt;logagent.conf&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input:
  journal-upload:
    module: input-journald-upload
    port: 9090
    worker: 0
    systemdUnitFilter:
      include: !!js/regexp /.*/i
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Using Logagent and Elasticsearch or Sematext Cloud&lt;/strong&gt; (i.e. we host Logagent and Elasticsearch for you) is probably &lt;strong&gt;the best option to centralize journald logs&lt;/strong&gt;. That's because you get all journald's structured data over a reliable protocol (HTTP/HTTPS) with minimal overhead. The catch? Initial import is tricky, because it can generate a massive HTTP payload. For this, you might want to do the initial import by streaming journalctl output through &lt;a href="https://github.com/sematext/logagent-js" rel="noopener noreferrer"&gt;Logagent&lt;/a&gt;, like:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;journalctl --output=json --no-page | logagent --index SEMATEXT-LOGS-TOKEN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  systemd-journal-remote
&lt;/h3&gt;

&lt;p&gt;Journald comes with its own “log centralizer”: &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html%23" rel="noopener noreferrer"&gt;systemd-journal-remote&lt;/a&gt;. You don't get anywhere near the flexibility of ELK/Sematext Cloud, but it's already there and it might be enough for small environments.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;systemd-journal-remote&lt;/code&gt; can either pull journals from remote systems or listen for journal entries on HTTP/HTTPS. The push model - where &lt;code&gt;systemd-journal-upload&lt;/code&gt; is in charge of pushing logs - is typically better because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  it can continuously tail the journal and remembers where it left off (i.e. maintains a cursor)&lt;/li&gt;
&lt;li&gt;  you don't need to open access to the journal of every system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;systemd-journal-remote&lt;/code&gt; typically comes in the same package as &lt;code&gt;systemd-journal-upload&lt;/code&gt;. Once it's installed, you can make it listen to HTTP/HTTPS traffic:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;host2# systemd-journal-remote --listen-http=0.0.0.0:19352 --output=/var/log/journal/remote
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now you can push the journal of a remote host like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;host1# systemd-journal-upload --url=http://host2:19352
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  systemd-journal-remote and systemd-journal-gatewayd
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html%23" rel="noopener noreferrer"&gt;s&lt;/a&gt;&lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-remote.service.html%23" rel="noopener noreferrer"&gt;ystemd-journal-remote&lt;/a&gt; can also pull journal entries from remote hosts. These hosts would normally serve their journal via &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-journal-gatewayd.service.html%23" rel="noopener noreferrer"&gt;systemd-journal-gatewayd&lt;/a&gt; (which is often provided by the same package). Once you have systemd-journal-gatewayd, you can start it via:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;host1# systemctl start systemd-journal-gatewayd.socket
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You can verify if it works like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;curl host1:19531/entries
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then, from the “central” host, you can use systemd-journal-remote to fetch journal entries:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;host2# systemd-journal-remote --url [http://](http://host1:19531)[host1](http://host1:19531)[:19531](http://host1:19531)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;By default, systemd-journal-remote will write the imported journal to &lt;code&gt;/var/log/journal/remote/&lt;/code&gt; (you might have to create it first!), so you can search them via &lt;code&gt;journalctl&lt;/code&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;journalctl -D /var/log/journal/remote/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Tools that read directly from the journal
&lt;/h3&gt;

&lt;p&gt;Another approach for centralizing journald logs is to &lt;strong&gt;have a &lt;a href="https://sematext.com/blog/logstash-alternatives/" rel="noopener noreferrer"&gt;log shipper&lt;/a&gt; read from the journal&lt;/strong&gt;, much like journalctl does. Then, it can process logs and send them to destinations like Elasticsearch or Sematext Cloud (which exposes the &lt;a href="https://sematext.com/docs/logs/index-events-via-elasticsearch-api/" rel="noopener noreferrer"&gt;Elasticsearch API&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;For this approach, there's a PoC &lt;a href="https://github.com/logstash-plugins/logstash-input-journald" rel="noopener noreferrer"&gt;journald input plugin for Logstash&lt;/a&gt;. As you probably know, &lt;a href="https://sematext.com/blog/getting-started-with-logstash/" rel="noopener noreferrer"&gt;Logstash is easy to use&lt;/a&gt;, so reading from the journal is as easy as:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;input {
  journald {
  # you may add other options here, but of course the defaults are sensible :)
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://www.elastic.co/guide/en/beats/journalbeat/master/journalbeat-overview.html" rel="noopener noreferrer"&gt;Journalbeat&lt;/a&gt; is also available. It's as easy to install and use as &lt;a href="https://sematext.com/blog/using-filebeat-to-send-elasticsearch-logs-to-logsene/" rel="noopener noreferrer"&gt;Filebeat&lt;/a&gt;, except that it reads from the journal. But it's marked as experimental.&lt;/p&gt;

&lt;p&gt;Why PoC and experimental? Because of potential journal corruption which might lead to nasty results. Check the comments in &lt;a href="https://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html" rel="noopener noreferrer"&gt;rsyslog's journal input documentation&lt;/a&gt; for details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://sematext.com/blog/what-is-syslog-daemons-message-formats-and-protocols/" rel="noopener noreferrer"&gt;Syslog daemons&lt;/a&gt; are also log shippers. Some of them can also read from the journal, or even write to it. There's a lot to say about syslog and the journal, so we'll dissect the topic in a section of its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  journald vs syslog
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Journald provides a good out-of-the-box logging experience&lt;/strong&gt; for systemd. The trade-off is, journald is &lt;strong&gt;a bit of a monolith&lt;/strong&gt;, having everything from log storage and rotation, to log transport and search. Some would argue that &lt;strong&gt;syslog is more UNIX-y&lt;/strong&gt;: more lenient, easier to integrate with other tools. Which was its main criticism to begin with.&lt;/p&gt;

&lt;p&gt;Flame wars aside, there's good integration between the two. Journald provides a &lt;a href="https://manpages.debian.org/jessie/manpages-dev/syslog.3.en.html" rel="noopener noreferrer"&gt;syslog API&lt;/a&gt; and can forward to syslog (see below). On the other hand, syslog daemons have journal integrations. For example, &lt;a href="https://www.rsyslog.com/" rel="noopener noreferrer"&gt;rsyslog&lt;/a&gt; provides plugins to both &lt;a href="https://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html" rel="noopener noreferrer"&gt;read from journald&lt;/a&gt; and &lt;a href="https://rsyslog.readthedocs.io/en/latest/configuration/modules/omjournal.html" rel="noopener noreferrer"&gt;write to journald&lt;/a&gt;. In fact, they recommend two architectures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  A small setup (e.g. N embedded devices and one server) could work by centralizing journald logs (see above). If embedded devices don't have systemd/journald but have syslog, they can centralize via syslog to the server and finally write to the server's journal. This journal will act like a mini-ELK&lt;/li&gt;
&lt;li&gt;  A larger setup can work by aggregating journal entries through a syslog daemon. We'll concentrate on this scenario in the rest of this section&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are two ways of centralizing journal entries via syslog:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;syslog daemon acts as a journald client&lt;/strong&gt; (like journalctl or Logstash or Journalbeat)&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;journald forwards messages to syslog&lt;/strong&gt; (via socket)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Option 1) is slower - reading from the journal is slower than reading from the socket - but captures all the fields from the journal. Option 2) is safer (e.g. no issues with journal corruption), but the journal will only forward traditional syslog fields (like severity, hostname, message..). Typically, you'd go for 2) unless you need the structured info. Here's an example configuration for implementing 1) with rsyslog, and writing all messages to Elasticsearch or &lt;a href="https://sematext.com/cloud/" rel="noopener noreferrer"&gt;Sematext Cloud&lt;/a&gt;:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# module that reads from journal
module(load="imjournal"
 StateFile="/var/run/journal.state" # we write here where we left off
 PersistStateInterval="100" # update the state file every 100 messages
)
# journal entries are read as JSON, we'll need this to parse them
module(load="mmjsonparse")
# Elasticsearch or Sematext Cloud HTTP output
module(load="omelasticsearch")

# this is done on every message (i.e. parses the JSON)
action(type="mmjsonparse")

# output template that simply writes the parsed JSON
template(name="all-json" type="list"){
 property(name="$!all-json")
}

action(type="omelasticsearch"
 template="all-json" # use the template defined earlier
 searchIndex="SEMATEXT-LOGS-APP-TOKEN-GOES-HERE"
 server="logsene-receiver.sematext.com"
 serverport="80"
 bulkmode="on" # use the bulk API
 action.resumeretrycount="-1" # retry indefinitely if Logsene/Elasticsearch is unreachable
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For option 2), we'll need to configure journald to forward to a socket. It's as easy as adding this to /etc/systemd/journald.conf:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ForwardToSyslog=yes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And it will write messages, in syslog format, to /run/systemd/journal/syslog. On the rsyslog side, you'll have to configure its &lt;a href="https://rsyslog-doc.readthedocs.io/en/latest/configuration/modules/imuxsock.html" rel="noopener noreferrer"&gt;socket input module&lt;/a&gt; to listen to that socket. Here's a similar example of sending logs to Elasticsearch or Sematext Cloud:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;module(load="imuxsock"
 SysSock.Name="/run/systemd/journal/syslog")

# template to write traditional syslog fields as JSON
template(name="plain-syslog"
 type="list") {
 constant(value="{")
 constant(value="\"timestamp\":\"") property(name="timereported" dateFormat="rfc3339")
 constant(value="\",\"host\":\"") property(name="hostname")
 constant(value="\",\"severity\":\"") property(name="syslogseverity-text")
 constant(value="\",\"facility\":\"") property(name="syslogfacility-text")
 constant(value="\",\"tag\":\"") property(name="syslogtag" format="json")
 constant(value="\",\"message\":\"") property(name="msg" format="json")
 constant(value="\"}")
}

action(type="omelasticsearch"
 template="plain-syslog" # use the template defined earlier
 searchIndex="SEMATEXT-LOGS-APP-TOKEN-GOES-HERE"
 server="logsene-receiver.sematext.com"
 serverport="80"
 bulkmode="on" # use the bulk API
 action.resumeretrycount="-1" # retry indefinitely if Logsene/Elasticsearch is unreachable
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Whether you read the journal through syslog, systemd-journal-upload or through a log shipper, all the above methods assume that you're dealing with Linux running on bare metal or VMs. But what if you're using containers? Let's explore your options in the next section.&lt;/p&gt;

&lt;h2&gt;
  
  
  journald and containers
&lt;/h2&gt;

&lt;p&gt;In this context, I think it's worth making a distinction between Docker containers and systemd containers. Let's take them one at a time.&lt;/p&gt;

&lt;h3&gt;
  
  
  journald and Docker
&lt;/h3&gt;

&lt;p&gt;Typically, a Docker container won't have systemd, because it would make it too “heavy”. As a consequence, it won't have journald, either. That said, you probably have journald on the host, if the host is running Linux. This means you can use the &lt;a href="https://docs.docker.com/config/containers/logging/journald/" rel="noopener noreferrer"&gt;journald logging driver&lt;/a&gt; to send all the logs of a host's containers to that host's journal. It's as easy as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run my_container --log-driver=journald
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And that container's logs will be in the journal:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# journalctl CONTAINER_NAME=my_container --all
Apr 09 13:03:28 localhost.localdomain dockerd-current[25558]: hello journal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you want to use journald by default, you can make the change in daemon.json and restart Docker:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# cat /etc/docker/daemon.json
{
 "log-driver": "journald"
}
systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If you have more than one host, you're back to the centralizing problem that we explored in the previous section: getting all journals in one place. This makes journald an intermediate step that may not be necessary.&lt;/p&gt;

&lt;p&gt;A better approach is to &lt;a href="https://sematext.com/docs/logs/sending-docker-logs/" rel="noopener noreferrer"&gt;centralize container logs&lt;/a&gt; via Logagent, which can run as a container. Here, Logagent picks up logs and forwards them to a central place, like Elasticsearch or Sematext Cloud. But it's not the only way. In fact, we explore different approaches, with their pros and cons, in our &lt;a href="https://sematext.com/guides/docker-logs/" rel="noopener noreferrer"&gt;Complete Guide to Docker logging&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  journald and systemd containers
&lt;/h3&gt;

&lt;p&gt;systemd provides containers as well (called &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-machined.service.html%23" rel="noopener noreferrer"&gt;machines&lt;/a&gt;) via &lt;a href="https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html%23" rel="noopener noreferrer"&gt;systemd-nspawn&lt;/a&gt;. Unlike Docker containers, systemd-nspawn machines can log to the journal directly. You can read the logs of a specific machine like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;journalctl --machine $MACHINE_NAME
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Where &lt;code&gt;$MACHINE_NAME&lt;/code&gt; is one of the running machines. You'd use &lt;strong&gt;machinectl list&lt;/strong&gt; to see all of them.&lt;/p&gt;

&lt;p&gt;As with Docker's journald logging driver, this setup might be challenging when you have multiple hosts. You'll either want to centralize your journals - as described in the previous section. Or, you can send logs from your systemd containers directly to the central location - either via a &lt;a href="https://sematext.com/blog/logging-libraries-vs-log-shippers/" rel="noopener noreferrer"&gt;log shipper or a logging library&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;Did you read all the way to the end? You're a hero! And you probably figured that journald is good for structured logging, quick local searches, and tight integration with systemd. Its design shows its weaknesses when it comes to centralizing log events. Here we have many options, but none is perfect. That said, &lt;a href="https://sematext.com/docs/logagent/input-plugin-journald-upload/" rel="noopener noreferrer"&gt;Logagent's journald input&lt;/a&gt; and &lt;a href="https://sematext.com/docs/integration/journald-integration/" rel="noopener noreferrer"&gt;Sematext Cloud's journald receiver&lt;/a&gt; (the hosted equivalent) come pretty close.&lt;/p&gt;

</description>
      <category>journald</category>
      <category>journalctl</category>
      <category>syslog</category>
      <category>elk</category>
    </item>
    <item>
      <title>Entity Extraction with spaCy</title>
      <dc:creator>Radu Gheorghe</dc:creator>
      <pubDate>Fri, 26 Apr 2019 09:56:58 +0000</pubDate>
      <link>https://forem.com/sematext/entity-extraction-with-spacy-fi</link>
      <guid>https://forem.com/sematext/entity-extraction-with-spacy-fi</guid>
      <description>&lt;h2&gt;
  
  
  What is Entity Extraction?
&lt;/h2&gt;

&lt;p&gt;Entity extraction is, in the context of search, the process of figuring out which fields a query should target, as opposed to always hitting all fields. The reason we may want to involve entity extraction in search is to improve precision. For example: how do we tell that, when the user typed in Apple iPhone, the intent was to run &lt;strong&gt;company:Apple&lt;/strong&gt; AND &lt;strong&gt;product:iPhone&lt;/strong&gt;? And not bring back phone stickers in the shape of an apple?&lt;/p&gt;

&lt;h2&gt;
  
  
  What is spaCy?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://spacy.io/"&gt;spaCy&lt;/a&gt; is a Python framework that can do many &lt;a href="https://en.wikipedia.org/wiki/Natural_language_processing"&gt;Natural Language Processing&lt;/a&gt; (NLP) tasks. &lt;a href="https://spacy.io/usage/linguistic-features#named-entities"&gt;Named Entity Extraction&lt;/a&gt; (NER) is one of them, along with &lt;a href="https://spacy.io/usage/training#textcat"&gt;text classification&lt;/a&gt;, &lt;a href="https://spacy.io/usage/linguistic-features#pos-tagging"&gt;part-of-speech tagging&lt;/a&gt;, and others.&lt;/p&gt;

&lt;p&gt;If this sounds familiar, that may be because we previously wrote about a different Python framework that can help us with entity extraction: &lt;a href="https://sematext.com/blog/entity-extraction-scikit-learn-classifiers/"&gt;Scikit-learn&lt;/a&gt;. Though Scikit-learn is more a collection of machine learning tools, rather than an NLP framework. spaCy is closer, in terms of functionality, to &lt;a href="https://sematext.com/blog/entity-extraction-opennlp-tutorial/"&gt;OpenNLP&lt;/a&gt;. We used all three for entity extraction during our &lt;a href="https://www.slideshare.net/sematext/entity-extraction-for-product-search"&gt;Activate 2018 presentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Getting spaCy is as easy as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install spacy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In this post, we’ll use a pre-built model to extract entities, then we’ll build our own model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using a pre-built model
&lt;/h2&gt;

&lt;p&gt;spaCy comes with &lt;a href="https://spacy.io/usage/models"&gt;pre-built models for lots of languages&lt;/a&gt;. For example, to get the English one, you’d do:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python -m spacy download en_core_web_sm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then, in your Python application, it’s a matter of loading it:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nlp = spacy.load('en_core_web_sm')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And then you can use it to extract entities. In our &lt;a href="https://github.com/sematext/activate/blob/master/spacy/def.py"&gt;Activate example&lt;/a&gt;, we did:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doc = nlp(u"#bbuzz 2016: Rafał Kuć - Running High Performance And Fault Tolerant Elasticsearch")
for entity in doc.ents:
    print(entity.label_, ' | ', entity.text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Which outputs:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MONEY | #bbuzz
DATE | 2016
PERSON | Rafał Kuć - Running High
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For this particular example, this result is “approximate” at best. 2016 is indeed a date, but &lt;a href="https://berlinbuzzwords.de/"&gt;#bbuzz&lt;/a&gt; isn’t money. And I doubt that &lt;a href="https://sematext.com/blog/author/kucrafal/"&gt;Rafał&lt;/a&gt; was Running High while giving that presentation.&lt;/p&gt;

&lt;p&gt;For this use-case, we’d need to build our own model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training a new model
&lt;/h2&gt;

&lt;p&gt;To train a new model, we first need to create a pipeline that defines how we process data. In this case, we want to extract entities. Then, we’ll train a model by running test data through this pipeline. Once the model is trained, we can use it to extract entities from new data as well.&lt;/p&gt;

&lt;p&gt;Let’s zoom into each step.&lt;/p&gt;

&lt;h3&gt;
  
  
  spaCy pipelines
&lt;/h3&gt;

&lt;p&gt;With spaCy you can do much more than just entity extraction. For example, before extracting entities, you may need to &lt;a href="https://spacy.io/usage/linguistic-features#tokenization"&gt;pre-process text&lt;/a&gt;, for example via stemming. Or we may want to do &lt;a href="https://spacy.io/usage/linguistic-features#pos-tagging"&gt;part-of-speech tagging&lt;/a&gt;: is this word a verb or a noun?&lt;/p&gt;

&lt;p&gt;For the scope of our tutorial, we’ll create an empty model, give it a name, then add a simple pipeline to it. That simple pipeline will only do named entity extraction (NER):&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nlp = spacy.blank('en') # new, empty model. Let’s say it’s for the English language
nlp.vocab.vectors.name = 'example_model_training' # give a name to our list of vectors
# add NER pipeline
ner = nlp.create_pipe('ner') # our pipeline would just do NER
nlp.add_pipe(ner, last=True) # we add the pipeline to the model
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Data and labels
&lt;/h3&gt;

&lt;p&gt;To train the model, we’ll need some training data. In the case of product search, these would be queries, where we pre-label entities. For example:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DATA = \[
  (u"Search Analytics: Business Value &amp;amp; BigData NoSQL Backend, Otis Gospodnetic ", {'entities': [ (58,75,'PERSON') ] }),
  (u"Introduction to Elasticsearch by Radu ", {'entities': [ (16,29,'TECH'), (32, 36, 'PERSON') ] }),
  # …
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Our training data has a few characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  The text itself is Unicode&lt;/li&gt;
&lt;li&gt;  The entities array contains a list of tuples. Each tuple is an entity labeled from the text&lt;/li&gt;
&lt;li&gt;  Each tuple contains three elements: start offset, end offset and entity name&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training the model
&lt;/h3&gt;

&lt;p&gt;Before training, we need to make our model aware of the possible entities. To do that, we add all the labels we’re aware of:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nlp.entity.add_label('PERSON')
nlp.entity.add_label('TECH')
# ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Now we can begin training. We’ll need to allocate the models and get an optimizer via our &lt;a href="https://spacy.io/api/language"&gt;model&lt;/a&gt;’s &lt;a href="https://spacy.io/api/language#begin_training"&gt;begin_training()&lt;/a&gt; method:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;optimizer = nlp.begin_training()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then we update the model with our training data. Each text, with its annotations (those labeled entities), would be passed to the &lt;a href="https://spacy.io/api/language#update"&gt;update() function of our model&lt;/a&gt;. Along with the newly created optimizer:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nlp.update([text], [annotations], sgd=optimizer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;In our &lt;a href="https://github.com/sematext/activate/blob/master/spacy/train.py"&gt;Activate example&lt;/a&gt;, because we have little training data, we just loop through it a few times, in random order:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for i in range(20):
    random.shuffle(DATA)
    for text, annotations in DATA:
        nlp.update([text], [annotations], sgd=optimizer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And that’s it! Now we have a model built for our own use-case.&lt;/p&gt;

&lt;h3&gt;
  
  
  Predicting entities
&lt;/h3&gt;

&lt;p&gt;The model we just built is already loaded in memory. If you don’t want to train it every time, you can &lt;a href="https://spacy.io/api/language#to_disk"&gt;save it to disk&lt;/a&gt; and &lt;a href="https://spacy.io/api/language#from_disk"&gt;load it&lt;/a&gt; when needed. With the model loaded, you’ll use it to predict entities just as you would with a pre-built model:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;doc = nlp(u"#bbuzz 2016: Rafał Kuć - Running High Performance And Fault Tolerant Elasticsearch")
for entity in doc.ents:
    print(entity.label_, ' | ', entity.text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Even with this small dataset, results &lt;strong&gt;typically&lt;/strong&gt; look better than with the default model:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PERSON | Rafał Kuć
TECH | Elasticsearch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;I’ve mentioned &lt;strong&gt;typically&lt;/strong&gt;  because on different runs, because of the randomization, the model looks different. Ultimately, if you want accurate results, there’s no substitute for training set size. Unless something was indeed fishy with &lt;a href="https://sematext.com/blog/author/kucrafal/"&gt;Rafał&lt;/a&gt; in 2016, because at times I get:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PERSON | Rafał Kuć&lt;br&gt;
TECH | High&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Conclusions and next steps&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;Like in the &lt;a href="https://sematext.com/blog/entity-extraction-opennlp-tutorial/"&gt;OpenNLP example&lt;/a&gt; we showed before, spaCy comes with pre-built models and makes it easy to build your own. It also comes with a &lt;a href="https://spacy.io/api/cli#train"&gt;command-line training tool&lt;/a&gt;. That said, it’s less configurable - or at least you don’t have all the options as accessible as in a purpose-built tool, like &lt;a href="https://sematext.com/blog/entity-extraction-scikit-learn-classifiers"&gt;Scikit-learn&lt;/a&gt;. For entity extraction, spaCy will use a &lt;a href="https://en.wikipedia.org/wiki/Convolutional_neural_network"&gt;Convolutional Neural Network&lt;/a&gt;, but you can &lt;a href="https://spacy.io/api/entityrecognizer#model"&gt;plug in your own model&lt;/a&gt; if you need to.&lt;/p&gt;

&lt;p&gt;If you find this stuff exciting, please join us: &lt;a href="https://sematext.com/jobs/"&gt;we’re hiring worldwide&lt;/a&gt;. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please &lt;a href="https://sematext.com/contact/"&gt;reach out&lt;/a&gt;, because we provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/consulting/"&gt;Solr, Elasticsearch and Elastic Stack consulting&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/support/"&gt;Solr, Elasticsearch and Elastic Stack production support&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/training/"&gt;Solr, Elasticsearch and Elastic Stack training classes&lt;/a&gt; (on site and remote, public and private)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/cloud/"&gt;Monitoring, log centralization and tracing&lt;/a&gt; for not only Solr and Elasticsearch, but for other applications (e.g. Kafka, Zookeeper), &lt;a href="https://sematext.com/spm"&gt;infrastructure&lt;/a&gt; and &lt;a href="https://sematext.com/docker"&gt;containers&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>search</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>Entity Extraction with Scikit-learn Classifiers</title>
      <dc:creator>Radu Gheorghe</dc:creator>
      <pubDate>Mon, 18 Mar 2019 14:39:28 +0000</pubDate>
      <link>https://forem.com/sematext/entity-extraction-with-scikit-learn-classifiers-28ag</link>
      <guid>https://forem.com/sematext/entity-extraction-with-scikit-learn-classifiers-28ag</guid>
      <description>&lt;h2&gt;
  
  
  What is entity extraction?
&lt;/h2&gt;

&lt;p&gt;Entity extraction is the process of figuring out which fields a query should target, as opposed to always hitting all fields. For example: how to tell, when the user typed in &lt;strong&gt;Apple iPhone&lt;/strong&gt;, that the intent was to run &lt;strong&gt;company:Apple&lt;/strong&gt; AND &lt;strong&gt;product:iPhone&lt;/strong&gt;?&lt;/p&gt;

&lt;h2&gt;
  
  
  Is entity extraction a classification problem?
&lt;/h2&gt;

&lt;p&gt;Typically, &lt;a href="https://sematext.com/blog/entity-extraction-opennlp-tutorial/"&gt;when you think about entity extraction, you think about context&lt;/a&gt;: in &lt;strong&gt;Nokia 3310 is an old phone&lt;/strong&gt; words like &lt;strong&gt;is&lt;/strong&gt; or &lt;strong&gt;an&lt;/strong&gt; are strong indicators that before them, we have a subject. E-commerce queries are a special case: we often have little context. In &lt;a href="https://www.slideshare.net/sematext/entity-extraction-for-product-search"&gt;our “Entity Extraction for Product Searches” presentation at Activate&lt;/a&gt;, we argued that if all you have is &lt;strong&gt;Nokia 3310&lt;/strong&gt;, figuring out that &lt;strong&gt;Nokia&lt;/strong&gt; is a manufacturer and &lt;strong&gt;3310&lt;/strong&gt; is a model is a classification problem. In this post, we’ll explore one of the approaches to solve this classification problem: training and using &lt;a href="https://scikit-learn.org"&gt;Scikit-learn&lt;/a&gt; classification models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Scikit-learn and how can I get it?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://scikit-learn.org"&gt;Scikit-learn&lt;/a&gt; is a popular machine learning library. It’s written in Python, so to get it, you can just:&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install sklearn
pip install numpy
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;We’ll install &lt;a href="https://docs.scipy.org/doc/numpy/index.html"&gt;NumPy&lt;/a&gt; as well, because we need to provide the training set as a &lt;a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html"&gt;NumPy array&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature selection
&lt;/h2&gt;

&lt;p&gt;Before implementing anything, we need to figure out which features are relevant for classification. &lt;a href="https://en.wikipedia.org/wiki/Feature_selection"&gt;Feature selection&lt;/a&gt; is a continuous process, but we need something to begin with. In the &lt;a href="https://github.com/sematext/activate/tree/master/sklearn"&gt;Activate example&lt;/a&gt;, we used three features: term frequency, number of digits and number of spaces. We assume that, typically, manufacturer names will occur more often in our index compared to model numbers, which are pretty unique. We expect more digits in model numbers and more spaces in manufacturer names. The fundamental question is, what would help one distinguish an entity from another. In this case, the manufacturer from the model number. You can get creative with features: &lt;a href="https://sematext.com/blog/using-solr-tag-text/"&gt;does the entity match a dictionary of manufacturers or models&lt;/a&gt;? How long is the query and in which position(s) is our entity located? Because there are common constructs in E-commerce, such as manufacturer+model (&lt;strong&gt;Nokia 3310&lt;/strong&gt;) or model+generation (&lt;strong&gt;iPhone 3GS&lt;/strong&gt;, if we stick to old school).&lt;/p&gt;

&lt;h2&gt;
  
  
  Training and test sets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data cleanup
&lt;/h3&gt;

&lt;p&gt;When it comes to training and testing a model, the old “garbage in, garbage out” saying applies here as well. You’ll want to curate your data as you see fit: lowercasing and stemming would be useful in many entity extraction setups. Just as they are for regular search :) When testing or applying the model, you’ll notice that some “entities” span across multiple words. You could take word &lt;a href="https://en.wikipedia.org/wiki/N-gram"&gt;n-grams&lt;/a&gt; to fix this problem. For example, in &lt;strong&gt;Apple Mac Book&lt;/strong&gt;, you’d take &lt;strong&gt;apple&lt;/strong&gt;, &lt;strong&gt;mac&lt;/strong&gt;, &lt;strong&gt;book&lt;/strong&gt;, &lt;strong&gt;apple mac&lt;/strong&gt; and &lt;strong&gt;mac book&lt;/strong&gt;, and expect to get &lt;strong&gt;apple&lt;/strong&gt; as manufacturer and &lt;strong&gt;mac&lt;/strong&gt; and &lt;strong&gt;mac book&lt;/strong&gt; as models. From which you can take the larger gram (mac book) or both (mac + mac book, but rank “mac book” higher), depending on how you’d like to balance &lt;a href="https://en.wikipedia.org/wiki/Precision_and_recall"&gt;precision and recall&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parsing entities into feature arrays
&lt;/h3&gt;

&lt;p&gt;When training a model, you don’t feed Scikit-learn the actual words, but the features of those words. You’ll need code that, given the queries (or entities), can generate feature arrays. In our example, for &lt;strong&gt;Nokia&lt;/strong&gt;, you’ll have 0 numbers, 0 spaces and its frequency in your index. In our sample code, we read data from a file. We assume each line contains an entity and we also use the file to judge frequencies: if we encounter an entity N times, we’ll get a frequency of N. In the end, we return a dictionary, where the entity is the key, and the value is the feature array for that entity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;read_into_feature_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
     &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;le_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
         &lt;span class="n"&gt;le_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
         &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;le_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
             &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;n"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
             &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;le_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                 &lt;span class="c1"&gt;# other features besides frequency
&lt;/span&gt;                 &lt;span class="n"&gt;digits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isdigit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                 &lt;span class="n"&gt;spaces&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isspace&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                 &lt;span class="c1"&gt;# initialize an array of [frequency, digits, spaces]. Frequency is initially 1
&lt;/span&gt;                 &lt;span class="n"&gt;le_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;digits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spaces&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
             &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                 &lt;span class="c1"&gt;# increment frequency if we met this before
&lt;/span&gt;                 &lt;span class="n"&gt;le_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;le_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
         &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;le_dict&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Training a model
&lt;/h2&gt;

&lt;p&gt;To train the model, we’ll need only the list of feature arrays, without the keys. This list of feature arrays is our training set (X), but we’ll also need labels for each entity (y). In our case, labels are manufacturers or models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# we have a file with manufacturers and one with models. Read them into dictionaries
&lt;/span&gt;&lt;span class="n"&gt;mfr_feature_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read_into_feature_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"mfrs"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model_feature_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read_into_feature_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# from the dictionaries, we get only the feature arrays and add them to one list
&lt;/span&gt;&lt;span class="n"&gt;training&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;mfr_feature_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mfr_feature_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model_feature_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_feature_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# make the list a NumPy array. That’s what Scikit-learn requires
&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;training&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# add training labels. We know that we first added manufacturers, then models
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mfr_feature_dict&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"mfr"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_feature_dict&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;At this point, we can select a model and train it. Scikit-learn comes with a variety of &lt;a href="https://scikit-learn.org/stable/supervised_learning.html"&gt;classifiers&lt;/a&gt; out of the box. From simple linear &lt;a href="https://scikit-learn.org/stable/modules/svm.html#classification"&gt;Support Vector Machines&lt;/a&gt; like we’re using in this example, to &lt;a href="https://scikit-learn.org/stable/modules/tree.html#classification"&gt;decision trees&lt;/a&gt; and &lt;a href="https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification"&gt;perceptrons&lt;/a&gt; (the same sort of algorithms you saw in our &lt;a href="https://sematext.com/blog/entity-extraction-opennlp-tutorial/"&gt;OpenNLP tutorial&lt;/a&gt;). You’d use them in similar way, though parameters are different, of course. With our training X and y, and the algorithm selected, we can try it. For linear &lt;a href="https://en.wikipedia.org/wiki/Support-vector_machine#Support-vector_clustering_(SVC)"&gt;SVC&lt;/a&gt;, the code can be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# select the algorithm. Here, linear SVC
&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SVC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kernel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;'linear'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;C&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# train it
&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Here, &lt;strong&gt;&lt;a href="https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769"&gt;C&lt;/a&gt;&lt;/strong&gt; &lt;a href="https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769"&gt;is the penalty parameter for the error term&lt;/a&gt;. The intuition is that, with higher C, your model will fit your training set better, but it may also lead to overfitting. There are &lt;a href="https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC"&gt;other SVC parameters&lt;/a&gt; as well, such as the number of iterations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using the model to predict entities
&lt;/h2&gt;

&lt;p&gt;At this point, we can use our model for entity extraction. Or at least we can test it. To do that, we can build a test X from some test samples and use the &lt;strong&gt;predict()&lt;/strong&gt; function of our classifier to get the suggested entities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_from_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_file&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;test_X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

  &lt;span class="c1"&gt;# same function that we used for the training set: read manufacturers/codes from a file
&lt;/span&gt;  &lt;span class="c1"&gt;# then turn them into a dictionary of entities to feature arrays
&lt;/span&gt;  &lt;span class="n"&gt;test_dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;read_into_feature_dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# concatenate feature arrays into our X
&lt;/span&gt;  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;feature_set&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;test_dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;test_X&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;feature_set&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_dict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# use our model to predict entities for each entity
&lt;/span&gt;  &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;clf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_X&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusions and next steps
&lt;/h2&gt;

&lt;p&gt;With well-selected features, classification is a good solution to extract entities from E-commerce queries. We showed an example here with Scikit-learn, but of course, there are other good options. &lt;a href="https://spacy.io/"&gt;SpaCy&lt;/a&gt; is one of them, and we’ll publish another how-to here soon! If you find this stuff exciting, please join us: &lt;a href="https://sematext.com/jobs/"&gt;we’re hiring worldwide&lt;/a&gt;. If you need entity extraction, relevancy tuning, or any other help with your search infrastructure, please &lt;a href="https://sematext.com/contact/"&gt;reach out&lt;/a&gt;, because we provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/consulting/"&gt;Solr, Elasticsearch and Elastic Stack consulting&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/support/"&gt;Solr, Elasticsearch and Elastic Stack production support&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/training/"&gt;Solr, Elasticsearch and Elastic Stack training classes&lt;/a&gt; (on site and remote, public and private)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://sematext.com/cloud/"&gt;Monitoring, log centralization and tracing&lt;/a&gt; for not only Solr and Elasticsearch but for other applications (e.g. Kafka, Zookeeper), &lt;a href="https://sematext.com/spm"&gt;infrastructure&lt;/a&gt; and &lt;a href="https://sematext.com/docker"&gt;containers&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want to boost your productivity with Solr or Elasticsearch, check out &lt;strong&gt;two useful Cheat Sheets&lt;/strong&gt; to help you boost your productivity and save time when you’re working with any of these two open-source search engines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to access all the new Solr features – Running Solr, Data Manipulation, Searching, Faceting, etc. &lt;a href="https://sematext.com/solr-cheat-sheet/?utm_medium=blogpost&amp;amp;utm_source=blogpost&amp;amp;utm_campaign=scikit-learn-classifiers-blogpost&amp;amp;utm_content=blog-solr-cheat-sheet"&gt;Download yours here&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Key Elasticsearch operations every developer needs – index creation, mapping manipulation, indexing API, and more! &lt;a href="https://sematext.com/elasticsearch-developer-cheat-sheet/?utm_medium=blogpost&amp;amp;utm_source=blogpost&amp;amp;utm_campaign=scikit-learn-classifiers-blogpost&amp;amp;utm_content=blog-elasticsearch-developer-cheatsheet"&gt;Download yours here&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>search</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
