Forem: Kentik

The (Mostly) Complete Guide to Installing Kentik NMS

Leon Adato — Fri, 12 Apr 2024 13:52:24 +0000

Introduction

The saying “nobody likes a know-it-all” applies equally well to blog series – they’re really not terribly loveable. That goes double for a deep-dive technical blog series. We, who make our livelihood and career in tech, appreciate in-depth information and tutorials. But if a post has “part 8 of 33” in the title, it’s a safe bet that most folks will scroll right by because who wants to make that kind of commitment?

I share this with you to explain that I never intended to create a blog series on using Kentik NMS. My goal was to share what I knew about Kentik’s newest addition to the platform, and to do so in a way that was focused and easy to consume in a reasonable amount of time.

But there’s so dang much that NMS can do! One topic triggered an idea for another, and another, and here we are, six posts later, and I haven’t really gone into the details of how to install NMS yet.

I’ll admit the oversight – not starting with the NMS installation – was (slightly) intentional. I’m tired of slogging through 30 paragraphs covering “how to install” before I even know if the tool I’m reading about does anything I need or care about. So, I made the conscious decision to start by digging into the useful features and circle back to installation once I felt NMS had proven its worth.

That time has come.

Observations* on the NMS architecture

(*You see what I did there, right?)

I’m not going to belabor the overall design of NMS with a bunch of “…color glossy photographs with circles and arrows and a paragraph on the back of each one explaining what each one was…” (hat tip to Arlo Guthrie) because it’s pretty simple:

The targets: This is the stuff you want to monitor – network gear and servers that sit on-premises, in the cloud, or both.
The Ranger Collector: A system – it can be physical, virtual, or nothing more than a host for a Docker container – that’s in the same logical network, so it’s able to receive data and pull metrics from the stuff you want to monitor.
The Kentik platform: This is the system – located remotely from you and your devices – to which the Ranger Collector sends data.

Ok, maybe just one photograph:

Measure twice, cut once

Eagle-eyed readers will note that I did, in fact, briefly touch on how to install Kentik NMS in both the NMS migration guide and also the blog Getting Started With Kentik NMS. This blog will go into far greater detail than those two, but I will still use bits from other blogs if they work well. Back in “Getting Started…” I wrote:

There’s nothing more frustrating than being ready to test out a new piece of technology and then finding out you’re not prepared. So before you head down to the “Installation and configuration” section, make sure you have the following things in hand:

A system to install the Kentik NMS collector on. The collector is an agent that can be installed directly onto a Linux-based system or as a Docker container. Per the Kentik Knowledge Base instructions, you’ll want a system with at least a single core and 4GB of RAM.
Verify the system can access the required remote sites:
- Docker Hub
- TCP 443 to grpc.api.kentik.com (or kentik.eu for Europe)
Verify the system can access the devices you want to monitor:
- Ping (ICMP)
- SNMP (UDP port 161)
Check that you have the following information for the devices you want to monitor:
- A list of IP addresses and/or one or more CIDR notated subnets (example: 192.168.1.0/24)
- The SNMP v2c read-only community string and/or SNMP version 3 username, authentication type and passphrase, privacy type and passphrase.
You have a Kentik account. If you are just testing NMS, we recommend not using an existing production account. If you don’t, head over to https://portal.kentik.com/login and get one set up.

Once you’ve got all of your technical ducks in a row (which, to be honest, shouldn’t take that long), you’re ready to get started on this NMS adventure!

That about sums it up. To get NMS up and running, you just need:

A system to install the collector on
Systems to monitor

And with that, you’re ready to get installing.

Installing Kentik NMS

As I mentioned earlier, there are two primary options for installing the Kentik NMS Ranger Collector: direct or Docker. I will cover both, but regardless of which one you plan to use, you’ll start in the Kentik portal. Click the “hamburger menu” (the three lines in the upper left corner), which shows the full portal menu:

Click “Devices” and then click the friendly blue “Discover Devices” button in the upper right corner.

There’s a quick question on whether you want full monitoring or “ping-only” – just to know if systems are up or down:

The next screen allows you to install the collector, either as a Docker container:

Or on a full Linux system:

Installing direct

Let’s start with a “direct” installation on a regular Linux system. Again, this can be an actual bare-metal server or a VM on-site or in the cloud. The only requirement is that the machine you’re installing on can access the systems you want to monitor.

Copy the command from the portal, SSH to the target system, and paste that command. (Note: You must have sudo permission to run this command.)

The installer will complete and… well, in most cases, that’s pretty much it.

No, really. That’s it. The next step involves getting things set up in the Kentik portal, so I will leave that aside for now.

Installing with Docker

This process starts out the same as with the Direct option – copy the command from the Kentik portal, SSH to the system hosting the Docker container, and paste.

However… there are a couple of implicit expectations that are worth stating out loud:

The system you’re using already has Docker and all its necessary components installed.
The user under which you’re installing has full permission to create, run, shut down, and update docker containers.

Presuming that’s the case, cut, paste, and run!

Once the Docker run command completes, there’s not much to see:

Scanning and adding devices (aka: the fun part)

Shortly after installing the Ranger Collector, you’ll see the agent name (or the name of the system the agent is installed on) show up in the “Select an Agent” area below the installation commands:

Go ahead and click “Use this Agent.” That will automatically authorize it and take you to the next screen, where you can specify the devices you want to monitor. From the next screen, you’ll enter an IP address, a comma-separated list of IPs, or a CIDR-noted range (example: 192.168.1.0/24).

Trick: You can mix and match, including individual IPs and CIDR ranges.

Another trick: if there are specific systems you want to ignore, list them with a minus (-) in front.

Presuming this is your first time adding devices, you’ll probably have to click “Add New Credential.”

Let’s get this out of the way: You will never select SNMP v1. Just don’t.

That said, select SNMP v2c or v3, include the relevant credentials, give it a unique name, and click “Add Credential.”

Then select it from the previous screen.

At that point, click “Start Discovery” to kick off the real excitement.

The collector will start pinging devices and ensuring they respond to SNMP. Once completed, you’ll see a list of devices. You can check/uncheck the ones you want to monitor and click “Add Devices.”

Docker for the distracted

I don’t want to presume that every reader is already familiar – let alone comfortable – with Docker and its basic commands. I recognize that there are a lot of Docker tutorials on the internet. In fact, I’ve lost several weeks of my life looking for (and at) them. But I also recognize that not every reader of this blog manages swarms of containers. So here’s the absolute minimum information you’ll need to be able to maintain your Ranger collector Docker container.

You can see which containers are running (along with their container IDs) with the following command:

docker ps

You can see an output of what a Docker container is doing with this command like this:

docker logs --follow <container id>

If you have issues with any containers, including the command to build or run the Kentik agent, you can easily stop a container with this command:

docker stop <container id>

Once a container is stopped, you probably will want to restart it. The problem is that it won’t show up using docker ps. To see containers that are no longer running, use the -a (“all”) switch:

docker ps -a

And then, to start that container again, run:

docker start <container id>

If something has gone horribly wrong, you can stop and then completely remove the container. (Warning: You’ll need to go back into the Kentik portal and re-run the original command to rebuild it, re-authenticate it, and add devices to be monitored by it.)

docker rm <container id>

Variety (and customization) is the spice of life

If you’ve been following along in the blog series, you’ll know that you can add custom SNMP metrics along with the ones collected by default. To do that, you need to create certain files and make them discoverable by the NMS Ranger Collector agent when it starts up. Whether you’re using the direct or Docker version of the agent, do the following:

Create the directory: /opt/kentik/components/ranger/local/config. If you’re running the direct agent, everything up to “ranger” will already be there, but you’ll have to create local/config.
In that directory, create three directories:
- /profiles
- /reports
- /sources
Make the user:group “kentik:kentik” the owner of everything you just created and all the files and directories beneath it.

sudo chown -R kentik:kentik /opt/kentik/components/ranger/local/config

Note: You must monitor at least one device for /opt/kentik/components/ranger to exist.

Another note: If you add more files, you’ll probably need to re-issue that command.

Trick: You can also make this directory easier to get to by using the Linux “symbolic link” capability.

This would change the chown command to:

sudo chown -R kentik:kentik /local_kentik

You still need that directory for the Docker version of Kentik NMS, so do everything I described at the beginning of this section. But you’ll also need to tell Docker to mount it as a custom folder. To do that, we’ll start by looking at the “docker run” command that you used to install the container in the first place:

docker run --name=kagent --detach --restart unless-stopped 
--pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env
K_API_ROOT=grpc.api.kentik.com:443 --mount 
source=kagent-data,target=/opt/kentik/ kentik/kagent:latest

Adding the custom folder means including the line:

-v /opt/kentik/components/ranger/local/config

…to the end of that command. Which would look like this:

docker run --name=kagent --detach --restart unless-stopped
--pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env
K_API_ROOT=grpc.api.kentik.com:443 --mount
source=kagent-data,target=/opt/kentik/ kentik/kagent:latest -v /opt/kentik/components/ranger/local/config

But wait! If you used the symlink trick from earlier, the command line becomes slightly easier to manage:

docker run --name=kagent --detach --restart unless-stopped
--pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env
K_API_ROOT=grpc.api.kentik.com:443 --mount 
source=kagent-data,target=/opt/kentik/ kentik/kagent:latest -v /local_kentik

Troubleshooting and other swear words

Working in tech, we become used to the fact that things rarely go right the first time. Only through careful consideration, iteration, and correction can we achieve the result we initially envisioned.

This section is devoted to a few things that might not initially work as you hoped when setting up Kentik NMS.

Peeking behind the curtain

The Kentik Ranger Collector agent runs pretty silently during installation and afterward, which is usually a good thing. Still, when you suspect something is going wrong, it can be anything from slightly unsettling to downright rage-inducing. The good news is that most of the information you need is in the Linux Journal. The Linux Journal records every outbound message, error, update, whine, sigh, and grumble that your Linux system experiences – especially when it concerns services that run through the systemctl utility.

The command to peer inside the Journal is, appropriately enough, journalctl. But typing that by itself will likely yield a metric tonne of mostly irrelevant information. To see messages and output specific to Kentik NMS, you should use the command:

sudo journalctl -u kagent

What this is saying is, “Show me the Journal, but filter for the following UNIT (hence the “-u”), which is either the name of a service or a pattern to match. If that’s still too much information, try this:

sudo journalctl -u kagent –since "10 minutes ago"

That asks for the Journal to be filtered down to messages about the Kentik agent (“kagent”), specifically those that have appeared in the last 10 minutes.

If you want to see a running list of messages as they show up in the journal in real time, use this. The “-f” means “follow”:

sudo journalctl -f -u kagent

Meanwhile, if you have a Docker container, you can see what’s happening with Docker’s logs:

docker logs --follow <container id>

No devices are discovered

If you’ve installed the NMS Ranger Collector agent, authenticated it into the platform, and run a scan, and no devices have been found, obviously, something needs to be fixed. Here’s a short list of things that might have gone wrong:

You can’t reach the target devices.

This is hands-down the most common issue. Whether due to firewall issues or a simple routing oversight, it’s important to start by verifying that the machine on which you’re running the NMS Ranger Collector agent can talk to the devices you want to monitor.

Start off by running ping and ensuring you’ve got a clean response.

Next, make sure you can reach the device via SNMP. Here are the essential steps:

On the system running the Ranger Collector agent, go to the command line/terminal.
Type “snmp -V” (that’s a capital “V”) to verify that SNMP is installed on this system. If not, install it.
Next, do an SNMPWALK on the system object ID, which is present on all devices that are running SNMP:

snmpwalk -v 2c -c <snmp community string> <device IP address> 1.3.6.1.2.1.1.2

If that works, you’ll see a response like this:

At this point, you’ll know a few things:

If you can’t ping the device, you have a routing or firewall issue.
If you can ping but can’t get SNMP information, then:
- The target device is refusing the SNMP request.
- Or it’s not running SNMP.
- Or you need to correct some piece of information, like the community string.

Custom OIDS aren't being collected

While this blog doesn’t get into it, a few other posts in this series delved deep into getting custom metrics and sending them to the Kentik platform. If that’s not working, here’s a list of things to check:

Check the YAML files in an editor that shows the type of whitespace you’re using. Mixing spaces with tabs will never end well for anyone.

Verify that all the files in /opt/kentik/components/ranger/local/config are owned by kentik:kentik.

Verify that the custom files are the correct “kind” – profile, report, or source.
Make sure the metadata name elements match up from file to file.

Red light, green light (stopping and starting)

Sometimes, the Kentik agent needs a good swift kick in the process. To do that, you can use the systemctl utility:

sudo systemctl stop kagent.service
sudo systemctl start kagent.service
sudo systemctl restart agent.service

Meanwhile, as discussed in the section on Docker, sometimes you need to restart the container itself:

docker ps To get a list of running containers
docker ps -a To get a list of all containers (running or stopped)
docker stop <container ID> To stop the container
docker start <container ID> To start the container
docker restart <container ID> To stop and start the container all at once
docker rm <container ID> To delete the container
- Note: You have to stop the container first
- Another note: All the systems monitored by this container will have to be re-added when you recreate it.

The end of the beginning

If you’ve arrived here after reading the previous posts on using Kentik NMS to troubleshoot, adding a single custom metric, adding multiple custom metrics, or modifying custom metrics before they’re sent to the Kentik platform, you now have everything you need to get the Kentik NMS Ranger Collector agent installed, running, and collecting monitoring data from your systems.

On the other hand, if you arrived here fresh from the internet and this is your first encounter with Kentik NMS, I invite you to use the links in the paragraph above to explore further.

Adjusting Data Before Sending it to Kentik NMS

Leon Adato — Tue, 02 Apr 2024 14:36:07 +0000

In my ongoing exploration of Kentik NMS, I continue to peel back not only the layers of what the product can do but also the layers of information I quietly glossed over in my original example, hoping nobody noticed.

In this blog, I want to both admit to and correct one of the most glaring ones:

If that is the temperature of one of your devices, you should seek immediate assistance. I don’t want to alarm you, but that’s six times hotter than the surface of the sun.

In reality, the SNMP OID in question gives temperature in mC (microcelsius), so all we really need to do is divide by 1,000. But this opens the door to plenty of other situations where it’s not only nice but necessary to adjust metrics before sending them to Kentik NMS.

Kentik comes with scripting capabilities courtesy of Starlark (formerly known as Skylark), a Python-like language created by Google.

That last sentence will either set your mind at ease or send you running for the door, and I’m honestly not sure how I feel about it myself.

But, back to the task at hand, Starlark will let you take the values that come in via an OID and then manipulate them.

A script block, which goes in the reports file, must define a function called process with two parameters: the record and the index set. It typically looks like this:

reports:
  /foo/bar/baz:
    script: !starlark |
      def process(n, indexes):
          (do stuff here)

That’s really all you have to know for now.

If you missed the original post and don’t feel like going back and reading it, here are the essentials:

Move to (or create if it doesn’t exist) the dedicated folder on the system where the Kentik agent (kagent) is running:

/opt/kentik/components/ranger/local/config

In that directory, create directories for /sources, /reports, and /profiles
Create three specific files:

Under /sources, a file that lists the custom OID to be collected
Under /reports, a file that associates the custom OID with the data category it will appear under within the Kentik portal
Under /profiles, a file that describes a type of device (Using the SNMP System Object ID) and the report(s) to be associated with that device type
Make sure all of those directories (and the files beneath them) are owned by the Kentik user and group:

sudo chown -R kentik:kentik /opt/kentik/components/ranger/

sources/linux.yml

version: 1
metadata:
  name: local-linux
  kind: sources
sources:
  CPUTemp: !snmp
    value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
    interval: 60s

reports/linux_temps_report.yml

version: 1
metadata:
  name: local-temp
  kind: reports
reports:
  /device/linux/temp:
    fields:
      CPUTemp: !snmp
        value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
        metric: true
    interval: 60s

profiles/local-net-snmp.yml

version: 1
metadata:
  name: local-net-snmp
  kind: profile
profile:
  match:
    sysobjectid:
      - 1.3.6.1.4.1.8072.*
  reports:
    - local-temp
  include:
    - device_name_ip

As I showed earlier in this post, that gives you data that looks like this in Metrics Explorer:

Notice that my temperature readings are up around the 33,000 mark? We gotta do something about that.

First, we’ll do the simple math - dividing our output by 1000.

sources/linux.yml - stays the same
profiles/local-net-snmp.yml - stays the same

Our new reports/linux_temps_report.yml file becomes:

version: 1
metadata:
  name: local-temp
  kind: reports
reports:
  /device/linux/temp:
    script: !starlark |
      def process(n, indexes):
        n[’CPUTemp’].value = n[’CPUTemp’].value//1000
    fields:
      CPUTemp: !snmp
        value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
        metric: true
    interval: 60s

Let’s take a moment to unpack the changes to this file:

under the category /device/linux/temp, we’re going to declare a starlark script
That script is going to take (is piped - | ) a process that includes
- n, the record containing the data
- indexes, the index set for the record
it pulls re-assigns the CPUTemp value from the record, replacing it with the original value divided by 1000
- To dig into the guts of Starlark for a moment, the two slashes (”//”) indicate “floored division” - which takes just the integer portion of the result.
The YAML file then goes on to identify the record itself, pulling the value from the OID 1.3.6.1.4.1 (and so on).

I’m going to re-phrase because what the file does is actually backward from what is happening:

The script: block declares the process but doesn’t run it. It’s just setting the stage.

The fields: block is the part that identifies the data we’re pulling. Every time a machine returns temperature information (a record set), that process is run, replacing the original CPUTemp value with CPUTemp/1000.

The result is an entirely different set of temperature values:

When you need a dessert topping AND a floor wax

Sometimes, you need to do the math but also store (and display) the original value. In that case, you just need one small change:

version: 1
metadata:
  name: local-temp
  kind: reports
reports:
  /device/linux/temp:
    script: !starlark |
      def process(n, indexes):
        n.append(’CPUTempC’, n[’CPUTemp’].value//1000, metric=True)
    fields:
      CPUTemp: !snmp
        value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
        metric: true
    interval: 60s

To build on the previous example, this is what it would look like if you wanted to take that Celsius result and convert it to Fahrenheit:

version: 1
metadata:
  name: local-tempF
  kind: reports
reports:
  /device/linux/tempF:
    script: !starlark |
      def process(n, indexes):
        n.append(’CPUTempF’, n[’CPUTemp’].value//1000*9/5+32, metric=True)
    fields:
      CPUTemp: !snmp
        value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
        metric: true
    interval: 60s

There’s a lot more to say about (and explore with) Starlark, but I want to leave you with just a few tidbits for now:

Ranger will call the process function every time the report runs.
For table-based reports, the process function will be called once for each row.
- create new records
- maintain state across calls to process
- combine data from multiple table rows
Scripts can be included in the report (as shown in this blog), or referenced as an external file:

script: !external
  type: starlark
  file: test.star

In my most recent blog on adding custom OIDs, I showed how to add a table of values instead of just a single item. The specific use case was providing temperatures for each of the CPUs in a system.

The YAML files to do that looked like this:

sources/linux.yml

version: 1
metadata:
  name: local-linux
  kind: sources
sources:
  CPUTemp: !snmp
    table: 1.3.6.1.4.1.2021.13.16.2
    interval: 60s

reports/temp.yml

version: 1
metadata:
  name: local-temp
  kind: reports
reports:
  /device/linux/temp:
    fields:
      name: !snmp
        table: 1.3.6.1.4.1.2021.13.16.2
        value: 1.3.6.1.4.1.2021.13.16.2.1.2
        metric: false
      CPUTemp: !snmp
        table: 1.3.6.1.4.1.2021.13.16.2
        value: 1.3.6.1.4.1.2021.13.16.2.1.3
        metric: true
    interval: 60s

profiles/local-net-snmp.yml

version: 1
metadata:
  name: local-net-snmp
  kind: profile
profile:
  match:
    sysobjectid:
      - 1.3.6.1.4.1.8072.*
  reports:
    - local-temp
  include:
    - device_name_ip

Incorporating what we’ve learned in this post, here are the changes. You’ll note that I’ve renamed a few things mostly to keep these new elements from conflicting with what we created before:

linux_multitemp.yml

version: 1
metadata:
  name: linux_multitemp
  kind: sources
sources:
  CPUTemp_Multi: !snmp
    table: 1.3.6.1.4.1.2021.13.16.2
    interval: 60s

This is effectively the same as the linux_temp.yml I re-posted from the last post. But again, I renamed the file, the metadata name, and the source name to keep things a little separate from what we’ve done.

linux_multitempsc_reports.yml

version: 1
metadata:
  name: local-multitempC
  kind: reports
reports:
  /device/linux/multitempC:
    script: !starlark |
      def process(n, indexes):
        n[’CPUTemp_Multi’].value = n[’CPUTemp_Multi’].value//1000
    fields:
      CPUname: !snmp
        table: 1.3.6.1.4.1.2021.13.16.2
        value: 1.3.6.1.4.1.2021.13.16.2.1.2
        metric: false
      CPUTemp_Multi: !snmp
        table: 1.3.6.1.4.1.2021.13.16.2
        value: 1.3.6.1.4.1.2021.13.16.2.1.3
        metric: true
    interval: 60s

The major change here is the addition of the script block. The other changes are simply renaming:

local_net_snmp.yml

version: 1
metadata:
  name: local-net-snmp
  kind: profile
profile:
  match:
    sysobjectid:
      - 1.3.6.1.4.1.8072.*
  reports:
    - local-temp
    - local-multitempC
  include:
    - device_name_ip

In this file, our addition strictly includes local-multitempC in the reports section.

The result is a delightful blend of everything we’ve tested out so far. We have temperature values for each of the CPUs on a given system, and those values have been converted from microCelsius to Celcius.

This post, along with all those that have come before, again highlights the incredible flexibility and capability of Kentik NMS. But there are so many more things to show! How to ingest non-SNMP data, how to add totally new device types, and how to install the NMS in the first place.

Wait… THAT HASN’T BEEN COVERED YET?!?!

Oof. I’d better get started writing the next post.

As always, I hope you’ll stick with me as we learn more about this together. If you’d like to get started now, start a free trial or request a personalized demo.

Getting to Work with Kentik NMS

Leon Adato — Wed, 13 Mar 2024 18:27:17 +0000

I recently explored why Kentik built and released an all-new network monitoring system (NMS) that includes traditional and more modern telemetry collection techniques, such as APIs, OpenTelemetry, and Influx.

After that, I briefly covered the steps to install Kentik NMS and start monitoring a few devices.

What I left out and will cover in this post is what it might look like when you have everything installed and configured. Along the way, I’ll dig a little deeper into the various screens and features associated with Kentik NMS.

This raises the question, as eloquently put by The Talking Heads, “Well, how did I get here?” Meaning: Where in this post do I explain how to install NMS?

My answer, for this moment only, is, “I don’t care.” You can refer to the previous post for a walkthrough of the installation, and many “how to install Kentik NMS” knowledge articles, blog posts, music videos*, and Broadway plays are either already available or will exist by the time you finish reading this post. But for this post, I’m not going to spend a single sentence explaining how NMS is installed. My focus is entirely on the benefit and value of Kentik NMS once it’s up and running.

* There will absolutely NOT be a music video - the Kentik legal team.

** OH HELL YES, WE ABSOLUTELY WILL!! - the Kentik creative marketing group (who do the final edit of blogs before they post)

And to think, your day had started so well. The sun was shining, the birds were singing, the coffee was fresh and hot, and you could feel the first flutters of hope – hope that you’d be able to get some good work done today, hope that you could take a chunk out of those important tasks, or hope that you could avoid the unplanned work of system outages.

And then came the call.

“The Application” was down. Nobody could get in. Nothing was working.

Now, let’s be completely honest about this. “The Application” was not, in fact, down. The servers were responding, the application services were running, and so on. But, being equally honest, it was slow.

As we all know, “slow” is the new “broken.” Even if it wasn’t literally down, it wasn’t fully accessible and responsive, which means it was effectively down.

What differentiates today from all the dark days in the past is that today, you have Kentik NMS installed, configured, and collecting data – data that the Kentik platform transforms into usable information that you can use to drive action.

Let’s look at “The Application” – at the data flowing across the wire:

By any account, that’s pretty down-ish.

The problem is that a count of the inbound and outbound data doesn’t tell us what’s wrong; it just tells us that something is wrong.

Likewise, the information from so-called “higher level” tools – monitoring solutions that focus on traces and such – might tell us that the flow of data has slowed or even stopped, but there’s no indication why.

This is why network monitoring still matters – both to you as a monitoring practitioner, engineer, and aficionado and to teams, departments, and businesses overall.

We can see, at exactly the moment, a drop in the most basic metric of all: the ICMP packets received by the devices.

Now, ICMP packets (also known as the good old “ping”) are still data, but when they’re affected equally and simultaneously with application-layer traffic, there’s a good chance the problem is network-based.

What was the problem? I’ll leave it to your experience, history, and imagination to fill in the blanks. In my example above, I changed the duplex setting on one of the ports, forcing a duplex mismatch that caused every other packet (or so) to drop. But it could have been a busy network device, a misconfigured route, or even a bad cable.

In terms of making the case for Kentik NMS, the upshot is that network errors still occur. Often. And application-centric tools are ill-equipped to identify them, let alone help you resolve them.

Almost as fast as it started, the problem is resolved. With the duplex mismatch reversed, pings are back up to normal:

And the application traffic is back up with it:

You pour yourself a fresh cup of coffee, listen to the birds chirping outside the window, and settle into what continues to look like a great day.

Now that I’ve given you a reason to want to look around, I thought I’d spend some time pointing out the highlights and features of Kentik NMS so you could see the full range of what’s possible.

The main NMS screen

We’ll start at the main Kentik screen, the one you see when you log into https://portal.kentik.com. From here, click the “hamburger” menu in the upper left corner and choose “Network Monitoring System.” That will drop you into the main dashboard.

On the main screen, you’ll see:

A geographic map showing the location of your devices
A graph and a table showing the availability information for those devices
An overview of the traffic (bandwidth) being passed by/through your infrastructure
Any active alerts
Tables with a sorted list of devices that have high bandwidth, CPU, or memory utilization

Returning to the hamburger menu, we’ll revisit the “Devices” list, but now that we have devices, we’ll take a closer look.

This page is exactly what it claims to be – a list of your devices. From this one screen, you have easy access to the ability to:

Sort the list by clicking on the column headings.
Search for specific devices using any data types shown on the screen.
Filter the list using the categories in the left-hand column.

There are also some drop-down elements worth noting:

The “Group By” drop-down adds collapsable groupings to the list of devices.

The “Actions” drop-down will export the displayed data to CSV or push it out to Metrics Explorer for deeper analysis.

The “Customize” option in the upper right corner lets you add or remove data columns.

The friendly blue “Discover Devices” button allows you to add new devices to newly added or existing collector instances.

Remember all the cool stuff I just covered about devices? The following image looks similar, except it focuses on your network interfaces.

Metrics Explorer is, in many ways, identical to Kentik’s existing Data Explorer capability. It’s also incredibly robust and nuanced, so much so that it deserves and will get its own dedicated blog post.

For now, I will give this incredibly brief overview just to keep this post moving along:

First, all the real “action” (meaning how you interact with Metrics Explorer) happens in the right-hand column.

Second, it’s important to remember that the entire point of the Metrics Explorer is to help you graphically build a query of the network observability data Kentik is collecting.

With those two points out of the way, the right-hand area has five primary areas:

Measurement allows you to select which data elements and how they are used.
- The initial drop-down lets you select the broad category of telemetry from which your metrics will be drawn. For NMS, you will often select from either the /interfaces/ or the /device/ grouping.
- Metrics indicate the data elements that should be used for the graph.
- Group by Dimensions will create sub-groupings of that data on the graph. Absent any “group by,” you end up with a single set of data points across time. Grouping by name, location, etc, will create a more granular breakdown.
- Merge Series is a summary option that allows you to apply sum, min, max, or average functions to the data based on the groupings.
Visualization options: This section controls how the data displays on the left.
- Chart type: Line, bar, pie, table only, etc.
- Metric: The column that is used as the scale for the Y-axis.
- Aggregation: Whether the graph should map every data point, an average, a sum, etc.
- Sample size: When aggregating, all the data from a specific time period (from 1 to 60 minutes) will be combined.
- Series count: How many items from the full data set should be displayed in the graph
- Transformation: Whether to treat the data points as they are or as counters.
Time: The period of time from which to display data and whether to display time markings in UTC or “local” (the time of whatever computer is viewing the graph).
Filtering: This will let you add limitations to include data that matches (or does not match) specific criteria.
Dimension: Non-numeric columns such as location, name, vendor, or subnet.
- Metric: Numeric data.
Table options: These set the options for the table of data that displays below the graph and lets you select how many rows and whether they’ll be aggregated by Last, Min, Max, Average, or P95 methods.

AD&D: About data and device types

After folks see how helpful Kentik can be, the next question is usually, “Will it cover my gear?” While a list of vendors isn’t the same as a comprehensive list of each make and model, this is a blog post, and nobody will take that kind of time. Meanwhile, the list below should still give you a good idea of what is available out of the box. From there, modifying existing profiles to include specific metrics or even completely new devices is relatively simple.

As of the time of this writing, Kentik NMS automatically collects data from devices made by the following vendors. For the legal-eagles in the group who are sensitive about trademarks, capitalization, and such, please note that this is a dump directly out of device-type directory:

3com
a10_networks
accedian
adva
alteon
apc
arista
aruba
audiocodes
avaya
avocent
broadforward
brother
calix
canon
cisco
corero
datacom
dell
elemental
exagrid
extreme
f5
fortinet
fscom
gigamon
hp
huawei
ibm
infoblox
juniper
lantronix
meraki
mikrotik
netapp
nokia
nvidia
opengear
palo_alto
pf_sense
server_iron
servertech
sunbird
ubiquiti
velocloud
vertiv
vyos

Of course, this is just the start of your Kentik NMS journey. There is so much more to the platform, from adding custom metrics and new devices to building comprehensive dashboards that contextualize data to creating alerts that convert monitoring information into action. I will be digging into all that and more in the coming weeks and months, even as Kentik NMS continues to grow and improve.

I hope you’ll stick with me as we learn more about this together. If you’d like to get started now, sign up for a free trial.

Getting Started with Kentik NMS

Leon Adato — Thu, 07 Mar 2024 20:54:55 +0000

Kentik NMS is here to provide better network performance monitoring and observability. Read how to get started, from installation and configuration to a brief tour of what you can expect from Kentik’s new network monitoring solution.

In my last post, I explored the reasons why Kentik recently built and released an all-new network monitoring system (NMS) that includes traditional techniques like SNMP along with more modern methods of collecting telemetry like APIs, OpenTelemetry, and influx.

For this article, we’ll jump right into getting set up in Kentik NMS, from installation and configuration to a brief tour of what you can expect from Kentik’s new network monitoring solution.

There’s nothing more frustrating than being ready to test out a new piece of technology and then finding out you’re not prepared. So before you head down to the “Installation and Configuration” section, make sure you have the following things in hand:

A system to install the Kentik NMS collector on. The collector is an agent that can be installed directly onto a Linux-based system or as a Docker container. Per the Kentik Knowledge Base instructions, you’ll want a system with at least a single core and 4GB of RAM.
Verify the system can access the required remote sites:
- Docker Hub
- TCP 443 to grpc.api.kentik.com (or kentik.eu for Europe)
Verify the system can access the devices you want to monitor:
- Ping (ICMP)
- SNMP (UDP port 161)
Check that you have the following information for the devices you want to monitor:
- A list of IP addresses and/or one or more CIDR notated subnets (example: 192.168.1.0/24)
- The SNMP v2c read-only community string and/or SNMP version 3 username, authentication type and passphrase, privacy type, and passphrase
You have a Kentik account. If you are just testing NMS out, we’d recommend not using an existing production account. (Not because NMS is unsafe, but because the possibility of accidentally triggering an event that the real Helpdesk will get and think they have to respond to is no fun. Be nice to your support folks. They know where all your data is kept.) If you don’t have an account, head over to https://portal.kentik.com/login and get one set up.

Once you’ve got all of your technical ducks in a row (which, to be honest, shouldn’t take that long), you’re ready to get started on this NMS adventure!

Installation and configuration

Whether you install the Kentik NMS collector on a Linux system (physical or virtual) or in a Docker container, you’ll start in the portal. Click the “hamburger menu” (the three lines in the upper left corner), which shows the full portal menu.

Click “Devices” and then select the friendly blue “Discover Devices” button in the upper right corner.

The next screen allows you to install the collector, either as a Docker container:

or on a full Linux system.

Shortly after doing that, you’ll see the agent name (or the name of the system the agent is installed on) show up in the “Select an Agent” area below.

Go ahead and click “Use this Agent.”

From the next screen, you’ll enter an IP address, a comma-separated list of IPs, or a CIDR-noted range (example: 192.168.1.0/24).

Trick: You can mix and match, including individual IPs and CIDR ranges.

Another trick: If there are specific systems you want to ignore, list them with a minus (-) in front.

Presuming this is your first time adding devices, you’ll probably have to click “Add New Credential.”

Let’s get this out of the way: You will never select SNMP v1. Just don’t.

That said, choose SNMP v2c or v3, include the relevant credentials, give it a unique name, and click “Add Credential.”

Then select it from the previous screen.

At that point, click “Start Discovery” to kick off the real excitement.

The main NMS screen

Now that we have some devices installed and are collecting data, let’s take a quick look around.

Back up in the main Kentik menu, click “Network Monitoring System.” That will drop you into the main dashboard.

On the main screen, you’ll see:

A geographic map showing the location of your devices
A graph and a table showing the availability information for those devices
An overview of the traffic (bandwidth) being passed by/through your infrastructure
Any active alerts
Tables with a sorted list of devices that have high bandwidth, CPU, or memory utilization

Returning to the hamburger menu, we’ll revisit the “Devices” list, but now that we have devices, we’ll take a closer look.

This page is exactly what it claims to be – a list of your devices. From this one screen, you have easy access to the ability to:

Sort the list by clicking on the column headings.
Search for specific devices using any data types shown on the screen.
Filter the list using the categories in the left-hand column.

There are also some drop-down elements worth noting:

The “Group By” drop-down adds collapsable groupings to the list of devices.
The “Actions” drop-down will export the displayed data to CSV or push it out to Metrics Explorer for deeper analysis.
The “Customize” option in the upper right corner lets you add or remove data columns.

And we’re already familiar with the friendly blue “Discover Devices” button.

Remember all the cool stuff I just covered about devices? The following image looks similar, except it focuses on your network interfaces.

Metrics Explorer is, in many ways, identical to Kentik’s existing Data Explorer capability. It’s also incredibly robust and nuanced. So much so that it deserves, and will get, its own dedicated blog post.

For now, I will give this incredibly brief overview just to keep this post moving along:

First, all the real “action” (meaning how you interact with Metrics Explorer) happens in the right-hand column.

Second, it’s important to remember that the entire point of the Metrics Explorer is to help you graphically build a query of the network observability data Kentik is collecting.

With those two points out of the way, the right-hand area has five primary areas:

Measurement allows you to select which data elements and how they are used.
Visualization options: This section controls how the data displays on the left.
Time: The period of time to display data from and whether to display time markings in UTC or “local” (the time of whatever computer is viewing the graph).
Filtering: This will let you add limitations so that only data that matches (or does not match) certain criteria is included.
Table options: These set the options for the table of data that displays below the graph and lets you select how many rows and whether they’ll be aggregated by Last, Min, Max, Average, or P95 methods.

And that ends our brief tour!

We’ve only skimmed the surface of what Kentik NMS offers, but hopefully, you’re ready to start adding your own devices and interfaces. We’ll be back soon with more NMS tutorials and walkthroughs, but in the meantime, sign up now to get started with a 30-day free trial of Kentik and see Kentik NMS in action yourself.

How to Configure Kentik NMS to Collect Custom SNMP Metrics

Leon Adato — Thu, 07 Mar 2024 20:42:50 +0000

Out of the gate, NMS collects an impressive array of metrics and telemetry. But there will always be bits that need to be added. This brings me to the topic of today’s blog post: How to configure NMS to collect a custom SNMP metric.

The recent release of Kentik NMS has impressed and excited a lot of folks, as evidenced by the volume of current Kentik customers kicking the tires of our newest capability, as well as folks who hadn’t dipped their toes in the warm and welcoming waters of Kentik’s platform until they heard about NMS.

Out of the gate, NMS collects an impressive array of metrics and telemetry. But that doesn’t mean it knows about absolutely everything. No matter how diligently Kentik’s engineers work to incorporate devices and data points (both new and old), there will always be bits that need to be added.

Not only is it impossible for any monitoring and observability solution to know about every possible data point, but making a tool collect “every” metric would cause it to be unreasonably slow.

The goal, instead, is to collect all the telemetry commonly needed and provide the ability to extend the tool to collect other metrics specific to each company’s circumstances.

This brings me to the topic of today’s blog post: How to configure NMS to collect a custom SNMP metric.

Imagine sitting at your desk, monitoring your little heart out with Kentik NMS. Even the Raspberry Pi boxes you’re using for small but essential tasks are showing up. Things are looking great.

Until you realize two things in quick succession:

Those Raspberry Pi’s are warm enough to heat up a slice of… well, actual raspberry pie.
Temperature stats aren’t showing up.

To be clear, I’m using temperature as a simple but common example. It could just as easily be toner status on a printer or a list of services running on a server, complete with CPU, RAM, and IO utilization for each service. What I’m about to explain is how to include any new SNMP metric, irrespective of data type or vendor.

With those clarifications out of the way, let’s add some temperature stats to our view to see whether we should stock up on fire extinguishers.

Before we start making changes, I want to go over the information you need at your fingertips.

First and foremost, you need to have the SNMP objects (OIDs) that get the data you want, and you should be certain the device responds to those objects in the way you expect.

In my case, the OID I want is: 1.3.6.1.4.1.2021.13.16.2.1.3.1

There are lots of sources for OID information; one such source is https://oidref.com.

To validate that my device responds correctly, I can use the SNMPWalk utility to poll just that value:

Now that we have our OID and we’ve confirmed it works on the device in question, our last step is to ensure we understand how this value is formatted. In this case, it’s in “milli-Celsius,” so 39166 is actually 39.1 degrees Celsius (or 102.38 degrees Fahrenheit).

Finally, I have to understand the SNMP system object (sysobjectid) of the device to which I want to add my data. You can find that by going into Kentik’s portal, visiting the Devices page, and adding the SysObjectID column.

Or if you go to the details page for a specific device and view it in the left-hand column:

Note that what I’ll be using for this example is 1.3.6.1.4.1.8072.3.2.10

Note that this will affect any Linux-based system because Raspberry Pis don’t have their own unique system ID.

Now we’re ready to get this value added and displayed in Kentik NMS!

I’m not going to hide the important information behind a wall of step-by-step text. This section is the straightforward, simple, direct answer. But it lacks context and detail and, therefore, might not make much sense. That’s what the rest of this post is about. But for those who want to get right to the point:

Customizations to Kentik NMS all go in /opt/kentik/components/ranger/local/config. Whether you are adding a custom OID, overwriting an existing OID with a new source (not covered in this post), or adding a new device type (also not covered in this post), it all goes there.
- This directory might already exist. If it doesn’t, go ahead and create it yourself.
In that directory, create three directories:
1. /profiles
2. /reports
3. /sources
In sources/, create linux-temps_source.yml and add the following information:

version: 1
metadata:
  name: local-linux
  kind: sources
sources:
  CPUtemp: !snmp
    value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
    interval: 60s

In reports/, create linux_temps_report.yml and add the following information:

version: 1
metadata:
  name: local-temp
  kind: reports
reports:
  /device/linux/temp:
    fields:
      CPUTemp: !snmp
        value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
        metric: true
    interval: 60s

In profiles/, create a file named local-net-snmp.yml and add the following information:

version: 1
metadata:
  name: local-net-snmp
  kind: profile
profile:
  match:
    sysobjectid:
      - 1.3.6.1.4.1.8072.*
  reports:
    - local-temp
  include:
    - device_name_ip

Make the user:group “kentik:kentik” the owner of everything you just created and all the files and directories beneath it.

sudo chown -R kentik:kentik

/opt/kentik/components/ranger/local/config

Note: This is only necessary if you’re running the Kentik NMS agent on a regular Linux system whether it’s a VM or not. This isn’t necessary for Docker-based agents, but I’ll explicitly cover that in a later section.

Restart the collector process (kagent):

sudo systemctl restart kagent.service

Wait a polling cycle or two, and you’ll be able to see it in the Metrics Explorer:

The previous section presented a lot of information in a very tight package. It was probably just enough for folks who are already familiar with NMS and its internal structures. But if you’re newer to the platform, you may be looking for additional information, detail, or context. That’s what I plan to present in the rest of this post.

Kentik NMS is, at its heart, a straightforward set of processes and directories. When you install it, all the essential files will be located in /opt/kentik/components/ranger/current.

The LATEST.ZIP file contains all of the device profiles and information needed to collect data from those devices. The beauty of this system is that NMS works with LATEST.ZIP as-is, without unpacking or unzipping it. Every time you restart the Kentik agent (kagent), it checks for a newer version and downloads it if necessary. So you’re guaranteed to get all the latest updates and goodies without any special upgrade process.

Note: Sharp-eyed Linux-literate readers will notice that “current” is actually a symbolic link to the latest version. This is important because if you make changes here, you’ll find those changes inexplicably lost after the next update.

Upshot: Avoid future headaches. Don’t make changes in this directory.

If you did unpack LATEST.ZIP (but, as I said, don’t), you’d find a specific directory structure underneath it.

The key directories there are Profiles, Reports, and Sources. Each one contains a set of YAML files defining an aspect of the collected data.

Important note: The names of the files aren’t important. What matters is the information you provide in the name: element within each YAML file. That will allow you to connect or associate a profile to a report, a report to a source, and so on.

A source tells Kentik NMS about one or more OIDS to collect. Here’s an example:

version: 1
metadata:
  name: local-linux
  kind: sources
sources:
  temp: !snmp
    value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
    interval: 60s

This file can be understood as:

A source named “local-linux”
The type (or kind) of file is a “source” (there are others, which you’ll understand in a minute)
The collection method is SNMP
The SNMP object (OID) to collect is 1.3.6.1.4.1.2021.13.16.2.1.3.1
That value should be collected every 60 seconds

Files in the Report folder tell Kentik NMS how to display a specific OID within the Metrics Explorer. There are several elements that repeat the things in the Source file, but – for reasons beyond the scope of this post – they’re necessary in both files.

Here’s an example:

version: 1
metadata:
  name: local-temp
  kind: reports
reports:
  /device/linux/temp:
    fields:
      CPUTemp: !snmp
        value: 1.3.6.1.4.1.2021.13.16.2.1.3.1
        metric: true
    interval: 60s

This information can be parsed as follows:

The name of this report is local-temp
The type (or “kind”) of file is – somewhat obviously – a report
Within Metrics Explorer, the data being collected will show up under /device/linux/temp
The data elements that will be available in Metrics Explorer is “CPUTemp,” which is an SNMP data element
- This element will contain the data collected by the SNMP OID 1.3.6.1.4.1.2021.13.16.2.1.3.1
- Which is a metric rather than a table or some other type of data structure.
The data will be displayed in 60-second increments.

Profiles associate specific reports with the device types (as identified by their SNMP System Object ID, or sysobjectid) and also mention common data elements (like name, or IP) that should be associated with the data.

version: 1
metadata:
  name: local-net-snmp
  kind: profile
profile:
  match:
    sysobjectid:
      - 1.3.6.1.4.1.8072.*
  reports:
    - local-temp
  include:
    - device_name_ip

One more time, let’s parse this out:

The name of the profile is local-net-temp
The type (or kind, there’s that word again) of the file is a profile
This profile applies to anything with an SNMP SysObjectID starting with 1.3.6.1.4.1.8072.* (this means most Linux-type machines that run net-SNMP).
Devices that match this profile will collect data found in the Report file with a name: element “local-temp.”
The device name and IP data should be included along with the data in the local-temp report and associated source.

You may have noticed that the kind element in all three files above identifies the file type and matches the file’s directory. If you have a nagging suspicion that the directory structure matches up with the Kind label, you are right.

In fact, you don’t actually need the directories. You could put all your files in a single folder, and as long as the Kind: value was correct, everything would match up. We here at Kentik encourage you to use the three-directory approach because it makes organizing, tracking, and maintaining a large number of profiles much easier in the long run.

Possession is 9/10ths of the law and 10/10ths of Linux permissions

Once all your files are in place, it’s important to ensure Kentik can access them. This comes down to giving ownership to both the “kentik” user and the “kentik” group.

Remembering that all of your customizations will go in the folder /opt/kentik/components/ranger/local/config we need to make sure everything you just created will be owned by the kentik user and group. The command to give ownership would be:

sudo chown -R kentik:kentik /opt/kentik/components/ranger/local/config

Note: This is only necessary if you’re running the Kentik NMS agent on a regular Linux system, whether it’s a VM or not. This isn’t necessary for docker-based agents, but I’ll explicitly cover that in a later section.

Finally, restart the collector process (kagent):

sudo systemctl edit kagent.service

Wait a polling cycle or two, and you’ll be able to see it in the Metrics Explorer:

Throughout this post, I’ve focused on the commands and options for the direct installation of the Kentik agent. If you’re running the containerized version, very little changes, but it’s still worth running through those differences for folks who prefer the Docker version of the Kentik NMS collector.

Before we move on, I don’t want to presume that every reader is already familiar – let alone comfortable – with Docker and its basic commands. Here are a few that you might need.

You can see which containers are running (along with their container IDs) with the following command:

You can see an output of what a Docker container is doing with this command (you get the container ID with the docker ps command):

docker logs –follow <container id>

Finally, if you have issues with any containers, including the command to build or run the New Relic agent, you can easily stop and remove a container with these commands:

docker stop <container id>
docker rm <container id>

For the Docker version of Kentik NMS, you will need to mount your custom folder into the container and add that path to the Kentik agent command line. What is that custom folder, you ask? If you’ve been paying attention, you can probably already guess:

/opt/kentik/components/ranger/local/config

That’s right, it’s the same folder we’ve already been working with.

Starting with the “docker run” command that you used to install the container in the first place:

docker run --name=kagent --detach --restart unless-stopped --pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env K_API_ROOT=grpc.api.kentik.com:443 --mount source=kagent-data,target=/opt/kentik/ kentik/kagent:latest

You would simply add

-v /opt/kentik/components/ranger/local/config

…to the end of that command.

The full command would look like this:

docker run --name=kagent --detach --restart unless-stopped --pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env K_API_ROOT=grpc.api.kentik.com:443 --mount source=kagent-data,target=/opt/kentik/ kentik/kagent:latest -v /opt/kentik/components/ranger/local/config

That’s it! Everything else in this post still applies, and you don’t even need to run the chown command to ensure ownership of that directory.

Troubleshooting and other swear words

Despite our best efforts, careful planning, detailed analysis, and heartfelt prayers – even with all that, things sometimes go awry. As the philosopher John Bender said in the profoundly philosophical work The Breakfast Club.

“Screws fall out all the time; the world is an imperfect place.”

With that great truth firmly in mind, I wanted to offer some tools and techniques you can use to identify where things may have gone off the rails.

YAML stands for “yet another markup language,” which, like most acronyms, tells you exactly nothing about it. YAML is similar in many ways to XML or JSON, an insight that provides little comfort to many of us who have an emotionally complicated relationship with those other two systems.

My personal trauma aside, YAML is great for configuration files because it’s highly structured. But for that same reason, it can be easy to bork something up because of a small (and hard-to-find) oversight. Here are the ones that might trip you up:

Everything you do in a YAML file will be in the form of a pair of information that follows the pattern: “key: value”
- Some examples:
  - name: local-temp
  - kind: profile
  - interval: 60s
Underscores, dashes, or spaces can separate words in keys.
A key will always end with a colon (:)
Indentation in the file matters!
- You must have them.
- There must be a certain number of spaces.
- Indents must use spaces. They can not be tabs.
- Things that are on the same level (“name” and “kind,” for example) must line up with the same number of spaces.

Sometimes, the Kentik agent needs a good swift kick in the… process. To do that, you can use the systemctl utility:

sudo systemctl stop kagent.service
sudo systemctl start kagent.service
sudo systemctl restart kagent.service

The Linux journal isn’t some specialized magazine or email newsletter. It’s the onboard record of every outbound message, error, update, whine, sigh, and grumble that your Linux system experiences – especially when it concerns services that run through the systemctl utility.

The command to peer inside the Journal is, appropriately enough, journalctl. But typing that by itself will likely yield a metric tonne of mostly irrelevant information. In order to see messages and output specific to Kentik NMS, you should use the command:

sudo journalctl -u kagent

If that list is overwhelming, include the following bit:

sudo journalctl -u kagent –since "10 minutes ago"

And if you want to see the messages appearing in the Journal in real time, use this:

sudo journalctl -u kagent -f

There is more – a whole lot more! – to explore with Kentik NMS, including ways to add SNMP table data, create profiles for completely new device types, and even add data that isn’t coming from SNMP in the first place.

Even so, this post should get you moving ahead in collecting those bits of information you know are available on your devices but aren’t collected by default by Kentik NMS today.

To get started with Kentik NMS, sign up for a free 30-day trial.

Setting Sail with Kentik NMS: Unified Network Telemetry

Leon Adato — Mon, 04 Mar 2024 15:00:00 +0000

Kentik NMS has launched and is setting sail in familiar waters. Monitoring with SNMP and streaming telemetry is only the first leg of the journey. In short order, we’ll unfurl additional options, increasing NMS’s velocity and maneuverability.

So you may have heard by now that Kentik has released a network monitoring system, commonly known as an “NMS,” in the smoky rooms of observability aficionados. This is more than just a cute little add-on to our robust flow-based monitoring capabilities. This is a stem-to-stern product that could stand alone if we wanted it to.

But the question that arises in many people’s minds, like the first light peeking over a distant horizon, is: Why?

Why, when the discipline of network monitoring is solidly into its third decade, would we think their shores are yet uncharted?

More to the point, is Kentik trying to imply that – in this age of ubiquitous cloud, containers, microservices, and APIs – the regular old route-and-switch network even matters anymore?

Given that we’re releasing Kentik NMS, the answer is obviously “yes.” But in this blog post, I need to get at why we’re doing it. Besides, you know, “customers keep asking for it” (although, admittedly, that is a pretty good reason).

First, traditional monitoring still matters. Hardware — both in terms of availability and performance — still matters. On-premises systems still matter. And “the network” - meaning anything from bare metal packet pushers in a closet all the way up to a Kubernetes cluster in the cloud — matters.

All of those things I’ve just named, along with a myriad of other infrastructure elements, are still critical components for organizations large and small. Being able to collect telemetry and visualize it effectively, turning data into information that drives action, is a core capability for any — and every — business.

Second, the people responsible for running and maintaining their networks keep telling us that the current set of solutions on the market have either failed to keep up or that the cost of keeping up is so high their speed of adoption is unacceptably slow.

Before you interpret what I just said as an insult to existing vendors, let me be clear: I have a tremendous amount of respect for and even love the existing monitoring tools on the market. They do many things well, and in some cases, they were the first to do those things. They blazed trails, educated consumers, and established whole markets and sub-specialties within IT.

But pivoting an entire product line is almost impossibly complicated. An established tool has existing customers who cannot be abandoned, which means keeping the current solutions more or less the same. Adding new capabilities is predicated on the ability of the tool to accommodate those new functions without breaking existing ones.

For example, let’s look at collecting network metrics via API rather than a more traditional method like using SNMP. And please understand the irony of calling SNMP “traditional” versus APIs when network devices have included API options for the better part of a decade.

While it’s been possible to collect data from hardware via API calls for quite some time, precious few network monitoring tools support this capability or do it particularly well. To be sure, the solutions that focus on application monitoring do it better, but even there, it’s in the context of the application rather than hardware.

The reason for this isn’t because monitoring solutions vendors are lazy or uninspired. It’s that the work of adding an API collector is hard; different vendors have implemented API interfaces in just a different enough way to create additional hurdles, and normalizing the API data with the other telemetry presents its own hurdles.

This difficulty stems from the fact that hardware didn’t support APIs when the tools were conceived and written.

And if all of that is true for a 10-year-old technology like REST-ful APIs, how much more so is it true for OpenTelemetry and its Cisco-specific cousin, streaming telemetry?

Kentik NMS in action

All of this is my way of saying that Kentik realized the world needed a new NMS because:

A) That data still matters, and

B) Creating an NMS from the ground up was actually easier than bolting additional capabilities onto an existing tool.

This brings us to the point we find ourselves at today: Kentik NMS has launched and is setting sail in familiar waters. Monitoring with SNMP and streaming telemetry is only the first leg of the journey. In short order, we’ll unfurl additional options, increasing NMS’s velocity and maneuverability.

So, now that the ship has set sail, I hope you come aboard and have a look around.

Alerts Should Work for You, Not the Other Way Around

Leon Adato — Wed, 15 Nov 2023 20:17:04 +0000

“A few years back, I was tuning an alert rule and accidentally triggered the alert, which created 772 tickets. Twice.”

This (all too true) story serves as the introduction of the main thesis of my talk in the video below (and the reason for its title): That alerts — contrary to the popular opinion held by IT practitioners across the spectrum of tech — don’t inherently suck. The problem lies in how alerts are typically created, which causes them to be… well, let’s just say “sub-optimal” and leave it at that.

I’ve given this talk frequently at conferences such as DevOpsDays BRUM, DevOpsDays TLV, Monitorama, and others. I believe its popularity is largely due to its fun approach to a frustrating issue.

I’d like to take a few moments of your time here to emphasize points I make in the talk, but then extend those ideas in ways that don’t fit the limitations of time or format common in conference presentations.

The slippery slope to “Monitoring Engineer”

If you’ve read this far, there’s a good chance you care about alerts for more than just your own personal reasons. You probably have people — whether on your immediate team or in the larger organization — who look to and rely on you for help designing, implementing, maintaining, and fixing alerts.

While most of us first encounter monitoring solutions because we want to know more about our own sh… tuff, it quickly follows that we’re helping others set up monitoring for themselves. Before long, we found ourselves in the “resident expert” role. Once that reputation gets around, the job (whether official or not) is irrevocably added to our responsibilities.

The good news is that this is a huge opportunity for those who enjoy the work. Monitoring is an undeniable game-change in organizations willing to embrace and use it.

Alerts ≠ Monitoring

One of my first encounters with alerting that was completely off the rails was at a company that defined uptime as “100% minus the # of alerts” in a given period. It was utterly unhinged.

While it was an extreme example, the underlying issue — confusing alerting with monitoring — isn’t rare at all. For many individuals (and teams, departments, and entire companies), the raison d’être for monitoring is to have alerts, which is simply not helpful or effective.

Monitoring is nothing more (and nothing less) than the consistent, persistent collection of data from a set of systems. Everything else that a monitoring and observability solution provides — dashboards, widgets, reports, automation, and alerts — is merely a happy by-product of having monitoring in the first place.

As a monitoring engineer, I know something is amiss when I see people hyper-focusing on alerts to the exclusion (if not the detriment) of monitoring.

Alerts need proof of their value

An alert should only exist if it has a proven, measurable, meaningful impact. The best way to validate that is to see if an alert is intended to cause

someone
to do something
RIGHT. NOW.
about a problem

If all of those conditions aren’t met, you’re looking at an alert that is trying to replace some other monitoring structure — a dashboard, a report, etc.

But that merely proves that an alert is actionable, not valuable. And I must be clear: “important” isn’t the same as “valuable.” Importance implies that it is technically, intellectually, or (believe it or not) emotionally meaningful to some person or group.

“Valuable” is much more particular: The existence of the alert can be directly tied to a financial outcome.

How does one establish this? Start with what the world would look like without the alert:

How would the people who can fix the issue find out about the problem? And more to the point, how LONG would it take for the people who can resolve the issue to find out?
Are there any inherent losses while the problem is happening? An online sales system that generates $1,000 an hour loses that amount every hour it’s unavailable.
How long would it take to fix the problem? In some cases, it’s the same amount of time, alert or not. But in far more circumstances, if the problem were left unaddressed for the length of time identified in the first bullet, it would take longer (possibly significantly longer) to resolve.
What is the regular (“total loaded”) rate for the staff who can fix the issue?
What is the “interruption cost” for that staff? This means the staff is (ostensibly) not sitting around waiting for this particular error. So what is the value of their normal work? Because they will NOT be doing it during the time they are addressing this issue.

You are welcome to take the formula above and, as the saying goes, “salt to taste.”

Once you have this, recalculate all of the above WITH the alert. The difference between the first calculation and the second is the dollar value of the alert.

Now, you can set up a simple report showing the number of occurrences the alert triggered, multiplied by the value. That is the amount this one alert has saved the company during that time.

Observability enables us to change our focus

Back when I started working with monitoring solutions (yeah, yeah, Grampa. When dinosaurs ruled the earth and you had to chisel each bit into the hard drive by hand with a lodestone), we had to guess at the user’s experience from an array of much lower-level data. We’d look at network traffic, disk I/O, server connections, and other metrics and use those metrics to guess what was happening at the top of the OSI model.

We didn’t do it because we thought it was the best option. We did it because it was the ONLY option. Tracing didn’t really come onto the scene — in terms of true application monitoring - until 2010. And it only took hold because of the fundamental change in application architecture.

The widespread adoption of cloud computing (AWS EC2 went GA in 2006) and mobile phones (the first iPhone came on the scene in 2007) radically changed how we interacted with applications. Facebook had an unbelievable (for the time) 600 million users in 2010. That number grew to 800 million in 2011 and over 1 billion in 2012.

Against THAT backdrop, application tracing and real user monitoring went from something we could only do in carefully controlled QA environments to a technique that was not only possible but game-changing.

Because the entire reason we have monitoring — the whole damn point — is to understand what users are experiencing with an application. That’s it. That’s the whole enchilada.

So, I will go on record as saying that alerting should focus on that aspect first and foremost. If the user experience is impacted, sound the alarm and get people out of bed if necessary.

At that point, all the other telemetry - metrics, events, and logs - can be used to understand the details of the impact. But those lower-level data points no longer have to be the trigger point for alerts. Not in most cases.

Where do we go from here?

Hopefully, you have enough time between this blog and my talk to reflect on your existing alerts with an eye toward real improvement. You may find yourself deleting alerts you once thought essential. You will also undoubtedly spend time tweaking your alerts to make them more actionable, meaningful, and valuable.

Just ensure you don’t trigger an alert storm in the process, or you’ll end up in the helpdesk, manually closing 1,544 tickets.

Don’t ask me how I know.

Lessons From a Six-Month Job Search

Leon Adato — Mon, 13 Nov 2023 19:27:30 +0000

Now that I’m free to share the news that I’ve landed at Kentik – a visionary company filled by an amazing group of folks who believe that the value of their team goes far beyond what they might offer to the business – I wanted to take a minute to reflect on my job search, comment on the state of the job market, and share some lessons I’ve picked up along the way.

Let me be clear – I’m under no illusion that the world has breathlessly awaited the thoughts of a middle-age white dude and will now, graced with my heretofore-undiscovered wisdom, be a truly better place. People with far more knowledge, experience, and expertise have written and spoken on this topic, with data and examples that are far more eloquent and compelling than anything I could hope to share.

No, my point in sharing this is to offer what little confirmation and comfort I can to anyone currently looking for work too. Confirmation that the job market is really, REALLY challenging right now. And comfort that the rejections and (possibly worse) the ghostings you’re getting daily are both common in this current trend, and also not at all a reflection on you, your skills, your experience, your value to a business, or your worth as a person.

I’ll be blunt and honest: As a white (or at least white-passing), cis-gendered, heterosexual, middle aged man with over 35 years of industry experience, I have literally every ounce of privilege a human body can contain – with the possible exception that I’m not the son of a CEO, and will inherit the company some day.

Despite all those advantages, it took me six months of solid searching to find my next gig. Even when you take into account that the job I was seeking (Developer Relations Advocate) is less common than, say, a staff developer or sales engineer, it’s still a significant span of time to finding a job.

Now, when I say it’s a “bad” job market, I want to offer some qualifiers only time and experience can bring: It’s not the 2009 “the housing market just crashed and we’re in the Great Recession” bad. It’s not the 2002 “the dot-com bubble just burst” bad, either. If we’re really trying to find the low water mark, it’s also not as bad as the 1973-1975 US recession, let alone the Great Depression.

But playing the “back in my day” game is cold comfort to anyone looking for a new job in the here-and-now, whether by choice or due to one of the far-too-common layoffs playing havoc with the tech industry right now.

At the same time, the context offers some reassurances: Yes, it’s bad. No, it’s not the worst ever seen. And in those other cases, the economy and job market bounced back.

While financial analysts will tell you it took 15 years for the NASDAQ to recover from the 2000 dot com bubble, most folks in tech will say that the job market cleared up in about 8 to 18 months, depending on where you lived and what you did.

The market recovered from the 2009 housing market crash that triggered the Great Recession in about 4 years. Employers began adding jobs a year after that.

Beyond context and perspective, there are other lessons I’ve learned, which I plan to share in upcoming blog posts. But my message for this one is pretty simple: Being honest with yourself will carry you forward when everything else seems uncertain.

Take time every day (yes, every day) to check in with yourself and ask if the goals you are pursuing are the ones you still want, and whether the cost – in time, energy (physical, mental, emotional), and also money – are still worth it. If the equation doesn’t balance out, be equally honest about what you think would.

I have a few friends who have done this math and pivoted to entirely different careers; or to different areas of tech; or who have put their job hunt on hold.

All of these are valid choices and don’t let anyone tell you otherwise.

At the same time, the job market won’t be this hard forever. By all indications, it’s not even “this hard” by the time you read this post. Things appear to be easing up a bit. While that might be the results of a September surge, it could just as easily be a sign that things are looking up. Only time will tell.

From the start of my job search, I had an inkling that my experiences might be the source of a blog. Actually, who am I kidding? I *KNEW* that I’d try to turn this into a blog, no matter what happened. In any case, I resolved to keep copious notes on which companies I applied to, and how those conversations went.

What I hadn’t expected was how hard it would be to turn those notes into a meaningful visualization. As it turns out, I needed three separate graphs to really convey everything about the process.

This first animation (courtesy of Flourish) does a good job of showing the number of simultaneous job conversations over time. (click the image to see the full animation)

But it fails to really give you a sense of how those conversations went. For that, my good friend and former colleague Thomas LaRock suggested I use a Sankey view. Honestly, I struggled to figure out exactly HOW to turn my information into Sankey data, until I saw this post from Eric Browning (an amazing Product and UX designer who happens to be looking for his next adventure. You really should check him out.).

But even those two graphs still fail to convey the sheer volume of applications, rejections, and so on. For that I’ve defaulted to a simple bar chart:

Across six months, I applied to over 40 companies, actively interviewed at 30, and gained a lot of interesting insights along the way. I plan to share some of those lessons in the coming weeks, and even the output from some of the take-home assignments I was given as part of the interview process.

But for now, my main point is: You can do this. You can handle it. It’s a lot, but it’s not more than you can bear. You may need to approach it differently, or consider an alternate path forward, but it’s never the less, something you can do.

Visibly Kentik

Leon Adato — Mon, 30 Oct 2023 14:00:00 +0000

(This post originally appeared on the Kentik blog)

What’s the opposite of “Hello, I Must Be Going”?

I ask because, here at Kentik, one of the requirements for a blog is to have a solid music reference. Or several. I inferred this from a review of existing blog content. OK, actually, just this blog.

It’s important that I get this right because – as both the location and the very existence of this blog imply – I’m able to call Kentik my home now, at least professionally, and I don’t want to do anything that might get things off on the wrong foot. Honestly, I cannot believe it’s true. And while I know you can’t hurry love, I already feel like I’m hanging with old friends.

I won’t belabor the point (or the song references). Still, I did want to highlight some of the things I’m excited about now that I’ve officially become a “Kentikian” (yeah, that’s what we call ourselves around here, but it don’t matter to me).

Observability is for everyone

Monitoring – both network-specific and the broader category of solutions that include APM, synthetics, traces, and even capital-O Observability (whether you spell it out or write it “o11y” like the super cool DevOps folks) – isn’t just a niche skill practiced by a few key people in an org.

If there’s anything I’ve learned over the quarter-century of installing, managing, and extending monitoring solutions, it’s that the data it contains is the lifeblood of the business and a superpower for any IT practitioner motivated enough to leverage it. Monitoring and the insights it provides via alerts, reports, and dashboards allow organizations to react to changes more quickly, identify and recover from issues more reliably, and understand the true health and performance of the business and the systems on which that business is built.

If there’s one thing I’m excited to share now that I’m here at Kentik, it’s the experiences, lessons, and insights I’ve gleaned from using dozens of tools over thousands of hours at companies that ranged from modest (25-100 systems), to moderate (1-5000 systems), to mind-boggling (250,000 systems).

Content by IT folks, for IT folks

Whether it’s a blog highlighting the ways network monitoring has (and hasn’t) changed or an analysis of what it looks like when an entire island cuts over from satellite internet to (undersea) fiber, Kentik places a high value in talking to IT folks the way we speak to each other.

Of course, that starts with clear, concise, and detailed explanations of the latest technology or technique and how to leverage it within Kentik’s platform.

But it also includes frank explanations when something doesn’t stack up. Even more importantly, Kentik isn’t afraid to share honest looks at what we need to do as tech practitioners to show up and do our best work every day for the businesses that depend on us.

It also means having fun sometimes. Because if the work were a never-ending joyless struggle, most of us would have built our careers as hamster ranchers, deep sea carpentry engineers, or competitive maraschino cherry jugglers.

With all due respect to Sesame Street, “c” is for community

A user community has to be more than a support forum. It has to not only allow but encourage conversations and connections between members irrespective of their company affiliations or the problems they’re encountering at the moment. A community should uplift, inform, inspire, comfort, and celebrate.

I’ll be honest (and one of the reasons I’m thrilled to be at Kentik is because this kind of honesty is not only permitted but valued), our community isn’t there – YET. And my saying that we need it is far more than aspirational. It’s happening. Stay tuned to this channel for more information as it becomes available.

But community can be found in many places and in many ways. Community also happens in the comment section of blogs and videos, in the shared stories of the hallway track at conferences and user groups, and in the whispered interjections and hastily scribbled notes during keynotes.

Kentik – and incredibly, I’m now included in that amazing collection of folks – is committed to building a vibrant, passionate, engaging, participatory community, and I hope you will stick around to be part of it because it’s going to be something special.

The (mostly) unnecessary summary

I still haven’t figured out the opposite of “Hello, I Must Be Going,” but perhaps I don’t need to. In 1930, 21 years before Mr. Collins graced us with his presence, let alone his musical genius, Groucho Marx sang the song in the classic movie “Animal Crackers.” The lyrics make it clear that leaving doesn’t mean not staying:

“I’ll stay a week or two

I’ll stay the summer through

But I am telling you:

I must be going”

Like Groucho, no matter how many times I sign off, I don’t plan to really go anywhere. I plan to be extremely visible here at Kentik, whether on the blog, in the video channel, at conferences, or in community spaces.

I hope to see you there, too!

*For those who skipped comparative linguistics in school, “Kentik” is Yiddish for “visible.”