Forem: Netdata

How to extend the Geth-Netdata integration

Odysseas Lamtzidis — Mon, 02 Aug 2021 15:28:49 +0000

How to extend the Geth collector

This is the the last of a 2-part blog post series regarding Netdata and Geth. If you missed the first, be sure to check it out here.

Geth is short for Go-Ethereum and is the official implementation of the Ethereum Client in Go. Currently it's one of the most widely used implementations and a core piece of infrastructure for the Ethereum ecosystem.

With this proof of concept I wanted to showcase how easy it really is to gather data from any Prometheus endpoint and visualize them in Netdata. This has the added benefit of leveraging all the other features of Netdata, namely it's per-second data collection, automatic deployment and configuration and superb system monitoring.

The most challenging aspect is to make sense of the metrics and organize them into meaningful charts. In other words, the expertise that is required to understand what each metric means and if it makes sense to surface it for the user.

Note that some metrics would make sense for some users, and other metrics for others. We want to surface all metrics that make sense. When developping an application, you need much lower level metrics (e.g eBPF), than when operating the application.

Let's get down to it.

A note on collectors

First, let's do a very brief intro to what a collector is.

In Netdata, every collector is composed of a plugin and a module. The plugin is an orchestrator process that is responsible for running jobs, each job is an instance of a module.

When we are "creating" a collector, in essence we select a plugin and we develop a module for that plugin.

For Geth, since we are using the Prometheus Endpoint, it's easier to use our Golang Plugin, as it has internal libraries to gather data from Prometheus endpoints.

The following image is useful:

If you want to dive into the Netdata Collector framework:

Geth collector structure

So, in essence, the Geth collector is the Geth module of the Go.d.plugin.

As you can see on GitHub, the module is composed of four files:

charts.go: Chart definitions
collect.go: Actual data collection, using the metric variables defined in metrics.go
geth.go: Main structure, mostly boilerplate.
metrics.go: Define metric variables to the corresponding Prometheus values

How to extend the Geth collector with a new metric

It's very simply, really.

Open your Prometheus endpoint and find the metrics that you want to visualize with Netdata.

e.g p2p_ingress_eth_65_0x08

Open metrics.go and define a new variable

e.g const p2pIngressEth650x08 = "p2p_ingress_eth_65_0x08"

Open collect.go and create a new function, identical to the one that already exist. Although it doesn't really makes a difference in our case, we strive to organize the metrics into sensible functions (e.g gather all p2pEth65 metrics in one function). This is the function that we will do any computation on the raw value that we gather.

Note that Netdata will automatically take care of units such as bytes and will show the most human readable unit in the dashboard (e.g MB, GB, etc.)

e.g

func (v *Geth) collectP2pEth65(mx map[string]float64, pms prometheus.Metrics) {
    pms = pms.FindByNames(
        p2pIngressEth650x08
    )
    v.collectEth(mx, pms)
    mx[p2pIngressEth650x08] = mx[p2pIngressEth650x08] + 1234

}

func (v *Geth) collectEth(mx map[string]float64, pms prometheus.Metrics) {
    for _, pm := range pms {
        mx[pm.Name()] += pm.Value
    }

We also need to add the function in the central function that is called by the module at the defined interval.

func (g *Geth) collectGeth(pms prometheus.Metrics) map[string]float64 {
    mx := make(map[string]float64)
    g.collectChainData(mx, pms)
    g.collectP2P(mx, pms)
    g.collectTxPool(mx, pms)
    g.collectRpc(mx, pms)
    g.collectP2pEth65(mx, pms)
    return mx
}

Lastly, now that we have the value inside the module, we need to create the chart for that value. We do that in charts.go:

chartReorgs = Chart{
        ID:    "reorgs_executed",
        Title: "Executed Reorgs",
        Units: "reorgs",
        Fam:   "reorgs",
        Ctx:   "geth.reorgs",
        Dims: Dims{
            {ID: reorgsExecuted, Name: "executed"},
        },
    }
    chartReorgsBlocks = Chart{
        ID:    "reorgs_blocks",
        Title: "Blocks Added/Removed from Reorg",
        Units: "blocks",
        Fam:   "reorgs",
        Ctx:   "geth.reorgs_blocks",
                Type:  Line, 
        Dims: Dims{
            {ID: reorgsAdd, Name: "added", Algorithm: "absolute"},
            {ID: reorgsDropped, Name: "dropped"},
        },
    }

Let's explain the fields of the structure:

ID: The unique identification for the chart.
Title: A human readable title for the front-end.
Units: The units for the dimension. Notice that Netdata can automatically scale certain units, so that the raw collector value stays in bytes but the user sees Megabytes on the dashboard. You can find a list of supported "automatically scaled" units on this file.
Fam: The submenu title, used to group multiple charts together.
Ctx: The identifier for the particular chart, kinda like id. Use the convention <collector_name>.<chart_id>.
Type: Line (Default) or Area or Stacked. Area is best used with dimensions that signify "bandwidth". Stacked when it make sense to visually observe the sum of dimensions. (e.g thesystem.ram chart is stacked).
Dims:
- ID: The variable name for that dimension.
- Name: human readable name for the dimension.
- Algorithm:
  - absolute: Default (if omitted) is absolute. Netdata will show the value that it gets from the collector.
  - incremental: Netdata will show the per-second rate of the value. It will automatically take the delta between two data collections, find the per-second value and show it.
  - percentage: Netdata will show the percentage of the dimension in relation to the sum of all the dimensions of the chart. If four dimensions have value = 1, it will show 25%.
  - Mul: Multiply value by some integer.
  - Div: Divide value by some integer.

A final note on extending Geth

The prometheus endpoint is not the only way to monitor Geth, but it's the simplest.

If you feel adventurous, you can try to implement a collector that also uses Geth's RPC endpoint to pull data (e.g show charts about specific contracts in real time) or even Geth's logs.

To use Geth's RPC endpoint with Golang, take a look at Geth's documentation.

To monitor Geth's logs, you can use our weblog collector as a template. It monitors Apache and NGINX servers by parsing their logs.

Add alerts to Geth charts

Now that we have defined the new charts, we may want to define alerts for them. The full alert syntax is out-of-scope for this tutorial, but it shouldn't be difficult once you get the hang of it.

For example, here is a simple alarm that tells me if Geth is synced or not, based on whether header and block values are the same:

  1 #chainhead_header is expected momenterarily to be ahead. If its considerably ahead (e.g more than 5 blocks), then the node is definetely out of sync.
  2  template: geth_chainhead_diff_between_header_block
  3        on: geth.chainhead
  4     class: Workload
  5      type: ethereum_node
  6 component: geth
  7     every: 10s
  8      calc: $chain_head_block -  $chain_head_header
  9     units: blocks
 10      warn: $this != 0
 11      crit: $this > 5
 12     delay: up 5s

You can read the above example as follows:
On the charts that have the context geth.chainhead (thus all the Geth nodes that we may monitor with a single Netdata Agent), every 10s, caluclate the difference between the dimensions chain_head_block and chain_head_header. If it's not 0, then raise alert to warn. If it's more than 5, then raise to critical.

Some useful resources to get you up to speed quickly with creating alerts for our Geth node:

Note that if you create an alert and it works for you, a great idea is to make a PR into the main netdata/netdata repository. That way, the alert definition will exist in every netdata installation, and you will help countless other Geth users.

Here are some useful resources to create new alerts:

Extend Geth collector for other clients

The beauty of this solution is that it's trivial to duplicate the collector and gather metrics from all Ethereum clients that support the Prometheus endpoint:

The only difference between a Geth collector and a Nethermind collector is that they might expose different metrics or the same metrics with different "Prometheus metrics names". So, we just need to change the Prometheus metrics names in the metrics.go source file and propagate any change to the other source files as well.

The logic that I described above stays exactly the same.

In conclusion

Extending Geth for more metrics is trivial.

As you may suspect, this guide is applicable for any data source that is exposing it's metrics using the Prometheus format.

Introduction to StatsD

Odysseas Lamtzidis — Mon, 15 Feb 2021 14:00:14 +0000

StatsD is an industry-standard technology stack for monitoring applications and instrumenting any piece of software to deliver custom metrics. The StatsD architecture is based on delivering the metrics via UDP packets from any application to a central statsD server. Although the original StatsD server was written in Node.js, there are many implementations today, with Netdata being one of them.

StatsD makes it easier for you to instrument your applications, delivering value around three main pillars: open-source, control, and modularity. That’s a real windfall for full-stack developers who need to code quickly, troubleshoot application issues on the fly, and often don’t have the necessary background knowledge to use complex monitoring platforms.

First and foremost, StatsD is an open-source standard, meaning that vendor lock-in is simply not possible. With most of the monitoring solutions offering a StatsD server, you know that your instrumentation will play nicely with any solution you might want to use in the future.

The second is that you have absolute control over the data you send, since the StatsD server just listens for metrics. You can choose how, when, or why to send data from any application you build, whether it’s in aggregate or as highly cardinal data points. You also don’t need to spend any time configuring the StatsD server, since it will accept any metrics in any form you choose via your instrumentation.

Finally, there is a complete decoupling of each component of the stack. The client doesn’t care about the implementation of the server, and the server is agnostic about the backend. You can mix and match any combination of client, server, and backend that works best for you, or migrate between them as your needs change.

Historically, it has always been easier to measure and collect metrics about systems and networks than applications. In 2011, Erik Kasten developed StatsD while working at Etsy, to collect metrics from instrumented code. The original implementation, in Node.JS, listened on a UDP port for incoming metrics data, extracted it, and periodically sent batches of metrics to Graphite. Since then, countless applications have StatsD already implemented and can be configured to send their metrics to any StatsD server, while the number of available libraries makes it trivial to use the protocol in any language.

How does StatsD work?

The architecture of StatsD is divided into 3 main pieces: client, server, and backend.

The client is what creates and delivers metrics. In most cases, this is a StatsD library, added to your application, that pushes metrics at specific points where you add the relevant code.
The server is a daemon process responsible for listening for metric data as it’s pushed from the client, batching them, and sending them to the backend.
The backend, which is where metrics data is stored for analysis and visualization.

StatsD uses UDP packets because the client/server both reside on the same host, where packet loss is minimal and you can get the maximum throughput with the least amount of overhead. TCP is also an option, in case the client/server implementations reside on different hosts and the deliverability of metrics is a primary concern; in that case, the metrics collection speed will be lower due to the overhead of TCP.

In case you are wondering about the difference between TCP and UDP, this image is most illustrative:

Source

More often than not, an HTTP-based connection is used to send the metrics from the server to the backend, and because the backend is stored for long-term analysis and storage, it often resides in a different host than the server/clients.

StatsD in Netdata

Netdata is a fully featured StatsD server, meaning it collects formatted metrics from any application that you instrumented with your library of choice. Netdata is also its own backend implementation, as it offers instant visualization and long-term storage using the embedded time-series database (TSDB). When you install Netdata, you immediately get a fully functional StatsD implementation running on port 8125.

Since StatsD uses UDP or TCP to send instrumented metrics, either across localhost or between separate nodes, you’re free to deploy your application in whatever way works best for you, and it can still connect to Netdata’s server implementation. As soon as your application exposes metrics and starts sending packets on port 8125, Netdata turns the incoming metrics into charts and visualizes them in a meaningful fashion.

Your applications can be deployed in a variety of ways and still be able to easily surface monitoring data to Netdata. Moreover, Netdata accepts StatsD packets by default, meaning that as soon as your application starts sending data to Netdata, Netdata will create charts and visualize them as accurately as it can. Since there are a myriad of different setups, Netdata offers a robust server implementation that can be configured to organize the metrics in charts that make sense, so you can easily improve the visualization by making some simple modifications.

Because StatsD is a robust, mature technology, developers have built libraries to easily instrument applications in most popular languages.

Python: https://github.com/jsocol/pystatsd
Python Django: https://github.com/WoLpH/django-statsd
Java: https://github.com/tim-group/java-statsd-client
Clojure: https://github.com/pyr/clj-statsd
Nodes/Javascript: https://github.com/sivy/node-statsd

Taking the example from python-statsd, you only need a reachable Netdata Agent (locally or over the internet) and a couple of lines of code. This hello_world example illustrates just how simple it is to send any metric you care about to Netdata and instantly visualize it.

Even with no configuration at all, Netdata automatically creates charts for you. Netdata, being a robust monitoring agent, is also capable of organizing incoming metrics in any way you find most meaningful.

import statsd
c = statsd.StatsClient('localhost', 8125)
c.incr('foo') # Increment the 'foo' counter.
for i in range(100000000):
   c.incr('bar')
   c.incr('foo')
   if i % 3:
       c.decr('bar')
       c.timing('stats.timed', 320) # Record a 320ms 'stats.timed'.

Netdata’s StatsD server is also quite performant, which means you can monitor applications where they run without concerns over bottlenecks or restricting resources:

Netdata StatsD is fast. It can collect more than 1.200.000 metrics per second on modern hardware, more than 200Mbps of sustained statsd traffic, using 1 CPU core

Netdata does this on top of gathering metrics from other data sources. Netdata monitors an application’s full stack, from hardware to operating system to underlying services, organized automatically into meaningful categories. Every available metric is nicely organized automatically into a single dashboard.

Ready to get started?
In the next part of the StatsD series, we are going to illustrate how to configure Netdata to organize the metrics of any application, using K6 as our use-case.

If you can’t wait until then, join our Community Forums where we have kickstarted a discussion around StatsD.

Here are a couple of interesting resources to get you started with StatsD:

StatsD GitHub repository
Scaling StatsD in DoorDash
Netdata StatsD reference documentation

Deploy real-time monitoring with Netdata and Ansible

Joel Hans — Tue, 17 Nov 2020 14:40:25 +0000

Hello, Joel here! I'm working with Netdata to help more people deploy real-time system and application monitoring. I hope this Ansible guide helps a few of you build some extraordinary infrastructure.

Netdata's one-line kickstart is zero-configuration, highly adaptable, and compatible with tons of different operating systems and Linux distributions. You can use it on bare metal, VMs, containers, and everything in-between.

But what if you're trying to bootstrap an infrastructure monitoring solution as quickly as possible. What if you need to deploy Netdata across an entire infrastructure with many nodes? What if you want to make this deployment reliable, repeatable, and idempotent? What if you want to write and deploy your infrastructure or cloud monitoring system like code?

Enter Ansible, a popular system provisioning, configuration management, and infrastructure as code (IaC) tool. Ansible uses playbooks to glue many standardized operations together with a simple syntax, then run those operations over standard and secure SSH connections. There's no agent to install on the remote system, so all you have to worry about is your application and your monitoring software.

Ansible has some competition from the likes of Puppet or Chef, but the most valuable feature about Ansible is that every is idempotent. From the Ansible glossary:

An operation is idempotent if the result of performing it once is exactly the same as the result of performing it repeatedly without any intervening actions.

Idempotency means you can run an Ansible playbook against your nodes any number of times without affecting how they operate. When you deploy Netdata with Ansible, you're also deploying monitoring as code.

In this guide, we'll walk through the process of using an Ansible playbook to automatically deploy the Netdata Agent to any number of distributed nodes, manage the configuration of each node, and claim them to your Netdata Cloud account. You'll go from some unmonitored nodes to a infrastructure monitoring solution in a matter of minutes.

Prerequisites

A Netdata Cloud account. Sign in and create one if you don't have one already.
An administration system with Ansible installed.
One or more nodes that your administration system can access via SSH public keys (preferably password-less).

Download and configure the playbook

First, download the playbook, move it to the current directory, and remove the rest of the cloned repository, as it's not required for using the Ansible playbook.

git clone https://github.com/netdata/community.git
mv community/netdata-agent-deployment/ansible-quickstart .
rm -rf community

Next, cd into the Ansible directory.

cd ansible-quickstart

Edit the `hosts` file

The hosts file contains a list of IP addresses or hostnames that Ansible will try to run the playbook against. The hosts file that comes with the repository contains two example IP addresses, which you should replace according to the IP address/hostname of your nodes.

203.0.113.0  hostname=node-01
203.0.113.1  hostname=node-02

You can also set the hostname variable, which appears both on the local Agent dashboard and Netdata Cloud, or you can omit the hostname= string entirely to use the system's default hostname.

Set the login user (optional)

If you SSH into your nodes as a user other than root, you need to configure hosts according to those user names. Use the ansible_user variable to set the login user. For example:

203.0.113.0  hostname=ansible-01  ansible_user=example

Set your SSH key (optional)

If you use an SSH key other than ~/.ssh/id_rsa for logging into your nodes, you can set that on a per-node basis in the hosts file with the ansible_ssh_private_key_file variable. For example, to log into a Lightsail instance using two different SSH keys supplied by AWS.

203.0.113.0  hostname=ansible-01  ansible_ssh_private_key_file=~/.ssh/LightsailDefaultKey-us-west-2.pem
203.0.113.1  hostname=ansible-02  ansible_ssh_private_key_file=~/.ssh/LightsailDefaultKey-us-east-1.pem

Edit the `vars/main.yml` file

In order to claim your node(s) to your Space in Netdata Cloud, and see all their metrics in real-time in composite charts or perform Metric Correlations, you need to set the claim_token and claim_room variables.

To find your claim_token and claim_room, go to Netdata Cloud, then click on your Space's name in the top navigation, then click on Manage your Space. Click on the Nodes tab in the panel that appears, which displays a script with token and room strings.

Copy those strings into the claim_token and claim_rooms variables.

claim_token: XXXXX
claim_rooms: XXXXX

Change the dbengine_multihost_disk_space if you want to change the metrics retention policy by allocating more or less disk space for storing metrics. The default is 2048 Mib, or 2 GiB.

Because we're claiming this node to Netdata Cloud, and will view its dashboards there instead of via the IP address or hostname of the node, the playbook disables that local dashboard by setting web_mode to none. This gives a small security boost by not allowing any unwanted access to the local dashboard.

You can read more about this decision, or other ways you might lock down the local dashboard, in our node security doc.

Curious about why Netdata's dashboard is open by default? Read our blog post on that zero-configuration design decision.

Run the playbook

Time to run the playbook from your administration system:

ansible-playbook -i hosts tasks/main.yml

Ansible first connects to your node(s) via SSH, then collects facts about the system. This playbook doesn't use these facts, but you could expand it to provision specific types of systems based on the makeup of your infrastructure.

Next, Ansible makes changes to each node according to the tasks defined in the playbook, and returns whether each task results in a changed, failure, or was skipped entirely.

The task to install Netdata will take a few minutes per node, so be patient! Once the playbook reaches the claiming task, your nodes start populating your Space in Netdata Cloud.

What's next?

Go use Netdata!

If you need a bit more guidance for how you can use Netdata for health monitoring and performance troubleshooting, see our documentation. It's designed like a comprehensive guide, based on what you might want to do with Netdata, so use those categories to dive in.

Some of the best places to start:

Enable or configure a collector
Supported collectors list
See an overview of your infrastructure
[Interact with dashboards and charts](https://learn.netdata.cloud/docs/visualize/interact-dashboards-charts
Change how long Netdata stores metrics

We're looking for more deployment and configuration management strategies, whether via Ansible or other provisioning/infrastructure as code software, such as Chef or Puppet, in Netdata's community repo. Anyone is able to fork the repo and submit a PR, either to improve this playbook, extend it, or create an entirely new experience for deploying Netdata across entire infrastructure.

Introduction to community repository: Consul, Ansible, ML

Odysseas Lamtzidis — Mon, 16 Nov 2020 16:46:02 +0000

The post was originally posted on the Netdata blog

On our journey to democratize monitoring, we are proud to have open source at the core of both our products and our company values. What started as a project out of frustration for lack of existing alternatives (see anger-driven development), quickly became one of the most starred open-source projects on all of GitHub.

Fast-forward a couple of years later, and the Netdata Agent, our open-source monitoring agent, is maturing as the best single-node monitoring experience, offering unparalleled efficiency and thousands of metrics, per-second. At the same time, we have gathered a considerable community on our GitHub repository and new forums.

As the community grows, and considering our belief that extensibility is key to adoption, it was only natural to start brainstorming a way to share code and sample applications that supercharge the user experience and the Netdata Agent’s capabilities.

Thus, without further ado, please say hello to our Community Repository.

Although still in its infancy, we expect this repository to be filled by community members who want to share their experience of running Netdata in a production environment or integrated into a technological stack. At the moment, the repository will be used to house all sample applications, which are divided into categories, depending on the use case.

Currently, there are three example applications, all contributed by the Netdata team, which were originally developed for internal use. Let’s take a look at them.

Configuration Management

The first sample application is one I built that focuses on the issue of configuration management of an arbitrary number of Netdata Agents. More specifically, I opted to use Consul, an amazing open-source project by HashiCorp, to dynamically manage the configuration of a Netdata Agent. The keyword is “dynamically”: Whenever I choose to change a configuration variable, the Netdata Agent restarts automatically so that it can pick up the change from the configuration files.

Consul, per their documentation, is a “service mesh solution providing a full-featured control plane with service discovery, configuration, and segmentation functionality”. As such, Consul is routinely used already in cloud-native applications, and it’s ideal for a simple key/value store that we can use to house the configuration variables that we wish to dynamically change. Since Netdata can’t pick up configuration from a RESTful interface, we use consul-template, again an open-source tool by HashiCorp, which watches a Consul node for a specific number of keys, picks up the changes to their values and places them into the templates, generating the changed configuration files in the process.

The code and documentation for this sample application can be found in the specific consul-quickstart directory.

Machine Learning and Netdata Agent’s API

The second contribution came from Andrew Maguire, who contributed a few examples built on the Netdata Agent’s API. The API offers anyone the ability to extract data from the Netdata Agent in an extremely efficient way and build real-time applications on top of it. He leveraged his in-house python library to automatically extract data, add them to panda arrays, and enable live ML, capabilities such as the detection of anomalies.

You can find the examples in the appropriate directory of the community repository and open them in Google Colab. We suggest Google Colab not only because it’s free, but also because they spin up a VM and install all the required dependencies, making it the fastest way to try out the examples and play with the API. To open them on Google Colab, simply open a notebook on GitHub, and click on the Open in Colab button.

Automatic provisioning of Netdata Agents

Last but not least, Joel Hans pulled together the scripts that he had created for him to be able to automatically provision and claim any number of Netdata Agents on remote servers. The sample application is enabled by Ansible, a popular system provisioning, configuration management, and infrastructure-as-code tool. The user defines a set of steps in a .yaml file, called a playbook, and then Ansible is responsible to run this playbook against a number of hosts using SSH as the only requirement.

With Ansible, Joel can install and claim any number of Netdata Agents automatically, so that he can access and monitor his nodes in a matter of minutes, through Netdata Cloud. It’s that easy. You can learn more in the guide.

Now, it’s your turn

The repository is up and running, but we need you to participate. If you are using any of the aforementioned tools and platforms and feel that we could have done something in a better way, please do let us know and make a pull request with your suggestions.

If, on other hand, you are using Netdata with another application that greatly improves the experience, please do create a README about the project and PR it to the appropriate category. The value of this repository is of a compounding nature. The more examples we can get, the more value our users (like you) will be able to receive, and thus the popularity of the repository will invite even more sample applications.

See you all on our repo!