Forem: Eric Goebelbecker

How to Merge Log Files

Eric Goebelbecker — Tue, 24 Mar 2020 14:20:15 +0000

You have log files from two or more applications, and you need to see them together. Viewing the data together in proper sequence will make it easier to correlate events, and listing them side-by-side in windows or tabs isn’t cutting it.

You need to merge log files by timestamps.

But just merging them by timestamp isn’t the only thing you need. Many log files have entries with more than one line, and not all of those lines have timestamps on them.

Merge Log Files by Timestamp

Let’s take a look at the simple case. We have two files from Linux's syslog daemon. One is the messages file and the other is the crontab log.

Here are four lines from the messages file:

Sep 4 00:00:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 120000ms.
Sep 4 00:02:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 124910ms.
Sep 4 00:04:13 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 109850ms.
Sep 4 00:06:03 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 112380ms.

And here are five lines from cron:

Sep 4 00:01:01 ip–10–97–55–50 CROND[18843]: (root) CMD (run-parts /etc/cron.hourly)
Sep 4 00:01:01 ip–10–97–55–50 run-parts(/etc/cron.hourly)[18843]: starting 0anacron
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Anacron started on 2018–09–04
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Jobs will be executed sequentially<
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Normal exit (0 jobs run)

When we’re only dealing with ten lines of logs, it’s easy to see where the merge belongs. The five lines in the cron log belong between the first and second lines of the messages log.

But with a bigger dataset, we need a tool that can merge these two files on the date and the time. The good news is that Linux has a tool for this already.

Merge Log Files With Sort

The sort command can, as its name implies, sort input. We can stream both log files into sort and give it a hint on how to sort the two logs.

Let’s give it a try.

cat messages.log cron.log |sort –key=1,2 > merge.log

This creates a new file named merge.log. Here’s what it looks like:

Sep 4 00:00:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 120000ms.
Sep 4 00:01:01 ip–10–97–55–50 CROND[18843]: (root) CMD (run-parts /etc/cron.hourly)
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Anacron started on 2018–09–04
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Jobs will be executed sequentially<
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Normal exit (0 jobs run)
Sep 4 00:01:01 ip–10–97–55–50 run-parts(/etc/cron.hourly)[18843]: starting 0anacron
Sep 4 00:02:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 124910ms.
Sep 4 00:04:13 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 109850ms.
Sep 4 00:06:03 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 112380ms.

It worked!

Let’s dissect that command.

cat messages.log cron.log |

Cat concatenates files. We used it to send both logs to standard output. In this case, it sent messages.log first and then cron.log.

The pipe | is what it sounds like. It’s a pipe between two programs. It sends the contents of the two files to the next part of the command. As we’ll see below, sort can accept a single filename on the command line. When we want to sort more than one file, we use a pipe to send the files on standard input.

sort –key=2,3 > merge.log

Sort receives the contents of two files and sorts them. Its output goes to the > redirect operator, which creates the new file.

The most important part of this command is –key=2,3. We used this to tell sort to sort its input using two fields and three of the files. For some reason, sort starts counting fields at one instead of zero.

So sort was able to merge the two files using the day of the month and the timestamp.

This is our easy case. These log files both had single line entries, and our dataset was for less than thirty days. So we don't have to worry about sorting by months.

Let’s look at something that’s a little more complicated.

Merge Log Files With Multiline Entries

Here are a couple of Java application logs that we would like to merge.

Here’s the first:

2018-09-06 15:20:40,980 [INFO] Heimdall main:26 [main] 

Fix Engine is starting.


2018-09-06 15:20:45,639 [ERROR] AcceptorFactory createSessionSettings:92 [main] 

Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:20:50,645 [ERROR] AcceptorFactory getSessionSettings:123 [main]

Second Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:21:45,653 [INFO] ThreadedSocketAcceptor startSessionTimer:291 [main] SessionTimer started
2018-09-06 15:21:47,711 [INFO] NetworkingOptions logOption:119 [main] Socket option: SocketTcpNoDelay=true
2018-09-06 15:21:59,919 [INFO] SendMessageToSolace addSession:51 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Adding session: FIX.4.2:FOOU->TEST02
2018-09-06 15:22:59,920 [INFO] MessageClient openTopic:422 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02]
Opening FOO/DEV/AMER/FixEngine/Admin/*/TEST02
2018-09-06 15:23:59,937 [ERROR] ConsumerNodeStatusHandler setStateUp:186 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Setting State up: TEST02
2018-09-06 15:24:03,962 [INFO] MessageClient openTopic:422 [stateHeartbeat]

Opening FOO/DEV/AMER/State/Admin/Events
2018-09-06 15:25:00,536 [INFO] incoming messageReceived:146 [NioProcessor-2] FIX.4.2:FOOU->TEST02: 8=FIX.4.29=6235=149=TEST0256=FOOU34=252=20180906-15:21:00.528112=TEST10=198

This log has a lot of whitespace and entries that span multiple lines.

Here’s the other:

2018-09-06 15:20:43:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-10-5] Adding session: TEST02 at 1536243961031
2018-09-06 15:20:46:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-13-4] Adding session: TEST02 at 1536243961031
2018-09-06 15:23:15:032 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-7-5] Adding session: TEST02 at 1536243961032
2018-09-06 15:24:35:257 [INFO] com.foobar.atr.rest.controller.StatusController getSessionStatus():67 [http-nio-8010-exec-4] Received request a fix session, senderCompId:RBSG2
2018-09-06 15:27:30:691 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-10-5] Adding session: PLOP02 at 1536244050691

This log is more uniform, with entries that only span a single line.

When we merge these two files, we want the multiline log message to remain together. So, sort's numeric sorting won’t work. We need a tool that's capable of associating the lines without timestamps with the last line that has one.

Unfortunately, no command line tool does this. We’re going to have to write some code.

A Merging Algorithm

Here’s an algorithm for merging log files that have multiline entries.

First, we need to preprocess the log files.

Scan the log file line by line until we reach the end.
If a line has a timestamp, save it and print the last saved line to a new file.
If a line has no timestamp, append it to the saved line, after replacing the new line with a special character
Continue with step #1.

We could do this in memory, but what happens when we’re dealing with huge log files? We’ll save the preprocessed log entries to disk so that this tool will work on huge log files.

After we perform this on both files, we have a new one that is full of single line entries. We’ll use the sort command to sort it for us, rather than reinventing the wheel. Then, we’ll replace the special characters with new lines, and we have a merged log file.

And we’re done!

Let's do it.

Merge Log Files With Python

We’ll use python. It’s available on all systems, and it’s easy to write a cross-platform tool that manipulates text files. I wrote the code for this article with version 2.7.14. You can find the entire script here on Github.

First, we need to process our input files.

parser = argparse.ArgumentParser(description="Process input and output file names")
parser.add_argument("-f", "--files", help="list of input files", required=True, nargs='+')
parser.add_argument("-o", "--output", help="output file", required=True, type=argparse.FileType('w'))
args = parser.parse_args()

line_regex = re.compile("^[^0-90-90-90-9\-0-90-9\-0-90-9]")

with open("tmp.log", "w") as out_file:
    for filename in args.files:
        lastline = ""
        with open(filename, "r") as in_file:
            for line in in_file:
                if line_regex.search(line):
                    lastline = lastline.rstrip('\n')
                    lastline += '\1'
                    lastline += line
                else:
                    out_file.write(lastline)
                    lastline = line

We'll start by processing command line arguments. This script accepts two:

-f is a comma-separated list of input files
-o is the name of the file to write the output to

Argparse gives us a list from the arguments passed to -f and opens the output file for us, as we’ll see below.

Python Regular Expressions

Then we'll create a regular expression. Let’s take a close look at it since this is what you’ll need to change if your logs are formatted differently.

Here’s the whole expression:

^[^0-90-90-90-9\-0-90-9\-0-90-9]

The expression starts with a caret ^. This means the beginning of a line.

But then we have this: [^ ] with some characters in the middle. Square brackets with a caret at the beginning mean not.

So the expression means "if this is not at the beginning of the line."

The pattern we're matching is inside the brackets.

0–90–90–90–9\-0–90–9\-0–90–9

Each 0–9 corresponds to a numeral. Each \- is a dash. So it could be read like this: NNNN-NN-NN. It’s a pattern for the date we see at the beginning of each log entry.

So in English, the expression means “if the line does not begin with a date.”

If you need to process logs with a different format, you'll need to change this. There's a guide to python regular expressions here.

Sorting the Results

Now, we'll start the real work.

Open a temporary file.
Open the first log file.
Join lines with no timestamp to their predecessors, as described above.
Repeat this for each file passed on the command line.

For the third step, we'll chop the newline '\n' from the end of the last line we saved. Then we'll add an SOH ('\1') character and concatenate the lines. (I could've done this in one line, but I spelled it out to make it clear.)

We're replacing newlines '\n' with the SOH character instead of NULLs ('\0') because nulls would confuse python's string processing libraries and we'd lose data.

Finally, the result of this code is a file named tmp.log that contains the log files preprocessed to be one line per entry.

Let’s finish the job.

sorted_logs = check_output(["/usr/bin/sort", "--key=1,2", "tmp.log"])

os.remove("tmp.log")

lines = sorted_logs.split('\n')
for line in lines:
    newline = line.replace('\1', '\n')
    args.output.write(newline + "\n")

Check_output executes an external command and captures the output.

So we'll use it to run sort on our temporary file and return the results to us as a string. Then, we'll remove the temporary file.

We wouldn’t want to capture the result in memory with a large file, but to keep this post short, I cheated. An alternative is to send the output of sort to a file with the -o option and then open that file and remove the special characters.

Next, we'll split the output on the new lines into an array. Then we'll process that array and undo the special characters. We'll write each line to the file opened for us by argparse.

We’re done!

Let's run this script on two files:

./mergelogs.py -f foo.log bar.log -o output.log

And we'll see this.

2018-09-06 15:20:40,980 [INFO] Heimdall main:26 [main] 

Fix Engine is starting.


2018-09-06 15:20:43:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-10-5] Adding session: TEST02 at 1536243961031
2018-09-06 15:20:45,639 [ERROR] AcceptorFactory createSessionSettings:92 [main] 

Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:20:46:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-13-4] Adding session: TEST02 at 1536243961031
2018-09-06 15:20:50,645 [ERROR] AcceptorFactory getSessionSettings:123 [main]

Second Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:21:45,653 [INFO] ThreadedSocketAcceptor startSessionTimer:291 [main] SessionTimer started
2018-09-06 15:21:47,711 [INFO] NetworkingOptions logOption:119 [main] Socket option: SocketTcpNoDelay=true
2018-09-06 15:21:59,919 [INFO] SendMessageToSolace addSession:51 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Adding session: FIX.4.2:FOOU->TEST02
2018-09-06 15:22:59,920 [INFO] MessageClient openTopic:422 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02]

Opening FOO/DEV/AMER/FixEngine/Admin/*/TEST02
2018-09-06 15:23:15:032 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-7-5] Adding session: TEST02 at 1536243961032
2018-09-06 15:23:59,937 [ERROR] ConsumerNodeStatusHandler setStateUp:186 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Setting State up: TEST02
2018-09-06 15:24:03,962 [INFO] MessageClient openTopic:422 [stateHeartbeat]

Opening FOO/DEV/AMER/State/Admin/Events
2018-09-06 15:24:35:257 [INFO] com.foobar.atr.rest.controller.StatusController getSessionStatus():67 [http-nio-8010-exec-4] Received request a fix session, senderCompId:RBSG2

Log Files, Merged

In this tutorial, we covered how to merge log files, looking at a straightforward case and then a more complicated situation. The code for this is available on Github, and you're free to download and modify it for your individual needs.

Jaeger Tracing Tutorial: Get Going From Scratch

Eric Goebelbecker — Tue, 17 Sep 2019 14:49:12 +0000

The Jaeger tracing system is an open-source tracing system for microservices, and it supports the OpenTracing standard. We talked about OpenTracing and why it's essential in a previous post. So now, let's talk more about Jaeger.

Jaeger was initially published as open source by Uber Technologies and has evolved since then. The system gives you distributing tracing, root cause analysis, service dependency analysis, and more.

We're going get started with Jaeger tracing by installing it and using it to examine some RESTful API calls to a single microservice. To do this, we'll need to build a small service with tracing enabled. Jaeger has tooling for Go, Java, JavaScript (Node.js,) Python, and C++. We'll use Java for this tutorial, but the concepts we cover here will apply to any supported platform.

Installation and Setup

Docker

The preferred way to install and run Jaeger tracing is with Docker. It's also the easiest. So if you're not running Docker yet, take a look at the installation process for your platform here. The Community Edition is more than adequate for this tutorial.

Install Jaeger

Jaeger is a set of distributed components for collecting, storing, and displaying trace information. But it also ships as an "all-in-one" image that runs the entire system. We'll use that to keep the install simple for this tutorial. There are instructions for getting started here, but I'll cover a condensed version in this post.

Docker will download the image for you when you try to start a container. I'll use a shorter command line than the one in Jaeger's instructions because we're only going to use one of the system's tracing modes.

docker run -d --name jaeger -p 16686:16686 -p 6831:6831/udp jaegertracing/all-in-one:1.9

So, when you run the container, your command should look like this.

When the command finishes, check to see if the server is running with docker ps -a

You should see the container name, jaeger, with up in the status column. You'll also see a lot of information about service ports.

Now, you can connect to the Jaeger console at http://localhost:16686

We see the Jaeger user interface. It's running!

Java Microservice

We'll use a simple Spring Boot service to create some traces. The code for the project is on GitHub.

The project has scripts for running the service via script or in a container. If you want to run both the service and Jaeger in containers, you'll need to know how to get them to connect over UDP using Docker networking. That's beyond the scope of this tutorial.

Jaeger and Open Tracing Concepts

Before we start our service, we can take a look at Jaeger's interface and review some basic open tracing concepts. The user interface service reports its queries so that we can see examples of a few basic traces.

Look at the box on the left-hand side of the page labeled Find Traces. The first control, a chooser, lists the services available for tracing. The count should show one. (If it doesn't, try refreshing the page.) Now, click the chooser and you'll see jaeger-query listed as the only service.

A service is an application that's registered itself to Jaeger. We'll see how to register our application below.

Next, with jaeger-query selected, click the Find Traces button on the bottom of the form.

A list of traces will appear on the right-hand side of the screen. The traces have titles that correspond to the Operation selector on the search form. So, select /api/services in the Operation box and click the Find button again. Depending on how many times you reloaded the page, you'll see a few operations.

Now click on one of the traces.

This trace has one operation in it. It took 0.14 ms. There's not much to look at here. But we can look at what the service sent to the Jaeger Tracing server. So click on the box in the upper right-hand side of the page.

Jaeger Tracing Tags

Next, let's look at the JSON.

{
  "data": [
    {
      "traceID": "3b8496f91e044c34",
      "spans": [
        {
          "traceID": "3b8496f91e044c34",
          "spanID": "3b8496f91e044c34",
          "flags": 1,
          "operationName": "/api/traces",
          "references": [],
          "startTime": 1549827709524283,
          "duration": 142,
          "tags": [
            {
              "key": "sampler.type",
              "type": "string",
              "value": "const"
            },
            {
              "key": "sampler.param",
              "type": "bool",
              "value": true
            },
            {
              "key": "span.kind",
              "type": "string",
              "value": "server"
            },
            {
              "key": "http.method",
              "type": "string",
              "value": "GET"
            },
            {
              "key": "http.url",
              "type": "string",
              "value": "/api/traces?end=1549827709522000\u0026limit=20\u0026lookback=1h\u0026maxDuration\u0026minDuration\u0026service=jaeger-query\u0026start=1549824109522000\u0026tags=%7B%22http.status_code%22%3A%22404%22%7D"
            },
            {
              "key": "component",
              "type": "string",
              "value": "net/http"
            },
            {
              "key": "http.status_code",
              "type": "int64",
              "value": 200
            }
          ],
          "logs": [],
          "processID": "p1",
          "warnings": null
        }
      ],
      "processes": {
        "p1": {
          "serviceName": "jaeger-query",
          "tags": [
            {
              "key": "client-uuid",
              "type": "string",
              "value": "6550fb460c8ee430"
            },
            {
              "key": "hostname",
              "type": "string",
              "value": "9f77a41dfd0c"
            },
            {
              "key": "ip",
              "type": "string",
              "value": "172.17.0.2"
            },
            {
              "key": "jaeger.version",
              "type": "string",
              "value": "Go-2.15.1dev"
            }
          ]
        }
      },
      "warnings": null
    }
  ],
  "total": 0,
  "limit": 0,
  "offset": 0,
  "errors": null
}

There's a lot of information here. Toward the top of the JSON, you see an array of spans. This trace only has one. A trace consists of one or more spans. A span is, as you might guess, an interval of time that contains one or more operations. We'll take a closer look at spans when we add some code to the Java service. Inside the span, there's an array of tags. Tags are attributes an application adds to traces. Here are two:

{
    "key": "http.method",
    "type": "string",
    "value": "GET"
},
{
    "key": "http.status_code",
    "type": "int64",
    "value": 200
}

We'll see how to add these tags to our spans below. For now, let's go back to the main page and use tags to search.

Now enter http.method=get in the Tags field and click the find button again.

You'll see a list of traces. Most of the operations in the Jaeger UI are GETS, which makes sense.

That's the basics of the Jaeger interface. Let's connect a service.

Tracing a Service

The Jaeger tutorial application contains a create-read-update-delete (CRUD) API for managing employee records. The records are stored in a local hashmap. We're going to add a trace with two spans to the application.

Creating a Tracer

To add tracing to an application, you need a Tracer. We'll create one and use Spring to supply it to the microservice's service and controller classes.

Here's the method for creating the tracer:

@Bean
public static JaegerTracer getTracer() {
    Configuration.SamplerConfiguration samplerConfig = Configuration.SamplerConfiguration.fromEnv().withType("const").withParam(1);
    Configuration.ReporterConfiguration reporterConfig = Configuration.ReporterConfiguration.fromEnv().withLogSpans(true);
    Configuration config = new Configuration("jaeger tutorial").withSampler(samplerConfig).withReporter(reporterConfig);
    return config.getTracer();
}

The first step is constructing configuration classes. You use them to create the Tracer. Jaeger has an extensive set of tools for configuration. We're accepting the default settings and naming our tracer jaeger tutorial.

This method is in the class with the service's main method. We're treating it like a Spring Bean and injecting into the constructors of the controller and service classes. If you don't understand Spring dependency injection, you can assume that the controller and service methods have access to a tracer.

You can learn more about Jaeger configuration here and here.

Tracing a REST API Call

Let's start with adding a single span to a POST method. Here's our code for adding a new employee.

@ApiOperation(value = "Create Employee ", response = ResponseEntity.class)
@RequestMapping(value = "/api/tutorial/1.0/employees", method = RequestMethod.POST)
public ResponseEntity createEmployee(@RequestBody Employee employee) {

    // Create a span
    Span span = tracer.buildSpan("create employee").start();
        
    HttpStatus status = HttpStatus.FORBIDDEN;

    log.info("Receive Request to add employee {}", employee);
    if (employeeService.addEmployee(employee)) {
        status = HttpStatus.CREATED;
            
        // Set http status code
        span.setTag("http.status_code", 201);
    } else {
        span.setTag("http.status_code", 403);
    }
        
    // Close the span
    span.finish();
    return new ResponseEntity(null, status);
}

We create a Span at the start of the method, using our Tracer instance. Then we set a tag corresponding to the HTTP status code of the request. This should make out trace look a lot like the Jaeger query service. The service has a Swagger interface, so we can use it to add an employee.

Fill out details for an employee and then click the Try it out! button twice. The first request will succeed. The second will fail because the service will not accept a new employee with an existing ID.

Now, take a look at the Jaeger search page. Select jaeger tutorial in the service selector and create employee in the operation selector and click the find button.

We see two traces, but we know one failed and one succeeded. Let's refine the search. Enter http.status_code=403 in the Tags text box.

Now, click the find button again. You'll see only one trace. Tags are useful for filtering traces and looking at specific criteria.

Multiple Spans and Log Messages

Let's finish up by adding a second span to a trace, along with log messages.

Here is the controller's delete method:

@ApiOperation(value = "Delete Employee ", response = ResponseEntity.class)
@RequestMapping(value = "/api/tutorial/1.0/employees/{id}", method = RequestMethod.DELETE)
public ResponseEntity deleteEmployee(@PathVariable("id") String idString) {

    Span span = tracer.buildSpan("delete employee").start();

    HttpStatus status = HttpStatus.NO_CONTENT;

    try {
        int id = Integer.parseInt(idString);
        log.info("Received Request to delete employee {}", id);
        span.log(ImmutableMap.of("event", "delete-request", "value", idString));
        if (employeeService.deleteEmployee(id, span)) {
            span.log(ImmutableMap.of("event", "delete-success", "value", idString));
            span.setTag("http.status_code", 200);
            status = HttpStatus.OK;
        } else {
            span.log(ImmutableMap.of("event", "delete-fail", "value", "does not exist"));
            span.setTag("http.status_code", 204);
        }
    } catch (NumberFormatException | NoSuchElementException nfe) {
        span.log(ImmutableMap.of("event", "delete-fail", "value", idString));
        span.setTag("http.status_code", 204);
    }

    span.finish();
    return new ResponseEntity(null, status);
 }

Like the add method, we're opening a span at the start of the method. We're also setting the status code tag based on the result of the delete request. Also, the code has log messages based on the outcome of the query.

We're also passing our Span object to the service. Let's look at why. Here is the delete method in the service:

public boolean deleteEmployee(int id, Span rootSpan) {

    Span span = tracer.buildSpan("service delete employee").asChildOf(rootSpan).start();

    boolean result = false;
    if (employeeMap.containsKey(id)) {
        employeeMap.remove(id);
        result = true;
    }
    span.finish();
    return result;
}

We're creating a new span inside the method, setting it as a child of the span that was passed in.

Run the service and try to delete a valid employee and then an invalid one.

Now, select delete employee in the operation control and click the find button.

You should see two operations with two spans each.

Inspect each trace, and you'll see a few new things. Here's the successful trace, with both spans displayed:

You can see the second trace and the log messages. The failed delete has different log messages.

So, with a few lines of code, we can see how long operations take and get an idea of why!

Jaeger Tracing for Microservices

Jaeger tracing is an open-source implementation of the OpenTracing standard. In just a few minutes we installed the system and used it to trace a REST microservice. Tracing is an essential strategy for managing your services and monitoring your users' experience, so enjoy the fruits of this new knowledge!

Java Stack Trace: Understanding It and Using It to Debug

Eric Goebelbecker — Mon, 29 Jul 2019 20:02:36 +0000

Deploying your Java code to production limits your troubleshooting options. Connecting to your app in production with a debugger is usually out of the question, and you might not even be able to get console access. So even with monitoring, you’re going to end up troubleshooting many problems post-mortem. This means looking at logs and, if you’re lucky, working with a Java stack trace.

That’s right, I said you’re lucky if you have a stack trace. It’s like getting a compass, a map, and a first-class airplane ticket handed to you all at once! Let’s talk about what a Java stack trace is and how you can use it.

What's a Java Stack Trace?

A stack trace, also called a stack backtrace or even just a backtrace, is a list of stack frames. These frames represent a moment during an application’s execution. A stack frame is information about a method or function that your code called. So the Java stack trace is a list of frames that starts at the current method and extends to when the program started.

Sometimes there’s confusion between a stack and the Stack. A stack is a data structure that acts as a stack of papers on your desk: it’s first-in-last-out. You add documents to the pile and take them off in the reverse order you put them there. The Stack, more accurately called the runtime or call stack, is a set of stack frames a program creates as it executes, organized in a stack data structure.

Let’s look at an example.

Java Stack Trace Example

Let’s take a look at a Java program. This class calls four methods and prints a stack trace to the console from the last one.

public class StackTrace {

  public static void main(String[] args) {
    a();
  }

  static void a() {
    b();
  }

  static void b() {
    c();
  }

  static void c() {
    d();
  }

  static void d() {
    Thread.dumpStack();
  }
}

When you run the class, you’ll see something like this:

java.lang.Exception: Stack trace
at java.base/java.lang.Thread.dumpStack(Thread.java:1383)
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:23)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:19)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:15)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:11)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:7)

The d() method() is at the top of the stack because that’s where the app generated the trace. The main() method is at the bottom because that’s where the program started. When the program started, the Java runtime executed the main() method. Main() called a(). A() called b(), and b() called c(), which called d(). Finally, d() called dumpStack(), which generated the output. This Java stack trace gives us a picture of what the program did, in the order that it did it.

A Java stack trace is a snapshot of a moment in time. You can see where your application was and how it got there. That’s valuable insight that you can use a few different ways.

How to Use Java Stack Traces

Now that you’ve seen what Java stack traces show you, how can you use them?

Java Exceptions

Stack traces and exceptions are often associated with each other. When you see a Java application throw an exception, you usually see a stack trace logged with it. This is because of how exceptions work.

When Java code throws an exception, the runtime looks up the stack for a method that has a handler that can process it. If it finds one, it passes the exception to it. If it doesn’t, the program exits. So exceptions and the call stack are linked directly. Understanding this relationship will help you figure out why your code threw an exception.

Let’s change our sample code.

First, modify the d() method:

static void d() {
  throw new NullPointerException("Oops!");
}

Then, change main() and a() so main can catch an exception. You'll need to add a checked exception to a() so the code will compile.

public static void main(String[] args) 
{
  try {
    a();
  } catch (InvalidClassException ice) {
    System.err.println(ice.getMessage());
  }
}

static void a() throws InvalidClassException 
{
  b();
}

You’re deliberately catching the “wrong” exception. Run this code and watch what happens.

Exception in thread "main" java.lang.NullPointerException: Oops!
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:29)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:24)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:20)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:16)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)

The exception bubbled up the stack past main() because you were trying to catch a different exception. So the runtime threw it, terminating the application. You can still see a stack trace though, so it’s easy to determine what happened.

Now, change main() to catch a NullPointerException instead. You can remove the checked exception from a() too.

public static void main(String[] args) {
  try {
    a();
  } catch (NullPointerException ice) {
    System.err.println(ice.getMessage());
  }
}

static void a() {
  b();
}

Rerun the program.

Oops!

We lost the stack trace! By only printing the message attached to the exception, you missed some vital context. Unless you can remember why you wrote Oops! in that message, tracking down this problem is going to be complicated. Let’s try again.

public static void main(String[] args) {
  try {
    a();
  } catch (NullPointerException npe) {
    npe.printStackTrace();
  }
}

Rerun the application.

java.lang.NullPointerException: Oops!
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:28)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:24)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:20)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:16)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)

That’s better! We see the stack trace, and it ends at d() where the exception occurred, even though main() printed it.

Logging Java Stack Traces

What if you don’t want to print an error message to the console but to a log file instead? The good news is that most loggers, including Log4j and Logback, will write exceptions with stack traces if you call them with the right arguments.

Pass in the exception object as the last argument to the message, without a formatting directive. So if you used Log4j or Logback with the sample code like this:

logger.error(“Something bad happened:”, npe);

You would see this in your log file:

Something bad happened:
java.lang.NullPointerException: Oops!
at com.ericgoebelbecker.stacktraces.StackTrace.d(StackTrace.java:28)
at com.ericgoebelbecker.stacktraces.StackTrace.c(StackTrace.java:24)
at com.ericgoebelbecker.stacktraces.StackTrace.b(StackTrace.java:20)
at com.ericgoebelbecker.stacktraces.StackTrace.a(StackTrace.java:16)
at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)

One of the best things you can do with exceptions and stack traces is to log them so you can use them to isolate a problem. If you get in the habit of printing useful log messages with details like stack traces and log indexing, then search tools, like Scalyr, become one of the most powerful tools in your troubleshooting tool bag.

The Java Debugger

Debuggers work by taking control of a program's runtime and letting you both observe and control it. To do this, it shows you the program stack and enables you to traverse it in either direction. When you’re in a debugger, you get a more complete picture of a stack frame than you do when looking at stack traces in a log message.

Let’s make a small code change and then throw the sample code into a debugger.

First, add a local variable to the d() method:

static void d() {
  String message = “Oops.”
  throw new NullPointerException(message);
}

Then add a breakpoint where d() throws the exception in your debugger. I’m using IntelliJ's debugger for this image.

Here you can see that the string we added to d() is part of the stack frame because it’s a local variable. Debuggers operate inside the Stack and give you a detailed picture of each frame.

Forcing a Thread Dump

Thread dumps are great post-mortem tools, but they can be useful for runtime issues too. If your application stops responding or is consuming more CPU or memory than you expect, you can retrieve information about the running app with jstack.

Modify main() so the application will run until killed:

public static void main(String[] args) throws Exception {
  try {
      while(true) {
          Thread.sleep(1000);
      }
  } catch (NullPointerException ice)  {
      ice.printStackTrace();
  }
}

Run the app, determine its pid, and then run jstack. On Windows, you'll need to press ctrl-break in the DOS window you're running your code in.

$ jstack <pid>

Jstack will generate a lot of output.

2019-05-13 10:06:17
Full thread dump OpenJDK 64-Bit Server VM (12+33 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007f8bb2727190, length=10, elements={
0x00007f8bb3807000, 0x00007f8bb2875000, 0x00007f8bb2878000, 0x00007f8bb4000800,
0x00007f8bb300a800, 0x00007f8bb287b800, 0x00007f8bb287f000, 0x00007f8bb28ff800,
0x00007f8bb300b800, 0x00007f8bb3805000
}

"main" #1 prio=5 os_prio=31 cpu=60.42ms elapsed=103.32s tid=0x00007f8bb3807000 nid=0x2503 waiting on condition  [0x0000700001a0e000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(java.base@12/Native Method)
    at com.ericgoebelbecker.stacktraces.StackTrace.main(StackTrace.java:9)

"Reference Handler" #2 daemon prio=10 os_prio=31 cpu=0.08ms elapsed=103.29s tid=0x00007f8bb2875000 nid=0x4603 waiting on condition  [0x0000700002123000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ref.Reference.waitForReferencePendingList(java.base@12/Native Method)
    at java.lang.ref.Reference.processPendingReferences(java.base@12/Reference.java:241)
    at java.lang.ref.Reference$ReferenceHandler.run(java.base@12/Reference.java:213)

"Finalizer" #3 daemon prio=8 os_prio=31 cpu=0.13ms elapsed=103.29s tid=0x00007f8bb2878000 nid=0x3903 in Object.wait()  [0x0000700002226000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(java.base@12/Native Method)
    - waiting on <0x000000070ff02770> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(java.base@12/ReferenceQueue.java:155)
    - locked <0x000000070ff02770> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(java.base@12/ReferenceQueue.java:176)
    at java.lang.ref.Finalizer$FinalizerThread.run(java.base@12/Finalizer.java:170)

"Signal Dispatcher" #4 daemon prio=9 os_prio=31 cpu=0.27ms elapsed=103.28s tid=0x00007f8bb4000800 nid=0x3e03 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #5 daemon prio=9 os_prio=31 cpu=6.12ms elapsed=103.28s tid=0x00007f8bb300a800 nid=0x5603 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

"C1 CompilerThread0" #7 daemon prio=9 os_prio=31 cpu=12.01ms elapsed=103.28s tid=0x00007f8bb287b800 nid=0xa803 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

"Sweeper thread" #8 daemon prio=9 os_prio=31 cpu=0.73ms elapsed=103.28s tid=0x00007f8bb287f000 nid=0xa603 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Service Thread" #9 daemon prio=9 os_prio=31 cpu=0.04ms elapsed=103.27s tid=0x00007f8bb28ff800 nid=0xa503 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Common-Cleaner" #10 daemon prio=8 os_prio=31 cpu=0.27ms elapsed=103.27s tid=0x00007f8bb300b800 nid=0xa303 in Object.wait()  [0x000070000293b000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
    at java.lang.Object.wait(java.base@12/Native Method)
    - waiting on <0x000000070ff91690> (a java.lang.ref.ReferenceQueue$Lock)
    at java.lang.ref.ReferenceQueue.remove(java.base@12/ReferenceQueue.java:155)
    - locked <0x000000070ff91690> (a java.lang.ref.ReferenceQueue$Lock)
    at jdk.internal.ref.CleanerImpl.run(java.base@12/CleanerImpl.java:148)
    at java.lang.Thread.run(java.base@12/Thread.java:835)
    at jdk.internal.misc.InnocuousThread.run(java.base@12/InnocuousThread.java:134)

"Attach Listener" #11 daemon prio=9 os_prio=31 cpu=0.72ms elapsed=0.10s tid=0x00007f8bb3805000 nid=0x5e03 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"VM Thread" os_prio=31 cpu=3.83ms elapsed=103.29s tid=0x00007f8bb2874800 nid=0x3703 runnable

"GC Thread#0" os_prio=31 cpu=0.13ms elapsed=103.31s tid=0x00007f8bb282b800 nid=0x3003 runnable

"G1 Main Marker" os_prio=31 cpu=0.26ms elapsed=103.31s tid=0x00007f8bb2845000 nid=0x3103 runnable

"G1 Conc#0" os_prio=31 cpu=0.04ms elapsed=103.31s tid=0x00007f8bb3810000 nid=0x3303 runnable

"G1 Refine#0" os_prio=31 cpu=0.39ms elapsed=103.31s tid=0x00007f8bb2871000 nid=0x3403 runnable

"G1 Young RemSet Sampling" os_prio=31 cpu=13.60ms elapsed=103.31s tid=0x00007f8bb2872000 nid=0x4d03 runnable
"VM Periodic Task Thread" os_prio=31 cpu=66.44ms elapsed=103.27s tid=0x00007f8bb2900800 nid=0xa403 waiting on condition

JNI global refs: 5, weak refs: 0

My application was running 11 threads, and jstack generated a stack trace for all of them. The first thread, helpfully named main, is the one we're concerned with. You can see it sleeping on wait().

Java Stack Traces: Your Roadmap

A stack trace is more than just a picture inside your application. It's a snapshot of a moment in time that includes every step your code took to get there. There's no reason to dread seeing one in your logs because they're a gift from Java that tells you exactly what happened. Make sure you're logging them when an error crops up and send them to a tool like Scalyr so they're easy to find.

Now that you understand what a Java stack trace is and how to use it, take a look at your code. Are you throwing away critical information about errors and exceptions in your code? Is there a spot where a call to Thread.dumpstack() might help you isolate a recurring bug? Perhaps it's time to run your app through the debugger a few times with some strategically-chosen breakpoints.