Forem: #benaryorg

Handling your personal data online.

#benaryorg — Sun, 10 Sep 2017 01:01:01 +0000

How are you dealing with your online identity?
I tend to keep my private life and my online persona somewhat separated.
This includes but is not limited to:

keeping my real name offline
not posting pictures of me online

How do you handle that?
Have you encountered any problems regarding that?
Do you have any recommendations?

I'm an ops person. Ask me anything!

#benaryorg — Sun, 10 Sep 2017 00:55:59 +0000

I'm an ops person a year into business, ask me anything.
Ask me technical questions, ask me to design a theoretical system, ask me things about my career.…

This will be my permanent AMA for anything that's technical.

Keeping Track of your Skills

#benaryorg — Sun, 10 Sep 2017 00:47:36 +0000

Hi, I'm an ops person.
I've been a developer.
Done distributed systems and System Engineering.
I did things with Haskell, Rust, C, Perl, Shell (lots of shell), GNU/Linux, {Free,Open}BSD, TLS, x509 and whatnot.

It's been some time since I started and sometimes I lose track of what I – hey, there's Python, JS, HTML/CSS missing in that list above – actually learned over the years.
There's still moments when I notice that I actually know what a TIME_WAIT on Linux is and why it's there and also what the difference between an abstract class and an interface is in Java.

So now my final question: Do you, if yes, how do you keep track of all that?

Note: I know that there are certain advantages to not keeping track (e.g. not accidentally claiming to know tech when your knowledge is hopelessly out of date).

Lines of Code don't matter.

#benaryorg — Sat, 02 Sep 2017 00:52:08 +0000

We all long ago learned that LOC (Lines Of Code) are a terrible unit for measurement.
Well, at least most of us learned that.

Now when I sat down this Friday to work on some internal magic to get some text from your console to a dashboard (easier said than done, I've found CouchDB to be the tool of choice) at the end I was doubting my productiveness.

The Result

At the end I had two things:

three hours in our time tracking, might have been quite a bit more though
104 (one hundred and four) lines of beautiful shell code in our GitLab

That's even less than a line of code per minute.
Hell, that even contained a block of code that's been duplicated five times with a different variable name
(it's a non-trivial case to DRY that part).

Granted, we did port the whole thing from a terrible vim file;pandoc file | curl home-grown-nodejs-daemon to a cleaner database solution with revisions and stuff, but the discussion part was just about an hour or so.

The Script

So what does the script do?
Basically it

uses getopts to get some variables filled
reads missing variables from stdin if the tty is interactive
fails if mandatory variables are missing
downloads the current document
merges the current document with the new entry
pushes that back to CouchDB wrapped with the correct revision

Seems like an easy task, right?

Why Lines of Code are bad measurement.

I've put a lot of effort into making the script as robust as possible.
If at some point you enter something like a literal my"name\":{}\x123 it will be stored the very same way in the database.
Everyone who has ever dealt with shell-scripts will know that it's hell of an effort to not fail at this.
There is your shell which has escaping.
There is the json merging which needs the string input to be escaped.
There is the curl which could possibly fail.
There is so much that could go wrong.

It took five lines of (maybe too) tightly packed shell that use a variable, read it if not set, but only if the tty is interactive, fail otherwise, escape it (properly, not only a simple s/"/\"/g) and store it in a new read-only variable.
This works for all inputs, including special characters that need special escaping in JSON (think: "binary" characters, multibyte, hell even emoji).

That's five lines.
You'd have trouble putting that in such a tightly packed piece of code in programming languages that don't need super special escaping.

There is no meaning to the number of lines of code, because it's an artificial number that can be changed at will (blank lines, moving lines together, comments, .…).

But further, there are task which seem so thoroughly trivial, but end up in a lot of work. Sometimes they even turn out to actually be plain simple, but that might not be obvious at first. There often is an elegant solution to a simple problem, that is so elegant and plain that you simply don't see it.

How To Build A RegEx

#benaryorg — Sun, 16 Jul 2017 14:51:39 +0000

Updates

I might at some point update this post, but over at my own blog.

How to build a RegEx

I see people abusing regexes just about every day.
If you're really good at regexes then you will certainly feel some sort of pain
as soon as you see .* or similar constructs in inappropriate places.

So here is my guide to doing it the right way.

Notes

I'm going to use PCRE all over the place.
My preferred syntax is m{something} and s{a}{b}g so I'll stick with those.

You can try all examples using:

# for searches, just start typing, quit using ^D
perl -ne 'm{a(.)c} && print "$1\n"'
# for replacements
perl -pe 's{a.c}{abc}g'

Why not use `.*`?

You're looking for a needle in a haystack.
A practical example: you look for your favourite plushie in your room.

A nice regex for that would be m{\bplushie\b}.
It looks for your plushie as one word, meaning that it will match only if on
each side the word ends.
See also word
boundaries.

What I see people do, which is completely ridiculous, is
m{^.*\bplushie\b.*$}.

Let me explain:

They match the beginning of the line even though they don't need it.
They match the end of the line.
They let their parser go through all of the characters, even though they already found what they were looking for.

If you look for an occurrence somewhere, that does not need to be anywhere
specific then why are you looking at everything else?
You don't walk into your room and start looking from one side sequentially to
the other.
What you do is look in your room, see the plushie sitting on the bed and take
it.

Complex Examples

Let's choose fail2ban as an example.
We want to block every IP sending more than 1000 HTTP requests per five
seconds.
I'll ignore how to configure fail2ban as it's not relevant to the regex
thing.

First try, without even looking at the format of the logs: m{^.*<HOST>.*$}

You probably messed up very hard here.
To explain why you've messed up so hard, let's look at one line of logs:

10.0.0.1 - - [16/Jul/2017:15:38:54 +0200] "GET /robots.txt HTTP/1.0" 404 319 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "-"

I took that line out of my server, the 10.0.0.1 is the part that we want
(please ignore that it's an internal IP).

m{.*} does greedy matching, so the above regex will be very easy to break.
Just put an IP in the User Agent.
The User Agent contains spaces as you might have noticed, which is fine as the
string is quoted (and nginx does some escaping on the contained string).
Now what'll happen if I put an IP address in the User Agent?
Right, due to greedy matching the rightmost <HOST> will match.
This will of course result in some serious problems.

As a malicious hacker I could:

Circumvent fail2ban altogether by putting 127.0.0.1 into my User Agent. That will effectively turn off fail2ban for my requests as long as 127.0.0.1 is whitelisted. If that IP is not whitelisted, you've got problems a lot worse (assuming fail2ban uses iptables, this will break at least half of your server's software, think of accessing MySQL on localhost).
Block arbitrary IPs from accessing your website in much the same way.

So how are we going to construct a Regex that will match only what we're
looking for here?

Well, the IP is right at the start, so we'll take m{^<HOST>.*$} right?

Technically right, but I wouldn't write it that way.
That regex would again match everything after the IP, but we really don't care
about that.

What we should do is a simple m{^<HOST>}.
This works as intended and it does only look at the first few characters.
If you want to make sure that it's followed by a space, go ahead, please.

So we end up with a fool-proof regex: m{^<HOST>\s}.
This will for sure fulfill all our needs.

To be honest, this example is kind of easy as the IP is right at the start.
Let's assume some other format so we can work out a more general way for this
to work.

Copy, Paste, Replace

We are going to perform CPR on a line of text.
As said above, this time we want to get something that is not right at the
start.
We'll look at the following line:

[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36 "GET /.well-known/acme-challenge/tTRnUGY9gZEVz2llGWqn1m3mHznMDOFH3zCXsgelh7w HTTP/1.1" 200 87

This is a slightly modified OpenBSD httpd log-format.
By slightly I mean:

date and time moved to the front
host moved a bit backwards

Now let's do this:

# copy the line, verbatim
[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36 "GET /.well-known/acme-challenge/tTRnUGY9gZEVz2llGWqn1m3mHznMDOFH3zCXsgelh7w HTTP/1.1" 200 87

# remove everything after the needle (except for the delimiter), we don't need it
[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36\s

# add needed meta-characters (start of line)
^[16/Jul/2017:16:27:41 +0200] openbsd.cloud.bsocat.net - - 66.133.109.36\s

# escape all the characters that need escaping
^\[16/Jul/2017:16:27:41 \+0200\] openbsd\.cloud\.bsocat\.net - - 66\.133\.109\.36\s

# replace the host
^\[16/Jul/2017:16:27:41 \+0200\] openbsd\.cloud\.bsocat\.net - - <HOST>\s

# replace everything that is not static by their possible values
# this requires a lot of in depth knowledge about the log format
# let's do this the easy way and just replace using dots for the date
^\[../.../....:..:..:.. .....\] openbsd\.cloud\.bsocat\.net - - <HOST>\s

# for the other fields we just specify them to "not contain spaces" as these
# are used for delimiting, so they will not occur in the fields
^\[../.../....:..:..:.. .....\] [^\s]+ [^\s]+ [^\s]+ <HOST>\s

# fail2ban needs spaces escaped I think
^\[../.../....:..:..:..\s.....\]\s[^\s]+\s[^\s]+\s[^\s]+\s<HOST>\s

How do we check?

If it's fail2ban, just run that.
For everything else:

perl -ne 'm{^\[../.../....:..:..:..\s.....\]\s[^\s]+\s[^\s]+\s[^\s]+\s([^\s]+)\s} && print "$1\n"'

This should output only the IP we were looking for.

My Way of (Purely Functional) Programming

#benaryorg — Wed, 15 Mar 2017 18:27:49 +0000