Forem: David Alexander

Nix Is Worth the Complexity

David Alexander — Sun, 04 Jul 2021 05:11:27 +0000

Recently I've gotten fed up with the breaking changes in Homebrew package manager. After some research, using Nixpkgs seemed like a far more stable option for GNU/Linux tooling on MacOS, albeit with a decent learning curve for configuration.

Without going too much further into it Nix is pretty cool.

Over the following months, I'd been spending what free time I had tinkering with Nix on MacOS, specifically with Home Manager and nix-darwin. Nix is cross-platform between Linux and MacOS, and, frankly, I found myself maintaining an increasing number of shell scripts for installing important tools I use. I got really good at writing bash scripts. 🤣

Bash scripts are really handy, but there are limits. There is no clean state with them, there's only whatever you're working with right then. You try your best to make them idempotent, but there's no reasonable way to test that they meet that expectation. It can only be reliably tested from a clean state once. On the other hand, Nix builds in a clean room every time.

Don't pollute the global state

Nix is a clean state, it's purpose-built for isolation between each program, allowing me to better follow the adage "don't mutate global state" and sandbox each tool I needed. Then I could selectively upgrade and, if the upgrade broke something, roll back to previous state easily.

It has happened more times than I can count, I help my coworkers through a borked python setup when the underlying python version gets upgraded in-place. Thanks brew upgrade... 😑

Instead of digging through all of the virtualenvs out there, and rummaging through whether pyenv was setup right for that shell, or any number of other issues, why not decouple it?

Nix offers the best of both dynamic and static linking when building an application. It allows for multiple versions of python 3 at the same time. Or Java. Or Haskell. Or Go. Or glibc. Upgrade one library and it doesn't need to update them all. Similarly if 5 applications all use the same library, there's no reason to duplicate it 5 times on the disk.

Keep project-specific tools with the project

I tend to use direnv in my workflow for exactly this purpose: I can keep project-specific settings (e.g., tool selection) specific to that base directory. Unfortunately this is typically limited to versions of known programs (e.g., python 3.7.3 instead of python 3.9.1) and workstation-specific environment variables (e.g., path to secret files).

Introducing Nix, this changes my workflow. By using nixify (roughly inspired by this bash function) I am able to install and use postgres, limiting it only to being used in this one project directory. Maybe I'll (re)use postgres in another project. Do I need it installed globally? Absolutely not. This is a development machine, not an application server.

Current state of integration

I've been working with Home Manager for managing my dotfiles and (user-scoped) system configuration. So far it has been difficult translating certain parts of RCM's framework, such as its overlay approach (having both ~/.dotfiles and ~/.dotfiles-local repos cloned with the latter containing higher priority config files).

Instead of symlinking files into place, thereby ensuring any changes to them in-place are reflected back in the git repo, they're made immutable and the only way to change them is from the git repo.

I've begun ripping out the version managers like pyenv, asdf, chruby, and others to completely replace it with project-specific Nix expressions.

7 principles of a good sysadmin

David Alexander — Thu, 15 Apr 2021 03:55:13 +0000

This seems to be a common topic of conversation, so I figure I should put it on paper (so to speak) what I value as a systems administrator, or "sysadmin."

Keep it simple
Ensure it can be reproduced
Keep it close to stock
Magic is bad
No development tools on the server
Prefer complexity at compile time over runtime
Consume artifacts

What does this mean?

The fewer moving parts, the easier to diagnose

By keeping things simple, reproduceable, and close to their defaults, this sets a sysadmin up for success when things go wrong. More so, by keeping things close to the default settings you maximize the chance that your setup overlaps someone else's. Bonus to finding info on Stack Overflow or that one forum post!

TIP: Script it out" href="tip-script-it"

By scripting out everything you do, no matter how small, this ensures you can walk away mid-thought and pick up where you left off later.

Keep scripts in some central location to share with your team of sysadmins
Version control systems (VCS) are best for iterating on these scripts
Make the VCS repo private, lest credentials are mistakenly hardcoded

Make your shell script executable documentation. Write it by defining your own shell functions describing each step you're taking.

Know your tools

If you don't understand how a thing works, fix that. Learn about it. Pull back that abstraction layer and look under the hood.

Why is magic bad? When you're troubleshooting some error, how can you logically rule out the tool as a contributing factor?

This principle does not preclude you from using said magical tool, but it does mandate you dispel that magic by working to understand how it is implemented.

Compile time vs. runtime

In a sysadmin context, compile time can mean "the stuff done to configure and setup a service, system, or application before it is immediately needed."

In contrast, runtime means "stuff being done as the application is being put into its 'running' state."

Example: Docker

Some container images contain a shell script as an entrypoint (e.g., entrypoint.sh). These shift some compile time tasks to execute at runtime, then defer to running the underlying application.

Reasons for this design might be:

More in-depth configuration changes required to pivot between environments
Rapid action may be required to change credentials, so they're only passed at runtime

Actions taken in a Dockerfile when docker build is run are considered "compile time."

Actions taken when running docker exec/docker run are considered "runtime."

Complexity costs at compile time need only be paid down once: when things are being setup.

If at runtime, cost of complexity is paid down every time the application is started.

Keep it simple. Pay down the cost as soon as possible.

Artifacts are like gold

Whether the application uses an interpreted language (e.g., python) or is statically compiled (e.g., golang), an artifact can be built to make rolling forward and reverting simple.

In the case of python, a wheel (my_pkg-0.1.0-py3-any.whl) is a well-formed package holding the python source code. In a venv, install it with pip install ./my_pkg-0.1.1-py3-any.whl to upgrade and pip install ./my_pkg-0.1.0-py3-any.whl to roll back.

With golang it's even easier. Just drop the new binary in place and, if it doesn't work, drop the old binary in that spot instead.

Some VCS providers, like GitHub.com, allow uploading binaries and other assets related to a release to a location that can be accessed later, perhaps even by shell scripts.

What is a Symbol?

David Alexander — Mon, 09 Oct 2017 23:02:00 +0000

As I was browsing twitter this evening, I came across a tweet by @searls asking newer rubyists (<5 years) what some still confusing concepts are in Ruby. The most common concept that remained confusing was symbols, so here’s my explanation of it.

In order to explain symbols, I must first explain datatypes.

Datatypes

Do you know the difference between datatypes?

Primitives

There’s a string (e.g., "Lorem ipsum dolor"), which is a set of characters (chars). There are various numbers ranging from an integer (e.g., 1 but not 1.4), a double (a really large integer), a float (e.g., 1.4 or 1.0), and any of those can be signed (including negative numbers) or unsigned (absolute values, positive numbers only). There are super simple concepts like booleans, which can be exactly “yes” or “no”.

Why are there all these variations? Isn’t it all just text? Think about this:

Binary

A computer thinks in binary, ones and zeros. Binary can represent any number, such as 000000 represents zero and 000001 represents one, as one might imagine. The tricky part is 000010 represents two. How’s that? Let’s continue.

000011 → three
000100 → four
000101 → five
000110 → six
000111 → seven
001000 → eight

Get the idea? If we continue with this pattern, we’ll see that we eventually hit 111111 (sixty-three), but that’s not the last number there exists, is it? We need another digit to hold a zero or one and it represents all the way up to one-hundred twenty-seven. Wow, that’s quite a jump, right? How does all of this have anything to do with data types?

Datatypes in binary

A computer has to store a representation of data in memory. But wait, didn’t we just cover that a computer thinks in binary? Well that’s still true. We have to come up with a shorthand for storing things like 1.3 in binary.

Let’s tear this apart. This is a float datatype. We have to come up with some shorthand for noting what is before the decimal point and what is after. How about we say the first four digits are the number before the decimal point and the last four are the number after. We can extend this beyond eight digits later.

In this case the binary representation of 1.3 might be 00010011. What’s the difference between this and the integer 19? Both are represented by the same binary, right? We’ll need to come up with some universal shorthand to note the difference.

In an integer, we know the same value 010011 (19) might also be positive or negative. This is called a “signed” integer. Other numeric datatypes can be signed as well, but we’ll keep it simple with integers for now. We’ll keep our shorthand of splitting up the digits, but whether it’s positive or negative is just 2 choices. We can probably keep to the first digit where 1 means negative and 0 means positive. Therefore 19 means 0010011 and -19 means 1010011.

Data headers

We have all these shorthand tricks we have to remember, but at the end of the day it’s just 1’s and 0’s. How will we differentiate between an unsigned 19 and 1.3 from before? Let’s come up with one, universal shorthand to determine which data type we will be storing for the next few digits. Three digits should cover it for now. We’ll remember some datatypes by this mapping for now:

001 → char
010 → unsigned integer
011 → signed integer
100 → float
110 → double
111 → boolean

To integrate this we’ll tack this onto the starting of each value and just remember to interpret the first 3 digits as the datatype indicators.

Data maximums/minimums

In order to have all of these shorthands, we have to agree on the next X number of digits which will signify the value of the datatype. A boolean, true and false, only needs one digit (plus the header) while an unsigned integer can only count between 0 and 127 with 3 digits, which might not be enough for fun things like counting the number of seats in a stadium.

We’ll come up with some arbitrary lengths of digits for these datatypes. Here are some examples:

0000000000 → char (10 digits)
0000000000000 → float (13 digits)
000000000 → unsigned integer (9 digits)
0000000000 → signed integer (10 digits)
0 → boolean (1 digit)

All together, a float will take up 13 digits for the value + 3 digits for the header indicating that it’s a float. 16 digits. If we take a multiple instances of data (datatype header + value in binary), we can string them together.

1000000000010011 → float datatype of value 1.3
1110 → boolean datatype of value false

Let’s take these two example data instances and put them together like a computer might keep it in memory: 11101000000000010011

What?

Okay, let's tear it apart again.

If we saw just this binary without any other context, we can remember our shorthand for the headers by reading left to right. The first 3 digits signify it’s a boolean, which we know has a value length of 1 digit. We read it as boolean false and we’ve parsed the first 4 digits in the example data.

Since we're done with the first 4, we'll start at digit 5, follow along the same pattern with interpreting the first 3 digits as the datatype (100 = float), and parse the next X digits (float value → 13 digits) as the value. The next 13 digits amount to 0000000010011, which we’ll pretend like we already established a new shorthand where floats have the last 3 digits reserved for the decimal number. This makes it easier since we remember this was 1.3.

Doing this again with another example value: 01100000111010010001011100

Reading the first 3 digits (011) we see it’s a signed integer. This means we read the next 10 digits (0000011101) as the value. Knowing our shorthand for signed integers, we recall the first digit of the value (0) tells us whether it’s positive or negative, and the remaining digits (000011101) are the number itself. We can then tell it’s +29.

The next three digits after that’s done is 001, which means a character. You know from before that a character datatype reserves the next 10 digits for its value, but we don’t know how to decode it. We haven’t established that shorthand. The important part here, though, is that we can differentiate between the different datatypes when they’re thrown together in one, long, unbreaking string of 1’s and 0’s.

Characters versus Strings

In case you weren’t already aware, a character might represent any one key on your keyboard, including letters, numbers, punctuation (like ? and !), or other bits of written language (like { or )). A character can make up individual letters you might not see on your keyboard, like “Ã©” or “â†’”. The point is, a character can be a large number of possible values taking up the same physical space on your screen as a 1 or 0.

Going back to what was said before, computers think in binary. We could come up with a way to store every character, mapping to a numerical value. These are called character encodings. Like other datatypes this tells us how many digits each character will take up. A character might be ASCII and be only what you might see on a QWERTY keyboard, or it might be UTF-8 and include emoji and other richer styles of a character. For the sake of simplicity, we’ll stick with ASCII.

There are 26 letters in the english alphabet, plus 11 special symbol keys (upper- and lowercase for each, so 22 characters), plus the number key line of 10 keys (so +20 characters), plus the spacebar, which brings us to 69 possible characters. We need enough binary digits, or bits, to handle a maximum number of 69. By my calculations, that’s 7 bits. Not too far off from our 10 bits we reserved earlier for holding character data, right?

To make a string like "Hello world!", we need to disect it. It’s the character H, then the character e, then the character l, and so on for the word “Hello”.

Wait a second, we have capital letters here too?! Shoot. Let’s amend our count of 69 characters to add in 26 more, 1 more for each letter of the alphabet in its capital form. That’s a total of 95 possible characters. Good thing our 7 bits are able to store a representation of any number between 0 and 127.

So we have “Hello” and “world!” separated by a space character, which amounts to 12 characters. Since each character takes up 10 bits (3 header + 7 value), we’re looking at 120 bits to render "Hello world!" as binary data. Here’s where I’m going to stop while you to ponder that for a minute.

Symbol datatype

Let’s assume you have a mapping of characters to bit values and the maximum bits mapped out perfectly. You have that string that takes up 12 characters for 120 bits just to say hello to the world. What if there’s something you only want to reference internally for your own purposes? We don’t care about the actual value, we just care that the value happens to be unique from any other value. It has semantic meaning only. You know how we have those mappings of letters, base10 numbers, etc. to binary? Those are the same concept of a symbol. That’s part of the char shorthand. What if we took it a step further?

First, after all that binary-talk, I need to take my head out of the theoretical for a moment. I need to go back to the programming I know: Ruby, Python, etc.

Ruby and Python are dynamic languages where you rarely have to know how many bits in memory a variable, holding a particular datatype, will take up. The way I operate, I care about how easily I can keep the inner workings of a program in my head at any one time. I need cues that won’t matter to the computer, but do only to me as the programmer. Take a logging class (in most any language), for example.

We have the concept of a logger which takes an enum of various error levels: ERROR, WARN, INFO, DEBUG, and possibly more. Do we care about storing the string of characters to represent it? No. We just need some internal representation to reference. Let’s choose a datatype that has a really small memory footprint here.

Representing "ERROR" (as a string) would be 5 characters Ã— 10 bits per character = 50 bits. Other logging levels might be greater or fewer characters in number, so we just need to have a datatype less than 50 bits, or binary digits long to stand in for the one other, more memory intensive value. Let’s choose the unsigned integer 4, and the other logging levels as unsigned integers as well. We don’t care about the value 4, only what it represents, so it could just as easily be 986 or a poop emoji. It is meant to differentiate ERROR from WARN and the others.

So the unsigned integer 4 is represented with 010000000100 in binary. That’s 12 bits long when the string version could be 50 bits. That’s quite the savings! Only 25% of the original memory footprint!

In a language like ruby, a small savings like that might not make too big of a difference, but say it could encode it in binary instead of unsigned integer, with all the wasted space up front reclaimed. We only need the first 3 bits for the header and the 3 bits following for the value of 4. What if our compiler was smart enough to see we only needed a maximum value of 4?

In that case, we could remember a new datatype that says the following X digits represent a symbol, where X is determined once the entire program is analyzed to figure out the max value one might ever see. In our case, it’s 4 so we downsize the number of bits used from 12 to 6. What once was 010000000100 is now 010100.

But wait, we still have the datatype header for an unsigned integer! That’ll screw everything up! That’s very true. We can't keep breaking our own shorthand conventions with encoding/decode binary, so we’ll have to come up with a new, on-the-fly datatype which we’ll refer to as a symbol. It’ll go by the header 101, since it hasn’t been used yet.

From the original 50 bits, to 010000000100 (12 bits), to 101100 (6 bits), that’s quite a bit of downsizing for the exact same functionality as far as both the programmer and the end user are concerned.

Recap

Symbols are just what they sound like: an in-memory stand-in value for something else. Their function is to save on memory usage when you don’t need to allocate a ton of bits in order to have a label that only needs to represent that it's different from other labels. The values don't actually matter. They are a handy representation which shifts work onto the compiler initially, but nets less memory (and equal computation power) at runtime.

Moving forward

These optimizations occur not only for storing data, but referencing binary chunk of data representing logic the computer is supposed to follow. Do you really think your computer remembers to look in that file for the string my_main_function() in order call the logic defined therein? Do you feel the computer cares how you name things? No! It reads the logic into binary and determines a symbol only it remembers in order for it to easily call the functionality. These are more compilation optimizations that happen automatically in the programming languages you use.

Languages like C and Ruby allow you direct access to symbols as a datatype, but languages like PHP and earlier versions of Java require you to declare your preferred datatype and value, leaving the memory optimization to the programmer when defining that a symbol exists.

Are symbols helpful? Sure. How often? That depends on what and how you’re coding the task at hand. Hopefully this will serve as an introduction for what circumstances would be best to use symbols versus other datatypes.

On the Importance of Packaging

David Alexander — Tue, 06 Jun 2017 10:18:53 +0000

You have an idea and want to turn it into a bit of code to carry it out. What do you do? You open up your IDE/Editor, perhaps you structure it inside of a folder with some default tooling for linting and (if necessary) compiling, and you get the code to work. It's hacky, it's ugly, but it works.

Now that you have something, you want to add one more feature to it or work out a bug. What do you do? You follow the same process. If you're feeling zesty, maybe you init a new git repository at the base of the project, but there's no real point to branching and merging back into master. Backing up the code? Might be an account on Bitbucket, or maybe something with GitLab. They both have free usage tiers with private repositories, and this definitely isn't good enough for the public to see.

This continues with more features hacked on, with no tests and zero documentation (except a comment here and there), all kept 100% private. Maybe you try to make it work on a Digital Ocean server you have stood up as a floating workspace with a persistent internet connection. It was difficult, but it works now. Kinda. Well, some of the time, anyway.

Your code portfolio is still missing a lot. Not a whole lot is visible because not a whole lot is in good enough condition to show anybody. You have a lot of these tiny projects that are riddled with what you know are bad practices, but were so much easier that figuring out the right way to do it. You treat your open source code contributions almost like a stage performance where you have everything perfectly thought out. First you have to write documentation, which means you have to clearly define that snowflake server you have setup to run your script. It also means you have to write usage documentation. Automated tests too, with varying importance depending on the community surrounding the language in which your project is implemented.

This has been my process for years with any number of once-off scripts. Projects that were intended for a recurring, minisculely scoped purpose I would have privately and, when I didn't have need for them anymore or they broke due to changes elsewhere (e.g., web scrapers breaking because website updates), I would just delete the project and call it done. Of course I would only be able to verbally mention "Yeah, I built something like that in my free time" during job interviews, but wouldn't be able to prove it because it was longer than a week ago and I didn't save the work.

What I've found, recently, is that there is a way to ease these growing pains.

Packaging

Every (legitimate) language these days has a package management strategy. Not familiar with the concept of package management? Let's talk about that for a sec.

You have a new project, let's say it's written in Python for now. The standard package format is well defined for installing using pip. It includes dependencies, name of the package, author name and contact info, version constraints for dependencies... a lot of information that is--and should be--standardized. In python, that's consistently defined in setup.py at the base of the repository.

Consider the following repository structure:

├── LICENSE.md
├── README.md
├── setup.py
└── my_project
    ├── __init__.py
    ├── core
    │   ├── admin_commands.py
    │   ├── inbox.py
    │   ├── initialize.py
    │   ├── mentions.py
    │   ├── posts.py
    │   ├── user_interaction.py
    │   └── validation.py
    ├── helpers
    │   ├── misc.py
    │   └── wiki.py
    ├── main.py
    └── strings
        ├── debug.py
        ├── posts.py
        ├── responses.py
        └── urls.py

We have a project named my_project, which is should be the name of the package in setup.py. Python has the Java-esque convention of import package.name.here to map to package/name/here.py filesystem structure.

Readme

Packages always have a README. If you use a package generator (like bundle gem) and it generates a README, always remove the default generated description (and other TODO-lines) and insert your own. Not sure what your project is about? Note what it currently covers. You can update it later. That's the point of version control.

In the README, include installation instructions that are realistic to your situation. Are there native OS dependencies that you need installed too? Those might not fit well in the setup.py or *.gemspec files. In that case, put it in the README.

A lot of times the boilerplate includes a lot of noise that might not be necessary for publicly publishing a package. Here are the bare minimum requirements for any package I make, public or private. I'm a forgetful person so coming back to a project 2 months later will require some getting up-to-speed again anyway. I like to make it easier for myself with this list of requirements for a README:

Usage
Installation procedure (assume a freshly installed OS)
Description
(if applicable) Assumptions of native platform dependencies

Dependencies

Packages provide a very clear definition of what it takes to run a piece of software that was written. I've been hearing for years that successful projects include a very narrowly defined scope and that the best way to do that is through narrowly defined interfaces.

In programming, you might see a junior dev hack away at a magic method that takes inputs in any variety of forms and is able to normalize it. Unless that is the primary function of the method/command/atomic unit they're creating, more experienced developers realize this very easily snowballs into a maintenance nightmare. The same might be said for dependency tracking. Having a clearly defined set of required dependencies (instead of the end user doing trial and error to install everything) will always be better.

Overkill

Python, for example, involves some bootstrapping to setup a package. For me, it has involved looking up prior projects' setup.py file, modifying it to fit the new project, and creating some default structures like my_project/__init__.py containing variables for version (__version__) and author (__author__). It also involves setting up automated testing with unittest and shell scripts to make automated testing easy to execute.

Here's my heuristic when it comes to creating a package:

Are there any dependencies that require a package manager (e.g., pip, npm, gem)?
Does the source code need to be split up into multiple files?
Am I testing my work with any more granularity than full, end-to-end tests?
Do I need to deploy this easily as, e.g., a command line tool?
Do I need to version the releases and have stable, beta, and dev copies of the source code?
Am I tracking code changes with git?

If I've answered yes to any one of these questions, I know to create turn the project into a package.

This means I...

create a new git repository (and private remote, for backup purposes)
create a README (in markdown, preferably)
track dependencies I install or remove, as I install or remove them
formulate a testing strategy to make sure everything works as expected (either automated or manual)
create a set of convenience scripts for running repeatable tasks scoped only to the current project (e.g., Makefile, Rakefile, bundler binstubs, shell script)

Local Gems With Bundler

David Alexander — Wed, 28 Dec 2016 17:27:06 +0000

Testing an unreleased version of a gem? Want to develop 2 unreleased projects that are based on each other and not have to worry about the following?

gem 'some_gem', path: '../my-dev-snapshot'

Leaving the Gemfile as such will screw with your project history, so we want the version of Gemfile as it will be when the gem is released, keeping that version in our “git memory”. See the following:

bundle config local.some_gem "$(realpath ../my-dev-snapshot)"

This will allow your Gemfile to remain pristine without the path: '../foo'hacks so others can set their own path to the gem source directory.

Caveats:

The gem, in this case some_gem, must be pointed at a git repository. In this case, it would need to be:

gem 'some_gem', github: 'foo/bar', branch: 'master'

This would allow us not only to optimize network traffic so we don’t make calls out to the git repository all the time, but also point it toward a local working copy.

There is an additional caveat, though. The given branch – in the example,master – must match the branch of the current working copy at the path specified with bundle config from earlier.