Forem: Dian Fay

ectomigo: Safer Schema Migrations

Dian Fay — Tue, 29 Mar 2022 00:00:00 +0000

The team I work with at my day job maintains many applications and processes interacting across a smaller number of databases. This is hardly exceptional. We are also constantly adding, subtracting, and refining not only the client programs but also the database schemas themselves. This too is hardly exceptional: business requirements change, external systems expose new information and deprecate old interfaces, von Moltke's Law ("no plan of operations remains certain once the armies have met") comes calling. Every now and again we just make a modeling or implementation mistake that manages to sneak through review and up to production. Sic semper startups.

So our database schemas are continually evolving. Each of those many applications and processes has to evolve along with them, or we get paged when the renamed column or dropped table breaks something we hadn't accounted for, and instant breakage is the best case. We've had schema incompatibilities lie in wait for over a month to catch us completely flatfooted. The complexities of even a single moderately-sized codebase are beyond the grasp of human memory. What hope do we have of recalling which relevant subset of database interactions appear where across two or ten or more?

What we need is a distinctly inhuman memory, one for which summoning up each and every reference to a changing table or view takes a moment's effort, and which cannot forget. A memory which operates at the level of the organization, rather than that of the project or of the individual developer/reviewer, only able to focus on a single target at a time. A memory we can consult when, or better yet before, code is ready to deploy -- "shifting left", as they say.

We need a database.

I built one.

ectomigo is a continuous integration module (initially a GitHub action) which parses your source files using tree-sitter to find data access code: SQL scripts and inline SQL in Java, JavaScript, and Python; MassiveJS calls; SQLAlchemy definitions; and more languages, data access patterns, analysis features, and platform support on the way after launch. Everything it finds it indexes, storing database object names and the file row-column positions of each reference.

When you submit schema changes for review, it parses that code as well, and matches the targets you're altering or dropping against every codebase your organization has enabled. If it does find any matches -- in other words, you still have live references to an affected database object, in this or another repository -- it leaves review comments alerting you to each potential problem.

ectomigo is launching on GitHub free for public and up to two private projects, with pricing available beyond that. The action code and the core code analysis library it integrates are open under the AGPL should you be interested in that.

We've been using early ectomigo builds at my workplace for a couple of months now, and it's already saved our bacon a few times with reports on database object usage in places we'd forgotten. If you're faced with migration risk yourself, I hope it can help you.

Exploring Databases Visually

Dian Fay — Sun, 04 Apr 2021 00:00:00 +0000

In "things you can do with a terminal emulator that renders images":

One way to look at a database's structure is as a graph of foreign key relationships among tables. Two styles of visual representation predominate: models or entity-relationship diagrams (ERDs) created as part of requirements negotiation and design, and descriptive diagrams of an extant database. The former are drawn by hand on a whiteboard or in diagramming software; the latter are often generated by database management tools with some manual cleanup and organization. Both styles usually take the complete database as their object, and whether descriptive or prescriptive, their role in the software development process is as reference material, or documentation.

Documentation isn't disposable. Even though these diagrams are out of date practically as soon as they're saved off, they take effort to make, or at least to make legible -- automated tools are only so good at layout, especially as table and relationship counts grow. That effort isn't lightly discarded, and anyway a diagram that's still mostly accurate remains a useful reference.

Documentation isn't disposable. But documentation isn't the only tool we have for orienting ourselves in a system: we can also explore, view the system in parts and from different angles, follow individual paths through the model from concept to concept. Exploration depends on adopting a partial, mobile perspective from the inside of the model, with rapid feedback and enough context to navigate but not so much as to be overwhelmed. The view from a single point is more or less important depending on the point itself, but in order to facilitate exploration that view has to be generated and discarded on demand. Look, move, look, move.

This is a partial perspective of the pagila sample database, from the table film:

It's generated by this fks zsh function which queries Postgres' catalog of foreign keys using a recursive common table expression to identify and visualize everything connected in a straight line to the target. The query output is passed to the Graphviz suite's dot with a template, rendered to png, and the png displayed with wezterm imgcat. No files are created or harmed at any point in the process.

Why only a straight line, though? The graph above has obvious gaps: film_actor implies an actor, and film_category its own table on the other side of the junction. inventory probably wants a store, and rental and the payment tables aren't much use without a customer. The view from rental is markedly different, with half a dozen tables that weren't visible at all from film:

This graph is familiar in part: there's rental itself, the payment tables, inventory, film -- the last shorn of the junctions to the still-missing actor and category tables. Those have passed around a metaphorical corner, since in order to get from rental to film_actor you must travel first up foreign keys into film (via rental.inventory_id and inventory.film_id), then down by way of film_actor.film_id. language, meanwhile, is "upwards" of film and therefore remains visible from rental.

The reason fks restricts its search to straight lines from the target table is to keep context narrow. You can get a fuller picture of the table structure by navigating and viewing the graph from multiple perspectives; what fks shows is the set of tables which can affect the target, or which will be affected by changes in the target. If you delete a store or a film, rentals from that store or of that film are invalidated (and, unless the intermediary foreign keys are set to cascade, the delete fails). But deleting a film_actor has nothing to do with rental, and vice versa.

There's an actual, serious problem with unrestricted traversal, too. If you recurse through all relationships, you wind up mapping entire subgraphs, or clusters of related tables. And clusters grow quickly. Stuart Kauffman has a great illustration of the principle in his book At Home in the Universe: The Search for the Laws of Self-Organization and Complexity.

Imagine 10,000 buttons scattered on a hardwood floor. Randomly choose two buttons and connect them with a thread. Now put this pair down and randomly choose two more buttons, pick them up, and connect them with a thread. As you continue to do this, at first you will almost certainly pick up buttons that you have not picked up before. After a while, however, you are more likely to pick at random a pair of buttons and find that you have already chosen one of the pair. So when you tie a thread between the two newly chosen buttons, you will find three buttons tied together. In short, as you continue to choose random pairs of buttons to connect with a thread, after a while the buttons start becoming interconnected into larger clusters.

When the ratio of threads to buttons, or relationships to tables, passes 0.5, there's a phase transition. Enough clusters exist that the next thread or relationship will likely connect one cluster to another, and the next, and the next. A supercluster emerges, nearly the size of the entire relationship graph. We can see what the relationship:table ratio looks like in a database by querying the system catalogs:

WITH tbls AS (
  SELECT count(*) AS num FROM information_schema.tables
  WHERE table_schema NOT IN ('pg_catalog', 'information_schema')
), fks AS (
  SELECT count(*) AS num FROM pg_constraint WHERE contype = 'f'
)
SELECT
  fks.num AS f,
  tbls.num AS t,
  fks.num::decimal / tbls.num AS r
FROM tbls CROSS JOIN fks;

The lowest ratio I have in a real working database is 0.56, and it's a small one, with f=14 and t=25. Others range from 0.61 (f=78, t=126) all the way up to 1.96 (f=2171, t=1107 thanks to a heavily partitioned table with multiple foreign keys); pagila itself is in the middle at 1.08 (f=27, t=25). I don't have enough data to back this up, but I think it's reasonable to expect that the number of relationships tends to increase faster than the number of tables. Without restrictions on traversal, you might as well draw a regular ERD: superclusters are inevitable.

fks will draw a regular ERD if passed only the database name, but like I said earlier, automated tools are only so good at layout (and in a terminal of limited width, even a smallish database is liable to produce an illegibly zoomed-out model). With no way to add universal render hints, Graphviz does a lot better with the smaller, more restricted graphs from local perspectives inside the database -- and so do humans. Reading a full-scale data model is hard! Tens or hundreds of nodes have to be sorted by relevance to the problem at hand; nodes and relationships which matter have to be mapped, the irrelevant actively ignored, others tagged with a mental question mark. Often a given problem involves more relevant entities than the human mind can track unaided. fks doesn't resolve the issue completely, but making a database spatial and navigating that space visually goes some way to meet our limitations and those of our tools.

Extra-fuzzy History Searching with Mnem

Dian Fay — Thu, 17 Sep 2020 00:00:00 +0000

I use a lot of Rust command-line tools: ripgrep, fd, dust, and more. So when I had my own idea for a better command-line mousetrap, it seemed like the way to go.

Shells log the commands you enter to a history file. Bash has .bash_history, zsh uses .histfile. The EXTENDED_HISTORY option in the latter adds timestamps, but that's about as fancy as it gets. Both shells (and presumably others) also have "reverse search" functionality which lets you look backwards and forwards through it, one line at a time.

Functional! But not especially friendly. Only seeing one result at a time makes it difficult to evaluate multiple similar matches; matching is strictly linear, as you can see by my typos; and the chronological is only sometimes the most useful order.

I do a lot with the AWS CLI, SaltStack, and other complicated command-line interfaces. I want to compare invocations to see how I've combined verbs and flags in the past, and for tasks I repeat just often enough to forget how to do them sorting by overall frequency is more useful than sorting by time.

Enter Mnem (regrettably, I missed getting clio, the Muse of history, by a matter of weeks):

The idea is pretty simple: load the history file, and reduce every command to its syntactic structure. git commit -m "some message here" becomes git commit -m <val>; mv "hither" "thither" turns into mv <arg1> <arg2>. Many entries will have the same structure, especially if switches are sorted consistently, so counting up occurrences yields each structure's overall popularity.

Picking one such aggregate yields a second selector over the original incidences, and selecting one of those prints it to stdout. This can be referenced, copied and pasted, or even evaled in the shell.

So far I've released Mnem to the Arch AUR and a Homebrew tap:

brew tap dmfay/mnem https://gitlab.com/dmfay/homebrew-mnem.git
brew install dmfay/mnem/mnem

Should sponsors be able to hide comments on their posts?

Dian Fay — Thu, 23 Jul 2020 15:43:53 +0000

GitHub is sponsoring CodeLand, and have the expected friendly DevRel spokesperson, announcement thread, raffle for early access to new stuff, and the like.

I'm not going to belabor the issues I have with GitHub. Suffice to say that when I asked the friendly DevRel spokesperson @bdougieyo how he felt about his company's collaboration with the US government's Immigration and Customs Enforcement agency (ICE) it was in good faith -- I'm genuinely interested in hearing how he does or doesn't grapple with his employer's ethics!

Instead of answering, ignoring, or even brushing me off with patter about how donations to nonprofits absolve GitHub of responsibility, he used the post author tools to hide my comment. And that I have a problem with: it crosses the line between GitHub using DEV to whitewash its reputation, and DEV providing tools to help GitHub whitewash its reputation.

If a comment on a sponsor post violates no rule, sponsors shouldn't be able to hide it for mere political expediency. They're here to attract attention and drum up business; if their practices mean that that attention isn't all as positive and uncritical as they'd like, that's for them to handle -- not DEV.

What do you think?

Plex: A Life

Dian Fay — Fri, 06 Sep 2019 00:00:00 +0000

A little while back I got my hands on a copy of Software Development and Reality Construction, the output of a conference held in Berlin in 1988. Among a variety of other more or less philosophical treatments of the theory and practice of software development, Don Knuth analyzes errors he made in his work on TeX; Kristen "SIMULA" Nygaard reviews his collaboration with labor unions to ensure that software meant to coordinate and control work does not wind up controlling the workers as well, a rather grim read in the era of Uber and Amazon; Heinz Klein and Kalle Lyytinen embark on a discussion of data modeling as production rather than as interpretation or hermeneutics. In all, it's some of the most insightful writing about programming and software engineering I've encountered.

This isn't about those contributions.

There's an entry fairly early on from one Douglas T. Ross, called "From Scientific Practice to Epistemological Discovery". Ross, who died in 2007, was a computer scientist and engineer most remembered today for the influential APT machine tools programming language and for coining the term "computer-aided design".

This isn't about the things Doug Ross is remembered for.

Doug Ross had a system. The system began its public life as an early software engineering methodology in the Cambrian explosion of such methodologies enabled by the spread of high-level programming languages in the 60s and 70s. The system went by a few names. Ross's company, SofTech Inc., called it the Structured Analysis and Design Technique or SADT. The US Air Force, never wont to use merely one acronym where two will do, called it IDEF0: ICAM (Integrated Computer Aided Manufacturing) DEFinition for function modeling.

To Doug Ross, the system was Plex. And Plex was everything. When the Department of Defense cut the Structured Analysis data modeling approach from IDEF0 in favor of a simpler methodology to be developed by SofTech subcontracters and named IDEF1, Ross decried the decision as destroying the "mathematical elegance and symmetric completeness of SADT [...] IDEF0 became merely the best of a competing zoo of other software development CASE tools, none of which were scientifically founded". He saw his career, and, indeed, his life, as drawing him inevitably toward the discovery and promulgation of his "philosophy of problem-solving", and furthering Plex's development became more and more important to him as time went on. In the mid-80s, he stopped drawing a salary at SofTech and went back to MIT, lecturing part-time on electrical engineering in order to focus more of his efforts on Plex.

But even MIT was, in Ross's own words, "not yet ready for [Structured Analysis] much less Plex". A graduate seminar on Plex itself was briefly offered in 1984, but was canceled due to lack of student interest. In "From Scientific Practice" Ross bemoans his inability to gain traction for Plex, writing of feeling "an intolerable burden of responsibility to still be the only person in the world (to my knowledge) pursuing it". His only recourse was to turn inward and "generate book after book on Plex in my office at home, in order that Plex will be ready when the world is ready for it!"

At this point, Doug Ross might be sounding a little bit like a crank. Let me be clear: Douglas T. Ross, computer science pioneer, was absolutely a crank of the first water. This is just as absolutely to his credit; any fool can make it from the sublime to the ridiculous, but it takes real talent to go in the other direction. And Plex is sublime, if in its own dry, academic way. Ross is not the celestial paranoiac Francis E. Dec, ranting and raving about the Worldwide Deadly Gangster Communist Computer God and lunar brain storage depots; nor is Plex the gonzo experience of Nature's Harmonic Simultaneous 4-Day Time Cube. That said, Ross never devolves into the racist vituperations Dec and Time Cube's Gene Ray were sometimes given to, either. So it goes.

⁂

Plex itself is a sprawling, incoherent metaphysics built, according to Ross, on the foundation of a single pun (or, more properly, double entendre): "nothing can be left out". Thus inspired, Ross embarks upon the classic Cartesian thought experiment. But where Descartes discards every proposition except the cogito ("I think, therefore I am"), Ross's buck stops at "nothing doesn't exist".

Or, in Ross's own framing:

Nothing doesn't exist. That is the First Definition of Plex -- a scientific philosophy whose aim is understanding our understanding of the nature of nature. Plex does not attempt to understand nature itself, but only our understanding of it. We are included in nature as we do "our understanding", both scientific and informal, so we must understand ourselves as well -- not just what we think we are, but as we really are, as integral, natural beings of nature. How one "understand"s and even who "we" are as we do "our understanding" necessarily is left completely open, for all that must arise naturally from the very nature of nature.

All emphasis -- all of it, I assure you -- original. Ross's dedication to bold and italic text wavers from work to work and page to page, but on balance "From Scientific Practice to Epistemological Discovery" is in fine form. Early entries he refers to in his "thousands of C-pages" (that is, "chronological working pages", all of which may or may not have been lost) and lecture notes he prepared in 1985 sometimes switch between up to eight colors every few words. The lecture notes are of particular interest compared to the other extant materials, comprising a "study of an SADT Data Model which expresses all aspects of any object which obeys laws of physical cause and effect" delivered as a dialogue between Ross and a genie reminiscent of Gödel, Escher, Bach.

Having arrived at the First Definition, Ross next attempts to deduce everything else from it, claiming that Plex need make no assumptions. "Nothing doesn't exist" leads, expanded this way and that, to "Only that which is known by definition is known -- by definition", as, "without a definition for something, we only can know it as Nothing". Within the space of a few paragraphs, he's slammed what appears to be his own misinterpretation of Stephen Hawking and (unknowingly?) reinvented Spinoza's pantheism, on the grounds that "Nothing isn't ; Plex is what Nothing isn't". And for what it's worth, this is all still in the first two pages of "From Scientific Practice".

⁂

In another instance, Plex guides Ross to enlightenment regarding questions of information theory. It turns out that a single bit actually requires 3/2 binary digits for encoding, "because the value of the half-bit is 3/4 !!!".

-- which ultimately results from the fact that in actuality, when you don't have something, it is not the case that you have it but it is Nothing -- it is that you don't have it; whereas when you do have something, that is because you don't have what it isn't!

At a closer reading, this isn't necessarily the gibberish it might seem at first blush. Plex's foundation in "Nothing" makes zero the default state. But one is only understandable when there's an understood meaning for one. The elaboration about nothings and somethings makes it seem like Ross is counting this other one -- that is, half a bit -- towards the cost of encoding any other bit. In semiotic terms, this is the interpretant or subjective value Charles Sanders Peirce sees implicit in signification. But if Ross ever investigated the ways logicians and linguists had already been exploring this territory, there's no indication that he attached any significance (as it were) to their work. And while including the interpretant for half the possible values may yield the same final figure, it does not account for the 3/4 half-bit; so in the face of storage hardware design as practiced, Ross's insistence on 3/2 seems more mystical than scientific.

I have no idea how au courant Ross was with the humanities in general, but it seems likely that the answer is "not very". He was, of course, quite well-versed in math and engineering. Even deep in the mire of Plex, one can find him struggling to accommodate the realization that he was, in essence, defining formal systems backwards (he settles this with the ingenious maneuver of declaring the distinction akin to chirality), but the only philosopher he mentions is Plato. His efforts at deductive logic too seem thoroughly warped, as evinced by his "proof that every point is the whole world". For reference, an object's "identity" is tautologically defined as above: the set of "that" which "this" isn't.

  I n = 1: A world of one point is the whole world.
 II Assume the theorem is true for (n - 1) points. (n > 1),
     i.e., for any collection of (n - 1) points, every point is the whole world.
     [ed: remember, Plex needs no assumptions, let alone "assume the theorem is true"]
III To prove the theorem for n points given its truth for (n - 1) points
     (n > 1)
     (a) The identity of any one point, p, in the collection is a collection of (n -
         1) points, each of which is the whole world, by II.
     (b) The identity of any other point, q, i.e., a point of the identity of p, is
         a collection of (n - 1) points, each of which is the whole world, by II.
     (c) The identity of p and the identity of q are identical except that where
         the identity of p has q the identity of q has p. In any case p is the
         whole world by (b) and q is the whole world by (a).
     (d) Hence both p and q are the whole world, as are all the other points (if
         any) in their respective identities (and shared between them).
     (e) Hence all n points are the whole world.
 IV For n = 2, I is used (via II) in IIIa and IIIb, q.e.d.
  V Q.E.D. by natural induction.

As mentioned, Ross generated a wealth of C-pages, lecture notes, and other writings on Plex, but except for a small fraction apparently hosted on his last MIT faculty/program page, I have no idea where most of this collection ended up. If you're interested in reading further in Ross's own words, the best places to start are probably "From Scientific Practice to Epistemological Discovery" in Software Development and Reality Construction or The Plex Tract.

Coda

Doug Ross himself remains a rather cryptic figure. There's some biographical information out there, but after his birth to missionary parents in what's now Guangdong and childhood homecoming to the Finger Lakes region of New York it mostly concerns where, when, with whom, and on what he was working. In his writings he comes off somewhat full of himself, as tends to be the case with esoteric philosophers and visionaries for whom the world is not yet and will never be ready. But when Ross talks about the necessary perfection, or perfect necessity, of his marriage to his wife Pat, herself a human computer at MIT's Lincoln Laboratory, it's still a little bit charming. And when he writes, with complete seriousness, that "being a pioneer came naturally" to him, I can't exactly say otherwise.

I wonder what it was like in that conference hall in 1988. I don't know whether the attendees or the organizers knew what they were in for when Ross got up to talk about this beautiful, all-consuming nonsense that was driving him to desperation. But sense isn't everything; and as a project of reality construction Plex is a monumental accomplishment. And the reality we ourselves have collectively constructed, in which points are points, a bit corresponds to a single binary digit, and genies obstinately refuse to appear no matter how we manipulate bottles, is the richer for its existence.

JOIN Semiotics and MassiveJS v6

Dian Fay — Tue, 13 Aug 2019 00:00:00 +0000

MassiveJS version 6 is imminent. This next release closes the widest remaining gap between Massive-generated APIs and everyday SQL, not to mention other higher-level data access libraries: JOINs.

This is something of a reversal for Massive, which until now has had very limited functionality for working with multiple database entities at once. I've even written about this as a constraint not without benefits (and, for the record, I think that still -- ad-hoc joins are a tool to be used judiciously in application code!).

But the main reason for this lack was always that I'd never come up with any solution that didn't fit awkwardly into an already-awkward options object. Deep insert and resultset decomposition were quite enough to keep track of. I am naturally loath to concede any inherent advantages to constructing models, but this really seemed like one for the longest time.

There are, however, ways. Here's what Massive joins look like, if we invade the imaginary privacy of an imaginary library system's imaginary patrons:

const whoCheckedOutCalvino = await db.libraries.join({
  books: {
    on: {library_id: 'id'},
    patron_books: {
      type: 'LEFT OUTER',
      pk: ['patron_id', 'book_id'],
      on: {book_id: 'books.id'},
      omit: true
    },
    who_checked_out: {
      type: 'LEFT OUTER',
      relation: 'patrons',
      on: {id: 'patron_books.patron_id'}
    }
  }
}).find({
  state: 'EV',
  'books.author ILIKE': 'calvino, %'
});

(relation in this sense indicates a table or view.)

And the output:

[{
  "id": 2,
  "name": "East Virginia State U",
  "state": "EV",
  "books": [{
    "author": "Calvino, Italo",
    "id": 1,
    "library_id": 2,
    "title": "Cosmicomics",
    "who_checked_out": [{
      "id": 1,
      "name": "Lauren Ipsum"
    }]
  }]
}, {
  "id": 3,
  "name": "Neitherfolk Public Library",
  "state": "EV",
  "books": [{
    "author": "Calvino, Italo",
    "id": 2,
    "library_id": 3,
    "title": "Cosmicomics",
    "who_checked_out": [{
      "id": 2,
      "name": "Daler S. Ahmet"
    }]
  }, {
    "author": "Calvino, Italo",
    "id": 4,
    "library_id": 3,
    "title": "Invisible Cities",
    "who_checked_out": []
  }]
}]

Or in other words, exactly what you'd hope it would look like -- and what, if you use Massive, you may previously have been dealing with a view and decomposition schema to achieve. This is a moderately complex example, and between defaults (e.g. type to INNER) and introspection, declaring a join can be as simple as naming the target: db.libraries.join('books').

The join schema is something of an evolution on the decomposition schema, sharing the same structure but inferring column lists, table primary keys, and even some on conditions where unambiguous foreign key relationships exist. It's more concise, less fragile, and still only defined exactly when and where it's needed. Even better, compound entities created from tables can use persistence methods, meaning that join() can replace many if not most existing usages of deep insert and resultset decomposition.

It might seem a little unconventional to just invent ersatz database entities out of whole cloth. There's some precedent -- Massive already treats scripts like database functions -- but the compound entities created by Readable.join() are a good bit more complex than that. There's a method to this madness though, and its origins date back to before Ted Codd came up with the idea of the relational database itself.

Semiotics from 30,000 Feet

Semiotics is, briefly, the study of meaning-making, with 19th-century roots in both linguistics and formal logic. It's also a sprawling intellectual tradition in dialogue with multifarious other sprawling intellectual traditions, so I am not remotely going to do it justice here. The foundational idea is credited on the linguistics side to Ferdinand de Saussure: meaning is produced in the relation of a signifier to a signified, or taken together a sign. Smoke to fire, letter to sound, and so forth. Everything else proceeds from that relationship. There is, of course, a lot more of that everything else, and like so many other foundational ideas the original Saussurean dyad is something of a museum piece.

But the idea of theorizing meaning itself in almost algebraic terms would outlive de Saussure. The logician Charles Sanders Peirce had already come to similar conclusions, and had realized to boot that the interpreted value of the signifier's relationship to its signified is as important as the other two. Peirce, following this line of reasoning, understood this "interpretant" itself to be a sign comprising its own signifier and signified which in turn yield their own interpretant, in infinite chains of signification. Louis Hjelmslev, meanwhile, reimagined de Saussure's dyad as a relation of expression to content, and added a second dimension of form and substance. To Hjelmslev, a sign is a function, in the mathematical sense, mapping the "form of expression" to the "form of content", naming as the "substance of expression" and "substance of content" the raw materials formed into the sign.

The use of the term "substance" sounds kind of like some sort of philosophically-détourned jargon, but there are no tricks here: it's just stuff. There's no more specific designation than the likes of "substance" for "that which has been made into a sign"; the category includes everything from physical materials to light, gesture, positioning, electricity, more, in endless combinations. A sign is created by these matters being selected and formed into content and expression: fuel, oxygen, and heat organized into fire and smoke, or sounds uttered in an order corresponding to a known linguistic quantity. It should be said also that consciousness need not enter into it: anything can make a sign, and even a plant can interpret one.

This all is to say: there's stuff out there, and what it has in common is that it is made to mean things. Most stuff, in fact, is constantly meaning many things at the same time, as long as there's an interpreting process -- and there's always something. The philosopher-psychologist tag team of Gilles Deleuze and Felix Guattari envisioned the primordial soup of matters-awaiting-further-formation as a spatial dimension: the plane of consistency or plane of immanence. Signification, as they proposed in 1000 Plateaus, happens on and above the plane of consistency, as matters are selected and drawn up from it to become substance and sign. The recursive nature of signification means that these signs are then selected into the substance of yet other signs, becoming layers or strata on the plane in a fashion they compare to the formation of sedimentary rock.

Signs and Databases

A database management system, like any other program, is an immensely complex system of signs. However, what sets DBMSs (and some other categories of software, like ledgers and version control systems) apart is that they're designed to manage other systems of signs. Thanks to this recursive aspect, a database can be imagined as a plane of consistency, a space from which any combination of unformed bytes might be drawn up into column-signs and row-signs which in turn are gathered into table-signs and view-signs and query-signs.

And if tables and views and queries are all still signs at base, where exactly do the differences come in? Tables store persistent data and are therefore mutable, while views and queries do not and are not, and must be constituted from tables themselves and (in the case of views) from each other. Tables constitute a lower stratum of signs, with views forming table- and view-substance into signs on higher strata, and queries higher still, at a sufficient remove from the plane of consistency that they're no longer stored in the database itself.

This is, of course, arriving at inheritance the long way around. In Massive terms, database entities are first instances of a base Entity class, after which they inherit a second prototype: one of Sequence, Executable, or Readable. Some of the latter may be further articulated as Writables, as well; there are no Writables which are not also Readables.

But there's more than one thing happening here, and the ordering of tables, views, and database functions into class-strata is the second step -- matters must be chosen before they can be formed into signs. It's in this first step of stratification that Massive adds script files to the API system of signs, treating them (almost) identically to functions and procedures.

Readable.join() takes the same idea further to expand on the database's relations: before, a Readable mapped one-to-one with a single table or view. But as long as SQL can be generated to suit, there's no reason one Readable couldn't map to multiple relations. Writables too, for that matter:

const librariesWithBooks = db.libraries.join('books');
const libraryMembers = db.patrons.join('libraries');

// inserts work exactly like deep insert, persisting an
// entire object tree
const newLibrary = await librariesWithBooks.insert({
  name: 'Lichfield Public Library',
  state: 'EV',
  books: [{
    library_id: undefined,
    title: 'Jurgen: A Comedy of Justice',
    author: 'Cabell, James Branch'
  }, {
    library_id: undefined,
    title: 'If On a Winter\'s Night a Traveller',
    author: 'Calvino, Italo'
  }]
});

// updates make changes in the origin table, based on
// criteria which can reference the joined tables
const withCabell = await librariesWithBooks.update({
  'books.author ilike': 'cabell, %'
}, {
  has_cabell: true
});

// deletes, like updates, affect the origin table only
const iplPatrons = await libraryMembers.destroy({
  'libraries.name ilike': 'Imaginary Public Library'
});

Try it Out!

The first v6 prerelease is available now: npm i massive@next. There's now a prerelease section of the docs going over what's new and different in detail. But to sum up the other changes:

Node < 7.6 is no longer supported.
Implicit ordering has been dropped.
Resultset decomposition now yields arrays instead of objects by default. The array schema field is no longer recognized, and you'll need to remove it from your existing decomposition schemas. To yield objects, set decomposeTo: 'object' instead.
JSON and JSONB properties are now sorted as their original type instead of being processed as text.
The type property of the order option has been deprecated in favor of Postgres-style field::type casting as used elsewhere. It will continue to work through the 6.x lifecycle but may be removed in a subsequent major release.

This is a feature I've been wishing I could make happen somehow ever since I first published the original resultset decomposition Gist more than two years ago. It's involved extensive changes to table loading, criteria parsing, and statement generation. I've endeavored not to break these areas, and have informally experimented by dropping pre-prerelease versions into an existing codebase. Results have been good, but should you find an issue with this or any other Massive functionality, please let me know!

I'm really excited to see just how far joins expand Massive's capabilities, but in truth there's just one thing I think I and most other Massive users will get the most mileage out of: plain old query predicate generation with criteria objects, without having to define and manage a plethora of views to cover basic JOINs. Stratification is a useful way to think about the production of meaning -- but strata themselves can also be dead weight.

A Self-Sourcing Cassandra Cluster with SaltStack and EC2

Dian Fay — Wed, 27 Mar 2019 00:00:00 +0000

Anybody doing something interesting to a production Cassandra cluster is generally advised, for a host of excellent reasons, to try it out in a test environment first. Here's how to make those environments effectively disposable.

The something interesting we're trying to do to our Cassandra cluster is actually two somethings: upgrading from v2 to v3, while also factoring Cassandra itself out from the group of EC2 servers that currently run Cassandra-and-also-some-other-important-stuff. We have a "pets" situation and want a "cattle" situation, per Bill Baker: pets have names and you care deeply about each one's welfare, while cattle are, not to put too fine a point on it, fungible. If we can bring new dedicated nodes into the cluster, start removing the original nodes as replication takes its course, and finally upgrade this Database of Theseus, that'll be some significant progress -- and without downtime, even! But it's going to take a lot of testing, to say nothing of managing the new nodes for real.

We already use SaltStack to monitor and manage other areas of our infrastructure besides the data pipeline, and SaltStack includes a "salt-cloud" module which can work with EC2. I'd rather have a single infra-as-code solution, so that part's all good. What isn't: the official Cassandra formula is geared more towards single-node instances or some-assembly-required clusters, and provisioning is a separate concern. I expect to be creating and destroying clusters with abandon, so I need this to be as automatic as possible.

Salt-Cloud Configuration

The first part of connecting salt-cloud is to set up a provider and profile. On the Salt master, these are in /etc/cloud.providers.d and /etc/cloud.profiles.d. We keep everything in source control and symlink these directories.

Our cloud stuff is hosted on AWS, so we're using the EC2 provider. That part is basically stock, but in profiles we do need to define a template for the Cassandra nodes themselves.

etc/cloud.profiles.d/ec2.conf

cassandra_node:
  provider: [your provider from etc/cloud.providers.d/ec2.conf]
  image: ami-abc123
  ssh_interface: private_ips
  size: m5.large
  securitygroup:
    - default
    - others

cassandra-test.map

With the cassandra_node template defined in the profile configuration, we can establish the cluster layout in a map file. The filename doesn't matter; mine is cassandra-test.map. One important thing to note is that we're establishing a naming convention for our nodes: cassandra-*. Each node is also defined as t2.small size, overriding the default m5.large -- we don't need all that horsepower while we're just testing! t2.micro instances, however, did prove to be too underpowered to run Cassandra.

cassandra_node:
  - cassandra-1:
      size: t2.small
      cassandra-seed: true
  - cassandra-2:
      size: t2.small
      cassandra-seed: true
  - cassandra-3:
      size: t2.small

cassandra-seed (and size, for that matter) is a grain, a fact each Salt-managed "minion" knows about itself. When Cassandra comes up in a multi-node configuration, each node looks for help joining the cluster from a list of "seed" nodes. Without seeds, nothing can join the cluster; however, only non-seeds will bootstrap data from the seeds on joining so it's not a good idea to make everything a seed. And the seed layout needs to toposort: if A has B and C for seeds, B has A and C, and C has A and B, it's the same situation as no seeds. If two instances know that they're special somehow, we can use grain matching to target them specifically.

Pillar and Mine

The Salt "pillar" is a centralized configuration database stored on the master. Minions make local copies on initialization, and their caches can be updated with salt minion-name saltutil.refresh_pillar. Pillars can target nodes based on name, grains, or other criteria, and are commonly used to store configuration. We have a lot of configuration, and most of it will be the same for all nodes, so using pillars is a natural fit.

srv/salt/pillar/top.sls

Like the top.sls for Salt itself, the Pillar top.sls defines a highstate or default state for new minions. First, we declare the pillars we're adding appertain to minions whose names match the pattern cassandra-*.

base:
  'cassandra-*':
    - system-user-ubuntu
    - mine-network-info
    - java
    - cassandra

srv/salt/pillar/system-user-ubuntu.sls

Nothing special here, just a user so we can ssh in and poke things. The private key for the user is defined in the cloud provider configuration.

system:
  user: ubuntu
  home: /home/ubuntu

srv/salt/pillar/mine-network-info.sls

The Salt "mine" is another centralized database, this one storing grain information so minions can retrieve facts about other minions from the master instead of dealing with peer-to-peer communication. Minions use a mine_functions pillar (or salt-minion configuration, but we're sticking with the pillar) to determine whether and what to store. For Cassandra nodes, we want internal network configuration and the public DNS name, which latter each node has to get by asking AWS where it is with curl.

mine_functions:
  network.interfaces: [eth0]
  network.ip_addrs: [eth0]
  # ask amazon's network config what we're public as
  public_dns:
    - mine_function: cmd.run
    - 'curl -s http://169.254.169.254/latest/meta-data/public-hostname'

srv/salt/pillar/java.sls

Cassandra requires Java 8 to be installed (prospective Java 9 support became prospective Java 11 support and is due with Cassandra 4). This pillar sets up the official Java formula accordingly.

java:
  # vitals
  release: '8'
  major: '0'
  minor: '202'
  development: false

  # tarball
  prefix: /usr/share/java # unpack here
  version_name: jdk1.8.0_202 # root directory name
  source_url: https://download.oracle.com/otn-pub/java/jdk/8u202-b08/1961070e4c9b4e26a04e7f5a083f551e/server-jre-8u202-linux-x64.tar.gz
  #source_hash: sha256=9efb1493fcf636e39c94f47bacf4f4324821df2d3aeea2dc3ea1bdc86428cb82
  source_hash: sha256=61292e9d9ef84d9702f0e30f57b208e8fbd9a272d87cd530aece4f5213c98e4e
  dl_opts: -b oraclelicense=accept-securebackup-cookie -L

srv/salt/pillar/cassandra.sls

Finally, the Cassandra pillar defines properties common to all nodes in the cluster. My upgrade plan is to bring everything up on 2.2.12, switch the central pillar definition over, and then supply the new version number to each minion by refreshing its pillar as part of the upgrade process.

cassandra:
  version: '2.2.12'
  cluster_name: 'Test Cluster'
  authenticator: 'AllowAllAuthenticator'
  endpoint_snitch: 'Ec2Snitch'
  twcs_jar:
    '2.2.12': 'TimeWindowCompactionStrategy-2.2.5.jar'
    '3.0.8': 'TimeWindowCompactionStrategy-3.0.0.jar'

The twcs_jar dictionary gets into one of the reasons I'm not using the official formula: we're using the TimeWindowCompactionStrategy. TWCS was integrated into Cassandra starting in 3.0.8 or 3.8, but it has to be compiled and installed separately for earlier versions. Pre-integration versions of TWCS also have a different package name (com.jeffjirsa instead of org.apache). 3.0.8 is the common point, having the org.apache TWCS built in but also being a valid compilation target for the com.jeffjirsa TWCS. After upgrading to 3.0.8 I'll be able to ALTER TABLE to apply the org.apache version before proceeding.

With the provider, profile, map file, and pillar setup we can actually spin up a barebones cluster of Ubuntu VMs now and retrieve the centrally-stored network information from the Salt mine:

sudo salt-cloud -m cassandra-test.map

sudo salt 'cassandra-1' 'mine.get' '*' 'public_dns'

We can't do much else, since we don't have anything installed on the nodes yet, but it's progress!

The Cassandra State

The state definition includes everything a Cassandra node has to have in order to be part of the cluster: the installed binaries, a cassandra group and user, a config file, a data directory, and a running SystemD unit. The definition itself is sort of an ouroboros of YAML and Jinja:

srv/salt/cassandra/defaults.yaml

First, there's a perfectly ordinary YAML file with some defaults. These could easily be in the pillar we set up above (or the pillar config could all be in this file); the principal distinction seems to be in whether you want to propagate changes via saltutil.refresh_pillar, or by (re)applying the Cassandra state either directly or via highstate. This is definitely more complicated than it needs to be right now, but given that this is my first major SaltStack project, I don't yet know enough to land on one side or the other, or if combining a defaults file with the pillar configuration will eventually be necessary.

cassandra:
  dc: dc1
  rack: rack1

srv/salt/cassandra/map.jinja

The map template loads the defaults file and merges them with the pillar, creating a server dictionary with all the Cassandra parameters we're setting.

{% import_yaml "cassandra/defaults.yaml" as default_settings %}

{% set server = salt['pillar.get']('cassandra', default=default_settings.cassandra, merge=True) %}

srv/salt/cassandra/init.sls

Finally, the Cassandra state entrypoint init.sls is another Jinja template that happens to look a lot like a YAML file and renders a YAML file, which for SaltStack is good enough. Jinja is required here since values from the server dictionary, like the server version or the TWCS JAR filename, need to be interpolated at the time the state is applied.

When the Cassandra state is applied to a fresh minion:

wget will be installed
A CASSANDRA_VERSION environment variable will be set to the value defined in the pillar
A user and group named cassandra will be created
A script named install.sh will download and extract Cassandra itself, once the above three conditions are met
A node configuration file named cassandra.yaml will be generated from a Jinja template and installed to /etc/cassandra
If necessary, the TWCS jar will be added to the Cassandra lib directory
The directory /var/lib/cassandra will be created and chowned to the cassandra user
A SystemD unit for Cassandra will be installed and started once all its prerequisites are in order

{% from "cassandra/map.jinja" import server with context %}

wget:
  pkg.installed

cassandra:
  environ.setenv:
    - name: CASSANDRA_VERSION
    - value: {{ server.version }}

  cmd.script:
    - require:
      - pkg: wget
      - user: cassandra
      - environ: CASSANDRA_VERSION
    - source: salt://cassandra/files/install.sh
    - user: root
    - cwd: ~

  group.present: []

  user.present:
    - require:
      - group: cassandra
    - gid_from_name: True
    - createhome: False

  service.running:
    - enable: True
    - require:
      - file: /etc/cassandra/cassandra.yaml
      - file: /etc/systemd/system/cassandra.service
{%- if server.twcs_jar[server.version] %}
      - file: /opt/cassandra/lib/{{ server.twcs_jar[server.version] }}
{%- endif %}

# Main configuration
/etc/cassandra/cassandra.yaml:
  file.managed:
    - source: salt://cassandra/files/{{ server.version }}/cassandra.yaml
    - template: jinja
    - makedirs: True
    - user: cassandra
    - group: cassandra
    - mode: 644

# Load TWCS jar if necessary
{%- if server.twcs_jar[server.version] %}
/opt/cassandra/lib/{{ server.twcs_jar[server.version] }}:
  file.managed:
    - require:
      - user: cassandra
      - group: cassandra
    - source: salt://cassandra/files/{{ server.version }}/{{ server.twcs_jar[server.version] }}
    - user: cassandra
    - group: cassandra
    - mode: 644
{%- endif %}

# Data directory
/var/lib/cassandra:
  file.directory:
    - user: cassandra
    - group: cassandra
    - mode: 755

# SystemD unit
/etc/systemd/system/cassandra.service:
  file.managed:
    - source: salt://cassandra/files/cassandra.service
    - user: root
    - group: root
    - mode: 644

srv/salt/cassandra/files/install.sh

This script downloads and extracts the target version of Cassandra and points the symlink /opt/cassandra to it. If the target version already exists, it just updates the symlink since everything else is already set up.

#!/bin/bash

update_symlink() {
  rm /opt/cassandra
  ln -s "/opt/apache-cassandra-$CASSANDRA_VERSION" /opt/cassandra

  echo "Updated symlink"
}

# already installed?
if [-d "/opt/apache-cassandra-$CASSANDRA_VERSION"]; then
  echo "Cassandra $CASSANDRA_VERSION is already installed!"

  update_symlink

  exit 0
fi

# download and extract
wget "https://archive.apache.org/dist/cassandra/$CASSANDRA_VERSION/apache-cassandra-$CASSANDRA_VERSION-bin.tar.gz"
tar xf "apache-cassandra-$CASSANDRA_VERSION-bin.tar.gz"
rm "apache-cassandra-$CASSANDRA_VERSION-bin.tar.gz"

# install to /opt and link /opt/cassandra
mv "apache-cassandra-$CASSANDRA_VERSION" /opt
update_symlink

# create log directory
mkdir -p /opt/cassandra/logs

# set ownership
chown -R cassandra:cassandra "/opt/apache-cassandra-$CASSANDRA_VERSION"
chown cassandra:cassandra /opt/cassandra

It's probably possible to do most of this, at least the symlink juggling and directory management, with "pure" Salt (and the environment variable could be eliminated by rendering install.sh as a Jinja template with the server dictionary), but the script does what I want it to and it's already idempotent and centrally managed.

srv/salt/cassandra/files/cassandra.service

This is a basic SystemD unit, with some system limits customized to give Cassandra enough room to run. It starts whatever Cassandra executable it finds at /opt/cassandra, so all that's necessary to resume operations after the symlink changes during the upgrade is to restart the service.

[Unit]
Description=Apache Cassandra database server
Documentation=http://cassandra.apache.org
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=forking
User=cassandra
Group=cassandra
ExecStart=/opt/cassandra/bin/cassandra -Dcassandra.config=file:///etc/cassandra/cassandra.yaml
LimitNOFILE=100000
LimitNPROC=32768
LimitMEMLOCK=infinity
LimitAS=infinity

[Install]
WantedBy=multi-user.target

srv/salt/cassandra/files/2.2.12/cassandra.yaml

The full cassandra.yaml is enormous, so I won't reproduce it here in full. The interesting parts are where values are being automatically interpolated by Salt. Like the Cassandra state, this is actually a Jinja template which renders a YAML file.

First, we get a list of internal IP addresses corresponding to cassandra-seed minions from the Salt mine and build a list of known_seeds.

{%- from 'cassandra/map.jinja' import server with context -%}
{% set known_seeds = [] %}
{% for minion, ip_array in salt['mine.get']('cassandra-seed:true', 'network.ip_addrs', 'grain').items() if ip_array is not sameas false and known_seeds|length < 2 %}
{% for ip in ip_array %}
{% do known_seeds.append(ip) %}
{% endfor %}
{% endfor %}

This becomes the list of seeds the node looks for when trying to join the cluster.

seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "{{ known_seeds|unique|join(',') }}"

Listen and broadcast addresses are configured per node. The broadcast addresses are a little special due to our network configuration needs: each node has to get its public dns name from the Salt mine. This is perhaps a bit overcomplicated compared to a custom grain or capturing the output from running the Salt modules at render time, but it's there and it works and at this point messing with it isn't a great use of time.

listen_address: {{ grains['fqdn'] }}
broadcast_address: {{ salt['mine.get'](grains['id'], 'public_dns').items()[0][1] }}
rpc_address: {{ grains['fqdn'] }}
broadcast_rpc_address: {{ salt['mine.get'](grains['id'], 'public_dns').items()[0][1] }}

The cluster name and other central settings are interpolated from the pillar+defaults server dictionary.

cluster_name: "{{ server.cluster_name }}"
...
authenticator: "{{ server.authenticator }}"
...
endpoint_snitch: "{{ server.endpoint_snitch }}"

The changes to the Cassandra 3.0.8 configuration are identical.

srv/salt/cassandra/files/2.2.12/TimeWindowCompactionStrategy-2.2.5.jar

See this post on TheLastPickle for directions on building the TWCS jar.

Highstate

Finally, the Salt highstate needs to ensure that our cassandra-* nodes have the Java and Cassandra states applied. Since Salt-Cloud minions come configured, however, we have to ensure the default salt.minion state is excluded from our Cassandra nodes since otherwise a highstate will blow away the cloud-specific configuration.

srv/salt/top.sls changes

base:
  'not cassandra-*':
    - match: compound
    - salt.minion
  'cassandra-*':
    - sun-java
    - sun-java.env
    - cassandra

Startup!

Set the Salt config dir to etc with -c and pass in the map file with -m:

sudo salt-cloud -c etc -m cassandra-test.map

To clean up:

sudo salt-cloud -d cassandra-1 cassandra-2 cassandra-3

Automatic Node Deploys to Elastic Beanstalk

Dian Fay — Mon, 08 Oct 2018 00:00:00 +0000

One of my favorite good ideas to ignore is the maxim that you should have your deployment pipeline ready to go before you start writing code. There's always some wrinkle you couldn't have anticipated anyway, so while it sounds good on paper I just don't think it's the best possible use of time. But with anything sufficiently complicated, there's a point where you just have to buckle down and automate rather than waste time repeating the same steps yet again (or, worse, forgetting one). I hit that point recently: the application isn't in production yet, so I'd been "deploying" by means of pulling the repo on an EC2 server, installing dependencies and building in-place, then killing and restarting the node process with nohup. Good enough for demos, not sustainable long-term. Also, I might have in fact missed a step Friday before last and not realized things were mostly broken until the following Monday.

I'd been using CircleCI to build and test the application already, so I wanted to stick with it for deployment as well. However, this precluded using the same EC2 instance: the build container would need to connect to it to run commands over SSH, but this connection would be coming from any of a huge possible range of build container IP addresses. I didn't want to open the server up to the whole world to accommodate the build system. Eventually I settled on Elastic Beanstalk, which can be controlled through the AWS command-line interface with the proper credentials instead of the morass of VPCs and security groups. Just upload a zip file!

The cost of using EBS, it turned out, was that while it made difficult things easy it also made easy things difficult. How do you deploy the same application to different environments? You don't. Everything has to be in that zip file, and if that includes any per-environment configuration then the right config files had better be where they're expected to be. This is less than ideal, but at least it can be scripted. Here's the whole thing (assuming awscli has already been installed):

# what time is it?
TIMESTAMP=$(date +%Y%m%d%H%M%S)

# work around Elastic Beanstalk permissions for node-gyp (bcrypt)
echo "unsafe-perm=true" > .npmrc

# generate artifacts
npm run build

# download config
aws s3 cp s3://elasticbeanstalk-bucket-name/app/development.config.json .

# zip everything up
zip -r app-dev.zip . \
  --exclude "node_modules/*" ".git/*" "coverage/*" ".nyc_output/*" "test/*" ".circleci/*"

# upload to s3
aws s3 mv ./app-dev.zip s3://elasticbeanstalk-bucket-name/app/app-dev-$TIMESTAMP.zip

# create new version
aws elasticbeanstalk create-application-version --region us-west-2 \
  --application-name app --version-label development-$TIMESTAMP \
  --source-bundle S3Bucket=elasticbeanstalk-bucket-name,S3Key=app/app-dev-$TIMESTAMP.zip

# deploy to dev environment
# --application-name app is not specified because apt installs
# an older version of awscli which doesn't accept that option
aws elasticbeanstalk update-environment --region us-west-2 --environment-name app-dev \
  --version-label development-$TIMESTAMP

The TIMESTAMP ensures the build can be uniquely identified later. The .npmrc setting is for AWS reasons: as detailed in this StackOverflow answer, the unfortunately-acronymed node-gyp runs as the instance's ec2-user account and doesn't have permissions it needs to compile bcrypt. If you're not using bcrypt (or another project that involves a node-gyp step on install), you don't need that line.

The zip is assembled in three steps:

npm build compiles stylesheets, dynamic Pug templates, frontend JavaScript, and so forth.
The appropriate environment config is downloaded from an S3 bucket.
Everything is rolled together in the zip file, minus the detritus of source control and test results.

Finally, the Elastic Beanstalk deploy happens in two stages:

aws elasticbeanstalk create-application-version does what it sounds like: each timestamped zip file becomes a new "version". These don't map exactly to versions as more commonly understood thanks to the target environment configuration, so naming them for the target environment and giving the timestamp helps identify them.
aws elasticbeanstalk update-environment actually deploys the newly-created "version" to the destination environment.

Obviously, when it comes time to roll the project out to production, I'll factor the environment out into a variable to download and upload the appropriate artifacts. But even in its current state, this one small script has almost made deployment continuous: every pushed commit gets deployed to Elastic Beanstalk with no manual intervention, unless there are database changes. That's next.

What's your atomic bomb?

Dian Fay — Mon, 10 Sep 2018 10:11:42 +0000

Inspired by Software that helps, except I disagree with Bertil's implicit assertion that all software is helpful in the long run :) The development of the atomic bomb is the classic example of how diffusion of responsibility through a team and/or management structure helps people rationalize working on all kinds of stuff. Likely none of us have reached the ethical depths plumbed by Oppenheimer and his team, but what aren't you proud of?

I worked in advertising for a few years, making it easier to waste paper and annoy people worldwide with those mailers from car dealerships you throw away without looking at. I'm never working in advertising again.

Surrealist Remixes with Markov Chains

Dian Fay — Sun, 05 Aug 2018 00:00:00 +0000

There's a new button at the bottom of this (and each) post. Try clicking it! (If you're reading this on dev.to or an RSS reader, you'll need to visit di.nmfay.com to see it)

By now everyone's run into Twitter bots and automated text generators that combine words in ways that almost compute. There's even a subreddit that runs the user-generated content of other subreddits through individual accounts which make posts that seem vaguely representative of their sources, but either defy comprehension or break through into a sublime silliness.

People have engaged in wordplay (and word-work) for as long as we've communicated with words. Taking language apart and putting it back together in novel ways has been the domain of poets, philosophers, and magicians alike for eons, to say nothing of puns, dad jokes, glossolalia, and word salad.

In the early 20th century, artists associated with the Surrealist movement played a game, variously for entertainment and inspiration, called "exquisite corpse". Each player writes a word (in this version, everyone is assigned a part of speech ahead of time) or draws on an exposed section of paper, then folds the sheet over to obscure their work from the next player. Once everyone's had a turn, the full sentence or picture is revealed. The game takes its name from its first recorded result: le cadavre exquis boira le vin nouveau, or "the exquisite corpse shall drink the new wine".

The Surrealist seeds fell on fertile ground and their ideas spread throughout the artistic and literary world, just as they themselves had been informed by earlier avant-garde movements like Symbolism and Dada. In the mid-century, writers and occultists like Brion Gysin and William Burroughs used similar techniques to discover new meanings in old texts. The only real difference in our modern toys is that they run on their own -- it's a little bit horror movie ouija board, except you can see the workings for yourself.

There are a variety of ways to implement this kind of functionality. On the more primitive side, you have "mad libs" algorithms which select random values to insert into known placeholders, as many Twitter bots such as @godtributes or @bottest_takes do. This method runs up against obvious limitations fairly quickly: the set of substitutions is finite, and the structure they're substituted into likewise becomes predictable.

More advanced text generators are predictive, reorganizing words or phrases from a body of text or corpus in ways which reflect the composition of the corpus itself: words aren't simply jumbled up at random, but follow each other in identifiable sequences. Many generators like these run on Markov chains, probabilistic state machines where the next state is a function only of the current state.

Implementing a Textual Markov Chain

The first order of business in using a Markov chain to generate text is to break up the original corpus. Regular expressions matching whitespace make that easy enough, turning it into an array of words. The next step is to establish the links between states, which is where things start getting a little complex.

Textual Markov chains have one important parameter: the prefix length, which defines how many previous states (words) comprise the current state and must be evaluated to find potential next states. Prefixes must comprise at least one word, but for the purposes of natural-seeming text generation the sweet spot tends to be between two and four words depending on corpus length. With too short a prefix length, the output tends to be simply garbled; too long a prefix or too short a corpus, and there may be too few potential next states for the chain to diverge from the original text.

Mapping prefixes to next states requires a sliding window on the array. This is more easily illustrated. Here's a passage from Les Chants de Maldoror, a 19th-century prose poem rediscovered and given new fame (or infamy) by the Surrealists, who identified in its obscene grandeur a deconstruction of language and the still-developing format of the modern novel that prefigured their own artistic ideology:

He is as fair as the retractility of the claws of birds of prey; or again, as the uncertainty of the muscular movements in wounds in the soft parts of the lower cervical region; or rather, as that perpetual rat-trap always reset by the trapped animal, which by itself can catch rodents indefinitely and work even when hidden under straw; and above all, as the chance meeting on a dissecting-table of a sewing-machine and an umbrella!

Assuming a prefix length of 2, the mapping might start to take this shape:

"He is": ["as"],
"is as": ["fair"],
"as fair": ["as"],
"fair as": ["the"]

Starting from the first prefix ("He is"), there is only one next state possible since the words "He is" only appear once in the corpus. Upon reaching the next state, the active prefix is now "is as", which likewise has only one possible next state, and so forth. But when the current state reaches "as the", the next word to be added may be "retractility", "uncertainty", or "chance", and what happens after that depends on the route taken. Multiple next states introduce the potential for divergence; this is also why having too long a prefix length, or too short a corpus, results in uninteresting output!

Because the prefix is constantly losing its earliest word and appending the next, it's stored as a stringified array rather than as a concatenated string. The order of operations goes like this:

Select one of the potential next states for the current stringified prefix array.
shift the earliest word out of the prefix array and push the selected next word onto the end.
Stringify the new prefix array.
Repeat until bored, or until there's no possible next state.

Remix!

If you're interested in the actual code, it's remix.js in devtools, or you can find it in source control.

Markov chain generators aren't usually interactive; that's where the "probabilistic" part of "probabilistic state machine" comes into play. This makes the implementation here incomplete by design. Where only one possible next state exists, the state machine advances on its own, but where there are multiple, it allows the user to choose how to proceed. This, along with starting from the beginning instead of selecting a random opening prefix, gives it more an exploratory direction than if it simply restructured the entire corpus at the push of a button. The jury's still out on whether any great insights lie waiting to be unearthed, as the more mystically-minded practitioners of aleatory editing hoped, but in the mean time, the results are at least good fun.

Summer 2018: Massive, Twice Over

Dian Fay — Mon, 30 Jul 2018 00:00:00 +0000

NDC talks are up!

There's also the FullStack London version which is slightly condensed for a shorter timeslot, if you have a SkillsMatter account and want to get right to the fun parts.

If you've read (almost) anything I've written, text or code, odds are you've run into Massive.js. On the off chance you haven't, the elevator pitch is that PostgreSQL exclusivity lets you get a lot more mileage out of your database (as long as it's Postgres) and JavaScript being a dynamically typed, functional-ish language lets you get away with it really easily.

This talk goes over Massive in much more depth: first laying out a case for alternatives to the dominant object-relational mapping data access technique, in general and especially in JavaScript; and then diving into the architecture of Massive itself with plenty of examples. Also, there's some trivia about early 20th century Russian avant-garde art and another bit poking fun at French modernist architect Le Corbusier.

It's the second talk I've done, and overall I was pretty happy with how it went in Oslo and London both! I'm the furthest thing from a natural public speaker but I covered what I wanted to cover, finished at a reasonable time, and didn't screw anything up too badly -- so that's a success in my book. And after all, the only way to improve this particular skill is to keep doing it.

Centralize Your Query Logic!

Dian Fay — Wed, 25 Jul 2018 00:00:00 +0000

At a talk I gave earlier this month, an audience member asked if Massive supported joining information from multiple tables together. It's come up on the issue tracker before as well. Massive does not currently have this functionality, and while I'm open to suggestions it's not on my own radar.

The central reason for this is that join logic can be tricky to manage from the application architecture side. The ability to correlate and combine what you need when you need it is certainly powerful, but it also embeds assumptions about your database layout in client code. As the database and application evolve, these assumptions can easily fall out of date and out of sync with each other. In real terms, if your application's "model" (whether implicit or explicit) of a user loaded from the database includes only the user record itself sometimes, but other times looks for information in a separate profile table, adds current statistics, et cetera, and you have functionality that operates on A User, either you understand that users come in different shapes and handle them accordingly across the board or you are living on borrowed uptime.

Some application architectures approach this scenario by grouping the query logic together. In the enterprise world, n-tier applications frequently pull related queries into "services" or Data Access Objects (DAOs) so there's at least some kind of organizational schema. This reduces the maintenance overhead somewhat, but it's an imperfect solution, not least because there's nothing but fallible code reviews (if that) standing in the way of someone dropping data access code somewhere else.

Fortunately, there's already part of the application-database ecosystem dedicated to organizing things -- the database itself! And as an organizing principle, it already has its own way to manage complex queries. Sure, it'll involve writing a little SQL, but let's face it: you were going to wind up writing SQL eventually anyway.

If you've only scratched the surface of working with databases, you might not be familiar with views. The good news is they're pretty straightforward: a view is a stored SQL query with a name, given life with the statement CREATE VIEW myview AS SELECT.... You can SELECT from a view just like you can a table, optionally with JOINs and a WHERE clause and all the other trimmings, whereupon the database executes the query. Results are not stored so the information you get out of a view is always current, unless you intentionally sacrifice realtime data for speed by creating a materialized view which does persist results and has to be manually refreshed.

The reason views are underrated and underutilized in application development has mostly to do with the frameworks developers use to communicate with databases. When you have to provide a concrete implementation of a unary User model, odds are you only care about things you can both read and write to, so you back it up with tables instead of using views to shape data for your needs. There's little room for views in object/relational mapping, and when I've had to use O/RMs I've really only been able to take advantage of views to streamline the raw SQL queries you have to write anyway when you use O/RMs.

If you're not stuck with an object-relational mapper, though, you can really get your money's worth out of views! Retrieving user records from a view, or building more complex user-inclusive results by joining it into other views, ensures that you have a consistent definition of what information comprises a user built into your database. You can't always stop other developers from winging it, naturally, but having that central definition to point to eliminates at least one major potential ambiguity. Massive's omission of the join feature encourages developers using it to center their thinking on the database and the tools it offers for organizing information.

As with anything, there are tradeoffs. Here, it's flexibility. Views may be ephemeral stored queries, but they're still part of the database schema for all that, and the schema takes more planning and effort to change than does application code. But it's a good idea to be thinking carefully about this stuff in the first place.