Forem: Brett Hoyer

Handling Automatic ID Generation in PostgreSQL with Node.js and Sequelize

Brett Hoyer — Tue, 03 Jan 2023 20:22:01 +0000

Automatic ID generation for database records is a fundamental part of application development. In this article, I’ll demonstrate four ways to handle automatic ID generation in Sequelize for PostgreSQL and YugabyteDB, the open source, cloud native, distributed SQL database built on PostgreSQL.

There are many ways to handle ID generation in PostgreSQL, but I've chosen to investigate these approaches:

Auto-incrementing (SERIAL data type)
Sequence-caching
Sequence-incrementing with client-side ID management
UUID-generation

Depending on your application and your underlying database tables, you might choose to employ one or more of these options. Below I'll explain how each can be achieved in Node.js using the Sequelize ORM.

1. Auto-Incrementing

Most developers choose the most straightforward option before exploring potential optimizations. I'm no different! Here's how you can create an auto-incrementing ID field in your Sequelize model definitions.

// Sequelize
const { DataTypes } = require('sequelize');
const Product = sequelize.define(
   "product",
   {
       id: {
           type: DataTypes.INTEGER,
           autoIncrement: true,
           primaryKey: true,
       },
       title: {
           type: DataTypes.STRING,
       }
   }
);

If you’re familiar with Sequelize, you’ll be no stranger to this syntax, but others might wonder what's actually happening under the hood.

The autoIncrement flag tells PostgreSQL to create an id column with a SERIAL data type. This data type implicitly creates a SEQUENCE which is owned by the products table's id column.

// PostgreSQL equivalent
CREATE SEQUENCE products_id_seq;
CREATE TABLE products
(
   id INT NOT NULL DEFAULT NEXTVAL('products_id_seq'),
   title VARCHAR(255)
);

When inserting a product into our table, we don't need to supply a value for id, as it's automatically-generated from the underlying sequence.

We can simply run the following to insert a product.

// Sequelize
await Product.create({title: "iPad Pro"});


//PostgreSQL equivalent
INSERT INTO products (title) VALUES ('iPad Pro');

Dropping our table will also drop the automatically-created sequence, products_id_seq.

// Sequelize
await Product.drop();

// PostgreSQL equivalent
DROP TABLE products CASCADE;

Although this approach is extremely easy to implement, our PostgreSQL server needs to access the sequence to get its next value on every write, which comes at a latency cost. This is particularly bad in distributed deployments. YugabyteDB sets a default sequence cache size of 100. I’ll outline why this is so important below.

Now that we have the basics out of the way, let's try to speed things up. As we all know, "cache is king."

2. Sequence-Caching

Although the autoIncrement flag in Sequelize model definitions totally eliminates the need to interact with sequences directly, there are scenarios where you might consider doing so. For instance, what if you wanted to speed up writes by caching sequence values? Fear not, with a little extra effort, we can make this happen.

Sequelize doesn't have API support to make this happen, as noted on Github (https://github.com/sequelize/sequelize/issues/3555#issuecomment-1132630072), but there's a simple workaround. By utilizing the built-in literal function, we are able to access a predefined sequence in our model.

const { literal, DataTypes } = require('sequelize');
const Product = sequelize.define("product", {
 id: {
   type: DataTypes.INTEGER,
   primaryKey: true,
   defaultValue: literal("nextval('custom_sequence')"),
 },
});

sequelize.beforeSync(() => {
 await sequelize.query('CREATE SEQUENCE IF NOT EXISTS custom_sequence CACHE 50');
});

await sequelize.sync();

That's not too bad. So, this is what changed:
We've created our own sequence, named custom_sequence, which is used to set the default value for our product ID.
This sequence is created in the beforeSync hook, so it will be created before the products table and its CACHE value has been set to 50.
The defaultValue is set to the next value in our custom sequence.

Well, what about the cache? Sequences in PostgreSQL can optionally be supplied a CACHE value upon creation, which allots a certain number of values to be stored in memory per session. With our cache set at 50, here's how that works.

//Database Session A
> SELECT nextval('custom_sequence');
1
> SELECT nextval('custom_sequence');
2

//Database Session B
> SELECT nextval('custom_sequence');
51
>
52

For an application with multiple database connections, such as one running microservices or multiple servers behind a load balancer, each connection will receive a set of cached values. No session will contain duplicate values in its cache, ensuring there are no collisions when inserting records. In fact, depending on how your database is configured, you might even find gaps in your sequenced id column if a database connection fails and is restarted without using all of the values alloted in its cache. However, this generally isn't a problem, as we're only concerned with uniqueness.

So, what's the point? Speed. Speed is the point!

By caching values on our PostgreSQL backend and storing them in memory, we're able to retrieve the next value very quickly. In fact, YugabyteDB caches 100 sequence values by default, as opposed to the PostgreSQL default of 1. This allows the database to scale, without needing to repeatedly obtain the next sequence value from the master node on writes. Of course, caching comes with the drawback of an increased memory constraint on the PostgreSQL server.

Depending on your infrastructure, this could be a worthy optimization!

3. Client-Side Sequencing

Sequence-caching improves performance by caching values on our PostgreSQL backend. How could we use a sequence to cache values on our client instead?

Sequences in PostgreSQL have an additional parameter called INCREMENT BY that can be used to achieve this.

// DB Initialization
const { literal, DataTypes } = require('sequelize');
const Product = sequelize.define("product", {
 id: {
   type: DataTypes.INTEGER,
   primaryKey: true
 },
});

sequelize.beforeSync(() => {
 await sequelize.query('CREATE SEQUENCE IF NOT EXISTS custom_sequence INCREMENT BY 50');
});

await sequelize.sync();

// Caller
let startVal = await sequelize.query("SELECT nextval('custom_sequence')");
let limit = startVal + 50;

if (startVal >= limit) {
   startVal = await sequelize.query("SELECT nextval('custom_sequence')");
   limit = startVal + 50;
}

await Product.create({id: startVal, title: "iPad Pro"})
startVal += 1;

Here, we're utilizing our custom sequence in a slightly different way. No default value is supplied to our model definition. Instead, we're using this sequence to set unique values client-side, by looping through the values in the increment range. When we've exhausted all of the values in this range, we make another call to our database to get the next value in our sequence to "refresh" our range.

Here's an example:

// Database Session A

> SELECT nextval('custom_sequence');
1

*
 inserts 50 records
 // id 1
 // id 2
 ...
 // id 50
*

> SELECT nextval('custom_sequence');
151

// Database Session B

> SELECT nextval('custom_sequence');
51

* inserts 50 records before Session A has used all numbers in its range *

> SELECT nextval('custom_sequence');
101

Database Session A connects and receives the first value in the sequence. Database Session B connects and receives the value of 51 because we've set our INCREMENT BY value to 50. Like our auto-incrementing solutions, we can ensure that there are no ID collisions by referencing our PostgreSQL sequence to determine the start value for our range.

What problems might arise from this solution? Well, it's possible that a database administrator could choose to increase or decrease the INCREMENT BY value for a particular sequence, without application developers being notified of this change. This would break application logic.

How can we benefit from client-side sequencing? If you have a lot of available memory on your application server nodes, this could be a potential performance benefit over sequence-caching on database nodes.

In fact, you might be wondering if it’s possible to utilize a cache on the client and server in the same implementation. The short answer is YES. By creating a sequence with CACHE and INCREMENT BY values, we benefit from a server-side cache of our sequence values and a client-side cache for the next value in our range. This performance optimization provides the best of both worlds if memory constraints are not of primary concern.

Enough with the sequences already! Let's move on to unique identifiers.

4. UUID-Generation

We've covered three ways to generate sequential, integer-based IDs. Another data type, the Universally Unique Identifier (UUID), removes the need for sequences entirely.

A UUID is a 128-bit identifier, which comes with the guarantee of uniqueness due to the incredibly small probability that the same ID would be generated twice.

PostgreSQL comes with an extension called pgcrypto (also supported by YugabyteDB), which can be installed to generate UUIDs with the gen_random_uuid function. This function generates a UUID value for a database column, much the same that nextval is used with sequences.

Additionally, Node.js has several packages which generate UUIDs, such as, you guessed it, uuid.

// Sequelize
const { literal, DataTypes } = require('sequelize');
const Product = sequelize.define(
   "product",
   {
       id: {
           type: DataTypes.UUID,
           defaultValue: literal('gen_random_uuid()')
           primaryKey: true,
       },
       title: {
           type: DataTypes.STRING,
       }
   }
);

sequelize.beforeSync(() => {
 await sequelize.query('CREATE EXTENSION IF NOT EXISTS "pgcrypto"');
});


// PostreSQL Equivalent
CREATE TABLE products
(
   id UUID NOT NULL DEFAULT gen_random_uuid(),
   title VARCHAR(255)
);

This allows us to generate a UUID client-side, with a server-side default, if required.

A UUID-based approach brings unique benefits with the random nature of the data type being helpful with certain data migrations. This is also helpful for API security, as the unique identifier is in no way tied to the information being stored.

Additionally, the ability to generate an ID client side without managing state is helpful in a distributed deployment, where network latencies play a big role in application performance.

For example, in a geo-partioned YugabyteDB cluster, connections are made to the nearest database node to serve low-latency reads. However, on writes, this node must forward the request to the primary node in the cluster (which could reside in another region of the world) to determine the next sequence value. The use of UUIDs eliminates this traffic, providing a performance boost.

So, what's the downside? Well, the topic of UUIDs is somewhat polarizing. One obvious downside would be the storage size of a UUID relative to an integer, 16 bytes as opposed to 4 bytes for an INTEGER and 8 for a BIGINT. UUIDs also take some time to generate, which is a performance consideration.

Some of the concerns regarding using UUIDs as primary keys are illustrated in this post and are discussed further with regards to YugabyteDB in this thread.

You can read more about the tradeoffs between integer and UUID based IDs here.

Get Building

Ultimately, there are many factors to consider when choosing how to generate your database IDs. You might choose to use auto-incrementing IDs for a table with infrequent writes, or one that doesn't require low-latency writes. Another table, spread across multiple geographies in a multi-node deployment, might benefit from using UUIDs. There's only one way to find out. Get out there and write some code!

If you're interested in using an always-free, PostgreSQL-compatible database node, give YugabyteDB Managed a try.

Bulk Loading Data In PostgreSQL With Node.js and Sequelize

Brett Hoyer — Tue, 27 Dec 2022 18:17:56 +0000

Application development often requires seeding data in a database for testing and development. The following article will outline how to handle this using Node.js and Sequelize.

Whether you're building an application from scratch with zero users, or adding features to an existing application, working with data during development is a necessity. This can take different forms, from mock data APIs reading data files in development, to seeded database deployments closely mirroring an expected production environment.

I prefer the latter as I find fewer deviations from my production toolset leads to fewer bugs.

A Humble Beginning

For the sake of this discussion, let's assume we're building an online learning platform offering various coding courses. In its simplest form, our Node.js API layer might look like this.

// server.js

const express = require("express");
const App = express();

const courses = [
   {title: "CSS Fundamentals", "thumbnail": "https://fake-url.com/css"}],
   {title: "JavaScript Basics", "thumbnail": "https://fake-url.com/js-basics"}],
   {title: "Intermediate JavaScript", "thumbnail": "https://fake-url.com/intermediate-js"}
];

App.get("/courses", (req, res) => {
   res.json({data: courses});
});

App.listen(3000);

If all you need is a few items to start building your UI, this is enough to get going. Making a call to our /courses endpoint will return all of the courses defined in this file. However, what if we want to begin testing with a dataset more representative of a full-fledged database-backed application?

Working With JSON

Suppose we inherited a script exporting a JSON-array containing thousands of courses. We could import the data, like so.

// courses.js

module.exports = [
   {title: "CSS Fundamentals", "thumbnail": "https://fake-url.com/css"}],
   {title: "JavaScript Basics", "thumbnail": "https://fake-url.com/js-basics"}],
   {title: "Intermediate JavaScript", "thumbnail": "https://fake-url.com/intermediate-js"},
   ...
];

// server.js

...
const courses = require("/path/to/courses.js");
...

This eliminates the need to define our mock data within our server file, and now we have plenty of data to work with. We could enhance our endpoint by adding parameters to paginate the results and set limits on how many records are returned. But, what about allowing users to post their own courses? How about editing courses?

This solution gets out of hand quickly as you begin to add functionality. We'll have to write additional code to simulate features of a relational database. After all, databases were created to store data. So, let's do that.

Bulk Loading JSON With Sequelize

For an application of this nature, PostgreSQL is an appropriate database selection. We have the option of running PostgreSQL locally, or connecting to a PostgreSQL-compatible cloud native database, like YugabyteDB Managed. Apart from being a highly-performant distributed SQL database, developers using YugabyteDB benefit from a cluster that can be shared by multiple users. As the application grows, our data layer can scale out to multiple nodes and regions.

After creating a YugabyteDB Managed account and spinning up a free database cluster, we're ready to seed our database and refactor our code, using Sequelize. The Sequelize ORM allows us to model our data to create database tables and execute commands. Here's how that works.

First, we install Sequelize from our terminal.

// terminal
> npm i sequelize

Next, we use Sequelize to establish a connection to our database, create a table, and seed our table with data.

// database.js

// JSON-array of courses
const courses = require("/path/to/courses.js");

// Certificate file downloaded from YugabyteDB Managed
const cert = fs.readFileSync(CERTIFICATE_PATH).toString();

// Create a Sequelize instance with our database connection details
const Sequelize = require("sequelize");
const sequelize = new Sequelize("yugabyte", "admin", DB_PASSWORD, {
   host: DB_HOST,
   port: "5433",
   dialect: "postgres",
   dialectOptions: {
   ssl: {
       require: true,
       rejectUnauthorized: true,
       ca: cert,
   },
   },
   pool: {
   max: 5,
   min: 1,
   acquire: 30000,
   idle: 10000,
   }
});

// Defining our Course model
export const Course = sequelize.define(
   "course",
   {
       id: {
           type: DataTypes.INTEGER,
           autoIncrement: true,
           primaryKey: true,
       },
       title: {
           type: DataTypes.STRING,
       },

       thumbnail: {
           type: DataTypes.STRING,
       },
   }
);


async function seedDatabase() {
   try {
       // Verify that database connection is valid
       await sequelize.authenticate();

       // Create database tables based on the models we've defined
       // Drops existing tables if there are any
       await sequelize.sync({ force: true });

       // Creates course records in bulk from our JSON-array
       await Course.bulkCreate(courses);

       console.log("Courses created successfully!");
   } catch(e) {
       console.log(`Error in seeding database with courses: ${e}`);
   }
}

// Running our seeding function
seedDatabase();

By leveraging Sequelize’s bulkCreate method, we’re able to insert multiple records in one statement. This is more performant than inserting requests one at a time, like this.

. . .
// JSON-array of courses
const courses = require("/path/to/courses.js");

async function insertCourses(){
    for(let i = 0; i < courses.length; i++) {
    await Course.create(courses[i]); 
}
}

insertCourses();

Individual inserts come with the overhead of connecting, sending requests, parsing requests, indexing, closing connections, etc. on a one-off basis. Of course, some of these concerns are mitigated by connection pooling, but generally speaking the performance benefits of inserting in bulk are immense, not to mention far more convenient. The bulkCreate method even comes with a benchmarking option to pass query execution times to your logging functions, should performance be of primary concern.

Now that our database is seeded with records, our API layer can use this Sequelize model to query the database and return courses.

// server.js

const express = require("express");
const App = express();

// Course model exported from database.js
const { Course } = require("/path/to/database.js")

App.get("/courses", async (req, res) => {
   try {
       const courses = await Course.findAll();
       res.json({data: courses});
   } catch(e) {
       console.log(`Error in courses endpoint: ${e}`);
   }
});
App.listen(3000);

Well, that was easy! We've moved from a static data structure to a fully-functioned database in no time.

What if we're provided the dataset in another data format, say, a CSV file exported from Microsoft Excel? How can we use it to seed our database?

Working With CSVs

There are many NPM packages to convert CSV files to JSON, but none are quite as easy to use as csvtojson. Start by installing the package.

// terminal
> npm i csvtojson

Next, we use this package to convert our CSV file to a JSON-array, which can be used by Sequelize.

// courses.csv
title,thumbnail
CSS Fundamentals,https://fake-url.com/css
JavaScript Basics,https://fake-url.com/js-basics
Intermediate JavaScript,https://fake-url.com/intermediate-js

// database.js
...
const csv = require('csvtojson');
const csvFilePath = "/path/to/courses.csv";

// JSON-array of courses from CSV
const courses = await csv().fromFile(csvFilePath);
...
await Course.bulkCreate(courses);
...

Just as with our well-formatted courses.js file, we're able to easily convert our courses.csv file to bulk insert records via Sequelize.

Conclusion

Developing applications with hardcoded data can only take us so far. I find that investing in tooling early in the development process sets me on the path towards bug-free coding (or so I hope!)

By bulk loading records, we’re able to work with a representative dataset, in a representative application environment. As I’m sure many agree, that’s often a major bottleneck in the application development process.

Give Sequelize and YugabyteDB a try in your next Node.js coding adventure!

How to Improve Node.js Application Latency Using Different Distributed SQL Deployments

Brett Hoyer — Thu, 03 Nov 2022 18:49:09 +0000

Let’s improve Node.js application latency using different YugabyteDB distributed SQL database configurations.

Application developers rely on different database configurations (and sometimes different databases altogether) to improve latency. The Largest River, my first distributed web application, is no different.

From the start of this project, I've set out to explore how different YugabyteDB configurations could be deployed to improve latency for users across the globe. During this process, I've deployed three databases with various configurations in YugabyteDB Managed.

This is how it went….

A Simple Beginning

Initially, I deployed a single-region, multi-zone cluster in Google Cloud in the us-west2 cloud region.

Considering that databases have traditionally scaled vertically (adding more compute power to existing nodes) rather than horizontally (adding more instances), this basic distributed SQL configuration already had some benefits over many standard offerings.

While I'm optimizing for latency, it is worth noting that a multi-zone deployment safeguards against outages in a particular data center. If one of our nodes failed, our YugabyteDB cluster would continue to operate by serving requests from the remaining nodes in the cluster.

However, in terms of latency, this deployment does pose some issues. Users connecting from a nearby application instance will experience low latency (as little as 4ms), but those connecting from the other side of the globe, say in Australia, will suffer high latency (250ms or more in some cases).

This is often a reasonable tradeoff, but here we're building the next big global business (well, we aren't, but pretend we are!) and demand faster reads and writes.

Going Global

Our friends on other continents are suffering. Multi-region, multi-zone with read replicas configuration to the rescue!

With this cluster deployment, we are putting our data closer to our end users. This drastically improves read latency. To illustrate, let’s use the same group in Sydney as an example. Their reads are now down to as little as 4ms, by connecting to the nearby read replica node. This is a major performance win on reads, but how about writes?

However, writes from Sydney are still relatively slow in this deployment. While from a practical standpoint, our application is able to connect and send writes to the nearest replica node, this request is sent to the primary cluster to be committed. After all, a replica node is just that, a replica of the primary database.

The DocDB replication layer has numerous options for replicating your data depending on your needs, so I suggest you explore further if going global is in your sights!

Keeping Things Separate

Geo-partioned configurations can be used both to improve latency and uphold compliance regulations. For instance, European Union and Indian regulations state that all user data collected in these territories must also be stored in these territories.

This configuration might also make sense for a global e-commerce application, as product catalogs might be different across geographies. Serving reads from these nodes is extremely fast, much like our previous multi-region deployment. In addition, writes are fast because we are able to commit writes to the geo-partitioned node in our user's local geography.

Counting Down

So, these are just a few of the many distributed database configurations at your disposal. Which you should choose is totally dependent on your application's needs. However, more often than not, distributing your data layer will improve latency (as well as resiliency, data compliance, and more!)

Look out for my next article on managing your database connections in Node.js!

Managing Your Distributed Node.js Application Environment and Configuration

Brett Hoyer — Wed, 19 Oct 2022 20:23:49 +0000

See how to effectively manage the environment and configuration of your distributed Node.js applications.

Node.js applications rely on environment variables to differentiate between…environments!

This means that applications running locally often behave differently to those being deployed to testing, staging, or production environments. Often, this just means listening on a different port, or pointing to a different database url.

The Largest River, my first foray into distributed application development, leverages both environment variables and configuration files to differentiate between the mode in which the code is being run, and the cloud region where it's being run.

In this article, I'm going to demonstrate how the dotenv and node-config NPM packages can be used together to keep your Node.js application code organized across environments.

Setting the Environment

Application configuration can be done in a variety of ways, depending on the complexity of the system being built. In The Largest River project, I've chosen to use environment variables minimally, only setting NODE_APP_INSTANCE and NODE_ENV.

Let's see how this works.

#.env
NODE_APP_INSTANCE=los-angeles

# server.js

if (process.env.NODE_ENV === "development") { 
  require("dotenv").config();
}

// rest of file has access to process.env.NODE_APP_INSTANCE
if (process.env.NODE_APP_INSTANCE === “london”) {
  // connect to database node nearest to London
}
...

You might be wondering where I've set the value of NODE_ENV. Well, this variable is set to 'development' by default. When running our node script, the following two commands are equivalent.

NODE_ENV=development node server.js

node server.js

In production, we set the variables explicitly in the startup script.

# startup_script.sh
...
// setting NODE_APP_INSTANCE environment variable from instance metadata
NODE_APP_INSTANCE=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/instance_id -H "Metadata-Flavor: Google")
echo "NODE_APP_INSTANCE=${NODE_APP_INSTANCE}" | sudo tee -a /etc/environment

//setting NODE_ENV to production
echo "NODE_ENV=production" | sudo tee -a /etc/environment
source /etc/environment

//starting our node server with explicit command line arguments
NODE_ENV=$NODE_ENV NODE_APP_INSTANCE=$NODE_APP_INSTANCE node server.js

This ensures that even on server reboots, our application will start reliably, with the proper environment set. The /etc/environments file is system-wide and persistent, making it a reasonable place to store our environment variables on production servers.

Beyond the Basics

With our environment settled, it's time to check out our application configuration.

What if our app needs more expressive configuration? Environment variables can do the job, but their values are always strings. What if we want to include objects and nested data types?

Some developers would say, "Stringify your JSON and add it to your environment."

While I understand the sentiment, I disagree. It is much more pleasant to work directly inside a JSON file. So long as you take care to not check sensitive information into source control, this can be a powerful tool in your development arsenal.

Here's an example, utilizing the node-config package:

# default.json
{
  "Databases": [
    {
      "id": "single_region",
      "type": "single_region",
      "host": "[HOST_FOR_DATABASE]", 
      "primaryRegion": "us-west2",
      "username": "[DB_USERNAME]",
      "password": "[DB_PASSWORD]",
      "cert": "[PATH_TO_DB_CERT]",
      "dev_cert": "[PATH_TO_CERT_IN_DEVELOPMENT]",
      "label": "Single-region, multi-zone",
      "sublabel": "3 nodes deployed in US West",
      "nodes": [
        {
          "coords": [35.37387, -119.01946],
          "label": "Bakersfield",
          "zone": "us-west2",
        },
        {
          "coords": [34.95313, -120.43586],
          "label": "Santa Maria",
          "zone": "us-west2",
        },
        {
          "coords": [32.71742, -117.16277],
          "label": "San Diego",
          "zone": "us-west2",
        }
      ]
    },
    ...
  ]
}

# server.js

const config = require("config");
const Databases = config.get("Databases");

By utilizing a JSON configuration file, we're able to easily edit our database details. Given that The Largest River features multiple YugabyteDB Managed databases with varying deployments, it is helpful to not be restricted to key-value string pairs.

As a matter of interest, this is what working with the same configuration would look like in an environment variable.

# .env
DATABASES='{"Databases":[{"id":"single_region","type":"single_region","host":"[HOST_FOR_DATABASE]","primaryRegion":"us-west2","username":"[DB_USERNAME]","password":"[DB_PASSWORD]","cert":"[PATH_TO_DB_CERT]","dev_cert":"[PATH_TO_CERT_IN_DEVELOPMENT]","label":"Single-region, multi-zone","sublabel":"3 nodes deployed in US West","nodes":[{"coords":[35.37387,-119.01946],"label":"Bakersfield","zone":"us-west2"},{"coords":[34.95313,-120.43586],"label":"Santa Maria","zone":"us-west2"},{"coords":[32.71742,-117.16277],"label":"San Diego","zone":"us-west2"}]}]}'

# server.js
if (process.env.NODE_ENV === "development") { 
  require("dotenv").config();
}

const Databases = JSON.parse(process.env.DATABASES)

Earlier we set an environment variable called NODE_APP_INSTANCE. This variable can be used to set instance-specific configuration in a multi-instance deployment. This can be hugely helpful if instances of the same application need to behave differently.

The Largest River is comprised of six application instances, each connecting to different database nodes, depending on their configuration.

The Largest River Application Instances

If a default.json needs to be overridden, these application instances will read from production-los-angeles.json, development-mumbai.json, etc. depending on the environment and cloud region.

Wrapping Up

The way you manage your application environment and configuration is entirely up to you. After all, you're the one that has to work with it!

There are no hard and fast rules. Personally, I'm always looking for ways to remove configuration details from my application logic. This has a number of benefits. For example, as my applications evolve, there is a single source of truth, and reduced code duplication.

I hope you'll consider some of these tips when you create your next distributed Node application. And remember, this is your journey, do what works best for you!

Look out for my next blog, which will look at distributed SQL database configurations in depth!

Developing a Node.js Application in a Virtual Private Cloud

Brett Hoyer — Mon, 10 Oct 2022 20:32:02 +0000

Distributed applications rely on Virtual Private Clouds (VPCs) to increase security and reduce latencies. Traffic is routed through the VPC, rather than the public internet, eliminating the need for any network hops along the way.

In my continued development of The Largest River, I've chosen to deploy my application instances inside of a Google Cloud VPC.

To access these resources in the cloud, both in development and production environments, there are multiple considerations. How will I connect to my YugabyteDB clusters? And how will I connect from my local machine to remote resources in the VPC?

The following outlines my solutions to these problems.

YugabyteDB Deployment in a VPC

YugabyteDB Managed relies on VPC Peering, to keep network traffic within the cloud provider's network and to establish connectivity between nodes from different cloud regions.

This is an optional configuration in single-region clusters, but for multi-region deployments, it is mandatory. I’ve configured two multi-region clusters for this application, so VPC peering is required.

For this to work, I also need to set up VPC network peering in the Google Cloud.

That was easy!

Now our application nodes and database nodes have a peered network connection inside of Google Cloud. This also means that all of our database connections must come from machines inside of this network.

What does this mean for local development?

Setting up SSH Tunneling

This was uncharted territory for me. I needed to find a way to establish database connections to various multi-node clusters from my development machine, in order to iterate quickly.

Initially, I thought about writing application code directly on a VM using remote SSH in Visual Studio Code. This was problematic, in that I would still need to push and update the code on all other application instances. I needed a solution which didn’t require continuously deploying code to remote servers within the VPC for testing.

Here is where I discovered the power of SSH tunneling (also known as port forwarding). By tunneling connections from my local machine through a server in the VPC, I'm able to establish a database connection to each and every database node relying on this peered VPC connection.

Here's an example:

# development machine command line
ssh -N -i /path/to/gcp/ssh/key -L 5000:fake-database-node-url.gcp.ybdb.io:5433 vm_username@[VM_IP_ADDRESS]

This command forwards localhost:5000 to fake-database-node-url.gcp.ybdb.io:5433, through a VM inside of the VPC, to which I'm able to establish and SSH connection.

Despite being incredibly simple to set up, this discovery has been instrumental in my development journey. By utilizing port forwarding, my Node.js application development environment has remained unchanged.

When the NODE_ENV environment variable is set to development, the app knows to establish connections through a particular local port for each database node. When the environment variable is set to production, the application knows requests should be made directly to the database URL. It just works, and it does so securely.

Much the same, SSH tunneling can be used within a database client. Here's what this looks like inside of my client of choice, (DBeaver)[https://dbeaver.io/].

Wrapping Up

For those working as IT administrators, devops, or network engineers, etc., this might seem like a rudimentary discovery!

SSH tunneling has a wide range of use cases and I'm sure I've just scratched the surface. However, nothing feels better than finding a simple, frictionless solution to a blocking issue you've never faced before.

Without SSH tunneling, I was essentially stuck, doomed only to deploy my code to an instance within the VPC in order to test its validity. I will not be apologizing for my excitement at this time. 😎

We’re approaching the finish line, stay tuned!

Deploying a Node.js application across multiple geographies with Terraform and Ansible

Brett Hoyer — Fri, 16 Sep 2022 17:21:49 +0000

A plan is in place, infrastructure is provisioned, now what?

Well, I've decided to take a step back and review the distributed system holistically, before surging forward with application development.

Which system level improvements can I make to increase the speed with which I can iterate on my designs, test my code, and alert myself to potential bottlenecks?

Here’s a sneak peek into the globally-distributed bookstore that I’ve been working on. Users are able to simulate connecting from 6 different locations across the globe, Los Angeles, Washington, D.C., São Paulo, London, Mumbai and Sydney. Multiple YugabyteDB Managed database clusters are deployed to support this application. So far, I've deployed 3 clusters, a single-region cluster, a multi-region cluster w/ read replicas and a geo-partitioned cluster. Users may choose which database the application connects to, in order to highlight the latency discrepancies between database configurations.

In this blog, I’ll explain how multiple Node.js servers are deployed across these geographies to make this possible.

Changing it up

The main benefit of Terraform is the ability to easily change cloud infrastructure when plans change.

Well, plans have changed!

I have decided to scrap Google's Container-Optimized OS in favor of Ubuntu.

In the short term, this will increase my productivity. I'll be able to update my code and manipulate the environment without needing to push builds to Google Container Registry. Change can be hard, but with Terraform, it's incredibly easy.

# main.tf

boot_disk {
   initialize_params {
       image = "cos-cloud/cos-stable" //OLD

       image = "ubuntu-os-cloud/ubuntu-2004-lts" //NEW
   }
}

This is the only configuration change required to provision new infrastructure, with the OS of my choosing. The beauty of automation is that the rest of the system, the networking, the instance sizes, and the locations, remain unchanged. A little upfront configuration saved me a ton of time and a lot of headaches.

Automating Application Deployment

However, it's not all rainbows and butterflies!

You might recall that in my previous post, I outlined how Google's Container-Optimized OS automatically pulls and runs container images upon provisioning infrastructure.

Now that we're no longer running containers on our VMs, we'll need to deploy and run our code another other way.

Fear not, there are many tools out there which makes this a breeze. I've chosen to use to use Ansible for my code automation.

Let's dive into it, starting with some configuration.

# inventory.yml

[Private]
10.168.0.2 #Los Angeles
10.154.0.2 #London
10.160.0.2 #Mumbai
10.158.0.2 #Sao Paulo
10.152.0.2 #Sydney
10.150.0.2 #Washington, D.C.

[Private:vars]
ansible_ssh_common_args= '-o ProxyCommand="ssh -W %h:%p -q [USERNAME]@[IP_ADDRESS]"'
ansible_ssh_extra_args= '-o StrictHostKeyChecking=no'

Here, I'm setting the IP addresses of the 6 application instances we provisioned using Terraform, plus some variables to be used by our Ansible playbook.

You might have noticed that I've set internal IP addresses, which typically cannot be accessed from outside of the private network. This is correct, and I'll be covering how this works in my next blog in this series (hint: SSH tunneling). For now, just assume these are publicly-accessible addresses.

Now, on to the playbook…

# playbook.yml

---
- hosts: all
 become: yes
 vars:
   server_name: "{{ ansible_default_ipv4.address }}"
   document_root: /home/
   app_root: ../api_service/
 tasks:
   - name: Copy api service to /home directory
     synchronize:
       src: "{{ app_root }}"
       dest: "{{ document_root }}"
       rsync_opts:
         - "--no-motd"
         - "--exclude=node_modules"
         - "--rsh='ssh {{ ansible_ssh_common_args }} {{ ansible_ssh_extra_args }}'"

   - name: Install project dependencies
     command: sudo npm install --ignore-scripts
     args:
       chdir: /home/

   - name: Kill Node Server
     command: sudo pm2 kill
     args:
       chdir: /home/

   - name: Restart Node Server
     shell: NODE_ENV=$NODE_ENV NODE_APP_INSTANCE=$NODE_APP_INSTANCE sudo pm2 start index.js --name node-app
     args:
       chdir: /home/

This script lays out the steps required to provision our code and run our Node.js server.

We start by reading the hosts from our inventory.yml file and using rsync to push code to each of them in sequence.

Our application dependencies are installed, the server is stopped if currently running, and then restarted with environment variables set on each machine. This is all remarkably easy to set up, understand, and replicate.

We now essentially have just two buttons to push - one to spin up our VMs and another to provision our application code.

Starting Fresh

You might be wondering where those environment variables are being set on our Ubuntu VMs. Don't we need to configure these instances and install system level dependencies?

We sure do!

We can add a startup script to our Terraform config which will run whenever an instance starts or reboots.

# startup_script.sh
#! /bin/bash

if [ ! -f "/etc/initialized_on_startup" ]; then
   echo "Launching the VM for the first time."

   sudo apt update
   sudo apt-get update
   curl -fsSL https://deb.nodesource.com/setup_16.x | sudo -E bash -
   sudo apt-get install -y nodejs
   sudo npm install pm2 -g
   NODE_APP_INSTANCE=$(curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/instance_id -H "Metadata-Flavor: Google")
   echo "NODE_APP_INSTANCE=${NODE_APP_INSTANCE}" | sudo tee -a /etc/environment
   echo $NODE_APP_INSTANCE
   echo "NODE_ENV=production" | sudo tee -a /etc/environment
   source /etc/environment
   sudo touch /etc/initialized_on_startup
else
   # Executed on restarts
   source /etc/environment
   cd /home/api_service && NODE_ENV=$NODE_ENV NODE_APP_INSTANCE=$NODE_APP_INSTANCE sudo pm2 start index.js --name node-app
fi

The startup script installs Node.js and PM2 and reads from our instance metadata to set the NODE_APP_INSTANCE environment variable.

This environment variable will be used in our application to determine where our application is running, and which database nodes the API service should connect to. I’ll cover this in more detail in a future post.

If our VM requires a restart, this script will re-run our Node server.

# main.tf
variable "instances" {
 type = map(object({
   name = string
   zone = string
   network_ip = string
   metadata = object({ instance_id = string})
   startup_script = string
 }))
 default = {
   "los_angeles" = {
     metadata = {instance_id = "los-angeles"}
     startup_script = "startup_script.sh",
     name = "los-angeles"
     zone = "us-west2-a" // Los Angeles, CA
     network_ip = "10.168.0.2"
   },
   ...
 }
}

resource "google_compute_instance" "vm_instance" {
   ...
   metadata_startup_script = file("${path.module}/${each.value.startup_script}")
   ...
}

After writing this basic startup script, I've added it to the Terraform config, so that it's run on all provisioned API service instances.

I now have a great foundation with which to build out my application logic. My infrastructure and application code have been deployed in a way that is easily replicable.

I'm looking forward to wrapping up the application logic, so I can unveil the final product. I guess I better get coding…

Follow along for more updates!

Automating Infrastructure For Geo-Distributed Applications With Terraform

Brett Hoyer — Thu, 28 Jul 2022 20:59:00 +0000

An increasing number of products are being developed with performance and scalability in mind. Modern developers need to rely on improved tooling to efficiently and reliably build, test and deploy their applications.

Most of us have moved on from hosting applications on machines in our bedroom closets. Instead we now choose to deploy to instances managed by Amazon Web Services, Google Cloud, or one of countless other cloud providers.

"This looks like fun!" - Nobody

To continue my development of The Largest River (my first foray into global application development), I’ve chosen to use Google Compute Engine.

Major cloud providers like Google provide both web and command-line interfaces to create and manage virtual machines, as well as offering countless other products. However, when deploying apps at scale, this can become quite an arduous task! What if the deployment environment needs to be replicated, say, to create a test or staging environment? Surely, we’d want this environment to directly mirror that of production. Any configuration step which isn’t automated leaves us open to unexpected bugs.

 

Infrastructure as Code

Rather than relying on time-consuming manual tasks or a series of shell scripts to create and modify infrastructure, I’ve decided to use Terraform to do this from a single configuration file.

Although I’ve only scratched the surface with its functionality, the benefits are immediately apparent. Here are some of the ways I’ve begun using Terraform with The Largest River.

Using the Google Cloud Platform Provider, I’m easily able to spin up and tear down my project infrastructure on GCP.

# main.tf

terraform {
 required_providers {
   google = {
     source  = "hashicorp/google"
     version = "4.24.0"
   }
  }
}

provider "google" {
 credentials = file([path_to_json_keyfile_downloaded_from_gcp])
 project = [gcp_project_name]
 region  = "us-central1"
 zone    = "us-central1-c"
}

After downloading my JSON keyfile from GCP, I’ve set up the Google provider, which can be used to securely connect to my project in the cloud. Although my application resources will be spread across multiple regions and zones, I’ve chosen the us-central1 region and us-central1-c zone as my project defaults.

Keeping Things Private

With the Google provider configured, I’m able to set up network, firewall, and instance resources.

In the case of The Largest River, setting up a VPC network is important for multiple reasons:

I’ll be able to communicate between servers in a multi-region deployment, without having to traverse the public internet. This comes with latency and security benefits.
As I’ll use this VPC for my multi-region YugabyteDB Managed deployments, my application servers and database nodes will live within the same global network.

Here’s how the network is initialized using the google_compute_network resource.

resource "google_compute_network" "vpc_network" {
 name = "tlr-network"
}

To make use of this network, I’ve added some firewall rules to enable SSH and HTTP traffic to my instances, using the google_compute_firewall resource.

resource "google_compute_firewall" "ssh-rule" {
 name    = "ssh-rule"
 network = google_compute_network.vpc_network.name
 allow {
   protocol = "tcp"
   ports    = ["22"]
 }
 target_tags   = ["allow-ssh"]
 source_ranges = ["0.0.0.0/0"]
}

resource "google_compute_firewall" "http-rule" {
 name    = "http-rule"
 network = google_compute_network.vpc_network.name
 allow {
   protocol = "tcp"
   ports    = ["80", "8080"]
 }
 target_tags   = ["allow-http"]
 source_ranges = ["0.0.0.0/0"]
}

These rules include some new parameters, namely, target_tags and source_ranges:

Target tags apply specific firewall rules to an instance in the network. We’ll see how this works soon, as I begin to configure some virtual machines.
Source ranges determine the IP addresses this rule applies to.

In this example, we’ve opened up connections to the whole internet, which, you know, isn’t great! In reality, we’d set our source ranges to a sensible range within our subnet for instances that don’t need to be exposed to the public internet.

Choosing the Right Compute Instances

What good is a network without any instances? We currently have roads leading to nowhere. Let’s change that, using the google_compute_instance resource.

variable "instances" {
 type = map(object({
   name = string
   zone = string
 }))
 default = {
   "usa" = {
     name = "instance-usa"
     zone = "us-central1-c"
   },
   "europe" = {
     name = "instance-europe"
     zone = "europe-west3-b"
   },
   "asia" = {
     name = "instance-asia"
     zone = "asia-east1-a"
   }
 }
}

resource "google_compute_instance" "vm_instance" {
 for_each = var.instances

 name                      = each.value.name
 machine_type              = "f1-micro"
 zone                      = each.value.zone
 allow_stopping_for_update = true

 tags = ["allow-ssh", "allow-http"]

 boot_disk {
   initialize_params {
     image = "cos-cloud/cos-stable"
   }
 }

 metadata = {
   gce-container-declaration = "spec:\n  containers:\n    - name: tlr-api\n      image: [container_image_url_in_gcr]\n      stdin: false\n      tty: false\n  restartPolicy: Always\n"
 }

 service_account {
   email = [email_for_project_service_account]
   scopes = [
     "https://www.googleapis.com/auth/cloud-platform"
   ]
 }

 network_interface {
   network = google_compute_network.vpc_network.name
   access_config {
   }
 }
}

Now that we have some containerized houses at the end of the road, we’re starting to get somewhere...

(Cringeworthy developer joke, my apologies.)

In all seriousness, GCP offers a wide range of instance types. For The Largest River, I’ve chosen Google Container Registry, and thus, a Container-Optimized OS.

To deploy to multiple zones without code duplication, I’ve set the instances variable, which is looped to create instances in the USA, Europe and Asia. I’ve added two tags to these instances, allow-ssh and allow-http. These tags match the target tags specified in our firewall rule blocks, which means these rules will be applied to the deployed instances.

Wrapping Up

With the core elements defined in our configuration, we can make use of the Terraform CLI to provision the infrastructure. You don’t even need to click in the GCP console as Terraform elegantly tracks changes to this configuration, making planning and updating a breeze.

Much like core app development, the infrastructure as code community has fully adopted code reuse and expressive language support. The Terraform Language includes many such features and I look forward to diving deeper, as I continue to build this geo-distributed application. You can revisit the start of my journey here, and stay tuned for more updates!

First Steps to Building a Globally-Distributed Application

Brett Hoyer — Wed, 29 Jun 2022 18:12:00 +0000

As software developers, we’re often prompted to learn new technologies, either by our employers, or by our own curiosities. This endless learning is one of the primary reasons we got into this field to begin with. UI developers wish they had a deeper understanding of backend frameworks, and backend developers wish they could write CSS transitions and animations (no they don’t, but you get what I mean).

Throughout my own software journey, my desire to enhance my skills across the stack has sent me down a seemingly endless maze of blog posts, tutorials and instructional videos. While these mediums serve their purpose, I’m often left wanting to learn through my own explorations and failures to determine what’s “best”.

As such, I’ve started to build a new globally-distributed application called “The Largest River” that will certainly satisfy this desire. This blog series will highlight my discoveries, shortcomings, and everything in between as I work to complete this project.

The Project

Today’s application development landscape is drastically different than that of years past. We’re handling scalability in new and exciting ways, and serving traffic from all over the globe. This is what I want to focus on. How can I build a distributed application that will service a global marketplace? We’ve all built more than our fair share of “to-do list” applications. This will not be one of them.

There are few key aspects I’d like to highlight:

Serving traffic globally with low latency
Being resilient to potential zone or region outages
Properly adhering to data compliance laws (for instance, all EU user data must be stored in the EU)

While the precise features of the application are immaterial, the architecture is of primary importance. A lot of tools (and buzzwords) come to mind when trying to architect a modern web application. Assets can be served from a CDN to improve page load speed. A global load balancer can front all traffic, sending requests to the nearest server. Serverless functions and edge functions can be used to handle requests, eliminating the need to manage infrastructure altogether. Kubernetes can be deployed for container orchestration, networking and healing, amongst many other production-grade features. The list goes on.

In an attempt to walk before I run, I’ve decided to start with a relatively simple architecture.

A React frontend sends traffic through an Nginx reverse proxy, to VMs running in multiple regions. Running VMs in multiple regions (once properly load-balanced) will result in shorter round trips, as well as enabling us to reroute traffic in the event of a region outage. These VMs are all running the same containerized Node.js process, which creates a connection to a YugabyteDB database. YugabyteDB is a Postgres-compliant, highly-available, distributed SQL database. If you’d like to spin up an always-free single-node cluster for yourself, it is easy to do so.

This architecture is intentionally a bit naive. I’m able to demonstrate that serving traffic to a single database node in another region comes with extremely high latencies. Businesses have operated this way for many years, scaling their databases vertically, at the cost of network latency (amongst many other things). As I continue to iterate on this design, I’ll deploy a multi-zone, multi-region database, which will be more representative of a modernized deployment. This will allow for both zone and region failures and enable data compliance, in addition to improving read and write latencies.

99% of the development process.

The Development Environment

I decided to use Docker and Docker Compose to simulate this distributed environment on my local machine. Containerization affords me the ability to easily manage and isolate dependencies, while also mirroring the production environment. Through a single command, I’m able to spin up all of the processes locally, passing the environment variables required to make connections to my remote database. Additionally, I’m using volumes to persist data, which affords me all of the niceties expected of modern application development, such as client reloads and server restarts on file changes.

The Production Environment

After countless hours of research and development, I’ve decided to run a Container-Optimized OS on Google Compute Engine VMs. These machines run images, which I’ve pushed to the Google Container Registry. As mentioned previously, this is helpful in that the same Docker images can be run locally and in production, with minimal differences in configuration.

Of course, this all sounds great, but how are the containers managed? How are they networked? In my career, I’ve rarely been faced with handling network traffic between multiple VMs and database nodes, so this learning curve is particularly steep. Thankfully, I’ve already made great progress (or so I think). I look forward to sharing my findings in future blog posts.

The 1% that makes it all worthwhile.