Forem: Lawrence Cooke

Setting up Claude Code for success

Lawrence Cooke — Fri, 27 Mar 2026 16:42:44 +0000

When first starting a new project using Claude Code, it is easy to jump ahead, diving straight into coding. However, if you spend a bit of time setting up Claude Code, the outcome will be a smoother and more enjoyable development experience.

Generating a CLAUDE.md file

When first starting your project, spend time talking through the requirements of the project with Claude. Take into account your tech stack, What language are you writing it in? Are you using a framework?

You should also not only discuss the tech stack, but also how you want this built. How do you want the folder structure to be laid out? How do you want to interact with the database?

Here is a general list of things you might include:

Programming Language (and version)
Framework and routing approach
Dependency injection patterns
Database access layer (raw queries, query builder, ORM)
Coding standard
Folder and namespace conventions
Error handling approach
Unit Testing
Third party packages you might include.

The goal of this conversation is to produce a CLAUDE.md file — a markdown document that lives in your project root, that is automatically loaded into Claude's context at the start of every session. It's the source of truth that means you never have to re-explain your stack again.

What a CLAUDE.md looks like

Here’s an example CLAUDE.md file

# Project Blueprint

## Stack
- Language: PHP 8.3
- Framework: FlightPHP (micro-framework)
- Database: Postgres 18
- Coding standard: PSR-12

## Database Access
Use Flight PHP build in PDO wrapper with prepared statements exclusively.
All queries live inside the business logic — never in controllers.

## Folder Structure
app/
  Controllers/ # Thin controllers, Only reference Logic classes and UI
  Logic/       # Business logic
  config/      # Config files — DO NOT READ OR EDIT

## Code Style
Run PSR-12 checks via: vendor/bin/phpcs --standard=PSR12 app/
Auto-fix via:          vendor/bin/phpcbf --standard=PSR12 app/

## What NOT to do
- Never use static methods on service classes
- Never put SQL in controllers
- Never read from app/config/config.php
- Never commit directly — all git operations are manual

Once this file exists, every new Claude Code session starts with this context. Claude knows your conventions without being told.

The CLAUDE.md is a time saver when coding across multiple sessions. It also allows more frequent context clearing, which can help keep Claude Code focused on the current ask, lowering costs, and save time not having to explain design choices every session.

Items can also be added to the file as you come across issues and technical asks that may not have been in the CLAUDE.md file initially, building up a good repository of information that Claude can use to help build the application to your specifications.

We all have different ways of coding, and teaching Claude how you like to code through the CLAUDE.md file, will result in code similar to how you code yourself, which helps when reviewing the code, as it will seem familiar.

Guardrails

Claude Code is powerful, which is why it needs guardrails. Left unconstrained, an AI can read your environment files, touch your git history, run database commands, or make network requests you didn’t intend. None of that maliciously, just helpfully, and that’s the problem.

The permissions system in Claude Code lets you define exactly what it can and cannot do, locked into your project’s .claude/settings.json. The deny list is your non-negotiable safety layer.

There are three settings files:

managed-settings.json
settings.local.json
settings.json

These are hierarchical.

Managed-settings.json is the top tier, It lives outside the repo.

OS	Path
macOS	`/Library/Application Support/ClaudeCode/managed-settings.json`
Linux	`/etc/claude-code/managed-settings.json`
Windows	`C:\Program Files\ClaudeCode\managed-settings.json`

Instructions in the managed-settings.json cannot be overridden by instructions in the repo level json files.

settings.json should be committed to your repository.
settings.local.json should not be committed to the repository.
settings.local.json overrides settings.json.

The difference between these is that settings.json is shared across developers in a multi developer setup, while settings.local.json is intended for individual developer instructions.

In a business setting, putting the most critical instructions in the managed-settings.json, and limiting the use of settings.json & settings.local.json, sets the system for success. While as an individual developer, just using the settings.local.json file might be sufficient.

What belongs on the deny list

Environment files & secrets
Your .env, .env.*, certificates (.pem, .key, .p12), and SSH/AWS credential folders should be completely off-limits. Claude has no reason to read them. With config files, creating a sample config with no secret keys set is useful for giving Claude access to the config structure without giving access to the keys.

Destructive git operations
Block git commit, git push, git merge, git reset, git clean, and anything else that writes to your history. You own the git history, not Claude. Every commit should be decided on and controlled by a person. This allows for code reviewing prior to committing the code.

Direct database access
No mysql, psql, pg_dump, mysqldump, or any other direct database CLI commands. Claude should generate migration files and queries in code, not run commands directly against your data.

Network and remote access
Block curl, wget, ssh, scp, rsync, and similar tools. Claude should fetch docs through approved WebFetch domains, not make arbitrary outbound calls.

System-level commands
rm, sudo, chmod, chown, kill, crontab — anything that can damage your system or escalate privileges.

The full settings file

The allow list is just as important as the deny list. It gives explicit permission for the tools you do want Claude to be able to run. Fetching framework docs, requiring Composer packages, running your linter etc.

Special Note

Within a settings file, DENY instructions override ALLOW instructions, however when there is a hierarchy of settings files, the higher priority files instructions override the instructions from lower priority files. If something is denied in managed-settings.json, adding an ALLOW in settings.json will not override the denial in managed-setting.json

{
  "permissions": {
    "allow": [
      "Bash(composer require:*)",
      "Bash(vendor/bin/phpcbf --standard=PSR12 app/)",
      "Bash(vendor/bin/phpcs --standard=PSR12 app/)"
    ],
    "deny": [
      "Read(app/config/config.php)",
      "Edit(app/config/config.php)",
      "Read(.env)",
      "Edit(.env)",
      "Read(**/*.pem)",
      "Read(**/*.key)",
      "Read(**/*.p12)",
      "Read(**/*.pfx)",
      "Read(~/.ssh/**)",
      "Read(~/.aws/**)",
      "Read(./.git/**)",
      "Bash(git commit:*)",
      "Bash(git push:*)",
      "Bash(git merge:*)",
      "Bash(git rebase:*)",
      "Bash(git reset:*)",
      "Bash(git clean:*)",
      "Bash(git branch -d:*)",
      "Bash(git branch -D:*)",
      "Bash(git tag -d:*)",
      "Bash(git stash drop:*)",
      "Bash(git stash clear:*)",
      "Bash(git remote add:*)",
      "Bash(git remote remove:*)",
      "Bash(git remote set-url:*)",
      "Bash(git config --global:*)",
      "Bash(rm:*)",
      "Bash(rmdir:*)",
      "Bash(shred:*)",
      "Bash(dd:*)",
      "Bash(mkfs:*)",
      "Bash(fdisk:*)",
      "Bash(curl:*)",
      "Bash(wget:*)",
      "Bash(nc:*)",
      "Bash(netcat:*)",
      "Bash(nmap:*)",
      "Bash(telnet:*)",
      "Bash(ftp:*)",
      "Bash(sftp:*)",
      "Bash(scp:*)",
      "Bash(rsync:*)",
      "Bash(ssh:*)",
      "Bash(mysql:*)",
      "Bash(mysqldump:*)",
      "Bash(mysqlimport:*)",
      "Bash(mariadb:*)",
      "Bash(mariadb-dump:*)",
      "Bash(psql:*)",
      "Bash(pg_dump:*)",
      "Bash(pg_restore:*)",
      "Bash(pg_dumpall:*)",
      "Bash(redis-cli:*)",
      "Bash(sudo:*)",
      "Bash(su:*)",
      "Bash(chmod:*)",
      "Bash(chown:*)",
      "Bash(passwd:*)",
      "Bash(useradd:*)",
      "Bash(usermod:*)",
      "Bash(crontab:*)",
      "Bash(kill:*)",
      "Bash(pkill:*)"
    ]
  }
}

The pattern to remember: tight allow, broad deny. Explicitly permit the specific tools Claude needs; broadly block everything that could cause harm if run without your supervision.

Custom Commands

Custom commands are saved prompts you can used during any session to run a repeatable, structured action across your codebase.

The most valuable commands aren’t about generating code, they’re about reviewing what’s been built. Running a security audit through a project catches problems before they’re buried under layers of new code.

Custom commands live in .claude/commands/ as markdown files. Each file is a detailed prompt that Claude executes when you invoke the slash command.

Claude Code itself can help you draft these command files efficiently.

Commands worth building

/security-audit
Reviews all files in the application for SQL injection risks, unvalidated input, missing authentication checks, insecure direct object references, and exposed error messages. Outputs a prioritised list of findings with file and line references.

/architecture-review
Checks that the business rules are being followed.

/update-claude
"Review what we have just discussed and update CLAUDE.md with any new architectural patterns or 'What not to do' rules we have discovered".

Example: the security audit command

Perform a security audit on the PHP codebase in app/.

Check for the following issues and report each with file
path, line number, and severity (high/medium/low):

1. SQL injection risks — any string concatenation in
   queries, any unparameterised input
2. Missing input validation — user-supplied data used
   without sanitisation or type checking
3. Authentication gaps — routes or methods that should
   require auth but don't check for a session
4. IDOR risks — fetching records by ID without verifying
   the current user owns that record
5. Verbose error exposure — raw exceptions or stack
   traces that could leak system details

Output format:
[SEVERITY] File: path/to/file.php (line N)
Issue: description
Fix: recommended action

Do not fix anything — report only. Fixes happen
in a separate pass once the full list is reviewed.

Run audits often. The best time to run a security audit is not at the end of the project , it's during the project development. Issues caught early are easy to fix. Issues caught after three months of layered code become complex.

Claude Ignore

Similar to .gitignore, there is a .claudeignore file. Entries in the .claudeignore tells Claude that the files and folders listed are not relevant.

Folders like node_modules, vendor, logs, caches etc would be good one to add to .claudeignore.

.claudeignore is not a replacement for settings.json. Claude can still read items listed in .claudeignore, You can still ask Claude to read them to gain context.

.claudeignore saves you tokens by not having Claude read the files upfront.

Context Compacting

As a Claude Code session grows longer, it will eventually compact, compressing older parts of the conversation to free up context window space. Claude does its best to preserve what matters, but compacting loses context. Architectural decisions made three hours ago, the reasoning behind a particular pattern choice, all of that is at risk of being quietly forgotten.

This is where the CLAUDE.md file helps.

Because CLAUDE.md is loaded at the start of every session, and re-read whenever Claude needs to orient itself , your core context is never actually lost to compacting. It doesn't live in the conversation history. It lives in the file.

Compacting can chew through hours of back-and-forth . Your architecture conventions are still right there, intact, waiting to be loaded again.

In long conversations, you may want to update the CLAUDE.md file with new information that came about during the conversation.

This can change how you work. Instead of using a single long session and worrying about context drift, you can:

Use /clear freely and often to start fresh without losing your architectural context
Treat each task or feature as its own clean session, describe the scope, build it, clear, repeat

Working in small, focused sessions with frequent clears is actually a healthier pattern than one long marathon session anyway. It forces you to scope tasks tightly, keeps the context window lean, and means Claude is always working with fresh, uncluttered context.

Time to code

With CLAUDE.md setup, and your system secured, you've created a safe space for Claude Code to be genuinely excellent.

This lets you focus on working through the architecture and delivering a quality product.

Setting up Claude Code isn't about restricting the AI, it's about defining the playing field so you can stop worrying about the boundaries and start focusing on the architecture.

Partial Indexes in PostgreSQL

Lawrence Cooke — Sun, 15 Feb 2026 20:56:41 +0000

Partial indexes are refined indexes, used to target specific access patterns. Instead of indexing every row in a table, they only index the rows that match a condition — making them smaller, faster, and more efficient for the right use cases.

Partial indexes work best on queries that filter first then scan multiple rows, target meaningful subsets of data, or might otherwise hit index thresholds that cause the planner to ignore the index entirely.

The setup

To demonstrate partial indexes, I am using the MySQL sample database converted to PostgreSQL. The salary table has about 3 million rows.

CREATE TABLE "employees"."salaries" (
    "id" int4 GENERATED ALWAYS AS IDENTITY,
    "emp_no" int4,
    "salary" int4,
    "from_date" date,
    "to_date" date,
    PRIMARY KEY ("id")
);

Here is a query with a standard index on the salary field:

SELECT COUNT(*) 
FROM salaries 
WHERE salary > 100000 
AND to_date = '9999-01-01';

In this sample database, 9999-01-01 is used to indicate an active salary.

This query took about 140ms (cold) and 40ms (hot) to run.

Choosing when to create a partial index

If queries are often run using the same pattern, they're good candidates for a partial index.

Before deciding to add one, consider the data itself. How much would be filtered out by adding this partial index? To really benefit, you want significant filtering — the more rows excluded, the better.

In this case, most of our queries focus on active employees and their current salary. Out of the 3 million rows, only 247,000 are current. That's a solid reduction, making it a good candidate for a partial index.

CREATE INDEX idx_salaries_salary_todate_partial
ON salaries (salary) 
WHERE to_date = '9999-01-01';

This creates the index on the salary field again, but this time it only includes active employee salaries.

Adding this partial index took the query time down to 16ms (both cold and hot).

Index thresholds

Here's where partial indexes really start to shine.

Range queries like the one we've used are susceptible to having the index ignored when the result set gets too large. This is known as the 30% rule — once the result set exceeds roughly 30% of total rows (it's not an exact number), the query planner often chooses a table scan over using the index.

Changing the query to look for a different salary filter demonstrates this:

SELECT COUNT(*) 
FROM salaries 
WHERE salary > 50000 
AND to_date = '9999-01-01';

While > 100000 only returned 17,000 rows, > 50000 returns 215,000 out of the 247,000 rows. With a standard index, the planner ignores it and uses a table scan instead.

In this scenario, the query time (hot) was around 120ms.

The partial index, however, is still used because it reduces the starting row count upfront. Even with this broader filter, results come back in around 40ms.

Meaningful subsets of data

Another scenario where partial indexes shine is when you're consistently querying a meaningful subset of data.

A good example is a queue, where you might have an is_processed boolean field and only care about the ones not yet processed. The number of unprocessed rows should be small compared to the processed ones, and over time that difference only grows larger.

CREATE INDEX idx_unprocessed_queue
ON queue (created_at) 
WHERE is_processed = false;

Index size benefits

An advantage of partial indexes worth mentioning is the index size.

In our salary example, the full salary index is 58MB while the partial index is just 7MB.

That said, partial indexes are often used alongside regular indexes rather than replacing them. The partial index solves for a specific access pattern, while the regular index covers other scenarios that the partial index wouldn't help with.

When partial indexes might not be as beneficial

Partial indexes are great, but there are times where their benefits are minimal.

A table where you're already using a unique index is one scenario:

SELECT id FROM users WHERE email = 'test@test.com' AND is_active = true

Since email is almost certainly a unique field, it's unlikely that adding a partial index on is_active would produce any real gains in query execution time.

This doesn't mean a partial index is never worth adding in these cases. If the table is large, a partial index might be used simply to reduce index bloat. Smaller indexes fit better into the buffer cache, potentially keeping the index in memory where it belongs.

Summary

Partial indexes are a powerful tool for reducing query times on specific access patterns. They're not right for every situation — but when you've got queries that consistently filter on the same conditions, they're absolutely worth considering. Start by looking at your most common query patterns and checking how much data would be filtered out. If the numbers look good, it might be worth adding a partial index.

Journey into Claude Code

Lawrence Cooke — Mon, 26 Jan 2026 19:30:14 +0000

Using AI in your daily development process requires a shift in how you think about writing code.

Having built websites for over 30 years, I’ve lived through many changes in how software is developed. AI is not the first major shift developers have had to adapt to, and it certainly won’t be the last. Like previous transitions, it has caused disruption, some of it useful, some of it driven by marketing hype that sets expectations far too high.

When those expectations aren’t met, the conclusion is often that AI has failed. In reality, it’s not that AI isn’t useful, it’s that it needs to be used differently. It works best as a tool within your workflow, not something you can simply turn on and forget.

Delving into Claude Code

My first attempt at using Claude Code didn’t go well. I approached it the way it was being marketed: asking it to “build a product.” The scope of that request was far too large. By giving the AI so much latitude, I was asking it to make architectural, stylistic, and framework-level decisions without sufficient context.

The result was a tangled web of code that often didn’t even respect the framework I had already set up. It would fall back to raw PDO statements instead of using framework tooling. It created methods and classes that were later abandoned but left behind. Coding standards shifted mid-stream, and the overall result felt incoherent.

This wasn’t a failure of Claude Code so much as a failure in how I was using it.

Second Attempt: Adapting the Approach

The second attempt required a mental shift. Rather than expecting AI to work instead of me, I needed it to work with me.

AI isn’t there to replace the developer, it’s there to support the developer. That means staying in control. In practical terms, this meant drastically reducing the scope of what I asked it to do and taking on more of an architectural role, guiding it much as I would guide a junior developer.

Instead of asking it to build an entire project, I broke the work down into much smaller pieces: a class, a method, or a clearly defined section of functionality. You can expand the scope somewhat, but keeping it small was critical while learning how AI fit into my workflow.

The results were noticeably better. There were still odd decisions, sometimes it used framework tools, sometimes it didn’t, but at this scale it was easy to catch and correct those issues. The code was generally solid, required some cleanup, and was far quicker to produce than writing everything myself. Importantly, the problems were easy to spot and fix.

The Importance of Review

Throughout this process, code review became even more important.

Any code you didn’t write yourself should be reviewed carefully, and AI-generated code is no exception. In truth, even code you did write yourself benefits from review, we all make mistakes.

Regularly reviewing AI-generated code helps keep things from becoming large and unwieldy. It’s how you maintain control, catch deviations early, and ensure the codebase stays consistent and understandable.

Attempt Three: Teaching AI How I Code

While the output from the second attempt was good, it still didn’t feel quite right. The code wasn’t bad, it just wasn’t how I would write it.

In PHP, there are countless ways to solve the same problem, but most developers have their way of doing things. If Claude Code could follow my preferences, there were clear benefits: the code would feel familiar, differences would stand out more clearly, and issues would be easier to identify.

The third attempt focused on creating a claude.md file that clearly documented how I wanted code to be written. This included decisions around function and variable naming, whether to use static methods, specific usage patterns for Flight PHP (my framework of choice), the PHP version I target, database choices, dependency injection approaches, and adherence to SOLID principles.

That file grew to around 500 lines of guidance, including examples of what to do, and what not to do. With those constraints in place, working with Claude Code became a much more pleasant experience.

Having detailed instructions on how to code also made it possible to slightly increase the scope of requests, though still not to the level of “build me an entire product.”

Final Thoughts

Whether you love AI or hate it, learning how to use it has become important for long-term relevance as a developer. Not taking the time to understand how it can best support your workflow risks being left behind.

One unexpected benefit I’ve found is that development feels more restful. Building software is mentally demanding, from designing architecture to writing code to reasoning about database access. Allowing AI to take on some of that load can reduce cognitive strain without sacrificing quality.

You still can’t let it run unchecked. You’re ultimately responsible for the code it produces. But by guiding it with clear constraints and documentation, and by keeping yourself firmly in control, AI can become a focused, effective, and genuinely helpful part of your development process.

It's not magic. But used well, it might just make the work a little lighter.

Using TF-IDF Vectors With PHP & PostgreSQL

Lawrence Cooke — Thu, 27 Mar 2025 23:17:45 +0000

Vectors in PostgreSQL are used to compare data to find similarities, outliers, groupings, classifications and other things.

pg_vector is a popular extension for PostgreSQL that adds vector functionality to PostgreSQL.

What is TF-IDF?

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a way to compare the importance of a word in a document compared to a collection of documents.

Term Frequency

Term frequency refers to how often a word is used within a document. In a 100 word document, if the word 'test' occurs 5 times, then the term frequency would be 5/100 = 0.05

Inverse Document Frequency

Inverse Document Frequency measures how unique a word is across a group of documents.

Common words like "the" or "and" appear in almost all documents, so they are assigned a low IDF score. Rare, specific words are assigned a higher IDF score.

The TF-IDF score is TF * IDF.

Normalizing TF-IDF

A drawback to using TF-IDF is that it unfairly advantages long documents over short documents.

Longer documents can accumulate higher TF-IDF scores simply because they contain more words, not necessarily because the word is more relevant.

This can be corrected by normalizing the score based on the document length.

TD-IDF score / total words in document.

PHP Implementation Guide

To create vectors in PHP, select all articles from a database and loop through them.

foreach ($articles as $article) {
     $articleText = $article['description'];
     $tokenizedDocuments[$article['id']] = $this->tokenizeArticle($articleText);
     $this->updateDocumentFrequencies($documentFrequencies, $words);
}

Break up the document into an array of words. Additional word processing could be done here if required.

protected function tokenizeArticle(string $text): array
{
    $text = strtolower($text);
    $text = preg_replace('/[^\w\s]/', '', $text);
    $words = preg_split('/\s+/', trim($text));

    return $words;
}

Create an array to keep track of the word frequency across all documents.

protected function updateDocumentFrequencies(array &$documentFrequencies, array $words): void
{
    $uniqueWords = array_unique($words);
    foreach ($uniqueWords as $word) {
        if (!isset($documentFrequencies[$word])) {
            $documentFrequencies[$word] = 0;
        }
        $documentFrequencies[$word]++;
    }
}

Once the articles have been processed, create the embedding vector

protected function createEmbeddings(
    array $articles,
    array $tokenizedDocuments,
    array $documentFrequencies,
): void {
    $totalDocuments = count($articles);

    foreach ($articles as $article) {
        $articleId = $article['id'];
        $words = $tokenizedDocuments[$articleId];

        $embedding = $this->calculateEmbedding(
            $words,
            $documentFrequencies,
            $totalDocuments
        );
    }
}

CaclulateEmbedding() is where the main calculations for TF-IDF score is done.

protected function calculateEmbedding(
    array $words,
    array $documentFrequencies,
    int $totalDocuments
): array {
    $termFrequencies = array_count_values($words);
    $totalWords = count($words);

    $embedding = array_fill(0, $this->dimension, 0.0);

    foreach ($termFrequencies as $word => $count) {
        $tf = $count / $totalWords;
        $idf = log($totalDocuments / ($documentFrequencies[$word] + 1));
        $tfidf = $tf * $idf;

        $index = abs(crc32($word) % $this->dimension);
        $embedding[$index] += $tfidf;
    }

    return $this->normalizeVector($embedding);
}

Dimensions

The number chosen for dimensions is critical to good quality TF-IDF. The number should be large enough to hold the number of unique words in any of your documents. 768 or 1536 are good numbers for medium sized documents. As a general rule about 20 - 30% of words in a document are unique. 1536 equates to about a 20 to 30 page document.

Calculate TF

Divide the number of times a word occurs in a document by the total words in the document.

$tf = $count / $totalWords;

Calculate IDF

Since IDF is the inverse of the document frequency, we use log to calculate the score

$idf = log($totalDocuments / ($documentFrequencies[$word] + 1));

Calculate TF-IDF

$tfidf = $tf * $idf;

TF-IDF array

TF-IDF arrays do not store values in order, instead they are stored in a calculated array key. This ensures that the same word will always appear in the same array position across all documents.

While it is possible to calculate duplicate array keys, as long as the vectors dimension size chosen is appropriate for the size of the document, duplicates are rare and is generally represents a similar word.

To calculate the position, use crc32 to generate an integer representation of the word and then divide it by the dimension size, and use the remainder as the array key position.

This will give a good spread of spaces that are filled with the TF-IDF scores.

Normalizing

Earlier we talked about normalizing as word frequency/docuiment length, while it can be calculated this way, normalizing is more commonly calculated using the Euclidean norm formula: √(x₁² + x₂² + ... + xₙ²)

The normalizeVector method is a PHP representation of this formula.

protected function normalizeVector(array $vector): array
{
    $magnitude = sqrt(array_sum(array_map(function ($x) {
        return $x * $x;
    }, $vector)));

    if ($magnitude > 0) {
        return array_map(function ($x) use ($magnitude) {
            return $x / $magnitude;
        }, $vector);
    }

    return $vector;
}

The final vector may look something like this:

[0.052876625,0,0,0,0,0,0,0,0,0,0,0,0.013156515,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-0.012633555,0,0,0,0,0.0065987236,0,0 ...]

This is known as a sparse vector. A sparse vector has a lot of empty array keys whereas a dense vector is much more filled in.

Dense vectors can improve the quality of the vector. One method of doing this is to include bi-grams in the vector along with the single words.

For example:

This is known as a sparse vector would include each word [this,is,known,as,a,sparse,vector] adding bi-grams would include [this_is,is_known,known_as,as_a,a_sparse,sparse_vector] which adds more context to the words by taking into account the words around them.

Creating Queries in PostgreSQL

Once vectors have been generated for your documents, it's time to store them in PostgreSQL.

Selecting the right dimension for your document is also critical here, once you choose a dimension size, all vectors going into the field have to be the same dimension.

ALTER TABLE "articles"
ADD COLUMN "embedding" vector(1536);

Types of Comparisions

There are three types of comparisons in PostgreSQL

Euclidean (L2) distance: <-> : Measures how far apart two vectors are. Smaller numbers mean vectors are more similar. Good for finding similar products etc.

Cosine similarity: <=> : Measures the angle between vectors, ignoring their magnitude. Smaller numbers mean vectors are more similar in direction. Good for text similarity where length shouldn't matter.

Inner product: <#> : Measures how much vectors "align" with each other. Larger numbers mean vectors are more similar (opposite of the others!). Useful for normalized comparisons.

Try them all with your data to find the one that best suits your use case.

Creating a recommendation system

One of the use cases of vectors is to create a recommendation system, in this case to find articles that are related in some way to the one you are currently reading.

To do this, we need to order the rows by the comparison to find the ones most relevant to the current article.

In this query, first, the embedding of the current article needs to be selected and then compare other articles to it, to find the most relevant.

For a recommendation, filtering out the current article from the query makes sense.

WITH
    search_article AS (
        SELECT embedding
        FROM articles
        WHERE id = 12
    )
SELECT id,title
FROM articles
WHERE id <> 12
ORDER BY embedding <-> (
        SELECT embedding
        FROM search_article
    )
LIMIT 4;

Creating a search engine

Vectors can be used to create a search engine for your documents. Comparing articles with the user entered question or keywords.

To do this, The user entered question would need to be converted into a vector using the term frequencies of your current articles (recommend this be stored in the database so you are not calculating them every time a search query is run). The user vector would need to be the same dimension size as the articles.

Create a query to compare the user vector to the stored vectors

SELECT id,title
FROM articles
ORDER BY embedding <=> :embedding
LIMIT 4;

Other use cases

Classifying Articles

A more complex use case for vectors would be to classify documents to put similar articles together. You may not have specific tags/keywords to classify documents against, but articles can still be classified into similar items.

This results in similar articles having the same cluster id

Finding Anomalies

If users post articles about tech, and suddenly someone posts an article about places to buy plushies, that would be an anomaly and might be worth checking to see if it fits the site's requirements.

To implement an anomaly checker, a distance threshold would need to be set and anything further away than the threshold would be flagged for manual review.

WITH 
    article_distances AS (
        SELECT 
            id, 
            title,
            embedding <-> (
                SELECT AVG(embedding)::vector(1536) 
                FROM articles
            ) AS distance_from_average
        FROM articles
    )
SELECT id, title, distance_from_average
FROM article_distances
WHERE distance_from_average > 0.75
ORDER BY distance_from_average DESC;

This query calculates the "average" embedding across all articles (representing your typical content) and then finds articles that are significantly different from this average.

Experiment with the threshold to find what is right for your use case.

Conclusion

Vectors are both complex and powerful, well planned vectors can help automate many use cases or add features to your website.

TF-IDF, while it is the method I chose here, it's not the only vector type. Open AI has their own model for generating vectors from text, as does Ollama. These may or may not be better for your use case.

It's important to experiment with different approaches - test various dimension sizes, comparison methods, and even vector generation techniques to find what works best for your specific needs.

How To Use Materialized Views

Lawrence Cooke — Mon, 23 Dec 2024 16:35:29 +0000

There are times when a query takes a long time to run. While indexes and good query design often help, sometimes the query is inherently slow.

In cases like this, we need to find an alternative way to collect the data to prevent the queries from creating slowness on a website.

Materialized views can provide significant improvement in query performance, and are especially useful in aggregated and reporting queries.

To demonstrate this I am importing a 100 million row CSV of mock temperature data from about 400 cities.

Amsterdam;2010-06-28;15.4
Kano;2017-04-23;18.7
Calgary;2016-05-07;4.3
Reggane;2014-10-04;32.0
Fukuoka;2010-04-17;22.6
Khartoum;2017-05-29;29.8
Vilnius;2014-06-16;4.2
Murmansk;2011-09-29;3.8
Parakou;2010-09-12;10.6
Cairo;2014-03-29;42.6
Edmonton;2015-04-24;-2.1
...

Creating and importing data into a table

To create the table in PostgreSQL the following table definition can be used:

CREATE TABLE weather_data (
    id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    city VARCHAR(100),
    recorded_date DATE,
    temperature FLOAT
);

Importing the data can be done through psql

psql -h localhost -U postgres -W

\copy weather_data(city, recorded_date, temperature) FROM 'measurements.csv' WITH (FORMAT csv, DELIMITER ';', HEADER);

Querying the data

The query used here is one derived from the 1 billion row challenge.

SELECT city, 
  MIN(temperature) AS min_temperature, 
  MAX(temperature) AS max_temperature, 
  AVG(temperature) AS avg_temperature
FROM weather_data 
GROUP BY city
ORDER BY city;

On 100 million rows, this took about 7 seconds to run. If a query like this was running on a website, it would be too slow to be practical.

An option to consider is to use Redis, or Elastic Search to take the load off the database.

Another option is to use a materialized view in PostgreSQL to store the returned results.

What are Views?

A view in SQL is a virtual table. Rather than storing data, they store a SQL query. When the virtual table is queried, it's running the more complex query underneath to return the result.

CREATE VIEW weather_stats AS
SELECT city, 
  MIN(temperature) AS min_temperature, 
  MAX(temperature) AS max_temperature, 
  AVG(temperature) AS avg_temperature
FROM weather_data 
GROUP BY city
ORDER BY city;

To query the view we would use a query like

SELECT * FROM weather_stats;

Since this is just running the stored SQL query, this does not provide any query execution time improvement.

To improve execution time, a materialized view would be a better option.

What are Materialized Views?

Materialized views are quite different from regular views. Instead of storing a SQL query and running the query each time it's used, materialized views store the result set on disk.

This makes accessing the result set much faster than just running the raw query.

To create a materialized view, the syntax is almost the same as a regular view

CREATE MATERIALIZED VIEW weather_stats AS
SELECT city, 
  MIN(temperature) AS min_temperature, 
  MAX(temperature) AS max_temperature, 
  AVG(temperature) AS avg_temperature
FROM weather_data 
GROUP BY city
ORDER BY city;

The create time for the materialized view will be about the same amount of time as it would be if you just ran the query, however when querying the materialized view, instead of the query taking 7 seconds, it returns results in about 5ms.

SELECT * FROM weather_stats;

Refreshing materialized views

Since materialized views store the result set, the data can become stale.

To prevent the data from becoming too stale, the materialized view can be refreshed as regularly as needed to keep the data up to date.

To refresh the data, a REFRESH query is run

REFRESH MATERIALIZED VIEW weather_stats;

Refreshing the data will temporarily block access to the view, causing downtime.

Once the refresh is complete, the data will become available again. The downtime would be based on how long the underlying query takes to run.

Concurrent Materialized View Refresh

To avoid downtime, concurrent refreshes can be used.

A concurrent refresh creates a temporary copy of the result set and when the refresh is complete, switches the materialized view to the new data. This allows access to the data during a data refresh.

Concurrent refreshes require a unique index to be added to the materialized view.

CREATE UNIQUE INDEX idx_weather_stats_city ON weather_stats(city);

Once the index has been created, a concurrent refresh can be run

REFRESH MATERIALIZED VIEW CONCURRENTLY weather_stats;

Updating A Materialized View Schema

Updating the views schema requires dropping the view and recreating it. Downtime would occur in these cases.

DROP MATERIALIZED VIEW IF EXISTS weather_stats;

After the view has been dropped, a new CREATE query can be run to build the materialized view again and import the result set into the new table definition.

Scheduling the materialized view refresh

To schedule the materialized view refresh, an external script could be created, and use crontab to trigger the refresh.

Alternatively, it's possible to schedule the refresh inside PostgreSQL

Installing pg_cron extension

To run cron jobs inside PostgreSQL, the PostgreSQL pg_cron extension needs to be installed.

sudo apt-get install postgresql-17-cron

Make sure you install the version compatible with the version of PostgreSQL you have installed.

Add the extension to postgresql.conf

sudo nano /etc/postgresql/17/main/postgresql.conf

and add the extension at the bottom of the conf file

shared_preload_libraries = 'pg_cron'

Restart PostgreSQL

 sudo systemctl restart postgresql

Activate the extension in PostgreSQL by running the CREATE EXTENSION query.

CREATE EXTENSION IF NOT EXISTS pg_cron;

Now we can schedule the view refresh

SELECT cron.schedule('refresh_weather', '0 * * * *', 
    'REFRESH MATERIALIZED VIEW CONCURRENTLY weather_stats');

This will set the cron to refresh the view every hour on the hour.

Check that the cron is running by executing this query

SELECT * FROM cron.job;

jobid	schedule	command	nodename	nodeport	database	username	active	jobname
1	0 * * * *	REFRESH MATERIALIZED VIEW CONCURRENTLY weather_stats	localhost	5432	postgres	postgres	true	refresh_weather

When not to use materialized views

While materialized views are a useful feature, they are not the right solution for every situation.

In cases where data changes frequently and/or when data consistency is critical, materialized views are not a good option to use.

Currency exchange rates, where the data needs to be up to date to the second would not be a good use case for materialized views unless you were looking to store historical data.

It's also worth noting that because materialized views are stored on disk, that the disk space needed to store the view needs to be considered.

Summary

While there are a few concepts to learn with materialized views, they can significantly improve query execution time compared to raw queries, which will benefit your website.

Materialized views are not a good option for all cases. Understanding your data and customer needs will help determine if they are a viable solution or not.

Optimizing SQL Queries

Lawrence Cooke — Sun, 13 Oct 2024 15:41:51 +0000

When writing queries, we should always take time to find the best way to write the query.

Sometimes this can mean using methods that on the surface seem like they wouldn't be fast, but actually are.

Query optimization is critical to having an efficient website.

While query optimization also applies to reporting and analytics, queries that run as part of a web service are the ones most noticed by users of your website.

For this article I am using the MySQL test employee database: https://dev.mysql.com/doc/employee/en/

The Schema

CREATE TABLE `employees` (
  `emp_no` int NOT NULL,
  `birth_date` date NOT NULL,
  `first_name` varchar(14) NOT NULL,
  `last_name` varchar(16) NOT NULL,
  `gender` enum('M','F') NOT NULL,
  `hire_date` date NOT NULL,
  PRIMARY KEY (`emp_no`),
  KEY `name` (`first_name`,`last_name`) 
)

CREATE TABLE `salaries` (
  `emp_no` int NOT NULL,
  `salary` int NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `salary` (`emp_no`,`salary`)
)

The salaries table can contain the same employee multiple times, each time an employees salary changes, it's a new row in the salaries table.

The task

The task for this query is to return a unique list of employee number, first_name, last_name who earn over $50,000 a year.

Along with selecting the data, we will need to ensure there are no duplicate employees.

Using DISTINCT

SELECT DISTINCT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
    INNER JOIN salaries USING (emp_no)
WHERE
    salary > 50000

In general, the use of DISTINCT is an indication that the query could be written better.

DISTINCT fetches all the possible rows, and at the end of the query process, strips out duplicate rows it doesn't need.

Distinct is calculated against all selected rows. This can mean that it's possible to return duplicate names in some cases.

An example of when this could occur would be if we included a column where each row for an employee changed, for example salary

SELECT DISTINCT
    employees.emp_no,
    first_name,
    last_name,
    salary
FROM
    employees
    INNER JOIN salaries USING (emp_no)
WHERE
    salary > 50000

Query Execution Plan:

-> Table scan on <temporary>  (cost=241946..245972 rows=321886)
   └─> Temporary table with deduplication  (cost=241946..241946 rows=321886)
      └─> Nested loop inner join  (cost=209757 rows=321886)
         ├─> Filter: (salaries.salary > 50000)  (cost=97097 rows=321886)
         │  └─> Index scan on salaries using salary  (cost=97097 rows=965756)
         └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=0.25 rows=1)

The execution plan shows the use of a temporary table and a high cost. Temporary tables are generally slower queries. They are necessary at times, but if you can find a way to query without the use of a temporary table, it's generally going to be more efficient.

Average response time: 745ms

Using GROUP BY

A common method of ensuring unique users is to use GROUP BY

GROUP BY is generally faster than DISTINCT. It doesn't need that last step of removing duplicates to complete the query plan

SELECT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
    INNER JOIN salaries USING(emp_no)
WHERE
    salary > 50000
GROUP BY
    employees.emp_no

Query Execution Plan:

-> Table scan on <temporary>  (cost=241946..245972 rows=321886)
   └─> Temporary table with deduplication  (cost=241946..241946 rows=321886)
      └─> Nested loop inner join  (cost=209757 rows=321886)
         ├─> Filter: (salaries.salary > 50000)  (cost=97097 rows=321886)
         │  └─> Index scan on salaries using salary  (cost=97097 rows=965756)
         └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=0.25 rows=1)

While the GROUP BY is slightly faster than DISTINCT, the execution plan is the same. The difference between them in this case is generally related to the internal query optimizer, query caching etc.

While execution plans are very useful, they don't always give you the whole story of what is going on internally, which leads to subtle differences between queries that might have the same execution plan.

Average response time: 721ms

Using Subquery

While subqueries are often viewed as less efficient, there are times where they can reduce the row count, which can make queries faster.

In this case, we are going to use a subquery to find the employee numbers where salary is over $50,000

SELECT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
WHERE
    emp_no IN(
        SELECT
            emp_no FROM salaries
        WHERE
            salary > 50000)

Using this method, the query time drops significantly.

Query Execution Plan:

-> Nested loop inner join  (cost=89029 rows=33961)
   ├─> Remove duplicates from input sorted on salary  (cost=5161 rows=33961)
   │  └─> Filter: (salaries.salary > 50000)  (cost=5161 rows=33961)
   │     └─> Index scan on salaries using salary  (cost=5161 rows=965756)
   └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=80472 rows=1)

Here you will see that the query is no longer using a temporary table, and is using a much simpler plan, with a much lower cost value.

These factors lead to a faster response time.

Average response time: 234ms

While using a subquery significantly improved the query performance, we may be able to achieve better results by using the EXISTS clause, which offers some advantages over the IN statement used in the subquery.

Using EXISTS

When using EXISTS, the query early terminates once it finds a match. In this case, it will early terminate once it has found a specific employee.

While there are multiple rows in the salaries table for an employee, it does not need to continue checking if that specific employee exists if it has found a matching row, so it stops looking for the employee and moves onto looking for the next one.

SELECT
    employees.emp_no,
    first_name,
    last_name
FROM
    employees
WHERE
    EXISTS (
        SELECT
            1
        FROM
            salaries
        WHERE
            salaries.emp_no = employees.emp_no
            AND salary > 50000)

We use SELECT 1 in this query because EXISTS only returns TRUE or FALSE, not what that the row contains.

While we could use SELECT emp_no or SELECT *, returning a constant makes the intent of the query clearer, and in some cases, can be more efficient.

Query Execution Plan:

-> Nested loop inner join  (cost=89029 rows=33961)
   ├─> Remove duplicates from input sorted on salary  (cost=5161 rows=33961)
   │  └─> Filter: (salaries.salary > 50000)  (cost=5161 rows=33961)
   │     └─> Index scan on salaries using salary  (cost=5161 rows=965756)
   └─> Single-row index lookup on employees using PRIMARY (emp_no=salaries.emp_no)  (cost=80472 rows=1)

While this query plan is the same as the subquery query plan, the early termination improves the execution time.

Average response time: 220ms

Summary

Distinct: 745ms
Group By: 721ms
Subquery: 234ms
Exists : 220ms

Using subqueries is not always the most efficient querying method, however, in scenarios like this, it can significantly improve your query.

While just changing the query can help fix slow queries, there are other optimizations that could be considered.

Creating better indexes can also help resolve slow queries, but adding indexes should be reserved for times where rewriting the query doesn't help the query to be more efficient.

It's important to try out different query strategies on your own data. While EXISTS was the most efficient strategy when querying this dataset, results may differ on other datasets, so try out a variety of queries and see which one works best for you.

Slim and Flight PHP Framework Comparison

Lawrence Cooke — Sat, 14 Sep 2024 00:17:18 +0000

Why use a micro framework?

On social media, often new PHP devs ask "What framework should I use for my project" and generally the answers given are "Laravel" or "Symfony".

While these are both good options, the right answer to this question should be "What do you need the framework to do?"

The right framework should be one that does what you need it to, without loads of features you will never use.

If you are making a website with one route, using Laravel or Symfony would be over engineering the site, while for a complex site, Laravel or Symfony may be the right choice.

Micro frameworks are great for building small to medium sized sites that don't need all of the features a full stack framework provides.

While there are many, Slim and Flight PHP are both great examples of micro frameworks.

Recently I built a small website that asks the user to solve 10 database related questions. It had three routes, and some basic queries to fetch the questions and compare the answers.

For a small project like this, a micro framework is a great choice. I built the site on both Slim and Flight PHP to compare them.

Skeletons

If you haven't used a particular framework before, using the provided skeleton project is generally a great place to start.

Flight PHPs skeleton project is pretty much what I expected, light weight, simple MVC setup, easy to understand the folder structure and know where everything should go in the project.

For someone new to the framework, the learning curve to getting up and running is minimal.

Light on composer libraries, just 5 in total (including the core library), 4 used in production.

The production size for the Skeleton, was 1.6Mb.

Slims skeleton project surprised me, The directory structure was more complex than I had anticipated. Geared more towards a structure that may be used in a larger project than in a small project. For a micro framework, this wasn't expected.

The Slim skeleton was a bit heavier than Flight PHP. 21 composer libraries, 9 used in production. Production size of the project was 3.3Mb.

Both worked out of the box with minimal additional configuration needed.

Building From Scratch

Instead of using the skeletons, I decided to build the sites by creating my own setup. The advantages of doing this is that I was able to tailor the frameworks to suit my needs, and see how flexible they were to different structures.

One of the big advantages of using micro frameworks is being able to build them to do exactly what you need without unnecessary overhead, adding features and libraries as they become needed.

My setup with Flight PHP wasn't significantly different from the skeleton, While I did end up with less directories and different composer libraries, structurally, it was similar.

With Slim, the structure of the project ended up significantly different from the skeleton.

It was nice that Slim was flexible and wasn't making assumptions about structure and worked just fine with a completely different structure than the skeleton.

Flight PHP is also flexible in this way, allowing for more complex structures if needed, adding new libraries into the framework was straight forward.

The Code

Routing

From a routing point of view, both were nice to work with. They were both easy to set up without much documentation reading necessary.

Routes in Flight PHP were slightly simpler to setup than Slim, and used less code to do so, but neither was difficult to setup.

Routing groups, regex abilities and middleware options made routes flexible while still being easy to work with.

Database Connections

With Slim, the expectation is that you should use an ORM like Eloquent or Doctrine for your database queries, whereas Flight PHP provides a simple wrapper for PDO that can be used if you need to and optionally, Active Record can be added to the project for query building.

For a small project like the one I was working on, using an ORM seemed to be a bit more than necessary, so I ended up building a small PDO wrapper class for Slim, similar to the one that comes built into Flight PHP.

ORMs are great, but having the flexibility built in to choose how I wish to code database queries is a good feature.

General Coding

Both Slim and Flight PHP Frameworks are good at allowing you to write code your own way.

Some frameworks tend to force you into coding a specific way and at times it can feel like you are fighting against the framework.

Frameworks should work with you not against you, and both of these felt like they were working with me.

Slim also provides a number of handy add ons including CSRF integration and HTTP caching.

Flight PHP provides additional add ons including Permissions and Active Record.

All of these add ons are helpful additions without having to use 3rd party solutions or build your own.

Returning JSON as a response is cleaner in Flight PHP than it is in Slim, Slim 3 had a convenient withJson response. While Slim 4 adheres more to PSR-7, it does mean that to build the JSON response requires more code.

If I was going to be using JSON responses a lot, I would likely create a wrapper to make it more convenient while still adhering to the PSR-7 standard.

This is a significant difference between the two Frameworks, Slim feels like it needs to be tailored more by creating classes to clean up and simplify the codebase, while Flight PHP has already done this for you.

Slim provides a number of helper middleware. The middleware is required in order to make some features work.

An example of this is fetching data from Javascript using FETCH. Slim has a method getParsedBody to create a data array from the POST request.

However, in order to use it the addBodyParsingMiddleware needs to be added to the container.

It's a little bit of a trap for new devs, but also provides access to optional features, which can lower the frameworks overall footprint by only enabling features you need.

Flight PHP achieves this through a config file, some features can be turned on and off through the config rather than through middleware enablement.

Speed Tests

According to benchmarks, comparing the two has interesting results, Slim edges out Flight PHP on some areas while Flight PHP edges out Slim in other areas.

Putting the two frameworks to a test on my own code showed that Flight PHP had faster and more consistent response times than Slim.

Front End

GET request returning JSON

POST request returning JSON

What I found noteworthy was the outlier spikes when using Slim.

Running these tests multiple times produced similar results each time to the ones I have shown above, with generally good response times for both but with outlier spikes in Slim that didn't occur when testing Flight PHP, and Flight PHP generally having better response times.

Final Thoughts

If you haven't ventured into micro frameworks, give them a go, there are a few out there and it can be a great learning experience to try them out and see what you like and what you don't like in each one.

Both Slim and Flight PHP are great micro frameworks.

Slim is a solid framework with some nice-to-have features, that will work quietly for you.

Flight PHP is lighter weight, and its simplicity makes learning the framework really easy.

Good response times and more simplified code to achieve the same thing makes it a really good choice for a micro framework to use.

After putting these two side by side, I do prefer Flight PHP over Slim, but as with any framework, give it a go and see if it works for you.

After all, the right framework is a framework that does what you need it to do.

Flight PHP
Slim Framework

Web Developer Burnout

Lawrence Cooke — Thu, 15 Aug 2024 14:00:00 +0000

A Little Background

Back when I started my web developer journey, everything was much simpler than it is today.

I started out using Mosaic web browser and then not too long after moved to the beta version of Netscape.

Developers at that time learned things together, as new features were added into the browser, we would share how we used them over a snack at the local bakery.

This lead to a decent amount of shared learning. Back then there was basically two things that were handy to know, HTML and Perl.

HTML for the web design and if there was any web form processing, we would use Perl.

While Perl has other more beneficial uses, Just knowing enough to process a web form was enough for a lot of use cases.

At the time, database administration was a more specialized job, MySQL and Postgres were not yet around, so the job of a web developer was more concentrated on HTML, Perl and Web Design.

Not long after, CSS and JavaScript were introduced, and as the 90s rolled on, Cold Fusion, PHP and others were added in.

PHP was simplistic, and for me , was the next logical choice, as was dipping my toes into databases now that they were more accessible.

The Web Development Shift

During the next few years, there was a large shift in web design and development. PHP and other server side scripting became more polished and accessible.

While Perl was still relevant, other options like Python and Ruby gained popularity.

There was also a shift in front end web design. Faster internet speeds opened up the internet to more graphic heavy design, and web standards were developed and tuned.

While Developers were busy learning new back end technologies, front end development matured into more complex designs requiring better graphic design skills.

This created a split in web development, where web development split into back end and front end specialists.

Burnout In Developers

Fast forward to today, where it's all too common to see posts on social media, especially from junior developers suffering from burnout.

Polls taken have shown that up to 83% of developers have suffered burnout, with high work load the most common reason.

High work load can be a combination of potentially long work hours and pressures to spend time after work learning more just to keep up.

Back when I started, the journey tended to lead itself, where new technologies were introduced at a rate where learning the new technology was just a natural progression, we were eased into learning new things.

I still believe that taking time to introduce yourself to a new technology and concentrate on it has a better outcome than introducing multiple technologies at once, where you have to split your time between them.

In a lot of cases, the pressure to learn multiple technologies to keep up, while balancing workplace expectations, leads to developer burnout and often leads to really good developers leaving the industry for simpler pastures.

New developers are bombarded with a large range of technologies, technologies that are ever changing, as are coding standards around these technologies.

Even within technologies there are multiple frameworks and libraries to learn, while trying to grasp the language itself, best practices, coding styles, and design patterns.

A look on Indeed at jobs listed as junior developer jobs shows jobs with vastly different tech stack requirements, leading to new developers to ask "where should I start?".

Adding to the problem is that the answers they receive can be wildly different, depending on the responders own developer journey.

There is not just one clear path to take, leaving junior developers needing to learn many technologies quickly to be successful.

To progress, they need opportunities to learn on the job but also take the initiative to continue learning outside work hours, stretching them thin.

Even as an experienced developer, the need to continually learn still remains. Those who don't spend time learning new things eventually get left behind.

One developer I recently spoke to, had been working his 8 hour days, then spending 3 - 5 hours a night on learning. This is great ambition to have, to want to learn, but it leads to burnout very quickly. We need time to rest and absorb.

Finding a good balance between learning what we need to learn and our own health and well-being can be hard to find.

Other Burnout factors

Along with burnout from trying to keep up and meet expectations from the work place, burnout comes from other factors also.

Home life can affect your developer life.

When you are dealing with life difficulties, mentally, there is only so much that you can do.

The needs of the home outweigh development (as it should). At some point you just reach a breaking point.

How others talk to you at work or online can lead to burnout. There are only so many negative code reviews that could have been written in a more positive manner, only so many times you can do a good job that goes unnoticed, before it takes its toll.

It's not that developers sit around waiting for appreciation, it's just that hearing it may be the inspiration they needed to keep going.

How can we help?

Developer burnout is multi-faceted but we can all do our bit to help our fellow developers from suffering, and prevent good developers from leaving the industry.

Be willing to share your knowledge
Even as new developers, share what you just learned, your excitement can be contagious.

Have patience with developers
While there might be gaps in their knowledge, there are going to be days where they return the favor and are able to teach you something.

Even experienced developers have gaps, don't look down on them because of their gaps.

Be mindful
While developers may write code that is not always great, they are learning, and I doubt there is a developer out there who hasn't been where they are at, and written code that we are not proud of.

Code Reviews
While at times code reviews need to be negative, there are ways to write them that lessen the blow to the developer.

When writing them, think about who you are reviewing, take the time to get to know the person if you can, it will help connect with them in a code review.

Often in code reviews, we are so fixated on finding what is wrong with the code, we forget to mention what is right with it.

Give praise where praise is due
Praise doesn't need to just be for big wins, praise for small wins may just be what the developer needed to hear that day.

It can be as simple as telling someone they did a really good job with something, even if there are flaws in their code, there are still positives that can be highlighted.

With the challenges new developers face, how we handle our interaction with them can make or break them as a developer.

This still applies to interacting with experienced developers, the challenges they face may be different from the challenges new developers face, but our interaction can just as easily make or break them as a developer.

You don't know what else might be going on in a developers life, and treating people with empathy can help them more than you will ever know.

How can we help ourselves?

We also need to remember to take care of ourselves to avoid burnout.

Take time for yourself
While there may be pressure to learn more and do more, taking time away will rejuvenate yourself.

Switch off for a bit
Do something you like to do that doesn't involve tech.

For me it's turning off the computer, putting my phone on silent and spending time outdoors, the beach if I can, lakes, forests and waterfalls if I can't get to a beach. A place where I can empty my head a bit.

Burnout tends to come from either pressure being put on you, or putting pressure on yourself. Take that load off where you can, so you can rest and reset.

Do something different
Learn something new just for yourself, or work on a new side project, something where you can do what you love doing but without any pressures.

The journey back from burnout is made of small steps. Gaining confidence back in yourself, confidence in your skill set.

Learning to once again love what you do.

Find the right balance
Find the right balance to be able to still learn, but learn at your own pace.

Learning at your own pace creates a more enjoyable and rewarding experience. It can also help prevent burnout while still improving your skillset.

Final Thoughts

Developer burnout is a serious issue, affecting both newcomers and experienced professionals.

The rapidly evolving tech landscape, coupled with high expectations and personal pressure, can be overwhelming.

Creating a supportive environment, using empathy in our feedback to other developers, and prioritizing our own well-being can help with reducing burnout.

It is important to continually learn in this industry, but do what works for you. Learn at a pace that ensures you remain in a good space.

What that pace looks likes will differ from person to person, what's right for someone else doesn't mean its right for you.

You are more beneficial to both your employer and your family if you are mentally and physically feeling 100%.

If you are struggling, talk to someone about it.

A fellow developer who you trust, your manager at work if you have a good rapport with them.

Someone who might have experienced a similar problem and worked through it. They may be able to offer you good advice and insights to help you get back on your feet again.

Creating Custom Functions In PostgreSQL

Lawrence Cooke — Thu, 25 Jul 2024 13:19:46 +0000

In PostgreSQL, custom functions can be created to solve complex problems.

These can be written using the default PL/pgSQL scripting language, or they can be written in another scripting language.

Python, Perl, Tcl and R are some of the scripting languages supported.

While PL/pgSQL comes with any Postgres installation, to use other languages requires some setup.

Installing the extension

Before an extension can be used, the extension package needs to be installed.

On Ubuntu you would run:

Perl

sudo apt-get -y install postgresql-plperl-14

The package name 'postgresql-plperl-14' is specific to PostgreSQL version 14. If you're using a different version of PostgreSQL, you need to change the version number in the package name to match your installed PostgreSQL version.

Python 3

sudo apt-get install postgresql-plpython3-14

Activating the extension

To activate the extension in PostgreSQL the extension must be defined using the CREATE EXTENSION statement.

Perl

CREATE EXTENSION plperl;

Python

CREATE EXTENSION plpython3;

Hello world example

Once the extension has been created, a custom function can be created using the extension.

Perl

CREATE OR REPLACE FUNCTION hello(name text) 
RETURNS text AS $$
    my ($name) = @_;
    return "Hello, $name!";
$$ LANGUAGE plperl;

Python

CREATE OR REPLACE FUNCTION hello(name text)
RETURNS text AS $$
    return "Hello, " + name + "!"
$$ LANGUAGE plpython3;

Breaking this down line by line

CREATE OR REPLACE FUNCTION hello(name text)

This line is how a function is created in Postgres. By using CREATE OR REPLACE, it will overwrite whatever function is already defined with the name hello with the new function.

Using CREATE FUNCTION hello(name text) will prevent the function from overwriting an existing function and will error if the function already exists.

RETURNS text AS $$

This defines what Postgres data type will be returned, it's important that the data type specified is a type recognized by Postgres. A custom data type can be specified, if the custom type is already defined.

$$ is a delimiter to mark the beginning and end of a block of code. In this line it's marking the start of the code block.

All code between the start and end $$ will be executed by Postgres

$$ LANGUAGE plperl;

$$ denotes the end of the script and tells Postgres what language the script should be parsed as.

Using the function

Functions can be used like any built-in Postgres function

SELECT hello('world');

This will return a column with the value Hello world!

Functions can be part of more complex queries:

SELECT id, title, hello('world') greeting FROM table;

More complex example

Here is an example function that accepts text from a field and returns a word count.

CREATE OR REPLACE FUNCTION word_count(paragraph text)
RETURNS json AS $$
use strict;
use warnings;

my ($text) = @_;

my @words = $text =~ /\w+/g;
my $word_count = scalar @words;

my $result = '{' .
    '"word_count":' . $word_count .
'}';
return $result;
$$ LANGUAGE plperl;

This returns a JSON formatted result with the word count.

We can add more detailed statistics to the function.

CREATE OR REPLACE FUNCTION word_count(paragraph text)
RETURNS json AS $$
use strict;
use warnings;

my ($text) = @_;

my @words = $text =~ /\w+/g;

my $word_count = scalar @words;

my $sentence_count = ( $text =~ tr/!?./!?./ ) || 0;

my $average_words_per_sentence =
  $sentence_count > 0 ? $word_count / $sentence_count : 0;

my $result = '{' .
    '"word_count":' . $word_count . ',' .
    '"sentence_count":' . $sentence_count . ',' .
    '"average_words_per_sentence":"' . sprintf("%.2f", $average_words_per_sentence) . '"' .
'}';

return $result;
$$ LANGUAGE plperl SECURITY DEFINER;

Now when we use it in a query

SELECT word_count(text_field) word_count FROM table

It will return JSON like

{"word_count":116,"sentence_count":15,"average_words_per_sentence":"7.73"}

Security considerations

When using custom functions or external scripting languages, there are additional security considerations to take into account. It can be a juggling act to get the right balance between usability and security.

Security Definer vs Security Invoker

In the previous function, SECURITY DEFINER option was added to the create function statement.

It's important to think about how you want a function run from a security point of view.

The default behavior is to use SECURITY INVOKER. This will run the function with the privileges of the user who is running the function.

SECURITY DEFINER provides more control over the privileges granted to the function. Using this mode, the function will run with the privileges of the user who created the function.

This can be both good and bad, if a function is created by a user with limited privileges, then there is little harm that can be done to the database.

If the function is created by a user with high access privileges, then the function will run with those same privileges. Depending on the type of function, this could allow a user to run the function with more open privileges than they have been granted.

There are times where this is useful, for example, if a user does not have read privileges to a table, but within the function , read is required, using SECURITY DEFINER can allow the required read privileges for the function to run.

Trusted and untrusted extensions

When creating the extensions above, plperl and plpython3 were used. In most circumstances these are the right extensions to use.

These extensions have limited access to the servers file system and system calls.

Extensions can also be created with a u (plpython3u, plperlu)

These are untrusted extensions and allow more access to the servers file system.

There may be cases where this is required, for example, if you want to use Perl modules, Python Libraries, or use system calls.

In the example above, the JSON output was generated as a string, if desired, the perl JSON module could have been used to encode the data as JSON. To do this would require using the untrusted extension to access the JSON module.

It's advisable to not use the untrusted extensions, but if necessary, use with caution and understand the potential risks.

If Perl is being used, Perl will run in taint mode when the untrusted extension is in use.

Final Thoughts

Being able to take advantage of Perls advanced text processing and memory management, or Pythons data analytic libraries within PostgreSQL can be a really powerful tool.

Passing off complex tasks to tools more suited to handling the task can reduce overhead on the database.

As always, when using custom functions and external scripting languages, take precautions to ensure secure usage.

Migrating from MySQL to PostgreSQL

Lawrence Cooke — Thu, 11 Jul 2024 05:00:00 +0000

Migrating a database from MySQL to Postgres is a challenging process.

While MySQL and Postgres do a similar job, there are some fundamental differences between them and those differences can create issues that need addressing for the migration to be successful.

Where to start?

Pg Loader is a tool that can be used to move your data to PostgreSQL, however, it's not perfect, but can work well in some cases. It's worth looking at to see if it's the direction you want to go.

Another approach to take is to create custom scripts.

Custom scripts offer greater flexibility and scope to address issues specific to your dataset.

For this article, custom scripts were built to handle the migration process.

Exporting the data

How the data is exported is critical to a smooth migration. Using mysqldump in its default setup will lead to a more difficult process.

Use the --compatible=ansi option to export the data in a format PostgreSQL requires.

To make the migration easier to handle, split up the schema and data dumps so they can be processed separately. The processing requirements for each file are very different and creating a script for each will make it more manageable.

Schema differences

Data Types

There are differences in what data types are available in MySQL and PostgreSQL, this means when processing your schema you are going to need to decide what field data types work best for your data.

Category	MySQL	PostgreSQL
Numeric	INT, TINYINT, SMALLINT, MEDIUMINT, BIGINT, FLOAT, DOUBLE, DECIMAL	INTEGER, SMALLINT, BIGINT, NUMERIC, REAL, DOUBLE PRECISION, SERIAL, SMALLSERIAL, BIGSERIAL
String	CHAR, VARCHAR, TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT	CHAR, VARCHAR, TEXT
Date and Time	DATE, TIME, DATETIME, TIMESTAMP, YEAR	DATE, TIME, TIMESTAMP, INTERVAL, TIMESTAMPTZ
Binary	BINARY, VARBINARY, TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB	BYTEA
Boolean	BOOLEAN (TINYINT(1))	BOOLEAN
Enum and Set	ENUM, SET	ENUM (no SET equivalent)
JSON	JSON	JSON, JSONB
Geometric	GEOMETRY, POINT, LINESTRING, POLYGON	POINT, LINE, LSEG, BOX, PATH, POLYGON, CIRCLE
Network Address	No built-in types	CIDR, INET, MACADDR
UUID	No built-in type (can use CHAR(36))	UUID
Array	No built-in support	Supports arrays of any data type
XML	No built-in type	XML
Range Types	No built-in support	int4range, int8range, numrange, tsrange, tstzrange, daterange
Composite Types	No built-in support	User-defined composite types

Tinyint field type

Tinyint doesn't exist in PostgreSQL. You have the choice of smallint or boolean to replace it with. Choose the data type most like the current dataset.

 $line =~ s/\btinyint(?:\(\d+\))?\b/smallint/gi;

Enum Field type

Enum fields are a little more complex, while enums exist in PostgreSQL, they require creating custom types.

To avoid duplicating custom types, it is better to plan out what enum types are required and create the minimum number of custom types needed for your schema. Custom types are not table specific, one custom type can be used on multiple tables.

CREATE TYPE color_enum AS ENUM ('blue', 'green');

...
"shirt_color" color_enum NOT NULL DEFAULT 'blue',
"pant_color" color_enum NOT NULL DEFAULT 'green',
...

The creation of the types would need to be done before the SQL is imported. The script could then be adjusted to use the custom types that have been created.

If there are multiple fields using enum('blue','green'), these should all be using the same enum custom type. Creating custom types for each individual field would not be good database design.

if ( $line =~ /"([^"]+)"\s+enum\(([^)]+)\)/ ) {
    my $column_name = $1;
    my $enum_values = $2;
    if ( $enum_values !~ /''/ ) {
        $enum_values .= ",''";
    }

    my @items = $enum_values =~ /'([^']*)'/g;

    my $sorted_enum_values = join( ',', sort @items );

    my $enum_type_name;
    if ( exists $enum_types{$sorted_enum_values} ) {
        $enum_type_name = $enum_types{$sorted_enum_values};
    }
    else {
        $enum_type_name = create_enum_type_name($sorted_enum_values);
        $enum_types{$sorted_enum_values} = $enum_type_name;

        # Add CREATE TYPE statement to post-processing
        push @enum_lines,
        "CREATE TYPE $enum_type_name AS ENUM ($enum_values);\n";
    }

    # Replace the line with the new ENUM type
    $line =~ s/enum\([^)]+\)/$enum_type_name/;
}

Indexes

There are differences in how indexes are created. There are two variations of indexes, Indexes with character limitations and indexes without character limitations. Both of these needed to be handled and removed from the SQL and put into a separate SQL file to be run after the import is complete (run_after.sql).

if ($line =~ /^\s*KEY\s+/i) {
    if ($line =~ /KEY\s+"([^"]+)"\s+\("([^"]+)"\)/) {
        my $index_name = $1;
        my $column_name = $2;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (\"$column_name\");\n";
    } elsif ($line =~ /KEY\s+"([^"]+)"\s+\("([^"]+)"\((\d+)\)\)/i) {
        my $index_name = $1;
        my $column_name = $2;
        my $prefix_length = $3;
        push @post_process_lines, "CREATE INDEX idx_${current_table}_$index_name ON \"$current_table\" (LEFT(\"$column_name\", $prefix_length));\n";
    }
    next;
}

Full text indexes work quite differently in PostgreSQL. To create full text index the index must convert the data into a vector.

The vector can then be indexed. There are two index types to choose from when indexing vectors. GIN and GiST. Both have pros and cons. Generally GIN is preferred over GiST. While GIN is slower building the index, it's faster for lookups.

if ( $line =~ /^\s*FULLTEXT\s+KEY\s+"([^"]+)"\s+\("([^"]+)"\)/i ) {
    my $index_name  = $1;
    my $column_name = $2;
    push @post_process_lines,
    "CREATE INDEX idx_fts_${current_table}_$index_name ON \"$current_table\" USING GIN (to_tsvector('english', \"$column_name\"));\n";
    next;
}

Auto increment

PostgreSQL doesn't use the AUTOINCREMENT keyword, instead it uses GENERATED ALWAYS AS IDENTITY.

There is a catch with using GENERATED ALWAYS AS IDENTITY while importing data. GENERATED ALWAYS AS IDENTITY is not designed for importing IDs, When inserting a row into a table, the ID field cannot be specified. The ID value will be auto generated. Trying to insert your own IDs into the row will produce an error.

To work around this issue, the ID field can be set as SERIAL type instead of int GENERATED ALWAYS AS IDENTITY. SERIAL is much more flexible for imports, but it is not recommended to leave the field as SERIAL.

An alternative to using this method would be to add OVERRIDING SYSTEM VALUE into the insert query.

INSERT INTO table (id, name)
OVERRIDING SYSTEM VALUE
VALUES (100, 'A Name');

If you use SERIAL, some queries will need to be written into run_after.sql to change the SERIAL to GENERATED ALWAYS AS IDENTITY and reset the internal counter after the schema is created and the data has been inserted.

if ( $line =~ /^\s*"(\w+)"\s+(int|bigint)\s+NOT\s+NULL\s+AUTO_INCREMENT\s*,/i ) {
    my $column_name = $1;
    $line =~ s/^\s*"$column_name"\s+(int|bigint)\s+NOT\s+NULL\s+AUTO_INCREMENT\s*,/"$column_name" SERIAL,/;

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" DROP DEFAULT;\n";

    push @post_process_lines, "DROP SEQUENCE ${current_table}_${column_name}_seq;\n";

    push @post_process_lines, "ALTER TABLE \"$current_table\" ALTER COLUMN \"$column_name\" ADD GENERATED ALWAYS AS IDENTITY;\n";

    push @post_process_lines, "SELECT setval('${current_table}_${column_name}_seq', (SELECT COALESCE(MAX(\"$column_name\"), 1) FROM \"$current_table\"));\n\n";

}

Schema results

Original schema after exporting from MySQL

DROP TABLE IF EXISTS "address_book";
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE "address_book" (
  "id" int NOT NULL AUTO_INCREMENT,
  "user_id" varchar(50) NOT NULL,
  "common_name" varchar(50) NOT NULL,
  "display_name" varchar(50) NOT NULL,
  PRIMARY KEY ("id"),
  KEY "user_id" ("user_id")
);

Processed main SQL file

DROP TABLE IF EXISTS "address_book";
CREATE TABLE "address_book" (
  "id" SERIAL,
  "user_id" varchar(85) NOT NULL,
  "common_name" varchar(85) NOT NULL,
  "display_name" varchar(85) NOT NULL,
  PRIMARY KEY ("id")
);

Run_after.sql

ALTER TABLE "address_book" ALTER COLUMN "id" DROP DEFAULT;
DROP SEQUENCE address_book_id_seq;
ALTER TABLE "address_book" ALTER COLUMN "id" ADD GENERATED ALWAYS AS IDENTITY;
SELECT setval('address_book_id_seq', (SELECT COALESCE(MAX("id"), 1) FROM "address_book"));
CREATE INDEX idx_address_book_user_id ON "address_book" ("user_id");

Its worth noting the index naming convention used in the migration. The index name includes both the table name and the field name. Index names have to be unique, not only within the table the index was added to, but the entire database, adding the table name and the column name reduces the chances of duplicates in your script.

Data processing

The biggest hurdle in migrating your database is getting the data into a format PostgreSQL accepts. There are some differences in how PostgreSQL stores data that requires extra attention.

Character sets

The dataset used for this article predated utf8mb4 and uses the old default of Latin1, the charset is not compatible with PostgreSQL default charset UTF8, it should be noted that PostgreSQL UTF8 also differs from MySQL's UTF8mb4.

The issue with migrating from Latin1 to UTF8 is how the data is stored. In Latin1 each character is a single byte, while in UTF8 the characters can be multibyte, up to 4 bytes.

An example of this is the word café

in Latin1 the data is stored as 4 bytes and in UTF8 as 5 bytes. During migration of character sets, the byte value is taken into account and can lead to truncated data in UTF8. PostgreSQL will error on this truncation.

To avoid truncation, add padding to affected Varchar fields.

It's worth noting that this same truncation issue could occur if you were changing character sets within MySQL.

Character Escaping

It's not uncommon to see backslash escaped single quotes stored in a database.

However, PostgreSQL doesn't support this by default. Instead, the ANSI SQL standard method of using double single quotes is used.

If the varchar field contains It\'s it would need to be changed to it''s

 $line =~ s/\\'/\'\'/g;

Table Locking

In SQL dumps there are table locking calls before each insert.

LOCK TABLES "address_book" WRITE;

Generally it is unnecessary to manually lock a table in PostgreSQL.

PostgreSQL handles transactions by using Multi-Version Concurrency Control (MVCC). When a row is updated, it creates a new version. Once the old version is no longer in use, it will be removed. This means that table locking is often not needed. PostgreSQL will use locks along side MVCC to improve concurrency. Manually setting locks can negatively affect concurrency.

For this reason, removing the manual locks from the SQL dump and letting PostgreSQL handle the locks as needed is the better choice.

Importing data

The next step in the migration process is running the SQL files generated by the script. If the previous steps were done correctly this part should be a smooth action. What actually happens is the import picks up problems that went unseen in the prior steps, and requires going back and adjusting the scripts and trying again.

To run the SQL files sign into the Postgres database using Psql and run the import function

\i /path/to/converted_schema.sql

The two main errors to watch out for:

ERROR: value too long for type character varying(50)

This can be fixed by increasing varchar field character length as mentioned earlier.

ERROR: invalid command \n

This error can be caused by stray escaped single quotes, or other incompatible data values. To fix these, regex may need to be added to the data processing script to target the specific problem area.

Some of these errors require a harder look at the insert statements to find where the issues are. This can be challenging in a large SQL file. To help with this, write out the INSERT statements that were erroring to a separate, much smaller SQL file, which can more easily be studied to find the issues.

my %lines_to_debug = map { $_ => 1 } (1148, 1195); 
 ...
if (exists $lines_to_debug{$current_line_number}) {
    print $debug_data "$line";  
}

Chunking Data

Regardless of what scripting language you choose to use for your migration, chunking data is going to be important on large SQL files.

For this script, the data was chunked into 1Mb chunks, which helped kept the script efficient. You should pick a chunk size that makes sense for your dataset.

my $bytes_read = read( $original_data, $chunk, $chunk_size );

Verifying Data

There are a few methods of verifying the data

Row Count

Doing a row count is an easy way to ensure at least all the rows were inserted. Count the rows in the old database and compare that to the rows in the new database.

SELECT count(*) FROM address_book

Checksum

Running a checksum across the columns may help, but bear in mind that some fields, especially varchar fields, could have been changed to ANSI standard format. So while this will work on some fields, it won't be accurate on all fields.

For MySQL

SELECT MD5(GROUP_CONCAT(COALESCE(user_id, '') ORDER BY id)) FROM address_book

For PostgreSQL

SELECT MD5(STRING_AGG(COALESCE(user_id, ''), '' ORDER BY id)) FROM address_book

Manual Data Check

You are going to want to verify the data through a manual process also. Run some queries that make sense, queries that would be likely to pick up issues with the import.

Final thoughts

Migrating databases is a large undertaking, but with careful planning and a good understanding of both your dataset and the differences between the two database systems, it can be completed successfully.

There is more to migrating to a new database than just the import, but a solid dataset migration will put you in a good place for the rest of the transition.

Scripts created for this migration can be found on Git Hub.

Postgres Arrays

Lawrence Cooke — Fri, 21 Jun 2024 02:00:43 +0000

What are Postgres arrays?

Arrays are columns that can hold multiple values. They are useful when there is additional data that is tightly coupled to a row of data in a table.

Storing tags associated with a row, values from a web form where multiple options can be selected. These are both examples of where you could use an array.

Arrays do not replace lookup tables. Lookup tables can generally be accessed from multiple rows in a table and are not tightly coupled to a specific row.

Example without using arrays

Here is a simplified schema for a migraine tracker that stores both the start and end time, and a list of triggers.

Main table

CREATE TABLE
  public.migraines (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    user_id integer NOT NULL,
    start_dt timestamp without time zone NULL,
    end_dt timestamp without time zone NULL,
  );

Lookup table for trigger type names

CREATE TABLE
  public.trigger_types (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    name character varying(30) NOT NULL
  );

Table to store selected triggers

CREATE TABLE
  public.migraine_triggers (
    id integer NOT NULL GENERATED BY DEFAULT AS IDENTITY,
    user_id integer NOT NULL,
    migraine_id integer NOT NULL,
    trigger_id integer NOT NULL
  );

Inserting Data

Inserting data requires two separate actions

Insert data into the migraine table
Insert Triggers into the migraine_triggers table

The insert into the migraine_triggers is likely a multi row insert.

INSERT INTO migraines (user_id,start_dt,end_dt) 
VALUES (1,'2024-06-18 09:30:00', '2024-06-18 10:30:00')

INSERT INTO migraine_triggers (user_id,migraine_id,trigger_id) 
VALUES (1,2, 3),(1,2, 4),(1,2, 5)

Updating Data

Updating data is not entirely straight forward, you have to decide what approach you want to take (or the approach that works best with your data).

1) Run a SELECT before UPDATE to find which ones already exist and INSERT items not already in the list. You may need to also delete rows that are no longer in the list.

2) Use a conflict resolution insert (if the table is indexed to allow it)

INSERT INTO migraine_triggers (user_id, migraine_id, trigger_id)
VALUES 
    (1, 2, 3),
    (1, 2, 4),
    (1, 2, 5)
ON CONFLICT (migraine_id, trigger_id) 
DO NOTHING;

You may need to also delete rows that are no longer in the list with this method also.

3) Run a delete query to delete all rows related to the migraine_id and INSERT all the new items.

In all these scenarios multiple queries are required to update the data.

Selecting Data

A simple selection might be to find migraines where the migraine was triggered by trigger 3

SELECT migraines.*, migraine_triggers.trigger_id 
FROM migraines
INNER JOIN migraine_triggers 
ON migraine_triggers.migraine_id = migraines.id
WHERE migraine_triggers.trigger_id = 3

Now a slightly more complex query to bring back the name of the trigger from the trigger_types table

SELECT migraines.*,trigger_types.name 
FROM migraines
INNER JOIN migraine_triggers 
ON migraine_triggers.migraine_id = migraines.id 
INNER JOIN trigger_types 
ON trigger_types.id = migraine_triggers.trigger_id
WHERE migraine_triggers.trigger_id = 3

Example using arrays

Using arrays we can simplify the database design and the queries needed to retrieve the same information in the above examples.

One of the features of arrays that separates it from JSON OR JSONB fields is that the data is strictly typed.

The data that goes into an array must be the right type of data.

This ensures that data integrity is maintained in the array.

In this example the data type would be INTEGER. A CHAR could be used but using an integer and utilizing a lookup table has some advantages over just storing the names in the array.

Adding an array field

Instead of using the migraine_triggers table, we can add a column to the migraine table to hold the trigger_ids selected for the migraine.

This will prevent the need for multiple row inserts, deletes and updates. It can also improve select performance because the queries can be simplified in some cases. It also reduces the size of the database by not needing an additional, potentially large table.

To add an array column, add [] after the columns data type.

CREATE TABLE
  public.migraines (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    user_id integer NOT NULL,
    start_dt timestamp without time zone NULL,
    end_dt timestamp without time zone NULL,
    trigger_types integer[] NULL
  );

Inserting Data

Inserting data will now require just one query, wrap the values for the array in {} to insert the array.

INSERT INTO public.migraines (user_id, start_dt, end_dt, trigger_types)
VALUES (2, '2024-06-18 09:30:00', '2024-06-18 10:00:00', '{1,2}');

Updating Data

Updating data is similar, just one query to update the migraine and the trigger data.

UPDATE public.migraines
SET end_dt='2024-06-18 11:00:00', trigger_types = '{1,3}'
WHERE id = 1;

Selecting Data

A simple selection to find migraines where the migraine was triggered by trigger 3 can now be simplified from what it was before.

SELECT * FROM migraines WHERE 3 = ANY(trigger_types);

In this case, there is no overhead from table joins.

Here is a more complex query where we want to pull in the trigger name from the trigger_types table.

SELECT migraines.*, trigger_types.name
FROM migraines 
INNER JOIN UNNEST(trigger_types) trigger_id ON trigger_id = 3
INNER JOIN trigger_types ON trigger_types.id = trigger_id;

In this case we can use unnest to turn the array into rows and then join those rows with the trigger_types table.

Indexing Arrays

To improve performance, you can add an index to an array field.

Using a GIN (Generalized Inverted Index) is most likely the best index type to choose.

GIN is designed for fields where multiple values are present. Arrays, JSONB are both examples where you might want to use a GIN index.

CREATE INDEX idx_gin_triggers ON migraines USING GIN (trigger_types);

Arrays are not right for every situations, but can provide a efficient way to store row meta data.

They can simplify database design and queries, while maintaining data integrity and ease of access.

Further information on arrays can be found in the Postgres Manual

Updating legacy code to php 8.x

Lawrence Cooke — Tue, 30 Apr 2024 14:41:22 +0000

When you have really old code, while it might work on php 7.4, getting it ready for php 8 is a daunting task.

But if you take it in small steps, you can do it.

I have a code base that was originally written 20+ years ago, there was no classes, OOP, Frameworks, or any of the tools now at our disposal, so this code was old procedural code for the most part, that by some miracle actually worked on PHP 7.4.

Knowing where to start is hard, but you have to start somewhere, so start somewhere simple.

Basic Cleanup

I started with short open tags.

Back in the early days of PHP, short open tags were the standard, until issues arose with XML, which is why I still had a bunch of these in this code.

I updated these manually, and the reason I did this manually was that I wanted to see what gremlins I would find in the process.

To find these short open tags I used regex to find them.

<\?(?!php|xml|=)

This was a helpful process, because I did notice some other things while doing this.

While its tempting to try and fix these, don't, but make notes on them, if you get distracted in what you are trying to fix, you will end up going down rabbit holes and never finishing what you started.

Once all the short open tags were converted, I found that I a lot of places where I had

<?php echo "something"?>

While this isn't a deal breaker, this process is a chance to clean up your code. and <?php echo can be neater written as <?= so why not make the change?

Now that the open tags are consistent, the ECHOs are easy to identify and update with a simple find and replace across the project.

Code Cleanup

Now that some basics are out of the way, its time to clean up the code, while this clean up doesn't get us any closer to PHP 8, having neat well laid out code will be a big benefit in this process.

For this I set up my phpcs.xml rules with just a couple of rules

<rule ref="PSR12" />
<rule ref="Generic.Arrays.DisallowLongArraySyntax"/>

This will use the latest PSR standard for the most part with one additional rule to convert array() into []. You may feel differently about this, but I prefer the shorter array definition these days.

Random Errors

It was in this code clean up process where I started running into trouble. I started getting errors.

This was a great outcome. the beautifier was finding places in my code that had had little bugs for years, some of them I am still wondering why the pages worked at all, but here we are.

This allowed me to fix these minor issues as they got picked up. There weren't many of these, but it was great to find them.

Secondary Cleanup

After the beautifier had run, the code looked neat, but there were places where I had multiple blank lines, probably from removing code over time.

I'm not sure if code sniffer has a rule for this, I couldn't find one in a quick search , but rather than spending time hunting it down, I thought I would just do another regex across the project to find these unnecessary gaps.

Using the following regex did the job nicely.

^\n{1,}$

Finally a clean codebase!

Updating to PHP 8

For the next phase of updates, I installed PHP Rector.

There is a temptation to configure rector and set it to the latest version of PHP 8.

With how old the code base is, and how much has changed, this lead to an overwhelming amount of changes.

In total, doing it this way ended up at over 1000 files changed and it was too many to cope with.

As great as rector is, its not a tool that you can just let it do whatever it wants with your code, it will end very badly.

Light Bulb Moment

During this process I realized that just because code ran on PHP 7.4, didn't mean it was PHP 7.4 code. This code had been through every version of PHP since PHP 2.0. There was a lot of old coding in it, and Rector wanted to fix it all.

Back Tracking

To make this task easier, I pushed the Rector PHP version back down as far as I could to PHP 5.3.

With rector targeted at PHP 5.3, a couple of minor issues were picked up, 3 files affected, easy changes.

When Rector got a clean bill of health on 5.3, I increased it to PHP 5.4, again a couple of minor things on 2 files.

It wasn't until PHP 5.6 that I caught anything mildly interesting, but what going back to 5.3 did was ensure I was fixing things in small bites, this is really significant to this process.

Updating to PHP 7.4

Each time I did a version update in Rector, it found new things to fix, most of them were able to be automated, and those that weren't, I could fix the code quickly.

In the process I found bunches of little code issues that might end up with warnings in PHP 7.4 but errors in PHP 8.

Rector with PHP 7.1 brought about the first really odd results, and these odd results is why blindly allowing Rector to fix everything is a bad idea.

Hidden amongst the changes was this really odd change:

if ($show_time_out == 0)

was changed to:

if (0 == 0)

A look at the code revealed that there were variables that were initialized as the wrong type. It also appeared that this change was triggered by variables that were set but never updated, probably from old code that had been removed but remanents remained.

One thing to note about Rector is that you can filter rules out. While long term you are going to want to keep most of them in, when you are in a situation where you end up with 150 files to update, automating what is safe to update will make the job of dealing with the others less daunting, in my case I had a total of 143 files with issues with PHP 7.1, but there were a lot that were safe to update and I was able to filter out the more difficult rule violations to update manually.

Updating to PHP 8

Finally the moment, to update to PHP 8. When I first started out on this update, the update to PHP 8 was daunting, there was a lot of issues. Too many to deal with, but because I did the changes incrementally the actual change for PHP 8 was only 25 files instead of 1000+, and most of those were changes in string manipulation changes that could be fixed automatically.

Both PHP 8.0 and 8.1 had changes in them, but once I reached PHP 8.1, no further changes to code was found. I pushed Rector up to 8.4 to check this.

Left Overs

Rector can't do everything for you, there are going to be left overs, and for me those left overs were related to type mixing.

$var = 0;
if($var == "")

While I am still working through and testing to find these gremlins, there are only a few in the code base that were not corrected in some way by Rector during the process.

As a result, the site itself is working, It still needs to be put through the usual testing phase, but the outcome was been better than expected.

Final Thoughts

Take your time with the process, rushing it will end with broken code.

Commit code regularly, you want to be able to back track if needed, and try again without having to go too far back. Committing after each PHP version bump made sense to me.

While I would love to update this code base into a framework, getting it to PHP 8 was a higher priority. The code works, even though it's not the best code.

I will move it to a framework in stages, but having a working code base to work from will make that transition a much easier one.