Forem: Chan Ro

Massive node_modules in production issue

Chan Ro — Wed, 25 Oct 2023 02:21:48 +0000

Sometimes node_modules become massive that it makes hard and longer to deploy codes to the production environment. In this blog, I want to cover a few things that can be done to reduce node_modules size.

npm commands

When deploying codes to production, devDependencies should be excluded as dependencies under that property literally for development purposes Eg. jest, typescript references etc.

Run npm ci --production command to exclude devDependencies from node_modules

Sometimes after npm install, npm list shows the dependency tree which isn't necessarily the same as the node_modules file tree. However, this can be simplified by running

Run npm dedupe

Unify sub dependencies version

Sometimes developer faces a situation where multiple dependencies require sub-dependency but npm install multiple as they use different versions from each other. Eg axios, aws-sdk.

In case, those dependencies are client libraries from your side, then you can simply update package.json in client libraries to use the same dependency version Eg. Below will install 2 aws-sdk packages into node_modules

library-a
 |-----> aws-sdk@2.1.0
library-b
 |-----> aws-sdk@2.2.0

Below will install 1 aws-sdk package as there's only one version

library-a
 |-----> aws-sdk@2.2.0
library-b
 |-----> aws-sdk@2.2.0

Usage of overrides property from package.json can also resolve multiple versions issue. Eg.

{
  ...,
  "dependencies": {
    "library-a": "1.0.0",
    "library-b": "1.0.5",
    ...
  },
  "overrides": {
    "library-a": {
      "aws-sdk: "2.2.0"
    },
    "library-b": {
      "aws-sdk: "2.2.0"
    }
  }
}

Remove unnecessary packages

Sometimes developers tend to solve problems with npm packages and sometimes it's not good practice as it will pile up the size of node_modules quickly.
Common examples are lodash, ramda, and etc.
Try to avoid using these kind of libraries. If you can create it relatively quickly then just do it. Don't be lazy

Use cases

The most useful use case is AWS lambda. AWS lambda has a size limitation (250mb unzip size) thus, having too many packages or duplicated dependencies may be a big cause of the issue.

Elasticsearch bulk reindex strategy

Chan Ro — Fri, 08 Sep 2023 06:28:47 +0000

What is reindexing?

Elasticsearch redindexing is rebuilding index using the data that are already stored in the Elasticsearch table to replace old index.

Now, the question is why do we need to reindex?

Elasticsearch is NoSQL database that is not schemaless, means each object property in Elasticsearch needs a data type (Elasticsearch sets data type to text and keyword by default). So There's some restriction Elasticsearch when we need to change data type of a existing property.

Example case

For a simple example, let's say we have a schema:

{
  "properties" : {
    "phoneNumber" : {
      "type" : "keyword"
    }
  }
}

Now, while developement, development team realised that the phoneNumber property needs to be number to support sorting by numeric. Thus, the team changed the schema:

{
  "properties" : {
    "phoneNumber" : {
      "type" : "number"
    }
  }
}

Now, when the team tries to update schema/mappings, this will throw an error because there are records already indexed and values cannot be converted into numbers so only way to resolve this is by running reindex sequence For more info

Reindexing small database

Reindexing a small database is pretty simple. You can just run:

POST _reindex
{
  "source": {
    "index": "my-index-000001"
  },
  "dest": {
    "index": "my-new-index-000001"
  }
}

Reindexing large database

There is a problom with reindexing a large database. The process reindexing needs to copy each records to new index(indice) so the process itself is not fast and in large database, it will take quite a while to complete the process. (I am talking about hours or maybe days..)

But that does not mean users cannot read data from old index while reindex is happening. However, what if users adding more records while reindexing is under progress? Then we may or may not lose these new records.

In reality, we need to somehow allow users to access the data while reindexing is happening (Dynamic reindexing) because we cannot let users to wait for hours or days (Unless you have a good excuse for clients and put your application under maintenance.)

Alias

Core key of the solution is Alias. Alias is secondary name that we can set onto index(indice) and you can set same alias onto multiple indexes(indice) and then we can access those indexes by using alias name. (Link)

Approach 1.

Simple approach where we set alias on both old and new index so that we can read data from both indexes and ensures that none of additional records is missed while reindex is running in the background.

However, the problem with this is that this will show duplicated results on read as reindex is copying data from old to new index.

Approach 2.

(Sorry for low quality img)

In this approach, I separated into two indexes as well as alias, writer and reader but writer index with 2 alias (reader and writer). On reindexing, we will create another 2 indexes, one for writer and one for reader.

Then I ran reindexing from old reader and writer to new reader index and removed writer alias from old writer index so that users can no longer add new records into old writer index. And added writer alias to the new writer index which is now the index for adding new records.

After reindex finishes, I moved the reader alias to the new reader index and then removed old indexes safely.

This way, users can read data while reindexing and also new records can be added without worrying reindexing (This excludes regarding available data nodes.. thats different story)

Queue actions in optimistic updates

Chan Ro — Tue, 05 Sep 2023 08:46:46 +0000

What is Optimistic update?

Optimistic update is a process where UI behaves as though a change was successsfully completed before backend server completes saving the data in the database. This will eventually get confirmation or error (unlikely). Overall, this provides more responsive UX.

General process of Optimistic update

In the process of updating a value in UI, the Frontend requests data change to the backend service and also changes the state on UI side without waiting for the response data from the backend.

However, sometimes we want to sync UI state with the response data from the backend since they are from source of the truth (db). Thus upon completing requests, we can re-update the state in UI with the response data if needed.

Problem 1

Updating states on UI before backend completes is okay. BUT what if user spams or interacts with UI very fast? or maybe change value of the same ID quick and frequently?

This will cause few potential problems:

Possible spamming call to API
Possible race condition issue on updating data in the database
Unnecessary usage of database connection pool

Good news is that this can be avoided by debouncing requests on the frontend

"Debounce" it

Debounce is a process that delays the processing of the event/action until the user has stopped typing for a given amount of time.

const debounce = (callback, delay = 1000) => {  
  let timeout;  
  return (...args) => {
    clearTimeout(timeout);
    timeout = setTimeout(() => {
      callback(...args);
    }, delay);
  }
}; 

const requestCall = debounce(() => console.log("request"));
requestCall();

Since this will not execute API call until user stops interacting for a given amount of time, All potential problems from the list should resolve.

Problem 2

(This problem really depends on how much do you need to care about source of truth data from the API response)

Putting stress to the backend API and race conditioning on database part should be resolved. BUT there's one other problem with debounce that is possible data sync problem on the Frontend after getting response from the Backend API.

Let's say we have a debouncer with delay of 1 second and API request can take up to 1 second to complete. What happens if user re-updates the value right before API request completes? If we need to priortise the value on the UI side then this is no problem but what if we need to priortise the value from the API response?

Then most likely, user will be seeing the value changing optimistically (changed by user) and then changing again with the value from the API response and then change again after the most recent API request.

This is quite bad user experience seeing values keep jumping.

Hash/Stack actions

With above problems, I decided not to use debounce but creating my own action hash set which waits ongoing request before executing request in the hash.

The approach can be either hash and stack where we store latest action in the hash that needs to be executed after the current OR it can be stack LIFO (Last in first out) instead of hash and pop the latest action for next execution and sending rest of requests in the stack to server as well for logging purpose.

For example, This is my data

{
  id: 1,
  value: "Hi"
}

Let's say, I updated with following order:

Update the value to "Hello"
Update the value to "World"
Update the value to "!"

In this case, I do not need to care about updating value "World" on Backend side when the client-side is still waiting for the response for "Hello". And once "Hello" is completed I extract "!" into next execution.

So.. the process will be like below:

(Simplified diagram)
(Blue lines indicates the life cycle after current process finishes)

Update the value to "Hello" - Request sent immediately and also pushed to a hash that handles all current processings.
Update the value to "World" - Request into hash using id as key
Update the value to "!" - Overrides the request(from 2.) in hash
On 1. completes, we will check the pending process hash and see if there is a request, then executes the request and clear the hash.

As shown above, there are two hashers (can replace to stack w/e prefer). One for current processings and one for pending processes.
And we have pending process checker that only gets triggered after the current process execution and checks if there's pending process in the hash.

With above approach potential client side data sync issue from optimistic UI application can be resolved

Possible alternative(?)

This can also achievable using abort controller however:
Abort controller is client-side based aborting and disregard executions on the server side thus even if we have bunch of canceled requests on the client-side, server side might be still handling those requests.

Summary

Debouncing is really good solution to avoid making unnecessary requests to server and make data handling more efficients but there are few concerns that needs to rethink about when it comes to optimistic UI. And sometime, creating own process hash/stack/queue handlers on the client side might work better than debouncing.

Frontend webpack optimization

Chan Ro — Mon, 03 Jul 2023 07:39:45 +0000

Most of developers uses webpack to build their applications to run for production environment.

We all know webpack also minimizes our code during build

Importance of built file sizes

The size of built files are very important as users need to download these files to run the application on the web and obviously if file size is large then it can cost performance.

How do we minimize these files?

There are few points that can minimise files further:

Remove all unused packages
Try build your own functions instead of relying on external packages. Thats if it's simple enough to build. (Eg. underscore, jquery packages are good examples. These are quite huge packages but with recent ECMAScripts, most of core functions are covered)
Tree shaking
Analyse to remove any unnecessary files
Removing duplicated packages

Webpack analyzer

Analyzining webpack output is a good way to start to investigating on the optimization. Webpack analyzer is a webpack plugin that displays all packages in the application in a diagram with interactions where developers can easily identify and find any unnecessary packages and files that can be removed to make build files size smaller.

The analyzer is very simple to setup.

Install the package
Include the plugin in the webpack config file

const BundleAnalyzerPlugin = require('webpack-bundle-analyzer').BundleAnalyzerPlugin;

module.exports = {
  plugins: [
    new BundleAnalyzerPlugin()
  ]
}

You can find more from the link below.
webpack-bundle-analyzer

Tree shaking

Tree shaking is a process of removing unused exports from a module.

This can be done by using ECMA import/export keyword instead require. This way webpack will pick which dependency to export and not during build process.

Also, if you are react user then code splitting can help on optimization as well. For example we can set a lazy load on component level to be rendered asynchronously.
Eg.

const ComponentA= React.lazy(() => import('./ComponentA'));

This will allow the component to be rendered only when its needed

Removing duplicated packages

Sometimes when we install packages, some of them uses the same dependency but with different versions.

(Resources are from Atlassian)
For eg.

As we can see there are 3 different button package verions (1.0.0, 1.3.0, 2.5.0) but ideally we can make this better to have it like:

This way, technically it can remove 2 dependencies in total that can reduce a quite size.

How it's done? it is quite simple. Atlassian provides npm package and github link for this webpack plugin and developers can simply add the plugin into webpack config.

webpack-deduplication-plugin

const { WebpackDeduplicationPlugin } = require('webpack-deduplication-plugin');

module.exports = {
  plugins: [
    new WebpackDeduplicationPlugin({
        cacheDir: cacheDirPath,
        rootPath: rootPath,
    }),
  ]
}

However, be careful before using the plugin because some of packages require a specific version of dependency and if thats the case this may cause more problems.

Building your own functions

Packages like jquery and underscore provides very useful functions which developers can immediately pick up and use it to build applications faster.
However, most of core functions that these packages provide are covered in recent ECMAscripts, so technically developers can live without it.

Speed in development is quite important because we need to hit deadlines and sometimes these deadlines are ridiculous BUT if you are building a platform/product that needs to be maintained afterward then you should rethink about using these kind of packages since:

They are generally huge
Most of them are covered in recent ECMA. (Eg. map, forEach, reduce, etc)
Replacing/removing these packages are pain.

So in overall, I was able to save ~42% after applying these methods in recent project (42% is HUGE).

Elasticsearch parent-child relation(join) field type

Chan Ro — Sat, 01 Jul 2023 11:04:02 +0000

(Quick Personal work note)

Elasticsearch provides various of different field types in schema to support users' use cases and also for indexing data for search engine.

In this thread, I will be writing brief about one of field type called join which probably is not used often unlike other field types, use cases, and possibly something that good to know.

What is join field type?

Join field type is basically a field that forms parent/child relationship between records in the same index.

This can be done by defining type property to join in the schema/mapping with name of keys that defines parent and child.

{
  "mappings": {
    "properties": {
      "_id": {
        "type": "keyword"
      },
      "name": {
        "type": "keyword"
      },
      "document_join_field": { 
        "type": "join",
        "relations": {
          "documentSet": "document" 
        }
      }
    }
  }
}

Above schema will define every records in the index to have a property called "document_join_field" and then it will take a property value called name which takes either documentSet or document

Updating Document Set (Parent)

PUT /example-index/_doc/1?routing=1
{
  "_id": "1",
  "name": "Document folderA",
  "NAME_OF_JOIN_FIELD": {
    "name": "documentSet"
  }
}

Updating Document (Child)

PUT /example-index/_doc/1?routing=1
{
  "_id": "1",
  "name": "Document Name A",
  "NAME_OF_JOIN_FIELD": {
    "name": "document"
  }
}

As we can see that on updating a record, there's parameter called routing. Elasticsearch needs to index both parent and child data in the same shard when either parent or child record updates thus the routing parameter is used to make that happen.

Use case and consideration

Join field type can possibly be considered over nested field type by below points:

Will each record have quite large amount of child fields?
Do number of child fields need to be extended real-time? (Eg. like consumer adds new child field on app level)

If use cases need to cover points from above then I suggest using Join over nested field type.

Usage and challenge in Join field type

When we try to search in index, the index will return child and parent records since child is also a record in the index. Thus, to query search or filter for join field type, we can use either has_child or has_parent query which can help us from targeting records that we want.

has_child query

has_child query can be used on parent record which tells Elasticsearch to navigate through all child fields records from the parent record in the index and allow user to define search/filter queries.

GET /my-index/_search
{
  "query": {
    "has_child": {
      "type": "document",
      "query": {
        ...YOUR_SEARCH_QUERY_HERE
      },
      "inner_hits": {}    
    }
  }
}

has_parent query

has_parent query can be used on child records to get parent data and allow user to define search/filter queries.

GET /my-index/_search
{
  "query": {
    "has_parent": {
      "parent_type": "documentSet",
      "query": {
        ...YOUR_SEARCH_QUERY_HERE
      }
    }
  }
}

In addition, we can query has_parent query within has_parent query and search other child record data.

GET /my-index/_search
{
  "query": {
    "has_parent": {
      "parent_type": "documentSet",
      "query": {
        "has_child": {
            "type": "document",
            "query": {
               ...YOUR_SEARCH_QUERY_HERE
            },
            "inner_hits": {}  
        }
      }
    }
  }
}

With has_parent and has_child queries, it can cover most of search and filter queries. BUT one thing that needs to be careful is try not to make search queries massive. (Sometimes query for join field can get ugly and hard to read so be careful on that)

Sorting (challenge)

Sort query is quite challenge for join field type. In nested field type sorting can be done with a property called nested_path within sort query but there's no property that can support for join field.

One useful approach to make sorting work (kind of) is to use script_score on search query and then sort by script score.

This may work or may not work but this challenge still exists in now days and people are trying to solve problem (Unless Elasticsearch provides an update on this).

So is this better than nested field type?

Many people will say "Can't we just use nested field type then?" and my answer in generally will be yes. We can just use nested field type. And Nested field type should be faster than join field type.

Nested field is generally faster on read/search queries
Nested field may be slower on write/update queries (This depends on the data size but generally nested field reindexes every child data when one updates) While joined field just need to reindex the child/parent record user is trying to update
Nested field need to set specific number of child fields that each nested field can have. While joined field does not need to worry about this.
Join field search query can become massive (May become hard to read or maintain unless you are Elasticsearch expert)
Join field has sort query challenge that needs to be solved.
Join field data is easier to maintain than nested field

Again, both nested and join field have pros and cons and use cases are quite different. But generally if you are trying to work with massive nested data than generally join field is preferred over nested field.

Test

Chan Ro — Wed, 28 Jun 2023 07:48:07 +0000

test