Forem: Ankit Utekar

Economies of scale in the cloud

Ankit Utekar — Sat, 10 Sep 2022 13:23:27 +0000

Owing to the increasing adoption of cloud computing, familiarity with cloud computing services is becoming one of the most crucial skills for software development professionals. The cost efficiency of the cloud plays a key role in this continuously increasing adoption. In the quest to understand why public cloud providers are able to offer such cost-efficient solutions, I learned about economies of scale experienced by cloud providers. In this article, we will be discussing different factors contributing to this effect.

Economies of scale basics

Let's go through the basics of economies of scale before applying it in the context of the cloud. Economies of scale come in the picture when a company starts benefiting because of the size of its operation. It is depicted in the image below. Even though output(i.e. production) is continuously increasing, the production cost per unit has decreased from Cx to Cy.

Let’s try to understand this effect with an example - say you are running a restaurant that serves donuts. When you started, you were serving around 5 customers per day. It’s been almost 6 months now, your business has boosted and now you are serving around 250 customers per day. You will need to increase the production of donuts, but the production cost per donut will go down because of the following factors -

Now you can buy ingredients in large quantities at a discounted price, which in turn results in fewer costs for buying ingredients required per donut
Let’s say, you had employed people anticipating a demand of 225 customers per day. Now, your employees are more productive instead of sitting idle like in earlier days. This increased productivity isn’t causing costs(salaries!) to go up in the same proportion, but generating more output than in earlier days
There are other fixed costs involved such as advertising, rent of the place, electricity, etc. that stay almost the same

You can further standardize processes(e.g. recipes) and have employees do specialized tasks. This will further increase the speed and overall efficiency of the operations, contributing to more cost-efficient production of donuts.

Economies of scale in the cloud

A company having large data centers that provides computing resources as a service is more likely to benefit from economies of scale than companies having their own in-house servers. This is because of factors such as resource pooling, demand aggregation, demand variation, and certain fixed costs involved in setting up and maintaining the infrastructure. Let’s see how these supply-side savings and consumer demand aggregation contribute to economies of scale in the cloud -

Supply-side savings

Similar to how increased size of operation causes cost reductions in our donut restaurant above, a cloud provider benefits in the following ways -

Buying in bulk

Cloud providers buy thousands of units of hardware equipment at a time and can have long-term contracts with the suppliers. These factors result in them getting bigger discounts as compared to retail buyers.

Cost saving in power expenses

Power supply costs are an important component of expenses in an IT company. Power supply and related labor costs vary geographically. If you have your own private data center, it is most likely to be co-located with your office in a big city. You are very unlikely to move to a cheaper location because the benefits will outweigh the costs associated with setting up new infrastructure in a new location. But considering size of their operations, cloud providers can afford to make this move and take advantage of this geographical variation in costs. Also, large data centers are likely to have better power utilization compared to private data centers because of demand aggregation. Cloud providers can get better deals in infrastructure required for setting up the power supply and cooling systems.

Cost savings in labor expenses

Large DCs have many systems of similar configuration. A cloud provider can invest in bringing in more expertise, incurring some fixed costs. But the solutions they develop can be utilized for a large number of systems. A cloud provider has less maintenance cost per machine as compared to in-house data centers - a single operator can handle more machines(1000s) and automation helps them do their jobs faster.

Standardization and homogenization of hardware

Standardizing and deciding on a fixed set of hardware models can also contribute to additional cost savings -

- Buying equipment in large quantities gives larger discounts

- Maintenance tasks and updates are easier to manage and can be automated

- Costs of investment in security software, automation and other tools are spread over a large number of systems(1000s), resulting in less cost per system

Consumer demand aggregation

Pooling of resources, making them available to thousands of customers, and serving their aggregated but varying demands results in better overall hardware utilization. It rarely happens that usage patterns are constant throughout the day. For an hour, there are millions of requests coming into your services, and the next hour, this number drops down to hundred!! There are different factors that decide the usage pattern of a particular software and in turn, affect the utilization of underlying hardware:

Time of day

Some apps get very high traffic in the morning(e.g. news aggregators) and some in the evening(e.g. food delivery). Direct consumer targeted services such as social media see high traffic in the mornings and evenings whereas services that we use at our work(e.g. code hosting services, email and other communication services) experience traffic during the day but are idle during non-office hours.
Suppose, you own a software company that develops SaaS communication tools(i.e. chat, email, video conferencing) for other companies. Your hardware utilization might be good throughout the day but it might be sitting idle in non-office timings. You need to have sufficient demand and scale for it to be used efficiently around the clock. Maybe if your products are being used worldwide, it can generate constant demand to keep your hardware utilization at maximum. But only large-scale companies and a few products have that kind of demand to get the most out of their hardware 24/7.

Industry

Logistics services may see very high traffic during holiday seasons but stay quiet the rest of the year. Financial services may experience high resource requirements during the end of the financial year but minimal usage rest of the year. Engineers working on these solutions need to consider these timings of peak usage while provisioning resources to run these services on. Festival season sales need to be considered for your E-commerce shop and consideration of traffic during tax filing season will be important for a tax filing app.

Resource profile

A server hosting a Redis instance needs to have specialized storage whereas an image processing task needs GPU on the machine it is running on. A software doing scientific calculations will need a high amount of compute but won't need networking resources of the same scale. A machine comes with a certain amount of compute, storage, and networking capabilities. But software running on these machines won’t use all of these in equal proportion. Sure you can buy specialized hardware but can you guarantee continuous demand for these resources? These hardware configurations are expensive compared to commodity hardware.

Uncertainty in growth predictions

When you are launching something new, you have to plan the capacity for running it and procure resources accordingly. Most of the time, these activities need to be done well in advance because there are approvals and other communications that need to happen. Planning capacity is a difficult activity because it is difficult to anticipate demands. You can't be sure if your TikTok rival app will be a hit or just make TikTok more popular. Will people even use the new stories feature (that I have copied from other apps) or will consider it unnecessary and I will eventually have to get rid of it?

All of these factors open up opportunities for achieving better utilization through demand aggregation and resource pooling. Large data centers are more likely to benefit from these than in-house data centers. Why? Because it is difficult for most IT companies to generate this kind of demand - continuous demand with complimentary access patterns. In-house solutions have to consider peak usage timings while planning capacity. To benefit from the above listed factors, cloud providers procure thousands of resources in a data center, implement intelligent resource sharing algorithms for their efficient utilization and provide computing services on an on-demand basis to users all over the world.

So this is how supply-side savings and consumer demand aggregation play a critical role in cloud economics. Are public cloud providers the only ones benefiting from economies of scale? Definitely not! If an organization had sufficient demand from private data centers situated in multiple offices that is now being aggregated into a few data centers, we can say that they are in the process of bringing cost efficiency through the economics of scale. Has your organization benefited from this effect in any way? Do let us know in the comments!

[ Cover image by Alex wong on Unsplash ]

Implementing auto-complete functionality in Elasticsearch - Part III: Completion suggester

Ankit Utekar — Sat, 17 Apr 2021 15:45:30 +0000

This is part III of my series on designing auto-complete feature in Elasticsearch. In this part, we will talk about completion suggester - a type of suggester which is optimized for auto-complete functionality and considered to be faster than the approaches we have discussed so far.

Completion suggesters use a data structure known as Finite State Transducer which is similar to the Trie data structure and is optimized for faster look-ups. These data structures are stored in-memory on nodes to enable faster searches. Like edge-n-gram and search_as_you_type, this also does most of the work at index time by updating in-memory FSTs with the input that we provide.

A special type of ES type - completion, is used for implementing it -

PUT /movies
{
    "mappings": {
        "properties": {
            "title": {
                "type": "completion"
            }
        }
    }
}

Mapping also supports analyzer, search analyzer, max_input_length parameters for the completion field. Analyzer value defaults to simple analyzer which lower-cases the input and tokenizes on any non-letter character such as number, space, a hyphen, etc. Analyzers on completion types behave differently than analyzers on other text fields. After analysis, tokens are not available separately - they are put together and inserted into FST, based on their order in the input text. Also, we can't test our mappings using _analyze endpoint in this approach. If we try to do so ES throws an error saying 'Can't process field [title], Analysis requests are only supported on tokenized fields'.

While indexing a document, we specify input and an optional weight parameter -

POST /movies/_doc/1001
{
    "title": [
        {"input": "Harry Potter and the Goblet of Fire", "weight": 5},
        {"input": "Goblet of Fire", "weight": 10}
    ]
}

POST /movies/_doc/1002
{
    {
    "title": {
        "input": ["Harry Potter and the Goblet of Fire",
                  "Goblet of Fire"],
        "weight": 2
        }
    }
}

We can specify multiple matches for a single document using input parameter. The weight parameter controls the ranking of documents in search results. It can be specified per input as shown in the first document(1001) above, or can be kept same for all the inputs as shown in the second document(1002).

Suggester fields are queried using suggest clause inside the request body of _search endpoint. Before ES version 5.0, there was a separate endpoint - _suggest for suggesters. Many examples on the internet use _suggest. Since version 5, _search endpoint itself has been updated to support suggesters too.

By default, Elasticsearch returns entire matching document. If we are only interested in the suggestion text, we can use _source option and set it to "suggest". This way, we minimize disk fetch and transport overhead:

GET /movies/_search
{
    "_source": "suggest",
    "suggest": {
        "harry_suggest": {
            "prefix": "goblet of f",
            "completion": {
                "field": "title"
            }
        }
    }
}

Above query returns both the documents - 1001 and 1002, as both the documents contain "Goblet of Fire" as one of the suggestions for title. First document is ranked higher as it has more weight i.e. 10. This can be observed in the response of above query:

{
    "took": 10,
    "timed_out": false,
    "_shards": {...},
    "hits": {...},
    "suggest": {
        "harry_suggest": [
            {
                "text": "goblet of f",
                "offset": 0,
                "length": 11,
                "options": [
                    {
                        "text": "Goblet of Fire",
                        "_index": "movies",
                        "_type": "_doc",
                        "_id": "1001",
                        "_score": 10.0,
                        "_source": {}
                    },
                    {
                        "text": "Goblet of Fire",
                        "_index": "movies",
                        "_type": "_doc",
                        "_id": "1002",
                        "_score": 2.0,
                        "_source": {}
                    }
                ]
            }
        ]
    }
}

"Goblet of Fire" is returned twice in suggestions as we had provided this text as input in both the documents. This can be avoided by using skip_duplicates option.

In case of completion suggester, ES matches the documents one character at a time starting from the first character, moving ahead one position as a new character is typed in. As discussed above, it preserves the order of input in FST. So, it won't be able to match in the middle of the input like n-gram based approaches. i.e. if you have a movie named "Harry Potter and the Goblet of Fire" and you type in "goblet of fire", it won't return the document as a match. You can, however, use the input option to provide multiple matches. You can manually tokenize your input string and pass the tokens to Elasticsearch in the input option, like how we have done in the examples above by providing "Goblet of Fire" as additional input.

Completion suggester supports fuzzy queries that allow us to consider typos while searching documents. You can also specify prefix text as regex query. Both the queries in example below return "Goblet of Fire" as suggestions -

/******************
    Fuzzy query 
******************/
GET /movies/_search
{
        "_source": "suggest",
        "suggest": {
            "harry_suggest": {
                "prefix": "gobet of f",
                "completion": {
                    "field": "title",
                    "fuzzy": {
                        "fuzziness": 2
                    }
                }
            }
      }
}

/******************
   regex query 
******************/
GET /movies/_search
{
      "_source": "suggest",
        "suggest": {
            "harry_suggest": {
                "regex": "g[aieou]b",
                "completion": {
                    "field": "title"
                }
            }
       }
}

Adding Context to searches

Unlike other queries, completion suggesters don't support adding filters in your queries. i.e. you can't filter out suggestions based on values of other fields in the document. Suppose, we have an index that stores Movies and we are developing auto-complete based on title field. Let's say we have mapped title as completion type and there are other fields like genres, ratings, production companies, etc. There is a document with title "Goblet of Fire" which has genre as "action". Now, if we try to filter out auto-complete suggestion based on genre = "romance", we expect that it shouldn't return "Goblet of Fire":

GET /movies/_search
{
    "query": {
            "bool": {
                "filter": [
                    {
                        "term": {
                            "genre": "romance"
                        }
                    }
                ]
            }
        },
        "suggest": {
            "harry_suggest": {
                "prefix": "goblet",
                "completion": {
                    "field": "title"
                }
            }
      }
}

This doesn't work as we expect - it returns "Goblet of Fire" as a suggestion even though it belongs to the "action" genre. The main reason behind this limitation is its design. As already discussed, suggestions are stored in a separate data-structure - in-memory FST, whereas other fields are stored on disk. This design facilitates faster searches through in-memory FST. Queries like above go against this design.

However, Elasticsearch does provide context suggester to circumvent this issue up to some extent. To use context suggester, we have to provide contexts while creating mapping for the index:

PUT /movies
{
    "mappings": {
                "properties": {
                    "title": {
                        "type": "completion",
                        "contexts": [
                            {
                                "name": "genre",
                                "type": "category"
                            }
                        ]
                  }
            }
      }
}

For a particular completion field, we can define multiple contexts having unique names. There are two types of contexts supported:

Category => Category of the thing you are indexing, e.g. genre of a movie/song
Geo => Geo-points for the documents you are indexing, allows filtering suggestions based on lat-long.

Each context type above supports some advanced parameters such as precision, neighbours for geo context, boost while querying so that documents having particular category are scored higher. Note that, for context enabled completion fields, contexts parameter is mandatory while indexing a document as well as querying it.

Let's index some documents in the index created above:

POST /movies/_doc/2001
{
     "title": {
                "input": "Harry Potter and the Chamber of Secrets",
                "contexts": {
                    "genre": "mystery"
                }
       }
}

POST /movies/_doc/2002
{
            "title": {
                "input": "Harry Potter and the Prisoner of Azkaban",
                "contexts": {
                    "genre": "crime"
                }
       }
}

Above, we have indexed "Harry Potter and the Prisoner of Azkaban" as a movie in "crime" genre and "Harry Potter and the Chamber of Secrets" in "mystery" genre. Let's try to get suggestions for prefix "harry":

/******************
      Request
******************/
GET /movies/_search
{
       {
            "_source": "suggest",
            "suggest": {
                "harry_suggest": {
                    "prefix": "harry",
                    "completion": {
                        "field": "title",
                        "contexts": {
                            "genre": "crime"
                        }
                    }
                }
            }
       }
}
/******************
      Response
******************/
{
            "took": 25,
            "timed_out": false,
            "_shards": {...},
            "hits": {...},
            "suggest": {
                "potter_suggest": [
                    {
                        "text": "harry",
                        "offset": 0,
                        "length": 5,
                        "options": [
                            {
                                "text": "Harry Potter
                                 and the Prisoner of Azkaban",
                                "_index": "movies",
                                "_type": "_doc",
                                "_id": "2002",
                                "_score": 1.0,
                                "_source": {},
                                "contexts": {
                                    "genre": [
                                        "crime"
                                    ]
                                }
                            }
                        ]
                    }
                ]
         }
}

As it can be observed in the response above, only "Harry Potter and the Prisoner of Azkaban" from the "crime" genre is returned even though the passed prefix matches both the documents indexed above.

That was all about our third approach for implementing auto-complete in ES. So how does completion suggester compare with other approaches seen so far? It is definitely the fastest one as data to be searched is available in memory, but there are some things we need to keep in mind if we decide to implement auto-complete using it:

Have to be mindful of the size of the index as suggestions are stored in-memory.
Infix matching, e.g. matching by middle-name, is not supported.
Advanced filtering for suggestions by other fields in the document is not supported.

So to conclude this series, we can say that following factors should be considered while choosing an approach for implementing auto-complete functionality in Elasticsearch:

Is the data already indexed? In what format? Can we re-index it to make it more suitable for auto-complete functionality? If the data is already indexed as text field and we can't re-index it, we will need to go with query time approach - i.e. prefix queries!
In what ways can this field be queried? Does it make sense to store it in more than one way?
Does it need to support infix matches? Is the order of words in text is fixed? Is the order well known to users? Completion suggesters don't support infix matches and are not suitable for fields having highly known order.
What can be the maximum size of the text that will be supplied as value to our field? Can it create problems if it's saved in-memory? Completion suggesters save data in memory, n-gram based approaches create additional tokens after basic tokenization for faster matches.
Do we need to have a separate index for this field? If all of the three approaches mentioned here don't satisfy your requirements, then you will need to create another index. In that index, only fields needed for auto-complete functionality will be stored as unique documents, instead of saving them with other data in the same index. This will minimize the chances of bloating up nodes and also can provide faster suggestions. But yes, it is a separate index after all, you will have to keep data in sync between your main index and new index. And there is overhead of managing another index too.

I hope you enjoyed this series and it added something to your knowledge. Do let me know your feedback in the comments.

May Elasticsearch be the Firebolt for your search-engine game!!!

Implementing auto-complete functionality in Elasticsearch - Part II: n-grams

Ankit Utekar — Sat, 17 Apr 2021 15:13:23 +0000

This is part II of my series on implementing auto-completion feature using Elasticsearch. In first part we talked about using prefix queries, a query time approach for auto-completions. In this post, we will talk about n-grams - an index time approach which generates additional tokens after basic tokenization so that we can have faster prefix matches later at query time. But before that, let's see what an n-gram is. As per Wikipedia -

an n-gram is a contiguous sequence of n items from a given sequence of text or speech

Yes, it is as simple as that, just a sequence of text. 'n' items here mean 'n' characters in case of character level n-grams and 'n' words in case of word level n-grams. Word level n-grams are also known as shingles. Further, based on value of 'n', these are categorized as uni-gram(n=1), bi-gram(n=2), tri-gram(n=3) and so on.

Below example will make it clearer:

Character n-grams for input string = "harry":
    n = 1 : ["h", "a", "r", "r", "y"]
    n = 2 : ["ha", "ar", "rr", "ry"]
    n = 3 : ["har", "arr", "rry"]

Word n-grams for input string = "harry potter and the goblet of fire":
    n = 1 : ["harry", "potter", "and", "the", "goblet", "of", "fire"]
    n = 2 : ["harry potter", "potter and", "and the", "the goblet",
           "goblet of", "of fire"]
    n = 3 : ["harry potter and", "potter and the", "and the goblet",
            "the goblet of", "goblet of fire"]

In this post, we will discuss two n-gram based approaches - first using edge-n-gram tokenizer and then using built-in search-as-you-type type, which also uses n-gram tokenization internally. These additional tokens are outputted into inverted index while indexing the document, which minimizes search time latency. Here, ES simply has to compare the input with these tokens unlike prefix query approach where it needed to check if individual token starts with given input.

Edge-n-gram tokenizer

As we have already seen, text fields are analyzed and stored in inverted index. Tokenization is 2nd step in this 3 step analysis process, ran after filtering characters but before applying token filters. Edge-n-gram tokenizer is one of the built-in tokenizers available in ES. It first breaks down given text into tokens, then generates character level n-grams for each of these tokens.

Let's create an index for movies, this time using edge-n-gram tokenizer:

PUT /movies
{
    "settings": {
            "analysis": {
                "analyzer": {
                    "custom_edge_ngram_analyzer": {
                        "type": "custom",
                        "tokenizer": "customized_edge_tokenizer",
                        "filter": [
                            "lowercase"
                        ]
                    }
                },
                "tokenizer": {
                    "customized_edge_tokenizer": {
                        "type": "edge_ngram",
                        "min_gram": 2,
                        "max_gram": 10,
                        "token_chars": [
                            "letter",
                            "digit"
                        ]
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "custom_edge_ngram_analyzer"
                }
            }
      }
}

In prefix query examples, we weren't passing analyzer parameter to any of the fields in mapping, we were relying on default standard analyzer. Above, we have first created a custom analyzer custom_edge_ngram_analyzer by passing it customized tokenizer customized_edge_tokenizer of type edge_ngram. Edge_ngram tokenizer can be customized using below parameters:

min_gram ⇒ Minimum number of characters to put in a gram, defaults to 1, similar to uni-gram example seen above
max_gram ⇒ Maximum number of characters to put in a gram, default to 2, similar to bi-gram example seen above
token_chars ⇒ Characters that are to be kept in a token, if ES encounters any character that doesn't belong to the provided list, it uses that character as break-point for new token. Supported character classes include letter, digit, punctuation, symbols and white-space. In above mapping, we have kept letters and digits as part of the token. If we pass input string as "harry potter: deathly hallows", ES will generate ["harry", "potter", "deathly", "hallows"] by breaking on white-space and punctuation.

Let's use _analyze API to test how our custom edge-n-gram analyzer will behave:

/**********Request**********/
GET /movies/_analyze
{
    {
        "field": "title",
        "text": "Harry Potter and the Order of the Phoenix"
    }
}

/**********Response*********/
[ha, har, harr, harry, po, pot, pott, potte, potter, an,
 and, th, the, or, ord, orde, order, of, th, the, ph, pho,
 phoe, phoen, phoeni, phoenix]

To keep it concise, I haven't included actual response which contains an array of objects, one object per gram, containing metadata about that gram. Anyhow, as it can be observed, our custom analyzer is working as designed - emitting grams for passed string, lower-cased and having length within min-max settings. Let's index some movies to test auto-complete functionality -

POST /movies/_doc
{
      {
            "title": "Harry Potter and the Half-Blood Prince"
      }
}

POST /movies/_doc
{
      {
            "title": "Harry Potter and the Deathly Hallows – Part 1"
        }
}

Edge-n-grammed fields support infix matches as well. i.e. you can match document with title 'harry potter and the deathly hallows' by passing 'har' and 'dead' too. This makes it suitable approach for auto-complete implementations where there is no fixed ordering of words in input text.

/**Matches second document**/
GET /movies/_search
{
    {
    "query": {
        "match": {
            "title": {
                "query": "deathly "
            }
        }
    }
 }
}

/**Matches both the documents**/
GET /movies/_search
{
    {
    "query": {
        "match": {
            "title": {
                "query": "harry pot"
            }
        }
    }
  }
}

/**Also matches both the documents**/
GET /movies/_search
{
    {
    "query": {
        "match": {
            "title": {
                "query": "potter har"
            }
        }
    }
 }
}

By default, search queries on analyzed fields(title in above example) run analyzer on search term as well. If you specify search term as "deathly potter" hoping that it will match only the second document, you will be surprised because it matches both the documents. It's because the search term "deathly potter" will be tokenized too, outputting "deathly" and "potter" as separate tokens. Although "Harry Potter and the Deathly Hallows – Part 1" is matched with the highest score, input query tokens are matched separately giving us both the documents as result. If you think this can cause issues, you can specify an analyzer for the search query as well.

Thus, edge-n-gram overcomes the limitations of prefix query by saving additional tokens in inverted index, minimizing query time latency. But, these additional tokens do take up extra space on nodes and can cause performance degradation. We should be careful while choosing the fields for n-gramming because values of some fields can have unbounded size and can bloat up your indices.

Search_as_you_type

Search_as_you type datatype, which was introduced in Elasticsearch 7.2, is designed to provide out-of-the-box support for auto-complete functionality. Like edge-n-gram approach, this also does most of the work at index time by generating additional tokens to optimize auto-complete queries. When a particular field is mapped as search_as_you_type type, additional sub-fields are created for it internally. Let's change our title field type to search_as_you_type:

PUT /movies
{
     "mappings": {
            "properties": {
                "title": {
                    "type": "search_as_you_type",
                    "max_shingle_size": 3
                }
            }
      }
}

For the title property in above index, three sub-fields will be created. These sub-fields use shingle token filter. Shingles are nothing but groups of consecutive words(word n-grams as seen above).

the root title field ⇒ analyzed using analyzer provided in the mapping, default one is used if not provided
title._2gram ⇒ This will split up the title into parts having two words each, i.e. shingles of size 2.
title._3gram ⇒ This will split up the title into parts having three words each.
title._index_prefix ⇒ This will perform further edge-ngram tokenization on tokens generated under title._3gram.

We can test its behavior using our favourite _analyze API:

Input string = "Harry Potter and the Goblet of Fire"

title._2gram: [ "harry potter", "potter and", "and the",
                "the goblet", "goblet of", "of fire" ]

title._3gram: [ "harry potter and", "potter and the",
                "and the goblet", "the goblet of", 
                "goblet of fire" ]

title._index_prefix on "goblet of fire" token from
title._3gram: [ "g", "go", "gob", "gobl", "goble", 
               "goblet", "goblet ", "goblet o", "goblet of" ]

How many sub-fields are to be created is decided by max_shingle_size parameter which defaults to 3 and can be set to 2, 3 or 4. Search_as_you_type is a text-like field, so additional options that we use for text fields(e.g. analyzer, index, store, search_analyzer) are also supported.

As you must've guessed it by now, it supports prefix as well as infix matches. While querying, we need to use multi_match query as we need to target its sub-fields too:

/**********Request**********/
GET /movies/_search
{
    "query": {
        "multi_match": {
            "query": "the goblet",
            "type": "bool_prefix",
            "analyzer": "keyword",
            "fields": [
                "title",
                "title._2gram",
                "title._3gram"
            ]
        }
    }
}

/**********Response**********/
{
    "took": 14,
    "timed_out": false,
    "_shards": {...},
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 3.0,
        "hits": [
            {
                "_index": "movies",
                "_type": "_doc",
                "_id": "r03hm3gBuCt11zp-Z_lC",
                "_score": 3.0,
                "_source": {
                    "title": "Harry Potter and the Goblet of Fire"
                }
            }
        ]
    }
}

We have set query type to bool_prefix here. The query will match documents having titles in any order, but documents having order that matches to the text in query will be ranked higher. In above example, we have passed "the goblet" as query text so documents having title as "the goblet of fire" will be ranked higher than documents having title "fire goblet".

Also, we have specified query analyzer to be keyword, so that our query text "the goblet" won't be analyzed and will be matched as it is. Without this, along with documents having title as "Harry Potter and the Goblet of Fire", documents having title "Harry Potter and the Deathly Hallows – Part 1" would also match.

This is not the only way to query a search_as_you_type field, but certainly more suitable for our auto-complete use-case.

Like edge-n-gram, search_as_you_type overcomes the limitation of prefix query approach by storing data that is optimized for auto-completions. So in this approach too, we have to be careful about things we are storing using this field. Additional space is required for storing these n-grammed tokens.

In part III, we will talk about completion suggesters, another index time approach which further speeds up queries by storing suggestions in-memory. Do let me know your feedback on this part in comments below.

Implementing auto-complete functionality in Elasticsearch - Part I: Prefix queries

Ankit Utekar — Sat, 17 Apr 2021 13:04:52 +0000

Being a software engineer, I tend to judge products and companies behind those products based on how efficiently they have implemented technically challenging things. One of the things on the internet that fascinates me is blazing fast auto-complete implementations! Especially those, which load things asynchronously, from a large data-set in the back-end.

Auto-complete is not like search functionality - we are supposed to update auto-complete options as soon as the user types next character, hitting the database literally every second, filtering through millions of records, without causing any performance degradation!!!

A technology that makes it easy to implement such features is Elasticsearch - a search and analytics engine built on top of Apache Lucene library. Elasticsearch has distributed, multi-tenant architecture with built-in routing and re-balancing, making it easy to scale. It's a widely used data store for storing, searching, and analyzing large volumes of data.

In this three-part series of blog posts, I will be going into details of how we can implement auto-complete functionality using various options available in Elasticsearch. In the first part(i.e. this post), we will talk about prefix queries. In the second part, we will have a look at n-grams and in the final part, we will discuss completion suggesters. I will be using Elasticsearch 7.12, the current version at the time of writing this.

For example purposes, we will be using an index that stores data of movies. To keep it simple, title will be the only property present in this index. As Elasticsearch exposes REST interface for its operations, you can use any REST based tool to communicate with it.

This series assumes basic familiarity with Elasticsearch. If you are new to Elasticsearch, I highly recommend reading an article or two on the basics of Elasticsearch.

So let's get started, shall we?

Prefix queries

Prefix queries are the simplest form of auto-complete implementation in Elasticsearch. We don't do anything special while storing the field, most of the work is done at query time. The field is indexed(stored!) as a simple text/keyword field and queries that allow us to match documents based on passed prefixes are used to query it.

Let's create an index to run prefix queries on:

PUT /movies
{
     {
      "mappings": {
        "properties": {
            "title": {
                "type": "keyword",
                "fields": {
                    "analyzed_title": {
                        "type": "text"
                    }
                }
            }
         }
      }
    }
}

While creating an index, we need to provide mapping, indicating type of data we intend to store. For the purpose of examples below, the title is mapped as a keyword field and also as a text field for supporting full-text queries. A field can be mapped as more than one type using multi-fields feature of Elasticsearch.

The key difference between a keyword field and a text field is that the keyword fields are not analyzed, i.e. data we pass to a keyword field is stored as it is. Text fields are analyzed, i.e. tokenized, possibly transformed(e.g. lowercased, stemmed, etc.), and stored in an inverted index. Inverted index is a data structure that stores mappings from terms to the location of documents they appear in, enabling efficient full-text searches.

To test how our data will be analyzed, we can use _analyze API. Let's see how our main title field will be analyzed:

/***********************
         Request
***********************/
GET /movies/_analyze
{
    "text": "Chamber of Secrets",
    "field": "title" 
}

/***********************
       Response
***********************/
{
    "tokens": [
        {
            "token": "Chamber of Secrets",
            "start_offset": 0,
            "end_offset": 18,
            "type": "word",
            "position": 0
        }
    ]
}

So, it returned only a single token. Why? That's right, it's because it's a keyword field! Let's test how our analyzed title will behave:

/***********************
       Request
***********************/
GET /movies/_analyze
{
    "text": "Chamber of Secrets",
    "field": "title.analyzed_title" 
}

/***********************
       Response
***********************/
{
    "tokens": [
        {
            "token": "chamber",
            "start_offset": 0,
            "end_offset": 7,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "of",
            "start_offset": 8,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "secrets",
            "start_offset": 11,
            "end_offset": 18,
            "type": "<ALPHANUM>",
            "position": 2
        }
    ]
}

As expected, it was broken down into three tokens. Moreover, the tokens are lower-cased. Why is that? Because, even if we don't specify any analyzer, default standard analyzer is applied to text fields which performs grammar-based tokenization and also lower-cases these tokens. Text analysis is a highly configurable process that consists of one or more character filters, a tokenizer, and one or more token filters, running in a pipeline. We can create our own analyzers and also can customize built-in analyzers.

Let's add some Harry Potter movies to our index, i.e. let's index some documents:

POST /movies/_doc
{
    "title": "Harry Potter and the Chamber of Secrets"
}

POST /movies/_doc
{
    "title": "Harry Potter and the Prisoner of Azkaban"
}

Let's try to query our main title field(keyword) using prefix query. The prefix query is a type of term level query which is used to query non-analyzed fields. We will try two different requests - first with prefix of first word in the title, another one with prefix of second word in the title:

/************************
Query using prefix "Harr"
*************************/
GET /movies/_search
{
    "query": {
        "prefix": {
            "title": "Harr"
        }
    }
}

/***********************
Returns below response 
************************/
{
    "took": 6,
    "timed_out": false,
    "_shards": {...},
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "movies",
                "_type": "_doc",
                "_id": "qk1qlngBuCt11zp-N_lD",
                "_score": 1.0,
                "_source": {
                    "title": "Harry Potter and the
                                Chamber of Secrets"
                }
            },
            {
                "_index": "movies",
                "_type": "_doc",
                "_id": "q01rlngBuCt11zp-GPl_",
                "_score": 1.0,
                "_source": {
                    "title": "Harry Potter and the
                                Prisoner of Azkaban"
                }
            }
        ]
    }
}

/***********************
Query using prefix "Pott"
************************/
GET /movies/_search
{
    "query": {
        "prefix": {
            "title": "Pott"
        }
    }
}

/*********************
Returns below response 
**********************/
{
    "took": 3,
    "timed_out": false,
    "_shards": {...},
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }
}

The title being a keyword field, we have to provide prefix with correct casing. If we pass 'harr' in query, it won't match. The first request returns both the documents indexed above, as it should. But the second request doesn't return us anything. That's because this query doesn't support infix(matching in the middle of the title) matches.

Note: I will be keeping only the relevant part & removing other parts of responses by replacing it with '...', just to make it a bit shorter.

If we want to match inside the title, we should be using match_phrase_prefix - a type of query used for prefix matching on analyzed text fields:

/**************************
 Query using prefix "pott" 
**************************/
GET /movies/_search
{
    "query": {
            "match_phrase_prefix": {
                "title.analyzed_title": {
                    "query": "pott"
                }
            }
        }
 }

/***********************
 Returns below response 
************************/
{
        "took": 5,
        "timed_out": false,
        "_shards": {...},
        "hits": {
            "total": {
                "value": 3,
                "relation": "eq"
            },
            "max_score": 0.1461155,
            "hits": [
                {
                    ...
                    "_source": {
                        "title": "Harry Potter and the
                                    Chamber of Secrets"
                    }
                },
                {
                    ...
                    "_source": {
                        "title": "Harry Potter and
                                    the Prisoner of Azkaban"
                    }
                }
            ]
        }
}

As we are searching on the analyzed title which is tokenized, "pott" prefix matches with token "potter", which belongs to both of our documents. So, both the documents are returned.

What about out-of-order prefixes? As words inside the title are tokenized, we will expect "potter harry" to match both the documents. But this being a phrase prefix query, it respects the order of input. If we want out-of-order matches, we can use match_bool_prefix.

/***********************************
Below query doesn't return anything
***********************************/
GET /movies/_search
{
    "query": {
        "match_phrase_prefix": {
            "title.analyzed_title": {
                "query": "potter harry"
            }
        }
    }
}

/************************************************
Below query DOES return both the documents,
similar to response in match_phrase_prefix above
*************************************************/
GET /movies/_search
{
    "query": {
        "match_bool_prefix": {
            "title.analyzed_title": {
                "query": "pott harr"
            }
        }
    }
}

So that's all I had to talk about auto-complete using prefix queries. There are a few things we need to consider while choosing this as an approach for implementing auto-complete functionality:

This one is the least recommended approach and considered to be the slowest one when compared to other auto-complete implementations is ES. The searches are slow because we are not doing any work while indexing the field that will help auto-complete queries. It is indexed as a simple text field, most of the work of matching the documents with queried text is done at search time. It will go to the inverted index and check if any token starts with text provided in the query, which is an expensive operation.
In recent versions of Elasticsearch, index_prefixes option has been added for term level prefix query that allows to speed up prefix queries by storing prefixes in separate fields.
If you already have a working index and don't want update mapping, prefix queries will be a suitable approach for you, given that auto-complete is not one of the heavily used features of your system. But if it is, then you might run into performance issues. It will be better to use one of the approaches discussed in the next parts of this series and re-index the data.

In part II we will talk about n-grams, an index time approach for auto-completes in ES. Do let me know your feedback on this part in comments below!

Making sense of FaaS by learning about Azure Functions – Part II

Ankit Utekar — Sat, 28 Nov 2020 03:13:28 +0000

I have been getting into Azure lately and out of myriad of available services, I chose Azure Functions to explore first. In Part I, we learned about characteristics of Function As A Service. We also saw some features of Azure's Azure Functions offering as examples of those characteristics. In this post, we will dive into Azure Functions by learning about its key concepts and see some examples.

I will be referencing code snippets from a small web-app that I have worked on while learning Azure Functions. It uses multiple serverless Azure solutions and is implemented in dotnetcore(C# language) and React. If you are not familiar with C#, no worries, I will try to explain C# specific syntactical parts in the examples below.

The app gives real time share market updates to clients. Data is stored in Cosmos DB - an Azure storage solution. There are some Azure Functions written to manipulate and fetch the data. Azure SignalR provides real time updates to connected clients who are accessing the React based app on their browser. If you want to read more about it or see the code, have a look at my github repository:

ankitutekar / share-market-live-updates-using-Azure-services

This is a small web-app implemented to get hands-on on Azure services. It provides live share market updates to connected clients.

Share Market live updates using Azure services

This is a small web-app implemented to get hands-on on Azure services. It provides live share market updates to connected clients. Note that, the share market updates are not consuming any real service, updates are simulated using custom update function written using Azure Functions and stored in Azure Cosmos DB. Also the shares are of made up companies!

Technologies used:

As this was developed while learning about serverless technologies, it heavily uses cloud services -

Azure Functions
Azure SignalR
Azure Cosmos DB
Aspnetcore web application, with AspNetCore.SpaServices extension for integrating Create-React-App
React JS

Flow diagram:

So the main components are -

A client app accessed by users to get real time share updates
Azure Function app having 4 different functions
An Azure SignalR service to act as backplane for all client connections
An Azure Cosmos DB instance which stores the share details

…

View on GitHub

Function is a primary concept in Azure Functions. These are similar to functions/methods in our traditional programming but with some superpowers! Azure functions can be written in a variety of langauges including C#, Java, JavaScript, Python and more. Moreover, you can also add third-party dependencies such as npm /NuGet packages to extend its functionality.

The building blocks that make Azure Functions so powerful are its triggers and bindings. Trigger specifies when a function is to be executed. We can have a timer trigger, HTTP triggers, and also triggers can be set up for handling events happening in other Azure Services. Bindings allow you to connect to other services seamlessly. Azure needs a function.json file for executing functions, which has trigger and bindings information embedded in it. Let's talk about these concepts one by one:

Triggers

A trigger is what causes our function to run. Every function must have exactly one trigger. Suppose you want to add a message to queue when record is inserted in DB or schedule some code to run every 10 minutes, triggers can do that and much more. Azure supports timer trigger, HTTP trigger, Event Hub trigger, Blob trigger, Queue trigger and web-hooks too.
One of the most important benefits of these triggers is that you don't have to write code to connect to other Azure services, this is done declaratively. You just specify the source trigger information, the data from the trigger event is received in the function parameter.
To make sense of this, let's look at an example from the share market updates web-app - In below example, we are running the function whenever an update occurs in Cosmos DB. This updated document is reported to the connected SignalR client.
If you are not familiar with C#, you will find the parameter declaration a bit weird. These are C# attributes used with function parameters, which allow us to decorate parameters with additional metadata. Attributes allow us to declaratively specify metadata about entities(class, methods, parameters, etc.) in the format [AttributeName(AttributeParameter1 = Value, AttributeParameter2 = Value, ...)].
In case of Azure Functions using C#, attributes are used to specify trigger and bindings information in function parameters. FunctionName attribute specifies entry point method in the class. In above example, we have passed databaseName, collectionName as parameters to attribute class constructor and further ConnectionStringSetting, LeaseCollectionName, CreateLeaseCollectionIfNotExists as attribute parameters. This attribute provides metadata about updatedDocs function parameter specifying collection to monitor for changes. If any records in the collection are updated, this function will be triggered and updated records will be available in updatedDocs function parameter.
We have specified Cosmos DB connection information with its trigger and SignalR hub information with output binding, inside function parameters. As you can see, there is no code written to open connection to Cosmos DB, fetch the data or close the connection!!! Similarly, there is no code written to open connection to SignalR hub, we have just specified connection information declaratively! These operations are performed behind the scenes by Azure Functions runtime and code written in appropriate SDKs based on triggers/bindings.

Bindings

Similar to triggers, bindings allow us to declaratively connect our function to other Azure Services. Unlike triggers, bindings are optional and you can have multiple bindings. Azure supports connecting to variety of services through bindings, e.g. storage services(blob, queue, table, Cosmos), event grids, event hubs, SignalR, twilio, etc. Moreover, you can also add your custom bindings.
Bindings have direction - in, out and inout. These directions specify if we are loading data into the function from other Azure service or outputting data from function to service. In above example, we have used SignalR output binding, specified with [SignalR(HubName="notifications")]. Note that the direction for bindings is optional and can be inferred by the way we have specified parameters, e.g. using IReadOnlyList, IAsyncCollector in above example. The SignalR output binding will send updated share data to all connected clients. It also allows us to specify user Ids and groups.
There are many trigger/binding specific attribute parameters supported using which we can do much more than what I have done in the above example.

function.json

function.json is another important part of Azure Functions, along with the code that we write. It holds binding and trigger information in it, which is used by runtime for monitoring events, determining how to pass data to and from a function and for taking scaling decisions.
This file is auto-generated if you are using a compiled language but for scripting languages, it will be your responsibility to provide one. For compiled languages, it is generated based on annotations(attributes in above example) that you have written. It can be edited directly through Azure Portal as well, but you shouldn't be doing so unless you are just trying things out.
In case of Azure Functions using C#, you will find function.json generated in bin folder, one file per function in your function app. Function app is a collection of functions, a unit of Azure Functions deployment. Below file was generated for example above -
Along with this, there are two function app level config files:
1. host.json - holds runtime configuration
2. local.settings.json - holds secrets such as connection strings and is not meant to be published to Azure.

Azure provides extensions for Visual Studio and VS code for local development of Azure Functions. It also supports Eclipse and IntelliJ. Functions can be deployed in multiple ways - GitHub actions, Azure DevOps, Azure App Service based deployments, etc.

This was my first time trying out serverless things and I must tell you, I have been fascinated by how much productive you can get in so little time. I hope you enjoyed these articles and I added something to your understanding of FaaS with Azure. Do let me know if any queries, will try my best to answer those.

References and further reading

Making sense of FaaS by learning about Azure Functions – Part I

Ankit Utekar — Sat, 17 Oct 2020 11:15:13 +0000

You must've seen the term FaaS or Function as a Service popping up on the internet every now and then. It has gained quite popularity recently. When I came across the term for the first time, I was puzzled. Function as a service? Do they write functions for us and we call them from our code? Is it like a paid library of utility functions having some secret code that only cloud providers can write? Well, after reading a few articles I came to know that I couldn't have been more wrong! I have been exploring this technology lately and been fascinated by how powerful it is.

In this article, we will explore FaaS by learning about some of its characteristics that are different from our traditional ways of doing things in software development. I will be using Azure Functions as a point of reference for explaining a few things. In part II, we will dive into Azure Functions.

First, let's start by defining FaaS - It is a service provided by cloud providers that allows developers to run their code without worrying about building and managing required infrastructure.

So, it is one of the service, from plethora of services provided by cloud providers. Most of the major cloud providers provide this service, some examples include AWS lambda, Azure Functions, GCP Functions.

Characteristics explained below will give you more insights into above definition -

No building and management of server side things

Cloud computing has different models based on how much responsibility we outsource - IaaS, PaaS, SaaS and FaaS. FaaS is a serverless technology. Our apps are deployed on someone else's server, i.e. in data centers of cloud providers. Serverless doesn't mean just provisioning of servers by cloud providers, there is more to it.
In FaaS, we are responsible for only the code our functions contain and possibly a couple of configuration files. All the infrastructure building and management tasks like container/VM management, supplying required dependencies for the app, OS level processes management, host networking etc. are performed by cloud providers and systems setup by them. We just have to specify our resource requirements like memory size and CPUs, max number of VM instances, etc. Azure will provision those resources for us and will take care of managing those resources too.
Azure provides us different plans for hosting our function apps. Default one is consumption plan - provides auto-scaling, requires minimal configuration and is more suitable for irregular traffic patterns . You can switch to the premium plan if you have higher resource requirements than what is provided in the consumption plan. It provides auto-scaling, gives more features than the consumption plan like pre-warmed instances, VNet connectivity and is more suitable for apps that are always/mostly running.
You can also run your functions in dedicated App Service Plan or Kubernetes cluster. Detailed comparison of different hosting plans for functions can be found here on MS docs.

Auto-scale based on load

As the number of requests increase, our function app is automatically scaled to support increased load. If we have to provision server instances(physical or virtual) by ourselves, there will be need for advance provisioning. In some cases, under-provisioning can also happen. These scenarios will be challenging to handle from cost and management perspective.
Due to the elastic scaling nature of FaaS, functions can also be scaled down to zero instances if there isn't enough activity to keep our functions alive. This is beneficial from cost perspective as we do not pay for the time when functions are not running.
In case of Azure Functions, the consumption plan and the premium plan both add necessary compute resources as load increases. If you are deploying functions in dedicated App Service Plan or on a Kubernetes clusters, you will need to configure scaling manually and it is not as fast as elastic scaling of consumption and premium plans.
Consumption plan can scale upto 200 instances max and premium plan can scale upto 100 instances. Single instance can serve multiple requests. We can also throttle the scaling through configuration files.

Smaller execution duration

The design of FaaS demands functions to be ephemeral. This makes FaaS more suitable for event-driven use cases, something that is triggered as a result of some event(e.g. a record added in the database), it does its job and shut-downs.
If your function has been idle for a few minutes, it will be scaled down to zero instances. Serving new requests needs restarting the function app, which might happen on another instance and can take some time. This is known as cold start issue. This time could be in milliseconds but if it's happening frequently, can cause performance issues. The cold start time is affected by number of factors like how many dependencies your function needs to load, was the instance already running, etc.
Azure recommends that we should be avoiding long-running functions. The timeout duration depends on the plan you are using and is configurable. For the consumption plan, default is 5 minutes and maximum configurable is 10 minutes. Switching to the premium plan gives you default of 30 minutes but you can configure it to never time out.
You can mitigate the cold start issue in the premium plan which provides option of pre-warmed instances and always ready instances. You can also run your function app on a dedicated App Service Plan. These options are more suitable for scenarios where your apps are always running/ mostly running, and can entail costs even if there are no requests, e.g. the premium plan requires at-least one instance to be pre-warmed so there is minimum cost associated with it.
Azure provides another option of durable functions, which are extension to Azure Functions and are designed to be stateful, suitable for potentially long running workflows and provides abstractions for workflow orchestrations.

Minimal/ no local state

As stated above, the lifetime of functions is supposed to be smaller, this brings constraints on amount of data that can be held into main memory. The process will get terminated if there aren't enough requests to keep the instance alive. Your function app might be moved around on server/container instances every now and then.
Due to this, some external persistence layer like Redis cache on external storage needs to be implemented.
Azure recommends that our functions be stateless. You can switch to durable functions which provide functionality of preserving state across requests.

Pay as per consumption model

Under pay as per consumption model of FaaS, we only pay for the number of requests that are actually executed unlike PaaS or IaaS offerings and on-prem setups where we have to keep our servers running even when there are no requests. In FaaS offering, we are not paying for idle CPU cycles. This brings huge cost benefits.
Say you have launched a product for users in your country and it gets traffic only in the evenings and mornings. Using FaaS services wherever applicable will be beneficial for you from cost perspective. Such irregular traffic patterns and trying out things for newly launched products are ideal scenarios for FaaS offerings.
This is more evident in the consumption plan of Azure Functions. You are paying on the basis of resource consumption per request, and number of executions. Average memory consumption per request in gigabytes is multiplied by number of milliseconds your request is executed for, to get a metric - gigabytes seconds(by the way, first million executions are free each month!!!). In the premium plan, you pay based on number of core seconds and memory used across instances.

So, does FaaS make sense for your application? As explained above, it can really help you with cost-cuttings when applied for correct scenarios. Cloud providers want to support needs of all the customers, their offerings are flexible. The line between PaaS offerings and FaaS offerings is getting blurry. Many of the above characteristics are applicable to PaaS upto certain degree. Also, some of the characteristics explained above are not applicable to certain plans of FaaS offerings. We shouldn't get too much into terminologies, instead properly examine our use case and choose what fits our needs.

Have you created something with FaaS? How has your experience been? Do share your thoughts below.

Part II of this post has been published here!

Cover image credits Taylor Vick on Unsplash

References:

Lessons learnt from three years of professional software development

Ankit Utekar — Sat, 18 Jul 2020 11:54:00 +0000

So in the last week of July, I'll be completing three years of professional software development. It has been an amazing journey. Mistakes made, things learnt. If my three years younger self had a time machine to talk to his future self (i.e. my present self), it would've been quite a lecture. Below I have penned down the things I wish someone had told me when I started my professional software development journey, hoping that it helps someone who is starting their journey now-

The shiny object syndrome is real and it is going to hit you real hard. The earlier you educate yourself about it, the better. It will always be there, even if you are working on the shiny object which you thought will be a cure-all, but being aware about it will make it easy for you to deal with it.
When starting with any new technology, don't need to search for the best resource for learning technology 'X'. You will start with 'YDKJS' series because that's what the internet told you the best resource to learn Javascript is, but you will give up within a week. While 'YDKJS' is one of the best and comprehensive book series out there, it won't help you get up and running quickly. You will eventually develop expertise in the technologies you work on, but beginner level tutorials should be enough to get started quickly on existing code base. Concentrate on basics first, advanced level topics can be covered later, as you come across those while working. Know that it's a continuous learning process.
When you are being trained for a new project, don't feel frightened by the number of technologies and tools used in the project. Try to understand the big picture first. No one is expecting you to learn each and every technology used in the project. We work in teams, there will be people to assist you, learn to ask for help, learn to ask questions. Dive into key areas that you will be working majorly in, having conceptual understanding of other things should be enough. Your knowledge will increase over the time.
Initially, some things won't make sense - you won't understand the 'patterns' or why we follow particular 'standard', no matter how much you read about it. It's because you haven't worked on the code enough to understand the problems. Patterns and standards are developed by people who have gotten their hands really dirty in that language/framework. You will also have your 'Aha!' moments, but learn to get along for a while, these people know what they are talking about.
Know that the best technology is the one that gets your job done, brings value to the business you are developing softwares for. Be open to learning new tech every now and then, but don't let the new tech distract you from your current stack. Keep learning new things about technologies you use, try to understand problems in it. It will make it easy when you are learning new tech that is trying to solve those problems and will also enable you to get the best out of it.
Whenever you are asked to compare two frameworks/languages/tools, you should always say 'It depends'. There are always trade-offs, and everything has its pros/cons. We, programmers, are known to get defensive when talking about tech we use. When choosing tool to use in industrial projects, we have to consider a lot many factors than you can imagine. Something that works great for project 'X', doesn't mean that it will work for your project too! Also, understand that, innovation happens, technology evolves. Be aware about the cons/shortcomings of languages/frameworks you use.
The developer community is amazing and the content creators are doing great job, but it doesn't mean you have to consume it ALL!!! The FOMO will kick in, you will learn new things like - Chrome shows a smiley icon as tab count when you have over 100 tabs, there isn't any functionality in your mail client that allows you to 'unsubscribe from all' newsletters. Also, no app is going to reward you for having maximum bookmarks. You'll be skimming through hundreds of articles, actually gaining very little knowledge. Learn to focus buddy, don't overwhelm yourself. Focus!!!
A lot of the time, you will feel like an imposter, especially in the early days of your career. People in our industry often complain about the imposter syndrome, go learn about it. Know that, many people, no matter their experience level, have to deal with some level of imposter syndrome. There is no one who knows it all, 10X engineers aren't real. Take on harder tasks, don't shy away because you are a beginner. You will struggle but it will build your confidence, that's the only way to deal with it. After three years, you will want to write about things that you have learnt. There will be lots of self-doubt, you will try to procrastinate because of that, but write and publish it anyway!

Cover image credits - Saulo Mohana on Unsplash

Understanding topological sort by cooking some Biryani

Ankit Utekar — Sun, 21 Jun 2020 10:56:18 +0000

Topological sort is an algorithm I wish I had programmed in my brain. So many tasks to do every day, with so much dependencies among them. It would've definitely improved my productivity without being dependent on any external tool.

Disclaimer:

The example of chicken Biryani is chosen just for making it sound fun. I have tried to generalize the steps and removed details. Steps were chosen from the first link Google gave me. There are multiple variations of this recipe. If you don’t know what Biryani is, visit here. If you ever decide to cook some Biryani, this is the post you shouldn't be referring to. I guess you know what happens when we start development with vague requirements, right?

Alright then, let's start cooking some chicken Biryani.

Topological sort:

Topological sort is an algorithm used for the ordering of vertices in a graph. It outputs linear ordering of vertices based on their dependencies. We represent dependencies as edges of the graph.
It works only on Directed Acyclic Graphs(DAGs) - Graphs that have edges indicating direction. The vertices have one-way relationship among them. Also, there shouldn’t be any cyclic dependency among the vertices.
Some common applications of this algorithm in the world of computer science include-
- Scheduling jobs
- Instruction scheduling
- Deciding compilation sequence through make files.

The problem:

Suppose we were making Biryani and we wanted someone to tell us the linear ordering of steps in its recipe i.e. a sequence of steps that can be followed by taking one step at a time. Steps in making Biryani have interdependence among them, just like any other recipe. The generalized process of cooking chicken Biryani(from the link that I used) is outlined below-

Buy all the required ingredients.
Soak some rice.
Marinate chicken.
Prepare layering, i.e. fried Onions, mint leaves, etc.
Cook the rice.
Cook chicken.
Add layering on chicken.
Put the rice on top of it.
Serve!

This problem sounds so trivial, but real-world applications of this algorithm can include hundreds of steps(vertices) and much more complex dependencies. For the purpose of this post, we will keep it short and easy to understand.

The algorithm:

There are multiple topological sorting algorithms to approach this problem. One of these algorithms is Kahn's algorithm, which is what I have chosen to explain in this post.

The algorithm works in the below steps:

Find out a vertex which doesn’t have any dependency, i.e. the first task that can be performed.
Remove the vertex from the graph, removing the dependencies(edges) that other vertices had on this task. Add this vertex to result list.
Repeat above two steps for rest of the graph.

Pseudocode can be written as:

RS <- List to hold final ordered vertices
PQ <- A queue to hold vertices chosen for processing after 
they are removed from the graph

while PQ is not empty:
    -Current <- Dequeue a vertex from PQ
    -Add Current to RS
    for every vertex n dependent on Current:
        -Remove edge Current -> n
        -if n has no other incoming edge, enqueue it to PQ

if graph still has edges then
    return error (graph is not a DAG)
else
    return RS

This uses Breadth First Search approach. There is also Depth First Search variation of this algorithm.
Basically, making sure that step 'y' that is dependent on step 'x' is performed after step 'x', is the main goal of this algorithm.

Alright then, our Biryani making problem can be solved by algorithm above. Let's put the steps in graph form:

We will start with a vertex that is not dependent on anything else. In our case, it is 'Buy Ingredients(BI)'. We will add this to processing queue and start the loop.
We will remove first vertex from the processing queue. Currently only 'BI' is present in our queue. We will add it to the result list.
We will remove the edges BI->SRC, BI->PL and BI->MCH from the graph. Here, what 'removing the edge' actually means depends on your implementation. In my implementation(linked below), I have maintained an in-degree counter at each vertex 'x' to track the number of vertices 'x' is dependent on. Also, I have maintained an adjacency list at 'x' to indicate the next steps to be taken from it. Each time a vertex 'v' is added to the result list, in-degrees of vertices that were dependent on 'v' will be decremented by one.
For all three vertices(SRC, PL, MCH), BI was the only dependency. Now that BI is added to result, these three vertices will be added to processing queue.
Above steps will be performed until processing queue is empty.

I have depicted the processing for above graph in below table. If it's hard to read, same has been uploaded in the repository linked below.

The implementation of this problem can be found on my repository below-

ankitutekar / TopologicalSortOfBiryaniRecipe

TopologicalSortOfBiryaniRecipe

Topological sort in C# for sorting steps in making Biryani. This is supplementary implementation for the blog that i have written - https://dev.to/ankitutekar/understanding-topological-sort-by-cooking-some-biryani-43c3

View on GitHub

The core of the topological sort algorithm is in the file below:

Some facts about topological sort:

It only works on directed acyclic graph. If a graph has cycles, it can't have valid topological sort.
A graph can have more than one valid topological sort. In above example, if we were to process vertex MCH(Marinate Chicken) before vertex PL(Prepare layering), ordering would've been a bit different(and still valid).
Every DAG has at-least one valid topological sort.
Time complexity of this algorithm is O(|V| + |E|). |V| being number of vertices in graph and |E| being number of edges. We are visiting every vertex once and additional |E| time for calculating in-degrees.

Feel free to ping me if you have any doubts. Also, do try Biryani at least once!

Dependency Injection : The what and whys

Ankit Utekar — Sat, 11 Apr 2020 11:12:59 +0000

Dependency Injection is one of those terms that I was frightened of when I started my career in software development. People used to throw this term around along with some DI framework name attached to it which made it more frightening for me. When you are a junior working on industrial projects, most of the basic configuration is already setup for you to code in. Most of the times you will get your job done without understanding low level configurations. But you should make sure you spend some time on understanding these things every once in a while.

Anyways, in this post I aim to share my learnings about DI. I am assuming that you are familiar with basic OOPs concepts such as classes, interfaces and constructors. So, let's get started, shall we?

What is a dependency?

Let's consider an example - Say you are working on an E-Commerce project which has functionality of creating orders of product(quite surprising right!). Say you have a class named OrdersApi.

public class OrdersApi
{
 /*
 Has methods to create order, retrieve order by ID, Delete an order, etc.
*/
}

Suppose, you have another service for calculating delivery date. Delivery date calculation has some data-science stuff hence its abstracted away in another service. You also have some DB connectivity. You have written separate classes in your orders service for handling these two functionalities (modularity!!!) - OrdersRepository(for making DB calls), DeliveryService(for making http requests to delivery date calculation service). These two classes are dependencies for our OrdersApi class.

Let's say you decide to log every request to OrdersApi in some log file. You don't want to repeat logging configuration code at every API it is being used in, so you put it in a new file - Logger, making it another dependency of OrdersApi

Note that the example I am using for this post is based on OOP concepts, but DI is not limited to just OOPs languages. e.g. Components, helper files in your react project, are also examples of dependency.

Different ways of supplying dependencies:

1. new() it up everywhere:

The simplest way of supplying these dependencies to our OrdersApi is by instantiating their instances in OrdersApi -

public class OrdersApi
{
  string dbConnectionString = "your://db-connection-string";
  string deliveryServiceAddress = "their://delivery-service-address";
  OrdersRepository ordersRepository = new OrdersRepository(dbConnectionString);
  DeliveryService deliveryService = new DeliveryService(deliveryServiceAddress);
  Logger logger = new Logger();
}

This just works! But there are some problems associated with it -

Our OrdersApi is now tightly coupled with its dependencies. Along with its own responsibilities (i.e. orders management) OrdersApi needs to know how to initialize these dependencies. This is violation of Single Responsibility Principle
Tightly coupled classes make unit testing difficult. If I am talking about writing unit tests for OrdersApi, OrdersApi should be only module being targeted in test cases. I should be able to mock its dependencies(mocking is something that mimics behavior of dependencies without instantiating actual dependencies)
This reduces maintainability. Changes in lower level modules (i.e. Logger, OrdersRepository, DeliveryService) shouldn't cause any changes in higher level modules (i.e. our OrdersApi class). I should be able to change my logging service library in Logger class, without having to update other classes that are dependent on it. My OrdersApi shouldn't worry about type of DB being used. With this design, there is high chance of frequent changes in your application because:
- There is no contract (interfaces!) being shared between dependencies
- Dependent modules are responsible for instantiating their dependencies. They will need to be aware about concrete implementation of their dependencies.
This is breaking Inversion Of Control principle.

2.Let the consumer supply it:

So we will ask consumer of OrdersApi to supply these dependencies by putting them as constructor parameters to OrdersApi-

public class OrdersApi
{
  public OrdersApi(OrdersRepository _ordersRepository, DeliveryService _deliveryService, Logger _logger)
   {
      this.ordersRepository = _ordersRepository;
      this.deliveryService = _deliveryService;
      this.logger = _logger;
   }
}

This solution doesn't resolve the issues mentioned above. The issues associated with OrdersApi design are now transferred to its consumer. In this way, we will keep creating a chain, making our code more complicated and hard to maintain.

3.Let some third party take over!:

What if there was someone to take care of all these dependencies? Someone to make sure that all the dependents are provided with their dependencies, managing proper ordering, letting dependents do their job without having to know concrete implementations of dependencies, letting us (the developers) manage our code more effectively? That someone is called Dependency Injection Framework!
We keep our design such that dependent modules are not directly using their dependencies, they are using abstractions(i.e. dependencies and dependents share some sort of contract they will adhere to - interfaces!). Then we tell DI framework who is dependent on what and who implements which contract. With this type of implementation Our OrdersApi will look like below -

public class OrdersApi
{
  public OrdersApi(IOrdersRepository _ordersRepository, IDeliveryService _deliveryService, ILogger _logger)
   {
      this.ordersRepository = _ordersRepository;
      this.deliveryService = _deliveryService;
      this.logger = _logger;
   }
}

Here, IOrdersRepository, IDeliveryService, ILogger are contracts(interfaces). We will register this configurations in DI container at application startups. Syntactical part/which file to register this in depends on the language we are working with and the DI framework we are using. And also, it is not necessary that you should always be using a framework, people sometimes have their own variations of a DI framework. Some languages provide built in support for handling these kind of issues. Idea will be the same, issues they are addressing will be similar as described above.
Below example shows a snippet in Startup.cs file of aspnet core application -

public void ConfigureServices(IServiceCollection services)
{
    services.AddSingleton<ILogger, Logger>();
    services.AddScoped<IDeliveryService, DeliveryService>();
    services.AddTransient<IOrdersRepository, OrdersRepository>();
}

Having a dedicated entity to handle dependencies makes our job easy in following ways -

Higher level modules are not responsible for their lower level dependencies anymore. They will always be provided with required dependencies by DI container.
Coupling is reduced, testability of our code is increased! We can safely mock the behavior of dependencies, making it easy to test the modules in isolation.
As we are making sure that there is no direct dependency and everything is dependent on contracts and abstractions, it ensures that minimum changes will be required if you decide to update the implementation of lower level dependencies/ replace them with something else. This increases maintainability of our project.
Use of dedicated DI framework comes with additional benefits such as configuring lifetime of particular dependency. As you can see in above code snippet, AddSingleton, AddScope and AddTransient are different lifetime supported by the framework which controls how many instances will be created during particular request and how they will be provided to different parts of the application.

So that's all I have to say about this topic. Personally, learning about DI has really benefited me. It exposed me to different concepts such as - IoC, SRP, mocking in unit testing, testability, use of interfaces in real world project, etc.. It definitely made me a better programmer.

So how do you handle dependencies in your project? Do share it below. Also, feel free to mock me(lol) in the comments section.

Explain package-lock.json like I am five

Ankit Utekar — Wed, 30 Jan 2019 14:52:13 +0000

What is the point of having same dependency tree ? And how does dependency tree actually work ? Should I commit package-lock.json every time I add new package ? How is it related to symbols (e.g. ^ ) that are put before package versions in package.json ?

When to snapshot ?

Ankit Utekar — Thu, 24 Jan 2019 02:54:22 +0000

I am having hard time understanding valid use cases for Jest snapshot testing.
There are lots of opinionated posts explaining snapshot testing.
Have you used snapshot tests in your projects ? If yes, for what kind of components (functional vs stateful with lots of logic, components that are edited frequently vs components that aren't)? If no, why?

What is the legit way to get the current date and time of system ?

Ankit Utekar — Sun, 13 Jan 2019 11:11:44 +0000

Yes, the title is a bit weird, but read on, I have an interesting story to tell.
Recently at work I was tasked with implementing a timer. It was fairly simple, at least that's what I thought. User performs some action -> an event is fired to the backend -> backend does some computations, saves the data along with current date-time and returns response including recorded date-time -> browser starts the timer after response has been received. Timer was supposed to run for 1 hour from the time when event happened in the backend.
The problem was, my system clock was set to run 8 seconds behind actual clock. As result of this, timer would consider extra 8 seconds.
We had server side validations, so even though user could perform action after timer has been ended, we were good.
My question is, is there any better alternative than relying on client system to get current date time ? Do you use server side timings/ third party services for these things ? If yes, what about network latency ? How does systems like online examination system handle it ?

Forem: Ankit Utekar

Economies of scale in the cloud

Economies of scale basics

Economies of scale in the cloud

Supply-side savings

Consumer demand aggregation

Implementing auto-complete functionality in Elasticsearch - Part III: Completion suggester

Adding Context to searches

Implementing auto-complete functionality in Elasticsearch - Part II: n-grams

Edge-n-gram tokenizer

Search_as_you_type

Implementing auto-complete functionality in Elasticsearch - Part I: Prefix queries

Prefix queries

Making sense of FaaS by learning about Azure Functions – Part II

ankitutekar / share-market-live-updates-using-Azure-services

This is a small web-app implemented to get hands-on on Azure services. It provides live share market updates to connected clients.

Share Market live updates using Azure services

Technologies used:

Flow diagram:

Triggers

Bindings

function.json

References and further reading

Making sense of FaaS by learning about Azure Functions – Part I

No building and management of server side things

Auto-scale based on load

Smaller execution duration

Minimal/ no local state

Pay as per consumption model

References:

Lessons learnt from three years of professional software development

Understanding topological sort by cooking some Biryani

Disclaimer:

Topological sort:

The problem:

The algorithm:

Pseudocode can be written as:

ankitutekar / TopologicalSortOfBiryaniRecipe

TopologicalSortOfBiryaniRecipe

Some facts about topological sort:

Dependency Injection : The what and whys

What is a dependency?

Different ways of supplying dependencies:

1. new() it up everywhere:

2.Let the consumer supply it:

3.Let some third party take over!:

Explain package-lock.json like I am five

When to snapshot ?

What is the *legit* way to get the current date and time of system ?

What is the legit way to get the current date and time of system ?