Forem: Md. Ishtiaque Zafar

How to stay one step ahead of errors and downtime as you scale up your business

Md. Ishtiaque Zafar — Thu, 06 Oct 2022 14:13:20 +0000

Hey, so you managed to scale your business from an early-stage startup to a fast-growing scale-up. This means your tech has grown a lot since. With an ever-growing system, how do you keep an eye on everything and make sure that it is all running and customers are not facing any downtimes or errors?

Well, we went through the same phase at FINN, growing from a 30 person startup in 2020 to 350+ people in 2022 and expanding to the US as well. This article presents some of my key learnings as to how to stay ahead of errors and downtimes.
The holy trinity when it comes to minimising downtimes and errors is logging, monitoring and alerting.

Logging

Simply speaking, logging is the act of keeping logs. Logs are entries of "events" happening in a system, for example: when did your system start, what's the hourly CPU utilisation, which user has logged in to the system, what requests were made to your API server, etc.

Why should you log?

Why do we need logging? The answer is: visibility. We want to know what's going on in our system and we want our system to provide that information easily, rather than cracking our heads and making blind guesses as to what might have been the issue. To ensure this, it is our responsibility to make sure that we add proper log entries.

Logs can also be processed and aggregated to get metrics like requests per hour, response time, as well as interesting business insights, such as how many orders were created in a day or an hour.

What should you log?

The key here is consistency. You should have a team-wide/organisation-wide agreement on what to log and ensure that it is adhered to when it comes to code. At FINN we have agree to have:

Timestamp (time) - so we know when it happened
Log level (level) - so we know what it's meant to express (error, warning, etc)
Info about source (scope) - so we can find it in the source (name of the filename)
Context - so we can investigate issues (e.g. the order ID)
Message that tells us what happened in English (message) - so that we can read the logs. This is the usual log message.
The version of the running service (v) - so that we can tell which version of our software logged the message and helps us debug against that specific version.
Who is performing the action (actor) - This is important for auditing and traceability purposes. We can easily identify who asked our systems to perform a certain action.

It is also very important to make sure that any sensitive information is not logged, for example: credentials, Personally Identifiable Information (PII) such as names, email, phone numbers, social security number and so on. You can instead log something such as the ID which your system uses to identify a person or order. If you really need to log some of this information, use a hash so you can check for equality without logging it.

Another thing to keep in mind is to log only what you need. Logging too much will create noise and increase costs as you will need more log storage.

How should you log?

How you log your data is also very crucial. For example, if you log everything in plain text, it will be very difficult to process and make meaning out of it. To make lives easier, always log in a format that can be parsed, such as JSON or XML. We chose JSON at our company because most logging libraries support it.

You must take advantage of using correct log levels. There are different levels of logs.

error: Error logs are usually from unrecoverable errors. Things went wrong when we weren't expecting it. Some process failed and the system could not perform what it started, for example, if you wanted to save a file, but the write to your storage failed.
warning: A warning indicates something went wrong, but the system survived and managed to complete the process it started. For example, if your system failed to send an email to the customer after order creation.
info: This level of logging is used to keep track of informational metrics like a successful completion of a request, response times, etc.
debug: Debug logs are used for logging things that can help you debug your system when things go wrong. It's a good practice to use debug logs to keep a track of your execution. You can log data every time you complete a step, for example when a user is authenticated, a user's profile is verified, a user's requested item is available, an order created successfully, an email sent to user.

Implementing what we learnt

Theories without examples and illustrations are boring.

I'll present our code in TypeScript to provide you with an example on how we implemented things, but it can be easily implemented in any other language as long as you stick to the principles :)

We used Winston as the logging library. We also implemented our own class on top of Winston logger to enforce our conventions.

Monitoring

Monitoring is keeping an eye on your systems, to see if everything is doing fine. The key to good monitoring is knowing what to monitor. Imagine monitoring a palace: you can’t put security cameras just about anywhere. It will be too much to handle and will distract you from keeping an eye on the main stuff. Similarly in software, you have to know which parts matter the most: for example, the checkout process is very important for an e-commerce company.

Once you have identified the crucial parts, there are multiple ways to monitor them: health checks, end-to-end (E2E) tests and log-processing.

Health Checks

Health checks are the starting steps towards monitoring. A simple health check would be just checking if you can connect to your system from the outside world. During the execution of the health check, your system should try connecting to its dependencies, such as a database and report if every critical dependency is available and working. If not, then it should fail and you get to know that your system is having a failure.

End-to-end tests

Health checks are limited to testing one part of a system at a time, but fail to tell you if the whole system, comprised of multiple dependent components, works as a whole. Fixing that would be the end-to-end tests. They mock a human using the system as a whole, such as going through the checkout process to order a car. This tests everything, from the input components on the website, to the backend APIs handling user requests, to the storage that stores the order data. If any test fails, we know that the real user must be facing the same issues as well, and jump into action.

At FINN we use Checkly once again to run our scheduled E2E tests. Some teams also use a combination of Make and Apify to run scheduled E2E tests on smaller projects.

One thing I love the most about Checkly is that it allows you to write Monitoring as Code together with Terraform. Article linked at the end.

Processing logs

The previous two methods give you overall monitoring and monitoring on critical parts of the system, but not system-wide monitoring. This is where your logs come in place! If you logged errors, warnings and other information, you can process these logs and create beautiful dashboards to keep track of your systems, such as how many errors were logged per hour, how many warnings were logged in a day, number of requests served successfully, and so on.

If you’re using the Serverless framework, using the Serverless Dashboard together with it makes things easier and provides you with ample monitoring to get started.

If you want to go pro, you can always use the mighty AWS CloudWatch and create your own dashboards. For example:

AWS CloudWatch also comes with powerful querying capabilities. To use these, head over to your CloudWatch console > Log Groups > Select your resource > Search Log Group.

CloudWatch is much too powerful to be described completely here. Please read AWS’ documentation on Pattern and Filter syntax to know more.

Alerting

Alerting exists so that you get informed about errors (and maybe warnings) as soon as they happen. Alerting basically is some automation that keeps an eye on your monitored metrics (such as 5xx errors, requests per time, 4xx errors, etc) and performs some action as soon as those metrics cross a pre-defined threshold.

Most of the teams at FINN have different alerting setups, the one pictured above is one of them. For non-critical incidents, posting on Slack is enough so that the team knows about it, and can solve the issue at their own pace. For critical money-making processes, however, we use Opsgenie, which has call escalation policies in places to ensure that critical incidents are responded to within a certain amount of time.

Conclusion

From an end-user’s perspective, one could say that logging, monitoring and alerting does not add much value, because they are not tangible features that the user can see/use. But oh boy they are so important in maintaining good user experience. Imagine how bad it would look if your customers have to call and notify you that a feature is not working.

No system is 100% error-free. Rather than trying to predict all the things that can go wrong (which I think becomes a waste of time after some point) and coming up with counter-measures, a better approach IMO is to embrace the errors, be ready and have systems in place to notify you as soon as they happen.

Please add your thoughts/opinions in the comments below. Would love to know how you are tackling this in your team/organisation.

Let’s Connect!

If you liked this article, giving it a clap and sharing it with others would help a lot!

If you would like to read more on topics such as this, including software engineering, productivity, etc, consider following me! Much appreciated.

Connect with me on LinkedIn, or on Twitter, I’m always happy to talk about tech, architecture and software design.

No-code isn’t scalable. Our learnings at FINN going from 1000 toward 100,000 car subscriptions

Md. Ishtiaque Zafar — Tue, 19 Jul 2022 06:37:55 +0000

At FINN, we grew from 0 to 20,000 car subscriptions, expanded to the US, all in just two years. Those statistics bring one word to mind: speed. We were fast, and how we did so is no secret either. If you want to know, read it here:

How A German Start-up Achieved 4 Mn € ARR In One Year | by Ishtiaque Zafar | Medium

Ishtiaque Zafar ・ Apr 9, 2022 ・
Medium

All that speed came with certain costs. The tech relied heavily on no/low code tools like Airtable, Make (Integromat) and Retool, to name a few. We built a lot of microservices, automation workflows which enabled us to automate a lot of processes revolving around e-commerce and car subscriptions.

Early 2021, we had our 1000th car subscription and things weren’t slowing down. Around mid-2021 we were selling 70 car subscriptions daily: our highest record being 242 cars in a day! Imagine that! By end of 2021 we had crossed 10K subscriptions! Growth-wise it was looking great, but the tech was already at its peak and showing signs of struggle.

These two years of rapid prototyping and experimenting helped us learn what works well and what doesn’t, but the thing with prototypes is that they are not reliable, they break often and do not scale well.

Our situation back then looked something like:

Database (Airtable) limits
Synchronisation issues
Overuse of no-code tools like Make for critical business purposes
Extreme coupling at data level
Lack of ownership of the data model
Lack of access and change control process
Difficult testing process

As with all prototypes, there comes a certain time at which we realise its not working for us anymore and we start making finer products, inspired by these prototypes and the learnings. The later half of 2021 indicated that FINN had outgrown its low-code prototypes and needed something better.


Image Credits - https://jasonmorrissc.github.io/post/2022-02-24_no-code/

Mitigating database failures

Airtable was our database of choice in the initial days. Reason: Ease of use, easy schema changes, quick rollbacks and snapshot recovery. We chose Airtable because everyone (and not just engineers) could work with it. Soon enough we had a database with more than 40 tables and the “cars” table having nearly 400 fields! (Highly inefficient, we know.) To add to that, we had dozens of automated workflow reading and changing that data and soon our database gave up. Airtable comes with a limit of 5 requests per second per base. We were constantly hitting that target and started having more than 100 timeouts per day.

We quickly set up a squad in action. This squad was called “No Time to Timeout”-squad, based on the recently released James Bond movie “No Time to Die”, with one goal: reduce daily Airtable timeouts to 0. The solution:

Identify data that are loosely coupled and move them into separate bases, so that they can be managed independently
Create read replicas for data that is used company-wide, e.g.: cars and subscription related data
Identify read-heavy processes and update them to read from read replicas (this introduces a little staleness in data, but will do for now)
Identify write heavy processes. For this we had to implement solutions case by case. For example, for car in-fleeting we implemented a diffing function which compares the recent changes sent by car manufacturers with the last changes and only updates the cars whose data has changed. For others, we batch the write operations

This reduced the timeouts to zero, but this was only our first step towards having a stable system.

Maintaining Trust in Our Data

We started having sync issues. Multiple sources of data, combined with lack of ownership of the data and no change control process, introduced a lot of data quality issues. Too many people changing the data directly, and partial updates to data required for certain processes made everything worse.

Ownership

This was the time to promote ownership. Teams were given ownership of certain data sets, e.g. the User Acquisition team owns the leads, Operations owns the cars and subscription management, Finance owns the monetary related data, and so on.

Access control

Controlling read and write access to data was tackled by cutting direct database access wherever possible and instead exposing data via micro-services written by the data-owning teams. These APIs were not generic CRUD APIs allowing you to do anything, but rather intent-driven APIs which only exposed a certain part of the dataset, depending on what you needed to do.

State machines

We started defining state machines for our critical entities like Car and Subscription. We had to define “What” data can be changed, “When”, and “Under what conditions”. The plan was to increase data consistency by making it impossible to change data in a way that would lead to conflicts. For example, marking a subscription as active would require to first check with the Car management and deliveries team if they have actually handed over the car. They would need to check with the Finance team if the deposit had been paid. Here’s one of our drafts for the Subscription state machine:

Going pro-code with project Green Dragon 🐉

The issues mentioned above made it crystal clear that we would not be able to scale. Moreover, soon we would have to expand into the US and things would only get more intense. We needed a system that was reliable as we simply could not afford to be in fire-fighting mode all the time. Where would be the time for new features, if all engineering effort was spent on maintaining the current system?

We decided to build new systems based on time-tested technologies and move away from the no-code/low-code solutions we had. We laid down some basic Engineering Principles to be followed. This was very challenging, because we had to keep the company and the business running till the new systems could take over.

Reduce risk to minimum

In my colleague Andrea’s words: Migrations are never easy and they hardly go as smoothly as we hoped. Our goal was to reduce the risk to minimum so that we could act quickly to respond to problems that might arise.

We came up with a strategy “Think 10x”. The idea was to start small, with just 10 cars targeted to be sold, and then scale 10 times in each phase. 10 cars → 100 cars → 1,000 … 100,000 cars. This ambitious project was called the “Green Dragon”.

The first phase, called “MVP 10”, was to focus on one part of the entire car subscription process and to go live with just 10 cars up for subscription from the new system on our website.

Enforce good engineering practices

Good systems are built on good principles. They guide us and help us make decisions when we have doubts. At FINN, we needed good principles for our new phase, the one where we double down on tech and make our critical systems better. Here the engineering principles we laid down at FINN:

Keep it simple: we want to solve today’s problem in a simple way with an eye looking at tomorrow, rather than solving tomorrow’s problem today
Embrace errors: We operate in a legacy environment where failures are common. We don’t want to fight errors, we want to embrace them and design with failure in mind
Make it visible: Our services must talk to us, tell us what is happening and show us how well they are doing. We don’t want to find things out, we want to be told
Internalise complexity: We cannot change how our partners work, but we can change how easy we make it for them to work with us
Customer first: Our customers rely on us to give them the best possible experience, but they also trust us to manage their data and information
Data is always accessible: we never want to block people from reading our data and we should always provide a way for them to do it without any work
Clear is better than clever: We strive for readable code that is easy to understand and change. Our goal is not to write the least amount of code, but the clearest. Magic is not our friend here Once again, credits for laying down these principles goes to Andrea :)

Reducing disruption to minimum

FINN is very pro-automation, and we didn’t want to disrupt that. Everyone at FINN knows how to work with the automation platform Make (formerly Integromat) and leverages it to automate their day to work. Since data would not be so open anymore, we needed to provide a way for our colleagues to be able to use the new system from the comfort of Make. We made custom apps on Make which enabled our colleagues to use our new core APIs.

Our colleagues also need custom dashboards showing different data to enable their day-to-day work, such as overseeing and managing daily deliveries, damage management, incident handling and customer care. They need parts of subscription data, car data, financial data and much more. They also need to be able to perform some actions on this data. For this, we built them dashboards using Retool, so that they don’t need direct access to data as in the old system at FINN. This allowed us to have data validations and checks in place, and only allow data writes through our intent-driven APIs. This was such an enabler to help us keep our data consistent, while promoting easy access to data at the same time.

Where are we today?

Mitigating our issues to get the current system working and give us some breathing room to focus on developing the new system surely feels something like this famous meme:

Fixing A Bug In A Production GIF | Gfycat

Watch and share Fixing A Bug In A Production GIFs on Gfycat

gfycat.com

Thankfully, with excellent teamwork across departments and multiple mitigation squads, we were able to stabilize our current systems. We took this Green Dragon project as an opportunity to show what real tech could do in all its glory. Every one was super energetic and enthusiastic about developing systems that would support FINN for the next phase of growth! We went live on June 28th 2022 with the MVP 10 phase smoothly and sold our first “Green Dragon” car subscription on June 30th and we aren’t slowing down. Huge shout out to Andrea Perizzato for leading this initiative ❤

What’s next?

We are going full flat out as we rebuild the tech that powers FINN. Watch this space for more insights about our tech and the next phases that we go live with!

Conclusion

No-code is cool and fast, and got us to where we are today, much faster than our competitors, but it won’t help us get where we want to. We have entered the next phase of growth where we take what we have learnt in the last two years and make it even better, more robust and scalable. We have set good practices in place to guide us along this way and we will continue to embrace struggle till we get to 100,000 car subscriptions.

If you liked this article, liking it and sharing it with others would help a lot!

If you would like to read more on topic such as this, software engineering, productivity, etc, consider following me! Much appreciated.

Connect with me on LinkedIn, I’m always happy to talk about tech, architecture and software design.

Lastly, FINN is hiring! If you want to be a part of this crazy growth and work alongside super awesome colleagues, consider having a look at our careers page or simply reaching out to me :) FINN is very diverse and inclusive, also consider having a look at our Women in Tech programme ❤

Why I don’t like API frameworks together with Serverless

Md. Ishtiaque Zafar — Mon, 02 May 2022 10:57:13 +0000

This blog was originally published here

Every time when creating a new API that is meant to be serverless, I consider or have been asked to consider using an API framework together with a serverless application framework like Serverless Framework or AWS SAM. So far, I have found none that impress me. Here’s why:

They do not promote modularity

If you have a look at frameworks like serverless-http or nestjs + serverless, or at zappa + flask for python lovers, all of them follow the same pattern:

Instantiate an app
Add routes and associated handler functions to the app
Add middlewares for authorisation, logging, etc, in the app itself
Bundle everything into a single lambda function which gets triggered with the base path ANY /

How your deployed code would look with an API framework

You see how heavy-weight it can be. It’s like packing your monolith app into a single lambda function and executing it. Heavier code will lead to longer boot times and hence longer response time compared to the no-framework approach. Poor lambdas are not meant to be used in such a way. You might get away with small apps, but if your monolith is huge, it’s better to use some other solution like containerisation with ECS.
A better solution, in my sense, should be something like this:

Without any API framework

Every lambda function contains just your business logic and code needed to interact with data sources. Such lightweight, much wow, very fast.

Why not offload overhead actions elsewhere?

An API does not just include your business logic, it needs to have some other must-have functionalities, like routing, input validation and authorisation, to name a few.

Most of these API frameworks come from times before the serverless revolution, and were designed to create API services to be run on “dumb” infrastructure, dumb in the sense that they were just compute resources, not capable of anything else like routing, request authorisation, etc. Hence, everything had to be done (and written by you!) in the code.

Going truly serverless means that you break your system into parts and then use readily available managed services which take care of individual responsibilities. For example, routing and input validation can be done by the API Gateway. You can have a custom lambda authoriser for authenticating all your requests. For your business logic, you can have lambda functions small enough to focus on one topic per function.
Have a look below:

Adds unnecessary learning curve

They add an unnecessary learning curve for the team. A new engineer on the team needs to learn the new framework as well. So why introduce one more thing to learn for your team, when writing small functions is the ultimate goal.

Take django for example. This framework has such a high learning curve and is somewhat opinionated. The goal of lambda functions was to enable developers to execute arbitrary functions, not entire apps, in a lightweight container-like solution. Read that again: “…functions, not entire apps, in a lightweight container-like…”. Using just the Serverless Framework or AWS SAM gives you so much flexibility and speed.

Downsides

One downside I have noticed working with a barebones framework is that the code is non-standard. Each project uses the Serverless framework, but the code structure is different in each one of them. Here are three different structures I’ve worked with recently:

You can see clearly how vastly it varies, mostly depending on the team and it might add to the onboarding time. As said earlier, this is a tradeoff against learning an API framework and the heaviness that comes with it. This can be easily tackled by following the same core principles which these API frameworks use.

Another downside of hyper-granularity is that you end up having many functions that share significant portion of the same code, and now you have that many more cold-starts.

Conclusion

In tech, no solution is the perfect solution, every single one of them comes with tradeoffs, so it’s up to you to decide which ones you can live with.
Got different opinions? I would love to discuss them. Drop them in the comments below!

If you liked this article, sharing it in your network would help a lot!

Connect with me on LinkedIn or Twitter or Medium

Forem: Md. Ishtiaque Zafar

How to stay one step ahead of errors and downtime as you scale up your business

Logging

Why should you log?

What should you log?

How should you log?

Implementing what we learnt

Monitoring

Health Checks

End-to-end tests

Processing logs

Alerting

Conclusion

Let’s Connect!

No-code isn’t scalable. Our learnings at FINN going from 1000 toward 100,000 car subscriptions

How A German Start-up Achieved 4 Mn € ARR In One Year | by Ishtiaque Zafar | Medium

Ishtiaque Zafar ・ Apr 9, 2022 ・ Medium

Mitigating database failures

Maintaining Trust in Our Data

Ownership

Access control

State machines

Going pro-code with project Green Dragon 🐉

Reduce risk to minimum

Enforce good engineering practices

Reducing disruption to minimum

Where are we today?

Fixing A Bug In A Production GIF | Gfycat

What’s next?

Conclusion

Why I don’t like API frameworks together with Serverless

They do not promote modularity

Why not offload overhead actions elsewhere?

Adds unnecessary learning curve

Downsides

Conclusion

Ishtiaque Zafar ・ Apr 9, 2022 ・
Medium