Forem: Ashutosh Sahu

Working with circular dependencies in sequelize-typescript

Ashutosh Sahu — Wed, 19 Feb 2025 14:36:50 +0000

When designing relational databases, it's pretty common to run into tables that just can't stop referencing each other. These circular references can be a bit of a headache, especially when using an ORM like Sequelize. Whether you're dealing with tables that reference themselves or ones that are interdependent, knowing how to manage these circular references is crucial for maintaining a clean and functional database schema.

In this article, we'll explore what circular references are, why they happen, and how to effectively handle them in Sequelize-typescript.

What Are Circular References?

Circular references happen when two or more tables reference each other, creating a loop. For example:

Table A has a foreign key pointing to Table B.
Table B has a foreign key pointing back to Table A.

This loop can complicate database operations like inserts, updates, and deletions.

Common Scenarios for Circular References

Self-Referencing Tables: A table references itself, like an Employee table where each employee has a manager_id referencing another employee in the same table.
Interdependent Tables: Two or more tables reference each other, like a User table and a Team table where a user belongs to a team, and a team has a leader who is a user.

In these cases, making the foreign keys nullable can help manage the relationships better.

Got it! Let's make this straightforward and clear.

The Problem

When I started writing models for these circular references in sequelize-typescript, I ran into an issue right away. Thanks to linter which helped identify issue earlier that there is a circular dependency problem in the imports.

For example, I had two models: User and Team.

// user.model.ts
import { Team } from './team.model';

export class User extends Model {
  ...
  @ForeignKey(() => Team)
  @Column({ type: DataTypes.INTEGER, allowNull: true })
  teamId?: string;

  @BelongsTo(() => Team)
  team?: Team;
}

// team.model.ts
import { User } from './user.model';

export class Team extends Model { 
  ...
  @ForeignKey(() => User)
  @Column({ type: DataTypes.INTEGER, allowNull: true })
  leaderId?: string;

  @HasOne(() => User)
  leader?: User;
}

See the problem? The User model needs to import the Team model, and the Team model needs to import the User model. This circular dependency in imports is causing issues and preventing progress.

### The Search for a Solution

I searched high and low for a way to resolve this dependency problem, but there was no direct solution that fit my needs. Every answer on Stack Overflow seemed to say the same thing: avoid circular dependencies altogether. But my situation was unique.

Then, I stumbled upon a suggestion to introduce a middleman to break the circular dependencies. For example, if A depends on B and B depends on A, you can introduce C so that A depends on C and B depends on C.

But how could I split my models any further? 🤔 This was the challenge I needed to tackle.

Inheritance to the Rescue

I found a solution using inheritance. Models typically contain two types of data: actual database columns and associations. So, what if I create a separate class for associations that extends the model containing the table columns?

One limitation with this approach is that it doesn't support deep nested joins in code. However, I see this more as a best practice than a limitation. When defining associations, I only need access to the columns of associated tables, not their associations.

For example, if I need to access the name of the team leader for a given user, Sequelize allows me to use user.team.leader.name. But I'm opting out of this convenience. Instead, I'll get the user.team.leaderId and then fetch the user in a second query using the leaderId.

This approach solves most of the problem, but I still needed the import for defining @ForeignKey. Fortunately, I discovered that the decorator isn't necessary if you provide the keys directly to the association decorators. Here's what my solution looks like:

// user.table.ts
@Table({ tableName: 'user', underscored: true })
export class UserTable extends Model {
    ...
    @Column({ type: DataTypes.INTEGER, allowNull: true })
    teamId?: string;
}

// user.model.ts
import { TeamTable } from './team.table.ts'

export class User extends UserTable {
    ...
    @BelongsTo(() => Team, { foreignKey: 'teamId', targetKey: 'id' })
    team?: TeamTable
}

// team.table.ts

@Table({ tableName: 'team', underscored: true })
export class TeamTable extends Model {
    ...
    @Column({ type: DataTypes.INTEGER, allowNull: true })
    leaderId?: string;
}

// team.model.ts

import { UserTable } from './user.table.ts'

export class Team extends TeamTable {
    ...
    @HasOne(() => User, { sourceKey: 'leaderId', foreignKey: 'id'})
    leader?: UserTable
}

This way, I can manage circular dependencies without running into import issues.

3. Designing a Microservice: choosing a DB

Ashutosh Sahu — Mon, 17 Feb 2025 08:03:01 +0000

In the previous article, we looked into the characteristics of microservice architecture and the steps involved in their design process like domain analysis, bounding context, and selecting a communication channel along with an example of our food ordering application. This article will discuss selecting databases and applying some design patterns or scaling techniques to our microservice architecture.

Step 4. Selecting the Database:

Databases are an essential part of any architecture. While selecting a database, we need to consider various factors like consistency, query performance, cost, scalability, and most importantly - the type of structure we will save in the database. Let's look at different types of databases:

SQL (MySQL, PostgreSQL, MSSQL)

SQL databases are used for storing structured data that is related to one another. If your service has to store data related to customer information, financial transactions, inventory management, or other such things in which different piece of information is related to one another in some way and need to be frequently queried, then you should use an SQL database.

NoSQL

NoSQL databases are used for storing unstructured or semi-structured data. They are widely used in web applications for storing user-generated content such as social media posts, comments, and reviews. NoSQL databases come in different types, each with its purpose.

Key-Value Store (DynamoDB, Redis, Memcache): Key-value stores are used for storing any data structures such as strings, integers, or objects which is tied to a key. They are widely used in caching and session management. Redis is an in-memory database, which is used for caching mainly. DynamoDB is a cloud-native database that is offered by Amazon Web Services (AWS). It is designed to handle large amounts of data and traffic while maintaining consistent and fast performance.
Document Store (MongoDB, CouchDB): Document stores are used for storing semi-structured data such as JSON documents. They are widely used in web applications for storing user-generated content. If your service needs to store data related to users which is more likely unstructured or has frequent changes and needs to be scaled easily, then MongoDb or CouchDb are good choices.
Column Store (Cassandra, HBase): Column stores are used for storing large amounts of data that can be queried quickly. They are widely used in big data applications for storing and analyzing large datasets. Column stores provide a type of structure to the NoSQL databases, making it easier to make faster queries. If you need to process real time or high volume of data, you should choose column store databases.
Graph Store (Amazon Neptune, Neo4j): Graph stores are used for storing data that has complex relationships. They are widely used in social networking and recommendation systems. If your service is related to providing some recommendations based on users past activities then a graph store is best choice.
Hybrid (CockroachDB, PostgreSQL): Hybrid databases combine the benefits of SQL and NoSQL databases. They are used for storing structured and unstructured data. CockroachDB is a distributed SQL database that is designed for scalability and high availability. PostgreSQL is a hybrid database that supports both SQL and NoSQL data models.

When deciding which type of database to use, consider the following factors:
Data Structure: If your data is structured and related to one another, then a SQL database is a good choice. If your data is unstructured or semi-structured, then a NoSQL database is a better choice.
Scalability: If you need to scale your database horizontally, then a NoSQL database is a better choice. If you need to scale your database vertically, then a SQL database is a better choice.
Querying: If you need to perform complex queries on your data, then a SQL database is a better choice. If you need to perform simple queries on your data, then a NoSQL database is a better choice.
Consistency: If you need strong consistency guarantees, then a SQL database is a better choice. If you can tolerate eventual consistency, then a NoSQL database is a better choice.
Cost: SQL databases are generally more expensive than NoSQL databases. If cost is a concern, then a NoSQL database is a better choice.

Let's continue with assigning a database for our food ordering microservice architecture.
For users and restaurant inventory-related data, we can use an SQL database like Postgres.
For managing orders, we can use a NoSQL database like MongoDB.
For real-time delivery tracking and queries, we can use Cassandra or HBase.
For recommendation systems, we can use Neptune as a graph db.

Step 5. Integrating Design Patterns or Techniques

The final step is to apply some design patterns or techniques that can help us solve some common problems or challenges in a microservice architecture. They are not required at the beginning phase, but eventually, you will need them.

Some common design patterns or practices are:

Strangler Pattern
SAGA Pattern
CQRS Pattern
API Gateway Routing
Circuit Breaker Pattern
Backend for Frontend
Load balancing
Consistent Hashing
Partitioning
Replication
Caching
Sharding
Rate Limiting
Locking and Idempotency handling
Concurrency handling
Application metrics
Audit logging

We will discuss them in detail later in our series.

A few things to remember

There are a few more small things that can help further in the design process.

It is not always necessary to create a service if you just need separation of concern in code, you can create a library or a package instead. For example, a part of your codebase works in designing email templates and generating PDF invoices or Excel reports. These tasks are independent functions in themselves, they just need an input and are supposed to give some output. they share common util functions and imports and sometimes, are needed across different services. For such tasks, it's better to create a library/package rather than a service. Using the library, you can abstract that part of code from your services, also they are faster to serve because there is no inter-service communication involved.
Try to make service communications unidirectional. Your microservice structure should be like a tree, with Parent and child services. where parents can send requests to children and get a response, but the reverse should be avoided as much as possible.
Always use a centralized logging system with a standard format for easy-to-find issues among different requests. Use headers like correlation ID to see all logs related to a single transaction among all services. also log key data points like order ID to retrieve all logs related to them at any point.
Stay near code, there are too many small business details that are never covered in any document. They are only in the code. No person knows them better than your code. So stay near the code or the person writing them. It's easier to draw boxes and design but harder to implement. Knowing how things like message queues, load balancers, and rate limiting are implemented and work at a low level is equally important.

In our next article, we are going to discuss infrastructure estimations for microservice architecture, how to choose the right infrastructure and calculate costs.

2. Designing a microservice Architecture

Ashutosh Sahu — Tue, 21 Jan 2025 06:25:59 +0000

In our last article, we have seen the differences between Microservices and monoliths, advantages, and common challenges with both architectures. In this article, we are going to dive more into how to design a microservice architecture using a Food delivery application as an example.
There are certain characteristic principles of microservices that serve as a rule set and help us tackle some of the challenges faced in microservices. Beware that these characteristics break some conventions and principles that were followed previously in software engineering. Let's take a look at them.

Characteristics to be Present in a Micro Architecture

Single responsibility principle

This principle helps very much in deciding what our architecture should look like. According to the single responsibility principle, microservice should be focused on a single business capability or function that it provides. We should avoid creating services that have multiple or unclear responsibilities. We should also avoid creating services that duplicate or overlap with each other's functionality.

The problem with this rule is it's hard to follow. As a general rule, we should not create a service that is too small or too large. This conflicts with the Single responsibility principle and makes us ask how small or how big a service should be to be able to fulfill this principle.

At the low level, there is always too much going on. A new requirement at first seems adjustable to an existing service, which also saves on the cost and time of implementation. Slowly these requirements gather up to become a large service, handling different tasks and doing too much out of the bounds breaking our single responsibility principle.

The solution is to gradually split the services as we develop new features. For example, suppose you created a single service at the beginning to handle Product inventory management and searching. It also used to store prices and related discounts, so later we started adding pricing, sales, offers, and discount handling to it. At first, all of this looked like a part of inventory management. As the features grew, the pricing logic started to become complex based on seasons, analytics, previous sales, etc. The codebase evolves around these and becomes a much bigger part of the process. Now it's time to separate Pricing into a new service.

One key factor while deciding on main services is that we should always keep scopes to split a service into smaller services. So that we only have to spawn a new service and move half of the functionalities to it when required, rather than ending up in a mess, where we need to redesign the whole service architecture.

The final decision on how to split into different microservices, and how big or small a service should be depends completely on your application and you. There is no hard constraint or rule for it. It's okay to compromise with some principles to gain benefits in other sectors.

One Database one service

A microservice should carry its data and manage its persistence and transactions separately. We should avoid sharing databases or tables among different services, as this can create coupling and inconsistency issues. We should also avoid using foreign key constraints or joins across different services, as this can create performance and scalability issues. This helps with scaling also, where you can scale a single database based on requirements.
Now this rule breaks the conventions of a single database where we used to rely on foreign keys and transactions to maintain consistency. we were also able to do joins on different tables under a single database to pull any kind of data we needed.

In an inconsistent microservice system, anything can go wrong. following are the most common cases:
Case 1: you have 3 different services to manage orders, payments, and inventory. you got an order, it saved the details of the order and updated the stock in inventory, but the payment failed. Now all these services have their databases, so they all run under different transactions. Failure of payment transactions will not revert the inventory and order service transactions. So an order gets placed without a successful payment.
Case 2: Let's take the same example again, you got an order, you updated the inventory that an item is sold, and also mapped the order ID against that item. But before payment, the user canceled the order, it got soft deleted from the order service. Now the inventory has an item reserved for the order that does not exist. Foreign keys could have made handling such situations easier, by using delete cascades, but they are not available.

These cases are the most common examples of errors happening in a microservice. They are complex to handle. If we need data consistency and want to avoid such situations, we need to implement a SAGA pattern which involves coordinating the transactions in different services using either choreography or orchestration approaches to compensate for any failures or errors by undoing or reversing the previous transactions.

Similarly, if we frequently need to join queries among tables of different services, we need to use a CQRS pattern for that.
We will discuss these patterns later in more detail in our series.

Again the decision of whether to use a centralized database or go with a DB for every service is up to you. It's not completely necessary to use a database for each service. If your application is small data needs to be highly coupled or implementing SAGA or CQRS is not affordable, you can go with a centralized database. you can also go with a hybrid approach, like 5 microservices and 2 databases, 3 services are using db1 other 2 services are using db2.

Scalability

A microservice should be able to handle increasing or decreasing demand without affecting its performance or availability. We should be able to scale our services independently from each other by adding or removing instances as needed. We should also be able to scale our services across different dimensions such as load balancing, partitioning, replication, and caching. we will discuss these techniques in detail later in our series.

Loosely coupled

Microservices should be able to communicate with each other without knowing too much about their internal details or dependencies. We should use well-defined interfaces that expose only the necessary functionality and hide other implementation details. We should also use standard protocols and formats that enable interoperability and compatibility among different services.

Design Process

Now that we have understood the challenges and characteristics of a microservice architecture, let us see how we can design one for our application. The design process involves the following steps:

Business Oriented division of service
Bounding Contexts
Selecting a communication type
Selecting a database
Applying Design Pattern / Scaling

Step 1. Business-oriented divisions of service

The first step is to analyze our domain and identify the business capabilities or functions that our application provides or will provide in the future. Each system is built around certain unique features. These features sometimes have the potential to completely change the architecture. To keep the system fixed within a certain architecture, It's necessary to analyze the domain.
When gathering the domain knowledge, start with building a vocabulary and language around it. When in a team, generally the discussions are verbose, not in much detail, or often referred to with different words. Words like slider, carousel, rotating view, dynamic view, and animation, are generally mixed up. We should avoid using our jargon during such discussions and use a simplified and predefined vocabulary.
Documenting/creating diagrams is good but only up to a point. UML and flowcharts are always better than explaining something in words. However, in a domain analysis, the vocabulary can grow over time. A diagram or document should be made as small as possible and should be divided into parts. In the beginning, there can be too frequent changes that will try to change most of the designs. Keeping track of all changes, and updating a big diagram/document based on it is hard. Eventually, with a single document, there comes a time when you cannot trust the document to gather information, You have to go through code and implementation to know what that piece is supposed to do.

Let's start designing the food ordering application, by creating a vocabulary.

This is just a basic vocabulary example. Notice how with the vocabulary, we fixed up a terminology. We also identified major entities (white circles), some functionalities (gray circles), and a soft relationship between them using arrows. Now we can pick each of these entities separately and identify its attributes, functionalities, and relationships. This analysis could introduce more entities as we go deeper into it. While doing so, maintain small documents dedicated to a specific entity only.

Step 2. Bounding contexts

After we are done with domain analysis, we know well the different entities, their attributes, relationships, and functionalities in our application. Now we can define bounded contexts. Bounded contexts are a grouping of related subdomains that share a common language and model. When bounding contexts, start with smaller groups of say 2 or 3 contexts. Pick an entity and assign it to a context based on the functionalities. As you start doing this, you will see there is a possibility to introduce a new context or split an existing one. If you are in doubt about whether to allow this new context in the system, then just don't add it now, but always make enough room so that it can be easily added in the future. Then assign a microservice to each context. When doing this, keep in check the characteristics that we went through earlier. Ask yourself, Is your microservice going to be scalable? Does it follow the single responsibility principle? Restructure it if you are in doubt. These questions will help you through a raw High-level design of the system.

Let's start with 3 contexts for our food-ordering app
User Context

add customer, restaurant and Delivery Partner, - manage Profile - authentication/authorization - restaurant menu/inventory CRUD Ordering Context - Maintaining a Cart
Order food - Payment confirmation - Cancel Order - Refund/Cashback - Delivery Partner / Dish Rating Tracking and Navigation Context - Assign a Delivery Partner to Order
Track Order for customer - Navigator for Delivery Partner.

Step 3. Selecting the Communication type

The next step is to choose the communication model for our microservices. There are two main types of communication models: synchronous and asynchronous

Synchronous Communication

Synchronous communication is a real-time communication model, where the sender and the receiver are both active and available at the same time. Synchronous communication is usually implemented using request/response-based protocols, such as HTTP, over TCP/IP or UDP.

TCP/IP: This protocol is used where reliability and consistency are important. In this, each data packet sent receives an acknowledgment. It also maintains the order in which data is sent. This makes it consistent and slow. It's widely used for most of the API calls and is the default protocol in a NodeJS server.

UDP: This protocol is used for creating custom messaging protocols. This doesn't guarantee delivery and order of delivery of the data. That's why it is fast but also lossy and inconsistent. It's suitable for situations where it's okay to lose data, like a video or audio live stream where ensuring previous data is transferred is not needed. We constantly need to send real-time data only. NodeJS provides dgram module for creating a UDP server.

REST (Representational State Transfer): REST is a widely used architectural style for designing web APIs, based on the principles of statelessness, uniform interface, and resource orientation. REST uses HTTP/1.1 with verbs (GET, POST, PUT, DELETE, etc.) to perform operations on resources, identified by URIs, and exchanges data in various formats, such as JSON, XML, or plain text.

gRPC (Remote Procedure Call): gRPC is a high-performance, open-source framework for RPC communication, developed by Google. gRPC uses HTTP/2 as the underlying transport layer, and protocol buffers as the default data serialization format. Protocol buffers are binary, compact, and schema-based messages, defined by .proto files, that can be generated into native code for various languages and platforms. They are faster than REST and are preferred for inter-service communications.

Generally, you should use REST for client-to-service communications and gRPC for service-to-service communications. Use synchronous communications where the user expects an immediate response or feedback from the system, such as logging in, placing an order, or doing data queries such as retrieving a product detail, a customer profile, or a report.

Asynchronous Communication

Asynchronous communication is a non-blocking communication model, where the sender and the receiver are not required to be active and available at the same time. The sender sends a message and continues with its task, without waiting for the response from the receiver. The receiver processes the message and sends back the response whenever it is ready. Asynchronous communication is usually implemented using message-based protocols, such as AMQP, MQTT, or STOMP, over TCP/IP.

Some of the common asynchronous communication protocols for microservices are:
Message queue: A message queue is a data structure that stores and transmits messages between the sender and the receiver, using a FIFO (first-in, first-out) or a priority-based order. A message queue acts as a buffer and a mediator, decoupling the sender and the receiver, and ensuring reliable and durable delivery of the messages. A message queue can have one or more producers and consumers, but each message is delivered to exactly one consumer. Some examples of message queue technologies are Amazon SQS, RabbitMQ, ActiveMQ, and Azure Service Bus.

Event stream: An event stream is a data structure that stores and transmits messages between the sender and the receiver, using an append-only and immutable order. An event stream acts as a log and a source of truth, capturing the history and the state of the system, and enabling event-driven communication. An event stream can have one or more producers and consumers, and messages are available for all consumers. Some examples of event stream technologies are Apache Kafka, Apache Pulsar, and Amazon Kinesis.

There can be three types of routing strategies/exchanges involved in a message queue: Direct, Fan-out, and Topic.

Direct

The strategy routes the messages based on the routing keys. For example, a direct exchange can bind the queue "order-service" to the routing key "order.created", and send the messages with that routing key to that queue.
It has the advantages of simplicity and efficiency, as it delivers the messages to the exact consumers that need them.
It has the disadvantages of rigidity and redundancy, as it requires the producers and the consumers to agree on the routing keys, and it does not support broadcasting of the messages.

Fanout

Fanout routing is a strategy that delivers a message to all the consumers who are subscribed to the message queue.
The fanout exchange sends the messages to all the queues regardless of their routing keys. For example, a fanout exchange can send the same message to the queues "order-service", "inventory-service", and "notification-service", regardless of their routing keys.
It has the advantages of flexibility and scalability, as it allows the producers and the consumers to be loosely coupled, and it supports the broadcasting of the messages.
It also has the disadvantages of inefficiency and waste, as it delivers the messages to the consumers that may not need them, and it consumes more network and system resources.

Topic

Topic routing is a strategy that delivers a message to multiple consumers that match the topic of the message. The topic is a string that consists of words separated by dots, and it can represent a hierarchy or a category of the message. For example, a message with the topic "order.created.usa" can represent an order creation event that occurred in the USA.
The topic exchange binds the queues to the topics and sends the messages to the queues that have the same or a subset of the topic as the message. The topic exchange can also use wildcards to match the topics, such as an empty string to match any single word, and a "#" to match any number of words. For example, a topic exchange can bind the queue "order-service" to the topic "order.", and send the messages with the topics "order.created", "order.updated", and "order.deleted" to that queue.
It has the advantages of versatility and granularity, as it allows the producers and the consumers to use different levels of abstraction and specificity for the messages, and it supports filtering and grouping of the messages.
It also has the disadvantages of complexity and ambiguity, as it requires the producers and the consumers to understand and follow the topic conventions, and it may cause conflicts or overlaps of the topics.

Asynchronous communication is suitable in cases like notifying a change in a system such as a new order, status update, or payment confirmation. they can also be used between microservices to perform actions like processing some tasks, updating the database, and sending async reports via mail based on events/messages. Asynchronous communication can ensure eventual consistency and fault tolerance, where the slow receiver can catch up with the fast sender, even if there are some delays or failures in the communication.

In the next article, we will discuss selecting the database for each of our microservices and applying some design patterns and techniques for better scaling.

1. Designing A Microservice Architecture: Microservice vs Monolith

Ashutosh Sahu — Tue, 21 Jan 2025 05:42:48 +0000

Microservices are a popular architectural style that aims to create modular, scalable, and resilient applications by breaking them into small, independent, and loosely coupled services. Each service focuses on a specific business capability and communicates with other services.

However, designing a microservice architecture is not a trivial task. It involves many challenges and trade-offs that need to be carefully considered and addressed. The process of designing a microservice architecture, starts from identifying the business domains and services, to choosing the database and communication models, to applying the best practices and patterns.

Why not create a monolith?

Before going into details of microservice design, let's look at the traditional monolithic architecture, where the entire application is built as a single unit that runs on a single server.

Monolithic architecture is a very simple architecture that comes with the following advantages:

Easy deployment: A monolithic application can be deployed as a single executable file or directory, which makes the deployment process easier and faster. There is no need to manage multiple services, dependencies, or configurations as in microservices.
High performance: A monolithic application can provide high performance, as it can avoid the overhead of network communication and data serialization between different services. The application can also leverage the benefits of shared memory, caching, and transactions within a single process.
Simplicity: A monolithic application can be simpler to develop, test, and debug, as it has a single code base and a single development environment. The application can also use common frameworks and libraries that support the entire functionality. There is no need to deal with the complexity of distributed systems, such as latency, inconsistency, concurrency, and failure.
Cost Effective: A monolithic application is cost efficient for multiple reasons like
- Reduced infrastructure complexity as it can run on a single server
- Simplified operations and maintenance as it can be deployed and updated as a single unit
- It saves the operational overhead and costs as it can leverage the benefits of shared memory, caching, and transactions within a single process.

Recently Amazon Prime video migrated from microservice to a monolith architecture which helped in reducing their costs by about 90%. you can read more about it here

Going through these advantages of a monolith, It's obvious to think why we need a microservice architecture in the first place.

It's right, every application doesn't necessarily need to have a microservice architecture. It depends on many small factors like business requirements, how many resources are required and can be afforded, how traffic is distributed among different components of your application, etc. We will identify all such factors that will help us decide what's the right choice.

Let's take a look at some common challenges in a monolithic architecture and how microservices help in solving them:

Complexity

As a monolith application grows in size and functionality, it becomes harder to understand, maintain, and test. The codebase becomes bloated with interdependent components that have multiple responsibilities and dependencies.

The developer experience degrades with such a codebase, mostly for new developers, who cannot completely explore the codebase and get to know every line of code written, the business logic, and hidden behaviors. They don't know if a function already exists for doing something and ends up recreating it or what side effects will occur on doing changes at a component that is used in other modules.

In a big codebase, loading all the code in an IDE, finding a suitable place to make changes, and running tests/builds on CI/CD, all require too much time and resources.

Microservices can reduce the code complexity of an application by breaking it down into smaller and simpler units that focus on specific functionalities. Each microservice can be developed, tested, deployed, and maintained independently, which makes the application easier to manage and understand. Microservices also enable better separation of concerns and modularity, which improves the cohesion and coupling of the code

Scalability

As the application receives more traffic and load, it becomes harder to scale it horizontally (by adding more servers) or vertically (by adding more resources to the existing server). Monoliths grow up in size and can utilize resources in GBs.

Generally in such systems, only 30-40% of the code, resource, or components are responsible for the traffic. while the others are only for some rare tasks that don't happen very often.

Taking the example of an e-commerce application, It receives many orders and product search-related queries. Most of the traffic is around these components only. The other components like customer service, and user profile management, don't receive that much traffic. But when a monolith is scaled, it is scaled as a whole unit. Suppose you are getting 5 times more traffic for orders than for customer services, But you scaled both the database and servers to keep serving the traffic. That usually ends up getting more costly because more than half of the components of your application are never going to use that many resources.

In microservices, each component is built independently as a service. so they can be scaled independently according to their demand and resource consumption. This means that only the services that need more resources can be scaled up or down, without affecting the rest of the application. Microservices also enable horizontal scaling, which means adding more instances of the same service to handle more load.

Reliability

As the application consists of a single point of failure, any bug or error can bring down the entire system. The application becomes vulnerable to security breaches and data loss. The recovery process can become longer and more difficult.

Microservices can enhance the reliability of an application by making it more resilient to failures and errors. Since each microservice is isolated from the others, a failure in one service does not affect the whole application, as long as there are fallback mechanisms in place. Microservices also enable faster recovery and fault tolerance, as each service can be restarted or replaced without disrupting the entire system.

Innovation

As the application is tightly coupled with a specific technology stack. Migrating/switching to a new package or library or updating an existing one, switching to a better design pattern, technology, or framework becomes so hard, that it's better to leave it as it is. The application becomes outdated and less competitive due to these factors.

Each microservice can be developed and deployed independently, which reduces the risk of breaking the whole application or introducing bugs. so they enable experimentation and exploration, as new features, functionalities, libraries, or frameworks can be added or removed easily without affecting the core services.

Challenges in a Microservice

While microservices can address many of the challenges in a monolithic architecture, they also introduce new challenges that must be tackled. Some of the common challenges in a microservice architecture are:

As the application comprises multiple services that run on different servers and communicate over the network, it becomes harder to coordinate, monitor, and troubleshoot. The system becomes distributed and asynchronous, which introduces issues such as latency, inconsistency, concurrency, and partial failure.
As the application relies on multiple services that depend on each other, ensuring availability and quality becomes harder. The system becomes vulnerable to network failures, service failures, data corruption, and inconsistency. The system requires robust mechanisms for fault tolerance, error handling, logging, tracing, testing, and security.

In the next article, we will go through the process of Designing a microservice architecture, considering various factors, problems, and tradeoffs.

Correlating the logs for tracking In NestJs

Ashutosh Sahu — Sat, 17 Aug 2024 13:55:02 +0000

The Problem

Imagine you’re at the helm of an e-commerce juggernaut—a complex system with multiple services orchestrating tasks like managing user carts, processing orders, handling payments, and tracking shipments. It’s a bustling digital marketplace, and transactions flow seamlessly between these services.
But here’s the catch: With so many moving parts, there’s bound to be a hiccup somewhere. Maybe an order fails to process, a payment gateway misbehaves, or a shipment mysteriously vanishes into the digital ether. When these glitches occur, you need to play detective and trace the breadcrumbs back to their source.

The Role of Logs

In this intricate dance of microservices, logs become our trusty companions. Each service diligently records its activities, creating a trail of events. But here’s the twist: In a distributed system, these logs converge into a single timeline—a grand mosaic of actions and reactions. It’s like assembling a jigsaw puzzle, where every piece matters.

The Quest for Unique Identifiers

When trouble strikes, we reach for our magnifying glass—the transaction ID or order ID. These unique identifiers are like cosmic coordinates, pinpointing the exact moment when things went awry. Armed with this information, we dive into the logs, hoping to unravel the mystery.
But what if the crucial identifier isn’t there? What if it’s missing from the log where the error occurred? Suddenly, we’re lost in a labyrinth, desperately seeking context. Perhaps the error log stares back at us, cryptic and unyielding. We know something’s amiss, but without that golden thread—the transaction ID—we’re left fumbling in the dark.

And so, we face the conundrum: How do we bridge the gap between logs and reality? How do we find the elusive context that ties it all together? It’s a challenge that haunts every distributed system engineer—the quest for clarity amidst chaos.

The Solution

The solution is to have a correlation ID. It’s like a golden thread woven through the fabric of your distributed system. Shared between all services involved in a request or transaction, this ID becomes our beacon. And here’s the magic: By embedding it in every log, we create a breadcrumb trail—a lifeline to the heart of the matter.

The Art of Sharing IDs

But how do we pass this mystical ID between services? Fear not; it’s simpler than deciphering ancient scrolls. Here’s the playbook:
Source Generation: The correlation ID starts at the source, which could be the frontend or a backend-for-frontend service. Think of it as a unique mark given to each transaction.

Header Travel: As the request travels, it carries an “x-correlation-id” header. This header is like a tiny parchment containing the secret—the essence of our journey.

Service Adoption: Every internal service receives this sacred header. They treat it like a valuable tool, recording it in their logs.

When an error occurs, we consult the logs. Armed with the correlation ID, we trace all related events through time and find out our culprit.

How to do this in Nestjs?

Async local storage

It is a powerful API that simplifies data storage and retrieval across asynchronous operations. Imagine it as a persistent context that spans multiple asynchronous calls, akin to a global variable tailored for a specific asynchronous execution context. With AsyncLocalStorage, you can seamlessly manage state and share data without the complexities of manual bookkeeping.

For example:

import { AsyncLocalStorage } from 'node:async_hooks';

// Create an instance of AsyncLocalStorage
const store = new AsyncLocalStorage();

// run an async operation with context
store.run({ isAdmin: true }, createUser(data));

In this scenario, createUser represents a sophisticated function that orchestrates multiple asynchronous tasks, such as checking for duplicate users in a database, creating the user, and triggering email verification. The beauty of AsyncLocalStorage lies in its ability to maintain the { isAdmin: true } context across these asynchronous steps. At any point during the user creation process, you can effortlessly retrieve the context value using store.getStore().

const { isAdmin } = store.getStore()

This approach ensures that critical information, like the correlationId, remains accessible throughout your Node.js application.

In Express or Nestjs application, you will find it suitable to add this context as a middleware

// with-context.middleware.ts
import { AsyncLocalStorage } from "node:async_hooks";
import { NextFunction, Request, Response } from "express";

interface ContextStore {
    correlationId: string;
}

export const globalStore = new AsyncLocalStorage<ContextStore>();

export const withCorrelationId = (
  request: Request,
  response: Response,
  next: NextFunction
) => {
    const context = globalStore.getContext();
    context.correlationId = request.headers['x-correlation-id'] ?? uuid.v4()
    response.setHeader('x-correlation-id', correlationId);
    globalStore.run({}, () => next());
};

Here we are adding a middleware withCorrelationId which will include the correlation id in a context and
run the next() middleware or route function within this context.

Now we need to add this middleware to our main application

// main.ts
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
import { withCorrelationId } from './with-context.middleware';


async function bootstrap() {
    const app = await NestFactory.create(AppModule, {
        logger: ['error', 'warn', 'debug'],
        cors: true,
    });
    app.use(withCorrelationId);
    ... // other middlewares or initializers

    await app.listen(3000, () => {
        logger.info(`Listening on port: 3000`);
    });
}

To add this correlation id to any log, we can create a custom logger in winston with TRANSIENT scope.

// logging.service.ts
import { Injectable, Scope } from '@nestjs/common';
import { createLogger } from 'winston';
import { globalStore } from './with-context.middleware'


@Injectable({ scope: Scope.TRANSIENT })
export class LoggingService {
     private rootLogger = createLogger();

     private getLogger(){
         const { correlationId } = globalStore.getStore();
            const childLogger = rootLogger.child({
                correlationId,
            });
            return childLogger;
     }

    info(msg: string, meta?: unknown[]){
        this.getLogger().info(msg, ...meta)
    }

    ... // other log level implementations
}

And there we have our logger. Use it like a normal logger.

@Injectable()
export class YourService{
    constructor(private readonly logger: LoggingService){}

    someFunction(){
        try {
            someAsyncOp()
        } catch (error) {
            this.logger.error("some error", error);
        }
    }
}

This way we never needed to get the context in every service file. it gets auto logged.
You can also use the context to attach some additional ids or any kind of data to the logging in the middle of the process without passing the data directly to each next function.

Why do we use OTP libraries when we can just do Math.random()

Ashutosh Sahu — Sat, 17 Aug 2024 11:57:57 +0000

One-time passwords (OTPs) are widely used for authentication and verification purposes in various applications and services. A server usually generates them and sends them to the user via SMS, email, or other channels. The user then enters the OTP to confirm their identity or perform an action.

I got a task where we had to implement OTP-based verification in Node JS. Before integrating something like this, I am sure most of us developers/engineers look on the Internet for best practices, tutorials, recent technical trends, and problems other major software systems face during their implementations. So I did it, and the thing that most attracted my attention was libraries like otp-lib, and otp-generator, whose only function was to generate an OTP. The rest of the tasks like sending it over an SMS or email still need to be done by other means. The first question that comes to mind after knowing that such libraries exist is why we have to go to such lengths to use a library to generate OTP when all we have to do is write a one-liner:

const otp = Math.ceil(Math.random() * 10000)

In this blog post, I will explain what I learned during our small research for OTP generators, why using Math.random() to generate OTPs is a bad idea, what are other ways to generate an OTP, and why a library should be used for such a task?

Types of random numbers

There are mainly two types of random numbers:

Pseudo-Random Numbers (PRN)
Cryptographic Random Numbers (CRN).

Pseudo-Random Numbers

Pseudo-random numbers are generated by an algorithm that takes an initial value, called a seed, and produces a sequence of numbers that appear to be random. However, the algorithm is deterministic, meaning that if you know the seed and the algorithm, you can predict the next number in the sequence. Javascript's Math.random() and Python's random.randInt() are an example of a pseudo-random number generator.

Cryptographic Random Numbers

Cryptographic random numbers are generated by a process that is unpredictable and cannot be reproduced or guessed. They are usually based on some physical phenomenon, such as atmospheric noise, thermal noise, or quantum effects.

How Math.random() works?

Different Javascript engines behave a little differently when generating a random number, but it all essentially comes down to a single algorithm XorShift128+.

XorShift is a deterministic algorithm, which uses addition as a faster non-linear transformation solution. Compared to its peers, which use multiplication, this algorithm is faster. It also has less chance of failure than Mersenne Twister (used by Python's random module)

The algorithm takes in two state variables, applies some XOR and shift on them, and returns the sum of the updated state variables which is an Integer. The states are generally seeded using the system clock because that is a good source for a unique number.

An implementation of XOR shift plus in javascript looks like this:

let state0 = 1;
let state1 = 2;
function xorShiftPlus() {
    let s1 = state0; 
    let s0 = state1; 
    state0 = s0;  
    s1 ^= s1 << 23;
    s1 ^= s1 >> 17;
    s1 ^= s0;
    s1 ^= s0 >> 26;
    state1 = s1;
    return state0 + state1;
}

The returned integer is converted to a double using OR operation with a constant. You can find the detailed implementation on chrome source code.

How to predict a random number generated by Math.random()

Predicting the outcome of Math.random() is hard, however, it is not completely impossible. Knowing the algorithm, you can easily regenerate the same random numbers if you know the values of state0 and state1.

Reverse engineering XorShift128+ Using a Z3 theorem prover you can find the value of state0 and state1 by providing 3 consecutive random numbers generated by a server.

The implementation of the Z3 solver can be found here.

Now the question comes how to get those 3 random numbers from a server. That's the hard part, and can be obtained in some of the following cases:

If an API returns a randomly generated number in its response or headers, it can easily be obtained by sending requests at set intervals.
API documentation like OpenAPI/Swagger in modern applications is generated on the server. Sometimes their responses can contain an example value that uses a random number.
With frameworks like NextJS that use server-side rendering while also being capable of handling backend API integrations, there are high chances of getting randomly generated numbers from the content served by them.

Another approach to exploit a random number is using the fact that Math.random() only returns numbers between 0 and 1 with 16 decimal places. This means that there are only 10^16 possible values that Math.random() can return. This is a very small space compared to the space of possible OTPs. if your OTP has 6 digits, there are 10^6 possible values. This visualizer shows that there is a pattern to the numbers generated. Using it, the possibilities can be reduced by 30%. Therefore, if you can guess or brute-force some of the digits of the OTP, you can reduce the space of possible values and increase your chances of finding the correct OTP.

Generating a Cryptographic Random Number in NodeJS

As mentioned previously, cryptographic random numbers are non-deterministic because they depend on the physical factors of a system. Every programming language can access those factors using low-level OS kernel calls.

NodeJS provides its inbuilt crypto module, which we can use to generate randomBytes and then convert them to a number. These random bytes are cryptographic and purely random in nature. The generated number can easily be truncated to the exact number of digits we want in OTP.

import * as crypto from 'crypto';
const num = parseInt(crypto.randomBytes(3).toString('hex'), 16)
// num.toString().slice(0,4)  // truncate to 4 digits

NodeJS 14.10+ provides another function from crypto to generate a random number in a given min-max range.

crypto.randomInt(1001, 9999)

Even after knowing the vulnerability of Math.random() and finding a more secure way to generate a random number cryptographically, we still remain with the same question from the beginning. Why do we have to go to such lengths to use a library to generate OTP when all we have to do is write a one-liner?

Before answering this questions, let's take a look at what is the inconvenience faced while handling and storing an OTP. The problem with using the above method to generate OTPs is that you have to store them in the database in order to verify them later. Storing the OTP in the database is not a good practice for the following reasons:

Storing OTPs in the database creates a lot of garbage data that has to be cleaned up periodically. OTP means a one-time password that can expire after a single use. It can also expire if not used for a specific duration or a new OTP is requested without using the previous one. This mainly adds unnecessary overhead to the database operations for maintaining valid OTPs while also consuming storage space.
Storing OTPs in the database poses a security risk if the database is compromised. An attacker who gains access to the database can read the OTPs and use them to bypass the authentication or verification process. This can lead to account takeover, identity theft, or fraud.
Storing OTPs in the database makes them vulnerable to replay attacks. A replay attack is when an attacker intercepts an incoming valid OTP and uses it again before it expires. This can allow the attacker to perform unauthorised actions or access sensitive information.

What do the OTP libraries do differently?

The OTP libraries use different algorithms and techniques to generate and verify OTPs that behave similarly to a Cryptographic random OTP, while also removing the overhead to store the OTP in a database.

There are mainly two types of OTP implementation techniques.

HOTP

HOTP stands for HMAC-based One-Time Password. It is an algorithm that generates an OTP based on a secret key and a counter. The secret key is a random string that is shared between the server and the user. The counter is an integer that increments every time an OTP is generated or verified.

The algorithm works as follows:

• The server and the user generate the same OTP by applying a cryptographic hash function, such as SHA-1, to the concatenation of the secret key and the counter.
• The server and the user truncate the hash value to obtain a fixed-length OTP, usually 6 or 8 digits.
• The user sends the OTP to the server for verification.
• The server compares the OTP with its own generated OTP and verifies it if they match.
• The server and the user increment their counters by one.

HOTP is mostly used in hardware token-based authentication like Yubikey. Yubikey is basically a programmed hardware key that you can connect physically to your computer or phone. Instead of receiving a code from SMS, or email, you can just press a button on Yubikey to verify and authenticate yourself.

The advantages of HOTP are:

The disadvantages of HOTP are:

• It requires synchronization between the server and the user's counters. If they are out of sync, due to network delays, transmission errors, or device loss, the verification will fail.
• It remains valid as long as a newly generated HOTP is not used, that can be a vulnerability.
• It requires a secure way to distribute and store the secret keys. If the secret keys are leaked or stolen, the OTPs can be compromised.

TOTP

TOTP stands for Time-based One-Time Password. It is an algorithm that generates an OTP based on a secret key, timestamp, and epoch.

The secret key is a random string that is shared between the server and the user. It can be created uniquely for each user by generating SHA1( "secretvalue" + user_id ) .
The timestamp is an integer that represents the current time in seconds
The epoch is the duration for which the algorithm will generate the same result. generally, it is kept between 30 sec - 1 min.

The algorithm works as follows:
• The server decides a secret key for the user and shares it over a medium like Authenticator apps.
• The server can directly generate an OTP and send it to the user by mail or SMS, or it can ask the user to use an Authenticator to generate an OTP using the shared key.
• The user can directly send the OTP received by mail or SMS or can generate it in the authenticator app in case of 2FA in a fixed time window.
• The server compares the OTP with its own generated OTP and verifies it if they are close enough in the epoch time range.

The advantages of TOTP are:

• It does not require storing the OTP in the database, as it can be generated and verified on the fly.
• It does not rely on pseudo-random numbers, as it uses a cryptographic hash function that is unpredictable and irreversible.
• It is resistant to replay attacks, as each OTP is valid only for a short period of time.
• It does not require synchronisation between the server and the user's timestamps. As long as they have reasonably accurate clocks, they can generate and verify OTPs independently.

The disadvantages of TOTP are:

• It requires a secure way to distribute and store the secret keys. If the secret keys are leaked or stolen, the OTPs can be compromised.
• It requires a reliable source of time for both the server and the user. If their clocks are skewed or tampered with, the verification will fail.
• The server has to consider the time drift or delay in processing the requests, so it should maintain a slightly greater epoch than the client.

Conclusion

Through our little research journey on the OTP, we came to know that Math.random() can be predicted, exploited, and replayed. We also got to know that storing OTPs in the database is not a good practice.

TOTP can generate secure and efficient OTPs, and can also verify them. It can generate an OTP offline as well as online, does not require synchronization or storage, and is resistant to replay attacks. Thus it solves most of our concerns related to best practices, security, and reliability.

Creating a Chess Engine from Scratch

Ashutosh Sahu — Sat, 23 Mar 2024 06:04:03 +0000

I am writing this article as a Journal to keep a record of my progress so far in this experimental project.

Backstory and Inspiration

I have been dumb at playing chess since childhood. I never won against a good opponent. even at the easiest level in a mobile game. I found some system design notes and problems. I enjoyed solving LLD (low-level design) problems in my mind and then reading the solution to check the depth of my knowledge. I came across designing a chess board game. And the design was very much what I thought, from creating an abstract class for a Piece, and inheriting other Pieces from it, each class will have its methods to move a piece and calculate the score on moving that Piece. Still, I have many questions unanswered after that like:

Is chess so simple to design using OOP concepts? Do modern games do it like this?
- I think it is what they do. What other ways could be there?
How do they manage levels in a chess game?
- If I had to guess, they take the top 10 moves with high scores and based on level choose the move to play.
Is chess AI nothing, only a set of rules to decide a move with a high score?
- This is something I had to discover, and I found that there is nothing like AI, it's just some advanced algorithms like minimax/negamax, alpha-beta pruning, etc.

Anyway, I decided that I would test myself If I could make a chess game that beats me. I chose Go as the programming language because I have only learned the basics of it, and this would enhance my skills. I chose fyne.io as the GUI framework because It was the only one with some good looking easy documentation.

Finding the best move

I want to keep it simple in the beginning. We will go with only 1 layer, searching for all the next possible moves
I found two cases where I need to calculate the score:

A piece shouldn't move
A piece should move

I think We eventually have to move a piece. so to simplify it, I came up with the following idea.

if a piece moves
- calculate its score
- calculate other pieces loss if this one moves.
final score = sum of both scores
move the piece with the highest score

Scoring mechanism

I searched for scoring in chess and found that Rook is given more importance than the horse and bishop.
I think this choice is different from player to player. from my point of view, I will give my Rook to capture the opponent's Knight or Bishop. so they all should have equal scores.

a piece gives check to king +20
a piece captures pawn +20
a piece captures rook +30
a piece captures horse +30
a piece captures bishop +30
a piece captures queen +40
a piece captures king +50
a piece removes check on king +50
pawn reaches last row +20
pawn has a threat from someone on reaching a position -30
Rook, Bishop, Knight has threat on reaching a position -40
Queen has a threat on reaching a position -40
King has a threat on reaching a position -50

Giving check to the king is only 20 points because what if your Queen is giving check to the king, but is also in danger. so I kept it minimum.
Initially, I started scoring from 10 points but changed it to 20. I realized that there will be a situation when the Pieces will try to keep their distance from one another, and this will result in a draw. so I will calculate the absolute distance of a piece from its nearest enemy and subtract its score from that distance.
Getting yourself captured is considered a loss, and to minimize it, I increased the loss score by +10
That's simple to start.

March 17, 2024

I decided that I wanted to make a cross-platform application, with a stack that I don't know. so that I will be able to learn it. I spent the whole day creating the board design in Fyne. It's not that simple, the framework is evolving but It doesn't give me what I wanted. I tried to put 64 cells in an 8x8 grid and resize all cells to 50x50 but the resize was not working. I checked many answers on stack overflow related to resizing but nothing worked. Later I found a video on YouTube, the person there did something different from what I was doing (still using the resize function) and it worked. I got super annoyed by it. As you can see I have so much going on in mind related to chess scoring, capturing pieces, and designing with the software principles I learned, and I am stuck on a UI issue. With this, I will not be able to go anywhere.

So I went to look for other options the other day. The next language is Python. I found QT and Tkinter frameworks. I have made a small tik-tak-toe game in Tkinter in the past, and it's good too. But QT is the most advanced and popular, so I went to look at its documentation and It's nothing in comparison to React UI libraries / tkinter / Fyne / Flutter. There is no good visual tutorial, only references to the API, and It's huge. I don't want to go with tkinter because it's very basic, like fyne and I am afraid I will get stuck there too.

The next I know is flutter/dart. I have worked previously on it and it's great. It now supports build for desktops too and has very flexible widgets that you can style as you want. But I need to reinstall and set it for desktop and mobile builds. I can't do that, because it will take a whole day also and I will not be able to put my ideas into code. I needed something fast. So I decided that I would fall to React/JS for making the game and then rewrite the logic on Flutter later.

And I think I made the right choice. Making the game on React was so fast because designing the chess board was very easy due to all the HTML and CSS. I made all chess piece classes, wrote the move and scoring logic for all of them, and made it run. So far, everything in my head is in the code now. Now I can take a sleep.

The game is very nice, playing by the rules. It can beat me at this point, just like any other Chess game out there. For debugging I added the score so when I click on a piece, it shows me the score of that piece if it moves to one of the movable cells.

There are a few things left to add to it like:

adding the castling (king and rook interchange position) move
detecting a checkmate or draw and stopping the game there
adding the history of moves and undo functionality
adding animation for a move
load the game from a saved file

I realized I am missing 1 thing, which is the motivation to move a pawn. moving it forward will help to promote it, but for the time it needs to be protected well too. That is hard to solve.

March 18, 2024

Last night, my game was almost complete, so now I had the question can It beat any other chess AI out there? Of course not. Now I have two more questions to answer:

What do other chess game engines do differently?
Is chess AI a real thing or it's just some more better algorithm?

I spent the whole night on Internet looking for answers. I found that other chess engines have different scoring mechanisms than mine and different strategies to decide the moves.

One such implementation is to assign a fixed score to every position for a piece, then for each move calculate the sum of all your scores and your opponent score. subtract them to get your score for that move. I am not going with that idea because I cannot understand the piece square table. As this is a stationary table and pieces can move anywhere on the board.

But this gave me the idea of making my algorithm better.

trial 1: I will try to find all moves, for each move, calculate your and the enemy's total score from all pieces, and subtract them to get the final score.

The Implementation also uses minimax and alpha-beta pruning which I am not going to use until I understand those two completely.

I keep looking for AI Chess engine implementation and found about Stockfish, and Leela Chess Zero(LC0). I found that LC0 uses MCTS (Mont-Carlo Tree Search). It's another term I don't want to know about. Because when I hear AI, I want to hear tensorflow/keras in play.

And so I changed the search prompt a little to "Neural Network Based Chess Engine".
I found this repository which has the Journal of the Chess Engine Progress, I liked reading it and decided that I will maintain my journal too. And so I am writing all this stuff.

At first, I wasn't clear about the way he had implemented it. For creating a neural network, what I need is an input layer, output layer, optimizer, and loss function. I was able to get all those from his Journal except the output layer. I was thinking what will be the target we have to predict and what will the output layer give us? How the NN will help us find the best move? Then I found DeepPink

It was a similar implementation as his journal. I found the output is in the form of -1,0,1, where -1 is lose, 0 is draw, and 1 is win.
Still after reading all this. I am not able to completely grab everything, like the evaluation function they used. It's just that I don't understand that level of mathematics.

Also about the dataset, they use PGN format, It's good, but I am not sure how to make my matrix using that format now. For now, I will generate my own data, if I need more data then I will think of visiting PGN again.

Here is what I am going to do:

trial 3: The data is going to be generated by keeping a record of all moves of both the players in 8 x 8 x 12 matrix form which tells the position of 12 pieces across an 8x8 board. after the result of the game, there will be three targets assigned to the moves: 'win', 'lose', 'draw'
example: if p1 wins, all moves of p1 will be labelled as win.

I will train it using relu, categorical cross entropy, hidden layers not decided, the output will be dense(3)

For prediction, I will find all 1st-level possible moves from the current state of the board. I am not going deeper to get more moves. I will feed all those moves to my NN. for each prediction output will be three numbers, showing the probability of win, lose, and draw. if a win is available, the highest win probability is selected. else if the draw is available, the highest draw probability move is selected, and else lowest loss probability move is selected.

Before I do that, I want to try other classification algorithms, like decision trees.

trial 2: for that I will convert the data to 2d table, with 1 hot encoding all the dimensions of 8x8x12 and drop 1 column. this will give me 7+7+11 = 25 features

I also had the thought to use the score as a parameter in training, but then the results will be highly biased towards the score. so I am dropping this Idea.

March 19, 2024

Woke up this morning with a new Idea. Chess is a sequence of moves, where you can play a certain move only based on the outcomes of all previous moves. so it's like an NLP problem, isn't it? We can represent the chess board in the form of a long sentence and for each move, we will have to predict the next word in that sentence.
For this, I am going to make the data repetitive. like if a game has 10 moves, I will make 10 strings, adding 1 move to the previous one. and then annotate them with a win, lose, or draw based on what player played the move.
The problem now is how can I represent the whole board as a sentence, and if I get the next word in a sentence, how will I convert it back to a move? Was looking on the internet for any such approach and found this article and some other research papers. It's simply a PGN format. However, the format itself is a little ambiguous to me.

trial 4 I will modify the PGN format a little to include the piece type at the beginning of the word. Since training a case-sensitive model is not good enough, I will change the piece type notation from k K to bk wk for representing black king and white king.
A problem that I identified with NLP is what if the predicted word is impossible move? While training, we will add a custom loss function that will increase the loss if the generated move is impossible. and if an impossible move occurs in prediction, we will have to fall back to best best-scoring move. I will not use transfer learning for this, because I think I am feeding a completely different language, one which is not used for communicating.

#P5 - Data Visualization

Ashutosh Sahu — Fri, 23 Jul 2021 06:52:25 +0000

Data visualization is the graphical representation of information and data by means of various graphs, charts and diagrams that helps to understand and get relevant information from data. We will see how they help to get various informations.

In python, there are some libraries that provide data visualization utilities.

1. Matplotlib

view on GitHub

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. SciPy, Pandas and seaborn are another libraries that depends on Matplotlib.

2. Seaborn

view on GitHub

Seaborn is just a wrapper library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Means you can draw the graphs similar to seaborn with matplotlib, with just some extra piece of code. It provides various color schemes and themes.

3. Plotly

view on Github

Plotly is an interactive graphing library that provides you the ability to interact with the graph, such as getting x and y axis by hovering the objects, enlarging, reducing, highlighting an area etc.. It is the best analytical tool as compared to above two, but also slow and much more resource consuming.

You can check their well versed documentations for various customization in graph. This article contains very few code examples.

Types of Plots and Charts

Almost everyday we see some analytics in a newspaper, TV, mobile application or on some website. Commonly we know about bar charts or pie charts, but there are many other types of visualization plots.

1. Scatter Plot

Scatterplot visualize the scatter of data values of two features.
It Used to find a relationship in a bivariate data, more commonly used to find correlations between two continuous variables. ```py

import seaborn
import matplotlib.pyplot as pyplot
seaborn.scatterplot(data = df, x = 'col1' y = 'col2')
pyplot.show()

![alt text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9qlnpxarca43pzdqhcqf.png)

## 2. Line Plot
- Line Plot is a univariate analysis plot. It creates a line that connects all data points.
- It is very useful for the observation of trend and time series analysis. 
```py


sns.lineplot(data=df, x="year", y="passengers")

3. Bar Plot

Bar Plots use bars with different height to represent data values.
They are used mainly for ranking values.
They are mostly used with data having less distinct values. ```py

sns.barplot(x="tips", y="day", data=df)

![alt text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fjysdb6xngyc2y5h04m5.png)

## Histogram (Hist Plot)
- histograms are used to observe the distribution for a single variable.
- They are used to identify the type of data distribution of a variable.

```py


seaborn.histplot(data, x="distance")

You can also see the kernel density estimation in Histplot by passing parameter kde=True.

Box Plot

A box plot also called a Whisker and box plot displays the five-number summary of a set of data, including minimum, 25th quartile, median, 75th quartile, and maximum.
It helps in various kind of analysis like outliers. ```py

seaborn.boxplot(data, x = "day", y = "total bill")

![alt text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6enmmmj7s5e7qmh3e72h.png)

## Violin Plot
- A violin plot is a more comprehensive box plot containing the KDE (kernel density estimation) lines alongsides the whiskers.
```py


seaborn.violinplot(data, x = 'cat_var', y = 'num_var')

Pair Plot

A pair plot shows all numerical pair relations along with their frequency distribution at diagonals. ```py

seaborn.pairplot(df, hue = 'species')

![alt text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/13rerr0cpdo227m30v84.png)

## Heatmap
- The heatmap is already demonstrated in previous article of this series. It can take any 2d data and show it in form of grid of various color intensity.

![alt text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wvx64ar9yixbi0czh7oy.png) 

There are many other type of visualizations which can be used as per the need, but above these are the most informative ones.

## Subplots
There is a good article on subplots, you can see it here


  
    
      
    
  
  
    
      Subplotting with matplotlib and seaborn
      Thales Bruno ・ Jun 21 '20
      
        #python
        #datascience
      
    
  


Or you can go with the subplot constructors



import matplotlib.pyplot as pyplot
import seaborn

fig = pyplot.figure(figsize= (12,5))
pyplot.subplot(1,3,1)
seaborn.violinplot(data = df, x = 'a', y = 'x')
pyplot.subplot(1,3,2)
seaborn.violinplot(data = df, x = 'b', y = 'y')
pyplot.subplot(1,3,3)
seaborn.violinplot(data = df, x = 'c', y = 'z')


</code></pre></div>

#P4 - Data Preprocessing

Ashutosh Sahu — Fri, 11 Jun 2021 08:30:50 +0000

In the previous articles, we have seen a simple example of how machine learning helps us to predict a target. There are many algorithms and ways to train a model, but all they need is data. whenever we take data from the real world, it always contains some irregularities. Data preprocessing is the first and foremost step after acquiring data which is used to put data in the desired format so that the model can be trained on it.

The whole process is known as EDA (Exploratory Data Analysis). EDA also includes visualizing data using various types of visualization techniques. we will cover them in the next part.

There is a great article on data preprocessing see it here

Types of Data

When we talk about data in Machine learning, we always think of data in form of a table consisting of rows and columns. columns depict features and rows depict attributes of these features.
There are other types of data like JSON files, they contain information in the form of embedded documents and fields.
There is one more type of data that is completely unstructured, unlike the above two types. i.e. image, video, and other such files.

here we are only discussing the first type of data and its preprocessing.

Steps of Data preprocessing

1. Gathering Business Knowledge

Business knowledge is the most important thing while preprocessing data. It is often underestimated, but it matters most of the time.
Let's consider that a dataset contains a feature country. In that feature, Some entries are written as the UK while others are written as United Kingdom. Now both are the same, but if we don't have business knowledge then we can not figure this thing out, and this will eventually lead to wrong training.

2. Data Exploration

Data Exploration means exploring various aspects of data. following are certain ways to do so in python -

a. describe()

when we import a dataset in python using pandas, it is stored in an object called DataFrame. the DataFrame object has various inbuilt methods among which describe() function gives the detailed statistics of numerical features of the dataset.

import pandas
df = pandas.read_csv('path/to/csv')
df.describe()

The describe() function returns a dataframe object in following format.

As you can see, It tells about count, mean, standard deviation(std), mode,min, max and spread of values at 25%, 50%, 75%.

b. info()

info() is another function provided by the dataframe object, which shows all the features with their data types and no. of values in each feature.

c. Pandas Profiling

This is a much better way than the above two because it extracts almost all the information required to better explore the data. pandas profiling is a separate library from pandas.

To install or upgrade it, use

pip install --upgrade pandas-profiling

then use it as follows -

profile = df1.profile_report(title="<name>", explorative = True, minimal = False)
profile.to_file(output_file="<filename>.html")

the pandas profiling generates an HTML report format that gives you various insights for each feature.

3. Missing Values Imputation

It is very common to have missing values in the datasets. Mostly they are caused because peoples do not like to fill every detail of the form. Sometimes it can also be caused by machine errors such as irregular functioning of a sensor or a device collecting data.
Be aware, because many times, the missing values can appear in other forms than an empty cell in a table.
Mostly when some fields are marked necessary, and people don't want to fill them, they use characters like - or ? or NIL. In the case of data gathered by devices, they have some default values in place of null like -1 or 0 or -99.

Look for frequencies of such values. if they mostly occur in your dataset, you can consider them as NaN (Not a Number). You can replace these values with NaN or can directly interpret them as NaN while loading the dataset for the first time with pandas.

df = pandas.read_csv('data.csv', na_values = ['?','-1'])

The na_values replaces all values that match in given list and change them to NaN.

Getting count of all missing values.

df.isna().sum()

Handling missing values

There are mainly two ways to handle such values.

remove the observations(rows) containing them.

# select only those observation for a given column that 
# don't have a Null or Nan value.
df = df[df['column'].notna()]

this option should be used when we have a very large dataset, and no. of rows removed after this operation will not affect much.

remove the whole column.

df.drop(columns = ['col1','col2'], inplace = True)

This option should be used only when the no. of missing values are too much ( > 50%) in a column.

Impute missing values with the central tendencies of data.

central tendency is a central or typical value for a probability distribution. It may also be assumed that the whole data lies at the center or location of the distribution.

When data is numerical

there are two ways to find the central tendency. Mean and Median.
Mean should be chosen only when the data is distributed uniformly. Suppose if you have most of the values like 100,150,200,250 and only few like 800 then mean of these values will be 300, but median will be 200. Now, if we see, the metric which can approximately give us the central tendency is median in this case.
so whenever, there is a large difference in mean and median, we should choose median.

df['col1'].fillna(value = df['col1'].mean(), inplace = True)
df['col2'].fillna(value = df['col2'].median(), inplace = True)

When data is categorical

Central tendency lies with most frequent category in such cases.

df['col3'].fillna(value = df['col1'].mode(), inplace = True)

4. Outlier Treatment

an outlier is a data point that differs significantly from other observations.

Outlier in scatter plot -

Outlier in Box plot -

image source: google

we will cover the plotting of these graphs in next part.
Outliers can decrease performance of model. They should be removed or treated.
IQR (Inter-Quartile Range): The Inter-Quartile Range is just the range of the quartiles: the distance from the largest quartile to the smallest quartile,
largest quartile is median of upper half of data and lowest quartile is median of lower half of data. We remove the values which exceeds this range.

image source: google

# lowest quartile 
q1 = df['col'].quantile(0.25) 
q2 = df['col'].quantile(0.50) 
q3 = df['col'].quantile(0.75) 
iqr = q3 - q1 

low = q1 - 1.5 * iqr
high = q3 + 1.5 * 1qr

# remove
df[df['col'] > high | df['col'] < low].drop()
or 
# replace with median
df.loc[df['col'] > high | df['col'] < low, 'col'] = q2

5. Variable Transformation

Sometimes, a single feature or attribute contains multiple values.
example -

Normalization is a concept in DBMS in which a database is said to be in first normal form, if it contains atomic values (value that cannot be divided). To solve such kind of problems, you can refer to stackoverflow.

Another problem that is observed is dealing with dates. A date object is always represented in form of strings. we can use pandas to extract year, month, day, week etc from the date.

df['date'] = pandas.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month']= df['date'].dt.month
df['day']  = df['date'].dt.day
df['week'] = df['date'].dt.week

6. Seasonality

Seasonality is a characteristic of a time series in which the data experiences regular and predictable changes that recur every calendar year. Any predictable fluctuation or pattern that recurs or repeats over a one-year period is said to be seasonal. The most easy example is Rainfall.

Any feature of dateset that shows seasonality with time, can cause instability to the model because it never shows a clear relationship with output. So if the target is itself not a seasonal value (models exist for such predictions) then we should remove seasonality from such data.

A common way to deal with seasonal data is to use differencing. If the season lasts for a week, then we can remove it on an observation today by subtracting the value from last week.
Similarly, if a season cycles in a year, like rainfall then we can substract the daily maximum rainfall from same day last year to correct seasonality.

seasonality

after removing seasonality

image source : machinelearningmastery.com

7. Bivariate / Correlation Analysis

Bivariate analysis is a quantitative analysis, which involves two variables for determining the empirical relationship between them.
This can be done drawing plots between two variables and check their distributions.

correlation or dependence is a statistical relationship between two variables or bivariate data. Correlation referes to the degree to which a pair of variables are linearly related. a positive correlation shows that the variables are directly proportional while negetive shows that they are inversely proportional. Value of correlation coefficient lies between -1 to 1 only.
correlation formula

why should we remove higher correlation features?

correlation measures only the association between the two variables. It doesn't tells causation. i.e., large values of y is not caused by large values of x. When we have highly correlated features in the dataset, the variance will also become high. this causes instability in model. The model becomes sensitive towards these values and slight changes in them affects the whole model.
So it is better to drop any one feature if two features are showing a higher correlation.

pyplot.figure(figsize=(10,10))
seaborn.heatmap(df.corr(), annot = True, fmt = '.2f', square = True, vmax=1, vmin = -1,linewidths=0.5, cmap='Dark2')

output

image source : google
The heatmap shows the correlation values of all the features with others. We can analyse which two feature are highly correlated and remove any one of them.

8. Label Encoding

Before moving further we have to recognise what categorical data is. Many times we think that if a variable is in the form of objects(strings) then it is a categorical variable.
suppose a dataset contains a feature year, which depicts only 3 distinct years - 2008, 2009, 2010. Are these values categorical or numerical?
They are categorical values, but their numerical interpretation is a problem for model. Also the model can only work over numerical data. That means we cannot use categorical data containing strings. To use such features, we have to transforming them into numerical form.
For example, we map our categorical data to simple counting numbers like 0,1,2,3...
This is called label encoding.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['year'] = le.fit_transform(df['year'])

What if the categories are ordinal? we have to encode them in an order. example, if a dataset contains some grades from Ato D, which are correlated to marks. A means good marks, D means bad marks. then this is a ordinal variable. we have to encode it in same order that it follows. means A = 0, B = 1, C = 2, D = 3

le.fit(["grade_A", "grade_B", "grade_C", "grade_D"])
print(list(le.classes_))
df['grades'] = le.transform(df['grades'])

9. One Hot Encoding

A one hot encoding allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. When a one hot encoding is used it may offer a more nuanced set of predictions than a single label
In one hot encoding, the label encoded variable is removed and a new binary variable is added for each unique label value.
Visuals are better than explaining.

Again, one hot encoding causes higher correlation between its values. so we should drop a feature after one hot encoding.
Scikit learn provides this inbuilt functionality.

from sklearn.preprocessing import OneHotEncoder
df_enc = OneHotEncoder(drop='first').fit(df['grades'])

Pandas also provide OHE by its get_dummies() function.

df_enc = pandas.get_dummies(data = df, columns = ['year', 'grades'], drop_first = True)

10. Scaling

Scaling means fitting the data values under a common scale (range).
suppose we have a dataset having two features on completely different scale. one feature lies in range 1 to 30, while other lies in range 4000 to 100000. Now there are some algorithms like K nearest neighbor, which classify data points based on distances between their features. In such algorithm, values of feature having small range will not be going to affect the distance majorly. So, having such a feature is meaningless. Almost every algorithm in ML deals with such kind of geometrical distances, except decision tree category algorithms.
You can refer to this article for more info about algorithms.
So there is a need to bring them on a common scale.

There are mainly two types of scaling techniques.

Standardization

Standardization is used to center the values around the mean with a unit standard deviation. this makes the mean to becomes zero with a unit standard deviation.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['cost','expenses']] = scaler.fit_transform(df[['cost','expenses']])

before and after standardization

image source : google
Standardization should be done when the data shows a bell curve a.k.a Normal Distribution or Gaussian Distribution.

Normalisation

Normalization is used to scale the data of an attribute so that it falls in a smaller range, most commonly a range between 0 to 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['cost','expenses']] = scaler.fit_transform(df[['cost','expenses']])

Normalization is well suited when data values appear to be continuously increasing or decreasing.

11. Oversampling

In around 2018, Amazon crafted an AI recruiting tool that review the job applicant's resume and give them ratings, similar to the amazon shopping ratings. The AI was trained on the basis of previous applicant's data, most of which are men. The AI became bias towards men and taught itself that men candidates are mostly preffered. The gender of candidate was not explicitely given to him, but he found that by their resume finding words like women, female chess champion, etc and rated them low for job.
This all happened because the training data was imbalanced. There are two ways to handle imbalanced data.

Undersampling

Reduce number of samples of a class having higher number of samples. This method is used when we have a very large dataset and removing instances doesn't cause much loss of data.

Oversampling

Increase the number of samples of class having lower number of samples. It is also known as data augmentation.
There are again two major algorithms for oversampling.

SMOTE (Synthetic Minority Oversampling Technique)

SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

# x = all features except y, y = imbalanced feature
from imblearn.over_sampling import SMOTE
smote = SMOTE()
os_x,os_y = smote.fit_resample(x,y)

ADASYN (Adaptive Synthetic)

ADASYN is a generalized form of the SMOTE algorithm. Only difference in SMOTE and ADASYN is that it considers the density distribution, which decides the no. of synthetic instances generated for samples. It helps to find those samples which are difficult to learn. Due to this, it helps in adaptively changing the decision boundaries based on the samples difficult to learn.

from imblearn.over_sampling import ADASYN
adasyn = ADASYN(random_state = 99)
os_x, os_y = adasyn.fit_resample(x,y)

What's next

We will see different visualization techniques and plots using matplotlib and seaborn in next article.

#P3 - Linear Regression

Ashutosh Sahu — Fri, 01 Jan 2021 09:35:25 +0000

Linear regression attempts to find the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an independent variable, and the other is considered to be a dependent variable.
The dependent variable is also known as the criterion variable and the independent variable is also known as the predictor variable.

Here our task is to find how the dependent variable(Y) can be predicted on the basis of the Independent variable(X). for this, we consider that all the set of points (x,y) lies on a straight line, which means there is a linear relationship between them.

But how it is possible to get accurate results by such an assumption? It is true that some points don't lie on the line and there will always be an error in our result, but you cannot expect a machine to be fully accurate. we will get good accuracy with a huge dataset.

so now we have to find the line which well satisfies the following conditions -

The line should pass through the point(x,y) or
The distance between the line and the point should be minimum.

source - google

now the question arises that how to find such a line. No, we don't have to take a paper and start plotting all the points.
Recall geometry, which states that the equation of a line is
y = a + bx, where b is the slope (gradient) and a is the y-intercept. if we can find the value of a and b, we can find a value for y according to the given x. in this way we will be able to predict the value of y.

value of a and b can be calculated by the given formula

source - click here

What we have discussed till now was based on simple linear regression, in which the value of y depends on one independent variable x.

Multiple Linear Regression

When the dependent variable is dependent on more than one independent variable then Multiple linear regression is used.
Here we have to fit a regression line through a multidimensional space of data points

The equation of line is given by
y = b0 + b1.x1 + b2.x2 + ......
where x1,x2,... are the independent variables, b0 is the y-intercept, and b1, b2,... are slopes.
finding values of b0,b1,b2 in such case is done by using some matrix algebra.

Steps for training a model

Prerequisite - Python, Google Colab or Jupyter

The Environment

You can use Jupyter Notebooks along with Anaconda or simply the google colab. If you are using google colab you have to import the file from Github or via google.colab module. colab has some good accessibility features. Jupyter runs on your local machine so if you are low on resources, you should go for google colab.

Data Collection

The first step for getting a model trained is to collect data. I prefer using Kaggle or UCI Machine Learning Repository which provides various types of datasets. datasets are mostly available in form of CSV files(comma-separated values).

About Dataset

The dataset that we have taken for Multiple Linear Regression is from the UCI Machine Learning Repository.
you can get the CSV files from here

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011) when the power plant was set to work with a full load. Features consist of hourly average ambient variables Temperature (T), Ambient Pressure (AP), Relative Humidity (RH), and Exhaust Vacuum (V) to predict the net hourly electrical energy output (EP) of the plant.

Attribute Information:

Features consist of hourly average ambient variables

Temperature (AT) in the range 1.81°C and 37.11°C,
Ambient Pressure (AP) in the range 992.89-1033.30 millibar,
Relative Humidity (RH) in the range of 25.56% to 100.16%
Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg
Net hourly electrical energy output (PE) 420.26-495.76 MW

The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without Normalization.
We have to train the model for the prediction of PE

Importing the dataset

pandas is a python library that can help you to convert CSV, excel, list, dict, NumPy array to dataframe.
we will use it to import our csv file.
if you are using colab, first upload your CSV file to colab then its path will be available as '/content/file_name.csv'

import pandas
# enter your CSV file path here
path = r"path\to\csv"
dataframe = pandas.read_csv(path)
dataframe.info()
dataframe.head(5)

info() gives you info about your dataframe and head(5) returns first 5 rows of the dataframe.

Separating Independent and Dependent variables

x = dataframe.loc[:,dataframe.columns!="PE"].values
y = dataframe.loc[:,"PE"].values
print(x[0],"\n",y[0])

[ 14.96 41.76 1024.07 73.17] 463.26
loc is a property of dataframe that is used to select rows and columns. it can also take boolean values. : is used to select all rows or columns.

Data Preprocessing

Before training a model for any type of data, the data needs to be preprocessed to make it ready for training. we will cover all the types of preprocessing techniques in the next article.

Our dataset currently doesn't require any preprocessing, except for feature scaling, but that is also managed internally by sklearn.

Splitting training and test data

In any Supervised learning model, we divide the whole dataset into two types, training dataset, and testing dataset. We train the model on the basis of the training dataset and then test it by test dataset. generally, the training dataset occupies about 70% to 80% of the whole dataset.
Scikit-learn(sklearn) is a library in Python that provides many unsupervised and supervised learning algorithms. It's built upon NumPy, pandas, and Matplotlib.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2, random_state=0)

here test_size determines the size of test data. in this case test data is 20% of the whole data. random_state determines how the random function will work.

Fitting the model and predicting the values.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train,y_train)

# predictions
y_pred = model.predict(x_test)
print("actual  |  predicted")
for i in range(0,5):
    print("{:.2f}  |  {:.2f}".format(y_test[i], y_pred[i]))

actual | predicted 431.23 | 431.43 460.01 | 458.56 461.14 | 462.75 445.90 | 448.60 451.29 | 457.87

And Here is your model trained.
you can see how nearly it predicts the values of "PE".

Calculating R-Square, Intercept, Slopes

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

r_sq = model.score(x,y)
r_sq_train = model.score(x_train,y_train)
r_sq_test = model.score(x_test,y_test)
print("r_sq : ",r_sq, r_sq_train, r_sq_test)

error = 1 - r_sq


print('intercept :', model.intercept_ )
print('slope :', model.coef_)

r_sq : 0.9286947104407257 0.9277253998587902 0.9325315554761303 intercept : 452.8410371616384 slope : [-1.97313099 -0.23649993 0.06387891 -0.15807019]

model.score() calculates the R square value.
In this case, the value of r_sq tells that the accuracy of the whole model is 92.86% while that of the training set is 92.77% and of the test set is 93.25%
model.intercept_ returns the intercept value b0
model.coef_ returns the list of coefficients (slopes) b1 b2 b3....

Feature Selection

Feature selection is the process of reducing the number of independent variables when creating a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and to improve the performance of the model.

Backward Elimination Method

it is one of the methods used for Feature Selection.
Steps :

Select a Significance level (P-value) generally (SL = 0.05)
fit the model with all possible predictors
find p-values of all predictors
remove the predictor with the highest p-value then fit the model again and repeat the process till the p-value is greater than 0.05

There is one thing to take care of.
y = b0 + b1.x1 + b2.x2 + b3.x3 ...
In the above equation, if you notice that every Xn has a multiplier bn but not the constant b0. The package statsmodel only considers a multiplier if it has a feature value. If there is no feature value then it would not get picked up while creating the model. So the b0 would be dropped. but if you have a x0 and set it to 1 that will solve the problem. Hence we need to create a feature with value = 1.

import statsmodels.regression.linear_model as sm
import numpy
print(x[0])
# add a column of values = 1 (int)
be_x = numpy.append(arr = numpy.ones((9568,1)).astype(int), values = x, axis=1)
print(be_x[0])

# finding significance level
x_opt = be_x[:,[0,1,2,3,4]]
ols = sm.OLS(endog = y, exog = x_opt).fit()
ols.summary()

if you see the P > |t| column there is no value, that is greater than 0.05. so there is no useless feature in our model.

What's Next

Practice by yourself. choose a dataset and try to fit the model for it. Remember that we have not dealt with various factors like categorical variables and null values yet. they all will be covered in the next article on data preprocessing. Choose your dataset wisely.

#P2 - ML Types

Ashutosh Sahu — Thu, 31 Dec 2020 07:38:30 +0000

There are three types of Machine Learning

Supervised Machine Learning

It is Task-Oriented learning where you provide various input and output samples. the machine then tries to learn from the given samples and figure out a relation (tries to map between the input and output). At a certain point, it has learned too much from those samples so that it can predict nearly the correct output of the new input.

There are two types of Supervised Learnings
1. Regression
It deals with predictions related to Numerical Data. Here you have to predict a numerical (continuous) value as an output.

example -
you are given the sum of marks of a student over the last 7 years, and you have to predict the sum for this year.

2. Classification
It deals with predictions related to Categorical Data. Here you have to predict a category to which the data belongs on the basis of input.

example - Sentiment analysis
You are given the facial properties of a person and you have to identify his sentiments(happy, sad, angry).

Unsupervised Machine Learning

In this learning, we just have input but no output by which we can apply supervised learning. here the data is processed for all the possible ways to group it in various types called labels(classes), and assigning a class to a member of this data is called labeling.

example - Buying habits of people
here various people have various needs and interests and its hard to group them. by applying unsupervised learning, we can identify new trends in buying habits.

Reinforcement Learning

It is a mixed version of learning where improvements on a previously learned model are continuously made to improve it.

example - an action game goes hard as you play it.
I am not sure but this can be an example of Reinforcement Learning. When I play shadow fight 3 I found that it gives suggestions like - your opponent is learning your moves, use different moves. so I think that it could be related to reinforcement learning.

#P1 - Introduction

Ashutosh Sahu — Thu, 31 Dec 2020 06:43:31 +0000

Machine learning

Arthur Samuel First coined the term and described it as: “the field of study that gives computers the ability to learn without being explicitly programmed.”

Tom Mitchell provided a more formal definition:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Sounds both interesting and confusing at the same time. isn't it?

One another question that comes in mind is that why there are so many terms related to it like Deep learning, Artificial Intelligence, Neural Networks? Are they all same or something different?

Artificial Intelligence

Artificial intelligence (AI) refers to the simulation of human intelligence in machines that exhibits traits associated with a human mind such as learning and problem-solving.
In simple words, when you see a machine taking decisions like a human, it is an AI.

Deep Learning

Deep learning is the functioning of AI which mimics the workings of the human brain in processing data for use in detecting objects, recognizing speech, translating languages, and making decisions.

Neural Networks

Neural networks are a series of algorithms that mimic the operations of a human brain to recognize relationships between vast amounts of data.

You can relate the above all terms by the given image.

courtesy: Google

My Opinions

Should we go for ML or Development ?

According to me, ML is one of the most overrated topic of the decade. Development is now becoming underrated, only because of the cool things that you can do with ML.
Actually, its hard to Develop something completely from scratch, and it is nowhere comparable to ML.
Just by learning how to use a python library and applying it on a dataset to create a model, you cannot say that you know Machine Learning.
So while choosing anyone of them, you should have to depend on other factors.