Forem: Sridhar CR

Stop AI From Seeing What It Shouldn’t: A Practical Guide to PII Safety

Sridhar CR — Sun, 16 Nov 2025 07:15:00 +0000

TL;DR
AI features are great, but if you feed them personal data, things can go wrong very quickly. Before your app talks to an AI model, make sure it is not accidentally sending PII.

We all know what PII is

PII stands for Personally Identifiable Information. It is anything that can directly or indirectly identify a person.

Names, emails, phone numbers, home addresses, government IDs, bank details, biometrics the list goes on.

We deal with it every day while building products. Nothing new here.

The real problem starts when we mix PII with AI features.
The challenge isn’t that PII exists. The challenge is that AI systems eagerly consume whatever data we send them... even when they shouldn’t.

How PIIs can get exposed to the AI (directly or accidentally)

There are many situations where PII slips into the AI layer without anyone meaning to do it. A few examples:

Intelligent search
A user searches for something like "John Smith account balance 4367" and the search system forwards the full prompt to an LLM behind the scenes.
Data analytics
We feed big datasets into an AI model to extract insights and forget that those tables contain names, emails and other personal info.
Fraud or risk detection
AI models often use IDs, location and behavioral history to detect fraud. Sometimes the pipeline ends up passing the raw data to an LLM.
Chatbots and customer support
Users often share private information like account numbers or home addresses in messages and those messages get forwarded to an AI agent to generate answers.
Document or content summarization
Users upload resumes, invoices, medical reports or contracts, and the AI summarizer sees everything inside those documents.

In all of these cases the team never intends to expose PII. But it happens because the AI layer simply receives whatever data comes its way.
And importantly: this risk applies during both training and inference (every time your app calls an AI API).

The risk of exposing PII to AI

Here is why this gets scary fast:

AI does not forget
If a model sees PII during training or input, it might surface it later. For example, someone asks the model for sample bank fraud data and the output ends up containing a real email or phone number that it once saw.
Legal trouble
Regulations like GDPR and CCPA are very strict about how personal data is processed. If your AI feature handles PII without the right controls, you can get into real legal and financial trouble.
Data exposure
Even if the model behaves correctly, logs, prompts, intermediate storage, training data and prompt history can leak personal information to the wrong place.
User trust
Once users think their private data was used in a way they did not agree to, trust is gone. Sometimes permanently.

How to prevent it

Here are practical ways to stop PII from reaching your AI model:

Safest option: do not send PII to AI
If the task does not need personal info, strip it out. Replace names with IDs, or anonymize the entire dataset before doing any AI processing. This is often impractical though.
Mask, redact or remove PII
If the data items are much simpler (for example, Indian phone numbers only), regex can sometimes be enough. But if the scenario is open-ended, tools like Presidio or NER-based models help detect names, emails, phone numbers and other identifiers before sending the text or dataset to the AI model.

You can mask values (e.g. joh***@gmail.com), replace them with placeholders, or delete them completely.

Here’s a quick code example using Presidio:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine

# --- Initialize Presidio ---
nlp_engine = SpacyNlpEngine(models={"en": "en_core_web_lg"})
analyzer = AnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en"])
anonymizer = AnonymizerEngine()

# --- Input text ---
text = "My name is John Doe, my email is john.doe@example.com and my phone number is +1-202-555-0170."

# --- Detect PII ---
results = analyzer.analyze(text=text, language="en")

# --- Mask PII ---
anonymized = anonymizer.anonymize(
    text=text,
    analyzer_results=results,
    anonymizers={"DEFAULT": {"type": "mask"}}
)

print("Original:", text)
print("Anonymized:", anonymized.text)

Upon execution:

Original: My name is John Doe, my email is john.doe@example.com and my phone number is +1-202-555-0170.
Anonymized: My name is **********, my email is ********************** and my phone number is *******************

Once masking or redaction is in place, the next step is to ensure the AI only receives the minimum amount of information it needs.

Data minimization
Only send the data that is absolutely necessary. Most AI features do not need full user profiles.
Add guardrails to AI inputs and outputs
Scan incoming user prompts for PII and block or redact them.
Also scan AI responses to make sure the model is not trying to output PII back to the user.
Secure logs and storage
If you store prompts, chat messages or training data that might contain PII, add access controls and deletion policies.

How the industry handles it

Big AI platforms are already aware of this problem and are adding safety layers.

For example, OpenAI Agent Builder has built-in guardrails that detect and redact PII from both inputs and outputs.

It combines rule-based techniques and AI detection to catch things like names, addresses, emails and financial details before the model sees them or before the model tries to return them.

There are also popular open source tools such as Microsoft Presidio, which many companies use to scan and mask PII before running analytics or AI workloads.

The pattern is the same across the industry:
Detect PII early, remove or protect it, then use AI.

These systems are not perfect, but they reflect one clear lesson: privacy cleanup must happen before the AI sees the data, not after.

Conclusion

If you are an entrepreneur, a developer or someone excited about adding AI features to your product, here is the simple takeaway:

Make your app smarter, but do not make it reckless.
Handle PII before the data reaches the AI layer.

It is always easier to prevent a privacy issue than to fix one after damage has been done.

References

Microsoft Presidio – Data Protection and De-identification SDK
https://microsoft.github.io/presidio/
(Microsoft GitHub)
“Implementing Text PII Anonymization” – Arize AI blog
https://arize.com/blog/pii-removal-microsoft-presidio-chatbot/
(Arize AI)
“Enhancing the De-identification of Personally Identifiable Information in Educational Data”
Y. Shen et al., Jan 2025
https://arxiv.org/abs/2501.09765
(arXiv)
“Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility”
Martin Kuo et al., Feb 2025
https://arxiv.org/abs/2502.17591
(arXiv)

How to Migrate Massive Data in Record Time—Without a Single Minute of Downtime 🕑

Sridhar CR — Fri, 13 Dec 2024 17:26:22 +0000

Introduction

Imagine you're working at one of the biggest enterprise multitenant SaaS platform. Everything is moving at lightning speed—new data is pouring in, orders are flowing from multiple channels, and your team is constantly iterating on new features. But there's one problem: your data infrastructure, built on MongoDB, is becoming a bottleneck.

While MongoDB serves well for operational data, it's struggling to handle the complexity of modern data analytics, aggregations, and transformations. Running advanced queries or performing complex analytics is becoming increasingly difficult.

To stay competitive, your team is planning a migration to an HTAP (Hybrid Transactional/Analytical Processing) SQL database—one that can seamlessly support both OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) workloads. The challenge? Zero downtime is non-negotiable. Disrupting customer operations or compromising data integrity during migration simply isn't an option.

Keep reading to discover how I solved this challenge and ensured a seamless migration based on my experience.

Researching Market Tools

There are numerous tools for ETL, and readily available options like Airbyte can help move data between databases. However, in this case, you're migrating from NoSQL to SQL, including data transformations to normalize the data. After some analysis, I decided to go with Apache Spark.

Apache Spark is known for its speed, scalability, and ability to handle massive datasets. But the real question is: how do you leverage Spark to ensure a fast, efficient ETL process that migrates your data without disrupting business operations?

Design and Architecture

To achieve zero downtime migration, I broke down the task into two key components:

Migration Phase: Moving the initial bulk of data to the new system.
Live Sync Phase: Ensuring continuous synchronization between the old and new systems during and after the migration. By designing these two phases carefully and coupling them with a timeline approach, I ensured a seamless, uninterrupted transition.

Migration Phase: Lifting the Bulk Data
The Migration Phase focuses on transferring large volumes of historical data from the source database to the target system efficiently. The goal is to move the majority of the data in one go, with minimal impact on ongoing business operations.

During this phase, Spark handles the heavy lifting. Instead of performing the migration sequentially—something that could take days or weeks for large datasets—Spark divides the data into smaller chunks called partitions. Each partition is processed in parallel across multiple worker nodes in the Spark cluster, speeding up the migration process significantly.

Live Sync Phase: Syncing the Continuous Data
While the Migration Phase focuses on transferring bulk historical data, the Live Sync Phase ensures that any ongoing changes to the data in the source database are continuously reflected in the target database in real-time. This phase keeps both systems in sync during migration and handles inserts, updates, and deletes without downtime.

By relying on real-time data processing, I ensured that new data was continuously migrated, without interrupting business operations.

Parallelism—The Secret to Speed and Efficiency

Level 1: Spark’s Native Parallelism

At the heart of Spark’s power is its ability to partition and process data in parallel across multiple nodes. By splitting large datasets into smaller partitions, Spark handles each partition independently, dramatically speeding up the entire ETL process. Spark’s built-in parallelism handles the distribution of data automatically, ensuring optimal performance without micromanaging the system.

Level 2: Custom Parallelism—Scaling with Multiple Spark Clusters

If the workload is too large for a single cluster to handle efficiently, custom parallelism comes into play. By running multiple Spark clusters in parallel, I was able to distribute the workload across different clusters, each processing a subset of the data. This horizontal scaling significantly improved performance, allowing me to manage even larger datasets across distributed environments.

In this setup, each Spark cluster operates independently, but they are orchestrated to maintain a smooth and efficient migration process. By customizing parallelism, I was able to maximize available resources, ensuring no single cluster became overwhelmed.

Monitoring and Safeguarding the Migration

To ensure a smooth, error-free migration, I designed read, write operations, and data transformations as individual action classes, each adhering to the Single Responsibility Principle. This modular approach made it easy to manage and extend the migration pipeline. I used Finite State Automata (FSA) to track the various stages of the migration, from data extraction to transformation and loading, ensuring that each step was executed in sequence and errors could be quickly pinpointed.

Error tracking was integrated into every action class, with granular logging to capture failures at specific points, making it easy to troubleshoot and recover. I also implemented regular checkpoints, which allowed me to resume the migration from the last successful point in case of failure, minimizing downtime and reprocessing. Additionally, I continuously monitored performance to track execution times, resource usage, and error rates, helping to optimize the pipeline and ensure everything ran smoothly throughout the migration.

Conclusion

As I reflect on the journey, it’s clear: Apache Spark was the true hero of the migration. From enabling efficient ETL processes with parallelism to ensuring zero downtime migration, Spark transformed the way our company handles data.

With Spark, I’ve learned that scalability, speed, and reliability don’t have to be mutually exclusive. Most importantly, I’ve unlocked the ability to migrate and transform data without ever missing a beat.

Are you ready to revolutionize your ETL workflows with Spark? Whether you're migrating data or building a real-time data pipeline, Spark’s power can help you achieve the performance, scalability, and reliability your business needs.

Feel free to share your experiences, ask questions, or let me know how you've used Apache Spark for your ETL processes. Let’s continue the conversation!

Boost Your Redis-Powered Caching: Unleashing the Magic of Efficient Data Serialization

Sridhar CR — Tue, 10 Oct 2023 13:13:10 +0000

In the quest to optimize application performance, caching is often the go-to strategy. However, many popular standalone caching solutions like Redis and Memcached primarily deal with string data. Consequently, data originating from the application must be converted into string format before it's pushed into the cache. This process essentially requires data serialization for storage.

Even with various caching strategies like Cache-Aside, Write-Through, Write-Behind, or Read-Through, the fundamental requirement remains the same: reading and writing data to and from the cache. When retrieving data from the cache, the application is faced with the task of deserialization, converting the stored string back into its original data format. It's important to note that this serialization and deserialization process can consume a significant amount of time, with the time required directly linked to the size of the data.

Thankfully, there are alternative methods for serializing data that don't involve converting it to JSON. One such approach is the use of Messagepack standards. Messagepack offers the advantage of efficient data storage and significantly faster data serialization. Find more about messagepack.

A simple representation of how messagepack stores the data.

Advantages of messagepack:

Efficiency: MessagePack is a binary serialization format, whereas JSON is a text-based format. This means MessagePack typically requires less space to represent the same data, resulting in smaller message sizes. Smaller payloads can lead to reduced network and storage usage, which can be especially important in resource-constrained environments.

Speed: Due to its binary nature, MessagePack is faster to encode (serialize) and decode (deserialize) compared to JSON. This speed advantage is particularly beneficial in applications where rapid data serialization and deserialization are crucial for performance.

Compatibility: MessagePack is designed to be language-agnostic, and there are libraries available for a wide range of programming languages. This makes it easier to work with MessagePack in a multi-language or multi-platform environment, where different components of an application may be written in different languages.

Native Data Types: MessagePack includes native support for a wide range of data types, including integers, floats, strings, arrays, and maps. This can lead to more efficient and accurate serialization and deserialization of data, especially when dealing with complex or nested data structures.

Cross-Platform: MessagePack's binary format makes it suitable for cross-platform data exchange. It can be used to transmit data between different systems and architectures without compatibility issues.

Backward and Forward Compatibility: MessagePack provides a level of backward and forward compatibility. New fields or data types can be added to your MessagePack messages without breaking compatibility with older versions of your software. This flexibility is valuable when working with evolving data schemas.

Reduced Overhead: JSON includes metadata and human-readable keys, which add overhead to the data. MessagePack, being a binary format, eliminates much of this overhead, resulting in more compact data representation.

Streaming Support: MessagePack is well-suited for streaming data, as you can serialize and deserialize data incrementally. This is particularly useful for scenarios like real-time data processing.

Binary Data Handling: MessagePack is excellent for handling binary data, such as images or serialized objects, as it doesn't require escaping or encoding of binary characters.

Ecosystem: MessagePack has a growing ecosystem of libraries and tools that support it, making it easier to integrate into your existing applications and infrastructure.

One of the noteworthy feature of Messagepack is its compatibility with a wide range of programming languages. In this post, we will explore Messagepack's performance benefits by using Python to serialize a Python dictionary with Messagepack tools and compare this performance to JSON. We conducted benchmark tests with a 70KB JSON data loaded as a dictionary.

With Messagepack tools:

Serialization is approximately 4 times faster compared to JSON.
Deserialization is around 1.2 times faster than with JSON.



import json
import msgpack

# Replace your data in 'dict_70_kb'
nonserialized_dict_70kb = dict_70_kb

# Serializer methods
def json_serializer(data):
    json_serialized_dict_70kb = json.dumps(data)
    return json_serialized_dict_70kb
def msgpack_serializer(data):
    msgpack_serialized_dict_70kb = msgpack.dumps(data)
    return msgpack_serialized_dict_70kb
def msgpack_pack_serializer(data):
    msgpack_serialized_dict_70kb = msgpack.packb(data)
    return msgpack_serialized_dict_70kb

# Deserializer methods
def json_deserializer(data):
    deserialized_dict_70kb = json.loads(data)
    return deserialized_dict_70kb
def msgpack_deserializer(data):
    deserialized_dict_70kb = msgpack.loads(data)
    return deserialized_dict_70kb
def msgpack_pack_deserializer(data):
    deserialized_dict_70kb = msgpack.unpackb(data)
    return deserialized_dict_70kb

print("JSON serialization: ", end='')
%timeit -n 10000 -r 15 json_serializer(nonserialized_dict_70kb)
print("Msgpack serialization: ", end='')
%timeit -n 10000 -r 15 msgpack_serializer(nonserialized_dict_70kb)
print("Msgpack pack serialization: ", end='')
%timeit -n 10000 -r 15 msgpack_pack_serializer(nonserialized_dict_70kb)

json_serialized_dict_70kb = json_serializer(nonserialized_dict_70kb)
msgpack_serialized_dict_70kb = msgpack_serializer(nonserialized_dict_70kb)
msgpack_pack_serialized_dict_70kb = msgpack_pack_serializer(nonserialized_dict_70kb)

print("JSON deserialization: ", end='')
%timeit -n 10000 -r 15 json_deserializer(json_serialized_dict_70kb)
print("Msgpack deserialization: ", end='')
%timeit -n 10000 -r 15 msgpack_deserializer(msgpack_serialized_dict_70kb)
print("Msgpack pack deserialization: ", end='')
%timeit -n 10000 -r 15 msgpack_pack_deserializer(msgpack_pack_serialized_dict_70kb)

Here's the python's time benchmarks, with huge number of iterations.

JSON Serialization - Conversion to JSON
Messagepack Serialization and Messagepack pack Serialization (both are same) - Conversion to Binary Messagepack

same info applies to the deserialization as well.

A simple scatterplot to display the difference between different strategies.

Try using messagepack in your application and check the performance benchmarks. Let me know if you have any questions or difficulties with messagepack.

Enhancing Code Quality and Security: Building a Rock-Solid CI Test Suite for Seamless Development

Sridhar CR — Mon, 03 Jul 2023 16:14:03 +0000

Introduction

In today's rapidly evolving software development landscape, ensuring code quality and security is of paramount importance. Continuous Integration (CI) has become an essential practice in software development, enabling developers to integrate their code changes frequently and detect issues early in the development lifecycle. One crucial aspect of CI is the test suite, which encompasses various checks to ensure code quality and security. In this blog post, we will explore the pipeline of a CI test suite for a python project, highlighting the different steps involved and the tools utilized.

TL;DR Here's an quick overview of CI suite with a wide variety of steps with different tools.

Step 1: Code formatting

Consistent code formatting is vital for maintainability and collaboration within a development team. To enforce a consistent coding style, the first step in the CI test suite is code formatting. There are lots of code formatters available such as,

Black
autopep8
yapf

Black is a popular Python code formatter that automatically reformats code to adhere to a defined style guide. By integrating Black into the CI pipeline, developers can ensure that their code follows consistent formatting conventions, enhancing readability and reducing code review overhead.

Step 2: Code linting

Linting is another crucial aspect of code quality. It helps to enforce the best practices and identifies common pitfalls, improving the overall code quality. Some of the commonly used code linters are as follows,

Pylint
Sonarlint (does more than linting)
flake8
autopep8
bandit

Pylint, a widely used Python static code analyzer, performs lint checks to identify potential programming errors, stylistic issues, and adherence to coding standards. Integrating Pylint into the CI test suite ensures that code is thoroughly analyzed for potential issues before merging into the main codebase.

Step 3: Code vulnerability checks

In an era where security threats are prevalent, it is essential to detect vulnerabilities and security hotspots in the codebase. Various tools for these checks are,

Sonarqube (sonarlint + other checks)
cycode

Sonarqube, a code analysis tool, performs static security analysis to identify potential security vulnerabilities, such as insecure authentication mechanisms or code that is susceptible to injection attacks. By integrating Sonarqube into the CI pipeline, developers can proactively identify and address security issues, reducing the risk of potential exploits.

Step 4: Unit test cases

Unit testing is a fundamental practice in software development to validate the correctness of individual code units or components.

Step 5: Scenario test cases

Beyond unit tests, scenario or integration tests provide end-to-end validation of the system's behavior.
Behave, a popular Behavior-Driven Development (BDD) framework for Python, allows developers to define test scenarios in a human-readable format. These scenarios describe the expected behavior of the system from a user's perspective. By including scenario test cases in the CI test suite, developers can ensure that the software functions correctly in real-world scenarios and that different components interact seamlessly.

When connected with a reporting tool known as allure, we can checks the results in an interactive UI,

Step 6: Code coverage benchmarks

For section of testing, the coverage benchmarks should be addressed thoroughly. For both unit tests and scenario tests, having a strict coverage percents, helps the development to address and write testcases for the boundary conditions/edge cases. It provides the confidence of unit tests and how much of the code has been tested.

Step 7: AppSec Checks

The security plays a vital role in the development lifecycle. Each section of the security checks can be covered in the pipeline. They include the image scans and DAST.

The image scanning can be done with various scanning tools such as,

Clair (also being used in ECR)
Trivy

Tools like Clair provide image scanning capabilities that analyze the container image for known vulnerabilities and adherence to security best practices. Integrating Clair image scan checks into the CI pipeline allows developers to identify and address any security issues before deploying containerized applications.

The DAST checks can be automated up to a certain point, where the code should be able to withstand certain scans and attacks. For eg. SQL Injections can be checked with sqlmap which tests with each and every type of sql injection payload and reports it back to the user.

Conclusion

A strict pipeline of a CI test suite provides a comprehensive approach to ensure code quality and security throughout the software development process. By incorporating code formatting checks, linting, security analysis, unit tests, scenario tests, and image scans, developers can mitigate potential issues early on and deliver robust and secure software.

The Ultimate Guide to Web Application Security (As a developer)

Sridhar CR — Fri, 28 Apr 2023 20:10:55 +0000

In this era of digital information, web security is one of the crucial aspects of the web apps. Due to their wide spread availability and accessibility, web applications are vulnerable to a variety of attacks, such as cross-site scripting (XSS), SQL injection, and file inclusion attacks.

A bug was reported to LinkedIn, regarding the issue which allowed users to delete any published content that was not even published by that specific user.
It is a classic example of access control issue
Click here to read more...

To ensure web application security, it's essential to implement various measures inorder to build a safe and secure web application. I have listed down the crucial measures that are build safe web apps. Let's go over each concepts and understand how it helps to reach our goal.

1. Authentication and Authorization

Authentication is the act of validating that users are whom they claim to be. This is the first step in any security process.

The authentication can be implemented with various methods based on the needs of the application, it can be as simple as OAuth with JWT tokens or it can involve SSO with OpenID Connect, SAML, etc. A strict authentication process would be an absolute neccessity for building secure apps.

Authorization is the process of giving the user permission to access a specific resource or function. This term is often used interchangeably with access control or client privilege.

The means of access control is highly coupled with the business needs. The Object Level authorization would also be an absolute neccessity for our goal. An ideal access control should be have an resource level (API level) privileges \products and as well as data access level privileges with different write, read access.

2. HTTPS (SSL/TLS) and Data Security

Secure REST services must only provide HTTPS endpoints. This protects authentication credentials in transit, for example passwords, API keys or JSON Web Tokens. It also allows clients to authenticate the service and guarantees integrity of the transmitted data.

The sensitive data should be stored properly by employing various techniques such as encryption, hashing, etc. For example, the password hash can be stored instead of storing the actual password. The encryption keys, hash keys and salts should be stored safely and should not be revealed under any circumstances.

3. Input data Validation

The input payloads of the API should be promptly validated before the actual execution of the API.

Do not trust input parameters/objects.
Validate input: length / range / format and type.
Achieve an implicit input validation by using strong types like numbers, booleans, dates, times or fixed data ranges in API parameters.
Constrain string inputs with regexps.
Reject unexpected/illegal content.
Make use of validation/sanitation libraries or frameworks in your specific language.
Define an appropriate request size limit and reject requests exceeding the limit with HTTP response status 413 Request Entity Too Large.
Consider logging input validation failures. Assume that someone who is performing hundreds of failed input validations per second is up to no good.

4. Security headers

There are a number of security related headers that can be returned in the HTTP responses to instruct browsers to act in specific ways. However, some of these headers are intended to be used with HTML responses, and as such may provide little or no security benefits on an API that does not return HTML.

The following headers should be included in all API responses:

These headers are highly configurable, so make use of this flexibility to allow the least privileges that are required for the users.

5. Cookie security

There are several use cases where we use cookies to store relevant information, such as session, tokens, CSRF tokens, etc. In order to enforce the cookie security, we have several attributes which can be set. They are listed as follows,

6. Error handling

We should be using classic error mapping techniques show an generic error messages to client. We should be avoiding detailed techincal information such as tracebacks, sql errors, system errors, etc.

If an error occurred based on the input payload, then we can raise a custom generic message such as HTTP 400 - Bad Request or HTTP 500 - Internal Server Error

7. Audit logs

Write audit logs before and after security related events. Consider logging token validation errors in order to detect attacks. Take care of log injection attacks by sanitizing log data beforehand.

8. Up-to date images and libraries

As in the microservices era, we would be using docker images from public repositories and we also install the required packages. The outdated versions of utils/libraries would create a potential vulnerability that waits to be exploited. Hence we need to take care of these issues as well.

This is not a one time activity, new bugs/issues would always pop in and we need to routinely check and resolve these kinds of issues. There are lots of scanners available as open source in the market, figure out the best tool that suits for the application and get to know about these types of vulnerabilities.

Here's a good scanner, Clair - Vulnerability Static Analysis for containers - Github

By strictly following all these approaches, we should be able to build a secure web application. This should be able to fend of possible attacks if it is hosted in a secure network behind the firewall. Even though we employ all these techniques to create a safe systems, the responsibility equally boils down on the developers who has to write the code without any bugs.

THANK YOU FOR READING
I hope you found this little article helpful. Please share it with your friends and colleagues. Sharing is caring. Connect with me on LinkedIn to read more about Python, Architecture design, Security, and more…!

Why consistency and availability cannot coexist in distributed systems?

Sridhar CR — Sun, 26 Mar 2023 17:25:24 +0000

Problem

Have you ever pondered why consistency and availability cannot coexist in distributed systems, as the title suggests?

While it may be achievable in a single server mode, it could be considered a single point of failure, therefore making availability uncertain. In distributed systems, with multiple partitions, attaining both consistency and availability is not feasible.

CAP theorem

CAP theorem, also known as Brewer's theorem, is an essential concept in distributed computing that explains the trade-offs involved in building a distributed system. It was introduced by Eric Brewer in 2000, and it is still relevant today in the development of modern distributed systems.

The CAP theorem states that it is impossible to achieve all three of the following guarantees in a distributed system simultaneously:

Consistency: All nodes in the system see the same data at the same time.

Availability: Every request made to the system gets a response, without guarantee of the freshness of the data.

Partition tolerance: The system continues to function even when network partitions occur.

The theorem's fundamental idea is that a distributed system can provide at most two of these three guarantees at any given time. In other words, we have to make a trade-off between consistency, availability, and partition tolerance.

Consistency means that all nodes in the distributed system see the same data at the same time. Achieving consistency is essential for many applications where data accuracy is critical. However, ensuring consistency can be challenging in a distributed system, especially when multiple nodes try to update the same data simultaneously.

C in CAP is different than the C in ACID. CAP-C means that data is consistent across all the nodes.ACID-C denotes the data correctness at the database level. However both denotes consistency.

Availability means that every request made to the system gets a response, but it doesn't guarantee the freshness of the data. If a node cannot respond to a request, the system is considered unavailable. In some cases, it may be acceptable to serve stale data, but in others, it may not be.
Partition tolerance refers to the ability of a distributed system to continue functioning when communication between nodes is disrupted. A network partition occurs when a group of nodes becomes disconnected from the rest of the system due to a network failure or other issues. Partition tolerance is essential for the system to continue operating in such situations.

Use-cases

The CAP theorem tells us that in a distributed system, we have to choose between consistency and availability while ensuring partition tolerance. In reality, different applications have different requirements and priorities, and there is no one-size-fits-all solution. The choice of which two guarantees to provide depends on the application's specific needs, the scale of the system, and the potential impact of network partitions.

For example, a system that handles financial transactions must prioritize consistency to ensure that all transactions are accurate and complete. On the other hand, a social media platform can prioritize availability to ensure that users can access the platform even during network disruptions, even if it means serving stale data.

Advancing solutions

1. Eventual consistency
It is nothing but a version of consistency model used in distributed systems, where all updates made to a data store will eventually be propagated to all replicas or nodes in the system, resulting in a consistent state over time. It is a weak form of consistency that allows for temporary inconsistencies or conflicts in the data until all updates are fully propagated across all nodes.

2. Consensus algorithms
A consensus algorithm is a distributed computing protocol that enables a group of nodes or processes to agree on a single value or decision, even if some of the nodes fail or behave maliciously. It is an algorithm to achieve consistent data across all nodes, in fact, they're backbone for the blockchain applications as well.

3. Hybrid architecture
For database design, with the idea of CAP theorem, the architect can make informed decisions such that, the strong consistency can be applied to critical data, and would give weak consistency to non-critical data.

This problem exists for around 2 decades, yet there are no ideal solutions to fix this completely. Let me know if there are new approaches that is able to solve this problem.

Conclusion

The CAP theorem is a critical concept in distributed systems that highlights the trade-offs involved in building such systems. While we cannot have all three guarantees of consistency, availability, and partition tolerance simultaneously, understanding the theorem's implications can help us make informed decisions when building distributed systems.

Design architecture for large file downloads

Sridhar CR — Sun, 19 Mar 2023 18:27:17 +0000

Situation

You are tasked with implementing a feature that allows users to download a massive CSV file, weighing in at 10 GB. The data must be sourced from a database, undergoes necessary transformations, and is served to the user as a downloadable file that should start immediately upon selecting the download option.

It's important to note that the data is subject to change at regular intervals and users may not always require a download, so it's not feasible to pre-push the data to S3. Instead, this process needs to be available on-demand to ensure timely and efficient delivery of the necessary data to users.

Solution - TL;DR

In order to handle this sitution effectively, one can opt in for the Streaming data from API since it provides us the lot of technical advantages yet so simple. This is one of the best solution for this usecase. It provides a great deal of user experience, since the download in this approch starts instantaneously.

Common problems with naive solution

While it may seem like a simple solution, writing an API to handle this scenario could prove problematic, particularly when dealing with large datasets. In such cases, the API may encounter special cases that result in failure, such as a 504 Gateway Timeout if the time limit is exceeded.

Additionally, if we plan to handle huge datasets with this approach, we may need to resort to vertical scaling, which could be excessive and ultimately unsustainable.

Finding the solution

Let's start finding optimized solution for this problem, compare them objectively. So when considering for best architectures, we should consider the usual factors, i.e scalability, availability, fault tolerant, consistency and security.

There are various factors that affect our solutions they are as follows,

Compute resources (CPU and Memory)
Database Disk IO
Network IO
Data size
Technical difficulties

After a good amount of research, I was able to list down some of the best solution approaches, they are as follows,

Asynchronous approach with queues and consumers
Event based approach with websockets
Streaming data from API

1. Asynchronous approach with queues and consumers

In this method, whenever a user clicks on the download, immediately that information is pushed to a queue. Then those requests are fulfilled by one of the consumer/worker process. The worker basically fetches the request from the queue, starts proceeding with the data fetching and data transformation.

This process can take x seconds which is directly proportional to the size of the data. After transformation, it will push it to S3 and then fetch the presigned URL so that it can be given to the client. Meanwhile the client would be polling backend server with an interval of y seconds for this presigned URL.

Here's a high level architecture view of this solution,

Pros: When implemented properly, this solution would provide great scalability with n workers, high availability, and failure handling with acknowledgement in queues.

Cons: The user experience would not be good. The infrastructure to setup these components might cost hefty.

2. Event based approach with websockets

In this approach, instead of using conventional REST APIs, we're gonna use Websockets to make use of event driven design to eliminate the polling and the compute processing is also done at the same backend server. This can also be said to be as asynchronous, based on the design as well.

The websocket is opened once the user requests for the download, the server starts to work on the data fetching, data transformation logics, computes the resultant file and pushes it to S3. After fetching the presigned URL, it is being sent the client for download.

Here's an overview of the design,

Pros: Easy to setup, cost effective.

Cons: Infrastructure scaling for compute resource is not flexible, since processing happens in backend server.

3. Streaming data from API

This approach is more straight forward, it can be achieved with the conventional API with a small change. Instead of doing the heavy lifting in a single shot and returning the response, we will be chunking and stream the response.

As a user perspective, immediately after the trigger, the file starts to download. The download speed is directly proportional to the above mentioned parameters. There should be a tradeoff in terms of performance for user experience. Here's the overview of the streaming API design.

Some technicalities needs to be crafted very carefully based on the infrastructure resources, such as how much records can be processed in a single chunk. A good metric is to chunk the data based on the data size.

100 MB data can be processed in a single chunk

A simple implementation of streaming API with Flask - python is as follows, flask docs



@app.route('/large.csv')
def generate_large_csv():
    def generate():
        for row in iter_all_rows():
            yield f"{','.join(row)}\n"
    return app.response_class(generate(), mimetype='text/csv')

Pros: Easy to setup, cost effective, best user experience etc.

Cons: Infrastructure scaling for compute resource is not flexible, since processing happens in backend server.

Conclusion

All the three approches offers a powerful, best in class solution, if implemented properly, but opting 1of these 3 should be based on your usecase/needs, you can use anyone of these solution to handle the large files. The third approach of streaming data was matching best for my usecase, so I went with the same. Ofcourse there can be issues at the implementation level, which are ignored in this article for the sake of simplicity.

Share your perspective in the comments.