Forem: Neurelo

In the land of LLMs, can we do better mock data generation?

Parth Parikh — Thu, 12 Sep 2024 15:55:37 +0000

"Lorem ipsum is latin, slightly jumbled, the remnants of a passage from Cicero's _de Finibus_ 1.10.32, which begins 'Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit...' [There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain.]. [de Finibus Bonorum et Malorum, written in 45 BC, is a treatise on the theory of ethics very popular in the Renaissance.]

"What I find remarkable is that this text has been the industry's standard dummy text ever since some printed in the 1500s took a galley of type and scrambled it to make a type specimen book; it has survived not only four centuries of letter-by-letter resetting but even the leap into electronic typesetting, essentially unchanged except for an occasional 'ing' or 'y' thrown in. It's ironic that when the then-understood Latin was scrambled, it became as incomprehensible as Greek; the phrase 'it's Greek to me' and 'greeking' have common semantic roots!"

- Richard McClintock, in a 1994 letter to the editor of Before & After magazine, describing his discovery of Lorem ipsum’s origin

Mock data or synthetic data is not a replacement for real data, but it is a technology with significant utility in testing and development. High-fidelity mock data can considerably accelerate software development lifecycles. Yet, if you look at the history of mock data generation, it has remained fairly stagnant for decades and still needs a lot of work to get it right. While we have seen evolutionary progress, the revolution has yet to come.

So, what would this revolution look like? “High-fidelity mock data that just works!” screamed a random programmer holding a Quackles rubber duck. But what does “high fidelity” mean in this context? I believe what we really want is a technology that can deeply understand a user’s schema and its underlying goals to generate high-fidelity synthetic data for a test database in “one-click”.

At Neurelo, our goal is to make it way easier to build with databases. Many of our users initially connect to us with an empty data source to evaluate our platform. And, there are those who work on new features by taking their current schema, making changes, testing them, and committing the changes once they are satisfied with their schema and its corresponding Neurelo auto-generated and Custom APIs. To test these changes, they usually also start with an empty database. Early on, we realized that to provide these users with a production-like experience when using these APIs, it was essential for us to intelligently mock data for their empty data sources. And that's exactly what we did.

In this blog post, I will discuss how we went about creating Neurelo’s “mock data generation” technology—the nuts and bolts, gears and cogs, silos and pipelines of it all!

When we started this project, we had five requirements in mind:

It should be diverse, meaning it should work with all three data sources we support: MongoDB, MySQL, and Postgres.
It should generate realistic data based solely on the schema, without requiring any external user input—a “one-click” solution with minimal friction.
The cost should be as low as possible, balancing scalability with accuracy.
Response time should be very fast.
We should use entirely native Rust to do this.

To understand why it was essential for us to use Rust for this project, let’s take a look at a simplified view of Neurelo’s underlying architecture:

Neurelo’s management layer hosts our entire frontend and is primarily responsible for communicating various user-specified actions to our operations plane. The operations plane, in turn, is where the actual execution of jobs takes place. In many cases, it delegates these jobs to be run against our query runners.

Our query runners are at the heart of Neurelo. They are primarily responsible for:

Translating incoming data APIs and generating queries against the database,
Introspecting an existing database schema,
Managing schema changes over time using our migration tools, etc.

“Mock Data Generation” acts as another “job” in the operations plane, which eventually gets delegated to our query runners. This is crucial because it allows us to leverage our existing Write API code paths to execute queries against the database without having to build new paths. This approach benefits us in terms of both performance and maintainability. Since our entire query runner is written solely in Rust, we needed a native Rust-based mock data generator.

LLMs to the rescue ?

Right from the outset, we thought LLMs would be a great fit for this and asked ourselves how we could leverage LLMs to accomplish this.

Our initial approach involved using LLMs to generate Rust code that would dynamically create raw INSERT queries. However, despite several attempts at prompt tuning, we were unable to achieve reproducible accuracy over multiple iterations. Two significant issues plagued this approach: first, the generated rust code was not always compilable, and second, when it did eventually compile, the quality of the synthetic data was subpar—it tended to regress to generic formats like “Movie1,” “Movie2,” etc., for movie names, despite explicit prompt tuning.

We could have tried extending beyond the zero-shot learning methodology and fine-tuned prompts further, but it didn’t take long for us to realize we were spinning our wheels with this approach and needed a more deterministic foundation.

But curiosity got the better of us, and we wondered, could we do the same in Python? Based on our understanding, LLMs are particularly effective at writing executable Python code. Additionally, prompting them to leverage Python’s “faker” third-party module could help us mitigate the second issue as well.

This is when we stumbled upon a rabbit hole! You see, when mocking multiple tables for a database schema, the order of insertion matters!

Mad Hatter pulls out the “Order of Insertion”

Say you have three tables, A, B, and C, in a database schema. Table A has a foreign key, B_fk, corresponding to table B’s primary key. Similarly, table B has a foreign key, C_fk, corresponding to table C’s primary key. In this scenario, it is crucial to follow the correct insertion order to maintain the referential integrity enforced by the foreign keys.

For this, we should first mock table C since it does not have any foreign keys, meaning there are no referential integrity constraints dependent on it. Next, we can use this data and mock table B, which is now possible because we have ensured that the referenced records in table C (via C_fk) exist. Finally, we can mock table A.

As you can see with the above example, the generation of INSERT statements for each table must respect their relationships and the order matters. As the french proverb goes, “Il ne faut pas mettre la charrue avant les bœufs” (loosely translated to “We must not put the cart before the horse”).

So, let us set aside LLMs for a moment and ask ourselves: given a DB schema, how can we ensure that the order of insertion is always correct? Or, in other words, how do we derive an accurate order of insertion from the DB schema?

The answer - Topological sorting!

The first step is to create a directed acyclic graph (DAG) from the relationships present in our DB schema. A DAG is a type of graph where edges have a direction and there are no cycles. This means that you cannot start at one node and follow a path that eventually loops back to the same node.

Note the word “acyclic.” A topological ordering is possible if and only if the graph has no directed cycles!

The simplest way to perform topological sorting is by using Kahn’s algorithm. The idea is as follows:

Start by calculating the in-degree (number of incoming edges) for each node in the graph.
Next, enqueue all nodes with an in-degree of 0, meaning nodes with no incoming edges.
While the queue is not empty:
1. Dequeue a node N and add it to the topological sort’s result.
2. For each of N’s neighboring nodes, reduce its in-degree by 1.
3. If any neighboring nodes’ in-degree becomes 0, enqueue it.
4. Continue until the queue is empty.

The order in which the nodes are added to the topological sort’s result represents the topological sort of the DAG.

Here’s an example to demonstrate this:

“But what if my schema has recursion in it?” a curious George asks. Well, as you might have noticed, cycles complicate Kahn’s algorithm. Imagine you had a cycle involving Authors. In such a scenario, none of the in-degrees would start at zero. Moreover, if the cycle were on Books, this would create complications when reducing the in-degrees of the neighboring nodes by 1, potentially leading to a situation where not all nodes are traversed.

If you think about it, ignoring cyclic relationships can actually be a blessing in disguise. After processing all nodes with an in-degree of 0, if there are still nodes left in the graph (i.e., if the queue is not empty), we can be certain that cycles exist.

If cyclic relationships are essential, one way to handle them is by breaking cycles using NULL values. The first step is to identify where the cycles are occurring, which can be done using Tarjan's strongly connected components algorithm. When inserting mock data, we can temporarily insert NULL values in the foreign key fields. After all the mock data has been inserted for each strongly connected component, we can update the NULL foreign keys with the actual mock data values. Note that this assumes the foreign key fields can be nullable. There are other pitfalls as well, such as what happens if the table has database triggers (like AFTER INSERT) that depend on foreign key relationships. Inserting NULL values should not trigger incorrect behavior in such scenarios. Composite keys add another layer of complexity to this. All of these are things we needed to handle.

Into the Mine Fields of Mock Data Generation

So, how do we go about creating our “mock data generation” technology? As we’ve seen, generating dynamic code using LLMs that in turn dynamically generates raw INSERT queries does not seem to be the solution—regardless of the language! However, there are a few positives we gained from this experimentation:

Leveraging a “faker”-like module can help us generate better quality mock data, elevating us from the generic world of “Name1” to “SpongeBob SquarePants.”
We now have a way to derive an insertion order from our schema.

Returning to the drawing board, we chalked up an interesting idea: instead of using LLMs to generate data, how about using them to predict which method to select from a list of methods that faker-like modules expose? For example, a (Column, ColumnType) of (hashval, string) can be mapped to “md5()”, and a (language, string) can be mapped to “language_name()”. This approach has its caveats, but it’s a perfect starting point. It’s a fast and low-cost solution.

To implement this, we leveraged Neurelo’s JSON-schema spec, which we call Neurelo Schema Language (NSL). Once a user introspects their data source with Neurelo, we use this NSL spec throughout to represent their schema.

This schema can now become the anchor for us to build our mock data generation logic. Here’s how our 1.0 “Mock Data Generation” logic looked like:

The LLM prompt itself classifies an array of tuples of the form (column name, column type) into a JSON dictionary where the classification results are present:

{  
    table_name_1: {  
        column_name_1: mapped_method_1,  
        column_name_2: mapped_method_2,  
    },  
    table_name_2: {  
        column_name_1: mapped_method_1  
    },  
    .....  
}

As we couldn’t find any suitable alternative to Python’s faker module in Rust, we ended up writing an equivalent in Rust.

Tangentially, we encountered an interesting issue while implementing this Rust-based faker module. We faced a scenario where the entire module would suddenly panic. Uncertain about the cause, we began narrowing down the data source and its affected versions, discovering that only systems running MySQL versions earlier than 5.6.4 were affected. This led us to the infamous Year 2038 problem associated with 32-bit systems. By constraining our mocked timestamp range to between 1970-01-01 and 2038-01-19, we resolved the issue. As my colleague George aptly put it, dealing with anything closely tied to timestamps is a direct line to a mental institution.

What’s up with References?

Here is yet another engineering problem: how do we ensure that foreign keys and primary keys are correctly mapped to each other? Moreover, how do we make this work for NoSQL databases such as MongoDB (with object id references)?

One approach is to keep a global index counter whose value corresponds to the row number being mocked. This can be directly useful when auto-increment is enabled for a primary key.

For most scenarios, to ensure that foreign keys and primary keys match correctly while mocking data, one can use the following steps:

Track mock data: Use a dictionary to keep track of the mock data for all tables related to the current row you’re working on.
Mock each table:
1. For each table, generate mock data based on its columns and their types.
2. Identify and set the foreign key and primary key relationships so they match in your dictionary.
Store mock data: Save the generated mock data for each table in your dictionary.

However, this approach cannot be directly applied to MongoDB, as it supports implicit referencing. To address this, we used another layer of LLM! For each MongoDB collection corresponding to a database, we identify properties that contain an ObjectId and, by default, treat them as part of a “relation.” The LLM prompt then performs another classification to convert these: {collection_name1: [relation1, relation2, ….., relationN], …..} into a JSON dictionary of {relationX: referencing_collection_nameY, relationA: referencing_collection_nameB, …..} If a relation cannot be accurately mapped to a collection name, we simply discard it by assigning it a JSON null value. We then use these output references to ensure that the ObjectId values match each other.

Et voilà! Referencing just got a pair of Ray-Bans! ;)

Ahem, ahem … Unique Constraints?

We used a similar strategy for ensuring unique constraints. However, we encountered an interesting issue when mocking large amounts of data for a specific foreign key with a unique constraint. Because LLMs would use a faker module to populate the foreign key, it meant that these methods can sometimes create duplicate entries, leading to a unique constraint violation. For example, if "zip code" is a foreign key and zip codes are generated using a pseudo-random method, even with just 1,000 rows, duplication can occur (as zip codes are not very long), causing the constraint to fail.

This issue triggers a chain reaction. Since the current table isn’t created, subsequent tables that depend on it also fail. The root cause is that not all columns justify populating a certain number of rows. This issue extends to the tables themselves as well. One solution is to select unique data from a pool of pre-generated, distinct data rather than generating it on the fly. However, this approach might not be ideal due to memory consumption.

We have resolved these challenges to a large extent, but there is still more to be done which is something we are actively working on.

Even the best tailors get it wrong

All of this work proved to be a solid start for us, and we did manage to ship a pretty good mock data generator into production and test it with our early users. However, it wasn’t long before we started noticing the familiar weed that has previously poisoned many AI models—“overfitting”!

You see, one of the shortcomings of this implementation is that the quality of mock data is directly proportional to the classification pipeline of our LLM model. There are many times when we encounter a scenario with (Column, ColumnType) as (name, string), which is classified as a “name.” However, this can be a red-herring—this column might belong to a “Movie” table, a “mediaType” table, or even a “Brand” table. So our next lesson was that integrating table names into the pipeline is absolutely essential.

However, even with table names integrated, there are scenarios where we can’t realistically map them to one of our faker modules. Although we previously implemented around 217 faker modules in Rust, this proved to be a drop in the ocean. Customers can present scenarios like (film, description, string) or (posts, title, string) and expect us to go beyond the age-old “Lorem Ipsum.” To address this, we developed a novel solution called the “Genesis Point Strategy.”

The underlying motivation for the Genesis Point Strategy is that while we would like to leverage LLMs to intelligently generate this fake data, doing so can be both time-consuming and costly. So, how can we achieve this in a cheap, fast, and intelligent way?

In comes the power of a “Cross product”! The idea is that if we want to mock 1,000 rows with the constraint that all rows must be unique, we don’t need to generate 1,000 unique values from GPT. Instead, we need only ceil(sqrt(1000)) = 32 fake data elements. We can then duplicate these into two sets, A and B, and use the cross product A×B to produce our final mocked data elements. While this may not be perfectly realistic, it should suffice for most practical purposes.

Note that we don’t discard the older pipeline. Instead, we use zero-shot learning to train our LLM to classify “would-be-overfitted” columns into a “NoneOfTheAbove” category. We then only need to parse and generate mock data for columns corresponding to this “NoneOfTheAbove” classification.

Here’s the revised iteration for our 2.0 “Mock Data Generation” logic:

And that’s it, folks! This was a summary of our mock data generation pipeline. Our users can now enjoy one-click high fidelity mock data generation based on schemas for their development and testing needs.

The future holds exciting possibilities. With the lessons we have learned from these iterations, we are looking forward to tackling more complex challenges, whether it is further optimizing for unique constraints, supporting composite types and multi schemas, or integrating more cost-driven LLM strategies. One thing is clear though, the revolution in mock data generation is underway, and we are thrilled to be leading the charge.

We encourage you to explore this feature and more at dashboard.neurelo.com. To learn more about how you can use our platform, be sure to check out our website and our getting-started tutorials.

Before we can compare the two approaches, it's crucial to have a clear understanding of what each entails. Raw SQL refers to writing SQL queries directly in the code, while an ORM like SQLAlchemy, Active Records or Drizzle, is a software library that maps objects to relational database tables.

When opting for raw SQL, developers have more control over the queries executed against the database. This control allows for complex queries and fine-tuning performance. On the other hand, an ORM acts as an abstraction layer, making database interactions more intuitive and reducing the need for repetitive and verbose SQL code.

Raw SQL queries can be highly efficient when dealing with complex database operations that require specific optimizations. By directly writing SQL statements, developers can leverage database-specific features and optimizations that might not be easily achievable through an ORM. This level of control is particularly useful in scenarios where performance is a critical factor and the query logic needs to be finely tuned to meet strict requirements.

On the contrary, ORMs provide a higher level of abstraction that simplifies the interaction with the database. By mapping database tables to object-oriented models, developers can work with familiar programming constructs, such as classes and objects, instead of dealing with raw SQL syntax. This abstraction not only enhances code readability but also promotes rapid development by reducing the amount of boilerplate SQL code that needs to be written and maintained.

Pros and Cons of Using Raw SQL

Raw SQL can be a powerful tool for developers who are well-versed in SQL and want maximum control over their database interactions. It allows for highly optimized queries, resulting in faster execution times. Additionally, raw SQL grants developers the flexibility to work with specific database features and syntax that may not be supported by an ORM.

On top of these advantages, using raw SQL can also provide a deeper understanding of how databases function at a fundamental level. Developers can fine-tune their queries to leverage database indexes efficiently, leading to significant performance improvements. This level of control can be especially beneficial when working with large datasets or complex data structures.

However, raw SQL does come with a set of challenges. It can be error-prone, difficult to troubleshoot, and, if mishandled, can lead to security risks like SQL injection attacks. Maintaining and updating raw SQL code can also become cumbersome, especially as the complexity of the project grows.

Another potential drawback of relying heavily on raw SQL is the lack of portability across different database management systems. SQL syntax and functions can vary between databases, making it challenging to write queries that are compatible with multiple platforms. This limitation can hinder the scalability of a project, as it may necessitate significant rewrites if the database technology needs to be changed in the future. It's essential for developers to weigh the benefits of using raw SQL against these portability concerns when deciding on the approach to database interactions.

Pros and Cons of Using ORMs

Using Object-Relational Mappers (ORMs) in software development offers several advantages that can boost productivity and streamline database interactions. One of the primary benefits is the significant reduction in boilerplate code, which allows developers to focus more on implementing business logic rather than writing repetitive SQL queries. ORMs abstract the database layer, enabling developers to work with objects rather than SQL statements, which aligns well with object-oriented programming principles. This abstraction not only simplifies CRUD operations—Create, Read, Update, and Delete—but also makes it easier to switch between different database systems with minimal code changes. Additionally, ORMs can enhance code maintenance by consolidating schema definitions and relationships in a single location, reducing redundancy and facilitating easier management of schema migrations over time. Moreover, an ORM offers built-in security features that help protect against SQL injection attacks. By parameterizing queries and escaping values, ORMs ensure that malicious input cannot compromise the database's integrity.

On the downside, ORMs can introduce performance overhead due to the abstraction layer, which might not be suitable for applications requiring high-performance or finely-tuned SQL queries. The queries generated by ORMs may not always be as optimized as hand-crafted SQL, potentially affecting application performance. Moreover, the complexity of mastering ORM frameworks and understanding their behavior can be a challenge, particularly for developers new to the technology. The abstraction provided by ORMs can sometimes obscure the understanding of underlying database interactions, making debugging more complex. Additionally, ORMs can limit control over query optimization and execution plans, and they may not immediately support the latest database features or functionalities.

Furthermore, ORMs can create "leaky abstractions," where underlying database-specific quirks and behaviors still need to be added into the code, which can complicate development. This tight coupling to the ORM might also make it difficult to switch to another data access method if needed. While ORMs offer substantial benefits in terms of productivity, maintainability, and security, these advantages come with trade-offs in performance, control, and complexity. Therefore, the decision to use an ORM should be carefully considered based on the specific needs and constraints of the project.

Performance Comparison: Raw SQL vs ORM

Performance is a critical factor when deciding between raw SQL and an ORM. In general, raw SQL queries tend to offer faster execution times due to their ability to optimize the queries. However, the difference in performance heavily depends on the complexity of the database interactions and the efficiency of the underlying ORM implementation.

ORMs often compensate for any performance deficiencies by providing caching mechanisms. By caching query results and reducing the number of database calls, ORM-powered applications can achieve better overall performance in certain scenarios.

Security Considerations: Raw SQL vs ORM

When it comes to security, both raw SQL and ORM have their considerations. Raw SQL introduces the risk of SQL injection attacks if proper measures are not taken, making it crucial for developers to sanitize user inputs and use prepared statements or query builders.

ORMs, while offering protection against SQL injection, can still be susceptible to other security vulnerabilities. Developers need to ensure that the ORM library they choose is actively maintained and regularly updated to address any security issues that arise.

When to Choose Raw SQL Over ORM

Although ORMs provide numerous advantages, there are situations where raw SQL might be the preferable choice. Raw SQL has the upper hand when it comes to complex queries, as it allows developers to fine-tune and optimize them according to specific database requirements. If the project heavily relies on database-specific features or requires low-level control, raw SQL might be the better option.

Additionally, raw SQL can be advantageous for performance-critical applications that demand lightning-fast database interactions. In such cases, the control over query optimization and the ability to manually fine-tune the code can make a noticeable impact on performance.

Conclusion

Deciding between raw SQL and an ORM is not a one-size-fits-all approach. It ultimately depends on the specific requirements of the project and the tradeoffs the development team is willing to make.

For developers seeking to balance control, performance optimization of raw SQL with the efficiency, maintainability, and code simplicity of ORMs, Neurelo is a cloud data API platform that offers auto-generated APIs, together with AI-powered complex query builder, Git-like schema management, observability and more. Performance and simplicity should not be tradeoffs. Click here to try it yourself.

Database Observability: An Introductory Guide

Shani Shoham — Tue, 04 Jun 2024 15:00:00 +0000

Ever wondered what happens behind the scenes when you use an app or website? A crucial part of the magic lies in the database—a vast digital system storing all the information that keeps things running smoothly. But just like any complex system, databases require constant care and attention to ensure optimal performance. This is where database observability comes in. With database observability, it's like there's a guardian watching over your data.

This post will teach you the importance of database observability, prepare you for the challenges that might be encountered, and equip you with practical strategies for implementing it effectively.

What Is Database Observability?

Database observability, to put it simply, is the process of actively tracking and comprehending the functionality and state of your database systems. It's similar to having a live window into your database, letting you see possible problems early on, maximize efficiency, and make sure your data is always available.

Database observability relies on three key components to provide this comprehensive view:

Metrics: These are numerical assessments that monitor several facets of the health of your database, including disk use, connection counts, and query execution times. They provide an instantaneous overview of your database's current state.
Logs: Imagine a detailed record of everything happening within your database. Logs capture events like successful or failed queries, user actions, and error messages. By analyzing logs, you can gain deeper insights into potential problems and identify root causes.
Traces: Think of traces as the behind-the-scenes story of a query. They capture the entire journey of a query as it travels through your database system, pinpointing any bottlenecks or slowdowns that might be hindering performance.

Importance of Database Observability

Consider your database to be the central nervous system of your application, housing all the vital data needed to keep everything operating. A healthy database is necessary for the proper functioning of your applications and websites, just as a healthy heart is necessary for an individual's well-being. This is the point at which database observability becomes important. This is why it's a critical piece of work.

Deep Dive into Production and Application Behavior

Significance of API and query-level insights: Database observability allows you to see beyond overall database health and delve into granular details. By monitoring API and query-level metrics, you can pinpoint exactly how specific applications and functionalities interact with your database. This helps you identify areas where queries might be slow or inefficient, impacting the user experience.
Impact on identifying and solving issues promptly: Traditional monitoring might only alert you after a major issue arises. Database observability empowers you to be proactive. By tracking key metrics and analyzing logs, you can identify potential problems early on—before they snowball into critical failures. This allows for faster troubleshooting and resolution, minimizing downtime and ensuring a smooth user experience.

Build a Reliable Database Fortress

A sluggish database can significantly impact your application's performance. Database observability helps you identify bottlenecks and performance issues within your database. By analyzing query execution times, connection pools, and resource utilization, you can optimize your database configuration and fine-tune queries, leading to a faster and more responsive system.

Enhanced Scalability

As your application grows, your database needs to keep pace. Database observability provides valuable insights into your database's resource usage, allowing you to proactively scale your infrastructure to meet evolving demands and ensure smooth performance under increasing loads.

Improved Development and Operations Collaboration

Database observability fosters better communication between developers and operations teams. By providing shared visibility into database health and performance, both teams can work together to optimize queries, identify potential issues early on, and ensure a more efficient development and deployment process.

Optimizing Resource Utilization

Database observability acts as a resource manager, akin to a wise gardener tending to a flourishing garden. It optimizes resource utilization, ensuring that every byte and cycle is utilized effectively. This not only improves efficiency but also reduces unnecessary expenses, much like turning off lights in unoccupied rooms to save energy.

Challenges of Database Observability

While database observability offers immense benefits, it's not without its challenges. Here are some key obstacles you might encounter on your journey.

Data Privacy and Security

Balancing observability with privacy concerns: Database observability involves collecting and analyzing data about your database's operation, which might include sensitive information. It's crucial to strike a balance between gaining valuable insights and protecting user privacy.
Strategies for safeguarding sensitive information: There are several strategies to ensure data security while maintaining observability. You can implement data masking to hide sensitive data in logs, leverage role-based access control to limit access to sensitive information, and encrypt sensitive data at rest and in transit.

Complexity of Design and Maintenance

Navigating intricate database structures: Modern databases can be complex, with intricate structures and relationships between tables. This complexity can make it challenging to determine which metrics and logs are most relevant for monitoring and troubleshooting.
Addressing challenges in maintaining observability tools: Database observability tools themselves require ongoing maintenance and updates. You'll need to invest time and resources in selecting the right tools, configuring them effectively, and ensuring they stay up-to-date to provide accurate and reliable insights.

Real-Time Observability

Importance of real-time insights: In today's fast-paced world, real-time insights are crucial for identifying and responding to issues promptly. Delays in data collection and analysis can hinder your ability to react quickly to potential problems.
Overcoming obstacles in achieving real-time observability: Achieving real-time observability can be challenging, especially for large and complex databases. Factors like data volume, processing power, and network latency can all contribute to delays. You can overcome these obstacles by implementing efficient data collection methods, leveraging streaming technologies, and optimizing infrastructure.

Resource Scalability

As data volumes grow, so does the need for scalable observability solutions. Addressing this challenge involves adopting cloud-based solutions and optimizing resource allocation. It's akin to ensuring that your ship not only sails smoothly but also adapts to the ever-changing tides without capsizing.

Strategies for Implementing Database Observability: A Roadmap to Success

Equipping yourself with the right strategies is essential for unlocking the true power of database observability. These strategies act as your roadmap, guiding you toward a comprehensive understanding of your database's health and performance. Let's delve into some key strategies that will empower you to effectively implement database observability.

Demystifying Production Environments and Application Behavior

Monitoring queries slowing down: Slow queries can significantly impact user experience. Here's how to tackle them:some text
Identify bottlenecks: Use your observability tools to pinpoint queries with longer execution times. Analyze query plans and execution paths to identify bottlenecks that might be slowing down data retrieval.
Optimize slow queries: Once you've identified bottlenecks, you can optimize slow queries. This might involve rewriting inefficient queries, creating appropriate indexes, or adjusting database configuration settings.
Managing queries interfering with one another: Sometimes queries can compete for resources and slow each other down. Here's how to address this:some text
Analyze query dependencies: Use your observability tools to track query dependencies and identify situations where one query might be blocking another.
Implement isolation techniques: Use database features like transactions and locking mechanisms to ensure queries execute without interference, preventing slowdowns.

Understanding Read/Write Patterns
Analyzing data access patterns: Gaining insights into how data is accessed within your database is crucial. Here's what you need to track:some text
Analyze read and write frequencies: Monitor the ratio of read operations to write operations (reads versus writes) within your database. This helps you understand how your application primarily interacts with the data.
Adjust resources based on usage patterns: Based on your read/write analysis, you might need to adjust resources allocated to your database. For instance, if you have a read-heavy application, scaling your read replicas can improve performance.

Scaling for Optimal Performance

When to scale (scaling up or out): As your application grows, your database might need to scale as well. Here's how to decide:some text
Recognize signs of increased load: Monitor key metrics like CPU usage, memory consumption, and connection pools. When these metrics reach capacity, it's a sign you might need to scale.
Implement scaling strategies effectively: There are two main scaling approaches: scaling up (adding more resources to a single server) or scaling out (distributing the database load across multiple servers). Choosing the right approach depends on your specific needs and infrastructure.
What to scale: Not all database components need to be scaled equally.some text
Identify components for scaling: Focus on scaling components like CPU, memory, or storage based on which resources are reaching their limits.
Ensure cost-effectiveness in scaling decisions: Consider the cost implications of scaling. Explore cost-effective options like using cloud-based database services with auto-scaling features.
By implementing these strategies and tailoring them to your specific database environment, you'll gain a deeper understanding of your application's interaction with your database, optimize performance, and ensure your database scales effectively to meet your growing needs. Remember, database observability is an ongoing journey, and these strategies will serve as your guide as you refine your approach and continuously improve the health and performance of your databases.

Conclusion

After reading this post, you now know about database observability, a critical practice for ensuring your database runs smoothly and efficiently. We've unpacked its importance, shedding light on how it helps you understand application behavior, improve system reliability, and ensure data remains readily accessible.

We've also equipped you with practical strategies for implementing database observability. You've learned how to monitor queries, analyze read/write patterns, and effectively scale your database for optimal performance. By following these steps and continuing to explore this essential practice, you can ensure your database remains the strong foundation of your applications and websites.

This post was written by Gourav Bais. Gourav is an applied machine learning engineer skilled in computer vision/deep learning pipeline development, creating machine learning models, retraining systems, and transforming data science prototypes into production-grade solutions.

Most popular Postgres ORMs

Shani Shoham — Wed, 29 May 2024 15:00:00 +0000

This blog post will discuss popular Postgres ORM tools and other alternatives.

Those who are not familiar with ORMs should read “What is an ORM and when developers should and shouldn't use it", but to recap Object-Relational Mappers (ORMs) abstract the database interactions in applications by converting database records into objects and vice versa.

This blog post includes:

Introduction
Postgres ORM tools
Popular Postgres ORMs
Sequelize
SQLAlchemy
Active Record
Eloquent
Hibernate
GORM
Postgres ORMs - Challenges and limitations

‍

Introduction

PostgreSQL (commonly also referred to as Postgres) is one of the most popular open-source relational database management systems (RDBMS). Renowned for its robustness, extensibility, and adherence to SQL standards, PostgreSQL has positioned itself as one of the most reliable and feature-rich relational databases available. Its support for complex data types and advanced indexing methods makes it suitable for handling diverse data requirements. Postgres is also recognized for its extensibility through the use of custom functions, data types, and procedural languages.

‍

Postgres ORM Tools

PostgreSQL supports various ORM tools for different programming languages, with some popular ones being SQLAlchemy for Python, Hibernate for Java, and Sequelize for TypeScript/Node.js. These tools provide a set of abstractions that allow developers to interact with the PostgreSQL database using object-oriented concepts.

This enables developers to model database entities as objects and manipulate them using familiar programming language constructs, reducing the need for raw SQL queries. It also enhances code readability and maintainability while also promoting a more natural development workflow.

‍

Postgres ORMs - Challenges and limitations

Performance Optimization: ORM-generated queries may not always be optimized for specific use cases, requiring developers to fine-tune queries manually directly in their code for better performance. The performance problems tend to worsen as the application scales.‍
Learning Curve: ORMs are specific to a particular programming language and working with an ORM introduces a learning curve. Developers need to understand the intricacies of the ORM tool, programming language framework, and PostgreSQL to use them effectively.
Abstraction Leaks: Abstraction leaks often occur as applications move beyond basic CRUD queries. This leads to messy code, unexpected behavior and performance issues. Developers must be cautious and carefully consider their application requirements, architecture, and data access patterns.
Schema Changes and Migrations: Managing schema changes and migrations can be challenging, particularly in large applications. Developers need to plan for seamless database evolution as the application evolves.
Modern application architectures: ORM tools may not seamlessly align with modern cloud and serverless architectures. As applications scale, ORM's centralized management of database connections and transactions can become a bottleneck, leading to increased latency and decreased overall throughput. Similarly, in serverless setups, ORM's heavyweight abstraction may struggle with short-lived compute instances. Developers need to carefully evaluate the trade-offs between ORM convenience and cloud-native benefits when architecting applications.

Postgres ORM alternative with Cloud based Data APIs

Neurelo’s programming interface for databases addresses many of the challenges associated with Postgres ORMs such as N+1 queries and leaky abstractions.

Full Transparency of Queries:

Traditional ORMs obscure the SQL layer. Neurelo provides full visibility into the queries being executed. This makes debugging much easier but also offers developers choices to optimize database interactions.

Query Strategies:

Neurelo minimizes the number of queries made to the database using intelligent techniques like eager loading (while also offering the option of lazy loading, when needed) which retrieves the data in a single query using joins. This solves N+1 query issues associated with many traditional ORMs.

Override Capability for Queries:

Flexibility and control are key in application development, and hence Neurelo allows developers to override the default query generation behavior for APIs, when needed, as per the applications requirements. This feature is particularly useful for optimizing performance or handling edge cases as the application evolves over time, offering both short-term benefits – get started quickly with the queries that Neurelo has optimized to begin with, and long-term flexibility – debug, understand, and update these queries over time as your data, use cases, and applications demand.

Extending Neurelo APIs for Complex Queries:

Neurelo’s AI-powered custom APIs go beyond the basic CRUD operations (typically offered by ORM frameworks) with auto-generated APIs that provide the ability to work on single entities (tables/collections) as well as advanced join read/write tasks that go across multiple entities. This means developers can craft bespoke solutions for specific requirements without compromising the integrity or efficiency of the application, thus solving “leaky abstractions”.

Schema-as-code and alignment to software development lifecycle

By treating database schemas as code, Neurelo empowers developers to manage schema changes seamlessly within their existing codebase, enhancing collaboration, version control, and automation. With Schema as Code, developers can define database schemas using familiar language and version control systems, enabling them to track changes, review diffs, and rollback modifications with ease. This approach revolutionizes schema management by integrating it into the software development lifecycle, streamlining workflows, and ensuring consistency between application code and database schema. Neurelo's Schema as Code concept represents a paradigm shift in database development, offering a modern and agile approach to schema management for the next generation of applications.

‍Try Neurelo now and never experience ORM limitations.

What is an ORM and when developers should and shouldn't use it

Shani Shoham — Sat, 25 May 2024 16:56:08 +0000

What is an ORM

Object-Relational Mapping (ORM) is a programming technique that facilitates the interaction between a relational database and programming language. It acts as a bridge between the logical, object-oriented representation of data in the code and the physical, relational structure in a database. The primary purpose of an ORM is to eliminate the impedance mismatch that can occur between the object model used in application code and the relational model employed by databases.

‍

Why use an ORM

Developers use ORMs for several reasons. Firstly, it simplifies and accelerates the development process by allowing them to work with objects and classes in the programming language rather than writing SQL queries. This abstraction minimizes the need to deal directly with database-specific syntax and intricacies.

Secondly, an ORM enhances code maintainability by providing some level of an abstraction. Changes to the database schema can be managed more easily, as developers can update the mapping in the ORM layer without extensively modifying application code.

Thirdly, an ORM promotes code reusability and portability. By abstracting the database interactions, developers can write database-agnostic code, making it easier to switch to a different database system without substantial code changes.

‍

Challenges using ORMs

While an Object-Relational Mapping (ORM) layer provides numerous advantages, developers often encounter challenges and trade-offs when deciding to use this approach.

Dev/Build Time Challenges:

N+1 Queries: ORMs often lead to the N+1 query problem, where multiple database queries are executed instead of a more optimized query, causing performance issues for data access. Developers need to be mindful of the data access patterns to minimize this problem.
Leakiness: As queries become more advanced and complex, beyond basic CRUD operations, the abstraction provided by an ORM can start to leak, forcing developers to direct deal with database-specific query code and behaviors from their code. This can lead to unexpected issues with code maintainability particularly when attempting to change the schema or optimize queries.

Production Performance, Scale, Optimization, and Security:

Performance: While ORM systems generate SQL queries, they may not always be ideal and developers may need to fine-tune or handcraft queries for optimal performance in specific situations. In high-performance scenarios, such optimizations become crucial.
Scale: As applications scale, ORM-generated queries may not always scale seamlessly. Developers may need to consider database-specific optimizations or even move away from certain ORM features to ensure performance at scale.
DB Connection Management: Connection pools manage reusable database connections, enhancing performance and scalability. ORMs may not always integrate seamlessly with connection pools, leading to issues such as inefficient connection management, impacting performance. Additionally, configuring the connection pool settings to align with the ORM's requirements and ensuring proper handling of connections within the ORM layer can be complex tasks.
Security: ORM systems abstract away much of the SQL, but developers must remain vigilant about security. Poorly sanitized inputs, improper use of ORM features, or overlooking security configurations can introduce vulnerabilities.

Suitability to SDLC Processes (e.g., CI/CD):

Continuous Integration/Continuous Deployment (CI/CD): ORMs can pose challenges in CI/CD pipelines, especially when dealing with database schema changes. Migrations and updates need careful consideration to avoid disruptions and ensure smooth deployment processes.
Adaptation to Change: As applications evolve with new features, the ORM mappings may need frequent updates. Ensuring that these changes do not disrupt existing functionality and that migrations are handled seamlessly becomes a crucial aspect of the development process.
Modern application architectures: ORM tools may not seamlessly align with modern cloud and serverless architectures. As applications scale, ORM's centralized management of database connections and transactions can become a bottleneck, leading to increased latency and decreased overall throughput. Similarly, in serverless setups, ORM's heavyweight abstraction may struggle with short-lived compute instances. Developers need to carefully evaluate the trade-offs between ORM convenience and cloud-native benefits when architecting applications. ‍

Summary

In conclusion, while an ORM simplifies database interactions and enhances code maintainability, developers should be aware of these challenges and make informed decisions based on the specific needs of their applications. Mitigating these challenges often involves a combination of careful design, optimization, and a deep understanding of both the ORM framework and the underlying database system.

Top 4 Alternatives to MySQL

Shani Shoham — Wed, 15 May 2024 23:32:26 +0000

In recent years, the landscape of DataBase Management Systems (DBMS) has undergone significant changes, leading to the rise of new and innovative solutions that challenge the traditional dominance of MySQL. As businesses strive for improved developer experience, performance, scalability, and security, it becomes imperative to also explore the top alternatives to MySQL in 2024 and choose the best database for your application needs.

‍

Understanding the Need for MySQL Alternatives

Before delving into the world of alternatives, it is crucial to comprehend why this shift is happening in the first place. The evolution of DBMS has played a pivotal role, with traditional relational databases now facing competition from modern and specialized solutions, ensuring organizations have the right tools for the job.

The Evolution of Database Management Systems

Over the years, database management systems have transformed from simple storage and retrieval tools to sophisticated systems capable of handling complex data structures and large volumes of information. This evolution has resulted in a diverse array of options, each catering to specific needs and use cases.

One key aspect of this evolution is the rise of NoSQL databases. Unlike traditional relational databases like MySQL, NoSQL databases offer a flexible data model that allows for the storage and retrieval of unstructured and semi-structured data. This makes them particularly well-suited for applications dealing with large amounts of diverse data, such as social media platforms or IoT devices.

Another significant development in the world of database management systems is the emergence of cloud-native databases. These databases are designed specifically for cloud environments and offer features like automatic scaling, high availability, and distributed data storage. They provide organizations with the ability to easily scale their databases as their needs grow, without the need for complex infrastructure management.

Why Consider Alternatives to MySQL?

MySQL is a widely-used open-source relational database management system (RDBMS) that offers several advantages for both small and large applications. One of its primary benefits is its speed and efficiency in handling large volumes of data, making it suitable for dynamic websites and applications that require fast data processing. MySQL supports a broad range of operating systems, ensuring flexibility across different environments. It is known for its reliability and robustness, with features that ensure data integrity and security, including transactional support, which allows for consistent data states even in case of system failures. Additionally, MySQL is easy to use and manage, with a vast community and comprehensive documentation available for support.

While MySQL has long been, and still is, a popular choice due to its ease of use and widespread adoption, there are several reasons why businesses may want to consider alternatives.

For instance, if your application requires high-performance real-time analytics, the limitations of MySQL's architecture may become a factor. MySQL is primarily designed for transactional workloads, where data consistency is of utmost importance. However, when it comes to analytical workloads that require complex queries and aggregations on large datasets, MySQL may not be ideal to deliver the required performance. In such cases, alternative databases like Apache Cassandra or Amazon Redshift, which are optimized for analytical workloads, can provide significant performance improvements.

Furthermore, MySQL's scalability and reliability might not fully meet the demands of rapidly expanding datasets and high-traffic applications. As data volumes grow and user concurrency increases, MySQL's single-node architecture can become a bottleneck. Alternative databases like MongoDB or Apache HBase, which offer distributed architectures and automatic sharding, can provide better scalability and fault tolerance.

It is crucial to assess your specific requirements and determine whether an alternative DBMS can better address your organization's needs. Consider factors such as the nature of your data, the workload characteristics, and the scalability and performance requirements. By carefully evaluating your options, you can ensure that you choose the right database management system that aligns with your business goals and technical requirements.

‍

Overview of Top MySQL Alternatives in 2024

Now that we understand the need for alternative DBMS options, let's explore some of the leading contenders in 2024.

PostgreSQL: An Advanced Open-Source Option

PostgreSQL has gained significant traction in recent years due to its feature-richness and robustness. As an open-source DBMS, it offers a high degree of customization and scalability, making it a suitable choice for a wide range of applications. With support for advanced data types, spatial data, and full-text search capabilities, PostgreSQL continues to enhance its reputation as a powerful alternative.

MongoDB: A Leading NoSQL Database

In the world of NoSQL databases, MongoDB stands out as a popular choice. Its flexible document-based data model allows for easy integration of complex and evolving data structures. Moreover, MongoDB's horizontal scalability and automatic sharding make it an excellent option for applications that handle large amounts of data distributed across multiple servers.

PlanetScale:

PlanetScale is a MySQL-compatible database that offers scale, performance, and reliability, and many more powerful database features. Leveraging cloud-native architecture, PlanetScale enables organizations to deploy, manage, and scale MySQL-compatible databases with ease. With features such as automatic sharding, distributed transactions, and high availability, PlanetScale enables businesses to handle large volumes of data and support complex workloads.

SQLite: A Lightweight Database System

Unlike the previous alternatives mentioned, SQLite offers a different approach by delivering a lightweight, serverless, and embedded DBMS. Ideal for simple applications and mobile devices, SQLite provides a compact and efficient solution without the need for external dependencies or configurations.

‍

Key Factors to Consider When Choosing MySQL or an Alternative

With numerous alternatives available, selecting the right DBMS warrants careful consideration of various factors.

Evaluating Your Database Needs

Understanding your specific application requirements, including the volume of data, performance expectations, and scalability needs, will aid in selecting the most suitable alternative for your organization.

Understanding the Importance of Scalability

As your business grows, the ability of your DBMS to scale seamlessly becomes crucial. An alternative that can handle increasing data loads and easily adapt to changing demands ensures long-term viability.

Developer experience

Developers are the main users of databases, connecting applications to the database management system (DBMS). A developer-friendly interface can significantly streamline the development process, allowing developers to focus on building and refining their applications rather than wrestling with intricate database-related tasks. It's essential to consider the effort involved in designing data models, automating schema migration, creating a scalable and maintainable DBMS interface, and constructing SQL queries for CRUD operations and other complex queries related to data interaction. Furthermore, factors such as version control and collaboration across various team members and stages of the software development lifecycle (SDLC) must be taken into account. Prioritizing developer experience not only ensures smoother development cycles and faster time-to-market but also helps preempt many complexities in production, ultimately leading to greater project success overall.

Security Features

Data security remains a top priority for every organization. Consider options that offer robust security features such as encryption, authentication mechanisms, and access control to safeguard your critical information.

The Role of Community and Support

Considering the availability of a vibrant community and reliable support resources is essential for smooth operations. Active developer communities often lead to continuous improvements, bug fixes, and available documentation.

‍

Managing your developers access

Regardless of whether you are using MySQL or another alternative, all solutions require a significant effort in creating APIs that your developers can use, knowing the specific query syntax, keeping track of changes to those APIs. testing and debugging them and keeping track of their performance (i.e timeouts and response time).

Neurelo gives you a cloud data API platform that is database agnostic. Neurelo auto-detects your schema and auto-generates your APIs. It uses AI that understands your schema to create APIs for complex custom queries in minutes and provides robust version control and observability capabilities.

With Neurelo you are giving developers streamlined access to your database in under a minute. Try it yourself: https://dashboard.neurelo.com/register

This content was originally posted at https://www.neurelo.com/post/mqsql-or-top-alternatives

Forem: Neurelo

In the land of LLMs, can we do better mock data generation?

LLMs to the rescue ?

Mad Hatter pulls out the “Order of Insertion”

Into the Mine Fields of Mock Data Generation

What’s up with References?

Ahem, ahem … Unique Constraints?

Even the best tailors get it wrong

Top 8 Managed Postgres Providers

What Is Managed Postgres?

Why Managed PostgreSQL

Key Features of Managed Postgres Providers

Top Managed Postgres Providers

1. Supabase

Key Features

Special Advantages

Justification for Ranking

2. Aiven

Key Features

Special Advantages

Justification for Ranking

3. Amazon RDS for PostgreSQL

Key Features

Special Advantages

Justification for Ranking

4. Google Cloud SQL for PostgreSQL

Key Features

Special Advantages

Justification for Ranking

5. Microsoft Azure Database for PostgreSQL

Key Features

Special Advantages

Justification for Ranking

6. Heroku Postgres

Key Features

Special Advantages

Justification for Ranking

7. DigitalOcean Managed Databases for PostgreSQL

Key Features

Special Advantages

Justification for Ranking

8. Rackspace Managed PostgreSQL

Key Features

Special Advantages

Justification for Ranking

Conclusion

Raw SQL or ORM. Which one is better?

Understanding Raw SQL and ORM

Pros and Cons of Using Raw SQL

Pros and Cons of Using ORMs

Performance Comparison: Raw SQL vs ORM

Security Considerations: Raw SQL vs ORM

When to Choose Raw SQL Over ORM

Conclusion

Database Observability: An Introductory Guide

What Is Database Observability?

Importance of Database Observability

Deep Dive into Production and Application Behavior

Build a Reliable Database Fortress

Enhanced Scalability

Improved Development and Operations Collaboration

Optimizing Resource Utilization

Challenges of Database Observability

Data Privacy and Security

Complexity of Design and Maintenance

Real-Time Observability

Resource Scalability

Strategies for Implementing Database Observability: A Roadmap to Success

Demystifying Production Environments and Application Behavior

Understanding Read/Write Patterns

Scaling for Optimal Performance

Conclusion

Most popular Postgres ORMs

Introduction

Postgres ORM Tools

Popular Postgres ORMs

Postgres ORMs - Challenges and limitations

Postgres ORM alternative with Cloud based Data APIs

Full Transparency of Queries:

Query Strategies: