Forem: Sam J

HarperDB’s New Upsert Feature

Sam J — Mon, 14 Dec 2020 18:06:01 +0000

In our new 2.3.0 release, we included an often requested NoSQL upsert operation to HarperDB*. This new hybrid operation will insert new records, if they do not exist, or update them, if they do.

This new feature can be used in two different ways via HarperDB’s API - via a simple NoSQL operation or as the action for a bulk load operation.

*This new operation is only available in HDB instances utilizing a LMDB data store. While the File System (FS) data store is still configurable and are still supported in HDB, some new/more advanced features may not be implemented for FS moving forward.

NoSQL Upsert Operation

As noted above, HarperDB users can now utilize an upsert operation via our API which will insert new records and/or update existing records.

A new record (to be inserted) is identified as a record that does not include a hash value or with a hash value that does not already exist on the table being upserted to.

An existing record (to be updated) is identified by a valid table hash value and will be updated based on the attribute values included in the JSON to upsert for that record - i.e. as with update, any attributes not included in the record’s JSON will NOT be updated.

Example NoSQL Upsert Operation

Example Request

    {
        "operation": "upsert",
        "schema": "dev",
        "table": "dog",
        "records": [
            {
                "id": 1,            
                "nickname": "Sammy"
            },
            {
                "name": "Harper",
                "nickname": "Good boy!"
                "breed": "Mutt",
                "age": 5,
                "weight_lbs": 155
            }
        ]
    }

Example Response

    {
        "message": "upserted 2 of 2 records",
        "upserted_hashes": [
            1,
            "6bca9762-ad06-40bd-8ac8-299c920d0aad"
        ]
    }

In the above example:

The existing record with hash value equal to 1, will have the nickname attribute updated to equal “Sammy” and all other attribute values for that record will remain untouched. Note: if there was no record with id equal to 1, a new record would be inserted with the provided nickname value.
The new record will be inserted as written and with a system generated hash value. If a new, unused hash value had been provided for this record, we would have used that hash value when inserting the new record.

NoSQL Bulk Load Upsert

Similar to our NoSQL insert and update operations, upsert is also now available to specify as the action on a bulk load API operation. This will communicate to the bulk load job to run an upsert operation on the large data set provided.

Bulk Load w/ Upsert Action

Request

    {
        "operation":"csv_url_load",
        "action": "upsert",
        "schema":"dev",
        "table":"dogs",
        "csv_url":"https://s3.amazonaws.com/data/dogs.csv"
    }

Response

    {
        "message": "Starting job with id e047424c-5518-402f-9bd4-998535b65336"
    }

Example Response from get_job operation for bulk load

    [
        {
            "__createdtime__": 1607897781553,
            "__updatedtime__": 1607897784027,
            "created_datetime": 1607897781549,
            "end_datetime": 1607897784026,
            "id": "e047424c-5518-402f-9bd4-998535b65336",
            "job_body": null,
            "message": "successfully loaded 348 of 348 records",
            "start_datetime": 1607897781562,
            "status": "COMPLETE",
            "type": "csv_url_load",
            "user": "admin",
            "start_datetime_converted": "2020-12-13T22:16:21.562Z",
            "end_datetime_converted": "2020-12-13T22:16:24.026Z"
        }
    ]

In the above example:

A csv_url_load bulk load operation is started using the linked data set. All records included in the linked data will be upserted into the table identified using the logic described above.
Hitting the get_job endpoint with the job id will provide you with an updated status of the bulk load job and, when complete, confirm the number or records upserted from the linked data set.

A Note RE: Clustering

As with other database operations like insert, update, csv_file_load, etc., in HarperDB, an upsert operation to a table on a specific node will distribute to the other nodes subscribed to changes on that table.

A few things to keep in mind when thinking through how this will play out for your clustering architecture:

In a scenario where you are upserting new records without hash values provided, the system generated hashes will be included in the transaction payload that is shipped to connected nodes - i.e. the auto-generated hashes for the new records will be mirrored on connected nodes.
In a clustered architecture, it is important to take a moment to consider the best NoSQL operation to use in every situation, while it may seem easy to just use upsert even when you are only intending to insert or update those records, there could be unintended consequences to your data integrity from that strategy.

For example, in a scenario where you have provided the hash values for upsert records, the upsert transaction will do one of the following on any connected nodes:
- If no matching hash value is found on the subscribing table, a new record will be inserted on the table even if the operation on the publishing node was an update on the record
- If there is a hash value match on the subscribing table, the record will be updated even if the operation on the publishing node was a record insert

To be specific, in some scenarios, using upsert could cause hash values for what you consider to be the same record to become out of sync across the cluster.

While this may not make a difference to the overall value or use of your data cluster - it could be the preferred outcome! - in others, your data cluster may be affected negatively so think through your use case carefully. Being explicit about the operation you want to transact will also make reviewing and understanding the transaction logs on your clustered nodes easier in the case where an issue arises and a rollback/fix is needed.

Happy upserting!

Do you have a new feature idea for HarperDB?
Our Feedback Board is a great place to vote and leave product suggestions, and you can always connect with our team in the community Slack Channel.

HarperDB's New Approach to Role Permissions

Sam J — Thu, 27 Aug 2020 15:59:40 +0000

In our 2.2.0 version release this week, we made major changes to the way role permissions are managed and used in HarperDB.

Prior to this release, we used permissions as a way to explicitly identify schema items to restrict role access to. In our new release, we have flipped that paradigm and now use permissions as a way to explicitly identify schemas and tables to grant role access to.

There are two important things to call out at a high-level about what these changes mean:

Any schema and/or table that does not have CRUD permissions explicitly defined for a role will not be accessible for that role.
If a role does NOT have specific attribute permissions set on a table, all attribute permissions will mirror the tables. If there are attribute permissions set, all other attributes will be fully locked down.

More Details about HarperDB Role Permissions

HarperDB utilizes a Role-Based Access Control (RBAC) framework to manage access to HarperDB instances. A user is assigned a role that determines the user's permissions to access database resources and run core operations.

Role permissions in HarperDB are broken into two categories - permissions around database manipulation and permissions around database definition.

Database Manipulation: A role defines CRUD permissions against database resources (i.e. data) in a HarperDB instance. Roles not assigned super_user permissions will only have the schema CRUD access explicitly defined within their role’s permissions.

Database Definition: Permissions related to managing schemas, tables, roles, users, and other system settings and operations are restricted to the built-in super_user role.

Built-In Roles

There are two built-in roles within HarperDB. See full breakdown of operations restricted to only super_user roles in our docs.

super_user - this role provides full access to all operations and methods within a HarperDB instance, this can be considered the admin role.
- This role provides full access to all Database Definition operations and the ability to run Database Manipulation operations across the entire database schema with no restrictions.
cluster_user - this role is an internal system role type that is used and managed internally to allow clustered instances to communicate with one another.

User-Defined Roles

In addition to the built-in roles above, admins (i.e. users assigned to the super_user role) can create customized roles for other users to interact with and manipulate the data within explicitly defined tables and attributes.

Unless the user-defined role is given "super_user" permissions, permissions must be defined explicitly within the request body JSON.
Describe operations will return metadata for all schemas, tables, and attributes that a user-defined role has CRUD permissions for.

More information and details about how to effectively create and manage role permissions in our new paradigm can be found in our docs.

Coming Over the Horizon

We are always working to make improvements to the way HarperDB allows our users to easily and effectively create and manage their database instances. One additional enhancement we are looking to implement for role permissions soon is to allow user-defined, non-super_user roles to be assigned access to specific database description and/or manipulation operations that are currently restricted to super_user roles.

Allowing a more ad hoc approach to assigning permissions on an operation-specific level, in addition to the existing schema-level permissions, will enable administrators the ability to more effectively customize roles for their individual use cases.

Do you have requests for HarperDB’s role permission functionality or another feature/functionality? Be sure to post it on our Feature Request board!

HarperDB is a distributed database focused on making data management simple. It has an easy to use REST API and supports NoSQL and SQL (including joins). Sign up for free and have your new HarperDB instance up and running in minutes here.

Performance Testing Javascript & Node with Benchmark.js

Sam J — Wed, 03 Jun 2020 20:58:51 +0000

At HarperDB, we’re working to build the best distributed database solution from the edge to the cloud. As a software developer on the team, I spend most of my time thinking about how to increase the stability and speed of our codebase – ideally, any work I’m doing achieves both of these priorities.

Using Benchmark.js to Test Functions in Node

The purpose of this post is to share one way I use Benchmark.js as a framework for quickly testing the most performant way to complete an operation in our code. In our upcoming release slated for late October, we spent a lot of time pulling out our file system code and putting it behind a data layer facade. This allowed me many opportunities to look at more performant options for things both big and small.

The example I’ve chosen to use below is a simple one I created when working through new ways to strip the .hdb file extension from the hash values we retrieve when searching for data in the file system. You can learn more about how we use FS in our patented data model in my last blog post here.

Setting Up a Benchmark Project

In order to make this as easy as possible, I have a project saved locally that allows me to quickly setup a performance test. I’ve created a sample repo on Github to give you an idea of what this looks like.

When I’m looking to test a new way to complete an operation in the code I’m writing, I create a new directory with a performance-test and test-methods files (or overwrite existing ones) in the “performance-playground” project I have saved locally.

In the test-methods file, I write up the different functions I am looking to test with a descriptive function name. Usually, I include the initial method as a reference point. If I’m working on refactoring only a small part of a larger function, I will break it out to ensure I’m only testing the specific operation I’m thinking about/working on.
Once that’s done, I build out the performance-test to run each of the methods I’m testing with the same data I’ve manually set directly in the module, or a larger data set I’ve built out in a loop like the create_test_array method above.
Once I’ve got my test setup, I can run the performance test in my terminal with npm test or by manually running the module in WebStorm. I get the following results…

Evaluating the Benchmark Results

The above test clearly shows that slice() is the most performant way to remove the .hdb file extension from a string. With a clear direction to go, I would normally start thinking about other aspects of the method I’m working on and whether are other ways to tune for performance, but in this instance, updating the method to use map() and slice() will provide a big performance improvement over the existing method.

While this example is simple, I think it provides a clear, easy-to-use framework for quickly testing different theories around the most performant way to code an operation in JavaScript. There are numerous ways this can be built out to test more robust functions and also with asynchronous methods in Node – e.g. I’ve used this to test different ways of using the async methods in the FS module.