<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jens Goldhammer</title>
    <description>The latest articles on Forem by Jens Goldhammer (@jenswr).</description>
    <link>https://forem.com/jenswr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F312232%2Fa9d844c1-f44a-450c-8514-663f7e988cd8.jpeg</url>
      <title>Forem: Jens Goldhammer</title>
      <link>https://forem.com/jenswr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jenswr"/>
    <language>en</language>
    <item>
      <title>Azure CosmosDB — why technology choices matter</title>
      <dc:creator>Jens Goldhammer</dc:creator>
      <pubDate>Wed, 09 Aug 2023 05:48:03 +0000</pubDate>
      <link>https://forem.com/fmegroup/azure-cosmosdb-why-technology-choices-matter-14j1</link>
      <guid>https://forem.com/fmegroup/azure-cosmosdb-why-technology-choices-matter-14j1</guid>
      <description>&lt;p&gt;Some months ago, my colleague Florian and me joined a development team of one of our clients. We are involved as architects and engineers of the application used in their retail stores.&lt;/p&gt;

&lt;p&gt;The client currently migrates the core software from runni ng decentralized in each retail store (with its own databases) to a central solution. They heavily invested in Microsoft Azure as a Cloud provider and are moving more and more workloads to the Azure Cloud.&lt;/p&gt;

&lt;p&gt;The client currently uses MSSQL databases in combination with the open-source Firebird database and has started to migrate data into the cloud. They have decided to use Cosmos DB as standard database for all new services in the cloud some years ago as it was the cheapest choice from their point of view.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2JCNcOiz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/obkfadxlcenxqeeczsc2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2JCNcOiz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/obkfadxlcenxqeeczsc2.png" alt="Image description" width="423" height="175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  What is Azure Cosmos DB?
&lt;/h1&gt;

&lt;p&gt;Azure Cosmos DB is the solution of Microsoft for fast NoSQL databases. For those who live in the AWS (Amazon Web Services) world, Cosmos DB can be compared to the service DynamoDB. You can learn more about Cosmos DB here: &lt;a href="https://azure.microsoft.com/en-us/products/cosmos-db"&gt;https://azure.microsoft.com/en-us/products/cosmos-db&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--j1139z56--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cz3eorhfltnxmb5oxx06.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--j1139z56--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cz3eorhfltnxmb5oxx06.png" alt="Image description" width="800" height="429"&gt;&lt;/a&gt;&lt;br&gt;
Source: &lt;a href="https://azure.microsoft.com/en-us/products/cosmos-db"&gt;https://azure.microsoft.com/en-us/products/cosmos-db&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When working with Cosmos DB, you have to forget all the things you learned in the relational database world. To design a good data model, you need to learn how to design a data model depending on your future access patterns to your data, because the performance of Cosmos DB depends on the partitions. Therefore you must put more effort into the data modelling upfront. You can find more about here: &lt;a href="https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data"&gt;https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cosmos DB instances can be created on demand and can be used in many programming languages.&lt;/p&gt;

&lt;p&gt;The unique point of Cosmos DB — in comparison to traditional relational databases — is the distribution of the stored data around the world, the on-demand scalability and the effortless way to get data out of it. Due to the ensured low response times Cosmos DB allows multiple use cases in the web, mobile, gaming and IoT applications to handle many reads and writes.&lt;/p&gt;

&lt;p&gt;Further use cases can be found here: &lt;a href="https://learn.microsoft.com/en/azure/cosmos-db/use-cases"&gt;https://learn.microsoft.com/en/azure/cosmos-db/use-cases&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Having joined the project as an architect and engineer, I was critical of using Azure Cosmos DB from the start, as I am a big fan of relational databases, especially for transactional data. My Cosmos DB journey began with writing a centralized device service to store clients’ purchased devices. We have used Azure Functions to implement the business logic on top of Azure Cosmos DB to retrieve and store the data.&lt;/p&gt;

&lt;h1&gt;
  
  
  Structure of Azure Cosmos DB
&lt;/h1&gt;

&lt;p&gt;Microsoft allows their customers to create several Cosmos DB instances in one Azure tenant. You can compare it to a database which holds several database tables. These instances can be used to separate workloads for different teams, stages or use cases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Cr1S4Sfl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p4q0qfsc1yfudomsgj0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Cr1S4Sfl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p4q0qfsc1yfudomsgj0w.png" alt="Image description" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Of course, in the document database world you can model your data without a specific schema. You can create your own JSON structures which makes it flexible as well. Often the idea is to combine different data into one item to allow fast reads. To reference data in other domains, you can use unique identifiers like the property id in the customer object.&lt;/p&gt;

&lt;p&gt;You can find more about the structure of Azure Cosmos DB here: &lt;a href="https://learn.microsoft.com/en-us/azure/cosmos-db/account-databases-containers-items"&gt;https://learn.microsoft.com/en-us/azure/cosmos-db/account-databases-containers-items&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Accessing data
&lt;/h1&gt;

&lt;p&gt;Azure provides multiple ways to query data from Azure Cosmos DB.&lt;/p&gt;

&lt;p&gt;The following interfaces are possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  NoSQL API&lt;/li&gt;
&lt;li&gt;  MongoDB API&lt;/li&gt;
&lt;li&gt;  Cassandra API&lt;/li&gt;
&lt;li&gt;  Gremlin API&lt;/li&gt;
&lt;li&gt;  Table API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An overview can be found here: &lt;a href="https://learn.microsoft.com/en-us/azure/cosmos-db/choose-api"&gt;https://learn.microsoft.com/en-us/azure/cosmos-db/choose-api&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We started to use the NoSQL API in our project with the SQL like interface. It was an easier migration path coming from SQL based relational databases in comparison to the other interfaces. To access the data, you can also use the Azure Portal with the Data Explorer — the data explorer allows you to access your collections, query and manipulate data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SNnqWtuc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ad97nq1b88pvatz08ms1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SNnqWtuc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ad97nq1b88pvatz08ms1.png" alt="Image description" width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Migration / Import mass data
&lt;/h1&gt;

&lt;p&gt;Cosmos DB allows to import data flexibly with different APIs.&lt;br&gt;
There are two options at the moment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Import via Azure Datafactory service which can be used out of the box&lt;/li&gt;
&lt;li&gt;  Import via custom CLI Tool which uses the cosmosDB API -&amp;gt; this tool needs to be developed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One way to import mass data is via Azure Data Factory which allows to pipeline data and map data from various sources and import into a Cosmos DB collection. We have used this mechanism a lot to transfer data from on-premises relational databases into the cloud and migrate data via pipelines into Cosmos DB collections.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2BG2Ul3i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jcxu1at219jd051tqqk0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2BG2Ul3i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jcxu1at219jd051tqqk0.png" alt="Image description" width="800" height="435"&gt;&lt;/a&gt;&lt;br&gt;
Source: Azure Portal example pipeline&lt;/p&gt;

&lt;p&gt;Azure Data Factory works quite well, is fast and very flexible, but has its own drawbacks and challenges. Unfortunately, this topic is a subject in itself, so we cannot go into more detail here.&lt;/p&gt;

&lt;p&gt;In the past we have also written our own CLI tools; they are more flexible for the datamapping and can be reviewed easier by other team members. By using bulk import with parallel threads with the cosmos API you will be as fast as the importing data via Azure Datafactory.&lt;/p&gt;

&lt;p&gt;You can find a list of available SDKs here: &lt;a href="https://developer.azurecosmosdb.com/community/sdk"&gt;https://developer.azurecosmosdb.com/community/sdk&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Querying data
&lt;/h1&gt;

&lt;p&gt;Azure Cosmos DB provides several APIs to retrieve the data out of the containers. We have decided to use the SQL interface to get data out in the used Azure functions.&lt;/p&gt;

&lt;p&gt;You can for example take this SQL like query to select all devices of a customer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--lOZkFA10--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sosl30hu1enrvoqqnqd1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lOZkFA10--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/sosl30hu1enrvoqqnqd1.png" alt="Image description" width="800" height="105"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This looks familiar, right?&lt;/p&gt;

&lt;p&gt;After some time, you notice that the SQL capabilities are limited as Azure Cosmos DB implements only a limited set of SQL specifications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Cosmos DB does not allow to join items from different collections — it only allows to join the item with itself which means that you need to read data separately from different collections to join your data. The documentation says that you have to change your data model if you have needs for joining.&lt;/li&gt;
&lt;li&gt;  Cosmos DB provides functions as well, but you may know only a few of them and sometimes in a completely different way as you may know from SQL. You have to learn the cosmos specific syntax as there is no standard for querying data in NoSQL databases.&lt;/li&gt;
&lt;li&gt;  Cosmos DB has limited capabilities for the group-by with having clause. Sometimes there are workarounds, sometimes not.&lt;/li&gt;
&lt;li&gt;  Cosmos DB supports Limit and Offset, but this is very slow (as it is implemented) and you should use continuation tokens instead. Why? If you are interested to understand this, read here: &lt;a href="https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/offset-limit"&gt;https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/offset-limit&lt;/a&gt; and &lt;a href="https://stackoverflow.com/questions/58771772/cosmos-db-paging-performance-with-offset-and-limit"&gt;https://stackoverflow.com/questions/58771772/cosmos-db-paging-performance-with-offset-and-limit&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My experience was that you often found good workarounds or a complete cosmos specific way, but sometimes you didn’t find a solution which was a little bit frustrating.&lt;/p&gt;

&lt;p&gt;Nevertheless, the most painful issue was that CosmosDB only report errors with following message: “One of the input values is invalid.” in your query without a useful hint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--66Nacy9W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z7p6d6qeawgt7v7193gz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--66Nacy9W--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z7p6d6qeawgt7v7193gz.png" alt="Image description" width="800" height="351"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, I made the mistake of putting a semicolon at the end of the query.&lt;/p&gt;

&lt;h1&gt;
  
  
  Updating data
&lt;/h1&gt;

&lt;p&gt;Updating one or multiple rows in a relational database with one SQL statement is a common request for processing data.&lt;/p&gt;

&lt;p&gt;Azure Cosmos DB exactly allows to update one item within a container by using multiple requests.&lt;/p&gt;

&lt;p&gt;The procedure looks like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Retrieve the whole document you want to update&lt;/li&gt;
&lt;li&gt;  Update the fields you want to update in your application code&lt;/li&gt;
&lt;li&gt;  Write back the whole document to Cosmos DB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Microsoft provides a new SQL update API to update one item without reading it before. The syntax for updating data is driven by the JSON PATCH standard (&lt;a href="https://jsonpatch.com"&gt;https://jsonpatch.com&lt;/a&gt;). This feature was a long time in preview and now is generally available in Azure Cosmos DB.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update-getting-started?tabs=dotnet"&gt;https://learn.microsoft.com/en-us/azure/cosmos-db/partial-document-update-getting-started?tabs=dotnet&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Mass updates to a larger set of documents cannot be done out of the box with one SQL statement. You must update each document separately. This limitation is a little bit surprising when you want to evolve your document schema.&lt;/p&gt;

&lt;p&gt;Yes, you can write a tool based on the bulk API. But you probably know that updating a lot of data is slow and involves much more effort instead of writing a single update query like in the relational data world.&lt;/p&gt;

&lt;h1&gt;
  
  
  Deleting data
&lt;/h1&gt;

&lt;p&gt;Deleting data in Azure Cosmos DB is generally possible by removing one item at a time via API. But there is unfortunately no support for SQL Delete-Statements!&lt;/p&gt;

&lt;p&gt;In general all limitations for mass operations in Cosmos DB may have reasons — for instance, the guaranteed response times for any action in the Cosmos DB. Operations on a bigger set of data might lead to higher execution times.&lt;/p&gt;

&lt;p&gt;But indeed this is an annoying point while writing and testing your software. Sometimes you need to remove specific data very quickly. One workaround is to drop the whole container and create your test data again, but often you have the case that you want to keep specific data in it.&lt;/p&gt;

&lt;p&gt;For example, we had to remove two million entries from a collection to repeat a migration, but wanted to keep other data in the collection. This action took half an hour by using a developed tool.&lt;/p&gt;

&lt;h1&gt;
  
  
  Transactions
&lt;/h1&gt;

&lt;p&gt;Azure Cosmos DB provides a simple transactional concept. It allows to group a set of operations in a batch operation. Unfortunately, the batch concept is not integrated nicely into the API as it does not allow to wrap your code into a transactional block like interfaces to relational databases allow.&lt;/p&gt;

&lt;p&gt;Additionally, transactions do not allow to update documents from different partitions which is understandable from a technical point of view, but this limits a lot. In our service we had the use case to update several documents at once from several partitions and we had to live without transactions in the end.&lt;/p&gt;

&lt;h1&gt;
  
  
  Ecosystem &amp;amp; Community
&lt;/h1&gt;

&lt;p&gt;I was very surprised starting with Cosmos DB to find such limited resources, articles and tools around the platform. But I understood this situation quickly because Cosmos DB is an exclusive, commercial service of Microsoft and is not as popular as Amazon DynomoDB for example.&lt;/p&gt;

&lt;p&gt;Microsoft itself provides limited tooling only:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the azure portal with the data explorer (&lt;a href="https://cosmos.azure.com"&gt;https://cosmos.azure.com&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;a visual studio code extension &lt;a href="https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-cosmosdb"&gt;https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-cosmosdb&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;a CosmosDB emulator which runs natively under Windows &lt;a href="https://learn.microsoft.com/de-de/azure/cosmos-db/data-explorer"&gt;https://learn.microsoft.com/de-de/azure/cosmos-db/data-explorer&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The community has written some tooling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Cosmos DB Explorer for Windows &lt;a href="https://github.com/sachabruttin/Cosmos%20DBExplorer"&gt;https://github.com/sachabruttin/Cosmos DBExplorer&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;CosmicClone to clone data from one container to another &lt;a href="https://github.com/microsoft/CosmicClone"&gt;https://github.com/microsoft/CosmicClone&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unfortunately there is no big community around Cosmos DB like for example for PostgreSQL or MySQL/MariaDB. Anyways, most of the good, known database tools out there which are supporting several vendors do not support Azure Cosmos DB. Mostly because it works completely different than relational databases.&lt;/p&gt;

&lt;h1&gt;
  
  
  Advanced topics
&lt;/h1&gt;

&lt;p&gt;Additionally Azure Cosmos DB allows to use stored procedures. Wait — Are stored procedures not a thing from the last century? Why should we use it? Probably you will notice that you need to use stored procedures for some scenarios like a mass deletion of entries in your collection as this is not supported out of the box.&lt;/p&gt;

&lt;p&gt;Writing stored procedures is supported in Cosmos DB with JavaScript. As you may know, testing becomes challenging for this kind of code and besides that, most of the backend developers in our team are not familiar with JavaScript. Due to these challenges, we decided not to use them within the application apart from administrative purposes!&lt;/p&gt;

&lt;p&gt;There are many advanced topics for Cosmos DB like scaling, partition keys etc. — these topics need their own blog post. You can read more about that in the official documentation: &lt;a href="https://learn.microsoft.com/en-us/azure/cosmos-db/"&gt;https://learn.microsoft.com/en-us/azure/cosmos-db/&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Summary
&lt;/h1&gt;

&lt;p&gt;Using a document-based database is not a no-brainer. Document-based databases like Azure Cosmos DB are not a replacement for relational databases and it was never the intention.&lt;/p&gt;

&lt;p&gt;Yes, Azure Cosmos DB has its use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If you have the use case “write once — read many” (for example just store data with a stable structure), you can use it.&lt;/li&gt;
&lt;li&gt;  If you need global distribution of your data, you probably need it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The problem is that you do not have these requirements in general for business applications very oftem. In my opinion most of the applications do not need to scale this way (besides, you are Amazon, Microsoft, Netflix or another global player).&lt;/p&gt;

&lt;p&gt;On the other side Azure Cosmos DB has some heavy limitations when working with the data, especially if you want to evolve your schema. If you want to store relational data within Cosmos DB and have a lot of changes in the data over time, Cosmos DB makes it very complicated and is currently not a good choice from my point of view.&lt;/p&gt;

&lt;p&gt;Besides these considerations one task has become very, very important from the beginning: design how to model and to partition your data. But this is a story on its own.&lt;/p&gt;

&lt;h1&gt;
  
  
  Resources
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://medium.com/@yurexus/features-and-pitfalls-of-azure-cosmos-db-3b18c7831255"&gt;https://medium.com/@yurexus/features-and-pitfalls-of-azure-cosmos-db-3b18c7831255&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://mattruma.com/adventures-with-azure-cosmos-db-limit-query-rus/"&gt;https://mattruma.com/adventures-with-azure-cosmos-db-limit-query-rus/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://blog.scooletz.com/2019/06/04/Cosmos%20DB-and-its-limitations"&gt;https://blog.scooletz.com/2019/06/04/Cosmos DB-and-its-limitations&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>azure</category>
      <category>cloud</category>
      <category>cosmosdb</category>
    </item>
    <item>
      <title>Recommender systems based on AWS Personalize</title>
      <dc:creator>Jens Goldhammer</dc:creator>
      <pubDate>Tue, 27 Sep 2022 08:47:15 +0000</pubDate>
      <link>https://forem.com/fmegroup/recommender-systems-based-on-aws-personalize-1fjk</link>
      <guid>https://forem.com/fmegroup/recommender-systems-based-on-aws-personalize-1fjk</guid>
      <description>&lt;p&gt;With its Personalize service, AWS offers a complete solution for&lt;br&gt;
building and using recommendation systems in its own solutions. The&lt;br&gt;
service, which is now also offered in the Europe/Frankfurt region, has&lt;br&gt;
been available since 2019 and is constantly being improved. Only last&lt;br&gt;
year, major improvements in the area of filters were added to the&lt;br&gt;
product.&lt;/p&gt;

&lt;p&gt;AWS Personalize allows customers to create recommendations based on an&lt;br&gt;
ML model for platform or product users. The following activities are&lt;br&gt;
abstracted and made particularly easy by AWS Personalize:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Import business data into AWS Personalize&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Continuous training of models with current data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Read recommendations with filtering capabilities from AWS&lt;br&gt;
Personalize&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might think that recommendation systems based on machine learning&lt;br&gt;
are old news and that you can do it all yourself anyway. Machine&lt;br&gt;
learning has been around for a few years now, and with it the&lt;br&gt;
possibility of developing such recommendation systems yourself.&lt;/p&gt;

&lt;p&gt;But the difference is: AWS Personalize takes the complete management of&lt;br&gt;
Machine Learning environments off the users' hands and allows to take&lt;br&gt;
first steps here very quickly. And we don't need the best ML experts on&lt;br&gt;
the team, because AWS Personalize takes a lot of more complex issues off&lt;br&gt;
our hands. Why it's still good to understand Machine Learning is shown&lt;br&gt;
by the challenges.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3QdDmmXp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nw7mhukj0qcj9pczz2ly.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3QdDmmXp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nw7mhukj0qcj9pczz2ly.png" alt="" width="704" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Personalize makes it easier than ever for us to create&lt;br&gt;
recommendations. From a technical perspective, everything seems easy to&lt;br&gt;
master. We find challenges mainly in the clear delineation of the use&lt;br&gt;
case and the meaningfulness of the recommendations.&lt;/p&gt;

&lt;p&gt;The recommendations should be as clearly delimited as possible for a use&lt;br&gt;
case. This influences both the selection of data and the structure of&lt;br&gt;
the data model and schema.&lt;/p&gt;

&lt;p&gt;A recommendation system lives from the meaningfulness and topicality of&lt;br&gt;
the displayed recommendations. If, for example, I make recommendations&lt;br&gt;
to a user that he already knows, are 2 years old or are not relevant at&lt;br&gt;
all, I lose his interest and trust. Recommendations are therefore first&lt;br&gt;
viewed critically and must therefore be convincing from the outset, even&lt;br&gt;
if this is of course partly viewed subjectively.&lt;/p&gt;

&lt;p&gt;Therefore, we need to ask the following questions from the beginning:&lt;/p&gt;

&lt;p&gt;It is very important to constantly validate the recommendations created&lt;br&gt;
by AWS Personalize. At the start, it is important to validate the&lt;br&gt;
recommendations manually, i.e., to check randomly whether the&lt;br&gt;
recommendations appear meaningful to a user at all. Recommendation is&lt;br&gt;
therefore to start with a recommendation system whose validity can be&lt;br&gt;
easily checked. In order to give a user recommendations that he or she&lt;br&gt;
does not yet know, it is necessary to work a lot with recommendation&lt;br&gt;
filters, so that favourites of users or content that has already been&lt;br&gt;
seen do not appear again.&lt;/p&gt;

&lt;p&gt;Now how do we make Personalize create recommendations for us? To do&lt;br&gt;
this, there are a few steps to complete.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tqz4jHtZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jtoz7175f7qw4m6qr2qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tqz4jHtZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jtoz7175f7qw4m6qr2qr.png" alt="Image description" width="800" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;First, you should select a domain that best matches the use case&lt;br&gt;
(1).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In case of a user-defined use case, data models are defined&lt;br&gt;
afterwards. Importing your own data into Personalize is done once or&lt;br&gt;
continuously based on the defined data models (2).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Amazon Personalize uses the imported data to train and provide&lt;br&gt;
recommendation models (3).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To query recommendations, both an HTTP-based real-time API for one&lt;br&gt;
user and batch jobs for multiple users can be integrated (4).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let's take a look at these data models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In addition to the e-commerce and video use cases, AWS Personalize&lt;br&gt;
offers the option of mapping your own use cases (domain). The bottom&lt;br&gt;
line is that it is always about the following data sets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Users&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Items&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interactions of the user with these items&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YazgG5xH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lcqh7avr220qcj24l2uq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YazgG5xH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lcqh7avr220qcj24l2uq.png" alt="Image description" width="751" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These datasets form a dataset group and are used as a whole in&lt;br&gt;
Personalize. Crucial here are the interactions that are necessary for&lt;br&gt;
most ML models and are used for training. A short example will&lt;br&gt;
illustrate this data model:&lt;/p&gt;

&lt;p&gt;"A fme employee reads the blog post "AWS Partnership" on the social&lt;br&gt;
intranet and writes a comment below it."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Item&lt;/strong&gt;: Blogpost "AWS Partnership&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interactions&lt;/strong&gt;: read | comment&lt;/p&gt;

&lt;p&gt;For this data a developer can define his own schema --- one schema each&lt;br&gt;
for Interactions, Users and Items.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Zd7HJZhW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/625kujsrsl6lxm2zxim2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Zd7HJZhW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/625kujsrsl6lxm2zxim2.png" alt="Image description" width="749" height="498"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following is an example schema for a user with 6 fields. These&lt;br&gt;
fields can be later used to get recommendations for content of specific&lt;br&gt;
users, e.g. users from a specific company or country.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6mejD3-o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ijwibmz8kabhivbpohp5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6mejD3-o--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ijwibmz8kabhivbpohp5.png" alt="Image description" width="426" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When importing data, this schema must then be followed. All three&lt;br&gt;
datasets have mandatory attributes (e.g. ID) as well as additional&lt;br&gt;
attributes that help to refine the ML model so that the recommendations&lt;br&gt;
can become even more precise. The additional attributes can be textual&lt;br&gt;
or categorical. They can also be used to filter recommendations.&lt;/p&gt;

&lt;p&gt;However, there are a few restrictions in modeling that you need to be&lt;br&gt;
aware of, such as the restrictions on 1000 characters per metadata. This&lt;br&gt;
is especially important if you want to model lists of values.&lt;/p&gt;

&lt;p&gt;Further info can be found &lt;a href="https://docs.aws.amazon.com/personalize/latest/dg/custom-datasets-and-schemas.html"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Import data into Personalize&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The quality of recommendations are dependent on the data provided. But&lt;br&gt;
how does the data get into the system ?&lt;/p&gt;

&lt;p&gt;The import of data always takes place in these data pots, so-called&lt;br&gt;
datasets (see above) --- there is exactly 1 dataset each for Users,&lt;br&gt;
Items and Interactions. These datasets are combined in a dataset group.&lt;/p&gt;

&lt;p&gt;To be able to train the ML model, the data sets have to be imported at&lt;br&gt;
the beginning (bulk import via S3). It is also possible to update the&lt;br&gt;
data continuously (via an API), which ensures that the model can always&lt;br&gt;
be improved.&lt;/p&gt;

&lt;p&gt;When you start with AWS Personalize, you usually already have a lot of&lt;br&gt;
historical data in your own application. This is necessary because&lt;br&gt;
recommendations only "work" meaningfully once a certain amount of data&lt;br&gt;
is available ( as with any ML application).&lt;/p&gt;

&lt;p&gt;Here it is recommended to use the bulk import APIs of AWS Personalize.&lt;br&gt;
For this, the data must first be stored in S3 in CSV format and&lt;br&gt;
according to the previously defined schema. Then you can start import&lt;br&gt;
jobs ( 1 per record) via AWS Console, AWS CLI, AWS API or AWS SDKs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xuOOuBqa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o39ao0i5f62mrg744mpu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xuOOuBqa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/o39ao0i5f62mrg744mpu.png" alt="Image description" width="743" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For continuous updating of the Users and Items datasets, AWS Personalize&lt;br&gt;
provides REST APIs that can be easily used with the AWS Client SDKs.&lt;/p&gt;

&lt;p&gt;A so-called event tracker can be used for updating the interactions.&lt;br&gt;
This previously created tracker can be used for a large amount of events&lt;br&gt;
within a very short time to get data into the system via HTTP .&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qjrLxWfO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kqwv2fjjo484adx5bu52.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qjrLxWfO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kqwv2fjjo484adx5bu52.jpeg" alt="Image description" width="749" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Train models&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the initial data is imported, AWS Personalize can now use this data&lt;br&gt;
in the form of the Dataset Group to train a model. To do this, you can&lt;br&gt;
first create a Solution, which is a "folder" for models. This sets the&lt;br&gt;
Recipe that Personalize should use.&lt;/p&gt;

&lt;p&gt;The recipe represents the ML model, which is then later trained (as a&lt;br&gt;
Solution version) with user-defined data. There are different types of&lt;br&gt;
recipes that offer different types of recommendations. For example,&lt;br&gt;
&lt;em&gt;USER_PERSONALIZATION&lt;/em&gt; provides personalized recommendations (from all&lt;br&gt;
items) and &lt;em&gt;PERSONALIZED_RANKING&lt;/em&gt; can provide a list of items with&lt;br&gt;
rankings for a particular user. Some recipes use all three data sets and&lt;br&gt;
some use only parts of them (e.g. SIMS does not need user data).&lt;/p&gt;

&lt;p&gt;After creating a solution, it can then be trained with the current state&lt;br&gt;
of the data sets, resulting in a solution version. Depending on the&lt;br&gt;
amount of data, this can take a little longer --- our tests showed&lt;br&gt;
runtimes of around 45 minutes. A solution version is the fully trained&lt;br&gt;
model that can be used directly for batch inference jobs or as the basis&lt;br&gt;
for a campaign --- a real-time API for recommendations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use recommendations in your own application&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now it's time to integrate recommendations into our own application. AWS&lt;br&gt;
provides a REST interface that allows us to retrieve recommendations&lt;br&gt;
from AWS Personalize in real-time. This makes it easy for us to&lt;br&gt;
integrate with any system&lt;/p&gt;

&lt;p&gt;Recommendations in AWS Personalize are always user-related.&lt;br&gt;
Recommendations can therefore look different for each user --- but can also be the same for certain recipes, as in the case of "Popularity count".&lt;/p&gt;

&lt;p&gt;The response is a list of recommendations in the form of IDs of the recommended items, each with a score. The items are uniquely referenced via the ID.&lt;/p&gt;

&lt;p&gt;These recommendations can now be evaluated in your own application,&lt;br&gt;
linked with the content from your own database and then displayed to the&lt;br&gt;
user in a user interface. The performance of the query (at least for&lt;br&gt;
smaller amounts of data) is so good that this query can be done live.&lt;br&gt;
However, one can also think about keeping the results of the query for a&lt;br&gt;
while per user, so as not to have to constantly request the service.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rq6FNKMW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/swywmoa5nfwh3677imse.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rq6FNKMW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/swywmoa5nfwh3677imse.png" alt="Image description" width="508" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you need recommendations for a large number of users for mailings,&lt;br&gt;
batch jobs ( &lt;strong&gt;batch inference jobs&lt;/strong&gt; ) can efficiently create these&lt;br&gt;
recommendations in the background. These batch jobs can be "fed" with&lt;br&gt;
the UserIds --- the result are recommendations for each user within one&lt;br&gt;
big JSON file.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--z4w9YEEu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5oz7qudz18b9q4tbra7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--z4w9YEEu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5oz7qudz18b9q4tbra7c.png" alt="Example for a result of the batch inference&amp;lt;br&amp;gt;
jobs" width="529" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Personalize worth the effort?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pricing model of the service can be quite demanding, so it is&lt;br&gt;
advisable to define in advance a result that you want to achieve with&lt;br&gt;
appropriate recommendations and resulting follow-up activities or repeat&lt;br&gt;
business.&lt;/p&gt;

&lt;p&gt;As a guide, to get individual recommendations for individual users in&lt;br&gt;
the Personalize Batch, we assume about 0.06 ct per recommendation for&lt;br&gt;
the user. That doesn't sound like a lot, but with several hundred&lt;br&gt;
thousand users and individual recommendations, it's part of the overall&lt;br&gt;
consideration. Depending on how often and to what extent batch runs for&lt;br&gt;
mailings etc. take place, it can get expensive. And the instances AWS&lt;br&gt;
uses for batch runs are very large and very fast. We created several&lt;br&gt;
batch jobs for mass exporting recommendations for 200k users for testing&lt;br&gt;
purposes. The batch jobs then ran overnight. We incurred costs of&lt;br&gt;
several hundred Euros --- we had probably underestimated the numbers in&lt;br&gt;
the AWS Calculator a bit.&lt;/p&gt;

&lt;p&gt;If referrals have a positive impact on the business and thus directly&lt;br&gt;
generate more sales for the customer, it can pay off very well. But what&lt;br&gt;
if my recommendations do not have a direct positive impact on my sales?&lt;br&gt;
One reason could be to bind customers more closely (subscription&lt;br&gt;
model) --- in the long term, this will in turn lead to more sales, but&lt;br&gt;
perhaps not in the short term.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS Personalize is a service that makes it very easy to get started with&lt;br&gt;
recommendation systems. As a development team, all you have to do is&lt;br&gt;
deliver the data in the right format and pick up the recommendations. It&lt;br&gt;
doesn't get much easier than that from a technical perspective.&lt;/p&gt;

&lt;p&gt;AWS Personalize can therefore be used well to extend existing systems&lt;br&gt;
without having to make deep changes. With the ability to create custom&lt;br&gt;
data models and tune the different ML algorithms, you can apply AWS&lt;br&gt;
Personalize to a wide variety of scenarios.&lt;/p&gt;

&lt;p&gt;The real work is in finding meaningful use cases, delineating them from&lt;br&gt;
one another, and providing the system with the right data.&lt;/p&gt;

&lt;p&gt;As always, this comes at a price. Is it worth it for them? Let's find&lt;br&gt;
out together.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References and Links&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Below are a few more links to help dig deeper into the topic:&lt;/p&gt;

&lt;p&gt;Official documentation:&lt;br&gt;
&lt;a href="https://aws.amazon.com/de/personalize"&gt;[https://aws.amazon.com/de/personalize]{.underline}&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Blogposts:&lt;br&gt;
&lt;a href="https://aws.amazon.com/de/blogs/machine-learning/category/artificial-intelligence/amazon-personalize/"&gt;[https://aws.amazon.com/de/blogs/machine-learning/category/artificial-intelligence/amazon-personalize/]{.underline}&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Personalize Best Practices:&lt;br&gt;
&lt;a href="https://github.com/aws-samples/amazon-personalize-samples/blob/master/PersonalizeCheatSheet2.0.md"&gt;[https://github.com/aws-samples/amazon-personalize-samples/blob/master/PersonalizeCheatSheet2.0.md]{.underline}&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Efficiency of models:&lt;br&gt;
&lt;a href="https://aws.amazon.com/de/blogs/machine-learning/using-a-b-testing-to-measure-the-efficacy-of-recommendations-generated-by-amazon-personalize/"&gt;[https://aws.amazon.com/de/blogs/machine-learning/using-a-b-testing-to-measure-the-efficacy-of-recommendations-generated-by-amazon-personalize/]{.underline}&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AWS Personalize Code Samples:&lt;br&gt;
&lt;a href="https://github.com/aws-samples/personalization-apis"&gt;[https://github.com/aws-samples/personalization-apis]{.underline}&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at&lt;br&gt;
&lt;a href="https://content.fme.de/en/blog/aws-personalize"&gt;[https://content.fme.de]{.underline}&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
