Forem: Brian Fletcher

Improving Backstage performance (by up to 48x)

Brian Fletcher — Fri, 04 Oct 2024 07:57:19 +0000

Backstage is an excellent framework for building an internal developer portal. It provides all of the fundamental building blocks to improve developer experience in an organization.

Core to Backstage is its Catalog of entities. The Catalog provides a database of software components, resources, libraries, and other kinds of software items. It provides client code and an API backend to retrieve items in the catalog, along with the software interfaces required to populate the entity catalog. Its model is flexible, customizable, and powerful.

However, with great power comes great responsibility. Without experience, it's easy to develop anti-patterns in Backstage catalog usage. These anti-patterns can then turn into major performance issues at scale. This in turn leads affects trust and usage of the product as a whole.

At Roadie, we provide an out-of-the-box version of Backstage for our customers. We have come across many of the ways in which non-optimal Catalog client usage can affect performance of the application as a whole. We have seen these performance issues result in lagging page loads and (in extreme cases) causing page loads to fail in Backstage.

By applying the patterns explained in this post, you could see a huge improvement in Catalog response time. In some cases, you may even see Catalog queries perform 48x faster!

Architecture of the Entity catalog

The entity catalog in Backstage is made up of three components. A Catalog client, the Catalog backend, and the Catalog database. When Backstage starts up for the first time, it will have a Catalog database and a catalog backend. When you visit the Backstage application in your browser, it will make use of the Catalog client to retrieve data from the Catalog backend. The Catalog backend in turn retrieves the requested catalog items from the Catalog database.

Using the Backstage Catalog Client

Soon after deploying Backstage in an organization, users will want to customize it.

Frequent customizations we come across include loading entities into the Catalog from an in-house platform or visualizing data from an internal system in the Backstage UI. Customization is normal and is a sign that Backstage is adding value for teams.

When developers write extensions to Backstage, it's likely they will come across the need to interact with the Catalog. There are two ways they can do this:

via a Frontend Backstage extension
via a Backend Backstage extension

To make use of the Catalog client in a frontend Backstage extension, you are likely to be using the useApi hook, along with a useAsync function.

import { catalogApiRef } from '@backstage/plugin-catalog-react';
import { useApi } from '@backstage/core-plugin-api';
import { stringifyEntityRef } from '@backstage/catalog-model';
import useAsync from 'react-use/lib/useAsync';

export const CustomReactComponent = () => {
  const catalogApi = useApi(catalogApiRef);
  const {
    value: entities,
  } = useAsync(async () => {
    const response = await catalogApi.getEntities();
    return response.items || [];
  }, [catalogApi]);

  return (<>{entities.map(stringifyEntityRef).join('\n')}</>)
}

In a backend Backstage extension, you are likely to be constructing the Catalog client using the discovery client. The discovery client is a helper that allows plugins to discover the API location of other clients.

Generally, if you are writing a backend plugin, like a new REST API or a Catalog processor, you will have access to the discovery client. Depending on your particular situation, you may have access to the discovery client in a different way.

import { CatalogClient } from '@backstage/catalog-client';
import { DiscoveryApi } from '@backstage/core-plugin-api';

export const getAllEntities = async (discovery: DiscoveryApi) => { 
  const catalogApi = new CatalogClient({
    discoveryApi: discovery,
  });

  const response = await catalogApi.getEntities();
  return response.items;
}

You will notice that once you have an instance of the CatalogApi, it is used in the same way in either the frontend or a backend extension.

(await catalogClient.getEntities()).items;

Check the Backstage docs for a more comprehensive explanation of the full Catalog interface.

How big can a Backstage Catalog get?

When thinking about Catalog size, it’s useful to think about two things:

How big is each individual entity?
How many entities do you have?

A typical Entity

A typical Backstage Entity looks like this:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: artist-web
  description: The place to be, for great artists
  labels:
    example.com/custom: custom_label_value
  annotations:
    example.com/service-discovery: artistweb
    circleci.com/project-slug: github/example-org/artist-website
  tags:
    - java
  links:
    - url: https://admin.example-org.com
      title: Admin Dashboard
      icon: dashboard
      type: admin-dashboard
spec:
  type: website
  lifecycle: production
  owner: artist-relations-team
  system: public-websites

It describes a website called the artist-web. It has a few basic Backstage entity properties, and some annotations, tags, and links. It's encoded in YAML here, but in Backstage's database, it is stored as plain text in JSON format.

Uncompressed, this entity definition is about half a kilobyte. Therefore, a Catalog containing about 20,000 similarly sized entities would add up to about 10 megabytes of data uncompressed. That's a pretty big chunk of data to be sending over the wire.

However, we haven't seen anything yet…

The Backstage Catalog model defines an API Kind. These are used to document the endpoints that services make available in Backstage. API entities often contain an embedded OpenAPI doc.

For example:

apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: artist-api
  description: Retrieve artist details
spec:
  type: openapi
  lifecycle: production
  owner: artist-relations-team
  system: artist-engagement-portal
  # The embedded OpenAPI spec is in the definition
  definition: |
    openapi: "3.0.0"
    info:
      version: 1.0.0
      title: Artist API
      license:
        name: MIT
    servers:
      - url: http://artist.spotify.net/v1
    paths:
      /artists:
        get:
          summary: List all artists
    ...

At Roadie, we have seen multiple customers with API kind entities in their Catalog with embedded OpenAPI docs as large as 1 megabyte in size. It's easy for even a small-sized engineering organization to have 50 such APIs documented in Backstage. Unoptimized, that's 50MB+ of data that's being transferred every time we query the full Catalog.

This Catalog size is important because poor use of the Catalog APIs can cause huge database queries and API response sizes to result, which will cause both unwarranted traffic across the network and unwanted wasted time transferring, encoding, and decoding that data.

How to Make Good Use of the Entity Catalog

With all this said, we wanted to run through some good practices that are going to help with improving the Backstage experience. The examples we use in the tables below are measured in the browser on a production Catalog that has 14k entities.

Unoptimized, we're looking at 2.16 seconds and 59.5 MB of data. That's our starting point. Now each experiment we do below is going to improve that data.

Only query the entities fields that you need

By default, when retrieving entities from the Backstage Catalog, Backstage will return the whole entity for each item listed. As mentioned above, an entity might be as large as 1 megabyte. As such, limiting fields requested to the ones that are strictly required can help a lot. For example, you might have code like the following that is requesting every entity in the Catalog and then converting the result into an array of entity references.

(await catalogClient.getEntities()).items.map(stringifyEntityRef);

If you look under the hood, you'll find that the stringifyEntityRef function only makes use of the kind, name, and namespace. As such, we can cut down on the amount of data transferred across the network by limiting the fields that are requested.

(await catalogClient.getEntities({
  fields: ['kind', 'metadata.name', 'metadata.namespace']
})).items.map(stringifyEntityRef);

Operation	Time	Size
`catalogClient.getEntities()`	2.16 seconds	59.5 Mb
`catalogClient.getEntities({ fields: ['kind', 'metadata.name', 'metadata.namespace'] })`	0.767 ms	1.2 Mb

Make use of the filter option to retrieve only entities that you need

We have seen a pattern develop whereby the Catalog client is used to retrieve all of the entities in the Catalog, and then the list is filtered client side.

(await catalogClient.getEntities()).items.filter((entity) => entity.kind === 'Group');

It is more efficient to send a filter to the catalog client so that filtering is done either in the Backstage backend or the Backstage database.

(await catalogClient.getEntities({ filter: { kind: ['Group'] } })).items

Operation	Time	Size
`catalogClient.getEntities()`	2.16 seconds	59.5 MB
`await catalogClient.getEntities({ filter: { kind: ['Component'] } })`	69 milliseconds	2.5 MB

Avoid retrieving all entities in order to count entities

A common pattern we see in Backstage is for developers to download the whole contents of the Catalog in order to count the entities in that Catalog. The following code will cause Backstage to query the database for every entity, then the client will need to decode the JSON it retrieves in order to count the number of entities.

(await catalogApi.getEntities({})).items.length

There is a far more performant way to do this, using the query API. The following requests a limit of 1 entity to be returned, and also requests that the uid field from the entity is the only item that is returned for that entity. The query API always returns the total count of entities for that query. As such it gives us what we need with out downloading the whole Catalog to the client.

(await catalogClient.queryEntities({
  fields: ['metadata.uid'],
  limit: 1,
}).totalItems;

The change suggested here is going to save work for the Backstage database, the Backstage backend, and the Backstage frontend.

Operation	Time	Size
`catalogClient.getEntities()`	2.16 seconds	59.5 MB
`catalogClient.queryEntities({ fields: ['metadata.uid'], limit: 1 })`	45 milliseconds	0.5 Kb

Enable Gzip Compression

When not using the Catalog Client, we recommend using gzip encoding to reduce the amount of data transferred. This is crucial because requests for large amounts of data directly from the Backstage APIs can be massive. Enabling compression significantly decreases the data volume sent to the client. You can achieve this by including the Accept-Encoding header with your client requests.

Operation	Size
`curl https://backstage-server/api/catalog/entities`	59.5 MB
`curl https://backstage-server/api/catalog/entities -H 'Accept-Encoding: gzip'`	6.7 MB

Keep Backstage up to date

The first, and perhaps the most important thing to consider is to keep Backstage up to date. If you are using Roadie, you are already using a very recent version of Backstage. However, if you are managing Backstage yourself, you may have fallen behind. Backstage releases new versions at least once a month, and these versions often contain very valuable performance improvements to the Catalog.

For example, in version 1.6.7 of the Catalog client library, there was an optimization. Previously, the Catalog client would sort all entities before returning them to the caller. This is a nice, helpful utility until there are thousands of entities to sort. Often, it is not necessary or optimal to receive a sorted list of entities.

Collaborate on the OSS Backstage core project

As part of research for this document, we spoke with the core maintainers of Backstage, and there are some great ideas about how to continue to improve the performance. For example, it has been discussed that by default, the getEntities function should be replaced by an iterator object. That iterator would be used to page over the list of entities rather than retrieving the whole list.

As such, keeping up to date with Backstage releases will allow you to benefit from these performance improvements.

Conclusion

This article is illustrative of some of the performance gains that can be achieved, and your mileage may vary. We have not delved into the performance implications of these changes on backend memory and database query performance. However, we can say that these changes can greatly improve these items too. It's difficult to quantify; however, at Roadie, we were seeing huge memory spikes and large garbage collections occurring in Backstage when the whole Catalog is queried. This is possibly due to the physical sizes of the entity Catalogs and the serialization and deserialization that occurs between the client, backend, and database.

We have shown that making use of some good patterns can result in a much improved load times for users. We have shown some examples where timings are reduced from multiple seconds to sub-second. We have also shown that the sizes sent across the wire can be greatly reduced from multiple megabytes to tens of kilobytes.

A well-managed and optimized internal developer portal can make your software engineers more efficient and empower them with the information they need. When load times are reduced from multiple seconds to sub 1 second, developers enjoy a fast, responsive experience that means they’re more likely to use Backstage and find what they need.

Scaling Backstage

Brian Fletcher — Fri, 27 Sep 2024 09:00:29 +0000

There are multiple challenges that arise when the volume of data in the Backstage grows to 1,000s and 10,000s of entities, ranging from performance to ease of use. We’ll explore these in this article as well as suggesting possible ways around them for your own Backstage deployments.

The Backstage developer portal is an excellent tool for platform teams, as well as engineers, to keep a handle on their software, maintain compliance statuses and spin up new services. Unfortunately the open source Backstage is known for its difficult set up time and overall cumbersome maintainability.

This pain is often made worse when the catalog within a Backstage instance gets large. Below are a few high-level pointers that we have come across during our journey to support bigger engineering organizations. We’ve scaled our installation to support organizations with multiple tens of thousands of entities over the last year. We’ll dig deeper into individual topics, based on interest, on further articles.

Let us know about your optimization problems or questions in either Roadie or Backstage Discord channels!

Handling the Catalog data

We have written more extensively about catalog performance and how to improve that in a separate blog post.

When developing on top of Backstage, you are always building on the foundation of solid catalog data. This makes the CatalogAPI usually the most used API on both back- and frontend of the application. It may be that the entities in the system grow large (looking at you API specs) or that there is just a large quantity of them (looking at you automatically ingested AWS resources). Therefore it is important to retrieve only the actual necessary fields that are displayed to the user and limit the amount of entities being fetched.

The default CatalogClient has the option to retrieve only relevant fields through the API. Do use it. Also make sure to use the pagination if possible and hit the correct endpoints with your catalog client. Retrieving less data is always going to be cheaper than retrieving more data. It really makes a big difference whether you want to JSON parse/stringify the biggest API docs in the world multiple times, along with the rest of the catalog data or you just want to use ineffective string manipulations for sorting purposes.

For some cases we at Roadie have needed to introduce our own endpoints and catalog queries to improve the performance in larger catalogs. These endpoints could be as simple as creating pointed subset queries directly against the database table to identify only needed entities. Or possibly only returning partial entity data with preformatted response shapes. Having the ability to do use-case specific queries for relevant data, and use the better performance usually present in the database layer makes a big difference at times.

Processors vs. Providers

In the early days of Backstage the approach to ingest entities into the catalog was by using Processors to retrieve data from third party sources. This is still a remnant within the product, and is (unfortunately) still used by some integrations. The main purpose of processors nowadays is to enhance the entities, but same caveats on their usage are still present.

CNCF maintainers of the project introduced Providers to Backstage at a later stage. These providers allow more maneuverability to schedule and modify the payloads that you are sending to the catalog. Being able to chunk the ingested entities into smaller buckets, having the ability schedule the intervals with more (or less) granularity and having better visibility to the internals of the catalog is a big benefit when tweaking the catalog ingestion to work optimally.

In many cases the problem may still remain though. The providers may have the need to emit locations or other intermittent data before it is finally stored into the system as a full entity. And in those cases the entity may need to go through the processing pipeline again.

When you encounter issues that may be related to this approach, make sure that your processors are nimble beasts and are definitely not blocking the event loop. Small milliseconds make a difference here. The system is both processing a lot of data and doing it multiple times. User experience may also suffer when processing times get large or some immediately expected entities are clogged up behind a large processing queue.

In Roadie we have taken an even more performant alternative approach for few specific entities that we natively support and ingest. We want more fine-grained control to serve our larger users better and have created an alternative, self-contained processing module to do the processing for some specific use cases.

Scaffolder

By default the Scaffolder within a Backstage project runs in the same process as the rest of the application. This is by design, but there is an escape hatch that can be used to externalize this from the codebase. Backstage is built as a modular monolith and in theory has the possibility to be spread out into multiple services.

There is a fair amount of work to achieve that but the payoff is usually there. The decision to make here is to identify the tradeoffs that your company is willing to sacrifice. Is it ok that only a single scaffolder run is manageable at one time? Does it matter if the rest of the Backstage application begins to show signs of slowness when other processes are running?

The Scaffolder, and larger Tech Insights installations, take a lot of CPU cycles from the underlying hardware which may negatively interfere with the user experience. Blocking the event loop is the big no-no in the Node.js world when it comes to performance. If you are running heavy tasks within the same process that you are using to serve your users, you may encounter bad times. If bad times appear, consider externalizing some of the chunkier pieces of your instance. These may include the Scaffolder, Tech Insights, Search indexing, Cost Insights and the catalog processing loop.

Roadie has extracted the larger, more resource hungry processes into ephemeral standalone processes to avoid eating up the event loop cycles of the main application. In our case these are running in AWS, where we host your Roadie instances, as ECS tasks or Lambda functions, depending on the use case. With the backend system fully out for the Backstage project, it should be much easier to spin out supporting services into their own processes and leave the catalog alone to do what it does best, showing entities to the users in a performant manner.

Perceived Frontend Performance

Of course, performance is relevant only to the users if they are able to feel it. This is present in standard Backstage installations on the frontend layer of the application. Does your catalog load fast? Do you have a ton of frontend plugins installed and your bundle sizes are big? Do you need to rebuild your tech docs every time you navigate to the docs page?

For the frontend resources, there are multiple well-known performance tricks that can be included in the build process and hosting solutions that you are using to serve your frontend app. In the end, the Backstage frontend is a single-page application with all the known benefits and caveats. All the data displayed will need to be retrieved from somewhere before they enter the Javascript runtime to be rendered on the screen.

In most cases getting the data we want to display means API calls. For some that is ok, like getting cheap values from fast endpoints, but for some the roundtrip to the server is not worth it. You can embed relevant information to the index.html that is either served from the backend (in newer Backstage installations) or pre-built during the deployment process. You can also use localstorage to your advantage, in fact Backstage does use this for some of its data, but not necessarily for caching purposes unfortunately.

YAMLs and scaling maintainability

The canonical and recommended approach by the CNCF open source maintainers of Backstage is to use catalog manifest files, usually called catalog-info.yaml within code repositories to store entity data for the catalog. In a large amount of cases this is the wrong approach. You maybe able to keep domain and system entities up to date easily, since they are small in numbers and change rarely. For other kinds/types of entities we have seen with multiple of our customers that maintaining and keeping those YAML files up to date in an engineering organization is difficult.

A better approach to ingest entities in many cases is to automate the process of at least initial entity information from a more robust source of truth. In the end, trusting humans to update a random file in their repository just for the sake of updating it seems unlikely to succeed 100% of the time.

In modern engineering organization there are multiple different good sources to use as the canonical starting point for your entity data. For users and groups you have your Oktas or Azure Entra Ids. For repositories you have your GitHub APIs. For components, APIs and resources you have your running instances, your exposed OpenAPI endpoints and your K8s or cloud provider APIs.

Ingesting the relevant data automatically from these allows you to trust that the relevant information is up to date and mirrors correctly the software that is actually being developed within your organization and what is running in environments.

That is in the end the purpose of a developer portal, a mapping of software that your are providing to your customers, not a mapping of software that you have at some point written into a well-formatted text file.

Rate limits

There are downsides to automating your catalog ingestion as well. Backstage relies heavily on integrations towards third party APIs and this causes some implications on how up to date it can keep the catalog information. Being so reliant on other services and wanting to be the single pane of glass to display that information means that you need to be aware of the limitations of this system.

Backstage by default is a pull-based system which contacts third parties using API tokens or other authentication information and retrieves relevant data. Usually in the form of plugins, this isn’t a massive issue since the actual concurrent user count is relatively small. Even for the bigger clients we don’t usually see high 3 digit morning rushes. The just-in-time nature of retrieving data on runtime to display from third parties on the familiar frontend thus works well.

On the other hand, Backstage also stores data internally. Data that it gathers automatically from third parties and uses to generate insights or enhance entities. These processing loops usually run at a schedule and try to slurp in as much as they can. Herein lies the problem where rate limits are introduced in the system.

Monitoring rate limits against different system is extremely important and helps you identify when you getting close to the edge to make the downstream service angry. Backstage offers a good set of monitoring primitives to expose metrics from your providers. You can for example set up open-telemetry to gather the information you need. Exposing rate limit information either by querying it on a loop periodically or directly embedding it to your fetch client implementations. The former approach gives you the ability to manually tweak your calling schedules to accommodate your integrations, the latter may give you the ability to automatically slow down the calling loop to satisfy the limits.

Let us know if you have encountered any other aspects or approaches that have helped you scale your Backstage instance and work effectively within your organization.