Forem: Łukasz Pluszczewski

Make Notion search great again: Vector Database

Łukasz Pluszczewski — Mon, 20 Nov 2023 09:15:57 +0000

In this series we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will see how we’ve used a vector database to finally achieve this.

Numbers, vectors, and charts are real data unless stated otherwise

Last time we downloaded and processed data from Notion API. Let’s do something with it.

Vector Database

To find semantically similar texts we need to calculate the distance between vectors. While we have just a few short texts we can brute-force it: calculate the distance between our query and each text embedding one by one and see which one is the closest. When we deal with thousands or even millions of entries in our database, however, we need a more efficient way of comparing vectors. Just like for any other way of searching through a lot of entries, an index can help here. To make our life easier we’ll use Weaviate DB - a vector database that implements the HNSW vector index to improve the performance of vector search.

There are a lot of different vector database you can use. We’ve used Weaviate DB because it has reasonable defaults, including vector and BM25 indexes working out of the box and a lot of features that can be enabled with modules (like “rerank” mentioned before). You can also consider postgres extension “pgvector” to take advantage of SQL goodness: relations, joins, subqueries and so on while weaviate may be more limited in that regard. Choose wisely!

I may revisit the topic of vector indexes in the future but in this article I’ll just use the database that implements it. To learn more about HNSW itself look here, and to learn more about configuring vector index in Weaviate DB look here.

Weaviate DB

Weaviate DB is an open-source, scalable, vector database that you can easily use in your own projects. The vector goodness is just one docker container away and you can run it like this:

docker run -p 8080:8080 -d semitechnologies/weaviate:latest

Weaviate is modular, and there are a number of modules allowing you to add functionality to your database. You can provide the embedding vectors to the database entries yourself, but there are modules to calculate those for you, like text2vec-openai module that uses the openAI API. There are modules allowing you to easily backup your DB data to S3, add rerank functionality to your searches, and many more. Enabling a module is as simple as adding an environment variable:

docker run -p 8080:8080 -d \
  -e ENABLE_MODULES=text2vec-openai,backup-s3,reranker-cohere \
  semitechnologies/weaviate:latest

Now, to connect to the database from our typescript project:

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'http',
  host: 'localhost:8080',
});

All the data in Weaviate DB is stored in classes (equivalent to tables in SQL or collections in MongoDB), containing data objects. Objects have one or more properties of various types, and each object can be represented by exactly one vector. Just like SQL databases, Weaviate is schema-based. We define a class with its name, properties, and additional configuration, like which modules should be used for vectorization. Here is the simplest class with one property.

{
  class: 'MagicIndex',
  properties: [
    {
      name: 'content',
      dataType: ['text'],
    },
  ],
}

We can add as many properties as we like. There are a number of types available: integer, float, text, boolean, geoCoordinates (with special ways to query based on the location), blobs, or lists of most of these like int[] or text[]:

{
  class: 'MagicIndex',
  properties: [
    { name: 'content', dataType: ['text'] },
    { name: 'tags', dataType: ['text[]'] },
    { name: 'lastUpdated', dataType: ['date'] },
    { name: 'file', dataType: ['blob'] },
    { name: 'location', dataType: ['geoCoordinates'] },
  ],
}

You can also control how, and for what properties the embeddings are going to be calculated if you don’t want to provide them yourself:

{
  class: 'MagicIndex',
  properties: [
    { name: 'content', dataType: ['text'] },
    {
      name: 'metadata',
      dataType: ['text'],
      moduleConfig: {
        'text2vec-openai': {
          skip: true,
        },
      },
    },
  ],
  vectorizer: 'text2vec-openai',
}

In this case, we’re going to use the text2vec-openai module to calculate vectors but only from the content property.

Weaviate stores exactly one vector per object so if you have more fields that are vectorized (or you have vectorizing class name enabled) embedding is going to be calculated from concatenated texts. If you want to have separate vectors for different properties of the document (like different chunks, title, metadata etc.) you need separate entries in the database.

Applying a schema is as simple as:

await client.schema
  .classCreator()
  .withClass(classDefinition)
  .do();

Let’s see what the data objects look like in our Notion index:

{
  pageTitle: 'Locomotive Kinematics of Quick Brown Foxes: An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles',
  chunk: '1',
  originalContent: '# Abstract\n\nThe paradigm of quick brown foxes leaping over lazy dogs has long fascinated both the scientific community and the general public...',
  content: 'abstract\nthe paradigm of quick brown foxes leaping over lazy dogs has long fascinated both the scientific community and the general public...',
  pageId: 'dfda9d5d-b059-4186-95f4-7cb8cdf42545',
  pageType: 'page',
  pageUrl: 'https://www.notion.so/LeapFoxSolutions/dfda9d5d-b059-4186-95f4-7cb8cdf42545',
  lastUpdated: '2023-04-12T23:20:50.52Z'
}

Let’s get what is obvious out of the way: we store the page title, its ID, URL, and the last update date. We also vectorize only content property: the vectorizer ignores the title, originalContent, and so on.

You probably noticed a chunk property though. What is it? For vectors to work best it is preferable that texts are not too long. They are generally used for texts not longer than a short paragraph so we split the contents of Notion pages into smaller chunks. We’ve used the lanchain's recursive text splitter. It tries to split the text first by double newline, if some chunks are still too long by a single new line, then by spaces, and so on. This way we keep paragraphs together if possible. We’ve set the target chunk length to 1000 characters with a 200-long overlap.

The length of the chunks and the way you split them can have a huge impact on vector search performance. It is generally assumed that chunk size should be similar to the length of the query (so during the search you compare vectors of similarly sized texts). In our case chunks 1000 characters long, although pretty big, seem to work best but your mileage may vary. Additionally, we also make sure that table rows are not sliced in half to avoid “orphaned” columns. This is a huge topic and I may revisit it in one of the future posts.

We save each chunk separately in the database and the chunk property is an index of the chunk. Why is it string and not number though? Because we don’t vectorize the title property, we save a separate entry for it that looks like this:

{
  pageTitle: 'Locomotive Kinematics of Quick Brown Foxes: An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles',
  chunk: 'title',
  originalContent: 'Locomotive Kinematics of Quick Brown Foxes An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles',
  ...
}

In the future, we may decide that we want to vectorize more properties of the page than just content and title. We can do that easily just by adding a new possible value to the chunk property.

What’s the deal with content and originalContent properties? To spare the vectorizer some noise in the data, we prepare a cleaned-up version of each chunk. We remove all special characters, replace multiple whitespaces with a single one, and change the text to lowercase. In our testing, vector search is slightly more accurate with this simple cleanup. We still keep originalContent though because this is what we pass to rerank and use for traditional, reverse index search.

Lastly, we have pageType property which is just a result of a Notion quirk: a page in Notion can be either a page or a database. As mentioned in the previous article, we treat both the same way in our index: databases are converted to simple tables.

Ok, we have an idea of what data we are going to store in the database, but how to add, fetch, and query that data?

Weaviate interface

Weaviate offers two interfaces to interact with it, RESTful and graphQL APIs and it is reflected in the available typescript client methods. We will focus on the graphQL interface. To get entries from the database, we need to simply provide a class name and the fields we want to get

client.graphql
  .get()
  .withClassName('MagicIndex')
  .withFields('pageTitle originalContent pageUrl');

It is recommended that each query is limited and uses cursor-based pagination if necessary:

client.graphql
  .get()
  .withClassName('MagicIndex')
  .withFields('pageTitle originalContent pageUrl')
  .withLimit(50)
  .withAfter(cursor);

Let’s add some entries to the database:

await client.data
  .creator()
  .withClassName('MagicIndex')
  .withProperties({
    pageTitle: 'Vulpine Agility vs. Canine Apathy: A Comparative Study',
    chunk: '2',
    originalContent: '## Background \n\n Though colloquially immortalized in typographical tests, the scenario of a quick brown fox vaulting over a lazy dog presents...',
    content: 'background\nthough colloquially immortalized in typographical tests the scenario of a quick brown fox vaulting over a lazy dog presents...',
    pageId: '1ba0b851-d443-4290-8415-3cd295850d14',
    pageType: 'page',
    pageUrl: 'https://www.notion.so/LeapFoxSolutions/1ba0b851-d443-4290-8415-3cd295850d14',
    lastUpdated: '2023-03-01T12:21:30.12Z'
  })
  .do();

With vectorizer enabled for MagicIndex class, that’s all we need to do. The entry is added to the database together with its vector representation calculated by OpenAI’s ADA embedding model. Now we can search for texts about foxes and dogs all day long.

Traditional search

Weaviate allows us to search with traditional reverse index methods too! We have a bag-of-words ranking function called BM25F at our disposal. It’s configured with reasonable defaults out of the box. Let’s see it in action:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withBm25({
    query: 'Can the fox really jump over the dog?',
    properties: ['originalContent'],
  })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { score }')
  .do();

You can see the _additional property that we can request in the query. It can contain various additional data related to the object itself (like its ID) or the search (like BM25 score or the cosine distance in case of vector search).

Vector search

Of course, a reverse index search will not find many texts that, while talking about brown foxes, don’t use those words. Thankfully, semantic search is as easy to perform:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({ concepts: ['Can the fox really jump over the dog?'] })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { distance }')
  .do();

There is some additional magic that we can do to make the search even better like setting the maximum cosine distance that we accept in the search results, or using the autocut feature:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    concepts: ['Can the fox really jump over the dog?'],
    distance: 0.25,
  })
  .withAutocut(2)
  .withLimit(10)
  .withFields('pageTitle originalContent pageUrl _additional { distance }')
  .do();

Now, not only do we get only results with cosine distance less than 0.25 (that’s what distance setting in withNearText method does), but additionally, weaviate’s autocut feature will group the results by similar distance and return the first two groups (more on how autocut works here).

But that’s not all. We can also make the search like some concepts and avoid some others:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    concepts: ['Can the fox really jump over the dog?'],
    moveAwayFrom: {
      concepts: ['typography'],
      force: 0.45,
    },
    moveTo: {
      concepts: ['scientific'],
      force: 0.85,
    },
  })
  .withFields('pageTitle originalContent pageUrl')
  .do();

While the example with foxes is a little silly, you can imagine many scenarios where that feature can be really useful. Maybe you’re looking for “ways to fly” but you want to move away from “planes” and move toward “animals”. Or you may search for a query, but keep the results similar to some other object in the database:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    concepts: ['Can the fox really jump over the dog?'],
    moveTo: {
      objects: [{ id: '84ab0371-a73b-4774-8b03-eccb97b640ae' }],
      force: 0.85,
    },
  })
  .withFields('pageTitle originalContent pageUrl')
  .do()

There are many other features that you may want to experiment with. Read more on those in the Weaviate documentation.

Hybrid search

Finally, we can combine the power of vector search with the BM25 index! Here comes the hybrid search which uses both methods and combines them with a given weights:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withHybrid({
    query: 'Can the fox really jump over the dog?',
  })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { distance score explainScore }')
  .do();

In _additional.explainScore property, you will find the details about score contributions from vector and reverse index searches. By default, the vector search result has a weight of 0.75 and a reverse index: 0.25, and those are the values we use in our Notion search. More about how hybrid search works and how to customize the query (including how to change the way vector and reverse index results are combined) can be found here.

Rerank

If we enable the rerank module, we can use it to improve the quality of search results. It works for any search method: vector, BM25, or hybrid:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withHybrid({
    query: 'Can the fox really jump over the dog?',
  })
  .withLimit(100)
  .withFields('pageTitle originalContent pageUrl _additional { rerank(property: "originalContent" query: "Can the fox really jump over the dog?") { score } }')
  .do();

Adding a rerank score field to the query will make Weaviate call a rerank module and reorder the results based on the score received. To increase the chance of finding relevant results, we’ve also increased the limit: now rerank has more texts to work on and can find relevant results even if we had a lot of false positives from a hybrid search.

Summary

To summarize. In our Notion index we’ve used Weaviate DB with the following modules:

text2vec-openai enabling Weaviate to calculate embeddings using OpenAI API and ADA model
reranker-cohere allowing us to use CohereAI’s reranking model to improve search results
backup-s3 just to make it easier to backup data and migrate between environments

To get the data to index, we fetch all Notion pages using a search endpoint with an empty query. In each page, we recursively fetch all blocks that are then parsed by a set of parsers: specific for each type of block. We then have a markdown-formatted string for each page.

We then split the contents of each page into chunks: 1000 characters long with 200 characters of overlap. We also “clean up” the texts by removing special characters and multiple whitespaces to improve the performance of vector search.

The data for each page chunk is then inserted into the database with a fairly straightforward schema. We have an index of the chunk and some properties of the Notion page: URL, ID, title, and type. Additionally, we keep both original, unaltered content and cleaned-up versions but we calculate embeddings only from the latter.

To find information in the index, we use the hybrid search with a default limit of 100 chunks, with rerank enabled by default.

What worked and what didn’t

So, the $100mln question. Does it work?

Absolutely! We have a working semantic search that allows us to reliably search for information even without using the exact wording used on the pages we’re looking for. You can search for “parking around the office” or “where to leave my car around the office” or even just “parking?”. How to use a coffee machine? What benefits are available in Brainhub? Which member of the team is skilled in martial arts? Who should I talk to if I want a new laptop? What are Brainhub’s values?

Not everything works perfectly though. Finding information in large tables (e.g. we have a table with team members - long, with a lot of columns and long texts inside) may be challenging if you’re not smart in chunking them e.g. by ensuring that one row is in one chunk even if very long to avoid orphaned columns. Even then the search is not perfect e.g. when asking who is a UX in our team, it may find a chunk with one person out of three UX designers in a table. While this is fine for search (in search results, you still get the link to the correct page that contains the whole table) it may not be enough for a Q&A bot that may miss some information because of it.

Another issue is noise. One of the reasons we wanted a better search was thousands of pages of meeting notes, outdated guidelines, and other mostly irrelevant stuff that lurks in the depths of our Notion workspace. We did implement some mitigations to improve search results and get rid of noise, like lowering the “search score” of old pages but it was not enough. The best method was still manually excluding areas that were most problematic. That’s not ideal of course, we would like our search engine to figure out what’s relevant automatically so that’s something to do more research on.

In general though, the results are more than satisfactory and, while there were a lot of small tweaks here and there needed, we’ve managed to create a Notion search that actually works.

Make Notion search great again: Notion API

Łukasz Pluszczewski — Tue, 17 Oct 2023 07:42:42 +0000

In this series, we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will explore the Notion API.

Before we can create a searchable index, we need to get the contents of the Notion pages. Let’s see how we used Notion API to do that.

Notion integrations

Before we can start using Notion API we must create an integration, sometimes called a “Connection” in the UI. An “Integration Secret” is generated for each integration which can be then used to access API. You can select permissions for the connection which Notion calls “capabilities”. Our index does not write anything to indexed pages, so we selected only “read” permissions. We also allowed the integration to read user information so that we can replace mentions with people's real names.

Integration can only access pages that it has been manually added to. This access also extends to all of its descendant pages. Keep in mind that what an integration has access to is not as straightforward as you might think. Removing it from a page may not propagate if either permissions or integration access to some child page has been separately modified. Also, integration has access to non-public pages! Because our solution will give indirect access to all indexed pages to everyone in the company (via the language model answers), we’ve made sure that we’re assigning it only to the pages we’re absolutely sure don’t contain any private information.

Notion client

To communicate with Notion API we’ve used a dedicated typescript library @notionhq/client

Below is an example of configuration and simple request:



const client = new NotionClient({
  auth: 'abcdef123',
  logLevel: LogLevel.WARN,
});

await client.pages.retrieve({ page_id: '3853acec-eebc-42e2-843b-2c340f769b80' });

Besides retrieving pages, we can also fetch databases, blocks, block’s children, etc.



await client.databases.retrieve({ database_id: 'c2da9700-8244-4bc0-bff1-8dccd909b211' });
await client.blocks.retrieve({ block_id: 'b00afc3d-1db2-4cf3-9801-868bd84f06f8' });
await client.blocks.children.list({ block_id: 'b00afc3d-1db2-4cf3-9801-868bd84f06f8' });

Rate limiting

Notion API doesn’t have a hard limit. You’re expected to not exceed an average of 3 req/s but occasional bursts above that are allowed. That’s why our internal rate limiter allows slightly more frequent requests, with proper rate limit error handling just in case. If you reach the rate limit, the Notion API will respond with a specific error code and a “retry-after” header that indicates the wait time in milliseconds.

To ensure that we handle API’s rate limits correctly, we’ve implemented an API client wrapper that handles errors appropriately. Below is a simplified example of rate limit handling:



async request() {
  try {
    return await client.databases.retrieve({ database_id: 'c2da9700-8244-4bc0-bff1-8dccd909b211' });
  } catch (error) {
    if (!isNotionApiErrorOfType(error, APIErrorCode.RateLimited)) {
      throw error
    }

    const retryAfter = parseInt(error.headers.get('retry-after'));

    return delay(
      () => request(),
      retryAfter * 1000,
    );
  }
}

Pagination

Most endpoints have a limit on the number of entries returned and provide a “cursor” if there is more data to fetch. Below is a simple example of how to handle the pagination if we want to load all data - a function that fetches all pages:



private async fetchAllPages(query?: string, cursor?: string) {
  const response = await client.search(query, {
    start_cursor: nextCursor,
  });

  if (response.next_cursor) {
    return [
      ...response.results,
      ...(await this.fetchAllPages(query, response.next_cursor, index + 1)),
    ];
  }
  return response.results;
}

Since the Notion API does not have a “get all pages” endpoint, the function above uses the search endpoint with an empty query to retrieve all pages. While it is not a reliable way of doing that, as for example recently added pages or databases may have not been indexed yet and are not going to be returned, we’ve decided that it’s good enough for now.

Blocks

Texts in Notion are structured around “blocks” which are the basic units of content. Whatever you add to the page is a block: a paragraph, a list, a table, and so on. Each block can be standalone, like a paragraph with just some text in it, or have child blocks, like a list item containing a sub-list, etc. Below is an example of a block (from notion documentation):



{
    "id": "c02fc1d3-db8b-45c5-a222-27595b15aea7",
    "type": "heading_2",
    "heading_2": {
        "rich_text": [
            {
                "type": "text",
                "text": {
                    "content": "Lacinato kale",
                    "link": null
                },
                "annotations": {
                    "bold": false,
                    "italic": false,
                    "strikethrough": false,
                    "underline": false,
                    "code": false,
                    "color": "green"
                },
                "plain_text": "Lacinato kale",
                "href": null
            }
        ],
        ...
    }

There are more properties (like “parent” or “last_edited_time”) that we’ve hidden so that we can focus on what’s important. Each block has a “type” property that tells us what kind of block it is, but also, where to get its contents from. Different blocks have different data structures, so we have a separate piece of code, called “parser” to handle each block type. Below are two examples of parsers:



[BlockTypes.NumberedListItem]: {
  parse: (block, ctx) => {
    const number =
      'number' in block.numbered_list_item
        ? block.numbered_list_item.number
        : null;
    const text = getPlainText(block.numbered_list_item.rich_text, ctx);
    return number ? `${number}. ${text}` : text;
  },
},
[BlockTypes.ToDo]: {
  parse: (block, ctx) => {
    const text = getPlainText(block.to_do.rich_text, ctx);
    const checkbox = block.to_do.checked ? '[X]' : `[ ]`;
    return `${checkbox} ${text}`;
  },
},

The “getPlainText” function is a simple helper that converts a rich_text array into a string. Additionally, it receives “context” containing the list of users so that it can replace all mentions with actual names.

The “rich_text” property contains an array of elements that we need to parse. We have a simple “getPlainText” function that converts that to just a string. Our parsers return text formatted as markdown, as it is easily understandable by LLMs, and also, unlike HTML, doesn’t leave much garbage after removing special characters for embeddings.

Since blocks can have child blocks, we fetch blocks recursively:



private async getBlocksRecursively(pageOrBlockId: string): Promise<BlockObjectResponse[]> {
    const blocks = await this.notion.blocks.children.list({
            block_id: pageOrBlockId,
        });

    return await Promise.all(
      blocks.results.flatMap(async (block) => {
        if (block.has_children) {
          return {
            ...block,
            children: await this.getBlocksRecursively(block.id),
          };
        }
        return block;
      }),
    );

By gathering all blocks recursively and converting them to text, we get a nice, markdown formatted content of the page.

Pages and Databases

Notion organizes its content into pages. In addition to the block contents described above, each page can also have properties. They are similar to blocks, but they have keys, don't have children, their ID is not a UUID, and lack certain properties like 'last_edited_time'. The possible types and formats of property values are the same as those of blocks, so we use the same code to parse them. Below is the example of property, with the key “When” and type “date”:



"When": {
    "id": "some-id",
    "type": "date",
    "date": {
        "start": "2023-03-23",
        "end": "2023-05-05",
        "time_zone": null
    }
},

Notion also includes databases, which are collections of pages that can be filtered, sorted, and organized as needed. When you view the database as a table, what you see in the column contains the value of the corresponding “property” of the given page. In our index, we represent databases and simple tables in the same way.

Pages that are members of databases, in addition to their properties, can also have ordinary content. In other words, with each “row” being a page, each “column” is a property of that page, but because the page itself works just like any other, users can add normal content to it: paragraphs, lists, images, and so on. Because the content is not visible in any database view in Notion and is not visible in our representation of a database, we additionally index all members as separate pages.

Retrieving data from Notion can be unintuitive and sometimes tedious, especially when dealing with permission management and handling different types of blocks. However, we've successfully parsed all page and database contents into clean markdown texts. The only thing left to do is to build a vector index from these contents, but we'll cover that in the next article. Stay tuned!

Make Notion search great again: semantic search

Łukasz Pluszczewski — Wed, 04 Oct 2023 11:22:25 +0000

In this series, we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will see how to use vector embeddings to search and how to improve its performance.

Numbers, vectors, and charts are real data unless stated otherwise

Last time we explored vector embeddings and their main utility for our case: distances between them represent semantic similarities between texts. Let’s see them in action.

Airbus or Boeing?

Let’s consider the following texts:

"Pope John Paul II was the first non-Italian pope in more than..."
"Pope Francis is the head of the Catholic Church, the bishop..."
"Nicolaus Copernicus, a Renaissance-era astronomer..."
"Johannes Kepler was a German astronomer, mathematician, astrologer,..."
"The Tesla Model 3 is an electric car produced by..."
"The Ford Focus is a compact car manufactured by Ford..."
"The Ford Mustang is a series of American automobile..."
"The Dodge Challenger is the name of three different..."
"The Boeing 737 is a narrow-body aircraft produced ..."
"The Airbus A380 is a large wide-body airliner that..."
"The Airbus A320 family consists of short to..."
"Salamanders are a group of amphibians typically..."
"The dog is a domesticated descendant of the wolf..."
"The cat is a domestic species of small carnivorous mammal..."
"Elephants are the largest living land animals..."
"The tiger (Panthera tigris) is the largest living..."
"Rabbits, also known as bunnies or bunny rabbits..."

We have texts about cars, planes, animals, two popes, and two astronomers. We can calculate embeddings for each text and see how far they are from each other. Using the OpenAI's ADA-2 model, we would get 1500-dimensional vectors so we would have a hard time visualizing it. But we have a tool up our sleeves that will help us out. What is that? That’s right, embeddings 🙂

There is nothing stopping us from calculating 2-dimensional embeddings of those vectors so that we can see relationships between them on a flat screen. This time, the small embeddings were calculated algorithmically, not by neural network. Below is the result:

If you’re curious about how to reduce the dimensionality of vectors using the same algorithm, you can read more here

You can find a few interesting things here. Firstly, it’s clear to see that different categories of texts are clearly separated: animals, cars, people, and planes have their own place in the chart and are quite far from other categories. But that’s not all. Two popes are close together and a little further away from astronomers. A dog is close to a cat, but quite far away from a tiger, which in turn is closer to a cat than to a dog or a bunny. Of course, because we decreased the dimensionality of the vectors, we’ve most likely lost a lot of semantic data that is encoded in 1500 values of original vectors. We can still see the relationships though.

While embeddings can be used as inputs to neural networks, we don’t need neural networks to use the spatial relationships between them to implement efficient semantic search. It’s enough to calculate the embedding of a search query using the same method and find its closest neighbors in the semantic space. Let’s write some queries about a few topics, calculate embeddings for those queries, and add them to our chart:

It’s quite clear what texts are going to answer our question. The current pope is even closer than the previous one (but we’ve probably just got lucky on that one)!

We rather fly Airbus than Nicolaus Copernicus, sounds about right.

AI has spoken! Dogs are the cutest.

I should consider Boeing 737 🤔.

The last one is not so clear. While we see that the vector for the query is closer to the cars than people, in a search engine, Boeing 737 would still be quite high in search results for cars to buy.

Looking at the examples above, keep in mind that we don’t see the actual vectors - just their 2-D embeddings. Regardless, you can probably see the utility of the vector spaces: you can find texts that are on similar topics fairly easily. While not perfect this method is a great first step for more complex semantic search solutions. Let’s dive deeper to see what we can do with the results to make it better.

Vectors are stupid, language models are not

Well, maybe they are. But they will help us nonetheless. Let’s imagine that for the query “What car should I buy?” we’ve got the following results (not the actual vector search result):

“The Boeing 737 is a narrow-body aircraft...”
“The tiger (Panthera tigris) is the largest living...”
“The Ford Mustang is a series of American...”
“The Tesla Model 3 is an electric car produced by...”
“The Airbus A380 is a large wide-body airliner...”
”The Dodge Challenger is the name of three...”
”The Ford Focus is a compact car manufactured...”

The Boeing 737 is probably not your first choice. For such a simple query the actual results are much more accurate of course, vector search is not that stupid, but irrelevant results may appear for more complex and nuanced queries. The capable language model would clearly distinguish between a plane and a car, or between text that is roughly about the topic, and one that actually contains the answer - even the nuanced one. Here comes the rerank!

Rerank

While it’s not feasible to use a big and expensive language model to analyze thousands or even millions of texts you may have in your database, you can easily afford to let it clean up and reorder your initial vector search results if needed. That’s exactly what rerank models do. They are language models, so, unlike simple vector search, they understand the contents of the texts they are processing. They accept a query and a list of text documents and they “rerank” those documents giving them scores based on how relevant they are to the query. It’s much more expensive to use those models than to just calculate embeddings so we only use them after the initial vector search. Let’s use Cohere.ai's rerank model on our not-so-perfect car buying search (you can find rerank’s score in the brackets, while the initial order from the vector search was made up, the results from the rerank model are real):

[0.41] “The Tesla Model 3 is an electric car...”
[0.36] “The Dodge Challenger is the name...”
[0.34] “The Ford Focus is a compact car...”
[0.32] “The Ford Mustang is a series of...”
[0.20] “The tiger (Panthera tigris) is the...”
[0.08] “The Boeing 737 is a narrow-body...”
[0.05] “The Airbus A380 is a large...”

Now we’re talking! We have relevant results at the top thanks to rerank’s ability to actually understand the query and the texts. While it’s still not perfect (tiger seems dangerously close to the Ford Mustang for some reason), it’s enough in the vast majority of cases. Now let’s put all of that into practice and build a proper search engine!

In the next articles, we’ll see how we get the data from Notion using its API and how we used Weaviate vector database to build a searchable index out of it.

Make Notion search great again: vector embeddings

Łukasz Pluszczewski — Thu, 28 Sep 2023 08:14:41 +0000

In this series, we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will focus on the theory of vector embeddings.

Numbers, vectors, and charts are real data unless stated otherwise

Notion, with its infinite flexibility, is perfect for keeping unstructured notes, structured databases, and anything in between. Thanks to that flexibility, adding stuff is easy. It’s so easy in fact, that we add, and add, and add, then add some more until we have 5000 pages of meeting-notes, temporary-notes and god-knows-what-notes.
Do you have that problem at your company? Tens of thousands of pages in Notion, created by dozens of people with different ideas about naming and structuring data. You’re adding new pages regularly just to keep things from being lost and then… they are lost. You try to search for something and you get pages that are kind-of on-topic, but don’t answer the question, some completely unrelated stuff, a random note from 2016, and an empty page for good meassure. Notion truly works in mysterious ways.

We can divide companies into two groups: those before their failed Notion reorganization attempt, and those after. We can try to clean it up, reorganize, and add tags but trying to clean up thousands of pages, with new ones being added daily, is doomed to fail. So we’ve decided to solve this problem with the power of neural networks. Our goal was to create a separate index that would allow us to efficiently search through all the pages in Notion and find ones that are actually relevant to our query. And we did it! (with some caveats - more on that at the end of the series)

But before we get there, let’s dive into the technologies we’re going to need.

Brown foxes and lazy dogs

Let’s consider a simple neural network that accepts text as input. It could be a classification network for example. Let’s say that you want to train the network to detect texts about foxes. How can you represent a text so that a neural network can work on it?

The quick brown fox jumps over the lazy dog

Neural networks understand and can work on vectors.

Well, actually, while a vector is just a one-dimensional tensor and simple neural networks can have vectors as input, more complex architectures, including most language models, work on higher-dimensional tensors. We’re simplifying a bit just to get to the point.

A n-dimensional vector is just a one-dimensional array with length n. The text about the brown fox is definitely not a vector. Let’s do something about it. One way of converting the text to numbers is to create a dictionary and assign each word a unique ID, like this (these are actual IDs from the dictionary used by OpenAI’s GPT models):

The	quick	brown	fox	jumps	over	the	lazy	dog
464	2068	7586	21831	18045	625	262	16931	3290

You may have noticed that the words 'the' and ' the' (with space at the beginning) have different IDs. Why do you think it’s useful to have those as separate entries in the dictionary?

So we figured out how to convert text into a vector, great! There are issues with it though. First, even though our example is straightforward and small, real texts may be much longer. Our vector changes dimensionality based on the length of the input text (which in itself may be an issue for some simple architectures of neural networks) so for long texts, we need to deal with huge vectors. Additionally, it’s harder to successfully train the neural network on data that contains big numbers (without going into details, while neural networks do calculations on big numbers just fine, the issue is the scale of the numbers and the potential for numerical instability during training when using gradient-based optimization methods). We could try one-hot encoding (by taking all entries from a dictionary, assigning 0 to entries that do not appear in our text, and 1 to ones that do) to have a constant-length vector with only ones and zeros, but that doesn’t really help us that much. We still end up with a large vector - this time it’s huge no matter the length of the input text and additionally, we lose the information about word position. And last but not least: what about the words we don’t have in our dictionary? We need syllables or even separate letters to be assigned IDs in some texts which makes the resulting vector even longer and messier.

The dictionary-based vectors are used in practical NLP applications e.g. as inputs to neural networks. One-hot encoding is also used but for different purposes (e.g. when the dictionary is small, or for categorization problems).

There is one thing that our vector could tell us about the input text, but doesn’t - the meaning. We could decide if the text is about foxes just by checking if there is an element 21831 in it, but what about the texts about "omnivorous mammals belonging to several genera of the family Canidae” or about "Vulpes bengalensis”? Meaning can be conveyed in many more ways than we can include in a simple algorithm or a whitelist of words (I bet you wouldn’t think of adding “bengalensis” to your dictionary, would you?). For that reason, to know for sure, we need to process the whole text, even if thousands of words long. The only way we can reliably decide if the text is about foxes is to pass the vector on to the advanced, and expensive, language model to process it and understand what those words, and the relationships between them, mean. Or is it?

Embeddings to the rescue

Embed… what? In simple terms, embedding is a way of representing data in a lower-dimensional space while preserving some relationship between data points. In other words, embedding is a smaller vector representation of a larger vector (or more general: a tensor), that still contains some important information about it. It can be calculated algorithmically but in cases like ours, this is done with the help of neural networks.

There are many ways to reduce the dimensionality of the data, like principal component analysis which is fast, deterministic, and used in data compression, noise reduction, and many other areas, or t-SNE which is used mainly in high dimensional data visualization and in the next article of this series ;)

The simplest example would be to represent our huge vector with potentially thousands of dimensions as just one number, representing how much about foxes the text is. For the sentence about jumping fox, we would probably get something like this:

0.89

Not very exciting. In reality, embeddings are a little bigger - depending on the use case they may have hundreds or thousands of dimensions.

You may now think: "Wait a moment. Thousands of dimensions? It is certainly not less than those few numbers representing our jumping fox, isn't it?". In our case yes - the embedding will actually be bigger than the dictionary-based vector. But that vector is also an embedding of a huge, practically infinite-dimensional vector in the space of all possible texts. In that sense, we have a dictionary-based embedding which can be small for short texts but is not very useful, and a powerful embedding that may be a little bigger but gives us way more easily extractable information about the text.

Embeddings can be calculated from many different types of data: images, sounds, texts, etc. Each of those is just a big vector, or actually, a tensor. You can think about an image as a 2-dimensional tensor (a matrix) of color values, as big, as many pixels there are in an image. A sound file is often represented as a spectrogram which is also a 2-dimensional tensor.

In each case, each dimension of the, often smaller, embedding vector encodes a different aspect of the item’s meaning.

It’s also worth mentioning that, while in principle, each dimension of the embedding encodes some aspect of the original, in most cases, it is not possible to determine what exactly each number in the embedding means. The idea of “foxness”, while may be preserved in the embedding in some way, is foreign to the embedding model and is unlikely encoded in a single dimension but rather in a much more complicated and nuanced way.

Because vectors are also points in n-dimensional spaces, we can think about the relative positions in those spaces as semantic relationships. The closer the embedding vectors are, the more similar the data points are. In the case of image embeddings, images of dogs would be very close to each other, quite close to images of foxes but far away from images of spaceships. The same applies to text embeddings.

By “closer” we don’t necessarily mean the lowest euclidian distance but rather the closest angle, measured by cosine similarity. The reason for that is that in high-dimensional space the distances or the magnitudes of vectors become much less informative. The actual semantic differences are encoded in the angles, or directions of vectors. The magnitudes may also contain useful information such as the emphasis or frequency, but they’re harder to use due to "the curse of dimensionality" - a set of phenomena resulting from increasing the number of dimensions like rapid increase in space volume.

While all this is true of the types of embeddings we talk about (and care about) in this series, in reality not all embedding methods encode semantic information about the original. What is encoded in the embedding values, and what are the relationships between datapoints depend on the specific training process of the embedding model and its purpose. You can easily imagine an embedding calculated from an image that captures some aspects of color palette, or an audio embedding that encodes its atmosphere instead of semantic contents etc.

As mentioned earlier, we calculate embeddings like this with the help of a neural network. There are many models that can do this from simple and small models you can run on your computer to large, multi-language, commercially available models like OpenAI's ADA. In our case, for the simplicity of use, and to get the best performance possible we’ve decided to use the latter. Below is a fragment of text embedding of our fox text calculated using OpenAI's ADA embeddings model. The particular version of the ADA model we’ve used creates 1.5k-long vectors:

0.0035436812	0.0083188545	-0.014174725	…	-0.031340215	0.018409548	-0.041342948

A long vector like this is not terribly useful for us by itself. We can’t really “read” any interesting information from it nor we can determine if it’s about foxes just by looking at it. However, we can use the relationships between embeddings of different texts to our advantage, for example, to find foxes or to implement efficient semantic search. But this is a topic for the next article.

Case study: PDF Insights with AWS Textract and OpenAI integration

Łukasz Pluszczewski — Mon, 14 Aug 2023 11:57:32 +0000

Original problem - automated PDF summarization

The company approached us with the issue of a large quantity of data to sift through in the form of pitchdecks. While each pitchdeck is generally fairly short, in most cases around 10 slides each, the issue is the number of them to analyze. We were faced with the task of automating the extraction of the most important information from unstructured hard-to-parse format - PDF. Additionally, the data is in the form of slides: with a lot of graphical cues and geometric relations between words that convey information not easily inferred from the text itself. To make it easier to analyze a large amount of data, we would need a solution that would automate as much of that process as possible: from reading the document itself, to finding interesting pieces of information like names of people involved, financial data, and so on.

Why is text extraction so hard?

The first issue we faced was getting the text contents from a PDF file. While extracting text directly from PDF, using open source tools like pdf-parse (which is used internally by langchain’s pdf-loader) did the job most of the time, we still had some issues with it: some PDFs were not parsed correctly and the tool returned empty string (like in the case of Uber sample pitchdeck ), we’ve just got some words split into individual characters and so on.

Unfortunately, getting the text contents of the PDF was just the beginning. The text in PDF is all over the place: we had slides with two or three words, some tables, lists, or just paragraphs squished between images. Below is the example of text extracted from page 2 of the example reproduction of AirBnB early pitchdeck (link, extraction done with pdf-parse library):

Welcome
AirBed&
Breakfast
Book rooms with locals, rather than hotels.
1
This is a PowerPoint reproduction of
an early AirBnB
pitchdeck
via Business Insider @
http
://
www.businessinsider.com
/airbnb
-
a
-
13
-
billion
-
dollar
-
startups
-
first
-
ever
-
pitch
-
deck
-
2011
-
9

And this is one of the better ones!

While parsing text like this is hard in itself, we also would like to be able to modify what extract from the text. We may want to know what people are involved in a business. Or do we just want to get all financial data, or maybe just the name of the industry? Each type of data extracted requires a different approach to parsing and validating text, and then a lot of testing.

How can it be solved?

Reliable text extraction

First, we’ve decided to leave open-source solutions behind. We’ve used AWS Textract to parse PDF files. This way we don’t rely on the internal structure of the PDF to get text from it (or to get nothing - like in the case of the Uber example). Textract uses OCR and machine learning to get not only text but also spatial information from the document.

Here is the Textract result (with all geometric information stripped) from the same page of the AirBnB pitchdeck reproduction

AirBed&Breakfast
Book rooms with locals, rather than hotels.
This is a PowerPoint reproduction of an early AirBnB pitchdeck via Business Insider @
http://www.businessinsider.com/airbnb-a-13-billion-dollar-startups-first-ever-pitch-deck-2011-9

But that’s not all! Textract responds with a list of Blocks (like “Page”, or “Line” for a line of text), together with their position and relationships which we can use to understand the structure of the document better

{
    "BlockType": "LINE",
    "Confidence": 99.91034698486328,
    "Geometry": {
      "BoundingBox": {
        "Height": 0.22368884086608887,
        "Left": 0.8931880593299866,
        "Top": 0.024829095229506493,
        "Width": 0.05453843995928764
      },
      "Polygon": [
        {
          "X": 0.9477264881134033,
          "Y": 0.02518528886139393
        },
        {
          "X": 0.9472269415855408,
          "Y": 0.2485179454088211
        },
        {
          "X": 0.8931880593299866,
          "Y": 0.2481813281774521
        },
        {
          "X": 0.8936926126480103,
          "Y": 0.024829095229506493
        }
      ]
    },
    "Id": "7a88c32b-a0f6-4392-aed5-c5ab8977f162",
    "Page": 1,
    "Relationships": [
      {
        "Ids": [
          "396d8b87-4712-4db0-a77d-0abbbf151bd3"
        ],
        "Type": "CHILD"
      }
    ],
    "Text": "Welcome"
  },

Most of the time, we don’t need such details, so in our case, we use only a fraction of them.

Summarisation process and AI

Now to actually parse the text and pull what we want from it. For that, the only solution that seemed viable was to use a language model. While we tested some open-source solutions, they were not up to the task. Hallucinations were too common, and responses too unpredictable. Additionally, most capable Open Source models available today are not licensed to be used commercially. So we went with the OpenAI GPT-3.5 and GPT-4 models.

We’ve decided to first let the model summarise the text and include all information from the pitchdeck in that summary. That way we have text that is complete (not just the outline) and has a structure that is easier to work with. We’ve used the following prompt for each page of the document:

Below is the text extracted from a single page of pitchdeck PDF. Write a summary of the page. List all people, statistics, and other data mentioned.
Include only what is in the text, avoid adding your own opinions or assumptions.
Answer with the bullet point summary and nothing more.

With additional instructions like “avoid adding your own opinions or assumptions” we minimize the hallucinations (models like to add fake data to the summary. GPT-3 even added a completely fake financial analysis!). When we have a summary of all pages we can ask the model to extract information from it. Here is an example of the prompt we’ve used to get the list of people referenced in the document:

List all people mentioned in the pages of the pitchdeck. Add their roles if that information is in the text.
Answer with the bullet point list and nothing more. Include only information that is in the pages, don\'t add your own opinions or assumptions.

The summarisation returned by the models (both GPT3 and 4) is of good quality: the information returned is factual and whatever is plainly stated in the document will end up in the summary as well.

However, the extracting of the list of people is a different story. Models, especially GPT-3, often answer with a list similar to this (not an actual response):

- Uber
- John Doe (CEO)
- Anabella Moody (CPO/CTO)
- j.doe@example.com
- (123) 555-0123

Not only this is clearly not a correct list of people, but also, the email was not in the source text at all, the model made it up!

We’ve also experimented with many variations of that prompt like:

Adding information that this is text extracted from PDF doesn’t seem to make any difference - models treat the input text the same way. When looking at the data there really isn’t any information for the model to infer anything from. We would need to include actual geometry data.
Skipping the summarisation part, and asking the model to get information from the text extracted from the whole document directly. This didn’t have much effect either (although I’ve seen a little worse responses at least in one case, but it was very subtle) which would suggest that we don’t need that summarisation step, especially when we do that for each page so we make quite a lot of requests. We’ve decided to keep it however as we may need a summary anyway.
Providing GPT with text together with spatial information returned by Textract. While this seems like a way to allow the model to infer some visual cues it is hard to figure out the right format. The JSON that Textract returns is quite verbose and it’s often too long to pass to the model (even with unnecessary fields stripped). Splitting up a page into smaller chunks seems wrong as the page context is often important to understand a chunk. This still needs investigating and more experiments.
While trying to solve the issue with inaccurate or hallucinated answers we’ve tried feeding the model with its answer so that it can validate and fix it. Unfortunately, our tests with GPT-3 failed - it didn’t see any issues with it’s made-up emails and phone numbers on a list that was supposed to contain the names of people. We need more tests with this approach using GPT-4 model though.

Next steps

What we miss and what is probably the most difficult is the ability to interpret the images and spatial relationships in PDF slides. While AWS Textract returns some spatial information it does not recognize images, and the data returned is hard to pass to the model. We’re still investigating how to make the model understand arrows, charts, and tables. Additionally, we would like to automate the process of online research e.g. find more information about companies mentioned in the documents using available APIs (like Crunchbase) or fetch more data on the people involved.

Summary

The case study addresses automating the extraction of vital details from numerous PDF pitchdecks. These decks are concise but numerous, making manual analysis impractical. The challenge involves extracting text and interpreting graphical elements. AWS Textract was employed for text extraction due to its OCR and layout understanding capabilities. OpenAI's GPT-3.5 and GPT-4 models were used to summarize and extract information, yet challenges arose in accurately extracting specific data like people's names or financial data. The study acknowledges the need to enhance image interpretation to understand visual elements better.

[EDIT]: Since the publication of this case study some new tools have appeared that make the process of parsing PDF presentations much, much easier. With multimodal language models like GPT-Vision, we can skip the OCR step and allow the system to interpret visual cues better than any pure text solution ever could. Stay tuned for more on that!