Forem: Sameer Kulkarni

What are Vector Databases? A Beginner's Guide

Sameer Kulkarni — Fri, 12 Jul 2024 05:18:49 +0000

Have you ever used AI tools like ChatGPT to chat with a robot or Stable Diffusion to generate unique images? Wondered how they work?

Familiar apps, like the contacts app you use, use structured data, i.e., name, phone number, email address, and a few more text and numeric fields. While AI apps deal with different kinds of data that are unstructured. Some examples of unstructured data are articles, chats, pictures, multimedia, etc. These are quite complex for computer programs to process and understand the context.

Applications like contact apps work very well with SQL-type databases since their requirements are fairly simple. However, when it comes to AI apps and tools that refer to different kinds of unstructured data, regular databases aren't made to handle it. This is where vector databases come in! They're a special kind of database that helps AI platforms store and process unstructured data easily. That is why many AI tools are switching to vector databases or using them from the beginning.

Vector databases are the backbone of any AI application. In this blog post, we'll understand what they are and how they are different from traditional SQL and No-SQL databases, along with understanding the benefits they bring to the table. Let’s get started 🚀.

Types of data

Before diving deeper into the vectors and vector databases, let's first understand the types of data and how they affect the type of database best suited for its storage and querying capabilities.

(K21Academy)

Structured data

Any data that has a predefined and fixed schema for it is called structured data. This data can be represented in a tabular format. SQL databases, spreadsheets, etc. are best suited for storing and interacting with this type of data.

Semi-structured data

Semi-structured data is data with a flexible schema. In other words, it is “self-describing” data such as JSON or XML files. These files contain data as well as structural details about themselves. This type of data is ever-evolving, as it lacks a fixed schema.

Document stores, Key-Value stores, and Graph databases are some examples of suitable databases for storing and analyzing semi-structured data.

Unstructured data

Unstructured data is any other data that is mostly in the form of natural language and/or in multimedia format. Some examples of this type of data would be news articles, social media discussions, images, audio-visuals, etc. This is the largest form in which the data exists today and is the most difficult data to comprehend for computer systems as well. For example, consider a simple sentence “Sam likes Football”. Depending on the context and the background of the person, “Football” can either mean American Football or Soccer. Similarly, a single picture can be seen and interpreted in numerous ways depending on the person, their point of view as well as the required information being seeked out of it. This is called latent semantic meaning of the given data.

The latent semantic meaning of the data can be extracted by evaluating it against a large number of attributes, such as words immediately preceding and following the text, context of the data, person’s background, grammar, etc., and converting it into n-dimensional numerical arrays, i.e., vector embeddings.

Vector databases excel at storing, indexing, and querying such data.

Below image represents how the same kind of data can be represented in three different types listed above.

What are vectors?

Vector is a mathematical concept which indicates information in more than one dimension. Being able to represent data in multiple dimensions is the key that helps vectors in AI and machine learning (ML). Unstructured data contains latent semantic meaning in multiple attributes. Hence the semantic meaning for any given data point needs to be captured with respect to all the relevant attributes for the model.

(Wikimedia)

Below are some examples of unstructured data containing semantic meaning in multiple attributes.

Text: The text you read can contain semantic information in many ways, such as individual words, their relationships, part of speech, sentence structure, grammar rules, overall topic/theme, sentiment, etc. Depending on the purpose, multiple AI models can use the same text data to evaluate the latent semantic meaning and form different vector data.
Images: Pixel data in images can be used to describe the image in vector forms. Some of the features that the models can consider would be colors, tones, shapes, edges, brightness, identifying objects, location, arrangement, artistic style, etc.
Audio: For audio data, sound waves can be converted to numerical representations and evaluated on a number of parameters such as volume, timbre, tempo, attack and decay, frequencies, harmonics, etc.

Vector embeddings

In simple terms, vector embeddings, a.k.a. embeddings are high-dimensional vectors generated by neural networks. A typical vector database stores many such embeddings.

Vectors help us represent data in mathematical form since modern CPUs and GPUs are best suited for working with mathematical data. Though, just turning data into a mathematical form isn’t useful if it cannot represent the meaning of the data it represents. For example, we can easily compare words in two sentences using computers, but that may not tell us whether they mean the same thing or if they are about the same topic. This is where vector embeddings and embedding models come into play. The embedding model takes a large amount of labeled data and converts it into numerical representation in a high-dimensional space. Each dimension represents a specific attribute or feature of the data extracted from it.

Here is an example of vector embeddings generated using word2vec, a well-known model used to generate word embeddings. As you can see, it converts given words into an array of floating point numbers. It does that by evaluating the words on a set number of parameters, since computers can easily understand and process numbers than the meaning of words themselves.

Word: "king"

Embedding: -0.25, 0.42, 0.18, ..., 0.71

Word: "queen"

Embedding: [-0.31, 0.38, 0.24, ..., 0.68]

Word: "computer"

Embedding: [0.17, -0.52, 0.89, ..., 0.34]

Word2vec is a classic example of a text embedding model for Natural Language Processing (NLP) algorithms. This model encodes semantic meaning and relationships between words. Thus, it is able to represent words with similar meanings closer together. The below image shows the visualization produced by TensorFlow’s projector tool for this model.

(Word2vec 10k embedding projector)

How do vector databases work?

Vector databases help store the vector embeddings generated by the embedding model. These vector embeddings are multidimensional numerical arrays that can sometimes span across hundreds of dimensions. Traditional scalar based databases can’t work with this type of data efficiently; this is why we need specialized databases like vector databases which are specifically designed to handle this data type. Some reasons why traditional databases can’t work with vector data efficiently are:

Vector embeddings that are high-dimensional arrays of numbers don’t fit neatly into a structured format.
The indexing algorithms of traditional databases are optimized for structured and semi-structured data. Vector embeddings need a different set of algorithms for indexing.
Traditional databases lack the functionality to perform complex distance calculations, like cosine similarity, needed to compare and query vectors.

Below diagram shows a simplified way a typical vector database works.

First, we feed the embedding model with the required labeled data in large quantities. The embedding model consists of execution of a data pipeline that largely consists of two steps:

Preprocessing: During preprocessing, the model cleans the data, removes noise, and inconsistencies, and then normalizes the data.
Embedding: The embedding stage includes feature extraction & vectorization (encoding) of the data followed by dimensionality reduction. The last step reduces the number of dimensions of the vector, if needed, for efficiency.

The vector embeddings are then inserted into the vector database for long term storage. The embeddings also contain some reference to the original content it was created from. Once enough vector data has been inserted into the database, the data that has similar semantic meaning naturally gravitates closer together.

In order to get query results from the database, the query is also run through the same embedding model as the data. This converts the query into another vector. The vector database then searches for data points closest to the query vector based on the distance metric and returns the results. In some cases, the database also needs to post-process the nearest neighbors of the queried data before returning them.

Indexing algorithms

The AI and ML applications need to be trained on a large amount of data. In addition to that, the fact that the vector embeddings generated can have hundreds of dimensions means that the vector databases need to store a huge amount of data. Querying over such a huge amount of data is quite cumbersome, even for vector databases. Hence, similar to scalar based databases, indexing also helps vector databases with faster query retrievals. The vector db indexes create a quickly traversable data structure using several algorithms. Below are some of the common ones.

Hierarchical Navigable Small World (HNSW)

HNSW builds a hierarchical structure. Each node in the hierarchy represents a set of vectors, and each edge represents the similarity between the vectors. It organizes the data points into a series of graphs at different levels, and similar data points are connected within each graph.

The benefits of HNSW include efficient approximate nearest neighbor searches, fast search times, and scalability with large datasets.

Locality Sensitive Hashing (LSH)

Locality sensitive hashing projects data points onto multiple hash tables using hash functions. These hash functions map similar vectors to the same or similar bucket which helps the database locating the query results faster.

The benefits of using LSH for indexing are very fast filtering of dissimilar points and being especially useful for large datasets with high dimensionality.

Product Quantization (PQ)

Product quantization is a lossy compression technique for high dimensional data. It reduces the dimensionality of the vectors into subvectors using codebooks and quantizes each subvector separately. During encoding, the original vector is compared to the codewords in the codebook, and the closest ones are selected to represent the original vector.

PQ helps in efficient approximate nearest neighbor search and reduction in storage requirements.

Use cases of Vector databases

Vector databases are a useful tool in many applications. Here are some use cases of the same.

Recommendation systems

Recommending articles, movies, music, products, etc., based on user preferences and/or the given description in text or other formats. They can store user profiles and item embeddings for efficient retrieval of similar items to recommend. This can help social media platforms, media publishing houses, and OTT platforms engage users and build brand loyalty.

Natural language processing (NLP)

Vector databases make tasks like sentiment analysis, topic modeling, and machine translation possible and efficient. They can store word as well as sentence embeddings enabling fast processing of large natural language data. This helps search engines provide better results, as well as chatbots and virtual assistants.

Fraud detection and anomaly detection

Vector databases help in analyzing and detecting unusual patterns, objects, behaviors in various data forms. They can store historical transaction embeddings or sensor data embeddings, which allows real-time detection of suspicious patterns. It is used by financial organizations to prevent financial frauds in their tracks.

Drug discovery and material science

Finding similarity to existing data is an important capability for identifying potential drug candidates or new alternatives to existing materials. Vector databases can store molecular embeddings or material property embeddings, enabling faster scientific discoveries. This helps to significantly reduce the time it takes to develop new drugs and discover newer materials for specific applications.

Vector DB classification

The vector databases are broadly classified using two aspects.

Dedicated vector database
Licensing model

Dedicated vector databases have certain advantages over traditional databases that support vector search. They are built for speed, with optimized indexing and scalability. On the other hand, traditional databases with support for vector search have advantages like the ability to handle structured and/or unstructured data along with vectors, with a smoother learning curve.

(Vector databases landscape)

The diagram above shows a few of the popular vector database options available today and how they fit into the landscape.

Conclusion

Vector databases are revolutionizing how we handle and interact with large and complex data in the age of artificial intelligence. They unlock a vast range of possibilities through their ability to capture, persist and efficiently search through the essence of data. From powering recommendation systems to accelerating scientific discovery, vector databases are bound to play a central role in the upcoming data driven future.

As new vector databases emerge and evolve, we can expect even faster speeds, larger storage capacities, more sophisticated indexing and world-wide adoption of them across industries.

Now that you have learned about vector databases, how they work, their use cases, and classification, you can bring in AI & GPU Cloud experts to help you build your own AI cloud.

If you found this post useful and informative, subscribe to our weekly newsletter for more posts like this. Do start a conversation about this post on LinkedIn. I’d love to hear your thoughts.

Securing Kubernetes Secrets with Conjur

Sameer Kulkarni — Fri, 26 Mar 2021 17:58:02 +0000

Why to secure Kubernetes secrets?

Secrets management is one of the important aspects of securing your Kubernetes cluster. Out of the box, Kubernetes uses base 64 encoding for storing them, which is not enough. You have to implement a number of security best practices on top, to prevent possible security breaches. etcd encryption at rest, access control with RBAC, are a couple of examples of the same. Using secrets management solutions like CyberArk Conjur, not only secures them for Kubernetes, but also provides other benefits as we will see in the post.

What is Conjur?

CyberArk Conjur is a secrets manager. It helps you manage secrets in Kubernetes, as well as across applications, tools & clouds. It offers Role Based Access Control (RBAC) with an audit trail to easily track each stored secret. It implements encryption at rest with AES-256-GCM and in transit using mTLS. Additionally, you can manage the access for each secret & can also rotate the secrets automatically.

In this post, we will see how to install Conjur OSS on Kubernetes. We will go through a basic set of Conjur policies and will load them into Conjur. We’ll also see how to run an application in Kubernetes which uses secrets from Conjur by conforming to the defined policies.

Pre-requisites

Familiarity with advanced YAML concepts.

You may be already familiar with the way Kubernetes spec files are written in YAML. Although you also need to understand a few more YAML concepts to understand & define Conjur policies, viz. tags, anchors & aliases. Conjur website has a quick refresher on this. Alternatively, you can go through the full YAML documentation.

A working Kubernetes cluster
Docker installed locally

Setup

How to Install Conjur?

The easiest way to Install Conjur on a Kubernetes cluster is by using the Helm chart. Let's first create a custom values file for the Helm chart.

$  DATA_KEY="$(docker run --rm cyberark/conjur data-key generate)"
$  HELM_RELEASE_NAME=conjur-oss
$  cat >values.yaml <<EOT
account:
  name: "default"
  create: true
authenticators: "authn-k8s/namespace,authn-k8s/deployment,authn-k8s/service_account,authn-k8s/demo,authn"
dataKey: $DATA_KEY
ssl:
  altNames:
  - $HELM_RELEASE_NAME
  - $HELM_RELEASE_NAME-ingress
  - $HELM_RELEASE_NAME.conjur.svc.cluster.local
  - $HELM_RELEASE_NAME-ingress.conjur.svc.cluster.local
service:
  external:
    enabled: false
replicaCount: 1
EOT

The dataKey is used for encrypting the secrets in the db. The ssl.altNames will be used for the SSL configuration of the Conjur service that the Helm chart will create.

Install Conjur OSS on a Kubernetes cluster, with the following commands.

$  CONJUR_NAMESPACE=conjur
$  kubectl create namespace "$CONJUR_NAMESPACE"
$  VERSION=2.0.3
$  helm repo update
$  helm install \
   -n "$CONJUR_NAMESPACE" \
   -f values.yaml \
   "$HELM_RELEASE_NAME" \
   https://github.com/cyberark/conjur-oss-helm-chart/releases/download/v$VERSION/conjur-oss-$VERSION.tgz

The VERSION declared above is the Conjur Helm chart release version. As of writing this post, the latest Conjur OSS Helm chart version is 2.0.3. Refer Conjur Helm chart releases for the latest Conjur Helm chart available.

Once the helm chart is installed, it creates an admin user. You will need this key for the initial load of the Conjur policies, secrets, etc. You'll also need it in the "break-glass" scenarios. Hence you need to store it in a safe place. You can fetch the same using the commands below.

$ POD_NAME=$(kubectl get pods --namespace $CONJUR_NAMESPACE \
                     -l "app=$HELM_RELEASE_NAME,release=$HELM_RELEASE_NAME" \
                     -o jsonpath="{.items[0].metadata.name}")
$ kubectl exec --namespace $CONJUR_NAMESPACE \
          $POD_NAME \
          --container=$HELM_RELEASE_NAME \
          -- conjurctl role retrieve-key default:user:admin | tail -1

Verify the installation.

$ kubectl get po,svc -n $CONJUR_NAMESPACE
NAME                             READY   STATUS    RESTARTS   AGE
pod/conjur-oss-b888db5d5-vmfl5   1/2     Running   0          77s
pod/conjur-oss-postgres-0        1/1     Running   0          77s

NAME                          TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
service/conjur-oss            NodePort    10.68.34.148   <none>        443:31022/TCP   79s
service/conjur-oss-postgres   ClusterIP   10.68.36.72    <none>        5432/TCP        79s

Define Conjur Policies

Conjur policies help define objects in its database in a tree structure. Some examples of the objects defined in the policies are users, roles, secrets & applications. It also defines rules the for role based access control. While the Conjur documentation defines the policy best practices, we will use one of the Conjur demo repositories to define policies. I've used policies in the demo repository as the base and have further simplified them to understand the basic concepts better. Download and review the simplified policy files from my repository. Note that all the policies need to have a .yml extension.

1_users.yml file defines users and roles. It also grants role based access to a group of users as well as to individual users.
2_app_authn-def.yml file defines applications & groups them into a layer for easier access management.
3_cluster-authn-svc-def.yml file defines the authenticator service & the SSL certificates for the mTLS communication between Conjur & its clients. In this case, Conjur clients are applications running on Kubernetes. It also defines role based access to authenticate with the service.
4_app-identity-def.yml file connects the authentication identities to application identities.
5_authn-any-policy-branch.yml policy is defined to verify that hosts can authenticate with Conjur from anywhere in the policy branch to retrieve secrets for Kubernetes.
6_app-access.yml defines secret variables for different applications that will use Conjur as its secrets manager. Note that the variables mentioned here are just secret variable names & not the values.

Generate mTLS cert & key

Run the create_mtls_certs.sh shell script to create the mTLS cert & key. Make sure to update the AUTHENTICATOR_ID & CONJUR_ACCOUNT in the script with the values appropriate for the Conjur installation. AUTHENTICATOR_ID is the part that follows conjur/authn-k8s/ in the id of the 3_cluster-authn-svc-def.yml policy file.

Load policies & secrets

We'll use conjur-cli to load policies & data in Conjur. Conjur has a container image pre-packaged with the Conjur cli. We will run the Conjur client as a pod on the cluster. The policies and certificates will get mounted on the same as configmap volumes. conjur-cli will load the policies & certificates to the Conjur server from these volumes.

Have all the policy files under the policy directory and the mTLS cert-key pair in mtls directory to create configmaps out of it.

$ # Create configmap containing the mTLS cert & key
$ kubectl create configmap conjur-ca -n $CONJUR_NAMESPACE --from-file $(pwd)/mtls
$ # Create a configmap containing all the policy files
$ kubectl create configmap policies -n $CONJUR_NAMESPACE --from-file $(pwd)/policy

Run the Conjur client pod with the above configmaps mounted as volumes. Download the sample pod config from here. Create the pod with downloaded config & exec into it to load values.

$ kubectl create -f conjur-client.yaml
$ kubectl exec -it -n $CONJUR_NAMESPACE conjur-client -- sh

Connect to the Conjur server & authenticate as an admin user

$ export CONJUR_URL=https://conjur-oss
$ export ACCOUNT=default
$ conjur init -u $CONJUR_URL -a $ACCOUNT
$ conjur authn login -u admin -p <admin_api_key_printed_by_helm_install>

We can start loading policies now. Loading the policy files 1_users.yml and 2_app-authn-def.yml in Conjur generates API keys for the users & hosts defined in it. The user API keys can be distributed to respective team members, allowing them to authenticate & interact with Conjur. We will use the Kubernetes authenticator instead of host API keys to get the application authenticated with Conjur.

$ conjur policy load root policy/1_users.yml
$ conjur policy load root policy/2_app-authn-def.yml
$ conjur policy load root policy/3_cluster-authn-svc-def.yml
$ conjur policy load root policy/4_app-identity-def.yml
$ conjur policy load root policy/5_authn-any-policy-branch.yml
$ conjur policy load root policy/6_app-access.yml

Load mTLS certificate & key.

$ conjur variable values add conjur/authn-k8s/demo/ca/cert "$(cat conjur-ca/ca.cert)"
$ conjur variable values add conjur/authn-k8s/demo/ca/key "$(cat conjur-ca/ca.key)"

Load secret values.

$ conjur variable values add demo-app-vars/url "https://my.app.com"
$ conjur variable values add demo-app-vars/username "myuser"
$ conjur variable values add demo-app-vars/password "supersecret"

You don’t want the client pod to continue running in the cluster, especially because it’s currently logged in to the Conjur server. Hence either log out from the Conjur server with conjur authn logout or delete the conjur-client pod. Also, delete the configmaps mounted on the client pod.

$ kubectl delete -f conjur-client.yaml
$ kubectl delete cm conjur-ca policies

Configure & run application

Conjur offers various authenticators for users and hosts. Here we will use the Kubernetes authenticator to get our application host authenticated with the Conjur server. Kubernetes authenticator uses Kubernetes APIs to authenticate resources like Pod, Deployment, etc. Refer Conjur documentation to see the full list of supported Kubernetes resources that can be defined as hosts.

Kubernetes Authenticator Client is one of the two options for using this authentication method. You can run it as initContainer or as sidecar for each application. Configure the Conjur url, account, login, etc as the env variables on the application and the authenticator container. Login is the full host id as defined in the 2_app-authn-def.yml policy file. You also need to mount the SSL certs for the Conjur service. In this case, we have to use the SSL certificates generated by Helm chart during Conjur installation. If you have your own SSL certificates configured on the Conjur server, you can use the same. Note that the value of CONJUR_AUTHN_URL on the authenticator container is slightly different from the CONJUR_APPLIANCE_URL on the application container. CONJUR_AUTHN_URL has the authetication service id appended to the CONJUR_APPLIANCE_URL.

Authentication client authenticates itself to the configured Conjur server url & provides it's identity, i.e., the login id, pod name, namespace, etc. Conjur validates the information provided by the authenticator client against defined policies, as well as, Kubernetes & provides an access token. The access token is valid only for 8 mins. Client container saves the authentication token on an in-memory volume, which is mounted to both the containers viz. application & authenticator.

Summon which is a separate open source utility from CyberArk Conjur, uses this token to fetch the values for secrets. Your application container needs to run Summon as its main process with path to a secrets.yml file that lists all the secret values that it needs to pull from Conjur. Summon runs the application executable as a sub-process & passes the secret values it fetched as env variables. See the command configured on the Dockerfile of the application we're about to run.

$ tail -n1 Dockerfile
ENTRYPOINT ["summon", "--provider", "summon-conjur", "-f", "/etc/secrets.yml", "/bin/sh", "-c", "while true; do printenv | grep PASSWORD; sleep 5; done"]

The !var in the secrets.yml file indicates that the value needs to be injected as env variable. It can be replaced with !file:var in case you want the value to be written to a file. In that case, the env variable name on the left side will contain the file's path where secret content is written. Observe the contents of our secrets.yml below.

$ cat secrets.yml
PASSWORD: !var demo-app-vars/password

Before running the application, let's first create our application namespace & copy the secret containing Conjur tls certs to our application namespace.

$ kubectl create namespace test
$ kubectl get secret conjur-oss-conjur-ssl-cert --namespace=conjur -oyaml |\
 sed 's/namespace: conjur/namespace: test/g' |\
 kubectl apply -f -

Use the example application deployment file to run the application. As you saw in the Dockerfile, my example application is a busybox container. It just prints the value of secret demo-app-vars/password from Conjur every 5 seconds. A typical application should never print secret values in logs; but we'll use it only to demonstrate that the value is available to the application from Conjur. Let's run the same & observe the logs.

$ kubectl create -f busybox.yaml

Check to see if the application has the secret value from Conjur available as an environment variable.

$ kubectl logs -f -lapp=busybox
pass: supersecret
pass: supersecret
pass: supersecret
pass: supersecret

An additional security feature you get by using Conjur & Summon is that the secret value is only available to the application & not to the entire container. This means, if an attacker were to get access inside the application container, they won't be able to access the secret values by listing the environment variables in the container.

$ kubectl exec -it -lapp=busybox -- sh
$ printenv | grep PASSWORD
$
$ # No output above

Cleanup

Cleanup all the resources created in this post with the below commands:

$ kubectl delete -f busybox.yaml
$ helm delete $HELM_RELEASE_NAME -n $CONJUR_NAMESPACE
$ 
$ ## Delete client pod & configmap, if not already removed
$ kubectl delete -f conjur-client.yaml
$ kubectl delete cm -n $CONJUR_NAMESPACE conjur-ca policies

Conclusion

In this post, we looked at what Conjur is, its uses & basic concepts. We also installed Conjur on a Kubernetes cluster and integrated it with a sample application running in Kubernetes. Hope this gives you a good start for using Conjur with Kubernetes.

We’re always thrilled to connect to people working with cloud native technologies. For any queries or comments, you can reach out to us via Twitter and LinkedIn.

References

conjur.org

Conjur OSS Helm Chart

Kubernetes Conjur Demo

My GitHub repo with all the resources used in the post