Forem: Cristian Carballo

Alternativa a Bedrock Knowledge Base

Cristian Carballo — Mon, 10 Feb 2025 14:11:53 +0000

En esta publicación compartiré una arquitectura de solución que he diseñado para crear una Base de conocimiento de un modelo de RAG.

Primero que nada, tal vez nos preguntemos ¿Que es un Modelo de RAG?.. Básicamente un modelo RAG (Retrieval-Augmented Generation) es una arquitectura de inteligencia artificial que combina recuperación de información (Retrieval) con generación de texto (Generation) para mejorar la precisión y relevancia de las respuestas generadas por modelos de lenguaje.

Este enfoque se utiliza principalmente en sistemas de IA generativa, donde se requiere responder preguntas o generar contenido basado en información actualizada y contextual, sin depender únicamente del conocimiento pre-entrenado del modelo.

Actualmente el servicio de AWS Bedrock ofrece una increible caracteristica que nos permite crear las Knowledge Base (o base de conocimiento) de una manera muy simple. Para más información sobre esta feature, revisar el siguiente link. Sin embargo, también podemos diseñar soluciones escalables para el mismo fin dentro de AWS.

A continuación compartiré una arquitectura de solución que está idealmente pensada para la generación de los embeddings a través de la incorporación de nuevos archivos en Amazon S3. Esta arquitectura representa una Knowledge Base en AWS, diseñada para almacenar y procesar información, generando embeddings que permiten búsquedas semánticas utilizando Amazon OpenSearch. A continuación, se describe cada componente y su función en el flujo de datos.

Componentes Principales
🔹 Infraestructura y Almacenamiento

AWS CloudFormation: Automatiza la creación de la infraestructura, desplegando la Knowledge Base Stack.
S3 Knowledge Base: Almacena los documentos y archivos que serán procesados.
Files Metadata Table: Base de datos que mantiene los metadatos de los archivos procesados.

🔹 Procesamiento de Datos

Amazon EventBridge: Detecta eventos cuando se suben archivos a S3 y activa el flujo de procesamiento.
Extract Metadata (Lambda): Extrae metadatos de los archivos y los envía a la Files Metadata Table.
File Processing Queue (SQS): Cola de procesamiento que gestiona las solicitudes para generar embeddings.
File Processing Dead Letter Queue (SQS): Almacena mensajes fallidos que no pudieron ser procesados.

🔹 Generación de Embeddings y Vectorización

Generate Embeddings (Lambda): Genera embeddings de los archivos utilizando un modelo de machine learning.
ECR (Elastic Container Registry): Contiene imágenes Docker para ejecutar las funciones de extracción de metadatos y generación de embeddings.

🔹 Indexación y Búsqueda Semántica

Amazon OpenSearch Service: Almacena y permite búsquedas basadas en embeddings, facilitando la búsqueda semántica dentro de la Knowledge Base.

Narrativa del Processo:

Carga de Archivos → Un usuario sube un archivo a S3 Knowledge Base.
Evento en EventBridge → Detecta la carga y activa la Lambda Extract Metadata.
Extracción de Metadatos → La Lambda extrae metadatos y los guarda en Files Metadata Table.
Cola de Procesamiento (SQS) → Se envía un mensaje a la File Processing Queue para iniciar la generación de embeddings.
Generación de Embeddings → La Lambda Generate Embeddings convierte el archivo en un vector numérico.
Indexación en OpenSearch → Los embeddings se almacenan en Amazon OpenSearch para búsqueda semántica.
Actualización de Estado → La tabla de metadatos se actualiza con el estado final del archivo.

Beneficios de esta Arquitectura
✅ Automatización Completa con AWS CloudFormation y EventBridge.
✅ Procesamiento Escalable con AWS Lambda y SQS.
✅ Búsqueda Semántica Avanzada mediante Amazon OpenSearch y embeddings.
✅ Alta Disponibilidad y Tolerancia a Fallos con S3, SQS y Dead Letter Queue.

Espero que sea útil para todos ustedes esta arquitectura de referencia. Ante cualquier consulta, no duden en contactarme!. https://www.linkedin.com/in/cristianrcarballo/

No le temas a AWS LakeFormation

Cristian Carballo — Mon, 10 Feb 2025 13:54:56 +0000

Al momento de diseñar una solución en AWS, es sumamente importante prestar atención a la "Seguridad" y más aún cuando en la solución se involucra el acceso a datos.

En esta publicación hablaré sobre una alternativa que nos ofrece el Servicio AWS LakeFormation para resguardar el acceso a los datos.

Si bien existen algunas alternativas, la aqui trataremos el control de acceso se gestiona mediante LF-Tags.

A continuación, se desglosan los componentes y su funcionamiento:

1. Servicios Claves Utilizados

AWS Lake Formation: Administra permisos a nivel de base de datos y tablas utilizando LF-Tags.
AWS Glue Data Catalog: Almacena metadatos de los datos en S3.
Bases de Datos en AWS Glue: -- Glue DB#1 (SALES): Contiene tablas con datos de ventas. -- Glue DB#2 (MKT): Contiene tablas de datos de marketing. -- Glue DB#3 (PRIVATE): Contiene datos privados/sensibles.
Athena: Permite consultas SQL sobre los datos.
Amazon QuickSight: Se usa para análisis y visualización de datos.
Amazon SageMaker: Se utiliza para Machine Learning y analítica predictiva.

2. Control de Acceso con LF-Tags (Etiquetas)
La arquitectura emplea LF-Tags para definir permisos a nivel de base de datos, tabla y columna. Como se puede ver en la arquitectura, Se han asignado LF-Tags específicos a cada unidad de negocio:

Etiquetas Verdes → Datos de Ventas (SALES)
Etiquetas Azules → Datos de Marketing (MKT)
Etiquetas Rojas → Datos Privados/Restringidos

Al setear etiquetas a las distintas bases de datos podemos permitir/no permitir el acceso a los datos de una manera mas práctica logrando asi que:

Los usuarios con acceso solo a “Sales” LF-Tags pueden consultar datos en Glue DB#1, pero no pueden ver datos de Marketing o Privados.
Los usuarios con acceso a “Marketing” LF-Tags solo pueden consultar Glue DB#2.
Los datos Privados (PRIVATE) tienen acceso restringido y requieren permisos especiales.

3. Flujo de Datos

Los datos son registrados en AWS Lake Formation.
AWS Glue Data Catalog almacena los metadatos de bases de datos y tablas.
Se asignan LF-Tags a bases de datos y tablas.
IAM Roles + LF-Tags controlan el acceso para usuarios, servicios o grupos.
Athena, QuickSight y SageMaker acceden de forma segura respetando las restricciones de LF-Tags.

4. Beneficios Claves

✅ Seguridad de Datos con LF-Tags → Garantiza que solo usuarios autorizados accedan a conjuntos de datos específicos.
✅ Segmentación por Unidad de Negocio → Se separan los datos de Ventas, Marketing y Privados mediante control basado en etiquetas.
✅ Integración con Servicios de Análisis AWS → Athena, QuickSight y SageMaker acceden a los datos de manera controlada.
✅ Uso de un Rol IAM Centralizado → Facilita la gestión de permisos a nivel de servicio.

Formas de Replicar Datos con S3

Cristian Carballo — Sat, 30 Dec 2023 13:31:04 +0000

Cuando se trata de replicación de datos entre Buckets S3, nos encontramos comunmente con 2 escenarios:

Los Buckets (Source & Target) se encuentran en la misma cuenta.

Los Buckets (Source & Target) se encuentran en cuenas diferentes.

Es importante resaltar que existen multiples metodos para replicar datos entre buckets, sin embargo es fundamental entender el caso de uso que se esté abordando en ese momento. En esta publicación nos enfocaremos en casos donde estrictamente se replicará objetos entre S3 Buckets, por lo cual podriamos optar por utilizar la funcionalidad más standard ofrecida por el propio servicio de S3. Estas implementaciones las realizaremos utilizando CloudFormation.

Si bien ambos escenarios son muy similares, hay que considerar que difieren sensiblemente en la manera de que se deben implementar. Para el caso de que los Buckets se encuentran en la misma cuenta, no debe considerarse la incorporación de una Bucket Policy **en el Target Bucket, sin embargo debe considerarse la activación del **Versionado, ya que es condición necesaria para que funcione la replicación y además sirve para poder restaurar los objetos a su versión anterior.

Para desplegar la Solución #1 (Buckets en la misma Cuenta), deberemos crear los siguientes archivos:

Archivo de Deploy (deploy.sh)

###
STACK_NAME="s3-replication-same-account-template"
TEMPLATE_FILE_NAME="s3-replication-same-account-template"
###

PROFILE="default"
ARTIFACTORY_BUCKET="**Nombre de un Random Bucket, para subir el tamplate**"

#1) Create Package
aws cloudformation package --template ./$TEMPLATE_FILE_NAME.yaml \
                           --s3-bucket $ARTIFACTORY_BUCKET \
                           --output json > $TEMPLATE_FILE_NAME-packaged-$ENV.yaml \
                           --profile $PROFILE

#2.1) Create Stack (From Package)
    aws cloudformation create-stack --stack-name $STACK_NAME \
                                    --parameters file://./parameters.json \
                                    --template-body file://./$TEMPLATE_FILE_NAME-packaged-$ENV.yaml \
                                    --profile $PROFILE \
                                    --region us-east-1 \
                                    --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM

Archivo de Parametros (parameters.json)

[
    {
        "ParameterKey": "pSampleType",
        "ParameterValue": "same-account-replication"
    }
]

Archivo de Cloudformation (s3-replication-same-account- template.yaml)

AWSTemplateFormatVersion: 2010-09-09
Transform: AWS::Serverless-2016-10-31
Description: "Same Account - S3 Replication"

Parameters:
  pSampleType:
    Description: S3 Replication Sample Type
    Type: String 


Resources:

  #####################################################
  #################### S3 BUCKET ######################
  #####################################################
  rSourceBucket:
    #DependsOn: rReplicationRole  
    Type: AWS::S3::Bucket
    Properties:       
      BucketName: !Sub "${pSampleType}-source-bucket-${AWS::AccountId}"            
      VersioningConfiguration: 
        Status: Enabled
      ReplicationConfiguration:
        Role: !GetAtt rReplicationRole.Arn
        Rules:
          - Id: !Sub "${pSampleType}-sample"
            Status: Enabled
            Prefix: datalake
            Destination:
              Bucket: !GetAtt rDestinationBucket.Arn
              StorageClass: STANDARD  
      Tags: 
        - Key: "S3-BucketName"        
          Value: !Sub "${pSampleType}-source-bucket-${AWS::AccountId}"
        - Key: "CostCenter"
          Value: "00000"  

  rDestinationBucket:  
    Type: AWS::S3::Bucket
    Properties:       
      BucketName: !Sub "${pSampleType}-destination-bucket-${AWS::AccountId}"            
      VersioningConfiguration: 
        Status: Enabled      
      Tags: 
        - Key: "S3-BucketName"        
          Value: !Sub "${pSampleType}-destination-bucket-${AWS::AccountId}"
        - Key: "CostCenter"
          Value: "00000"  



  #####################################################
  ##################### IAM ROLE ######################
  #####################################################

  rReplicationRole:
    Type: "AWS::IAM::Role"
    Properties:
      RoleName: !Sub "${pSampleType}-role"      
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Action:
              - "sts:AssumeRole"
            Effect: "Allow"
            Principal:
              Service:
                - "s3.amazonaws.com"
      Policies:
        - PolicyName: S3ReplicationPolicy
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - "s3:GetObjectVersionForReplication"
                  - "s3:GetObjectVersionAcl"
                  - "s3:GetObjectVersionTagging"
                Resource: !Sub "arn:aws:s3:::${pSampleType}-source-bucket-${AWS::AccountId}/*"
              - Effect: Allow
                Action:
                  - "s3:ListBucket"
                  - "s3:GetReplicationConfiguration"
                Resource: !Sub "arn:aws:s3:::${pSampleType}-source-bucket-${AWS::AccountId}"
              - Effect: Allow
                Action:
                  - "s3:ReplicateObject"
                  - "s3:ReplicateDelete"
                  - "s3:ReplicateTags"
                Resource: !Sub "arn:aws:s3:::${pSampleType}-destination-bucket-${AWS::AccountId}/*"

Para desplegar la solución simplemente se debe ejecutar el archivo de deploy de la siguiente manera:

bash deploy.sh

Continuando con el despliegue de la Solución #2 (Buckets en distintas cuentas), deberemos crear los siguientes archivos:

Archivo de Deploy Source (deploy_source.sh)

###
STACK_NAME="s3-replication-different-account-source-template"
TEMPLATE_FILE_NAME="s3-replication-different-account-source-template"
###

PROFILE="default"
ARTIFACTORY_BUCKET="**Nombre de un Random Bucket, para subir el tamplate**"

#1) Create Package
aws cloudformation package --template ./$TEMPLATE_FILE_NAME.yaml \
                           --s3-bucket $ARTIFACTORY_BUCKET \
                           --output json > $TEMPLATE_FILE_NAME-packaged-$ENV.yaml \
                           --profile $PROFILE

#2.1) Create Stack (From Package)
    aws cloudformation create-stack --stack-name $STACK_NAME \
                                    --parameters file://./parameters-source.json \
                                    --template-body file://./$TEMPLATE_FILE_NAME-packaged-$ENV.yaml \
                                    --profile $PROFILE \
                                    --region us-east-1 \
                                    --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM

Archivo de Deploy Target (deploy_destination.sh)

###
STACK_NAME="s3-replication-different-account-destination-template"
TEMPLATE_FILE_NAME="s3-replication-different-account-destination-template"
###

PROFILE="default"
ARTIFACTORY_BUCKET="**Nombre de un Random Bucket, para subir el tamplate**"

#1) Create Package
aws cloudformation package --template ./$TEMPLATE_FILE_NAME.yaml \
                           --s3-bucket $ARTIFACTORY_BUCKET \
                           --output json > $TEMPLATE_FILE_NAME-packaged-$ENV.yaml \
                           --profile $PROFILE

#2.1) Create Stack (From Package)
    aws cloudformation create-stack --stack-name $STACK_NAME \
                                    --parameters file://./parameters-destination.json \
                                    --template-body file://./$TEMPLATE_FILE_NAME-packaged-$ENV.yaml \
                                    --profile $PROFILE \
                                    --region us-east-1 \
                                    --capabilities CAPABILITY_AUTO_EXPAND CAPABILITY_NAMED_IAM

Archivo de Parametros Source (parameters-source.json)

[
    {
        "ParameterKey": "pSampleType",
        "ParameterValue": "different-account-replication"
    },
    {
        "ParameterKey": "pDestinationBucketName",
        "ParameterValue": "different-account-replication-destination-bucket-aws-account-id"
    }
]

Archivo de Parametros Source (parameters-destination.json)

[
    {
        "ParameterKey": "pSampleType",
        "ParameterValue": "different-account-replication"
    },
    {
        "ParameterKey": "pReplicationRoleArn",
        "ParameterValue": "arn:aws:iam::aws-account-id:role/different-account-replication-role"
    }
]

Archivo de Cloudformation (s3-replication-different-account-source-template.yaml)

AWSTemplateFormatVersion: 2010-09-09
Transform: AWS::Serverless-2016-10-31
Description: "Same Account - S3 Replication"

Parameters:
  pSampleType:
    Description: S3 Replication Sample Type
    Type: String 
  pDestinationBucketName:
    Description: S3 Destination Bucket
    Type: String 

Resources:

  #####################################################
  #################### S3 BUCKET ######################
  #####################################################
  rSourceBucket:
    #DependsOn: rReplicationRole  
    Type: AWS::S3::Bucket
    Properties:       
      BucketName: !Sub "${pSampleType}-source-bucket-${AWS::AccountId}"            
      VersioningConfiguration: 
        Status: Enabled
      ReplicationConfiguration:
        Role: !GetAtt rReplicationRole.Arn
        Rules:
          - Id: !Sub "${pSampleType}-sample"
            Status: Enabled
            Prefix: datalake
            Destination:
              Bucket: !Sub "arn:aws:s3:::${pDestinationBucketName}"
              StorageClass: STANDARD  
      Tags: 
        - Key: "S3-BucketName"        
          Value: !Sub "${pSampleType}-source-bucket-${AWS::AccountId}"
        - Key: "CostCenter"
          Value: "00000"  

  #####################################################
  ##################### IAM ROLE ######################
  #####################################################

  rReplicationRole:
    Type: "AWS::IAM::Role"
    Properties:
      RoleName: !Sub "${pSampleType}-role"      
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Action:
              - "sts:AssumeRole"
            Effect: "Allow"
            Principal:
              Service:
                - "s3.amazonaws.com"
      Policies:
        - PolicyName: S3ReplicationPolicy
          PolicyDocument:
            Statement:
              - Effect: Allow
                Action:
                  - "s3:GetObjectVersionForReplication"
                  - "s3:GetObjectVersionAcl"
                  - "s3:GetObjectVersionTagging"
                Resource: !Sub "arn:aws:s3:::${pSampleType}-source-bucket-${AWS::AccountId}/*"
              - Effect: Allow
                Action:
                  - "s3:ListBucket"
                  - "s3:GetReplicationConfiguration"
                Resource: !Sub "arn:aws:s3:::${pSampleType}-source-bucket-${AWS::AccountId}"
              - Effect: Allow
                Action:
                  - "s3:ReplicateObject"
                  - "s3:ReplicateDelete"
                  - "s3:ReplicateTags"
                Resource: !Sub "arn:aws:s3:::${pDestinationBucketName}/*"

Archivo de Cloudformation (s3-replication-different-account-destination-template.yaml)

AWSTemplateFormatVersion: 2010-09-09
Transform: AWS::Serverless-2016-10-31
Description: "Same Account - S3 Replication"

Parameters:
  pSampleType:
    Description: S3 Replication Sample Type
    Type: String

  pReplicationRoleArn:
    Description: Role in the source account for the replication
    Type: String


Resources:

  #####################################################
  #################### S3 BUCKET ######################
  #####################################################

  rDestinationBucket:  
    Type: AWS::S3::Bucket
    Properties:       
      BucketName: !Sub "${pSampleType}-destination-bucket-${AWS::AccountId}"            
      VersioningConfiguration: 
        Status: Enabled      
      Tags: 
        - Key: "S3-BucketName"        
          Value: !Sub "${pSampleType}-destination-bucket-${AWS::AccountId}"
        - Key: "CostCenter"
          Value: "00000"  



  # #####################################################
  # ################## BUCKET POLICY ####################
  # #####################################################

  rDestinationBucketsPolicy:
    Type: AWS::S3::BucketPolicy
    Properties: 
      Bucket: !Sub "${pSampleType}-destination-bucket-policy-${AWS::AccountId}" 
      PolicyDocument:
        Version: "2012-10-17"
          Statement:
            - Effect: "Allow"
              Principal:
                AWS: !Ref pReplicationRoleArn
              Action: 
                - "s3:ReplicateObject"
                - "s3:ReplicateDelete"
              Resource: !Sub "${rDestinationBucket.Arn}/*"
            - Effect: "Allow"
              Principal: 
                AWS: !Ref pReplicationRoleArn
              Action:
                - "s3:List*"
                - "s3:GetBucketVersioning"
                - "s3:PutBucketVersioning"
              Resource: !GetAtt rDestinationBucket.Arn

Para desplegar la solución simplemente se debe ejecutar el archivo de deploy pero en las 2 cuentas de AWS (Source & Target), de la misma manera que en el escenerio #1.

Para Finalizar, se debe considerar que la replicación entre objetos puede demorar algunos segundos o varios minutos, ya que va a depender de AWS, sin embargo AWS cuenta con RTC, que básicamente es un SLA, donde asegura que en el lapso de 15 minutos (Máximo) serán replicados los objetos desde el Source al Target. Es clave tener en consideración esta feature cuando se tratan de escenarios de casos de uso críticos donde contar con los objetos a tiempo es necesario.

Muchas gracias!
Cristian R. Carballo
https://www.linkedin.com/in/cristianrcarballo/

Ingestando en Near Real Time con Kinesis Data Firehose

Cristian Carballo — Thu, 28 Dec 2023 12:24:18 +0000

Existen multiples alternativas de ingesta de datos en AWS, gracias a los diversos Servicios que se encuentran disponibles dentro de la consola, sin embargo en varias oportunidades resulta desafiante el decidir cual utilizar para resolver nuesta necesidad/caso de uso.

En esta publicación nos enfocaremos en resolver escenarios donde nuestro "productor" de datos genera multiples streams constantemente. Estos productores de datos podrian ser por ejemplo:

Dispositivos IOT
Servidores de aplicaciones que generen logs
VPC Flow Logs
Telemetría en vehiculos

Para estos casos de usos utilizaremos Amazon Kinesis Data Firehose, el cual es un servicio de ETL que captura, transforma y permite ingestar datos de streaming en Datalakes y/o otros servicios que integran nativamente, como por ej. Open Search o Redshift.

Una de las principales particularidades de Amazon Kinesis Data Firehose, es que permite la ingesta de datos de manera Near Real Time. Esto significa que los streams de datos que hayan sido ingresados al Delivery Stream no seran entregados en su Output hasta que se cumplan una de las siguientes dos premisas:

Buffer por Tiempo (Minimo: 60 segundos | Máximo: 900 segundos)
Buffer por Size (Minimo: 1 mb | Máximo: 128 mb)

Cualquiera de las 2 que ocurra primero, se realizará el dump de los datos que contenga el Delivery Stream, en el output que se haya configurado.

A modo de ejemplo, implementaremos en la consola de AWS la siguiente arquitectura:

Donde simularemos un dispositivo de IOT generando streams que serán enviados al Delivery Stream. Al cabo de 1 minuto, los datos serán escritos en S3.

A Continuación listaré los pasos a realizar:

Abrir a la Consola de AWS

En el buscador de Servicios, ingresar S3

Crearemos un Bucket que será el target de nuestro Delivery Stream.

NOTA: El nombre del Bucket debe ser único. No intentar crear el mismo que la imagen.

En el buscador de servicios, ingresar Kinesis Data Firehose

Seleccionaremos Create Delivery Stream

Seleccionamos el Bucket target y le seteamos el buffer minimo

Tardará unos minutos en crearse el Delivery Stream.

Para poder simular la ingesta de streams, he desarrollado el siguiente script que simula la carga de datos (random) con una estructura fija en el Delivery Stream:

import names
from random import randint
import boto3
import uuid
import time
import json

DeliveryStreamName = 'demo-kdf'
session_dev = boto3.session.Session(profile_name='default')
firehose = session_dev.client('firehose', region_name='us-east-1')


cnt_streams = 10
for i in range(cnt_streams):   
    record = {
      'id': i,
      'name': names.get_first_name(),
      'surname': names.get_last_name(),
      'age': randint(18,80)
    }

    print(record)

    response = firehose.put_record(DeliveryStreamName = DeliveryStreamName,
                                   Record = {'Data': json.dumps(record)})

    time.sleep(.1)

Al ejecutarlo se visualizarán por terminal los streams de datos que fueron enviados al delivery stream

Luego de unos minutos, al verificar en el bucket de S3 (Output) Encontraremos el dump de datos

El contenido puede visualizarse con S3 Select

Muchas Gracias!!
Cristian R. Carballo LinkedIn
https://www.linkedin.com/in/cristianrcarballo/

¿Cómo migrar tu base de datos on premise a la nube de AWS?

Cristian Carballo — Thu, 11 Aug 2022 19:18:53 +0000

En esta publicación voy a estar hablando sobre el servicio de AWS llamado Database Migration Service (DMS).

Caracteristicas Principales

AWS Database Migration Service (DMS) es un servicio de AWS que nos permite la migración de bases de datos a la nube.
Permite realizar la migración de bases de datos On Premise (y cloud) hacia AWS.
DMS es un servicio resiliente a potenciales fallos (highly resilient & self–healing)
A lo largo de la migración de datos la base de datos de origen (Source) se mantiene activa sin interrupciones.
DMS Soporta migraciones de bases de datos de manera:
Homogéneas: Ej. Caso de uso: PostgreSQL ⇒ PostgreSQL
Heterogéneas: Ej. Caso de uso: MS Sql Server ⇒ Aurora (Debe utilizarse el SCT (Schema Conversion Tool))
Los tipos de migraciones pueden ser:
Full/Snapshot
Change Data Capture (CDC)
El Schema Conversion Tool (SCT) permite convertir el schema de una BDD de un motor a otro.
Puede ser utilizado entre bases de datos OLTP / OLAP
Ej. OLTP: Oracle ⇒ PostgreSQL
Ej. OLAP: Teradata ⇒ Redshift

Endpoints
Para utilizar DMS es necesario definir un Sourcer y Target Endpoint. A Continuación mostraré los actualmente disponibles:

DEMO

Descripción del escenario:

La compañía “Sin Nombre SRL” está comenzando a dar sus primeros pasos en el mundo Cloud.
Luego de analizar las distintas soluciones y tecnologías disponibles en el mercado, optaron por migrar su infraestructura de Base de datos hacia AWS, utilizando el servicio RDS.
La actual base de datos es PostgreSQL y está sobre un servidor dedicado el cual requiere mantenimiento (patches/seguridad/OS/etc), el cual debe ser reducido.
Es necesario que el 100% de los datos actuales sean migrados a la nube, exceptuando algunas entidades de datos.
Debe replicarse diariamente los cambios ocurridos en el origen.

Arquitectura:

NOTA: Debemos contar con una instancia de Clod9

Pasos a Realizar:

Crear las 2 instancias (Source y Target) con un script de CloudFormation.
Crear una Inbound Rule en el security group de cada Base de datos para abrir el puerto de PostgreSQL.
Crear instancia de replicación en DMS.
Instalar Postgres Tools (psql)
Conectarse a Source DB
Ejecutar script de creación de tablas en Source DB
Crear en DMS el Source Endpoint y Target Endpoint.
Probar conexión a Target DB y validar que se encuentra sin datos
Crear tarea de replicación (FULL LOAD) y ejecutar.
Validar que los datos hayan sido replicados en el target.
Crear tarea de replicación (CDC) y ejecutar.
Insertar registros nuevos en la tabla de Source DB.
Validar que los datos hayan sido replicados en el target.
Eliminar Recursos.

Paso # 1: Crear las 2 instancias (Source y Target) con un script de CloudFormation.

Ejecutar el siguiente comando para posicionarse en el directorio donde se encuentra el archivo bash para realizar el deployment.

cd /home/ec2-user/environment/Datapath/DMS

Ejecutar el archivo “deploy.sh” (Creación de Stack CloudFormation)

bash deploy.sh

Verificar que se esté ejecutando el stack en Cloudformation

Paso # 2: Crear una Inbound Rule en el security group de cada Base de datos para abrir el puerto de PostgreSQL.

Ir al servicio de RDS y modificar los security groups para poder abrir los puertos 5432 (PostgreSQL)

En la parte inferior derecha se puede visualizar el security group de nuestra instancia de BDD. Hacer Click en el link del security group para que nos direccione a las configuraciones

Hacer Click en “Edit inbound Rules” y agregar la misma regla que está en las imágenes de a continuación para abrir el puerto de postgres.

Paso # 3: Crear instancia de replicación en DMS.

Ingresar al servicio DMS y crear una nueva Replication Instance haciendo click en el boton “Create replicacion instance” (boton naranja)

Ingresar un nombre (no utilizar el de la imagen) que debe ser único. Se recomienda agregarle de sufijo el ID de cuenta de AWS
Dejar los valores por default y establecer:
Allocated Storage: 5 Gb
VPC: Dejar la default
Type: Single AZ
Publicly Accessible: No seleccionado

Paso # 4: Instalar Postgres Tools (psql)

Dentro de la instancia de Cloud9, ejecutar el siguiente comando:

sudo yum install postgresql -y

Paso #5: Conectarse a Source DB

Dentro de la instancia de Cloud9, ejecutar el siguiente comando para conectarnos a la Source DB:

Asegurarse de reemplazar el valor en rojo por el endpoint de la base de datos source:

psql -h [[++SOURCE_ENDPOINT++]] \ -U postgres \ -p 5432 \ -d postgres

Password es: source_p4assw0rd

Paso #6: Ejecutar script de creación de tablas en Source DB

`create table if not exists personas
(
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
nombre varchar(100),
apellido varchar(100),
telefono varchar(100)
);

insert into personas (nombre, apellido, telefono) values ('cristian','carballo','1234');
insert into personas (nombre, apellido, telefono) values ('Juan','Lopez','1234');
insert into personas (nombre, apellido, telefono) values ('Miguel','Garcia','1234');
insert into personas (nombre, apellido, telefono) values ('Roman','Riquelme','1234');
insert into personas (nombre, apellido, telefono) values ('Diego','Maradona','1234');

create table if not exists marcas
(
id bigint GENERATED ALWAYS AS IDENTITY,
nombre varchar(100),
ubicacion varchar(100)
);

insert into marcas (nombre, ubicacion) values ('Nike','USA');
insert into marcas (nombre, ubicacion) values ('Reebok','UK');
insert into marcas (nombre, ubicacion) values ('Adidas','Nigeria');
`

Paso #7: Crear en DMS el Source Endpoint y Target Endpoint.

Ingresar al servicio DMS y dentro de la opción de Endpoints, crear 2 endpoint:

Source Endpoint ⇒ PostgreSQL Source
Target Endpoint ⇒ PostgreSQL Target

*Paso #8: Probar conexión a Target DB y validar que se encuentra sin datos
*

Dentro de la instancia de Cloud9, ejecutar el comando para conectarnos a la Target DB, de igual manera a lo realizado en paso # 5.
Para validar que la Target BDD esté vacía, ejecutar el siguiente comando:

SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';

Paso #9: Crear tarea de replicación (FULL LOAD) y ejecutar.

Dentro del servicio DMS crear una Migration Task con los siguientes parametros:

Paso #10: Validar que los datos hayan sido replicados en el target.

Ejecutar en el Target BDD el siguiente comando y ver que se generó la tabla personas y luego accederla:

SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';

Paso #11: Crear tarea de replicación (CDC) y ejecutar.

Dentro del servicio DMS crear una Migration Task con los siguientes parametros:

Paso # 12: Insertar registros nuevos en la tabla de Source DB.

Para este ejemplo insertaremos 2 registros y actualizaremos uno ya existente

Paso #13: Validar que los datos hayan sido replicados en el target.

Como se puede ver en la imagen a continuación, los registros fueron migrados correctamente.

Paso #14: Eliminar Recursos

Para eliminar los recursos realizarlos de la siguiente manera respetando el orden:

Detener la tarea de migración del CDC.

Esperar que detenga y eliminar tarea de migración de CDC y FULL.

Esperar a que terminen de eliminarse las 2 tareas de migración y luego eliminar la Replication Instance.

Eliminar los 2 endpoints (Source & Target) creados en DMS.

Eliminar las 2 bases de datos en RDS de manera manual asegurándose de no realizar el backup.

En CloudFormation eliminar el stack de RDS (StackRDS).

En CloudFormation eliminar el stack generado por la creación de Cloud9 (aws-cloud9-datapath-xxxxx)

How to use "redshift-data" API with AWS CLI

Cristian Carballo — Wed, 03 Aug 2022 14:58:18 +0000

In this post i will list the steps to use the "redshift-data" thru AWS CLI.

The Redshift Data API enables you to communicate from the outside of the cluster and execute statements or get results from it.

There are at least 2 popular ways to use it:

BOTO3 Library BOT3 Documentation
AWS CLI CLI Documentation (In this post we will use this method)

First of all you need to create a Redshift Cluster. I Highly recommend to create with default configuration, using the minimal instance types and nodes, if you want to test the "redshift-data" api.

AWS Recommendation is to create within the Cluster the user "redshift_data_api_user" because by default there is an AWS Manage Policy ("AmazonRedshiftDataFullAccess") with the necessary grants to connect from the outside and gives the grants to that user by default. If you want to grant the access to another user, you can get the entire policy document and replace the default user to preffer user.

DEFAULT

{ "Sid": "GetCredentialsForAPIUser", "Effect": "Allow", "Action": "redshift:GetClusterCredentials", "Resource": [ "arn:aws:redshift:*:*:dbname:*/*", "arn:aws:redshift:*:*:dbuser:*/redshift_data_api_user" ] }

IF YOU WANT TO CHANGE DEFAULT USER

Once you decide either to use default user or custom user, you need to make sure that your user/role has the policy attached. As you can see in the image below i am using the default Manage policy to my IAM User:

Note: It also has the policy to change the password & to access to the Redshift console for query the data. No other IAM permission are added to my example user.

After adding the permission to the user, it is needed to perform the following steps in order to enable the redshift-data CLI:

Create the user within the cluster:

create user redshift_data_api_user password 'Password1234';

Grant permission for USAGE and CREATE over an example SCHEMA:

GRANT USAGE on SCHEMA example to redshift_data_api_user
GRANT CREATE on SCHEMA example to redshift_data_api_user

Finally we can use the "redshift-data" thru CLI to execute an statement. In this example i will create a sample table within the schema "example":

aws redshift-data execute-statement \ --region AWS-REGION \ --db-user redshift_data_api_user \ --cluster-identifier CLUSTER-ID \ --database DATABASE \ --sql "create table example.customer(name varchar(10), surname varchar(10))"

Note: Please note that the region / cluster-identifier and database are values that you need to place based in your cluster creation. It is important to keep the "db-user" param with the "redshift_data_api_user" since that is the one created above and the user that we can getcredentials (policy).

Hope this example is helpful for you!.

Cheers,
Cristian.

Create an AWS Cloud9 Environment

Cristian Carballo — Wed, 03 Aug 2022 14:02:09 +0000

In this post i will explain you how to create in simple steps a Clou9 Environment for delopment within your AWS Account.

First things first. What is Cloud9?.
AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser. It includes a code editor, debugger, and terminal. Cloud9 comes prepackaged with essential tools for popular programming languages, including JavaScript, Python, PHP, and more, so you don’t need to install files or configure your development machine to start new projects. Since your Cloud9 IDE is cloud-based, you can work on your projects from your office, home, or anywhere using an internet-connected machine. Cloud9 also provides a seamless experience for developing serverless applications enabling you to easily define resources, debug, and switch between local and remote execution of serverless applications. With Cloud9, you can quickly share your development environment with your team, enabling you to pair program and track each other's inputs in real time.

Advantages

Enables you to develop using your web browser. The IDE is really similar than VSCODE.
Docker & GIT are installed by default. You can develop and Test your Dockerfiles using the terminal. In addition, you can clone your own repositories and take advantage of them in the cloud9 env.

Linux Base OS, enables you to download packages into your environment.
By using Cloud9 you can leverage the collaboration with other teammates.

Creating a Cloud9 Environment

You need to have an AWS Account. https://signin.aws.amazon.com/
Over the Services box, search for "Cloud9"
Click in "Create environment"
By Default it is highly recommended to keep "Environment Type" as "EC2 instance" and "Instance type" as "t2.micro" since the "t2.micro" is part of the free tier. It might be enought for running testing, however based on your use case you probably will need to scale out the instance.

Select within which VPC you want to provision your environment. If you want to use the Default VPC, you don't need to change anything, just click in NEXT.

Finally you will get a Summary for the Cloud9's configuration. To create it click in "Create"

Once it is created you can login into your env by click on "Open in Cloud9"

As you can see, it already has Docker & Git

Hope this information is useful for you!!

Cheers,
Cristian.

Payload Validation in AWS REST API using PYDANTIC

Cristian Carballo — Wed, 27 Jul 2022 18:18:34 +0000

This post will show an example of how to validate the payload received in a REST API developed in Python using SAM (Serverless Application Model). In addition i will show how to deploy it locally.

PYDANTIC provides data validation and settings management using python type annotations. Furthermore PYDANTIC enforces type hints at runtime, and provides user friendly errors when data is invalid.

Pre-Requistes for this example:

Setup a Local Serverless Environment(Click to Open).
Create a GitHub Account and Fork the following Repo: https://github.com/criscarba/aws_sam_app_public/tree/master/payload-validator . Finally clone/download it locally.

Please make sure you have successfully completed the Pre-Requisites listed above before continue. Reach out me if you have any trouble.

*STEP #1 *— Open VSCode in the directory where the repository was downloaded or cloned:

STEP #2— Build the REST API with AWS SAM

Open a new Terminal in VSCode and make sure that you are sat into the REST API path where SAM can find the “template.yaml” file and run the command:

SAM BUILD

STEP #3 — Start the API Locally

Within the terminar in VSCode run the command

SAM LOCAL START-API

You will see that the application is running at http://127.0.0.1:3000 (localhost)

STEP #4: — Execute API using POSTMAN

Download POSTMAN from their official site https://www.postman.com/

Create a new workspace and test the API:

STEP #5: Understand PYDANTIC

As explained above PYDANTIC provides data validation and settings management using python type annotations and enforces type hints at runtime, and provides user friendly errors when data is invalid.

Within the “services” folder there is a python code called “event_check.py” that code is a PYDANTIC Class which contains the model of the expected event payload.

STEP #6: Test PYDANTIC Behavior

Now that you understand how to create the model of your event that will allow PYDANTIC to enforce the validation, is time to call the API with with failure payloads and see the outcomes

Send an unexpected KEY (table_2):

Send less keys than expected (Only table):

**
I hope the content is useful for everyone. Thanks a lot!
Cheers!**

Cristian Carballo
cristian.carballo3@gmail.com

AWS Local Serverless Environment Setup using AWS SAM

Cristian Carballo — Wed, 27 Jul 2022 18:10:01 +0000

This post is intended to list the steps for setting up your local development environment for creating serverless applications using the AWS SAM CLI.

I will list below all the pre-requisites that you need to have installed in your local machine:

A valid AWS Account —To build and deploy our serverless function to the AWS Lambda, so you must have a valid AWS account. If you are new and do not have an account yet, you can navigate to http://console.aws.amazon.com/ and signup for a new account
Python — My example is developed in Python so i will recommend you to install Python if you want to use my code. You can download the latest version of python by visiting https://www.python.org/downloads/ and install the same based on the operating system you are using. It is important to mention that you can build your serverless functions with any language of your choice. There are quite a few languages supported by AWS Lambda, like Python, C#, Ruby, NodeJS, etc.
AWS CLI — In addition to building the serverless apps locally, we will also need to access the AWS services programmatically. This can be achieved by installing the AWS CLI or the command-line interface, using which you can perform many administrative activities on your AWS Account.
AWS SAM CLI — In order to develop and test the applications locally, you need to install the AWS SAM CLI on your machine. The AWS SAM CLI will provide an AWS Lambda like execution environment using which you can run your code locally and get the output
Docker — Finally, you also need to get Docker installed on your machine if you want to test the application locally. The AWS SAM CLI will use Docker to mount an image where the execution will be performed. You can install Docker by visiting https://docs.docker.com/desktop/
Visual Studio Code — For developing the code, we are going to use the Visual Studio Code as the editor. You can download this from http://code.visualstudio.com/
Once you have installed all the pre-requisites on your machine, you can check the installed versions by running the following commands.

Python — python –version AWS CLI — aws –version AWS SAM CLI — sam –version Docker — Docker –version

I hope the content is useful for everyone. Thanks a lot!
Cheers!

Cristian Carballo
cristian.carballo3@gmail.com

Develop your AWS Glue Jobs Locally using Jupyter Notebook

Cristian Carballo — Wed, 27 Jul 2022 15:28:31 +0000

This post is mainly intended for professionals who are Data Engineers and use AWS as a cloud provider. It will be covered how to create a local experimental environment step by step.

As you well know, AWS offers multiple data oriented services, where AWS Glue stands out as a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage.

AWS Glue is designed to work with semi-structured data. It introduces a component called a dynamic frame, which you can use in your ETL scripts. A dynamic frame is similar to an Apache Spark dataframe, which is a data abstraction used to organize data into rows and columns, except that each record is self-describing so no schema is required initially. With dynamic frames, you get schema flexibility and a set of advanced transformations specifically designed for dynamic frames. You can convert between dynamic frames and Spark dataframes, so that you can take advantage of both AWS Glue and Spark transformations to do the kinds of analysis that you want.

What is Jupyter Notebook?

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.

How can we take advantage of Jupyter Notebook? Basically inside a Jupyter notebook we can perform all the necessary experimentation of our pipeline (transformations, aggregation, cleansing, enrichment, etc.) and then export it in a Python script (.py) for use in AWS Glue.

Let’s Get Started!

1)Install Anaconda environment with Python 3.x

For Windows 64-Bit: https://repo.anaconda.com/archive/Anaconda3-2020.11-Windows-x86_64.exe
For Windows 32-Bit: https://repo.anaconda.com/archive/Anaconda3-2020.11-Windows-x86.exe
For other OS, find the version here: https://www.anaconda.com/products/individual

NOTE: I recommend to use Python 3.7

** 2) Install Apache Maven**

Download Link: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz
Unzip the file into C:\apache-maven-3.6.0 (Recommended)

Create the MAVEN_HOME System Variable (Windows =>Edit The system Environment variables =>Environment Variables). Follow the instructions below:

Modify the PATH Environment Variable in order to make the MAVEN_HOME variable visible:

3) Install Java 8 Version

Download the product for your OS Version https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
“IMPORTANT” During the Installation make sure to set the installation directory C:\jdk For the Java Development Kit and the C:\jre For the Java Runtime. (Otherwise set the directory you have chosen during the installation)

Create the JAVA_HOME Environment Variable and make sure to add it into the PATH variable. (Similar process than MAVEN_HOME)

4) Install the SPARK distribution from the following location based on the glue version:

Glue version 1.0: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
Unzip the File into C:\spark_glue Directory (or choose other directory)

Create the SPARK_HOME environment Variable and Add it into the PATH Variable. (Similar process than MAVEN_HOME)

5) Download the Hadoop Binaries

Download Link: https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
Create the Environment variable HADOOP_HOME, following the same process of creating the folder into “C:\hadoop”, and adding it into the PATH variable (%HADOOP_HOME%\bin)
It is also required to download the “winutils.exe” that can be downloaded from this link https://github.com/steveloughran/winutils/blob/master/hadoop-3.0.0/bin/winutils.exe

NOTE: Make sure that the “winutils.exe” file is within the “bin” folder of the Hadoop directory

6) Install Python 3.7 in your Anaconda virtual environment

Open an ANACONDA PROMT and Execute the command conda install python=3.7

NOTE: This Process will take ~30 min

7) Install “awsglue-local” in your Anaconda virtual environment

Open an ANACONDA PROMT and run the command pip install awsglue-local

*8) Download the Pre_Build_Glue_Jar dependencies (REQUIRED FOR CREATING THE INSTANCE OF SPARK SESSION)
*

Download Link: https://drive.google.com/file/d/19JlxsFykugjDXeRSK5zwQ8M0nWzpGHdt/view
Unzip the jar file into the same folder that you will create your .ipynb (Jupyter Notebook). This .jar file is REQUIRED FOR CREATING THE INSTANCE OF SPARK SESSION. Below is an example of how to create the SPARK Session:

*9) Confirm that you have installed everything successfully
*

Open a new Anaconda Prompt and Execute the following commands:

conda list awsglue-local

java -version

mvn -version

pyspark

10) Once everything is completed, open a Jupyter notebook.

Open a new ANACONDA PROMPT and run the command “PIP INSTALL FINDSPARK” and wait until its completed. Once its completed close the Anaconda prompt. This is only required in most cases for the first time.
Re-Open Anaconda Prompt and run the command “jupyter-lab” in order to open a Jupyter notebook.

Create a Jupyter Notebook and execute the following commands (THIS IS ONE TIME ONLY)

import findspark
findspark.init()
import pyspark

You won’t need to execute this code again since this is a typical step for the initial installation of spark. The Findspark library generates somes references into the local machine to link the pyspark library with the bin files.

I hope the content is useful for everyone. Thanks a lot!
Cheers!

Cristian Carballo
cristian.carballo3@gmail.com

LinkedIn Profile