Forem: Vijay Dilip Bhosale

AWS Data migration service (DMS) configuration considerations

Vijay Dilip Bhosale — Wed, 18 Jun 2025 09:37:48 +0000

A data migration to AWS involves leveraging various AWS services tailored to the type of data being migrated:

Relational Database Migration
Use AWS Database Migration Service (DMS) for migrating relational databases.
Additionally, native database tools such as Oracle GoldenGate, SQL Server Replication, bulk load, etc., can be considered based on the source and target environments.
File-Based Migration
Utilize services like AWS Transfer Family and AWS DataSync etc for migrating file-based data efficiently and securely.
Real-Time Data Migration
For streaming or real-time data, consider Amazon Kinesis or Amazon MSK (Managed Streaming for Apache Kafka) to ensure low-latency and scalable data movement.

In this blog, we will explore AWS Database Migration Service (DMS)—its core capabilities, configuration options, and key considerations around cost and optimization. Whether one is planning a large-scale migration or a targeted database shift, understanding how to configure DMS effectively and manage associated costs is essential for a successful cloud transition.

AWS DMS:

AWS Database Migration Service is a cloud service to migrate relational databases, NoSQL databases, data warehouses and all other type of data stores into AWS Cloud or between cloud and on-premises setups efficiently and securely. DMS supports several types of source and target databases such as Oracle, MS SQL Server, MySQL, Postgres SQL, Amazon Aurora, AWS RDS, Redshift and S3 etc.

The service runs on an Amazon EC2 instance (Replication instance), which performs the following tasks:

Reads data from the source database.
Transforms/formats the data as needed.
Loads the data into the target database.

AWS DMS operates through a secure, highly available architecture designed for reliable database migrations. Here's the detailed architecture and implementation approach:

Security Implementation

Network Security:

{
    "VPC": {
        "CIDR": "10.0.0.0/16",
        "Subnets": ["10.0.1.0/24", "10.0.2.0/24"],
        "SecurityGroups": {
            "DMSReplicationInstance": {
                "Inbound": [
                    {"Port": 3306, "Source": "SourceDB-SG"},
                    {"Port": 5432, "Source": "TargetDB-SG"}
                ]
            }
        }
    }
}

Encryption Configuration:

Encryption:
  AtRest:
    - KMS key for replication instance
    - KMS key for stored credentials
  InTransit:
    - SSL/TLS for database connections
    - AWS Certificate Manager integration

Performance Optimization

Replication Instance Sizing Matrix:

Data Volume | Instance Class  | Storage (GB) | Network Performance
-----------|----------------|--------------|--------------------
< 1 TB     | dms.t3.large  | 100          | Up to 5 Gbps
1-5 TB     | dms.r5.xlarge | 200          | Up to 10 Gbps
5-10 TB    | dms.r5.2xlarge| 400          | Up to 15 Gbps
> 10 TB    | dms.r5.4xlarge| 1000         | Up to 25 Gbps

Task Configuration for Optimal Performance:

{
    "TaskSettings": {
        "TargetMetadata": {
            "BatchApplyEnabled": true,
            "ParallelLoadThreads": 8,
            "ParallelLoadBufferSize": 50000
        },
        "FullLoadSettings": {
            "MaxFullLoadSubTasks": 8,
            "TransactionConsistencyTimeout": 600
        }
    }
}

Monitoring and Alerting

CloudWatch Metrics Dashboard:

Essential Metrics:
  - CPUUtilization: Alert threshold > 80%
  - FreeableMemory: Alert threshold < 2GB
  - FreeStorageSpace: Alert threshold < 10%
  - ReplicationLag: Alert threshold > 300 seconds

Operational Alerts:

def create_cloudwatch_alarm():
    cloudwatch.put_metric_alarm(
        AlarmName='DMS-ReplicationLag',
        MetricName='ReplicationLag',
        Threshold=300,
        Period=300,
        EvaluationPeriods=2,
        ComparisonOperator='GreaterThanThreshold',
        AlarmActions=[sns_topic_arn]
    )

Disaster Recovery

Backup Strategy:
  - Task settings backup
  - Source endpoint configurations
  - Target endpoint configurations
  - CloudFormation templates

Recovery Procedures:
  1. Launch new replication instance
  2. Restore task settings
  3. Recreate endpoints
  4. Resume replication

Components to consider for cost factors for AWS DMS are as follows:

Instance class:
This is the type of instance class considered for the replication instance for the actual data migration. AWS DMS supports T2, T3, C4, C5, C6i, R4, R5 and R6i instance classes.
Storage cost:
Each of these instances will have 50/100GBs or more of the storage attached for data cache, replication logs with $0.115 GB/month for single AZ instance.
Multi-AZ:
Single-AZ or Multi-AZ for high availability of the above configurations would increase the cost of above configuration accordingly.
Data Transfer:
A standard data transfer cost for any data out of the region or the AWS account.

Database migrations and Change Data Capture (CDC) are inherently complex processes influenced by multiple factors. To ensure a successful and optimized migration using AWS Database Migration Service (DMS), it is essential to carefully evaluate and configure the service based on the specific data migration requirements:

Source database engine and versions.
Target database engine and versions.
Is target database relational or non-relational database.
Homogeneous or heterogeneous data migration for SCD tool considerations accordingly.
Requirement of one-time data migration or continuous data replication.
Selected tables or entire database.
Number of tables, columns, transformations required.
LOB columns to consider for the data migration.
In case of CDC, acceptable amount of replication latency.
Rate of change of data at source during peak hours.
Source and target database utilization during peak hours.
Transaction log size.
Network connectivity between source and target.
Network latency between source to replication instance and replication instance and target.
Additional source database engine configurations such as storage on ASM, RAC implementation for Oracle.
Execute AWS provided scripts on the source database to capture configuration details.

A factors that affects the performance during data migration using AWS DMS are as follows :

Resource availability on the source.
The available network throughput.
The resource capacity of the replication server.
The ability of the target to ingest changes.
The type and distribution of source data.
The number of objects to be migrated.

Considering all the above points one need to consider below configurations:

Latest version of the AWS DMS.
Sufficient storage on the replication instance for transaction logs and buffering of the data during peak hours.
Limitations for data migration for AWS DMS for each of the source and target databases.
Multi-AZ configurations: Lower environments like Dev, SIT and UAT can be configured in Single-AZ and Prod environment with multi-AZ for high availability.

Replication instance sizing - AWS DMS configures EC2 instance internally for the replication instance setup and is critical for the actual data migration process of DMS. One need to consider below factors for the replication instance sizing -

Database and table size : Data volume helps determine the task configuration to optimize full load performance. For example, for TBs of schemas, one can partition tables into multiple tasks of GBs and run them in parallel. The possible parallelism depends on the CPU resource available in the replication instance. That's why it's a good idea understands the size of your database and tables to optimize full load performance. It determine the number of tasks that you can possibly have.
LOBs : The data types that are present in migration scope can affect performance. Particularly, large objects LOBs) impact performance and memory consumption. To migrate a LOB value, AWS DMS performs a two-step process. First, AWS DMS inserts the row into the target without the LOB value. Second, AWS DMS updates the row with the LOB value. This has an impact on the memory, so it's important to identify LOB columns in the source and analyze their size.
Load frequency and transaction size : Load frequency and transactions per second (TPS) influence memory usage. A high number of TPS or data manipulation language (DML) activities leads to high usage of memory. This happens because DMS caches the changes until they are applied to the target. During CDC, this leads to swapping (writing to the physical disk due to memory overflow), which causes latency.
Table keys and referential integrity : Information about the keys of the table determine the CDC mode (batch apply or transactional apply) that one use to migrate data. In general, transactional apply is slower than batch apply. For long-running transactions, there can be many changes to migrate. When one use transactional apply, AWS DMS might require more memory to store the changes compared to batch apply. If one migrates tables without primary keys, batch apply will fail and the DMS task moves to transactional apply mode. When referential integrity is active between tables during CDC, AWS DMS uses transactional apply by default.

Best Practices Checklist:

Pre-Migration:
Network bandwidth assessment
Source/target database version compatibility check
Storage capacity planning
Security group and IAM role configuration
SSL certificate setup
During Migration:
Monitor replication lag
Track CPU/Memory utilization
Validate data consistency
Monitor network throughput
Post-Migration:
Application connectivity verification
Performance baseline comparison
Data integrity validation
Cleanup temporary resources

One can test the DMS implementation by starting with subset of the data and adding tables gradually and accordingly changing the DMS configurations. Document all these details.

Conclusion:
This is a blog to consider all the points for before performing the relational databases migration using AWS DMS. This will help to consider all the factors affecting the database migrations.

AWS S3: Restricted Access for End Users

Vijay Dilip Bhosale — Wed, 04 Jun 2025 05:30:15 +0000

AWS S3 (Simple Storage Service) plays key role in modern data architecture as a centralized enterprise data lake layer for all the stages from ingestion, transformation, visualization, backup and DR, archival solutions. As a data lake, AWS S3 provides a centralized storage layer capable of storing structured, semi-structured, and unstructured data from diverse sources including:

Relational databases (RDBMS)
APIs
IoT devices
External data sources

Integration with Internal Business Units:
Many organizations have internal business units that manage their own datasets manually. These datasets often need to be integrated with centralized data lakes and data warehouses hosted in the cloud.
To support this:

Enforce secure data practices, read and write access should be granted only to specific S3 prefixes based on user roles and responsibilities. This ensures that end users can access only the data relevant to their business unit or function.
Access permissions must be configured in AWS S3 to allow end users to upload their datasets into designated S3 buckets.
Since these datasets are often restricted to specific business units, access controls must be enforced accordingly.

Access Control Considerations:
While row-level access controls are commonly implemented in data warehouses, similar access restrictions should also be applied at the S3 data lake level. This ensures:

Only authorized users can access specific datasets.
Business unit-level data segregation is maintained.

Example Scenario:
Let’s understand this with a practical example (to be detailed next), illustrating how access permissions and data integration work in a real-world AWS S3 data lake setup.

Business Unit Access Control in AWS S3 Data Lake:
In a typical enterprise setup, business units such as Risk, Cards, and Loans should have read and write access only to their respective S3 prefixes. This ensures that each unit can access and manage only the datasets relevant to them.

IAM Role-Based Access Design:
To enforce this segregation:

Separate IAM roles should be created for each business unit.
End users from each business unit should be mapped to their respective IAM role.

Example: Risk Business Unit:
For instance, the IAM role risk-data-lake can be configured to allow access only to the S3 prefix:
s3:///pba/manual/raw/risk

This role would grant:

Read and write permissions
Access restricted to the Risk unit’s data only

This approach ensures data security, compliance, and operational clarity across business units.
It can be implemented with below policy created and role is created using it and users are mapped to this role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Action": "s3:ListBucket",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::mycompany-dev-landing",
            "Condition": {
                "StringEquals": {
                    "s3:prefix": [
                        "",
                        "pba/",
                        "pba/manual/",
                        "pba/manual/raw/",
                        "pba/manual/raw/risk/"
                    ],
                    "s3:delimiter": [
                        "/"
                    ]
                }
            }
        },
        {
            "Sid": "VisualEditor0A",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::mycompany-dev-landing",
            "Condition": {
                "StringLike": {
                    "s3:prefix": "pba/manual/raw/risk/*"
                }
            }
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::mycompany-dev-landing/pba/manual/raw/risk/*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "s3:GetBucketLocation",
            "Resource": "arn:aws:s3:::*"
        },
        {
            "Sid": "VisualEditor3",
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::*"
        }
    ]
}

This policy allows below permissions:

List all the buckets in AWS S3
List the all the objects in mycompany-dev-landing bucket
Read and write objects to specific prefix - pba/manual/raw/risk/*

Conclusion:
This approach enables the implementation of restricted access in the AWS S3 data lake, aligning with business unit boundaries and compliance requirements.

OAuth grant type setup for Salesforce from Amazon AppFlow

Vijay Dilip Bhosale — Fri, 09 May 2025 08:12:28 +0000

OAuth grant type setup for Salesforce from Amazon AppFlow - JSON Web Token (JWT) and Authorization code:

This is the last blog of the three-part series on data transfer between Salesforce (Software-as-a-service) CRM and AWS using Amazon Appflow.

In Part I, it was discussed an overview of Amazon AppFlow for data transfer from external CRM applications. It also covered sample flow configuration and cost considerations during configuring and executing the flows.

In Part II, it was discussed about considerations and observations during data transfer between Salesforce (Software-as-a-service) CRM and AWS using Amazon Appflow

This blog will cover in detail about JSON Web Token (JWT) setup and authorization code options for OAuth grant type:

JSON Web Token (JWT):
With this option, one need to provide a JWT to access the data from Salesforce. The JWT is passed along with the connection, and Salesforce provides access.

As discussed earlier, as per organization mandate, compliance reasons, Authorization code may not be supported and need to connect using JSON Web Token (JWT) for connecting to Salesforce from AWS. This option uses certificate to sign the JWT request and does not need a explicit user credentials.

One need to create certificate by using below commands and created certificate must be uploaded in salesforce to get JWT token for using OAuth JWT token option.

Generate RSA private key:
openssl genrsa -des3 -passout pass:Password -out server.pass.key 2048
openssl rsa -passin pass:Password -in server.pass.key -out server.key

Generate certificate:
openssl req -new -key server.key -out server.csr

Generate SSL certificate:
openssl x509 -req -sha256 -days 365 -in server.csr -signkey server.key -out server.crt

These commands will create private key and actual X509 certificate.

Once x509 certificate is created, one need to create connected app in Salesforce for oAuth connection setup to use JWT option and use this certificate. It needs to have below options selected, considered –

Enable OAuth setting
Use digital signature and upload the certificate already created

Refer below link for detailed steps for configuring the connected app.

https://help.salesforce.com/s/articleView?id=sf.connected_app_overview.htm&language=en_US&type=5

Once connected app is created, one can create valid JWT token, which consists of Headers, Payload and Signature

Headers – JWT header with format as alg:RS256
Payload – JSON claims set with below parameters –

{
“Iss”: “3MVG99OxTyEMCQ3gNp2PjkqeZKxnmAiG1xV4oHh9AKL_rSK.BoSVPGZHQ
ukXnVjzRgSuQqGn75NL7yfkQcyy7”,
“aud”: “https://test.salesforce.com”,
“sub”: “sample@gmail.com”,
“exp”: “1333685628”
}

Signature: It will consist of Base64UrlEncode encoded header details, certificate key and private key

With the above details and using external sites such as jwt.io, a valid base64 encoded JWT token can be generated.

A connection to Salesforce in AWS AppFlow can be created with an option for Using a valid JWT token JSON web token and putting a valid JSON token created above as shown below.

Authorization code:

With this option, one need to select Authorization code for the OAuth grant type in setting up the connection for the Salesforce in Amazon Appflow and it will prompt a separate window for user credentials need to be entered.

As discussed before, it does not need explicit setup on the Salesforce or AWS side, just given user need access to the Salesforce account, but this option may not be available for higher environments due to compliance requirements.

Conclusion:

This blog provides details about JSON Web Token (JWT) setup from Salesforce and Amazon AppFlow side for enabling connecting to Salesforce using JSON Web Token (JWT) as oAuth grant type. It also discusses about Authorization code OAuth grant type for connecting to Salesforce.

Considerations during data transfer between Salesforce and AWS using Amazon AppFlow:

Vijay Dilip Bhosale — Tue, 22 Apr 2025 06:17:32 +0000

This is the second blog of the three-part series on considerations and observations during data transfer between Salesforce (Software-as-a-service) CRM and AWS using Amazon AppFlow.

In Part I, It was discussed an overview of Amazon AppFlow for data transfer from external CRM applications. It also covered sample flow configuration and cost considerations during configuring and executing the flows.

This blog will cover the following considerations for Salesforce as a source:

Large data migration from Salesforce to AWS (considered S3 as destination here):

There are some limitations on the size of data transfer in a single flow run using Amazon AppFlow. For example, one can transfer a maximum of 1GB of data per single flow run for Marketo as the source. Similarly, it is applicable for Salesforce as the source with 15GB of data per flow run.

Considering average size as 2KB to 4KB for a single record in Salesforce, it would come to approximately 7.5 million Salesforce records. This will work fine for daily incremental or scheduled flow runs.

But for the initial full load run, it may work for small to medium sized tables that will fall under this size limitation, and data transfer would happen in a single flow run.

For large tables with millions of records and data sizes are more than 15 GB for a single table, such as 30-50-100 GB, it is necessary to split the data transfer between separate data flows configured to not exceed data transfer for a single flow more than 15GB.

It is better to filter data based on some date column filter to split the data by month, quarter, or years and load it accordingly. This will also help to archive historic data, if required.

Let’s consider three tables with different data sizes here for the data flow design considerations:

Table A – 12 GB data size
Suppose this table has data from the last 10 years. As this data size is within the limit of 15 GB, this table can be loaded in a single flow run in Amazon AppFlow.
Table B – 40 GB data size
Suppose this table has data from the last 5 years. It can be split based on the modified date column year-wise and verified the record count for each year to ensure it does not exceed the 15 GB limit. Accordingly, five separate flows could be created to load the data into and data consolidated and validated post migration.
Table C – 82 GB data size
Suppose this table has data from the last 4 years. Splitting this table based on years would not work as each year may have data exceeding the 15 GB limit. In this case, it may need to be split based on quarterly or half-yearly data and again data consolidated and validated post migration.

In all these scenarios, one needs to query the data first and extract the record count and approximate data size based on yearly, half-yearly, or quarterly durations. Accordingly, the flows need to be designed.

Salesforce API preference:

With Salesforce as a source or destination, one can configure Salesforce API preference setting. Salesforce API preference settings allow one to specify which Salesforce APIs Amazon AppFlow can use during data transfer from Salesforce to AWS. These configurations allow for optimization of the data transfer for small, medium to large sized data transfer.

There are three options available:

Standard – Uses Salesforce REST API and is optimized for small to medium-sized data transfers.
Bulk – Uses Salesforce Bulk API 2.0, which runs asynchronous data transfers and is optimized for large data transfers.
Automatic – AppFlow decides which API to use based on the number of records the flow transfers.

In this case, with Salesforce as the source for Automatic option, API preference is considered as follows:

Salesforce REST API – For less than 1,000,000 Salesforce records.
Salesforce Bulk API 2.0 – For more than 1,000,000 Salesforce records.

It works with Automatic as the API preference in most cases, using the REST API for small and medium datasets and the Bulk API for large datasets. However, with the Bulk API for large datasets, there is one limitation as follows:

Flow can't transfer Salesforce compound fields as Bulk API 2.0 doesn't support them.

Suppose there are fields such as First name, Last name, and Full name in the Salesforce object (Table), and Full name is configured as a compound field in Salesforce as:

Full name = First name + Last name

In this scenario, the First name and Last name fields would be transferred to AWS, but the Full name would not be. Therefore, one would need to understand the logic for any compound fields in the dataset and derive them again at the destination during the transformations.

Similarly, there could be another compound field such as Complete address derived from Address1, Address2, City, State, Pin code, Country, etc.

OAuth grant type for Amazon AppFlow to communicate with Salesforce:

During Salesforce connection setup, one needs to choose an OAuth grant type. This choice determines how Amazon AppFlow communicates with Salesforce to access the data. There are two options available:

Authorization code
With this option, the Amazon AppFlow console shows a window that prompts for authorization to the Salesforce account. Once signed in with the credentials, one need to choose "Allow" to permit Amazon AppFlow to access Salesforce data. Accordingly, it creates the AWS-managed connected app in your Salesforce account. With this option, there is no additional setup required.

JSON Web Token (JWT)
With this option, one need to provide a JWT to access the data from Salesforce. The JWT is passed along with the connection, and Salesforce provides access.

One need to create the JWT before it can be used for accessing Salesforce data.

The Authorization code option may be available for lower environments such as Dev, Sandbox, etc., but one may need to access data using JWT for higher environments such as Prod, UAT, etc.

Additionally, there could be an organizational mandate to restrict Authorization code access and use only JWT-based access.

A detailed process to set up JWT and access data using the same will be discussed in the next blog separately.

Conclusion:

This blog provides considerations, limitations, and their workarounds for data transfer between Salesforce and AWS using Amazon AppFlow.

Overview of Amazon Appflow for data transfer between Salesforce CRM and AWS and Cost Factors Considerations

Vijay Dilip Bhosale — Wed, 09 Apr 2025 16:11:54 +0000

Organizations on their data modernization journey need to migrate existing, on-premises data from relational databases, external sources, files, and software-as-a-service (SaaS) applications into AWS data lakes and data warehouses to derive valuable insights from the data.

In previous blogs, it was discussed in detail the process of data migration from relational databases into AWS using the AWS Database Migration Service (DMS). In this blog will focus on data migration options for SaaS applications.

Organizations often use CRM applications and platforms like Salesforce, ServiceNow, Slack, Marketo, HubSpot, Zendesk, Zoho, etc., to manage customers, sales, marketing, and service processes. These SaaS applications offer public APIs, but developers need to spend time developing the code to push or pull data from these services. Amazon AppFlow can help here by providing no-code integration with these applications, simplifying the data migration process.

Amazon AppFlow is a fully managed integration service that enables you to securely exchange data between software as a service (SaaS) application, such as Salesforce, and AWS services, such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift. For example, you can ingest contact records from Salesforce to Amazon Redshift or pull support tickets from Zendesk to an Amazon S3 bucket.

Amazon AppFlow supports wide range of applications to integrate with AWS and some of major applications are Salesforce, ServiceNow, Slack, Marketo, Zendesk etc. and supported AWS destinations are AWS S3, Redshift and EventBridge.

Some of the use cases for AppFlow are :

Migrate Saas data
Bi-directional data sync between CRM applications and AWS
Data lake integration
Data warehouse sync
Customer 360 visibility
Event drive workflows etc.

Features of using Amazon AppFlow for the data integration are as follows :

Scalable and managed service
No-code integration
Bi-directional data flow
Transformations during the data transfer
Private data transfer using AWS Private Link
Built-in security features for data in transit (TLS) and data at rest (KMS)

Here Salesforce is in scope for data integration with AWS and let’s go through Salesforce concepts as follows:

Salesforce is the world's leading customer relationship management (CRM) platform that uses cloud-based software to help businesses manage customer data and interactions across sales, service, marketing, and e-commerce. It provides a unified view of customer data to help companies build stronger customer relationships, increase productivity through automation, and gain insights from analytics.

Data from Salesforce CRM needs to be pulled into the AWS S3 for data lake integration, where the required transformations will be performed on it and the data would be synced with data warehouse for the downstream applications and business reporting.

Amazon AppFlow setup:

Connection –

To set up a connection between a software-as-a-service (SaaS) application and AWS using Amazon AppFlow, one need to create connections within Amazon AppFlow. This setup involves storing configuration details and credentials that Amazon AppFlow requires to transfer data between these applications.

Salesforce connector –

Create connection to Salesforce –

Salesforce side authentication –

Flow –
Amazon AppFlow transfers data between SaaS applications, such as Salesforce in this case, and AWS. The flow setup involves specifying several details:

Source and Destination Details: Define the source (Salesforce) and the destination (AWS).
Encryption: Configure encryption settings to ensure data security during transfer.
Data Mapping: Decide how the data from the source will be mapped and placed into the destination.
Filter Conditions: Control which records are transferred to the destination based on specified filter conditions.
Flow Trigger: Set up triggers to determine when the flow should be executed.

Create flow –

Configure flow for source – Snowflake and destination – S3 connection –

Create flow has additional tabs for Map data fields and Add filters etc. details.

Flow Trigger Options in Amazon AppFlow:
The flow trigger helps to decide how the flow will run, and it has three options:

Run on demand – Run manually as per the data transfer requirements
Run on schedule – Flow will run on recurring schedule
Run on event – Flow will run against change event from source Saas application.

Amazon AppFlow costing:
First, review the Amazon AppFlow pricing to understand the costs associated with different flow triggers. Consider how each trigger option on-demand, scheduled, and event-based impacts the overall costing.

Amazon AppFlow is charged for two components:

Flow run cost – The flow run cost is determined by the number of flows executed to call the source applications and transfer data to the destination. Even if a flow run is initiated to check for data availability at the source and no data is found, it is still counted and contributes to the overall cost.

Price per flow run - $0.001
Maximum number of flows runs per AWS account per month – 10 million

Data processing cost - The volume of data processed in a month by Amazon AppFlow is billed per GB, aggregated across all flows in an AWS account.

Data processing price for flows whose destinations are hosted on AWS - $0.02 per GB

Running on demand is ideal for ad hoc flow execution for data transfer as needed, and it is charged per each ad hoc flow run.
However, when choosing between running on event and running on schedule, one needs to select cautiously. Consider the flow run cost and the requirement for near real-time versus scheduled batch data transfer use cases.

Costing is explained in detail in below table:

This example considers 100,000 records processed in a single day from the source SaaS application. The number of records processed may increase depending on the number of records changed and processed at the source within a given month. Consequently, the same number of flows may be triggered due to the run-on-event trigger.

Running on schedule can transfer the same number of records and data at recurring intervals, such as every hour. This results in a minimum of 24 flow runs only in each day and accordingly in each month. Reducing the number of flow runs will considerably lower the overall AppFlow cost.

Conclusion:
This blog provides an overview of Amazon AppFlow, a fully managed service for integration and data transfer between software-as-a-service (SaaS) applications such as Salesforce and AWS. It explains the configuration of Amazon AppFlow, the actual data transfer process, and the key cost considerations to keep in mind when running AppFlow flows.

AWS Data migration full load task configurations for efficient data loading

Vijay Dilip Bhosale — Thu, 27 Mar 2025 07:52:51 +0000

Relational databases migration to AWS environment into Amazon S3 data lake using AWS Database Migration Service Part V:

This is the fifth blog in a multi-part series on considerations and observations during the migration of relational databases to an Amazon S3 data lake using the AWS DMS service.

In the previous blogs, the following considerations were discussed:

Part I – Oracle DB CDC processing and network bandwidth
Part II – MS SQL Server LOB and Datetime columns
Part III – MongoDB load balancing and deleted records
Part IV – Network bandwidth issue in details during the migrations

This blog will focus mainly on full load data migration settings, configurations from RDBMS source Oracle, and it could be applicable for other RDBMS sources too.

Problem statement:
With major RDBMS databases in scope for the data migration to AWS cloud, actual data migration was the critical task of this requirement. For small and medium sized tables in all the databases, full load data migration was happening very quickly.

Regarding the Oracle data migration considering large Oracle tables in scope and heavy transaction rate on the source, full load tasks were taking quite a long time, more than 1 day for 12-15 large tables in Oracle and with ETL transformations, data validations it uses to take 2-3 days for the complete process.

Source tables in scope for this Oracle tables data loading were as follows:

Total tables : 120
Large to very large tables : 15
Rows in large tables : 100M – 500M

Full load tasks for the tables in DMS was configured as shown below for one of the tables –

{
        {
            "rule-type": "selection",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "fin",
                "table-name": "charge_dtls"
            },
            "rule-action": "include",
            "filters": []
        }
}

Also, these complete data processing was performed for 3-4 times during the development phase due to various reasons as follows–

DMS tasks failed as source was not available for few hours over the weekend for maintenance activities
Modified timestamp to process changed records not populated correctly in source tables
Business requirements and reload due to data issues etc.

These data reload adds overall delay for 2-3 days for complete data processing for the large tables and additional cost incurred.

Solution overview:
Upon further reviewing the full load tasks configured and time required for the full load completion, it was decided to use parallel load options with full load for improving the full load task completion.

Parallel load helps in speeding the migration process and make it more efficient. With the parallel options, AWS DMS splits single segmented table into different threads with each thread can run in parallel.

Parallel load comes in different configurations for loading data parallelly using separate threads and here parallel load range for column value boundaries used with specific column as follows

{
            "rule-type": "table-settings",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "fin",
                "table-name": "charge_dtls"
            },
            "parallel-load": {
                "type": "ranges",
                "columns": [
                 "charge_Id"
                ],
                "boundaries": [
                    [
                        "1000000"
                    ],
                    [
                        "2000000"
                    ],
                    [
                        "3000000"
                    ],
                    [
                        "4000000"
                    ],
                    [
                        "5000000"
                    ],
                    [
                        "6000000"
                    ],
                    [
                        "7000000"
                    ],
                    [
                        "8000000"
                    ],
                    [
                        "9000000"
                    ],
                    [
                        "10000000"
                    ]
                ]
            }
        }

Here charge_dtls table is loaded using parallel load with ranges on the Id column.

First one need to identify a column to create boundaries and a column with primary, unique or index should be an good choice.
This parallel range helps in a scenario for table without data partitions as in this case here.

It helped to reduce overall full load time for tables by 50-60% and overall reduction in the data loading, ETL jobs and data validation activity to complete within 1.5 days against 2-3 days before.

Full load task in DMS can be configured with other parallel load configurations such as –

Partitions-auto: for the tables with partitions already defined while table creations. E.g. suppose above table is defined with partition on charge_year while creation would be configured as follows –

{
            "rule-type": "table-settings",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "fin",
                "table-name": "charge_dtls"
            },
            "parallel-load": {
            "type": "partitions-auto"
           }
     }

Partitions-list: Above table can be loaded using partitions list on the same partition column charge_year to filter and fetch data from specific partition values, as shown below to fetch data only for partition values 2024 and 2025 for charge_year column

{
            "rule-type": "table-settings",
            "rule-id": "1",
            "rule-name": "1",
            "object-locator": {
                "schema-name": "fin",
                "table-name": "charge_dtls"
            },
            "parallel-load": {
            "type": "partitions-list",
            “partitions”: [“2024”, “2025”]
           }
}

MaxFullLoadSubTasks is settings for the number of tables or table segments to run in parallel with max value as 49 and default value as 8.

It needs to be configured accordingly for running full load using parallel configurations or multiple small tables to run in parallel.

Conclusion:
In this blog, it was discussed about the parallel load options available during full load data migration for improving full load performance and reducing the required data loading time. Most of the times these options were not considered, and data loading is completed using default configurations.

AWS Data Migration Considerations for network bandwidth issue

Vijay Dilip Bhosale — Sun, 09 Mar 2025 14:50:23 +0000

Relational databases migration to AWS environment into Amazon S3 data lake using AWS Database Migration Service - Part IV:

This is the fourth blog in a multi-part series on considerations and observations during the migration of relational databases to an Amazon S3 data lake using the AWS DMS service. Relational databases such as Oracle, MS SQL Server, MySQL, PostgreSQL, and others were migrated to AWS.

Part I in this series covered considerations such as:

Modified date not populated properly at the source.
Enabling supplemental logging for Oracle as the source.
Network bandwidth issues between source and target databases.

This blog will cover the following additional considerations to address the network bandwidth issue (#3) mentioned above:

It was discussed in the first blog about the following CloudWatch metrics verified to identify the delay at the Oracle source or S3 target:

CDCLatencySource - The gap, in seconds, between the last event captured from the source endpoint and current system time stamp of the AWS DMS instance. CDCLatencySource explains the latency between source database and replication instance on the AWS cloud. High CDCLatencySource means the process of capturing changes from source is delayed. CDCSourceLatency can be close to ZERO when there is no replication lag between the source and the replication instance.
CDCLatencyTarget - The gap, in seconds, between the first event timestamp waiting to commit on the target and the current timestamp of the AWS DMS instance. Target latency is the difference between the replication instance server time and the oldest unconfirmed event id forwarded to a target component. In other words, target latency is the timestamp difference between the replication instance and the oldest event applied but unconfirmed by target endpoint. To identify latency in an ongoing replication, you can view this metric together with CDCLatencySource. If CDCLatencyTarget is higher than CDCLatencySource, one need to investigate further on below factors:

• No primary keys or indexes are in the target
• Resource bottlenecks occur in the target or replication instance
• Network issues reside between replication instance and target
CDCIncomingChanges - The total number of change events at a point-in-time that are waiting to be applied to the target. Note that this is not the same as a measure of the transaction change rate of the source endpoint. A large number for this metric usually indicates AWS DMS is unable to apply captured changes in a timely manner, thus causing high target latency.
NetworkReceiveThroughput - The incoming (Receive) network traffic on the replication instance, including both customer database traffic and AWS DMS traffic used for monitoring and replication.

It was clear from these metrics observed that the issue was related to network bandwidth between the source (Oracle) and target (Amazon S3) and was observed in the CDCLatencySource metrics.

There are some other points considered, and changes made in the DMS tasks configuration to resolve the network bandwidth issue are as follows:

Single or few ongoing changes (CDC) tasks for a given source:

AWS DMS tasks for the Oracle tables were configured as follows:

Separate full load task for each large table.
Separate CDC task for each large table.
One full load plus CDC task for each schema with small and medium tables grouped

Full load tasks were executing fine to capture the source database backup one time and load it into target databases. For the CDC tasks, it reads the changes from redo and archived logs (Oracle as the source considered here) for the committed transactions, and network bandwidth is consumed to fetch these logs into the replication instance of the DMS for further processing.

It was observed that network bandwidth issues were evident at midnight during the end of the day and at month-end processing when larger archived log files were generated around 50-60GB per hour compared to 10-15GB per hour during other times.

Also, Oracle redo and archived logs shipped into the replication instance of the DMS for CDC processing gets multiplied by the number of CDC tasks configured in DMS.

For example, with 5 CDC tasks for the Oracle tables, for 50GBs of the archived log generated, 50*5 = 250GBs of the total logs would be shipped into the replication instance for processing the changes.

Hence, considering the network bandwidth issue, it was decided to have a single or few CDC tasks configured to avoid multiplied archived logs in GBs required to ship into the replication instance. So, CDC tasks were configured as:

Single or very few CDC tasks for all the tables across schemas

With a single CDC task, 50GBs of archived log would be shipped compared to the same number of tables with 5 CDC tasks to ship 50*5 = 250GBs of the archived logs.

Changes in memory configuration for DMS CDC tasks:

MemoryKeepTime- It Sets the maximum time in seconds that each transaction can stay in memory before being written to disk with default value as 60.

MemoryLimitTotal- It Sets the maximum size (in MB) that all transactions can occupy in memory before being written to disk with default value as 1024

After consulting the AWS Tech team, it was discussed that increasing values for both these properties would help to manage memory efficiently for the CDC tasks by keeping transactions in memory longer. Hence, these properties were changed as follows:

MemoryKeepTime - 600
MemoryLimitTotal - 10240

The exact values to set for these properties may differ on a case-by-case basis, and it is necessary to consult the AWS Tech team before changing these values.

Other points checked for addressing the performance issues for the DMS tasks:

Source database configuration to verify CPU and Memory for overall load on the source database during CDC processing.
Replication instance configuration for CPU and Memory to process the incoming changes.
Archived logs generation (in size in GBs) at the source.
Network logs for network bandwidth used between Oracle source and AWS cloud.
LOB support for the tables – The number of LOB columns in tables and the LOB option (such as Limited, Full, and Inline) for migration can affect data migration performance.
Binary reader over Logminer for Oracle as the source for reading log files for better CDC performance.
Minimum required supplemental logging to avoid additional details captured in archived logs.

Conclusion:
In this blog, the main focus was on the considerations for performance optimization of the Oracle tables migration issue faced during CDC tasks.

AWS Data Migration Considerations for MongoDB DB deleted records

Vijay Dilip Bhosale — Mon, 24 Feb 2025 09:33:16 +0000

Relational Databases Migration to AWS Environment into S3 Data Lake Using AWS DMS - Part III

This is the third blog of the multi-part series on considerations and observations during relational databases migration to AWS S3 data lake using AWS DMS service. Relational databases such as MongoDB, Oracle, MS SQL Server, MySQL, PostgreSQL, etc., were migrated to AWS.

Part I in this series covered considerations such as:
• Modified date not populated properly at the source
• Enabling supplemental logging for Oracle as the source
• Network bandwidth issues between source and target databases

Part II in this series covered considerations such as:
• Missing LOB columns for change data capture from MS SQL Server as source
• MS SQL Server time zone issue for timestamp column
• RDBMS instances pause or shutdown during non-business hours

First understand the important MongoDB database concepts as follows:

_MongoDB is open source; document oriented NO SQL database.

Document: Records in a MongoDB database are called documents, and the field values may include numbers, strings, Booleans, arrays, or even nested documents. A document is roughly equivalent to a row in a table in relational database.
Collection: Collection is a group of documents and is similar to a table in RDBMS.
Database: A database in MongoDB is a set of collections, and is roughly equivalent to a schema in a relational database.
MongoDB document is stored as a binary JSON (BSON) file internally in a compressed format that includes a type for each field in the document. Each document has a unique ID._

AWS DMS supports two migration modes when using MongoDB as a source as follows:

Document mode
In document mode, the MongoDB document is migrated as is, meaning that the document data is consolidated into a single column named _doc in a target table. Document mode is the default setting when you use MongoDB as a source endpoint.
You can optionally set the extra connection attribute extractDocID to true to create a second column named "_id" that acts as the primary key. If you are going to use CDC, set this parameter to true.
Table mode
In table mode, AWS DMS transforms each top-level field in a MongoDB document into a column in the target table. If a field is nested, AWS DMS flattens the nested values into a single column. AWS DMS then adds a key field and data types to the target table's column set.

Let’s see below example to understand the data structure in details for the document and table mode:

In MongoDB, each document within a collection is uniquely identified by a field called _id. By default, this field uses the ObjectId format, a 12-byte BSON data type that ensures uniqueness and embeds valuable metadata, such as the creation timestamp.

Source data:
{ "_id" : ObjectId("abcdef101"), "Name" : “ABC”, "City" : “DEF”}
{ "_id" : ObjectId("abcdef102"), "Name" : “ABC1”, "City" : “DEF1”}

Document mode:

Table mode:

This blog addresses the following considerations observed and managed during the MongoDB data migration to AWS S3:

Deleted MongoDB records moved to S3 data lake:

As discussed earlier, MongoDB records, referred to as documents which is a data structure composed of field and value pairs, were migrated in Document mode here into the AWS S3 data lake using AWS DMS. Both new and updated documents were successfully populated into the data lake and validated to match with the source.

On one occasion, several thousand deleted records were received alongside inserted and updated records.

Initially, these records were treated as actual deletions and processed accordingly. However, it was later observed that during the same period, records with the same object IDs were received as inserted records a few milliseconds later in a different replica set (MongoDB server instance).

I, 2024-11-21 04:18:55:009812, “{“”id””: {  “” objid”” : “” 2345687fge8“” } , “”CustId”” : “” 121”” }”
D, 2024-11-21 04:18:56:001234, “{“”id””: {  “” objid”” : “” 2345687fge8“” }  }”

Further analysis revealed that these documents were not newly inserted or deleted during the specified time period. Instead, they were added to the MongoDB server a few hours to a few days earlier.

Following further analysis and discussion with the MongoDB DBA team, it was determined that there were no actual deleted and inserted records during that time. The reason is that the MongoDB server is horizontally scaled in shards (replica sets) using multiple servers. The MongoDB balancer automatically redistributes chunks of data across shards to ensure that each shard has a balanced amount of data. The balancer begins distributing the data for sharded collections based on a migration threshold.

During this phase, collection data from one shard (primary node) is inserted into secondary nodes (shards), and these documents are deleted from the primary node. Hence, newly inserted and deleted records were received in the oplog for the MongoDB database.

As there were no actual deletions occurring in the MongoDB server, these deleted records were skipped and not processed in the S3 data lake.

MongoDB with load balancing across multiple replica sets:

During data migration from MongoDB, note that the server may be scaled into shards/replica sets. Each shard holds a subset of the collection data to ensure high availability and boost read/write throughput. This is primarily used in large databases for horizontal scaling.

The table below lists databases stored in different shards (instances):

Sr. No. Database    Instance 1 (xx.xx.xx.1) Instance 2 (xx.xx.xx.2)
1   Customer    Yes                 Yes
2   Master      Yes                 No
3   UserConfig  No                  Yes

The Customer database is distributed across two instances, with the Master and UserConfig databases stored in instance 1 and instance 2, respectively.

Create two separate source endpoints in AWS DMS to fetch data from the Customer database instances. Merge and process the data from both instances (instance 1 and instance 2) at the destination, which is the S3 data lake.

Cost Optimization Strategies for MongoDB:

Storage Classes:
Use S3 Intelligent-Tiering for dynamic data access patterns
Implement lifecycle policies for aging data
DMS Instance Optimization:
Scale instances based on workload
Use reserved instances for long-term migrations

Operational Excellence Migration Checklist:

PreMigration:
  - Validate source connectivity
  - Configure monitoring
  - Test network throughput
  - Verify IAM permissions

During Migration:
  - Monitor replication lag
  - Track error rates
  - Validate data integrity
  - Monitor resource utilization

Post Migration:
  - Verify record counts
  - Validate data consistency
  - Archive migration logs
  - Document lessons learned

Performance Considerations Monitoring Dashboard:

Metrics:
  - CPUUtilization
  - FreeableMemory
  - WriteIOPS
  - ReadIOPS
  - ReplicationLag
Alarms:
  - CPUUtilization > 80%
  - ReplicationLag > 300

Oracle archived logs retention matter:

When capturing change data from Oracle, an issue with archive logs was noted. The CloudWatch logs for the DMS task showed the following error:

00019300: 2024-11-29T02:22:13 [SOURCE_CAPTURE  ]E:  Archived Redo log with the sequence XXXXX does not exist, thread 1 [XXXXX]  (oradcdc_thread.c:XXX)

The issue was traced to the scheduled purging of archived logs from the Oracle database source. Increasing the retention period for these logs allowed the DMS change data capture tasks to be restarted successfully.

Conclusion:

This is Part III of a series on relational database migration challenges using AWS DMS. This blog focuses on table migration considerations from a MongoDB database.

AWS Data Migration Considerations for MS SQL Server DB LOB and Datetime columns

Vijay Dilip Bhosale — Mon, 10 Feb 2025 06:56:53 +0000

Relational databases migration to AWS environment into Amazon S3 data lake using AWS Database Migration Service Part II:

This is the second blog of the multi-part series on considerations and observations during relational databases migration to Amazon S3 data lake using AWS DMS service. Relational databases such as Oracle, MS SQL Server, MySQL, PostgreSQL, etc., were migrated to AWS cloud.

This blog will cover on below considerations -

Missing LOB columns for change data capture from MS SQL Server as source:

Full load data from MS SQL Server was completed using AWS DMS into Amazon S3 data lake. Most of these were small to medium-sized tables, and data and record count validation was completed and matched with the source.

Change data capture (CDC) tasks were configured for the tables with a one-hour interval for writing the data into the data lake, and data was populated accordingly. Data validation is essential and critical post migration to make sure all the columns, data types migrated correctly and there are no any data issues such as data truncation, mismatch, datetime, int, float columns migrated correctly. During data validation, it was observed that a few columns were missing data for CDC tasks, even though data was available at the source and was populated during the full load for the same columns.

Upon further analysis the observation was as follows:
• These columns were of the following data types:

Xml
Varchar(max)
Nvarchar(max)
Text
Image

• These are LOB columns in MS SQL Server database.
• Hence It was found that LOB columns were not getting populated with ongoing replication.
• However, this was not the case for all tables having the LOB columns; some tables with LOB columns were getting data populated correctly.

After analysing CloudWatch logs for these tasks, the following error logs were found:

"2024-08-27T11:49:23 [SOURCE_UNLOAD ]W: Column 'COLUMN NAME' was removed from table definition SCHEMA NAME.TABLE NAME': the column data type is LOB and the table has no primary key or unique index (metadatamanager.c:2492)"

Upon investigating this issue further, it was found after reading AWS documentation that:

During CDC, AWS DMS supports CLOB data types only in tables that include a primary key.

• Upon validation, it was found that tables with LOB columns that were populated correctly had primary keys,
• Tables with no data populated for LOB columns did not have primary keys.

Primary keys were added for all those tables and all those tables were loading the data again, it was confirmed that those columns are getting LOB data populated now and matched with the source.

Note– LOB columns can be migrated in full or limited mode for specific size of the data to target using AWS DMS service.

MS SQL Server time zone issue for timestamp column:

Another issue found during the data migration for MS SQL Server tables was the timestamp columns were populated incorrectly in the target data store.

Timestamp columns for all the source tables were configured and populated with a non-UTC time zone as – Asia/Calcutta, UTC+5.30.

Upon data validation post data migration, data migrated for the tables had different values for the timestamp columns in the target data store compared to the source data.
After analysing the data, it was found that the data populated at the target data store was populated in a different time zone as – UTC.

After further analysis, it needs to set one property in extra connection attributes for MS SQL Server as the source – serverTimezone.
One needs to set up serverTimezone for the source endpoint having a non-UTC value at the source with the value as:
serverTimezone=Asia/Calcutta;

Upon setting this property for the source endpoint and reloading the data into the S3 data lake, it was found that the timestamp column was populated correctly with the same value as source and the non-UTC Asia/Calcutta time zone.

There is a similar property – Initstmt=SET time_zone for the target DB as MySQL compatible database, if it is in a non-UTC time zone and needs to be set as follows:
Initstmt=SET time_zone= Asia/Kolkata;

RDBMS instances pause or shutdown during non-business hours:

One issue observed for the AWS DMS change data capture tasks were failing for MySQL and Oracle databases was due to the source database not available for few hours in the day and mainly during non-business hours, and the error logged in CloudWatch logs was:

Oracle:

2024-08-08T06:24:45 [SOURCE_CAPTURE  ]W:  Oracle error code is '1089' ORA-01089: immediate shutdown or close in progress - no operations are permitted Process ID: 41371 Session ID: XXXX Serial number: XXXXX   (oracdc_merger.c:176)
2024-15-08T06:24:47 [METADATA_MANAGE ]W:  Oracle error code is '12514' ORA-12514: TNS:listener does not currently know of service requested in connect descriptor   (oracle_endpoint_conn.c:914)
Last Error Task 'TASK NAME’ was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL

MySQL:

2024-06-21T46:16:00 [METADATA_MANAGE ]E:  RetCode: SQL_ERROR  SqlState: HY000 NativeError: 2003 Message: [unixODBC][MySQL][ODBC 8.0(w) Driver]Can't connect to MySQL server on 'XX.XX.XX.XX' (XXX) [XXXXXXX]  (ar_odbc_conn.c:XXX)
Last Error Task TASK NAME' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL

There could be various reasons for a DMS task not being able to connect to the source, such as the source not being available, network issues, changes in permissions, security configuration settings, etc.

In this case after further analysis, it was found that, these were non-prod servers hosted on the cloud or on-premises that are paused or shut down during non-business hours daily for cost optimization or regular maintenance activities.

AWS DMS change data capture tasks could not connect to source databases during non-business hour time and after retry for a certain number of times the tasks fail. One needs to RESUME/RESTART these tasks once the source server is up and running and ensure such pauses or shutdowns are minimized to avoid the tasks failing regularly, it needs to keep these servers running all the time.

Conclusion:
This is Part II of the multi-part series on relational database migration challenges using AWS DMS and their implemented solutions. In this blog, we discussed mainly the considerations during table migration from MS SQL Server as the source.

AWS Data Migration Considerations for Oracle DB CDC processing and network bandwidth

Vijay Dilip Bhosale — Wed, 29 Jan 2025 08:11:25 +0000

Relational databases migration to Amazon S3 data lake using AWS Database Migration Service Part I –

Observations during the data migration –

With the designing and creating Amazon S3 data lake and data warehouse in Amazon Redshift, the data sources were from on-premises for Oracle, MS SQL Server, MySQL, Postgres SQL and MongoDB for relational databases. AWS DMS was used here for the initial full load and daily incremental data transfer from these sources into Amazon S3. With this series of posts want to explain on the various challenges faced during the actual data migration with different relational databases.

Modified date not populated properly at the source

AWS DMS is used for initial full load and daily change data capture from source databases into AWS. AWS DMS captures changed record based on the transaction logs on the source database such as REDO and Archived logs for Oracle database.
Modified date column updated properly on the source can help to apply deduplication logic at the destinations, extract latest modified record for given data record on the target, here in Amazon S3.

In case modified date data is not available for a table or it is not updated properly, AWS DMS provides a property of transformation rules to add a new column while extracting data from the source database.

Here AR_H_CHANGE_SEQ header helps to add a new column with value as unique incrementing number from source database which consists of a timestamp and an auto incrementing number.

Below code example adds a new column named as DMS_CHANGE_SEQ to the target, which has unique incrementing number from the source. This is 35-digit unique number with first 16 digits for the timestamp and next 19 digits for record id number incremented by the database.

{
            "rule-type": "transformation",
            "rule-id": "2",
            "rule-name": "2",
            "rule-target": "column",
            "object-locator": {
                "schema-name": "%",
                "table-name": "%"
            },
            "rule-action": "add-column",
            "value": "DMS_CHANGE_SEQ",
            "expression": "$AR_H_CHANGE_SEQ",
            "data-type": {
                "type": "string",
                "length": 100
            }
  }

Enabling Supplemental logging for Oracle as source
For Oracle as source database, to capture ongoing changes, AWS DMS needs that minimum supplemental logging is enabled on the source database. Accordingly, this will include additional information, columns in the redo logs to identify the changes happened at the source.

Supplemental logging can be enabled for primary, unique keys, set of columns or all the columns at the source. Supplemental logging for all columns captures all the columns for given changed record on the source tables and helps to overwrite in target in this case Amazon S3 layer.

But supplemental logging with all columns will increase the redo logs size, as all the columns for the table are logged into the logs. One need to configure redo and archive logs accordingly to consider additional information populated in it.

AWS DMS Performance issue - Network bandwidth
Initial full load from the on-premises sources for Oracle, MS SQL Server etc. worked fine and change data capture also for most of the time.

Overall, there used to be moderate number of transactions most of the time of the day in given month, except daily end of business day process post-midnight and month end activities. It was observed that DMS migration tasks were out of sync for data loaded into target or failed during this time due to high latency issue between source and AWS replication instance.

There could be different reasons for the high latency such as source and target latency, load on source and target systems, replication instance configuration, network speed and bandwidth etc.

In this case source and target systems were not having heavy transaction load and replication instance was also configured well to handle the volume of the incoming data and was having minimum load on it.

After reviewing the source, target and replication instance metrics in the CloudWatch logs and below observations were found–

CDCLatencySource - The gap, in seconds, between the last event captured from the source endpoint and current system time stamp of the AWS DMS instance. CDCLatencySource explains the latency between source database and replication instance on the AWS cloud. High CDCLatencySource means the process of capturing changes from source is delayed

Generally, CDCIncomingchanges (i.e. The total number of change events at a point-in-time that are waiting to be applied to the target) goes on increasing from zero to few thousands post-midnight end of day activity and during reconciliation activities early morning.
Similarly, CDCLatencySource (i.e. The gap, in seconds, between the last event captured from the source endpoint and current system time stamp of the AWS DMS instance) goes on increasing from zero to few thousands up to 10-12K seconds during daily post-midnight reconciliation activities. This value was up to 40K during month end activities.

Upon further logs analysis and reviewing other metrics, it was observed that –

AWS DMS metrics NetworkReceiveThroughput that is referred to understand the incoming traffic on the DMS Replication instance for both customer database and DMS traffic. This metrics helps to understand the network related issues, if any between source database and DMS replication instance.

Network bandwidth issue between source and AWS network
It was observed that Network receive throughput was max up to 30MB/s (250Mb/s) for the VPN connection between source and AWS and the same VPN is being shared for other applications too.

A good network connectivity, bandwidth between source and target databases is critical for the successful data migration. One should make sure sufficient network bandwidth between on-premises or other cloud source databases and AWS environment is setup before performing the actual data migration. There are different network options, topologies available for network setup between on-premise and AWS cloud.

A VPN tunnel such as AWS Site-to-Site VPN or Oracle Cloud Infrastructure (OCI) Site-to-Site VPN (Oracle AWS) can provide throughput up to 1.25Gbps. This would be sufficient for small tables migration or tables with less DML traffics migration.
For the large data migrations with heavy transactions per seconds on the tables, one should consider AWS Direct connect. It provides an option to create dedicated private connection with 1 Gbps, 10Gbpsand 100 Gbps etc. bandwidth supported. One need to consider initial setup time for a month with direct connect setup.

Conclusion –
This is part I of the multi-part series for the relational databases migration challenges using AWS DMS and their solutions implemented. AWS DMS network speed performance related issue and other factors would be explained in next posts. Most of these challenges mentioned in this series could happen during the database migration process and these solutions can be referred.