Forem: Yoshiki Fujiwara(藤原善基)@AWS Community Builder

Why Delta, Iceberg, and Hudi Can't Write to FSx S3 Access Points — And What Works Instead

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Tue, 26 May 2026 17:00:49 +0000

TL;DR

In Parts 1–6 of this series, we validated read paths across six engines. This Part 7 answers the write question: Can you use Delta Lake, Apache Iceberg, or Apache Hudi on FSx for ONTAP S3 Access Points?

No. All three transactional table formats fail to write on FSx S3 AP due to fundamental S3 API limitations:

Format	Failure Point	Error	Root Cause
Delta Lake	Commit (write)	`501 Not Implemented`	No conditional writes (If-None-Match)
Apache Hudi	Timeline commit	Not Supported (by design)	No atomic rename (.inflight → .commit)
Apache Iceberg	Metadata write	`NullPointerException`	S3FileIO cannot handle AP alias for metadata

What DOES work for write on FSx S3 AP:

✅ Flat Parquet append (PutObject)
✅ Athena CTAS (write-back)
✅ DuckDB COPY TO (write-back)
✅ EMR Spark df.write.parquet() (flat Parquet)

Quick Decision Guide:

Need transactional table format → Write to native S3, read FSxN data via S3 AP separately
Need write-back to FSxN → Use flat Parquet (append-only, no transactions)
Need ACID on NAS data → Use NFS/SMB protocol directly (not S3 AP)

GitHub: fsxn-lakehouse-integrations

How to Read This Article

This article is:

A consolidated "Not Supported" evidence report for transactional writes
Root cause analysis for each table format
Architecture guidance for working alternatives

Read by role:

Data engineer: Failure Evidence → What Works Instead
Architect: Root Cause Analysis → Architecture Patterns
Partner / SA: Partner Decision Card → Discovery Questions
Storage engineer: S3 API Compatibility → Why This Is Fundamental

Prerequisite Concepts

Before reading this article, it helps to understand:

Delta Lake — Databricks' open table format using _delta_log/ JSON commits with conditional writes
Apache Iceberg — Netflix's table format using metadata files with atomic commit protocol
Apache Hudi — Uber's table format using .hoodie/ timeline with atomic rename
Atomic rename — renaming a file in one atomic operation (S3 does not support this)
Conditional writes — writing only if a condition is met (e.g., If-None-Match); FSx S3 AP returns 501
PutObject — S3's basic write operation (supported by FSx S3 AP for files ≤ 5 GB)

Why Transactional Table Formats Need Special S3 Operations

All three formats solve the same problem: concurrent write safety on object storage. They each use a commit protocol that requires operations beyond basic PutObject:

Delta Lake:    PutObject with If-None-Match header (conditional write)
               → Prevents two writers from creating the same commit file

Apache Hudi:   Rename .inflight → .commit (atomic rename)
               → Marks a commit as complete atomically

Apache Iceberg: PutObject + HeadObject/GetObject for metadata verification
                → Verifies metadata was written correctly before commit

FSx for ONTAP S3 AP does NOT support:

Conditional writes (If-None-Match → returns 501 Not Implemented)
Atomic rename (S3 API has no rename operation)
Reliable metadata verification on AP alias (NullPointerException in S3FileIO)

This is not a configuration issue — it's a fundamental API limitation.

Architecture: What's Supported vs What's Not

                    FSx for ONTAP S3 Access Point
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         ✅ READ          ✅ WRITE         ❌ WRITE
         (all engines)    (flat Parquet)   (transactional)
              │               │               │
    ┌─────────┤         ┌─────┤         ┌─────┤
    │         │         │     │         │     │
  Athena   Redshift   EMR  DuckDB   Delta  Iceberg  Hudi
  Snowflake Spectrum  Spark Lambda   Lake
  DuckDB
  EMR

Failure Evidence: Delta Lake

Test: delta-rs (Rust) write to FSx S3 AP
Date: 2026-05-23
Result: 501 Not Implemented

Error: Generic S3 error: Error performing put request to
s3://verification-tes-...-ext-s3alias/_delta_log/00000000000000000000.json:
response error "501 Not Implemented"

Root cause: Delta Lake's commit protocol uses If-None-Match header on PutObject to ensure only one writer creates each commit file. FSx for ONTAP S3 AP does not implement conditional writes and returns 501.

CloudShell reproduction: delta-rs write to FSx S3 AP returns "501 Not Implemented" — conditional writes (If-None-Match) are not supported.

Spark Delta fallback: Spark's Delta writer uses CopyObject + DeleteObject as a rename fallback, but this is not atomic — two concurrent writers can corrupt the log.

Conclusion: Delta Lake write is Not Supported on FSx S3 AP. This is a fundamental limitation, not a configuration issue.

Failure Evidence: Apache Hudi

Test: Logical deduction from Delta Lake verification + Hudi architecture analysis
Date: 2026-05-24
Result: Not Supported (by design)

Root cause: Apache Hudi's commit protocol requires atomic rename for its timeline:

.hoodie/[instant].inflight → .hoodie/[instant].commit

This rename must be atomic to prevent partial commits from being visible. S3 has no rename operation — the only way to "rename" is CopyObject + DeleteObject, which is not atomic.

Attempted verification: EMR Serverless with Hudi write — Hudi catalog plugin not available in EMR 7.1.0 default configuration. However, the fundamental constraint (no atomic rename) makes the outcome deterministic.

Conclusion: Apache Hudi write is Not Supported on FSx S3 AP. Same root cause as Delta Lake.

Failure Evidence: Apache Iceberg

Test: EMR Serverless (emr-7.1.0) with Iceberg S3FileIO + Glue Catalog
Date: 2026-05-24
Result: NullPointerException

java.lang.NullPointerException: Cannot invoke
"org.apache.iceberg.TableMetadata.metadataFileLocation()"
because "metadata" is null

Root cause: Iceberg's S3FileIO attempts to write metadata files to the warehouse path on FSx S3 AP. The NullPointerException occurs during the commit phase when Iceberg tries to verify the metadata file was written successfully. Possible causes:

S3FileIO may not correctly handle S3 AP alias as bucket name
The metadata write (PutObject) may succeed but subsequent HeadObject/GetObject to verify fails due to AP alias resolution
Iceberg's commit protocol may require operations not fully supported by FSx S3 AP

Note: Iceberg READ of pre-existing tables (metadata in Glue, data files on S3 AP) may still work — this was not tested in this run.

Conclusion: Apache Iceberg write is Not Supported on FSx S3 AP. The failure is in metadata write/verify, not data file write.

Summary: Why All Three Fail

Requirement	Delta Lake	Apache Hudi	Apache Iceberg	FSx S3 AP Support
Basic PutObject	✅ Uses	✅ Uses	✅ Uses	✅ Supported
Conditional write (If-None-Match)	✅ Required	❌ Not used	❌ Not used	❌ 501 Not Implemented
Atomic rename	❌ Not used	✅ Required	❌ Not used	❌ No S3 rename API
Metadata write + verify on AP alias	❌ Not needed	❌ Not needed	✅ Required	❌ NullPointerException
Write result	❌ Failed	❌ Not Supported	❌ Failed	—

Common thread: Each format requires at least one operation beyond basic PutObject that FSx S3 AP does not support. This is not a bug — it's a design boundary of the S3 AP interface.

What Works Instead

✅ Flat Parquet Append (PutObject)

All engines can write flat Parquet files to FSx S3 AP:

# EMR Spark
df.write.mode("append").parquet("s3://<AP>/gold/output/")

# DuckDB
COPY (SELECT * FROM result) TO 's3://<AP>/gold/output.parquet' (FORMAT PARQUET);

Limitations: No ACID transactions, no schema evolution, no time travel. Append-only pattern.

✅ Athena CTAS (Write-back)

CREATE TABLE fsxn_gold.aggregated_sensors
WITH (
  external_location = 's3://<AP>/gold/athena-output/',
  format = 'PARQUET'
) AS
SELECT status, COUNT(*), AVG(temperature)
FROM fsxn_athena_verification.sensor_readings
GROUP BY status;

✅ DuckDB COPY TO

conn.execute("""
    COPY (SELECT * FROM read_parquet('s3://<AP>/sensor-data/*.parquet')
          WHERE temperature > 30)
    TO 's3://<AP>/gold/hot_sensors.parquet' (FORMAT PARQUET)
""")

✅ EMR Spark Write (Flat Parquet)

agg_df.write.mode("overwrite").parquet("s3://<AP>/gold/emr_output/")

For regulated workloads (Takizawa-san lens): Flat Parquet on FSx for ONTAP S3 AP does NOT provide ACID guarantees, schema evolution, or time travel. If your compliance framework requires transactional consistency (e.g., SOX audit trail, HIPAA data integrity), flat Parquet is insufficient. Use DataSync → S3 → Delta/Iceberg with Lake Formation governance (Part 6) for regulated workloads that need both FSx for ONTAP as source and ACID guarantees on the analytics layer.

Architecture Patterns for Transactional Workloads

If you need transactional table formats AND FSx for ONTAP data:

Sync mechanism note (verified May 2026): SnapMirror S3 (ONTAP S3 bucket → AWS S3 replication) is not available on FSx for ONTAP — the snapmirror object-store commands are disabled as a managed service restriction. AWS DataSync (NFS → S3) is the only validated sync mechanism for moving FSx for ONTAP data to standard S3 buckets where Delta/Iceberg/Hudi can write safely.

Pattern 1: Read from FSx for ONTAP, Write to Native S3

FSx for ONTAP (source) ──S3 AP──▶ EMR Spark (read + transform)
                                        │
                                        ▼
                              Native S3 (Delta/Iceberg table)
                                        │
                                        ▼
                              Athena / Databricks / Redshift (query)

Use when: You need Delta/Iceberg for downstream analytics but source data lives on FSxN.

Pattern 2: Write via NFS, Read via S3 AP

Application ──NFS/SMB──▶ FSx for ONTAP Volume (write files)
                                        │
                              S3 Access Point (read-only)
                                        │
                                        ▼
                              Athena / Redshift / DuckDB (query)

Use when: Applications write via NFS/SMB and analytics engines read via S3 AP. No transactional format needed because NFS provides POSIX semantics.

Pattern 3: Hybrid (FSxN for raw, S3 for curated)

FSx for ONTAP (raw/bronze) ──S3 AP──▶ EMR Spark ──▶ S3 (silver/gold, Iceberg)
         │                                                    │
         └── NFS/SMB access for apps                          └── Athena + Lake Formation

Use when: Raw data stays on FSxN (multi-protocol access), curated data goes to S3 with full lakehouse capabilities.

Comparison with Other Engines in This Series

Engine	Read from FSxN S3 AP	Write flat Parquet	Write Delta/Iceberg/Hudi
Athena (Part 1)	✅	✅ CTAS	❌
Databricks (Part 2)	⚠️ Partial	❌ (UC blocks)	❌
Snowflake (Part 3)	✅	⚠️ TBD	❌
DuckDB Lambda (Part 4)	✅	✅ COPY TO	❌
EMR Spark (Part 5)	✅	✅ df.write	❌
Redshift Spectrum (Part 6)	✅	❌ (read-only)	❌

Key insight: ALL engines can read from FSx S3 AP. MOST can write flat Parquet. NONE can write transactional table formats. This is a property of the S3 AP interface, not the engines.

Partner Decision Card

Customer requirement	FSx S3 AP path	Recommended alternative
Read NAS data from analytics engines	✅ Works (all engines)	Use any engine from Parts 1-6
Write flat Parquet back to NAS	✅ Works (EMR, DuckDB, Athena CTAS)	Use EMR Spark or DuckDB
Delta Lake on NAS data	❌ Not Supported	Write Delta to native S3; read FSxN separately
Iceberg on NAS data	❌ Not Supported	Write Iceberg to native S3; read FSxN separately
Hudi on NAS data	❌ Not Supported	Write Hudi to native S3; read FSxN separately
ACID transactions on NAS	❌ Not via S3 AP	Use NFS/SMB protocol directly
Schema evolution on NAS data	❌ Not via S3 AP	Use Glue Catalog for schema management

Discovery Questions for Partners

When a customer asks about transactional table formats on FSx for ONTAP S3 AP:

Is the requirement for transactional WRITE or just READ? (Read of pre-existing tables may work for Iceberg)
Can the transactional table live on native S3 while source data stays on FSxN? (Pattern 1)
Is the write pattern append-only or does it require updates/deletes? (Append-only works with flat Parquet)
Does the application already write via NFS/SMB? (Pattern 2 — no S3 AP write needed)
Is schema evolution required? (Use Glue Catalog for schema management without table format)
What is the concurrency requirement? (Single-writer flat Parquet is safe; multi-writer needs transactions)

Governance Impact

Write pattern	Governance model	Concurrency safety	Production suitability
Flat Parquet (single writer)	IAM + S3 AP + Glue Catalog	✅ Safe (single writer)	Production-ready
Flat Parquet (multi-writer)	IAM + S3 AP + Glue Catalog	⚠️ Risk (no conflict detection)	Use with caution
Delta/Iceberg/Hudi	N/A	N/A	❌ Not Supported
NFS/SMB write + S3 AP read	POSIX + IAM + S3 AP	✅ Safe (POSIX locking)	Production-ready

For multi-writer scenarios: If multiple processes need to write to the same prefix on FSxN via S3 AP, use a coordination mechanism (e.g., Step Functions, SQS queue) to serialize writes. Without transactional table formats, there is no built-in conflict detection.

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
Flat Parquet + Glue Catalog	★★★☆☆	★★★★☆	★★☆☆☆	★★★★★	★★★★★	3.8
NFS write + S3 AP read	★★★☆☆	★★★★☆	★★☆☆☆	★★★★★	★★★★☆	3.6
Hybrid (FSxN raw + S3 Iceberg)	★★★★★	★★★★★	★★★★☆	★★★☆☆	★★☆☆☆	3.8

Scoring methodology: Flat Parquet + Glue Catalog scores highest on Cost and Simplicity (no additional infrastructure). Hybrid pattern scores highest on Governance and Performance (full lakehouse on S3) but lower on Simplicity (two storage tiers to manage).

Cost Analysis

Pattern	Additional cost beyond FSxN	Notes
Flat Parquet (append-only)	$0	Just PutObject to existing FSxN
Hybrid (FSxN + S3 Iceberg)	S3 storage + Glue Catalog	Duplicate storage for curated layer
NFS write + S3 AP read	$0	Same FSxN volume, two access paths

Key insight: The cheapest write pattern is flat Parquet directly to FSxN via S3 AP. If you need transactional capabilities, the cost is maintaining a separate S3 tier for the curated layer.

Known Failure Signatures

Symptom	Format	Root cause	Resolution
`501 Not Implemented`	Delta Lake	Conditional write (If-None-Match) not supported	Use flat Parquet instead
`NullPointerException` on metadata	Iceberg	S3FileIO cannot handle AP alias	Write Iceberg to native S3
Rename fails / timeline corrupt	Hudi	No atomic rename in S3 API	Write Hudi to native S3
`CopyObject` + `DeleteObject` (non-atomic)	Delta (Spark fallback)	Spark uses copy+delete as rename	Not safe for concurrent writes
Write succeeds but table is corrupt	Any format (if forced)	Missing concurrency control	Do not force transactional writes

What's Next

This article concludes the core engine validation series (Parts 1-7). The series has validated:

✅ 6 read engines (Athena, Databricks, Snowflake, DuckDB, EMR, Redshift)
✅ 3 write patterns (flat Parquet via EMR, DuckDB, Athena CTAS)
❌ 3 table formats that don't work (Delta, Iceberg, Hudi)
✅ Enterprise governance (Lake Formation fine-grained: column, row, tag)
✅ AI/ML integration (Snowflake Cortex: 8/10 functions, Bedrock KB: zero-copy RAG)

Start Here: 3 Steps to Validate in Your Environment

Choose your engine using the comparison tables in this series:
- Cheapest: DuckDB Lambda (Part 4) — $0.00001/query
- Most governed: Redshift Spectrum + Lake Formation (Part 6)
- Best AI: Snowflake External Table + Cortex (Part 3)
- Best ETL: EMR Serverless (Part 5)
- Best for Databricks customers: DataSync → S3 → UC (Part 2)
Deploy the verification template from GitHub — each engine has a CloudFormation template and setup guide
Record evidence using the verification-pack templates — consistent, reviewable results across environments

PoC Cost Summary (1-day validation)

Engine	PoC Cost (1 day)	What you validate
DuckDB Lambda	~$0.01	Read + write Parquet, sub-second queries
Athena	~$0.05	Serverless SQL, Glue catalog integration
EMR Serverless	~$0.50	Spark ETL, write-back, distributed processing
Redshift Serverless	~$1.50	DWH JOINs, Lake Formation governance
Snowflake	~$5 (1 credit)	External Table, Cortex AI, governance tags

Previously in this series:

Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables
Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query
Part 5: EMR Spark — Read-Write ETL on NAS Data
Part 6: Redshift Spectrum + Lake Formation — Enterprise Governance

References

Key achievement: This validation conclusively established the write boundary for FSx for ONTAP S3 Access Points — transactional table formats (Delta, Iceberg, Hudi) are not supported due to fundamental S3 API limitations (no conditional writes, no atomic rename). The working alternative is flat Parquet append via PutObject, which is supported by EMR Spark, DuckDB, and Athena CTAS. For teams that need transactional capabilities, the recommended pattern is hybrid: raw data on FSxN (multi-protocol access) with curated Iceberg/Delta tables on native S3.

Evidence from verification-pack: delta-lake/, iceberg/, hudi/

Disclaimer: This article is an independent validation report and does not represent AWS, NetApp, Databricks, or Apache Software Foundation official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

Redshift Spectrum + Lake Formation — Enterprise Governance on NAS Data

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Tue, 26 May 2026 16:58:02 +0000

TL;DR

In Part 1, Athena provided serverless SQL. In Part 2, Databricks hit boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda was cheapest. In Part 5, EMR Spark delivered full ETL. This Part 6 adds enterprise governance: Redshift Spectrum + Lake Formation provides 4-layer authorization on NAS data.

Redshift Serverless (8 RPU) successfully queries FSx for ONTAP data via S3 Access Points using the same Glue Catalog tables as Athena — no additional data registration needed. Add Lake Formation on top for table-level, column-level, and tag-based access control.

Query	Duration	Comparison with Athena
COUNT(*) 10K rows	3,231 ms	Athena: ~1,500 ms
GROUP BY aggregation	2,580 ms	Athena: ~1,800 ms
COUNT(*) 5M rows	4,277 ms	Athena: 2,196 ms

~2x slower than Athena for simple scans (Redshift Serverless cold start overhead), but Redshift adds DWH capabilities: federated JOINs with local tables, materialized views, and stored procedures.

Quick Decision Guide:

Need DWH JOINs with NAS data → Redshift Spectrum (this article)
Need enterprise governance (table/column/tag) → Add Lake Formation
Need serverless SQL only (no DWH) → Use Athena (Part 1) — faster and cheaper

GitHub: fsxn-lakehouse-integrations

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (Redshift Serverless 8 RPU, ap-northeast-1)
A governance architecture guide for Lake Formation + FSx S3 AP

Read by role:

DWH engineer: Architecture → Setup → Benchmark Results
Security / governance reviewer: 4-Layer Authorization → Governance Impact
Data engineer: When to Use → Comparison with Athena
Partner / SA: Partner Decision Card → Discovery Questions

Prerequisite Concepts

Before reading this article, it helps to understand:

Redshift Spectrum — Redshift's ability to query data in S3 via external schemas (Glue Catalog)
Redshift Serverless — pay-per-query Redshift without cluster management (measured in RPU)
Lake Formation — AWS's centralized governance layer for data lakes (table/column/tag permissions)
Glue Catalog — AWS's metadata catalog (shared by Athena, Redshift Spectrum, EMR, Glue)
External Schema — a Redshift schema that maps to a Glue Catalog database

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  Redshift Serverless (8 RPU)                                     │
│                                                                  │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  SQL Query                                                │   │
│  │  SELECT * FROM fsxn_spectrum.sensor_readings              │   │
│  │         JOIN local_table ON ...                            │   │
│  └──────────────────────────────────────────────────────────┘   │
│                          │                                       │
│              External Schema (Glue Catalog)                       │
└──────────────────────────┼───────────────────────────────────────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
    Lake Formation    IAM Role    S3 Access Point
    (table/column     (API        (resource
     permissions)      access)     policy)
              │            │            │
              └────────────┼────────────┘
                           │
                           ▼
              FSx for ONTAP Volume (Parquet files)

4-Layer Authorization:

Lake Formation — Who can access which tables/columns (fine-grained)
IAM — Who can call which AWS APIs
S3 Access Point Policy — Which principals can access this access point
File System — UNIX permissions on the underlying files

Benchmark Results

Query	Duration (ms)	Rows	Notes
CREATE EXTERNAL SCHEMA	240	—	One-time setup
COUNT(*) 10K rows	3,231	10,000	Cold start overhead
GROUP BY + AVG aggregation	2,580	3 groups	Status grouping
COUNT(*) 5M rows	4,277	5,000,000	Large scan

Environment: Redshift Serverless 8 RPU, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Performance note: Redshift Serverless has cold start overhead (~2-3s for first query). Warm queries on provisioned Redshift clusters would be faster. For simple scans, Athena is ~2x faster because it has no DWH initialization overhead.

Evidence Matrix

Layer	Evidence	Result	Interpretation
Redshift Serverless	Workgroup creation (8 RPU)	✅ Pass	Serverless endpoint available
IAM role	Spectrum role with S3 AP permissions	✅ Pass	GetObject + ListBucket on AP ARN
External Schema	CREATE EXTERNAL SCHEMA from Glue	✅ Pass	Same catalog as Athena
Spectrum read (small)	COUNT(*) 10K rows	✅ Pass	3,231ms
Spectrum read (aggregation)	GROUP BY + AVG	✅ Pass	2,580ms
Spectrum read (large)	COUNT(*) 5M rows	✅ Pass	4,277ms
Lake Formation admin	put-data-lake-settings	✅ Pass	Admin configured
Lake Formation grant	Table-level SELECT grant	✅ Pass	Fine-grained permission works
LF column-level	SELECT on 3 permitted columns	✅ Pass	Non-permitted column returns "cannot be resolved"
LF column deny	SELECT on denied column (humidity)	✅ Pass (denied)	"Column cannot be resolved or requester is not authorized"
LF row filter	Data cells filter creation	✅ Pass	Row filter (status='normal') + column filter combined
LF-Tag creation	sensitivity tag (public/internal/confidential)	✅ Pass	Tag created and assigned to table
LF-Tag permission	Tag-based DESCRIBE+ASSOCIATE grant	✅ Pass	Scalable governance via classification
Athena under LF	Query with LF permissions active	✅ Pass	Same governance applies to Athena

Setup

Step 1: Create External Schema (reuses Glue Catalog)

CREATE EXTERNAL SCHEMA fsxn_spectrum
FROM DATA CATALOG
DATABASE 'fsxn_athena_verification'
IAM_ROLE 'arn:aws:iam::<ACCOUNT_ID>:role/fsxn-redshift-spectrum-role'
REGION 'ap-northeast-1';

Key insight: This uses the same Glue Catalog database that Athena uses. No additional table registration needed — if Athena can query it, Redshift Spectrum can too.

Step 2: Query FSx for ONTAP Data

-- Simple count
SELECT COUNT(*) FROM fsxn_spectrum.sensor_readings;
-- Result: 10000 (3,231ms)

-- Aggregation
SELECT status, COUNT(*), AVG(temperature)
FROM fsxn_spectrum.sensor_readings
GROUP BY status;
-- Result: 3 groups (2,580ms)

-- JOIN with local Redshift table (DWH capability)
SELECT s.device_id, s.temperature, d.location
FROM fsxn_spectrum.sensor_readings s
JOIN device_master d ON s.device_id = d.device_id
WHERE s.temperature > 35;

Step 3: Add Lake Formation Governance

# Set Lake Formation admin
aws lakeformation put-data-lake-settings \
  --data-lake-settings '{"DataLakeAdmins": [{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:user/<admin>"}]}'

# Grant table-level SELECT to a role
aws lakeformation grant-permissions \
  --principal '{"DataLakePrincipalIdentifier": "arn:aws:iam::<ACCOUNT_ID>:role/fsxn-analyst-role"}' \
  --resource '{"Table": {"DatabaseName": "fsxn_athena_verification", "Name": "sensor_readings"}}' \
  --permissions '["SELECT", "DESCRIBE"]'

Lake Formation Data Permissions: fine-grained table-level SELECT grant for fsxn-athena-glue-role on sensor_readings table.

Lake Formation Governance Value

Capability	Without Lake Formation	With Lake Formation
Table-level access	S3 AP policy (all-or-nothing per prefix)	Per-table SELECT/DESCRIBE grants
Column-level security	❌ Not possible	✅ Column-level grants + masking
Row-level filtering	❌ Not possible	✅ Data Cells Filter (row filter expressions)
Tag-based access control	❌ Not possible	✅ Classify data → auto-grant by tag (LF-Tags)
Centralized audit	CloudTrail (API-level)	Lake Formation audit (table/column-level)
Cross-account sharing	Share S3 AP (complex)	Share tables via Lake Formation (simple)

Fine-Grained Governance — Verified (May 2026)

All three fine-grained Lake Formation capabilities have been validated on FSx for ONTAP S3 AP data:

Feature	Test	Result
Column-level permission	Grant SELECT on 3 of 4 columns; query the denied column	✅ Permitted columns return data; denied column (`humidity`) returns "cannot be resolved"
Row filter (Data Cells Filter)	Create filter `status = 'normal'`; query returns only matching rows	✅ Only rows matching the filter expression are returned
LF-Tag	Create tag `sensitivity: public/internal/confidential`; assign to table	✅ Tag created, assigned, and queryable via Lake Formation console

Governance implication for regulated workloads: Lake Formation on FSx for ONTAP S3 AP data provides the same fine-grained access control as on native S3 data. Column masking, row filtering, and tag-based classification all work without data movement. This is the strongest AWS-native governance path for FSx for ONTAP data.

Iceberg + Lake Formation path: Glue Data Catalog supports Iceberg table registration natively. For transactional workloads requiring ACID guarantees: sync FSx for ONTAP data to S3 via DataSync → write as Iceberg table (EMR Spark) → register in Glue Catalog → query via Redshift Spectrum with full Lake Formation governance (column/row/tag). This provides the best of both worlds: FSx for ONTAP as source of truth + Iceberg ACID + Lake Formation governance.

Enterprise governance use cases:

Healthcare: Column-level masking of PHI fields (e.g., hide patient_name from analysts)
Finance: Row-level filtering by business unit (each team sees only their data)
Public sector: LF-Tag classification enforcement (sensitivity: public/internal/confidential)

Comparison with Other Engines in This Series

Aspect	Redshift Spectrum	Athena (Part 1)	DuckDB Lambda (Part 4)	EMR Spark (Part 5)
Query latency (5M rows)	4,277ms	2,196ms	N/A (memory limit)	6,780ms
DWH JOINs with local tables	✅ Best	❌	❌	❌
Lake Formation governance	✅	✅	❌	⚠️ Optional
Materialized views	✅	❌	❌	❌
Stored procedures	✅	❌	❌	❌
Zero idle cost	✅ (Serverless)	✅	✅	✅
Write-back to FSxN	❌ (results stay in Redshift)	✅ CTAS	✅ COPY TO	✅ Best
Cold start	~3s (Serverless)	~2s	1.9s	20s
Cost model	RPU-seconds	$/TB scanned	$/invocation	$/job

Partner Decision Card

Customer requirement	Redshift Spectrum + LF today	Recommended path
JOIN NAS data with DWH tables	✅ Best fit	Redshift Spectrum external schema
Enterprise governance (table/column/tag)	✅ Best fit	Add Lake Formation
Existing Redshift investment	✅ Natural extension	Add external schema to existing cluster
Serverless SQL only (no DWH)	⚠️ Overkill	Use Athena (faster, cheaper for simple queries)
Write-back to FSxN	❌ Not supported	Use EMR Serverless (Part 5)
Sub-second latency	❌ Cold start overhead	Use DuckDB Lambda (Part 4)
Cross-account data sharing	✅ Lake Formation sharing	Configure LF cross-account grants
Column-level masking for compliance	✅ Lake Formation	Configure column-level permissions

Discovery Questions for Partners

When a customer asks about Redshift Spectrum + Lake Formation + FSx for ONTAP S3 AP:

Does the customer already have a Redshift cluster or Serverless workgroup? (If yes, adding Spectrum is trivial)
Do they need to JOIN NAS data with existing DWH tables? (This is Redshift Spectrum's unique value)
Is table/column-level governance required? (Lake Formation adds this layer)
Is the workload read-only analytics, or does it need write-back? (Spectrum is read-only from external data)
What is the query frequency? (For < 10 queries/day, Athena is cheaper)
Is cross-account data sharing needed? (Lake Formation simplifies this)
Are there compliance requirements for column-level masking? (Lake Formation provides this)
What is the acceptable query latency? (Redshift Serverless has ~3s cold start)

Governance Impact Summary

Access path	Authorization layers	Auditability	Production suitability
Redshift Spectrum (no LF)	IAM + S3 AP + File System (3 layers)	Medium (CloudTrail)	Good for non-regulated workloads
Redshift Spectrum + Lake Formation	LF + IAM + S3 AP + File System (4 layers)	High (LF audit + CloudTrail)	Recommended for regulated workloads
Athena + Lake Formation	LF + IAM + S3 AP + File System (4 layers)	High (LF audit + CloudTrail)	Recommended for serverless regulated workloads

Key insight: Redshift Spectrum and Athena share the same Glue Catalog and Lake Formation permissions. Governance configured for one automatically applies to the other. This means you can use EMR Spark for write-back, register output in Glue, apply Lake Formation permissions, and query from both Athena and Redshift Spectrum with the same governance.

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
Redshift Spectrum + LF	★★★★★	★★★☆☆	★★☆☆☆	★★★☆☆	★★★☆☆	3.2
Athena + Lake Formation	★★★★★	★★★☆☆	★★☆☆☆	★★★★☆	★★★★☆	3.6
Snowflake External Table	★★★★☆	★★☆☆☆	★★★★☆	★★★☆☆	★★★★☆	3.4
DuckDB Lambda	★☆☆☆☆	★★★★☆	★☆☆☆☆	★★★★★	★★★★★	3.2
EMR Serverless Spark	★★☆☆☆	★★★★☆	★★★☆☆	★★★☆☆	★★★☆☆	3.0

Scoring methodology: Redshift Spectrum + LF scores highest on Governance (same as Athena + LF) but lower on Cost and Simplicity due to RPU pricing and DWH management overhead. Choose Redshift Spectrum when DWH JOINs are required; choose Athena when serverless SQL is sufficient.

Cost Analysis

Component	Cost
Redshift Serverless (8 RPU, per query)	~$0.36/RPU-hour (billed per second)
Redshift Serverless (idle)	$0 (scales to zero)
Lake Formation	$0 (no additional charge)
Glue Catalog	$1/100K objects/month
FSx for ONTAP (existing)	$0 incremental

Monthly estimate (100 queries/day, avg 5s each):

100 queries × 5s × 8 RPU × $0.36/RPU-hour ÷ 3600 = ~$0.40/day = ~$12/month

Compare with:

Athena (same queries): ~$5/TB × data scanned
DuckDB Lambda: ~$1.10/month (but no DWH JOINs)

When Redshift Spectrum is cost-justified: When you already have Redshift and need to JOIN NAS data with local tables. The marginal cost of adding Spectrum queries is low.

When to Use (and When Not To)

Use Redshift Spectrum + Lake Formation when:

Customer already has Redshift (adding Spectrum is trivial)
Need to JOIN NAS data with DWH tables
Enterprise governance (table/column/tag) is required
Cross-account data sharing is needed
Compliance requires column-level masking

Don't use when:

Simple serverless SQL is sufficient (use Athena — faster, cheaper)
Need write-back to FSxN (use EMR Serverless)
Need sub-second latency (use DuckDB Lambda)
No existing Redshift investment (Athena is simpler to start)
Dataset is small and ad-hoc (DuckDB Lambda is cheapest)

Known Failure Signatures

Symptom	Likely cause	Next step
`permission denied for schema`	IAM role not associated with Redshift	Associate IAM role with Redshift namespace
`S3 access denied` on external table	IAM role missing S3 AP permissions	Add S3 AP ARN to role policy
External schema creation fails	Glue database doesn't exist	Create database in Glue Catalog first (or use Athena)
Query returns 0 rows	Table location doesn't match S3 AP path	Verify Glue table LOCATION uses AP alias
`Spectrum is not supported`	Using provisioned cluster without Spectrum	Enable Spectrum or use Serverless
Lake Formation permission denied	LF permissions not granted	Grant SELECT via `aws lakeformation grant-permissions`

What's Next

Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead (critical knowledge for architecture decisions)

Previously in this series:

Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables
Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query
Part 5: EMR Spark — Read-Write ETL on NAS Data

References

Key achievement: This validation established that Redshift Spectrum + Lake Formation provides the strongest enterprise governance path for FSx for ONTAP S3 AP data — 4-layer authorization (Lake Formation → IAM → S3 AP → File System), table/column-level access control, and seamless sharing of Glue Catalog with Athena. The same governance configuration applies to both Athena and Redshift Spectrum queries, enabling a unified governance model across query engines.

All benchmarks are from a specific test environment (Redshift Serverless 8 RPU, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Performance improves with warm queries and provisioned clusters.

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Tue, 26 May 2026 16:57:34 +0000

TL;DR

In Part 1, Athena provided serverless read-only SQL. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with config. In Part 4, DuckDB Lambda delivered the cheapest path. This Part 5 shows the full-power Spark ETL path with write-back.

EMR Serverless Spark can read, transform, and write-back Parquet files on FSx for ONTAP via S3 Access Points. Total Spark execution: 16 seconds for a full ETL pipeline (read → aggregate → window → write). Job total including cold start: 37 seconds. Cost: ~$0.05 per job.

No cluster to manage. No data to copy. No idle cost.

Quick Decision Guide:

Need Spark's full power (UDFs, ML, window functions) + write-back → EMR Serverless
Read-only SQL, no Spark needed → Use Athena (Part 1) or DuckDB Lambda (Part 4)
Need enterprise governance on results → Combine EMR write-back + Athena/Lake Formation for reads

GitHub: fsxn-lakehouse-integrations/integrations/emr-spark/

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (EMR Serverless emr-7.1.0, ap-northeast-1)
A deployment guide for EMR Serverless + FSx for ONTAP S3 AP

Read by role:

Data engineer: Architecture → Critical Findings → PySpark Job
Platform engineer: Deploy and Run → Gotchas → Cost Analysis
Partner / SA: Partner Decision Card → Discovery Questions
Security reviewer: Governance Impact → When to Use

Prerequisite Concepts

Before reading this article, it helps to understand:

EMR Serverless — a deployment option for EMR that runs Spark/Hive jobs without managing clusters
EMRFS — EMR's S3 filesystem implementation (s3:// prefix) that natively supports S3 AP aliases
S3A vs EMRFS — s3a:// (Hadoop's S3AFileSystem) does NOT support S3 AP aliases; always use s3://
PySpark — Python API for Apache Spark
Parquet timestamp resolution — Spark requires microsecond timestamps; nanosecond (pandas default) causes errors

Why EMR Serverless + FSx for ONTAP?

Traditional ETL	This approach
Provision EMR cluster (minutes)	Submit job to EMR Serverless (seconds)
Copy data from NAS to S3	Read NAS data in place via S3 AP
Pay for idle cluster	Pay only during job execution
Manage cluster scaling	Auto-scales per job
Write results to separate S3 bucket	Write results back to FSx for ONTAP

EMR Serverless eliminates cluster management entirely. Combined with FSx S3 AP, you get a fully serverless ETL pipeline that reads and writes directly to your NAS storage.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  EMR Serverless Application (Spark 3.5, emr-7.1.0)              │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  PySpark Job                                             │   │
│  │  ├── Read: spark.read.parquet("s3://<AP>/sensor-data/")  │   │
│  │  ├── Transform: GROUP BY, Window functions               │   │
│  │  └── Write: df.write.parquet("s3://<AP>/gold/output/")   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                          │                                      │
│                    EMRFS (s3://)                                │
└──────────────────────────┼──────────────────────────────────────┘
                           │
                           ▼
              S3 Access Point (internet-origin)
                           │
                           ▼
              FSx for ONTAP Volume (Parquet files)

Key: EMR Serverless uses EMRFS (s3:// prefix) which natively supports S3 AP aliases. No special configuration needed.

Benchmark Results

Operation	Duration	Notes
Read 10K rows	6.78s	First read includes Spark initialization
GROUP BY aggregation	2.52s	Status + AVG(temperature)
Window function	1.19s	Moving average per device
Write-back to FSxN	3.61s	Parquet output to S3 AP
Total Spark execution	16.35s	All operations combined
Job total (with cold start)	37s	Includes EMR Serverless startup

Environment: EMR Serverless, emr-7.1.0, Spark 3.5, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Evidence Matrix

Layer	Evidence	Result	Interpretation
EMR Serverless app	create-application	✅ Pass	Spark 3.5 app created
IAM role	Execution role with S3 AP permissions	✅ Pass	GetObject + PutObject on AP ARN
EMRFS read	spark.read.parquet("s3://AP/...")	✅ Pass	EMRFS natively handles AP alias
Spark transforms	GROUP BY, Window, aggregation	✅ Pass	Full Spark SQL works
Write-back	df.write.parquet("s3://AP/gold/...")	✅ Pass	PutObject to FSxN via S3 AP
S3A (negative test)	spark.read.parquet("s3a://AP/...")	❌ Expected fail	S3A cannot parse AP alias
Job lifecycle	start → running → success	✅ Pass	37s total including cold start

Critical Finding: EMRFS vs S3A

This is the most important thing to know:

# ✅ WORKS — EMRFS natively supports S3 AP aliases
df = spark.read.parquet("s3://my-ap-alias-ext-s3alias/sensor-data/")

# ❌ FAILS — S3A cannot parse AP alias URLs
df = spark.read.parquet("s3a://my-ap-alias-ext-s3alias/sensor-data/")
# Error: IllegalArgumentException: Invalid S3 URI

Always use s3:// (EMRFS) with EMR. The s3a:// filesystem (Hadoop's S3AFileSystem) does not understand S3 AP alias format.

Critical Finding: Parquet Timestamp Compatibility

If you generate Parquet files with pandas or DuckDB, they default to nanosecond timestamps. Spark cannot read these:

AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS, true))

Fix: Generate Parquet with microsecond timestamps:

import pyarrow as pa, pyarrow.parquet as pq

# Convert nanosecond → microsecond before writing
table = pa.Table.from_pandas(df)
schema = table.schema
new_fields = []
for field in schema:
    if pa.types.is_timestamp(field.type):
        new_fields.append(field.with_type(pa.timestamp('us')))
    else:
        new_fields.append(field)
new_schema = pa.schema(new_fields)
table = table.cast(new_schema)
pq.write_table(table, 'output.parquet')

This affects cross-engine compatibility: if you write Parquet with DuckDB or pandas and want to read it with Spark (EMR, Glue, Databricks), always use microsecond resolution.

Comparison with Other Engines in This Series

Aspect	EMR Serverless	Athena (Part 1)	DuckDB Lambda (Part 4)	Snowflake (Part 3)	Databricks (Part 2)
Read from FSx for ONTAP S3 AP	✅ Direct	✅ Direct	✅ Direct	✅ With ARN	⚠️ Partial (explicit path only)
Write-back to FSx for ONTAP	✅ Best	✅ CTAS	✅ COPY TO	⚠️ TBD	❌ Blocked
Complex transforms (UDF, ML)	✅ Best	❌ SQL only	❌ SQL only	⚠️ Snowpark	✅ Best (if data in UC)
Cold start	20s	~2s	1.9s	N/A	N/A (cluster always on)
Cost per job	$0.05	$0.005/TB	$0.00001	Credits	DBU
Governance	IAM only	✅ Glue + LF	❌ None	✅ Tags + RBAC	❌ UC blocked on S3 AP
Distributed processing	✅ Best	✅	❌	✅	✅ Best (if data in UC)
Session policy issues	❌ None	❌ None	❌ None	Resolved with ARN	❌ Blocks table creation

Why EMR Serverless instead of Databricks for FSx for ONTAP S3 AP?

EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively — no special configuration needed. In contrast, Databricks UC generates a restrictive session policy that blocks subdirectory listing, table creation, and write operations on FSx for ONTAP S3 AP paths (confirmed by Databricks Support, May 2026).

For teams that need Spark processing on FSx for ONTAP data today:

EMR Serverless: Direct read + write-back, no session policy issues, IAM governance
Databricks: Requires DataSync → S3 → UC (data copy), but provides full UC governance + Mosaic AI

Partner Decision Card

Customer requirement	EMR Serverless today	Recommended path
Full Spark ETL with write-back	✅ Best fit	Deploy EMR Serverless
Complex transforms (UDFs, ML pipelines)	✅ Best fit	Deploy EMR Serverless
Large-scale distributed processing	✅ Best fit	Deploy EMR Serverless
Read-only SQL analytics	⚠️ Overkill	Use Athena or DuckDB Lambda
Sub-second query latency	❌ 20s cold start	Use DuckDB Lambda
Enterprise governance on results	⚠️ IAM only	Write to FSxN → read via Athena + Lake Formation
Delta/Iceberg table format	❌ Write not supported on S3 AP	Write flat Parquet only. Iceberg read (pre-existing table) is theoretically possible via GetObject but not validated.
Scheduled batch ETL	✅ Good fit	EMR Serverless + Step Functions

Discovery Questions for Partners

When a customer asks about EMR Serverless + FSx for ONTAP S3 Access Points:

Does the workload require Spark-specific features (UDFs, ML, window functions, graph)?
Is write-back to FSx for ONTAP required? (EMR is the best write-back path)
What is the typical dataset size? (EMR shines at > 1 GB; for < 1 GB, DuckDB Lambda is cheaper)
Is the workload batch or interactive? (EMR has 20s cold start — not suitable for interactive)
Does the team have Spark expertise? (If not, Athena SQL may be simpler)
Is Delta/Iceberg table format required? (Not supported for write on FSx S3 AP)
What is the job frequency? (10 jobs/day = $15/month; 100 jobs/day = $150/month)
Is there an existing EMR or Glue investment? (Leverage existing IAM roles and scripts)

Governance Impact

Capability	EMR Serverless	Notes
Authentication	IAM (execution role)	Standard AWS IAM
Authorization	S3 AP policy + IAM	No table/column-level control natively
Audit trail	CloudWatch Logs + CloudTrail	Job logs + S3 API calls logged
Data classification	❌ None built-in	Can integrate with Lake Formation for reads
Row/column security	❌ None built-in	Apply at read layer (Athena + LF)
Catalog integration	⚠️ Optional (Glue Catalog)	Can register output in Glue for downstream governance

Governance model: EMR Serverless uses IAM + S3 AP policy for access control. For enterprise governance, write results back to FSxN and read them via Athena + Lake Formation (Part 6). This gives you Spark's processing power with Lake Formation's governance on the output.

Recommended pattern for governed ETL:

FSxN (raw) → EMR Spark (transform) → FSxN (gold) → Athena + Lake Formation (governed read)

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
EMR Serverless Spark	★★☆☆☆	★★★★☆	★★★☆☆	★★★☆☆	★★★☆☆	3.0
Athena + Lake Formation	★★★★★	★★★☆☆	★★☆☆☆	★★★★☆	★★★★☆	3.6
DuckDB Lambda	★☆☆☆☆	★★★★☆	★☆☆☆☆	★★★★★	★★★★★	3.2
Snowflake External Table	★★★★☆	★★☆☆☆	★★★★☆	★★★☆☆	★★★★☆	3.4

Governance: Access control, audit, classification capabilities
Performance: Processing throughput for ETL workloads
AI Capability: Built-in ML/AI integration (Spark MLlib, etc.)
Cost: Total cost for batch ETL workloads
Operational Simplicity: Setup and maintenance effort

Scoring methodology: Each dimension rated by the author based on validated evidence. EMR scores highest on Performance and AI Capability (Spark MLlib, distributed ML) but lower on Governance (IAM-only) and Simplicity (requires Spark expertise).

Cost Analysis

Component	Cost
EMR Serverless (37s job)	~$0.05
FSx for ONTAP (existing)	$0 incremental
S3 AP requests	$0 (included in FSx)
Script storage (S3)	< $0.01

Monthly estimate (10 jobs/day):

300 jobs × $0.05 = $15/month
Zero idle cost (application stopped between jobs)

Compare with:

EMR on EC2 (m5.xlarge cluster): ~$200/month (always-on)
Glue ETL (same workload): ~$0.44/job × 300 = $132/month
DuckDB Lambda: ~$1.10/month (but no distributed processing)

The PySpark Job

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
import time

spark = SparkSession.builder.appName("FSxN-S3AP-Verification").getOrCreate()

S3_AP = "s3://<your-ap-alias-ext-s3alias>"

# --- Read ---
start = time.time()
df = spark.read.parquet(f"{S3_AP}/sensor-data/sensor_data_microsecond.parquet")
row_count = df.count()
print(f"Read: {row_count} rows in {time.time()-start:.2f}s")

# --- Transform: GROUP BY ---
start = time.time()
agg_df = df.groupBy("status").agg(
    F.count("*").alias("count"),
    F.avg("temperature").alias("avg_temp"),
    F.avg("humidity").alias("avg_humidity")
)
agg_df.show()
print(f"GROUP BY: {time.time()-start:.2f}s")

# --- Transform: Window function ---
start = time.time()
window_spec = Window.partitionBy("device_id").orderBy("timestamp").rowsBetween(-5, 0)
window_df = df.withColumn("moving_avg_temp", F.avg("temperature").over(window_spec))
window_df.select("device_id", "timestamp", "temperature", "moving_avg_temp").show(5)
print(f"Window: {time.time()-start:.2f}s")

# --- Write-back ---
start = time.time()
agg_df.write.mode("overwrite").parquet(f"{S3_AP}/gold/emr_spark_output/")
print(f"Write-back: {time.time()-start:.2f}s")

spark.stop()

Deploy and Run

# 1. Create EMR Serverless application
aws emr-serverless create-application \
  --name "fsxn-spark" \
  --release-label "emr-7.1.0" \
  --type "SPARK" \
  --region ap-northeast-1

# 2. Upload script to S3 (regular bucket, not S3 AP)
aws s3 cp scripts/spark_verification.py \
  s3://my-scripts-bucket/emr-scripts/

# 3. Submit job
aws emr-serverless start-job-run \
  --application-id <app-id> \
  --execution-role-arn arn:aws:iam::<ACCOUNT_ID>:role/emr-serverless-role \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://my-scripts-bucket/emr-scripts/spark_verification.py"
    }
  }'

# 4. Check status
aws emr-serverless get-job-run \
  --application-id <app-id> \
  --job-run-id <job-run-id>

# 5. Stop application (zero cost when stopped)
aws emr-serverless stop-application --application-id <app-id>

Known Failure Signatures

Symptom	Likely cause	Next step
`IllegalArgumentException: Invalid S3 URI`	Using `s3a://` instead of `s3://`	Switch to EMRFS (`s3://`) prefix
`Illegal Parquet type: INT64 (TIMESTAMP(NANOS))`	Nanosecond timestamps in Parquet	Regenerate with microsecond resolution
Job stuck in PENDING > 60s	EMR Serverless capacity	Check service quotas; retry
`AccessDeniedException` on S3 AP	IAM role missing AP permissions	Add S3 AP ARN to execution role policy
Script not found	Script on S3 AP instead of regular S3	Move script to regular S3 bucket
Write fails with 501	Attempting Delta/Iceberg write	Use flat Parquet write only

Gotchas and Lessons

1. Script must be on regular S3 (not S3 AP)

EMR Serverless loads the PySpark script from S3. The script location must be a regular S3 bucket, not an FSx S3 AP. The script then reads/writes data from/to the S3 AP.

2. IAM role needs both S3 bucket and S3 AP permissions

{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::my-scripts-bucket/*",
    "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>",
    "arn:aws:s3:ap-northeast-1:<ACCOUNT_ID>:accesspoint/<ap-name>/object/*"
  ]
}

3. Cold start is ~20 seconds

EMR Serverless has a cold start of ~20 seconds before Spark begins executing. For latency-sensitive workloads, keep the application in "started" state (costs ~$0.01/hour for pre-initialized capacity).

4. No session policy issues

Unlike Databricks and Snowflake, EMR Serverless uses direct IAM role credentials without intermediary session policies. The S3 AP ARN format works natively.

When to Use EMR Serverless vs Other Engines

Requirement	EMR Serverless	Athena	DuckDB Lambda	Glue ETL
Read-only SQL	✅	✅ Best	✅	✅
Write-back to FSxN	✅ Best	✅ (CTAS)	✅	✅
Complex Spark transformations	✅ Best	❌	❌	✅
Sub-second latency	❌ (cold start)	❌	✅ Best	❌
Zero idle cost	✅	✅	✅	✅
Large-scale distributed	✅ Best	✅	❌	✅

What's Next

Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance (4-layer authorization) on NAS data
Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx S3 AP, and what flat Parquet patterns work instead

Previously in this series:

Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables
Part 4: DuckDB Lambda — Serverless Analytics for $0.00001/Query

References

Key achievement: This validation established that EMR Serverless Spark provides the most capable read-write ETL path for FSx for ONTAP S3 AP data — full Spark SQL, UDFs, window functions, and write-back in 16 seconds of Spark execution at $0.05/job. No cluster management, no data copy, no session policy issues. The trade-off is cold start latency (20s) and lack of built-in governance — pair with Athena + Lake Formation for governed reads on the output.

All benchmarks are from a specific test environment (EMR Serverless emr-7.1.0, FSx for ONTAP Single-AZ 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads.

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

Serverless Analytics on NAS Data for $0.00001/Query — DuckDB Lambda FSx for ONTAP

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Tue, 26 May 2026 16:55:30 +0000

TL;DR

In Part 1, Athena worked cleanly. In Part 2, Databricks hit session policy boundaries. In Part 3, Snowflake works with AWS_ACCESS_POINT_ARN config. This Part 4 shows the cheapest path: $0.00001/query.

You can deploy DuckDB inside a Lambda function (arm64, 1024 MB) and query Parquet files on FSx for ONTAP via S3 Access Points. Warm queries return in 452ms for 10,000 rows. Cold start is ~1.9s. Cost per query: approximately $0.00001.

No database server. No cluster. No idle cost. Just a Lambda function with an 18 MB layer.

Quick Decision Guide:

Cheapest ad-hoc analytics on NAS data → DuckDB Lambda ($1.10/month for 1000 queries/day)
Need governance / catalog → Use Athena + Lake Formation instead
Need distributed processing > 10 GB → Use EMR Serverless (Part 5)

GitHub: fsxn-lakehouse-integrations/integrations/duckdb/

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (Lambda arm64, 1024 MB, ap-northeast-1)
A deployment guide for DuckDB + Lambda + FSx for ONTAP S3 AP

Read by role:

Developer / Data engineer: Architecture → Deploy in 5 Minutes → Handler config
Cost-conscious team lead: Cost Analysis → When to Use
Partner / SA: Partner Decision Card → Discovery Questions
Security reviewer: Governance Impact → When Not To Use

Prerequisite Concepts

Before reading this article, it helps to understand:

DuckDB — an in-process SQL engine (like SQLite for analytics) that runs inside your application
httpfs extension — DuckDB's HTTP/S3 file system extension for reading remote Parquet files
S3 Access Point alias — the *-ext-s3alias hostname that FSx for ONTAP S3 AP exposes
Lambda arm64 (Graviton2) — AWS Lambda on ARM architecture, ~20% cheaper than x86
Path-style S3 access — required for S3 AP aliases (s3_url_style = 'path')

Why DuckDB + Lambda + FSx for ONTAP?

Traditional approach	This approach
Provision Redshift/EMR cluster	Deploy Lambda function
Pay for idle compute	Pay only when queried
Copy data from NAS to S3	Query NAS data in place
Manage infrastructure	Zero infrastructure
Minutes to first query	Sub-second (warm)

DuckDB is an in-process SQL engine — think "SQLite for analytics." It runs inside your Lambda function, reads Parquet directly from S3 via the httpfs extension, and returns results. No external database, no connection pooling, no cluster scaling.

Architecture

Client (API Gateway / SDK / CLI)
    │
    ▼
Lambda Function (arm64, Python 3.12, 1024 MB)
    │
    ├── DuckDB (in-process, 18 MB layer)
    │   └── httpfs extension (S3 access)
    │
    └── S3 Access Point (internet-origin)
            │
            └── FSx for ONTAP Volume
                ├── sensor_data.parquet (10K rows)
                └── sensor_data_large.parquet (5M rows)

No VPC attachment needed (internet-origin AP). This avoids the ~1-2s ENI cold start penalty.

Lambda function: arm64 (Graviton2), Python 3.12, 1024 MB memory, DuckDB httpfs layer attached.

Benchmark Results

Test	Latency	Notes
Cold start (simple query)	1,854 ms	httpfs INSTALL + credential setup
Warm (simple query)	0.96 ms	Connection reused
Warm COUNT(*) 10K rows	452 ms	S3 AP → Parquet → result
Warm GROUP BY 10K rows	1,411 ms	Full scan + aggregation
Local COUNT(*) 5M rows	779 ms	For comparison (over internet)
Local COPY TO Parquet	304 ms	Write-back to FSxN

Environment: Lambda arm64, 1024 MB, ap-northeast-1. FSx for ONTAP Single-AZ, 128 MB/s.

Storage context (Yakio-san lens): The 10K-row Parquet file is ~250 KB. At 128 MBps provisioned throughput, a single query consumes negligible bandwidth (<0.2% of capacity). NFS/SMB workloads sharing the same file system are not impacted. For concurrent Lambda invocations, throughput becomes a factor at ~500+ simultaneous queries reading large files.

Key insight: Lambda warm queries (452ms) are faster than local queries (628ms) because Lambda runs in the same region as FSxN — lower network latency.

Lambda test invocation: cold start 2071ms, query returns 3 status groups (normal: 8505, warning: 1221, critical: 274). Memory used: 185 MB.

Evidence Matrix

Layer	Evidence	Result	Interpretation
Lambda deployment	CloudFormation stack	✅ Pass	Function + Layer deployed
DuckDB initialization	httpfs INSTALL + LOAD	✅ Pass	Extension loads in ~1.8s cold
S3 AP connectivity	read_parquet() via httpfs	✅ Pass	Path-style + endpoint config works
Read (small)	COUNT(*) 10K rows	✅ Pass	452ms warm
Read (aggregation)	GROUP BY + AVG	✅ Pass	1,411ms warm
Read (large)	COUNT(*) 5M rows	✅ Pass	779ms (local test)
Write-back	COPY TO Parquet	✅ Pass	304ms write to S3 AP
IAM authorization	Lambda execution role	✅ Pass	s3:GetObject/PutObject on AP ARN

Deploy in 5 Minutes

1. Build the Layer (Docker required)

cd integrations/duckdb
docker run --rm --platform linux/arm64 --entrypoint bash \
  -v "$(pwd)/dist:/output" \
  public.ecr.aws/lambda/python:3.12-arm64 \
  -c "dnf install -y zip > /dev/null 2>&1 && \
      pip install duckdb==1.1.3 --target /tmp/python/lib/python3.12/site-packages/ --quiet && \
      cd /tmp && zip -qr /output/duckdb-layer.zip python/"

2. Deploy

./deploy.sh --region ap-northeast-1

This builds the layer, uploads to S3, deploys CloudFormation (Lambda + IAM + Layer), and runs a test invocation.

3. Query

aws lambda invoke \
  --function-name fsxn-duckdb-query \
  --payload '{"query": "SELECT status, COUNT(*) FROM read_parquet('\''s3://{S3_AP}/sensor-data/sensor_data.parquet'\'') GROUP BY status"}' \
  --cli-binary-format raw-in-base64-out \
  response.json && cat response.json | jq .

The {S3_AP} placeholder is automatically replaced with your S3 AP alias from the Lambda environment variable.

The Handler (Key Configuration)

Three settings are critical for DuckDB + S3 AP in Lambda:

# 1. Lambda has no home directory
conn.execute("SET home_directory = '/tmp';")

# 2. S3 AP aliases require path-style access
conn.execute("SET s3_url_style = 'path';")

# 3. Explicit endpoint for AP alias resolution
conn.execute("SET s3_endpoint = 's3.ap-northeast-1.amazonaws.com';")

Without these, you'll get:

Can't find the home directory (missing #1)
Unknown error for HTTP HEAD (missing #2 or #3)

Full handler: lambda/handler.py

Comparison with Other Engines in This Series

Aspect	DuckDB Lambda	Athena (Part 1)	Snowflake (Part 3)	EMR Spark (Part 5)
Query latency (10K rows)	452ms (warm)	~2s	~3s	6.78s
Cold start	1.9s	~2s	N/A (warehouse)	20s
Cost per query	$0.00001	$0.005/TB	Credits	$0.05/job
Write-back to FSxN	✅ COPY TO	✅ CTAS	⚠️ TBD	✅ Spark write
Governance / Catalog	❌ None	✅ Glue + LF	✅ Tags + RBAC	⚠️ IAM only
Max dataset size	~1 GB (Lambda limit)	Unlimited	Unlimited	Unlimited
Distributed processing	❌ Single process	✅	✅	✅

Partner Decision Card

Customer requirement	DuckDB Lambda today	Recommended path
Cheapest ad-hoc analytics	✅ Best ($1.10/month)	Deploy DuckDB Lambda
API-driven analytics (behind API GW)	✅ Best (sub-second warm)	Deploy with API Gateway
IoT / edge data quick analysis	✅ Good fit	Deploy DuckDB Lambda
Need governance / audit trail	❌ No built-in governance	Use Athena + Lake Formation
Dataset > 10 GB	❌ Lambda memory limit	Use EMR Serverless or Athena
Need JOINs with DWH tables	❌ Isolated engine	Use Redshift Spectrum
Need catalog integration	❌ No Glue/catalog support	Use Athena or Redshift
Write-back (small files)	✅ COPY TO works	DuckDB Lambda for small writes

Discovery Questions for Partners

When a customer asks about DuckDB Lambda + FSx for ONTAP S3 Access Points:

What is the typical dataset size per query? (DuckDB Lambda works best < 1 GB)
How many concurrent queries are expected? (Lambda scales horizontally but each invocation is isolated)
Is governance / audit trail required? (DuckDB has none — consider Athena + Lake Formation)
Is the workload ad-hoc or scheduled? (Lambda excels at sporadic, event-driven queries)
Does the team need SQL JOINs with other data sources? (DuckDB is isolated — no cross-source JOINs)
Is sub-second latency required? (Warm DuckDB Lambda delivers; cold start adds ~1.9s)
Is there an existing analytics platform? (If yes, DuckDB Lambda may be redundant)
What is the budget tolerance? (DuckDB Lambda is the cheapest option in this series)

Governance Impact

Capability	DuckDB Lambda	Notes
Authentication	IAM (Lambda execution role)	Standard AWS IAM
Authorization	S3 AP policy + IAM	No table/column-level control
Audit trail	CloudWatch Logs + CloudTrail	Query text logged if configured
Data classification	❌ None	No tagging or masking
Row/column security	❌ None	All-or-nothing file access
Catalog integration	❌ None	No Glue, no schema registry

Governance model: DuckDB Lambda relies entirely on IAM + S3 AP policy for access control. There is no built-in governance layer. For regulated workloads requiring table-level access control, column masking, or audit trails, use Athena + Lake Formation (Part 1 + Part 6) or Snowflake External Tables (Part 3).

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
DuckDB Lambda	★☆☆☆☆	★★★★☆	★☆☆☆☆	★★★★★	★★★★★	3.2
Athena + Lake Formation	★★★★★	★★★☆☆	★★☆☆☆	★★★★☆	★★★★☆	3.6
Snowflake External Table	★★★★☆	★★☆☆☆	★★★★☆	★★★☆☆	★★★★☆	3.4
EMR Serverless Spark	★★☆☆☆	★★★★☆	★★★☆☆	★★★☆☆	★★★☆☆	3.0

Governance: Access control, audit, classification capabilities
Performance: Query latency for typical workloads
AI Capability: Built-in ML/AI integration
Cost: Total cost of ownership for low-frequency workloads
Operational Simplicity: Setup and maintenance effort

Scoring methodology: Each dimension rated by the author based on validated evidence. This is not an official AWS assessment. DuckDB Lambda scores highest on Cost and Simplicity but lowest on Governance and AI — it's the "quick and cheap" option, not the "governed enterprise" option.

Cost Analysis

Component	Monthly Cost (1000 queries/day)
Lambda invocations	~$0.60
Lambda compute (1024 MB × 1s avg)	~$0.50
FSx for ONTAP (128 MB/s, existing)	$0 incremental
S3 AP requests	$0 (included in FSx)
Total	~$1.10/month

Compare with:

Redshift Serverless (8 RPU): ~$2.88/hour when active
Athena: $5/TB scanned (but no idle cost)
EMR Serverless: ~$0.05/job (but 20s cold start per job)

DuckDB Lambda is the cheapest option for ad-hoc, low-frequency analytics on FSxN data.

When to Use (and When Not To)

Use DuckDB Lambda when:

Ad-hoc queries on < 1 GB datasets
API-driven analytics (behind API Gateway)
Cost is the primary concern
No existing analytics infrastructure
Edge/IoT data analysis
Databricks customer waiting for UC + FSx for ONTAP S3 AP support: Quick NAS data validation without spinning up a Databricks cluster or copying data to S3. Use as a lightweight bridge until Databricks UC natively supports FSx for ONTAP S3 Access Points.

Don't use when:

Datasets > 10 GB (Lambda memory/timeout limits)
High concurrency (> 100 concurrent queries)
Need JOINs with DWH tables (use Redshift Spectrum)
Need governance/catalog integration (use Athena + Lake Formation)
Need Delta/Iceberg table format (not supported on FSxN S3 AP)

Known Failure Signatures

Symptom	Likely cause	Next step
`Can't find the home directory`	Missing `SET home_directory = '/tmp'`	Add to handler initialization
`Unknown error for HTTP HEAD`	Missing path-style or endpoint config	Set `s3_url_style = 'path'` and `s3_endpoint`
`HTTP 403 Forbidden`	IAM role missing S3 AP permissions	Add `s3:GetObject` on AP ARN to execution role
Timeout (15s Lambda limit)	Dataset too large for Lambda memory	Increase memory or use EMR Serverless
`Out of Memory`	Dataset exceeds Lambda memory	Reduce query scope or increase to 10 GB memory
Cold start > 3s	Layer too large or extension install slow	Pre-install httpfs in layer (avoid runtime INSTALL)

Local Development

You can also run DuckDB locally without Lambda:

import duckdb, boto3

session = boto3.Session(region_name='ap-northeast-1')
creds = session.get_credentials().get_frozen_credentials()

conn = duckdb.connect(':memory:')
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute(f"SET s3_region = 'ap-northeast-1';")
conn.execute(f"SET s3_access_key_id = '{creds.access_key}';")
conn.execute(f"SET s3_secret_access_key = '{creds.secret_key}';")
conn.execute(f"SET s3_session_token = '{creds.token}';")
conn.execute("SET s3_url_style = 'path';")

result = conn.execute("""
    SELECT status, COUNT(*), AVG(temperature)
    FROM read_parquet('s3://<your-ap-alias>/sensor-data/sensor_data.parquet')
    GROUP BY status
""").fetchall()
print(result)

What's Next

Part 5: EMR Spark — Read-Write ETL on NAS Data — for teams that need distributed Spark processing with write-back capability and larger-than-memory datasets
Part 6: Redshift Spectrum + Lake Formation — for teams that need DWH-integrated analytics with enterprise governance on NAS data
Part 7: Table Format Boundaries — why Delta, Iceberg, and Hudi can't write to FSx for ONTAP S3 AP, and what works instead

Future exploration: DuckDB's iceberg extension may enable reading pre-existing Iceberg tables on FSx for ONTAP S3 AP (metadata + data files accessed via GetObject). Not validated in this article.

Previously in this series:

Part 1: Athena — Query NAS Data In Place
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries
Part 3: Snowflake — From 'Access Denied' to Working External Tables

References

Key achievement: This validation established that DuckDB Lambda is the lowest-cost analytics path for FSx for ONTAP S3 AP data — $0.00001/query with 452ms warm latency. Zero infrastructure, zero idle cost, and full read-write capability on Parquet files. The trade-off is zero governance — for regulated workloads, pair with Athena + Lake Formation or use Snowflake External Tables.

All benchmarks are from a specific test environment (FSx for ONTAP Single-AZ, 128 MB/s, ap-northeast-1). Scale throughput provisioning for production workloads. Full evidence: verification-pack/duckdb-local/

Disclaimer: This article is an independent validation report and does not represent AWS or NetApp official guidance. Product behavior and platform capabilities may change. Always validate in your own environment.

Snowflake and FSx for ONTAP S3 Access Points — From 'Access Denied' to Working External Tables

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Mon, 25 May 2026 14:52:05 +0000

TL;DR

In Part 1, Athena worked cleanly. In Part 2, Databricks hit session policy boundaries. This Part 3 validates Snowflake's path — and it works.

Snowflake can query FSx for ONTAP S3 Access Point data — but only with the correct stage configuration. Without the AWS_ACCESS_POINT_ARN parameter, SELECT fails with "access denied" while LIST works. With it, the tested read, governance, and AI paths work: SELECT, External Tables, COPY INTO load, Directory Tables, governance tags, and 8 out of 10 Cortex AI functions work on FSx data (7 directly, 1 via COPY INTO for Cortex Search).

Configuration	LIST	SELECT	External Table	Cortex AI (text)	Vision AI
AP alias only (no ARN)	✅	❌ Access Denied	❌	❌	❌
AP alias + `AWS_ACCESS_POINT_ARN`	✅	✅	✅	✅ Direct	✅ Via staging

This appears to be a recurring integration pattern in this series: platforms that generate restrictive session policies need an explicit S3 Access Point ARN parameter so the generated policy includes the regional access point ARN.

Quick Decision Guide:

Zero-copy governed read on NAS data → External Table with AWS_ACCESS_POINT_ARN
Full AI + maximum query performance → COPY INTO internal table
RAG / semantic search over NAS documents → COPY INTO → Cortex Search Service (198ms)

GitHub Repository: fsxn-lakehouse-integrations

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (Snowflake Standard, ap-northeast-1)
A configuration guide for Snowflake + FSx for ONTAP S3 AP

Read by role:

Snowflake admin: Stage configuration → Working setup
Storage engineer: Evidence matrix → Root cause analysis
Data engineer: What works today → External Table setup
Partner / SA: Partner Decision Card → Architecture guidance
Security / governance reviewer: Governance Impact Summary → Regulated Workload Checklist
AI/ML engineer: AI / ML Integration Path → MLOps Boundary

Prerequisite Concepts

Before reading this article, it helps to understand:

Snowflake Storage Integration — an object that stores a reference to an IAM role for accessing external cloud storage
Snowflake External Stage — maps a cloud storage URL to a storage integration for data access
External Table — a Snowflake table that reads data directly from files on an external stage (no data copy)
AWS_ACCESS_POINT_ARN — a stage parameter that tells Snowflake to include the S3 Access Point ARN in its generated session policy
S3 Access Point ARN vs S3 bucket ARN — S3 AP uses arn:aws:s3:<region>:<account>:accesspoint/<name>, not arn:aws:s3:::<bucket>
Directory Table — a Snowflake feature that exposes file metadata (path, size, date) from a stage as a queryable table

Important premise: Snowflake does NOT officially document FSx for ONTAP S3 Access Points as a supported External Stage storage backend. The AWS_ACCESS_POINT_ARN parameter exists in Snowflake's CREATE STAGE documentation for S3 Access Points generally, but FSx for ONTAP S3 AP is not listed as a validated target. Our validation confirms that read and governance operations work when configured correctly, but this should not be interpreted as an officially supported configuration by Snowflake. Consult Snowflake Support before production use.

The Goal

Query structured and unstructured data stored on FSx for ONTAP from Snowflake — without copying data to a native S3 bucket. FSx for ONTAP S3 Access Points should make this possible by exposing NFS/SMB file data via S3 API.

In Part 1, Athena worked cleanly. In Part 2, Databricks required the access_point field and still has limitations. This article validates Snowflake's path.

Test Environment

Snowflake Account: Standard edition, AWS ap-northeast-1
Warehouse: COMPUTE_WH (X-Small)
Role: ACCOUNTADMIN
FSx for ONTAP: <FILE_SYSTEM_ID> (ONTAP 9.17.1)
SVM: <SVM_NAME>
S3 Access Point: Internet-origin, UNIX file system user

Scope: This article validates Snowflake Standard edition. Enterprise features (e.g., advanced governance, private connectivity) may provide additional capabilities not tested here.

The Setup

Snowflake accesses external data through a three-layer configuration:

Storage Integration (IAM Role ARN + trust)
    │
    └── External Stage (S3 URL + AWS_ACCESS_POINT_ARN + file format)
            │
            └── External Table / SELECT @stage (data access)

Visual Story: Before and After

❌ Before: SELECT Fails Without `AWS_ACCESS_POINT_ARN`

CREATE OR REPLACE STAGE fsxn_stage_without_arn
  STORAGE_INTEGRATION = fsxn_verification_integration
  URL = 's3://<ap-alias>/'
  FILE_FORMAT = (TYPE = PARQUET);

LIST @fsxn_stage_without_arn/sensor-data/;   -- ✅ Works
SELECT $1 FROM @fsxn_stage_without_arn/sensor-data/sensor_data.parquet LIMIT 3;  -- ❌ Access Denied

"Failed to access remote file: access denied. Please check your credentials." — The same file that LIST found cannot be read.

✅ After: SELECT Succeeds With `AWS_ACCESS_POINT_ARN`

CREATE OR REPLACE STAGE fsxn_stage_with_arn
  STORAGE_INTEGRATION = fsxn_verification_integration
  URL = 's3://<ap-alias>/'
  AWS_ACCESS_POINT_ARN = 'arn:aws:s3:<region>:<account>:accesspoint/<ap-name>'
  FILE_FORMAT = (TYPE = PARQUET);

SELECT $1 FROM @fsxn_stage_with_arn/sensor-data/sensor_data.parquet LIMIT 3;  -- ✅ SUCCESS

Result: 3 rows of sensor data returned successfully.

{"humidity": 32.2, "id": 1, "pressure": 1002.1, "sensor_id": "S004", "status": "normal", "temperature": 21.13}
{"humidity": 45.63, "id": 2, "pressure": 1004.13, "sensor_id": "S005", "status": "normal", "temperature": 23.07}
{"humidity": 42.79, "id": 3, "pressure": 1000.18, "sensor_id": "S003", "status": "normal", "temperature": 36.96}

✅ External Table Also Works

CREATE OR REPLACE EXTERNAL TABLE fsxn_sensor_ext_table
  LOCATION = @fsxn_stage_with_arn/sensor-data/
  FILE_FORMAT = (TYPE = PARQUET)
  AUTO_REFRESH = FALSE;

SELECT * FROM fsxn_sensor_ext_table LIMIT 3;  -- ✅ SUCCESS (3 rows)

Complete Capability Matrix

Capability	Status	Notes
Read operations
SELECT from `@stage` (Parquet)	✅ Verified	GetObject with `AWS_ACCESS_POINT_ARN`
SELECT from `@stage` (CSV)	✅ Verified	CSV with SKIP_HEADER works
SELECT from `@stage` (JSON)	✅ Expected	Same GetObject path (no JSON files in test data)
External Table (read)	✅ Verified	CREATE + SELECT both succeed
LIST `@stage` (all prefixes)	✅ Verified	Subdirectories included
GET_PRESIGNED_URL	✅ Observed	Works but not officially supported
Load operations
COPY INTO (stage → table)	✅ Verified	4.9s for Parquet load
Governance
Governance Tags on External Table	✅ Verified	CREATE TAG + ALTER TABLE SET TAG
SYSTEM$GET_TAG	✅ Verified	Tag retrieval works
Row Access Policy	✅ Expected	Standard Snowflake feature on tables
Column Masking	✅ Expected	Standard Snowflake feature on tables
Write operations
PutObject (via COPY INTO unload)	⚠️ TBD	FSx S3 AP supports PutObject ≤5GB
Event-driven
Snowpipe (auto-ingest)	❌ Not possible	S3 Event Notifications not supported on FSx S3 AP
AUTO_REFRESH on External Table	❌ Not possible	Requires S3 Event Notifications
Transactional table formats
Iceberg Table read (pre-existing metadata)	⚠️ TBD	Requires separate validation
Iceberg Table write-back	❌ Not suitable	Conditional writes not supported on FSx for ONTAP S3 AP. For Iceberg, use Snowflake Managed Iceberg Table on standard S3 (COPY INTO from FSx for ONTAP External Stage → Iceberg table on S3). External engines (Spark, Athena) can then read the same Iceberg table.
Delta / Hudi write	❌ Not suitable	Conditional writes not supported
Supported file formats
Parquet	✅ Verified	Primary format for analytics
CSV	✅ Verified	With header skip, delimiter options
JSON	✅ Expected	Same read path as Parquet/CSV
Avro	✅ Expected	Snowflake-supported format, same read path
ORC	✅ Expected	Snowflake-supported format, same read path

Key insight: With AWS_ACCESS_POINT_ARN, Snowflake achieves broad read and governance integration for the tested paths. The only limitations are event-driven features (Snowpipe, AUTO_REFRESH) and transactional write formats (Iceberg, Delta) — both due to FSx S3 AP API limitations, not Snowflake limitations.

The Root Cause: Session Policy ARN Mismatch

When Snowflake performs sts:AssumeRole, it applies a session policy. Without AWS_ACCESS_POINT_ARN, this session policy uses standard S3 bucket ARN patterns that don't match the FSx S3 AP regional ARN format:

Without AWS_ACCESS_POINT_ARN:
  Session policy allows GetObject on: arn:aws:s3:::*/*
  FSx S3 AP actual ARN:              arn:aws:s3:<region>:<account>:accesspoint/<name>/object/*
  → NO MATCH → AccessDenied

With AWS_ACCESS_POINT_ARN:
  Session policy includes:            arn:aws:s3:<region>:<account>:accesspoint/<name>/*
  → MATCH → GetObject succeeds

This is the same pattern as Databricks Unity Catalog's access_point field — both platforms need the S3 AP ARN explicitly specified to include it in the generated session policy.

Support Confirmation (May 2026): Snowflake Support confirmed this resolution. The original issue (LIST works, SELECT fails with "access denied") is resolved by adding the AWS_ACCESS_POINT_ARN parameter to the stage definition. Unlike Databricks (where the equivalent access_point field was never GA and has been removed), Snowflake's AWS_ACCESS_POINT_ARN is a documented, supported parameter in the CREATE STAGE reference.

Evidence Matrix

Layer	Evidence	Result	Interpretation
Snowflake integration	DESCRIBE INTEGRATION	✅ Pass	Trust established
Stage metadata	LIST `@stage`	✅ Pass	ListBucket path works (bucket-level ARN matches)
Object read (no ARN)	SELECT `@stage`	❌ Fail	GetObject blocked by session policy
Object read (with ARN)	SELECT `@stage`	✅ Pass	`AWS_ACCESS_POINT_ARN` resolves session policy
External Table	CREATE + SELECT	✅ Pass	Governed table access works with ARN
Same role direct	AWS CLI List/Get/Head	✅ Pass	IAM/AP/FSx permissions are correct
FSx authorization	File system user permissions	✅ Pass	FSx-side permission permits access
Operational health	SVM DNS check	✅ Pass	Distinguish ReadTimeout from AccessDenied

FSx for ONTAP S3 AP Authorization Path

FSx for ONTAP S3 Access Points use a dual-layer authorization model:

Layer 1 — S3-side authorization:

IAM identity-based policy (Snowflake's assumed role session)
S3 Access Point resource policy
Session policy generated by Snowflake (requires AWS_ACCESS_POINT_ARN to include AP ARN)

Layer 2 — FSx for ONTAP-side authorization:

File system user associated with the access point
UNIX mode-bits / NFSv4 ACLs (for UNIX security style volumes)

In the Snowflake validation, the initial failure occurred at Layer 1 — Snowflake's generated session policy did not include the S3 AP ARN pattern. Setting AWS_ACCESS_POINT_ARN resolves this by instructing Snowflake to include the AP ARN in the session policy, allowing both layers to be evaluated normally.

S3 API Compatibility and Snowflake Operations

Snowflake operation	Likely S3 operation	FSx S3 AP support	Observed result (with ARN)
LIST `@stage`	ListObjectsV2	✅ Supported	✅ Success
SELECT `@stage`	GetObject / HeadObject	✅ Supported	✅ Success
GET_PRESIGNED_URL	Presign / signed GetObject URL	Presign not supported in FSx S3 AP docs	Observed working; not a supported production path
External Table read	GetObject	✅ Supported	✅ Success
Iceberg metadata read	Head/Get + conditional	Partial (conditional writes not supported)	TBD

Comparison: Snowflake vs Databricks

Aspect	Snowflake	Databricks
Parameter name	`AWS_ACCESS_POINT_ARN` (on stage)	`access_point` (on External Location)
LIST without parameter	✅ Works	❌ Blocked (before `access_point`)
SELECT without parameter	❌ Fails	❌ Fails
SELECT with parameter	✅ Works	✅ Works (explicit path only)
External Table / UC Table	✅ Works	❌ CREATE TABLE still fails
Subdirectory listing	✅ Works	❌ Blocked
Documentation	CREATE STAGE docs	Databricks Support (May 2026)

Key difference: Snowflake's AWS_ACCESS_POINT_ARN resolves the issue more completely than Databricks' access_point field. Snowflake achieves full External Table support, while Databricks still cannot create UC tables.

Partner Decision Card

Customer requirement	Snowflake + FSx S3 AP today	Recommended path
File discovery only	✅ Works (LIST / Directory Table)	Use directly
Query file contents in Snowflake	✅ Works with `AWS_ACCESS_POINT_ARN`	Configure stage with ARN
Governed Snowflake external tables	✅ Works with `AWS_ACCESS_POINT_ARN`	Configure stage with ARN
Zero-copy SQL on NAS data	✅ Snowflake or Athena	Both work; choose by workload
Snowflake ML / Snowpark on NAS data	✅ Possible via External Table	Configure stage with ARN, validate Snowpark path
Iceberg Table on FSx S3 AP	TBD (conditional writes not supported)	Validate separately

Choose Snowflake when governed external tables, tags, Directory Tables, or Snowpark integration are required. Choose Athena when lightweight AWS-native serverless SQL over NAS data is sufficient.

Discovery Questions for Partners

When a customer asks about Snowflake + FSx for ONTAP S3 Access Points:

Is the workload read-only analytics, or does it require write-back?
Is Snowflake governance (tags, row access policy, masking) required?
Does the workload need real-time file detection (Snowpipe), or is scheduled refresh acceptable?
Are the target files structured (Parquet/CSV/JSON) or unstructured (images/documents)?
Is the data regulated (PHI, PII, financial)? If so, review presigned URL governance.
Does the customer need Iceberg table format? (Write-back not supported on FSx S3 AP)
What is the expected file count and average file size? (Impacts LIST/REFRESH latency)
Is the Snowflake account in the same AWS region as FSx for ONTAP?

Governance Impact

Capability	Status	Governance impact
LIST `@stage`	✅ Works	File inventory; not data access governance
SELECT `@stage`	✅ Works (with ARN)	Query-level access via Snowflake governance
External Table	✅ Works (with ARN)	Governed schema/table abstraction available
Iceberg Table	❌ Write not suitable	Conditional writes not supported; read of pre-existing tables TBD
GET_PRESIGNED_URL	⚠️ Observed only	Risk of bypassing Snowflake query governance if misused

For regulated workloads, do not use GET_PRESIGNED_URL as a workaround for query access. Even if URL generation is observed to work, it is not a governed Snowflake query path and should be reviewed separately for auditability, expiration, data classification, and access logging.

Governance Impact Summary

Important premise: FSx for ONTAP S3 Access Points are NOT officially documented by Snowflake as a supported External Stage storage backend. The governance paths described below are validated in this environment but should not be treated as officially supported configurations without Snowflake Support confirmation.

Access path	Governance model	Auditability	Production suitability
External Table (with `AWS_ACCESS_POINT_ARN`)	Snowflake RBAC + Tags + Row Access Policy	High (Snowflake Access History, query logs)	Recommended governed read path
COPY INTO (load to Snowflake table)	Full Snowflake governance on loaded data	High (standard Snowflake table governance)	Recommended for ML/AI workloads requiring full governance
Directory Table + GET_PRESIGNED_URL	File catalog governed; URL access is external	Medium (catalog queries logged; URL access not logged by Snowflake)	File discovery governed; downstream access requires separate audit
BUILD_SCOPED_FILE_URL	Snowflake-mediated access	High (access mediated through Snowflake privileges)	Preferred for governed unstructured data access
GET_PRESIGNED_URL (direct)	External access path	Low (Snowflake does not log URL usage after generation)	PoC / non-regulated only; requires separate access logging

Snowflake Access History captures query-level access to External Tables. However, presigned URL usage after generation is not tracked by Snowflake — use CloudTrail S3 data events for downstream audit if required.

MLOps Boundary

Reading data from FSx for ONTAP S3 AP via Snowflake External Table does not automatically make the downstream ML workflow governed.

If the data accessed via External Table or COPY INTO is used for ML or GenAI:

Register derived datasets in governed Snowflake tables
Track experiments with Snowflake ML lineage or external experiment tracking
Document source data access path (stage name, S3 AP alias, prefix, timestamp)
Record whether training data lineage is captured within Snowflake or externalized
Ensure Snowpark ML workloads use appropriate role privileges
If using Cortex functions, validate that input data classification is appropriate for the model

Snowflake's ML Lineage tracks feature-to-model relationships. If the source data path is an External Table on FSx S3 AP, document this as the lineage origin.

AI / RAG Data Readiness Checklist

If the FSx for ONTAP S3 AP data is intended for AI, RAG, or GenAI pipelines via Snowflake:

[ ] Are documents classified by sensitivity (PHI, PII, financial, internal, public)?
[ ] Are file-level permissions preserved or re-modeled for the AI pipeline?
[ ] Is metadata available for filtering and retrieval (file type, date, owner)? → Use Directory Table
[ ] Is freshness requirement defined (real-time, daily, weekly)? → Define REFRESH schedule
[ ] Is read-only access sufficient, or does the pipeline need write-back?
[ ] Is human review required for generated output before downstream use?
[ ] Is permission-aware retrieval required (user A sees only their authorized documents)?

If permission-aware retrieval is required, define one of:

Enforce at source access path — use per-user or per-group S3 Access Points with scoped file system users
Re-model permissions in metadata index — extract file-level ACLs into Directory Table metadata and filter at query time
Filter retrieval results by user/group claims — apply Snowflake Row Access Policy on External Table based on authenticated user identity
Do not proceed until authorization model is validated and approved by security owner

Snowflake + FSx S3 AP approval requirements (for regulated workloads):

Data owner approval for External Table / stage access
Security owner approval for presigned URL generation policy
Platform owner approval for COPY INTO (data leaves FSx, enters Snowflake)
Defined: allowed prefix, allowed operations, refresh schedule, expiration date
Approval record location (where the decision is stored)
Review / expiration date (when the approval must be re-evaluated)

For regulated workloads, exercise caution with:

GET_PRESIGNED_URL for patient-facing or financial data (bypasses Snowflake query governance)
COPY INTO without data classification review (data moves from FSx to Snowflake storage)
Cortex LLM functions on sensitive data without human review gate
Unreviewed access to regulated datasets via scoped URLs

Unstructured Data Support

Format	Support	Access Method	Use Case
Images (JPEG, PNG, TIFF)	✅	GET_PRESIGNED_URL / BUILD_SCOPED_FILE_URL	Thumbnail generation, ML inference, quality inspection
Video (MP4, MOV)	✅	GET_PRESIGNED_URL	Streaming, frame extraction
Documents (PDF, DOCX)	✅	GET_PRESIGNED_URL / Snowpark File Access	Text extraction, RAG, document processing
Audio (WAV, MP3)	✅	GET_PRESIGNED_URL	Transcription, speech analytics
Binary / Archives	✅	GET_PRESIGNED_URL	Download, transfer

How to manage unstructured data as a library:

-- Enable Directory Table for file catalog
ALTER STAGE fsxn_stage SET DIRECTORY = (ENABLE = TRUE);
ALTER STAGE fsxn_stage REFRESH;

-- Query file catalog (search by path, size, date)
SELECT RELATIVE_PATH, SIZE, LAST_MODIFIED
FROM DIRECTORY(@fsxn_stage)
WHERE RELATIVE_PATH LIKE '%images/%'
ORDER BY LAST_MODIFIED DESC;

-- Generate download URL for applications (valid 1 hour)
SELECT GET_PRESIGNED_URL(@fsxn_stage, 'images/photo001.jpg', 3600);

-- Generate Snowflake-proxied secure URL
SELECT BUILD_SCOPED_FILE_URL(@fsxn_stage, 'documents/report.pdf');

Note: AUTO_REFRESH is not available because FSx S3 AP does not support S3 Event Notifications (GetBucketNotificationConfiguration is not supported). Use ALTER STAGE REFRESH manually or via Snowflake Task on a schedule.

URL type guidance: Use BUILD_SCOPED_FILE_URL when you want access mediated through Snowflake role privileges (governed path). Treat GET_PRESIGNED_URL as an external object access path that bypasses Snowflake query governance and requires separate review for regulated workloads.

AI / ML Integration Path

Snowflake provides AI/ML capabilities that can leverage FSx for ONTAP data via S3 AP. 7 out of 9 tested Cortex AI functions work directly on FSx S3 AP data without copying.

Snowflake AI/ML Feature	FSx S3 AP Compatibility	Access Path	Duration	Use Case
CORTEX.SUMMARIZE	✅ Direct	External Table → Cortex	3.3s	Text summarization on NAS documents
CORTEX.TRANSLATE	✅ Direct	External Table → Cortex	5.1s	Multi-language support
CORTEX.SENTIMENT	✅ Direct	External Table → Cortex	2.5s	Sentiment analysis
CORTEX.COMPLETE (text)	✅ Direct	External Table → Cortex	16s	AI analysis, anomaly detection
CORTEX.EXTRACT_ANSWER	✅ Direct	External Table → Cortex	2.7s	Information extraction
PARSE_DOCUMENT (OCR)	✅ Direct	Stage path → OCR	~8s	Invoice/report text extraction
COMPLETE (Vision/Multimodal)	✅ Workaround	COPY FILES → internal stage → TO_FILE	41s	Image analysis, defect detection
TO_FILE on FSx S3 AP	❌ Blocked	—	—	"Remote file not found"
Cortex Search (RAG)	✅ Verified	External Table → COPY INTO → Cortex Search Service	198ms query	Semantic search over NAS documents

Key finding: Text-based Cortex functions, PARSE_DOCUMENT, and Cortex Search all work on FSx S3 AP data (Cortex Search requires COPY INTO as a staging step). Vision AI (multimodal COMPLETE) requires a staging step because TO_FILE() cannot resolve files on S3 AP external stages.

Validated AI/ML paths:

✅ Cortex LLM SUMMARIZE on External Table data — AI-generated summary in 3.3s
✅ Cortex TRANSLATE on External Table data — English to Japanese in 5.1s
✅ Cortex SENTIMENT on External Table data — sentiment scores in 2.5s
✅ Cortex COMPLETE (text) on External Table data — AI anomaly analysis in 16s
✅ Cortex EXTRACT_ANSWER on External Table data — information extraction in 2.7s
✅ PARSE_DOCUMENT (OCR) on FSx S3 AP stage file — text extraction from images in ~8s
✅ COMPLETE (Vision AI) via COPY FILES workaround — image analysis in 41s (pixtral-large)
✅ Cortex Search (RAG) — External Table → COPY INTO → Cortex Search Service → semantic query in 198ms
✅ COPY INTO loads NAS data into Snowflake tables → available for all Cortex/ML functions
✅ Directory Table catalogs unstructured files → enables file discovery for processing pipelines
✅ GET_PRESIGNED_URL generates download URLs → enables external ML services to access files

Vision AI Workaround (Validated)

Direct TO_FILE() on FSx S3 AP external stage returns "Remote file not found." The workaround:

-- 1. Create unencrypted internal stage (SNOWFLAKE_SSE required — default encryption blocks TO_FILE)
CREATE OR REPLACE STAGE fsxn_ai_stage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

-- 2. Copy image from FSx S3 AP to internal stage
COPY FILES INTO @fsxn_ai_stage FROM @fsxn_ap_arn_test_stage/media/documents/invoice_sample.png;

-- 3. Enable Cross-Region Inference (required for vision models in ap-northeast-1)
ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION';

-- 4. Run Vision AI
ALTER STAGE fsxn_ai_stage SET DIRECTORY = (ENABLE = TRUE);
ALTER STAGE fsxn_ai_stage REFRESH;
SELECT SNOWFLAKE.CORTEX.COMPLETE('pixtral-large',
  'Describe this invoice. What is the invoice number, customer, and amount?', FILE
) AS vision_result
FROM (SELECT TO_FILE(BUILD_SCOPED_FILE_URL(@fsxn_ai_stage, RELATIVE_PATH)) AS FILE
      FROM DIRECTORY(@fsxn_ai_stage) WHERE RELATIVE_PATH LIKE '%.png' LIMIT 1);

Result: Vision AI correctly identified Invoice #INV-2026-0524, Customer: Acme Corp, Amount: USD 1,234.56.

Data residency note: The COPY FILES step moves image data from FSx for ONTAP to Snowflake-managed internal storage. Cross-Region Inference may route data to US/EU regions for model processing. Verify compliance with your data residency requirements before enabling for regulated workloads.

Cortex Search (RAG) — Validated

Cortex Search provides semantic search over text data — the Snowflake-native RAG building block. The validated path uses External Table → COPY INTO → Cortex Search Service:

-- 1. Load FSx S3 AP data into internal table (required for Cortex Search)
COPY INTO sensor_documents FROM @fsxn_stage_with_arn/sensor-data/
  FILE_FORMAT = (TYPE = PARQUET);

-- 2. Create Cortex Search Service on the loaded data
CREATE OR REPLACE CORTEX SEARCH SERVICE sensor_search_service
  ON text_column
  WAREHOUSE = COMPUTE_WH
  TARGET_LAG = '1 hour'
  AS (SELECT * FROM sensor_documents);

-- 3. Semantic search query
SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
    'sensor_search_service',
    '{"query": "high temperature anomaly", "columns": ["text_column"], "limit": 5}'
  )
);
-- Result: Relevant documents returned in 198ms

Dataset context: This validation used the sensor data loaded via COPY INTO from FSx S3 AP (1000 rows of IoT sensor readings). Cortex Search performance at scale (millions of documents, large text corpora) should be validated separately — 198ms is a sizing reference for this dataset size, not a service-level guarantee.

GA status: Verify that Cortex Search Service and its query functions are Generally Available (GA) in your Snowflake edition and region before production use. Preview features may not be covered by Snowflake SLA and should not be used for regulated workloads without explicit vendor confirmation.

Cortex Search Service created on data loaded from FSx for ONTAP via COPY INTO.

Semantic search query returns relevant results in 198ms — RAG-style retrieval over NAS-originated data.

Key insight: Cortex Search requires COPY INTO (data must be in a Snowflake internal table), but the end-to-end path from FSx for ONTAP → External Stage → COPY INTO → Cortex Search Service → semantic query is validated. This provides a Snowflake-native RAG path for NAS documents.

Data residency change: COPY INTO moves data from FSx for ONTAP to Snowflake-managed storage. Once loaded, the data is subject to Snowflake's storage lifecycle, not ONTAP's. For regulated workloads, obtain data owner approval before COPY INTO and document the residency change in your compliance records. Cortex Search Service indexes are stored in the same region as the Snowflake account — no cross-region data movement occurs for the index itself.

Comparison with Bedrock Knowledge Bases: Cortex Search requires a COPY INTO step (data moves to Snowflake storage). Bedrock Knowledge Bases can read directly from FSx S3 AP without copying. Choose Cortex Search when the RAG pipeline must stay within Snowflake governance. Choose Bedrock KB when data residency on FSx is mandatory and AWS-native RAG is preferred.

PoC Quick Start — Validate Cortex Search on your NAS data in 3 steps (estimated: 30 minutes with pre-configured stage):

Configure External Stage with AWS_ACCESS_POINT_ARN (see Configuration Guide above)
Run COPY INTO <target_table> FROM @fsxn_stage/<your-documents-prefix>/ to load text data
Create Cortex Search Service on the loaded table and run a semantic query to validate retrieval quality

Manufacturing Use Case: OCR + AI on NAS Data

-- OCR: Extract text from inspection report image stored on FSx for ONTAP
SELECT SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
  @fsxn_stage,
  'media/documents/invoice_sample.png',
  {'mode': 'OCR'}
) AS ocr_result;
-- Result: "INVOICE #INV-2026-0524", "Customer: Acme Corp", "Amount: USD 1,234.56"

-- AI Analysis: Analyze sensor data for anomalies
SELECT SNOWFLAKE.CORTEX.COMPLETE('mistral-large2',
  'Analyze this IoT sensor reading and identify anomalies: ' || VALUE::VARCHAR
) AS ai_analysis FROM fsxn_sensor_ext_table LIMIT 1;

PARSE_DOCUMENT (OCR mode) extracts text from an image on FSx for ONTAP via S3 AP — works directly without copying.

Cortex COMPLETE (mistral-large2) generates AI anomaly analysis of IoT sensor data on FSx for ONTAP — works directly on External Table data.

Vision AI (pixtral-large) correctly extracts invoice details from an image originally on FSx for ONTAP — requires COPY FILES to internal stage.

Not validated in this article:

Snowpark File Access (SnowflakeFile.open()) for direct binary file processing in UDFs
AI_TRANSCRIBE for audio files on FSx S3 AP

Comparison with Databricks AI/ML path:

AI/ML Capability	Snowflake + FSx S3 AP	Databricks + FSx S3 AP
Governed table as ML input	✅ External Table	❌ UC Table creation blocked
Text AI (LLM) on NAS data	✅ 6 Cortex functions direct	⚠️ boto3 + external LLM (bypasses UC)
Vision AI on NAS images	✅ Via staging workaround (41s)	⚠️ boto3 driver-only (bypasses UC)
OCR / Document extraction	✅ PARSE_DOCUMENT direct (8s)	⚠️ boto3 + external OCR
Feature engineering	✅ Snowpark DataFrame on External Table	⚠️ spark.read with explicit path only
File catalog for ML pipeline	✅ Directory Table	⚠️ dbutils.fs.ls (top-level only)
RAG over NAS documents	✅ Cortex Search (via COPY INTO, 198ms)	⚠️ boto3 + external RAG (bypasses UC)

Key insight: Snowflake's AI/ML path benefits from governed External Tables and direct Cortex function access — 8 out of 10 tested functions work on FSx data (7 directly without copying, 1 via COPY INTO for Cortex Search). Databricks' AI/ML path is limited by UC table creation failure, forcing boto3 workarounds that bypass governance.

For end-to-end RAG on NAS documents: Use Snowflake Cortex Search (validated: External Table → COPY INTO → Cortex Search Service, 198ms query latency) or Amazon Bedrock Knowledge Bases as the AWS-documented path (no copy needed).

Decision guidance: Use Snowflake when the customer already needs Snowflake governance, Cortex/Snowpark processing, or table-based feature engineering. Use Bedrock Knowledge Bases when the primary requirement is AWS-native permission-aware RAG over NAS documents.

Comparison: Snowflake vs Databricks (Governance)

Governance Capability	Snowflake + FSx S3 AP	Databricks + FSx S3 AP
Table creation	✅ External Table	❌ CREATE TABLE fails
Data classification tags	✅ Governance Tags	❌ UC Table not creatable
Access control	✅ Row Access Policy	❌ UC governance not applicable
File catalog	✅ Directory Table	⚠️ dbutils.fs.ls (top-level only)
Secure URL generation	✅ BUILD_SCOPED_FILE_URL	❌
Column masking	✅ Available	❌
COPY INTO (data load)	✅	❌
Unstructured data catalog	✅ Directory Table + Presigned URL	⚠️ boto3 only (bypasses governance)

Key takeaway: In this validation, Snowflake with AWS_ACCESS_POINT_ARN achieved a more complete governed read path than the Databricks path tested in Part 2. Snowflake can create governed tables, apply tags, and manage unstructured data catalogs — capabilities that remain blocked in Databricks due to UC table creation failure.

For regulated workloads: Snowflake provides a more complete governed path today (External Table + Tags + Row Access Policy + audit trail). Databricks requires staged ingestion to S3 for equivalent governance. If your compliance framework requires governed table-level access control on the data, Snowflake is the validated path for FSx S3 AP integration.

Business Impact

Requirement	Observed result	Business impact	Recommended decision
Zero-copy Snowflake query over NAS	✅ Works (with ARN)	Eliminates copy pipeline	Use `AWS_ACCESS_POINT_ARN` stage
Snowflake governance on FSx data	✅ External Table works	Governed table abstraction available	Create External Tables
File inventory from Snowflake	✅ Works	Metadata cataloging possible	Use LIST / Directory Tables
RAG / AI over NAS documents	✅ Cortex Search validated (198ms)	Snowflake-native RAG path available	COPY INTO → Cortex Search Service
Text AI on NAS data (no copy)	✅ 7 functions direct	AI processing without data movement	Use Cortex functions on External Table

Detailed validation metrics (refresh duration, file count, query latency, COPY INTO duration, URL generation success rate) should be recorded in the verification-pack evidence files rather than treated as universal benchmark numbers.

Use Case Fit Matrix

Use case	Best current path	Why
SQL analytics on structured NAS files	Snowflake External Table or Athena	Both validated; Snowflake adds governance tags
Unstructured data catalog	Snowflake Directory Table	File metadata queryable with governance
Data load from NAS to Snowflake	COPY INTO from FSx S3 AP stage	Validated (4.9s for Parquet)
RAG over NAS documents	Cortex Search (via COPY INTO, validated 198ms) or Bedrock KB (AWS-native)	Cortex Search validated; Bedrock KB is AWS-documented path
ML feature engineering	Snowpark DataFrame on External Table	Governed read path available
Real-time ingestion	Not FSx S3 AP path	Use native S3 + Snowpipe
Iceberg / transactional tables	Not FSx S3 AP path	Use native S3 for write-back

Cost Model Considerations

Component	Cost driver	Notes
Snowflake warehouse	Credit consumption during queries	X-Small sufficient for validation; scale per workload
FSx for ONTAP	Throughput capacity + storage	S3 AP queries share throughput with NFS/SMB workloads
S3 AP requests	No additional S3 request charges	FSx S3 AP does not incur separate S3 API fees
Data transfer	Standard AWS data transfer	Snowflake SaaS in same region minimizes transfer

Cost comparison across engines is not the focus of this article. Snowflake's credit-based model differs fundamentally from Athena's per-TB-scanned model. Evaluate based on workload pattern, governance requirements, and existing Snowflake investment.

Configuration Guide

Step 1: Create Storage Integration

CREATE OR REPLACE STORAGE INTEGRATION fsxn_integration
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<account>:role/<role-name>'
  STORAGE_ALLOWED_LOCATIONS = ('s3://<ap-alias>/');

Step 2: Create Stage WITH `AWS_ACCESS_POINT_ARN`

CREATE OR REPLACE STAGE fsxn_stage
  STORAGE_INTEGRATION = fsxn_integration
  URL = 's3://<ap-alias>/'
  AWS_ACCESS_POINT_ARN = 'arn:aws:s3:<region>:<account>:accesspoint/<ap-name>'
  FILE_FORMAT = (TYPE = PARQUET);

Step 3: Verify

LIST @fsxn_stage/;                                    -- File discovery
SELECT $1 FROM @fsxn_stage/path/to/file.parquet LIMIT 5;  -- Data read

Step 4: Create External Table (optional)

The following DDL is simplified for readability. See the GitHub SQL scripts for the exact tested definition.

CREATE OR REPLACE EXTERNAL TABLE my_ext_table
  LOCATION = @fsxn_stage/sensor-data/
  FILE_FORMAT = (TYPE = PARQUET)
  AUTO_REFRESH = FALSE;

Internal Table vs External Table — Design Guide

Understanding the difference between internal (managed) tables and external tables is critical for architecture decisions when integrating FSx for ONTAP with Snowflake.

Comparison Matrix

Aspect	External Table (on FSx S3 AP)	Internal Table (COPY INTO)
Data location	Remains on FSx for ONTAP (zero-copy)	Copied into Snowflake-managed storage
Multi-protocol access	Same data via NFS/SMB/S3 AP simultaneously	Only accessible via Snowflake
Data freshness	Real-time (reads current file state)	Stale until next COPY INTO
Query performance	Slower (estimated ~2-5s for small queries based on observed S3 AP GetObject latency)	Faster (sub-second with micro-partitions, pruning)
Governance (Tags, Masking)	✅ Full support	✅ Full support
Time Travel	❌ Not available	✅ Available (up to 90 days)
Cortex AI (text functions)	✅ Direct (SUMMARIZE, TRANSLATE, etc.)	✅ Direct
Cortex AI (Vision/TO_FILE)	❌ TO_FILE blocked on FSx S3 AP	✅ Works on internal stage
Cortex Search (RAG)	❌ Requires COPY INTO first	✅ Direct
ONTAP features preserved	✅ Snapshot, FlexClone, Dedup, FPolicy	❌ Data is outside ONTAP
Storage cost	FSx for ONTAP only (no Snowflake storage)	FSx + Snowflake storage (duplicate)

Decision Flowchart

Q: Does the data need to stay on FSx for ONTAP?
├── YES → External Table
│         Q: Do you need Vision AI or Cortex Search?
│         ├── YES → Hybrid: External Table + selective COPY INTO
│         └── NO → External Table is sufficient (text AI works directly)
│
└── NO → COPY INTO internal table
          Q: Do you need real-time freshness?
          ├── YES → Scheduled COPY INTO (Task) or FPolicy → Lambda → Snowpipe
          └── NO → Batch COPY INTO on schedule

Cost Comparison

Pattern	FSx Storage	Snowflake Storage	Best For
External Table only	✅ (existing)	None	Read-heavy, compliance, multi-protocol
COPY INTO (full)	✅ (existing)	+ full copy	Max performance, Time Travel, full AI
Hybrid (External + selective COPY)	✅ (existing)	+ images/RAG data only	AI workloads with data residency needs

Industry-Specific Recommendations

Industry	Recommended Pattern	Rationale	PoC Success Criteria
Manufacturing	External Table + PARSE_DOCUMENT (OCR)	Data stays on FSx; inspection images processed in place	OCR extracts text from 10+ inspection images in <10s each
Financial Services	Hybrid (External Table + COPY INTO for Cortex Search)	Compliance requires data on FSx; RAG needs internal table	Cortex Search returns relevant compliance docs in <500ms
Healthcare	External Table + SnapLock	PHI must not leave controlled storage; immutable audit	SELECT on External Table succeeds with governance tags applied
Media / Entertainment	External Table + COPY FILES (Vision AI)	Large media files stay on FSx; selective staging for AI	Vision AI describes image content correctly via staging path
Cross-Industry Analytics	COPY INTO (full)	Maximum query performance; data duplication acceptable	COPY INTO completes in <10s for representative dataset

Snowpipe Alternatives for FSx for ONTAP

Since FSx S3 AP does not support S3 Event Notifications, standard Snowpipe auto-ingest is not available. Use these alternatives:

Option 1: FPolicy → Lambda → SNS → Snowpipe REST API (Recommended)

FSx for ONTAP ──FPolicy──▶ Lambda ──▶ SNS ──▶ Snowpipe REST API ──▶ COPY INTO target table
     │                                              │
     └── NFS/SMB users access same data             └── Snowflake governance on loaded data

Latency: Seconds (<30s from file write to Snowflake availability)
Complexity: Medium (requires FPolicy configuration + Lambda function)
Best for: Near-real-time ingestion requirements

FPolicy throughput note: FPolicy introduces minimal latency on the NFS/SMB I/O path (typically <1ms per operation for passthrough mode). However, under high-frequency file write workloads (thousands of files/second), validate throughput impact on the FSx for ONTAP file system before production deployment.

Option 2: Snowflake Task + COPY INTO (Simple)

-- Create a task that runs COPY INTO every 5 minutes
CREATE OR REPLACE TASK fsxn_ingest_task
  WAREHOUSE = COMPUTE_WH
  SCHEDULE = '5 MINUTE'
AS
  COPY INTO target_table FROM @fsxn_stage_with_arn/incoming/
  FILE_FORMAT = (TYPE = PARQUET)
  PATTERN = '.*[.]parquet';

ALTER TASK fsxn_ingest_task RESUME;

Latency: Minutes (configurable schedule interval)
Complexity: Low (pure Snowflake SQL)
Best for: Batch ingestion where minutes-level latency is acceptable

Option 3: Snowpipe REST API (Manual Trigger)

Applications call the Snowpipe REST API with a file list when new files are known:

Latency: Seconds (triggered by application)
Complexity: Low (API call from any application)
Best for: Application-controlled ingestion workflows

Snowpipe / COPY INTO Supported Formats

Format	Snowpipe	COPY INTO	External Table	Notes
CSV	✅	✅	✅	Delimiter, header, encoding options
JSON	✅	✅	✅	Nested, semi-structured
Parquet	✅	✅	✅	Column pruning, predicate pushdown
Avro	✅	✅	✅	Schema evolution supported
ORC	✅	✅	✅	Read-only
XML	✅	✅	✅	Native support

Stop Criteria

Stop the Snowflake direct-access PoC when:

SELECT from stage fails with AccessDenied after AWS_ACCESS_POINT_ARN is configured and IAM/AP/FSx permissions are proven correct
The workload requires Iceberg Table write-back (conditional writes not supported on FSx S3 AP)
Data owner does not approve the access path
ReadTimeout occurs (check SVM DNS/AD configuration — see Networking Troubleshooting)

Regulated Workload Checklist

Before using Snowflake + FSx S3 AP for regulated data:

[ ] Confirm the S3 Access Point file-system user identity and least-privilege permissions
[ ] Confirm Snowflake role privileges for stage, external table, and tag access
[ ] Define whether users may generate presigned or scoped URLs (prefer BUILD_SCOPED_FILE_URL for governed access)
[ ] Record derived data locations if COPY INTO loads data into Snowflake tables
[ ] Define manual refresh schedule and evidence retention
[ ] Store approval owner, review date, and expiration date
[ ] Validate that GET_PRESIGNED_URL is not used as a bypass for query-level governance
[ ] If Vision AI is required: Approve COPY FILES to internal stage (data moves to Snowflake-managed storage)
[ ] If Cross-Region Inference is enabled: Verify that image/document data may be processed in US/EU regions
[ ] If Cortex Search is used: Approve COPY INTO (data moves to Snowflake storage) AND Cortex Search Service index creation (data residency changes twice — once for table load, once for search index). Cortex Search Service index is stored in the Snowflake account region.

Store the checklist result with an approval ID, owner, review date, expiration date, and evidence location so the PoC decision can be audited later.

Cross-Region Inference — Data Residency Warning

When CORTEX_ENABLED_CROSS_REGION = 'ANY_REGION' is set, Cortex AI functions may route data to model endpoints in other AWS regions (US, EU) for processing. For regulated workloads:

Verify: Does your compliance framework allow data processing outside the home region?
Alternatives: Use AWS_US or AWS_EU instead of ANY_REGION to limit routing scope
Mitigation: Process only non-regulated images via Vision AI; keep PHI/PII in text-only Cortex functions (which run in-region)
Documentation: Record which Cross-Region setting is used and which data types are processed

Compliance Framework Mapping

Framework	Recommended Pattern	Key Controls
HIPAA (PHI)	External Table + SnapLock + FPolicy audit	Data never leaves FSx; file access audited; admin cannot delete during retention
SOX (Financial)	COPY INTO + Time Travel + audit trail	Full change history; point-in-time queries for audit
GDPR (PII)	External Table + Row Access Policy + Tag-based Masking	Data minimization at query time; PII masked for non-authorized roles
FINRA (Records)	External Table + SnapLock Compliance	Non-erasable, non-writable records for retention period

Approval Evidence Example

approval_id: "FSXN-SF-POC-001"
data_owner: "<name/group>"
security_owner: "<name/group>"
platform_owner: "<name/group>"
allowed_prefixes:
  - "s3://<ap-alias>/sensor-data/"
  - "s3://<ap-alias>/bronze/"
allowed_operations:
  - LIST
  - SELECT (External Table)
  - COPY INTO (load only)
  - Directory Table
  - BUILD_SCOPED_FILE_URL
  - Cortex text functions (SUMMARIZE, TRANSLATE, SENTIMENT)
  - COPY FILES to internal stage (for Vision AI only)
disallowed_operations:
  - GET_PRESIGNED_URL for regulated data
  - COPY INTO unload (write-back)
  - Cortex LLM on PHI/PII without human review
  - Cross-Region Inference on regulated images (unless approved)
cross_region_inference: "ANY_REGION"  # or "DISABLED" for regulated data
review_date: "<YYYY-MM-DD>"
expiration_date: "<YYYY-MM-DD>"
evidence_location: "verification-pack/snowflake/evidence/<date>/evidence-record.yaml"

COPY INTO unload (write-back to FSx S3 AP) was not validated in this article. Although FSx S3 AP supports PutObject, Snowflake unload behavior should be tested separately before positioning write-back as supported.

Data residency note: COPY INTO (load) and COPY FILES change the data residency model — source files remain on FSx, but a derived copy is created in Snowflake-managed storage. Cross-Region Inference may further route data to other regions. Treat loaded tables and staged files as derived regulated data and apply retention, classification, and deletion controls separately.

Troubleshooting Playbook

When Snowflake access to FSx for ONTAP S3 AP fails, isolate one layer at a time:

Stage configuration — Is AWS_ACCESS_POINT_ARN set? Without it, GetObject will fail.
IAM — Does the Storage Integration role have s3:GetObject, s3:ListBucket on the S3 AP ARN?
S3 AP policy — Does the Access Point resource policy allow the Snowflake IAM user ARN?
FSx file system — Is the file system user (e.g., root) permitted to read the target files?
Network — Is the AP internet-origin? (Snowflake SaaS cannot use VPC-origin APs)
Operational — Does vserver services dns check show healthy DNS? (ReadTimeout = DNS/AD issue)

Known Failure Signatures

Symptom	Likely layer	Next step
LIST works, SELECT fails with "access denied"	Missing `AWS_ACCESS_POINT_ARN`	Add ARN parameter to stage
LIST and SELECT both fail with "access denied"	IAM role or S3 AP policy	Check DESCRIBE INTEGRATION, verify trust policy
ReadTimeout (no response)	SVM DNS/AD or FSx backend	Check `vserver services dns check`; verify S3 AP lifecycle
Stage creation fails	Storage Integration config	Verify STORAGE_ALLOWED_LOCATIONS includes the AP alias
External Table creation fails	Stage or file format issue	Verify LIST works first, then check FILE_FORMAT
COPY INTO fails	File format mismatch or permissions	Verify SELECT works first

What This Article Does Not Conclude

This article does not conclude that Snowflake + FSx for ONTAP S3 AP is production-certified for all workloads. It documents the behavior observed in one validated environment and identifies the configuration required for successful integration.

Specifically, this article does not validate:

Snowpipe auto-ingest (requires S3 Event Notifications)
Iceberg Table write-back (requires conditional writes)
COPY INTO unload / write-back to FSx S3 AP
Snowpark File Access (SnowflakeFile.open) for binary processing
Performance at scale (large file counts, concurrent queries, large directory refreshes, or mixed NFS/SMB/S3 workload contention on the FSx file system)
Private connectivity (PrivateLink) path

Operational Note: ReadTimeout vs AccessDenied

During this validation series, all S3 APs on one SVM became unresponsive for 7+ days due to orphaned DNS/AD configuration.

Important distinction:

ReadTimeout (no response) → Check SVM DNS/AD configuration
AccessDenied (immediate error) → Check AWS_ACCESS_POINT_ARN stage parameter

See FSx S3 AP Networking — DNS/AD Troubleshooting for details.

Lessons Learned

1. Platform documentation holds the answer

The AWS_ACCESS_POINT_ARN parameter exists in Snowflake's CREATE STAGE documentation. The initial "no workaround" conclusion was premature — always check platform docs for S3 AP-specific parameters before concluding incompatibility.

2. The same pattern recurs across platforms

Both Snowflake (AWS_ACCESS_POINT_ARN) and Databricks (access_point field) require explicit S3 AP ARN configuration. This appears to be a recurring integration pattern: platforms that generate restrictive session policies need an explicit parameter so the generated policy includes the regional access point ARN format.

3. LIST ≠ READ (but the fix is simple)

The partial success (LIST works, SELECT doesn't) is confusing but has a clear fix. The root cause is that ListBucket uses bucket-level ARN matching while GetObject requires object-level ARN matching — and the AP ARN parameter resolves both.

4. SVM DNS/AD configuration can silently break S3 AP

ReadTimeout (not AccessDenied) indicates an operational issue, not a session policy issue. Check vserver services dns check on the SVM.

5. Pre-signed URLs work but are not a governed path

GET_PRESIGNED_URL() generates valid URLs for FSx S3 AP objects. However, this bypasses Snowflake query governance and should not be used as a production workaround for regulated workloads.

What to Tell Stakeholders

Current recommendation (8 out of 10 tested AI functions validated on FSx data):

Use Snowflake External Stage with AWS_ACCESS_POINT_ARN for governed read access to FSx for ONTAP data
Use External Tables for governed schema abstraction with tags and access policies
Use COPY INTO when data needs to be loaded into Snowflake for ML/AI processing
Use Directory Table for unstructured data cataloging
Do not rely on Snowpipe AUTO_REFRESH — use scheduled ALTER STAGE REFRESH instead
Do not position Iceberg write-back on FSx S3 AP as supported
For end-to-end RAG, use Cortex Search (validated: External Table → COPY INTO → Cortex Search Service, 198ms query) or Bedrock Knowledge Bases (AWS-documented path, no copy needed)

This validation should be used to guide architecture selection and stage configuration, not as a production certification.

What's Next

Part 1: Athena — Query NAS Data In Place (validated read-oriented SQL path)
Part 2: Databricks — A Layer-by-Layer Validation of Observed Boundaries (session policy + access_point field)
Part 4: DuckDB Lambda — Serverless analytics at $0.00001/query (for teams that need lightweight, zero-idle-cost SQL without warehouse management)
Part 5: EMR Spark — Read-Write ETL Pipeline (for teams that need distributed Spark processing with write-back to S3 for downstream lakehouse consumption)

References

Key achievement: This validation established that Snowflake + FSx for ONTAP S3 AP provides a governed, AI-ready read path — 8 out of 10 tested Cortex AI functions work on NAS data, External Tables enable full governance (tags, masking, row policies), and Cortex Search delivers 198ms semantic search over NAS-originated documents. This is the most complete governed integration path validated in this series.

This article documents observed behavior in one validated environment (Snowflake Standard edition, AWS ap-northeast-1, May 2026). Platform behavior may change with future updates.

Disclaimer: This article is an independent validation report and does not represent Snowflake, AWS, or NetApp official guidance. Product behavior, support status, and platform capabilities may change. Always validate in your own environment and consult vendor documentation and support channels.

Databricks and FSx for ONTAP S3 Access Points — A Layer-by-Layer Validation of Observed Boundaries

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 24 May 2026 11:17:38 +0000

TL;DR

Connecting Databricks to FSx for ONTAP S3 Access Points is significantly harder than Athena (Part 1). After testing every approach I could find — Unity Catalog External Locations, NFS mounts, Instance Profiles, multiple VPC configurations — here is what I found:

Unity Catalog's session policy initially blocked the FSx for ONTAP S3 AP ARN pattern → 403
Setting the access_point field on the External Location partially resolves the session policy: explicit-path file read succeeds, but UC table creation, subdirectory listing, and write operations remain blocked — meaning UC governance features (lineage, tags, fine-grained access) cannot yet be applied
NFS kernel mount is blocked by seccomp by design (confirmed by Databricks Support)
Instance Profile + boto3 works for direct S3 AP access (bypassing Unity Catalog)
Spark read with explicit file path works under UC governance — 1000 rows of sensor data readable with full schema inference, proving data access is possible even if table creation is blocked

Quick Decision Guide:

Read-only SQL analytics on NAS data → Use Athena (Part 1) or Snowflake External Table (Part 3)
Governed Databricks lakehouse on NAS data → Stage via FPolicy → Lambda → S3 → Auto Loader → UC Managed Table
Exploratory PoC (time-limited) → Instance Profile + boto3 with compensating controls

This article is a layer-by-layer validation of observed integration boundaries between Databricks and FSx for ONTAP S3 Access Points. It is not an argument against Databricks. Databricks remains a strong platform for lakehouse, ML, and production Delta workloads. This article focuses narrowly on one integration boundary: direct access from Databricks to FSx for ONTAP S3 Access Points.

This article documents the full troubleshooting journey, including the strace analysis that identified the root cause of NFS mount failures.

This article documents observed behavior in one validated environment. It should not be interpreted as a general compatibility statement for all Databricks configurations or future platform versions.

GitHub Repository: fsxn-lakehouse-integrations

If you want to reproduce this validation, the repository's integrations/databricks/ directory contains environment setup notes, and verification-pack/ contains test templates and evidence recording formats. The verification pack is intentionally template-first by design, so validation runs can produce consistent, reviewable evidence across environments. Actual result files will be added as validation runs are completed.

This article also includes a Snowflake ↔ Databricks concept mapping table (showing which capabilities work on each platform) and an AI Readiness Score to help teams quantitatively compare pattern options for FSx for ONTAP integration.

How to Read This Article

This article is:

A reproduction-focused validation report
Evidence from one environment (DBR 17.3 LTS, ap-northeast-1)
A starting point for vendor confirmation and architecture discussion

This article is not:

A general compatibility statement
A production certification
A statement on behalf of Databricks

Read by role:

Databricks admin: Unity Catalog External Location → Governance Impact Summary
Storage engineer: NFS Mount investigation → Evidence Matrix
Data engineer: Instance Profile + boto3 → Next Validation Metrics
Partner / SA: Decision Matrix → Discovery Questions → Partner Conversation Guide
Opening a support case: Databricks Support Case Packet

Prerequisite Concepts

Before reading this article, it helps to understand:

Unity Catalog Storage Credential — an object that stores a reference to a cloud IAM role for accessing external storage
Unity Catalog External Location — maps a cloud storage path to a storage credential for governed access
Instance Profile on AWS — an IAM role attached to an EC2 instance, providing credentials via IMDS
Databricks-managed VPC vs Customer-managed VPC — whether Databricks or the customer controls the workspace network
Cluster access modes — Standard (shared, multi-user with UC governance) and Dedicated (single-user with sudo access). Unity Catalog requires standard or dedicated access mode
S3 Access Point ARN vs S3 bucket ARN — S3 AP uses arn:aws:s3:<region>:<account>:accesspoint/<name>, not arn:aws:s3:::<bucket>
Driver vs executor behavior in Spark — the driver orchestrates; executors run distributed tasks. Credentials and network paths may differ between them

Key Concepts: Databricks Storage & Ingestion Architecture

Understanding Databricks' storage and ingestion concepts is essential for evaluating FSx for ONTAP S3 AP integration boundaries.

Partner Quick Reference: If your customer asks "Can Databricks read our NAS data via S3 Access Points?" — the answer is "partially, with limitations." File-level read works under UC governance, but table creation and directory listing are blocked. For governed analytics on NAS data today, recommend Snowflake External Table or Athena. For Databricks-specific workloads, recommend staged ingestion to S3 → UC Managed Table (see Recommended Architecture Pattern). If the customer already uses Databricks, the FPolicy → Lambda → S3 → Auto Loader pattern preserves full UC governance on ingested data. See Discovery Questions for customer qualification.

Storage Credential → External Location → External Table/Volume

Storage Credential (IAM Role ARN + External ID)
    │
    └── External Location (cloud storage path + credential + access_point field)
            │
            ├── External Table (tabular data: Parquet, Delta, Iceberg)
            └── External Volume (non-tabular: images, documents, audio)

Concept	Description	FSx S3 AP Status
Storage Credential	IAM Role that Databricks assumes to access cloud storage. During AssumeRole, Databricks generates a session policy that restricts what the assumed session can do — even if the IAM role itself has broader permissions.	✅ Created
External Location	Maps S3 path to a Storage Credential; defines access boundary	✅ Created (with `access_point` field)
External Table	UC-governed table whose data resides in External Location	❌ CREATE TABLE blocked
External Volume	UC-governed volume for unstructured files in External Location	❌ Blocked (same session policy issue)

External Volume is the Databricks equivalent of Snowflake's Directory Table — it provides governed access to non-tabular files (images, documents, audio, video). Since External Volume requires External Location creation with full subdirectory access, it is currently blocked by the same session policy limitation that blocks External Table creation.

Auto Loader (Incremental Ingestion)

Auto Loader is Databricks' equivalent of Snowflake's Snowpipe — it incrementally processes new files as they arrive in cloud storage.

Mode	Description	FSx S3 AP Status
Directory Listing	Periodically lists directory to find new files	⚠️ Requires External Location (blocked)
File Notification	Uses S3 Event Notifications + SQS for real-time detection	❌ Not possible (FSx S3 AP doesn't support S3 Events)

Auto Loader supported formats (8 formats): JSON, CSV, Parquet, Avro, ORC, XML, TEXT, BINARYFILE.

FSx S3 AP latency context: Even if Directory Listing mode were unblocked, FSx S3 AP ListObjectsV2 latency is significantly higher than native S3 (tens of seconds to minutes for large directories). This would impact Auto Loader polling intervals and new-file detection speed. Plan for minutes-level detection latency, not seconds.

Concept Mapping: Snowflake ↔ Databricks

Snowflake Concept	Databricks Equivalent	FSx S3 AP (Snowflake)	FSx S3 AP (Databricks)
Storage Integration	Storage Credential	✅	✅
External Stage	External Location	✅	✅ (partial)
External Table	External Table	✅	❌ Blocked
Directory Table	External Volume	✅	❌ Blocked
Snowpipe	Auto Loader	⚠️ (no S3 Events)	❌ Blocked
COPY INTO	COPY INTO / Auto Loader	✅	❌ Blocked
`AWS_ACCESS_POINT_ARN`	`access_point` field	✅ (resolves all)	⚠️ (partial resolution)
Cortex Search (RAG)	Mosaic AI / MLflow	✅ (via COPY INTO)	⚠️ (boto3 + external)

Data Ingestion Alternatives for FSx for ONTAP (When Auto Loader Is Blocked)

Throughput constraint: All S3 AP operations are bounded by the FSx for ONTAP file system's provisioned throughput capacity (e.g., 128 MB/s in this validation environment). This throughput is shared with NFS/SMB workloads on the same file system. Plan ingestion windows and concurrent access accordingly.

Since Auto Loader requires External Location (currently blocked on FSx S3 AP), use these alternatives:

Method	Description	Latency	Governance
FPolicy → Lambda → S3 → Auto Loader	FPolicy detects file changes → Lambda copies to S3 → Auto Loader ingests	Seconds	✅ Full UC (on S3 copy)
AWS Glue ETL	Glue job reads from FSx S3 AP → writes to S3/Delta	Minutes	AWS-side
EMR Serverless	Spark job reads from FSx S3 AP → writes to S3/Delta	Minutes	AWS-side
AWS DataSync	Scheduled sync from FSx NFS → S3 bucket	Minutes-Hours	AWS-side
SnapMirror to S3	ONTAP-native replication to S3 bucket	Minutes	ONTAP-side

SnapMirror to S3 caveat: Object metadata in SnapMirror S3 targets differs from NFS file metadata. Validate schema compatibility and file naming conventions before using SnapMirror S3 as an ingestion path for analytics engines.

Recommended production pattern:

FSx for ONTAP ──FPolicy──▶ Lambda ──▶ S3 Bucket ──▶ Auto Loader ──▶ Delta Table (UC governed)

Iceberg interoperability note: Once data is in UC as a managed Delta or Iceberg table, external engines can access it via UC's Iceberg REST Catalog — enabling Athena, EMR, and Trino to query the same governed table without data duplication. This makes the DataSync → S3 → UC path a hub for multi-engine access.

AI Readiness Score

Pattern	Governance	Performance	AI Capability	Cost	Operational Simplicity	Overall
Athena + FSx S3 AP	★★★☆☆	★★★★☆	★☆☆☆☆ (SQL only)	★★★★★	★★★★★	3.6
Snowflake External Table	★★★★☆	★★★☆☆	★★★★☆ (Cortex AI)	★★★★★	★★★★☆	4.0
Staged to S3 → UC Table	★★★★★	★★★★★	★★★★★ (full Mosaic AI)	★★☆☆☆	★★☆☆☆	3.8
boto3 PoC (Databricks)	★☆☆☆☆	★★☆☆☆	★★★☆☆ (driver-only)	★★★★★	★★★☆☆	2.8
Bedrock KB + FSx S3 AP	★★★☆☆	★★★★☆	★★★★☆ (RAG)	★★★★☆	★★★★☆	3.8

Governance: UC lineage, tags, masking, row filters
Performance: Query latency, distributed processing
AI Capability: Breadth of AI/ML functions available
Cost: Storage efficiency, compute cost
Operational Simplicity: Setup, maintenance, pipeline complexity

Scoring methodology: Each dimension rated by the author based on validated evidence in this article series. This is not an official AWS assessment or certification. Scores reflect observed capabilities in one test environment.

Performance note: Performance scores reflect relative comparison within FSx S3 AP access patterns, not comparison with native S3 bucket performance. All patterns accessing FSx S3 AP have higher latency than equivalent native S3 operations.

How to use this score: Use Overall score as a starting point for pattern selection. Scores ≥ 4.0 indicate strong fit for governed production workloads. Scores 3.5–3.9 indicate viable paths with trade-offs. Scores < 3.0 indicate PoC-only paths requiring compensating controls.

When to choose which:

Choose Snowflake External Table (4.0) when governed AI on NAS data without copying is the priority
Choose Staged to S3 → UC Table (3.8) when maximum Databricks performance and full Mosaic AI are required (accepts data duplication cost)
Choose Bedrock KB (3.8) when AWS-native RAG with zero-copy on FSx is the primary requirement
Choose boto3 PoC (2.8) only for time-limited exploration with explicit approval; with compensating controls (see Compensating Controls section), governance risk can be partially mitigated for PoC scope. Post-expiration actions must be defined: terminate cluster, remove instance profile, archive evidence.

The Goal

Process unstructured data (images, documents, audio) stored on FSx for ONTAP from Databricks — without copying data to S3. FSx for ONTAP S3 Access Points should make this possible by exposing NFS/SMB file data via S3 API.

In Part 1, Athena worked cleanly in my validation using the official AWS tutorial pattern. Databricks, however, has multiple security layers that interact with S3 AP in unexpected ways.

Test Environment

I tested across two workspace configurations:

Runtime scope: Only DBR 17.3 LTS (Spark 4.0.0) was tested. This article does not compare DBR 16.x, 18.x, ML runtimes, GPU runtimes, or serverless compute. Runtime-level behavior may differ across versions and compute types. This article does not compare behavior across DBR versions or access modes beyond those listed in the test environment.

┌─────────────────────────────────────────────────────────────────────┐
│ Workspace 1: Databricks-managed VPC                                 │
│ - VPC created and managed by Databricks                             │
│ - Limited network control                                           │
│ - VPC Peering to FSx for ONTAP VPC                                  │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Workspace 2: Customer-managed VPC (same VPC as FSx for ONTAP)       │
│ - Full network control                                              │
│ - Direct connectivity to FSx for ONTAP (no peering needed)          │
│ - NAT Gateway for Databricks control plane                          │
└─────────────────────────────────────────────────────────────────────┘

Cluster modes tested:

Standard (Shared Access)
Dedicated (Single User) — provides sudo/root access
Dedicated with Instance Profile

All tests used DBR 17.3 LTS (Spark 4.0.0), ap-northeast-1.

Approach 1: Unity Catalog External Location

The Setup

The Databricks-governed path for S3 data access is to create a Storage Credential and External Location. I tested whether the same pattern could work with an FSx for ONTAP S3 Access Point.

# What I expected to work
files = dbutils.fs.ls("s3://<FSx-S3-AP-alias>/")

The Error

AccessDenied: User: arn:aws:sts::<ACCOUNT>:assumed-role/databricks-...-cross-account-role/
  databricks-unity-catalog-credential-<WORKSPACE_ID>
is not authorized to perform: s3:ListBucket on resource:
  "arn:aws:s3:<REGION>:<ACCOUNT>:accesspoint/<AP_NAME>"
because no session policy allows the s3:ListBucket action

Observed Boundary

Unity Catalog applies a session policy when it calls AssumeRole. This session policy acts as a permissions boundary — even if the IAM role has s3:* on *, the session policy restricts what the assumed session can do.

The evidence narrows the failure domain, but does not identify Databricks internal implementation details.

In this validation, the generated session policy behavior allowed access to a standard S3 bucket path but did not allow the FSx for ONTAP S3 Access Point ARN pattern:

arn:aws:s3:::bucket-name       ✅ Allowed
arn:aws:s3:::bucket-name/*     ✅ Allowed

But FSx for ONTAP S3 AP uses a different ARN format:

arn:aws:s3:<region>:<account>:accesspoint/<name>    ❌ Not in session policy

Proof

The same IAM role works fine for regular S3 buckets through Unity Catalog:

# This works — regular S3 bucket
dbutils.fs.ls("s3://my-workspace-storage-bucket/")
# SUCCESS

# This fails — FSx for ONTAP S3 Access Point
dbutils.fs.ls("s3://<FSx-S3-AP-alias>/")
# AccessDenied: no session policy allows...

Status

In my initial validation, this behaved like a platform boundary in Unity Catalog's generated session policy. I opened a support case to confirm whether S3 Access Point ARN patterns can be supported for external locations.

Before (access_point field not set) — Unity Catalog session policy blocks all S3 AP operations:

Without the access_point field, dbutils.fs.ls on the S3 AP alias returns UNAUTHORIZED_ACCESS. The session policy only allows standard S3 bucket ARNs.

Update (2026-05-24): `access_point` Field Resolves Session Policy

Critical Update (2026-05-26): Databricks Support subsequently confirmed that the access_point field was never released as a generally available feature and has been removed from documentation. The partial success described below is "a side effect of incomplete internal handling, not a supported code path." Unity Catalog External Locations do not currently support S3 Access Points. See the full support confirmation at the end of this section.

Databricks Support (May 2026) confirmed that Unity Catalog External Locations support an access_point field. Setting this field includes the S3 AP ARN in the generated session policy.

Configuration that works:

External Location:
  URL: s3://<FSx-S3-AP-alias>/
  Credential: <storage-credential-name>
  access_point: arn:aws:s3:<region>:<account>:accesspoint/<ap-name>

API call to set the field:

curl -X PATCH \
  https://<workspace>/api/2.1/unity-catalog/external-locations/<location-name> \
  -H "Authorization: Bearer <token>" \
  -d '{"access_point": "arn:aws:s3:<region>:<account>:accesspoint/<ap-name>"}'

What now works under UC governance:

Operation	Result	Notes
`dbutils.fs.ls("s3://<alias>/")`	✅	Top-level listing (287 items)
`dbutils.fs.head("s3://<alias>/file.txt")`	✅	Read file content
`spark.read.text("s3://<alias>/file.txt")`	✅	Spark read with explicit file path
`spark.read.csv("s3://<alias>/path/to/file.csv")`	✅	1000 rows, schema inferred

After (access_point field set) — Top-level listing succeeds, 287 items visible:

With the access_point field configured, dbutils.fs.ls at the top level returns 287 items from the FSx for ONTAP volume.

Sensor data read via Spark — 1000 rows with schema inference:

spark.read.csv with explicit file path successfully reads 1000 sensor readings with full schema inference (timestamp, machine_id, temperature_c, vibration_mm_s, pressure_bar, rpm, status, location).

What still does NOT work:

Operation	Result	Error
`dbutils.fs.ls("s3://<alias>/subdir/")`	❌	AccessDenied on getFileStatus
`spark.read.load("s3://<alias>/subdir/")`	❌	Forbidden (directory-level access)
`CREATE TABLE LOCATION 's3://<alias>/...'`	❌	UC_CLOUD_STORAGE_ACCESS_FAILURE
`dbutils.fs.cp` (PutObject)	❌	AccessDenied

Remaining blockers — Subdirectory listing and UC table creation fail:

Subdirectory dbutils.fs.ls returns UNAUTHORIZED_ACCESS. CREATE TABLE LOCATION fails with UC_CLOUD_STORAGE_ACCESS_FAILURE. Without a UC table, governance features (lineage, tags, fine-grained access control) cannot be applied.

Summary: Data is readable but not governable. The critical blocker is CREATE TABLE LOCATION failure, which prevents Unity Catalog governance from being applied to the data.

Key pattern: File-level read operations succeed (GetObject with explicit key). Directory-level operations (ListObjectsV2 with prefix, HeadObject on prefix) fail for subdirectories. This suggests the session policy scopes ListObjectsV2 to the root prefix only.

Implication: Explicit-path file read works, but without UC table creation, Unity Catalog governance features — lineage, fine-grained access control, governance tags, column masking, row filtering — cannot be applied. The data is technically readable through the External Location path but not registerable as a governed UC table. This limits the practical value for production governance use cases until the subdirectory listing and table creation issues are resolved.

Requirements for this path:

Customer-managed VPC workspace (same VPC as FSx for ONTAP)
External Location with access_point field set
Storage Credential IAM role with S3 AP permissions
NAT Gateway for control plane connectivity

Approach 2: NFS Mount (Managed VPC)

The Idea

If S3 AP doesn't work through Unity Catalog, mount the FSx for ONTAP volume directly via NFS.

The Setup

Created VPC Peering between Databricks-managed VPC and FSx for ONTAP VPC. Updated route tables and security groups.

The Result

%sh
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/2049' && echo "REACHABLE" || echo "NOT REACHABLE"
# NOT REACHABLE

NFS port (TCP 2049) is unreachable from Databricks-managed VPC, even with VPC Peering configured. From the customer-controlled routing perspective, route tables and FSx for ONTAP-side security groups were configured to allow NFS. However, cluster-side egress remained governed by the Databricks-managed environment, and NFS egress was not permitted.

Lesson

Databricks-managed VPC gives you limited network control. The egress rules on cluster instances are managed by Databricks, not by customer-added security group rules.

Approach 3: NFS Mount (Customer-managed VPC)

The Setup

Deployed a new workspace in the same VPC as FSx for ONTAP. No peering needed — direct L3 connectivity.

Network Verification (All Pass)

%sh
echo "TCP 2049 (NFS):"
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/2049' && echo "REACHABLE"
echo "TCP 111 (portmapper):"
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/111' && echo "REACHABLE"
echo "TCP 635 (mountd):"
timeout 3 bash -c 'echo > /dev/tcp/10.0.3.133/635' && echo "REACHABLE"

TCP 2049 (NFS): REACHABLE ✅
TCP 111 (portmapper): REACHABLE ✅
TCP 635 (mountd): REACHABLE ✅

Note: The /dev/tcp test confirms TCP reachability. NFSv3 mountd may use TCP or UDP depending on configuration. The exact transport should be validated with rpcinfo if needed.

sudo Access (Dedicated Mode)

%sh
sudo whoami
# root ✅

NFS Client Installation and Export Verification

%sh
sudo apt-get install -y nfs-common
showmount -e 10.0.3.133

Export list for 10.0.3.133:
/vol1 (everyone) ✅

Everything looks perfect. Network connected, root access available, NFS exports visible. Let's mount:

The Mount Attempt

%sh
sudo mkdir -p /mnt/fsxn
sudo mount -t nfs -o nfsvers=3,nolock 10.0.3.133:/vol1 /mnt/fsxn

mount.nfs: access denied by server while mounting 10.0.3.133:/vol1

Wait, what? The server is showing the export to everyone, we have root access, the network is connected... why "access denied by server"?

The Investigation: Why NFS Mount Fails

This is where it gets interesting. The error message says "access denied by server" — but is it really the server?

Step 1: Verify ONTAP Export Policy

Via ONTAP REST API (accessible from the same cluster):

{
  "rules": [{
    "clients": [{"match": "0.0.0.0/0"}],
    "ro_rule": ["any"],
    "rw_rule": ["any"],
    "superuser": ["any"],
    "protocols": ["any"]
  }]
}

The export policy is maximally permissive — all clients, all protocols, read-write, superuser. ONTAP is not denying access.

Important: This permissive export policy was used only to eliminate ONTAP export restrictions as a variable during troubleshooting. It is not a production recommendation. For production, restrict: client CIDR, protocol, read/write rule, superuser mapping, and volume/junction path scope.

ONTAP Production Hardening Checklist

For production deployments, harden the ONTAP configuration:

[ ] Restrict export policy client CIDR to known analytics subnets only
[ ] Avoid rw=any and superuser=any — use explicit security flavors
[ ] Map S3 Access Point file system user to a least-privilege NAS user (not root/UID 0)
[ ] Validate NFS/SMB ACL behavior when S3 AP is active
[ ] Validate S3 API access against file-level permissions
[ ] Capture ONTAP audit evidence where required (ONTAP FPolicy)
[ ] Document junction path and volume scope
[ ] Isolate analytics volumes from production NFS/SMB workloads if throughput contention is a concern

Step 2: strace the mount command

%sh
sudo strace -f -e trace=mount mount -t nfs -o nfsvers=3,nolock 10.0.3.133:/vol1 /mnt/fsxn 2>&1

mount.nfs: trying 10.0.3.133 prog 100003 vers 3 prot TCP port 2049
mount.nfs: trying 10.0.3.133 prog 100005 vers 3 prot UDP port 635
mount("10.0.3.133:/vol1", "/mnt/fsxn", "nfs", ...) = -1 EACCES (Permission denied)
mount.nfs: mount(2): Permission denied

Key finding: mount.nfs successfully connects to both NFS (port 2049) and mountd (port 635), but the mount() syscall returns EACCES. The denial happens at the kernel level, not at the server.

TCP/UDP note: The initial reachability check used /dev/tcp, confirming TCP reachability. During the actual mount attempt, mount.nfs tried mountd over UDP as shown in the strace output. This is not a contradiction — NFSv3 mountd may use either transport. For production troubleshooting, use rpcinfo and packet capture to confirm the actual protocol and port mapping.

Step 3: Manual NFS RPC Calls (User-space)

To prove ONTAP is granting access, I performed manual NFS RPC calls using Python sockets:

import socket, struct

# MOUNT RPC (program 100005, version 3, procedure MNT)
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.settimeout(5)
sock.sendto(mount_rpc_packet, ("10.0.3.133", 635))
response = sock.recv(4096)
# Parse: status=0 (MNT3_OK), file_handle=44 bytes
print("MOUNT RPC: SUCCESS ✅")

# NFS3 FSINFO, GETATTR, READDIRPLUS — all succeed
print("NFS3 FSINFO: SUCCESS ✅")
print("NFS3 GETATTR: SUCCESS ✅")
print("NFS3 READDIRPLUS: SUCCESS ✅")

All NFS operations succeed at user-space level. ONTAP grants full access. The problem is not the server.

Step 4: tmpfs Mount Test

%sh
sudo mount -t tmpfs tmpfs /tmp/test_mount && echo "SUCCESS" || echo "FAILED"

SUCCESS ✅

The mount() syscall itself is allowed. Only NFS filesystem type is blocked.

Step 5: Seccomp Status

%sh
cat /proc/self/status | grep Seccomp

Seccomp:        2
Seccomp_filters:        1

Seccomp: 2 = BPF filter mode active.

The Conclusion

┌─────────────────────────────────────────────────────────────────┐
│ Evidence Chain:                                                 │
│                                                                 │
│ 1. Network connectivity      → ✅ All NFS ports reachable       │
│ 2. ONTAP export policy       → ✅ 0.0.0.0/0, rw=any, su=any     │
│ 3. NFS RPC (user-space)      → ✅ All operations succeed        │
│ 4. mount() with type="nfs"   → ❌ EACCES                        │
│ 5. mount() with type="tmpfs" → ✅ Success                       │
│ 6. Seccomp                   → Active (BPF filter mode)         │
│                                                                 │
│ Conclusion: The evidence points to a local platform security    │
│ boundary, likely seccomp filtering or an equivalent runtime     │
│ restriction, blocking the NFS mount path.                       │
└─────────────────────────────────────────────────────────────────┘

The error message "access denied by server" is misleading. The mount.nfs program interprets the kernel's EACCES as a server-side denial, but strace reveals the truth: the denial is local.

If sharing this finding: This is not a Databricks compatibility verdict. It is a layer-by-layer validation of observed boundaries in one environment (DBR 17.3 LTS, ap-northeast-1). Platform behavior may differ across runtime versions, access modes, and configurations.

Important: Because Databricks does not publicly document this specific syscall/filesystem-type behavior, treat this as validation evidence rather than an official platform statement until confirmed by Databricks Support.

All Mount Options Tested

Options	Result
`-o nfsvers=3,nolock`	access denied
`-o nfsvers=4.1`	access denied
`-o nfsvers=3,nolock,resvport`	access denied
`-o nfsvers=3,nolock,noresvport`	access denied
`-o sec=sys`	access denied
(no options)	access denied
tmpfs	SUCCESS

Evidence Matrix

Layer	Evidence	Result	Interpretation
Network	TCP 2049 / TCP 111 / TCP 635 reachable	✅ Pass	Network path exists between cluster and FSx for ONTAP
ONTAP export	Export policy allows 0.0.0.0/0, rw=any, su=any	✅ Pass	Export policy is not the blocker
NFS server RPC	MOUNT / FSINFO / GETATTR / READDIRPLUS succeed via user-space	✅ Pass	ONTAP grants NFS operations to this client
Local syscall	`mount(type=nfs)` returns EACCES	❌ Fail	Evidence points to a local runtime boundary affecting kernel NFS mount
Local syscall control	`mount(type=tmpfs)` succeeds	✅ Pass	`mount()` syscall is not universally blocked
Runtime security	Seccomp mode 2 observed in the tested process context	Observed	Runtime filtering may restrict NFS-specific mount
Unity Catalog S3	External Location test on S3 AP ARN → AccessDenied	❌ Fail	Session policy does not allow S3 AP ARN pattern
Instance Profile S3	boto3 GetObject on S3 AP → Success	✅ Pass	IAM role itself has correct permissions

showmount -e confirms that the export is visible through mountd. It does not guarantee that the local runtime allows the kernel NFS mount operation to complete. showmount -e validates NFS export visibility only. It does not validate the file system user identity associated with the S3 Access Point. For S3 AP authorization, record the associated UNIX or Windows identity and verify file-level permissions separately — these are independent authorization paths.

FSx for ONTAP S3 AP Authorization Path

FSx for ONTAP S3 Access Points use a dual-layer authorization model that combines AWS IAM permissions with file system-level permissions:

Layer 1 — S3-side authorization:

IAM identity-based policy (caller's permissions)
S3 Access Point resource policy
VPC endpoint policy (if applicable)
SCP / RCP (if applicable)

Layer 2 — FSx for ONTAP-side authorization:

File system user associated with the access point
UNIX mode-bits / NFSv4 ACLs (for UNIX security style volumes)
Windows ACLs (for NTFS security style volumes)

In the Databricks validation, the failure occurs before Layer 2 — Unity Catalog's generated session policy restricts the assumed role session at the S3 API level, preventing the request from reaching FSx for ONTAP-side authorization. The Instance Profile + boto3 path bypasses Unity Catalog's session policy, allowing both layers to be evaluated normally.

For production, both layers must be configured with least-privilege. A permissive file system user (e.g., root / UID 0) combined with a broad IAM policy creates an overly permissive access path.

Approach 4: Instance Profile + boto3

The Setup

Customer-managed VPC workspace, Dedicated cluster with an Instance Profile attached.

IMDS Access

import urllib.request, json

# IMDSv2 token
req = urllib.request.Request(
    "http://169.254.169.254/latest/api/token",
    headers={"X-aws-ec2-metadata-token-ttl-seconds": "21600"},
    method="PUT"
)
token = urllib.request.urlopen(req, timeout=2).read().decode()
print(f"Token: {token[:20]}...")  # ✅ Success

Regular S3 Access

import boto3
s3 = boto3.client("s3", region_name="ap-northeast-1")
buckets = s3.list_buckets()
print(f"ListBuckets: {len(buckets['Buckets'])} buckets")  # ✅ 58 buckets

FSx for ONTAP S3 AP Access

response = s3.list_objects_v2(
    Bucket="<FSx-S3-AP-alias>",
    MaxKeys=10
)
print(f"Objects: {response['KeyCount']}")  # ✅ Works

This works. Instance Profile credentials bypass Unity Catalog's session policy entirely. boto3 talks directly to the S3 API with the EC2 instance's IAM role.

Governance warning
Instance Profile + boto3 is a pragmatic workaround for PoC and controlled experiments. It bypasses Unity Catalog governance, including fine-grained access control, lineage, and centralized data access auditing. Do not treat this as a production lakehouse governance pattern without a separate security and compliance review. Databricks recommends Unity Catalog external locations as the standard governed access mechanism.

Scope note
The Instance Profile + boto3 sample above runs on the driver node only (single-node PoC pattern). Whether the same credential, network path, and concurrency behavior applies to Spark executors in a multi-node cluster requires separate validation.

Approach 5: S3 AP + Instance Profile (Managed VPC with VPC Peering)

The Hypothesis

If Instance Profile + boto3 works on a Customer-managed VPC (Approach 4), does it also work from a Databricks-managed VPC with VPC Peering to the FSx for ONTAP VPC? This would validate whether the S3 Gateway Endpoint in the Databricks-managed VPC can route S3 AP requests to the FSx for ONTAP backend.

The Setup

Databricks-managed VPC (vpc-060209589cbe4c298, CIDR: 10.53.0.0/16)
FSx for ONTAP VPC (vpc-0ae01826f906191af, CIDR: 10.0.0.0/16)
VPC Peering: pcx-02167ddf900a30782 (active)
Route tables: updated in both directions
FSx for ONTAP security group: allows all traffic (0.0.0.0/0)
S3 Gateway Endpoint: vpce-020b59ab4da0b44b8 (full access policy)
Cluster: m5.large × 3, DBR 17.3 LTS, Dedicated mode, Instance Profile attached

The Result

{
  "dns_resolution": {"success": true, "ip": "52.219.151.110"},
  "vpc_peering_443": {"success": false, "result_code": 11},
  "vpc_peering_nfs": {"success": false, "result_code": 11},
  "s3_ap_access": {"success": false, "error": "Read timeout"},
  "imds": {"success": true}
}

Analysis

Layer	Result	Interpretation
DNS resolution	✅	S3 AP alias resolves to S3 endpoint IP (52.219.x.x)
VPC Peering (TCP 443)	❌	FSx for ONTAP management IP unreachable — egress blocked
VPC Peering (NFS 2049)	❌	NFS port unreachable — egress blocked
S3 AP via S3 Gateway Endpoint	❌	Read timeout — S3 service reachable but FSx for ONTAP backend connection fails
IMDS / Instance Profile	✅	Credentials available and valid

Key finding: Even with VPC Peering established, routes configured, and security groups permissive, the Databricks-managed VPC's egress restrictions block connectivity to the FSx for ONTAP backend. The S3 Gateway Endpoint routes requests to the S3 service, but FSx for ONTAP S3 AP requires the S3 service to reach the FSx for ONTAP file system — which is in a different VPC from the Databricks cluster. The S3 service-side routing to the FSx for ONTAP backend is not affected by customer-side VPC Peering.

Important: This result confirms that FSx for ONTAP S3 AP access requires the requesting service (Databricks cluster) to be in the same VPC as the FSx for ONTAP file system, or to use a network configuration where the S3 service can reach the FSx for ONTAP backend. VPC Peering between the requester VPC and the FSx for ONTAP VPC does not help because S3 AP requests are routed through the S3 service, not directly to the FSx for ONTAP IP.

Lesson

S3 AP requests do not traverse VPC Peering. They are routed through the S3 service endpoint. For FSx for ONTAP S3 AP to work, the S3 service must be able to reach the FSx for ONTAP file system's internal endpoint. This is handled by AWS internally when the request originates from the same region, but the Databricks-managed VPC's egress restrictions appear to interfere with this path.

Customer-managed VPC (same VPC as FSx for ONTAP) remains the only validated path for Instance Profile + boto3 access to FSx for ONTAP S3 AP from Databricks.

IMDS Access Matrix

Cluster Mode	Workspace Type	IMDS	boto3 S3	boto3 S3 AP
Standard (Shared)	Managed VPC	❌	❌	❌
Dedicated	Managed VPC	❌	❌	❌
Dedicated	Customer VPC	❌	❌	❌
Dedicated + Instance Profile	Managed VPC (VPC Peering)	✅	⚠️	❌
Dedicated + Instance Profile	Customer VPC	✅	✅	✅

Row 4 note: IMDS works and Instance Profile credentials are valid, but S3 AP access times out because the Databricks-managed VPC egress restrictions block FSx for ONTAP backend connectivity. Regular S3 bucket access was not tested with a permissive policy (AccessDenied was due to intentionally scoped IAM policy, not network).

IMDS is blocked on all configurations except Dedicated mode with an explicitly registered Instance Profile on a Customer-managed VPC workspace.

Complete Results Summary

#	Approach	Result	Blocker
1	UC External Location + dbutils.fs (without `access_point` field)	❌	Generated session policy did not allow S3 AP ARN
1b	UC External Location + `access_point` field (file-level read)	✅	Top-level ls, head, spark.read with explicit path all work
1c	UC External Location + `access_point` field (subdirectory ls)	❌	Prefix-based ListObjectsV2 still blocked for subdirectories
1d	UC External Location + CREATE TABLE LOCATION	❌	UC_CLOUD_STORAGE_ACCESS_FAILURE during internal validation
2	UC External Location + Spark read (directory)	❌	Same prefix-level access issue
3	NFS mount (Managed VPC, VPC Peering)	❌	Egress blocked (port 2049)
4	NFS mount (Customer VPC, Dedicated)	❌	NFS mount blocked by seccomp by design (confirmed by Databricks Support)
5	boto3 (Managed VPC, no Instance Profile)	❌	IMDS blocked
6	boto3 (Customer VPC, no Instance Profile)	❌	IMDS blocked
7	Instance Profile + boto3 (Customer VPC)	✅	Works (bypasses UC governance)
8	NFS RPC user-space (Customer VPC)	✅	Works but impractical for production
9	No Isolation Shared mode	❌	Legacy access mode; not pursued
10	S3 AP + Instance Profile + boto3 (Managed VPC, VPC Peering)	❌	Managed VPC egress blocks FSx for ONTAP backend connectivity

Governance Impact Summary

Documentation status (Updated 2026-05-26): Databricks Support confirmed that the access_point field was never released as GA and has been removed from documentation. Unity Catalog External Locations do not currently support S3 Access Points as storage targets. The partial success observed is a side effect, not a supported code path. Feature gap reported to UC engineering — no timeline available.

Access path	Governance model	Auditability	Production suitability
Unity Catalog External Location	Centralized UC governance (fine-grained, lineage)	High (if supported)	Preferred, but blocked in this validation
Instance Profile + boto3	EC2 IAM role based	AWS-side logs possible if enabled; UC lineage not captured	PoC only unless separately approved
Kernel NFS mount	Filesystem / OS level	Outside UC governance	Not viable in this validation
User-space NFS RPC	Custom application path	Custom logging required	Experimental only
Athena + FSx for ONTAP S3 AP	IAM / S3 AP / Athena workgroup	AWS-side evidence possible	Best current read-only SQL analytics fit
Bedrock Knowledge Bases + FSx for ONTAP S3 AP	IAM / S3 AP / Bedrock Knowledge Base role / guardrails where used	AWS-side evidence possible	AWS-documented RAG / GenAI path; validated with permission-aware retrieval in related series
Glue / EMR Serverless + FSx for ONTAP S3 AP	IAM / S3 AP / Glue / EMR job roles	AWS-side evidence possible	Validated ETL / Spark path in this broader series where verification-pack evidence is available; validate production write-back semantics separately

AWS-side audit events, such as CloudTrail data events where enabled and applicable, may show S3 API access by the instance profile, but they do not replace Unity Catalog lineage, table-level privileges, or centralized Databricks governance controls.

MLOps Boundary

Using boto3 to read objects from FSx for ONTAP S3 AP does not automatically make the downstream ML workflow governed.

If the data retrieved via Instance Profile + boto3 is used for ML or GenAI:

Register derived datasets in governed storage (Unity Catalog managed location)
Track experiments with MLflow
Register models in Unity Catalog where applicable
Document source data access path (S3 AP alias, prefix, timestamp)
Record whether training data lineage is captured or externalized
Ensure the ML compute uses an access mode compatible with Unity Catalog governance

Models in Unity Catalog provides centralized access control, auditing, lineage, and model discovery across workspaces. If the PoC data path bypasses UC, the model lifecycle should still be governed through UC model registry.

AI / RAG Data Readiness Checklist

If the FSx for ONTAP S3 AP data is intended for AI, RAG, or GenAI pipelines:

[ ] Are documents classified by sensitivity (PHI, PII, financial, internal, public)?
[ ] Are file-level permissions preserved or re-modeled for the AI pipeline?
[ ] Is metadata available for filtering and retrieval (file type, date, owner)?
[ ] Is freshness requirement defined (real-time, daily, weekly)?
[ ] Is read-only access sufficient, or does the pipeline need write-back?
[ ] Is human review required for generated output before downstream use?
[ ] Is permission-aware retrieval required (user A sees only their authorized documents)?

If permission-aware retrieval is required, define one of:

Enforce at source access path — use per-user or per-group S3 Access Points with scoped file system users
Re-model permissions in metadata index — extract file-level ACLs into a searchable metadata store and filter at query time
Filter retrieval results by user/group claims — apply post-retrieval filtering based on authenticated user identity
Do not proceed until authorization model is validated and approved by security owner

Instance Profile + boto3 approval requirements (for regulated workloads):

Data owner approval
Security owner approval
Platform owner approval
Compliance reviewer approval (if regulated data involved)
Defined: allowed prefix, allowed operations, logging requirements, expiration date
Approval record location (where the decision is stored)
Review / expiration date (when the approval must be re-evaluated)
Incident escalation contact

For regulated workloads, do not use Instance Profile + boto3 for:

Patient-facing responses or clinical decision support
Financial decision automation
Unreviewed access to regulated datasets
Writeback to source-controlled data locations
Workloads requiring Unity Catalog lineage

Decision Matrix

Requirement	Recommended path today	Notes	Next validation action
SQL query on structured files	Athena + FSx for ONTAP S3 AP (Part 1)	Verified, simple, governed	Scale test with production data sizes
RAG / GenAI over NAS documents	Bedrock Knowledge Bases + FSx for ONTAP S3 AP	AWS-documented tutorial	Validate retrieval accuracy, permission-aware filtering, and sync freshness
ETL pipeline on NAS data	Glue or EMR Serverless + FSx for ONTAP S3 AP	Validated in this broader series where verification-pack evidence is available	Validate throughput impact and production write-back semantics
Serverless file processing	Lambda + FSx for ONTAP S3 AP	AWS-documented tutorial	Validate concurrency and throughput for your workload
Databricks governance with Unity Catalog	Wait for platform support	UC session policy currently blocks S3 AP ARN in my validation	Monitor Databricks support case response
Databricks unstructured data PoC	Dedicated cluster + Instance Profile + boto3	Works, but bypasses UC governance	Validate executor-scale behavior separately
Production Databricks lakehouse tables	Use supported cloud storage (S3 bucket)	Required for Delta write semantics	N/A — use standard pattern
Databricks distributed processing over FSx for ONTAP S3 AP	Not validated yet	Driver-only boto3 success does not prove executor-scale behavior	Test with multi-node cluster and Spark mapPartitions
Enterprise read-only analytics	Athena / Glue / EMR Serverless / FSx for ONTAP S3 AP	Best current fit for AWS-native path	Production workload isolation test
Video streaming from NAS	CloudFront + FSx for ONTAP S3 AP	AWS-documented tutorial	Validate caching and latency for your content

This article does not recommend bypassing Unity Catalog for production governed lakehouse workloads. The Instance Profile + boto3 path is documented because it worked in a controlled validation environment, not because it is the preferred governance model.

Architecture Decision Guidance

Databricks remains the recommended platform for curated lakehouse workloads, governed Delta tables, ML pipelines, and multi-step data engineering. FSx for ONTAP S3 AP should be treated as a source integration boundary that may require staging, validation, or an alternate read path depending on governance requirements.

Use Databricks when:

Data is already in supported object storage (S3 bucket)
Delta Lake write semantics are required (INSERT, MERGE, OPTIMIZE, VACUUM)
Unity Catalog lineage and fine-grained governance are mandatory
Large-scale Spark processing is required
ML/AI workloads need integrated compute

Use AWS-native services + FSx for ONTAP S3 AP when:

The primary requirement is read-only SQL analytics over NAS data → Athena (validated in Part 1)
RAG / GenAI over enterprise documents → Bedrock Knowledge Bases (AWS-documented path)
ETL pipelines reading/transforming NAS data → Glue (validated in this broader series where verification-pack evidence is available)
Spark-scale processing without persistent clusters → EMR Serverless (validated in this broader series where verification-pack evidence is available)
Serverless file processing (thumbnails, text extraction, transcription) → Lambda (AWS-documented path)
Video streaming from NAS → CloudFront (AWS-documented path)
External partner file exchange → Transfer Family (AWS-documented path)
BI and AI-assisted analytics → QuickSight candidate path, typically via Athena or Glue Catalog
Source data copy should be minimized
Workload isolation and governance can be validated with AWS-side controls
Serverless, pay-per-query or pay-per-invocation cost model is preferred

Use controlled boto3 PoC only when:

The workload is exploratory and time-limited
Unity Catalog lineage is not required for the PoC scope
Explicit approval is obtained from data owner, security owner, and platform owner
Compensating controls are defined and documented

FSx for ONTAP Sizing Considerations

Before selecting an analytics engine, validate FSx for ONTAP-side capacity:

Throughput capacity — S3 API throughput is bounded by the FSx for ONTAP file system's provisioned throughput
Expected S3 API request rate — high-frequency small object reads may hit IOPS limits
File count and average object size — large directories with many small files may increase listing latency
Prefix layout — flat vs hierarchical prefix design affects listing performance
NFS/SMB production workload window — analytics queries share throughput with existing file workloads
Snapshot / backup / replication schedule — SnapMirror and backup operations consume throughput
Isolation strategy — consider a dedicated volume or SVM for analytics access to avoid contention

Delta Lake production workloads require more than object read access. They require validated behavior for transaction log writes, atomic commit assumptions, concurrent writers, checkpointing, recovery, and lifecycle operations. This article does not validate FSx for ONTAP S3 AP for Delta write-path semantics.

Compensating Controls for Controlled boto3 PoC

If Instance Profile + boto3 is approved for a controlled PoC, define:

Dedicated cluster only (no shared compute)
Single-purpose instance profile (not reused across workloads)
Least-privilege S3 Access Point policy (specific prefix only)
Read-only permissions by default
Allowed prefix list (explicitly documented)
CloudTrail data event coverage where enabled and applicable
Notebook/job owner (named individual)
Approval expiration date
No production writeback
No regulated data unless separately approved with compensating controls

Recommended Databricks-side controls:

Restrict instance profile usage to an approved group via workspace admin settings
Enforce dedicated access mode through cluster policy
Restrict cluster creation permissions to approved users
Tag PoC clusters with owner, approval ID, and expiration date
Disable or terminate clusters after approval expiration
Review workspace audit logs for cluster and instance profile usage

Post-expiration mandatory actions:

Terminate all PoC clusters using the instance profile
Remove the instance profile from workspace admin settings
Archive all evidence (notebooks, logs, results) to approved storage
Update approval record with completion date and findings
Confirm no residual access paths remain (audit workspace settings)

Data Protection Considerations

FSx for ONTAP S3 AP exposes access to file data; it does not replace ONTAP volume-level protection. When analytics workloads access source data via S3 AP, validate:

Snapshot schedule impact — analytics reads do not conflict with scheduled snapshots, but heavy write-back could
SnapMirror replication policy — source volume replication continues regardless of S3 AP access
Backup window vs analytics query window — concurrent backup and analytics may compete for throughput
Write-back isolation — analytics results should be written to a separate volume or prefix, not the source-of-record volume
Recovery behavior — if analytics workload reads during a failover event, understand the RPO/RTO implications

ONTAP S3 NAS bucket data is protected by volume-level SnapMirror asynchronous replication, not by S3-level replication. Plan DR at the volume level.

Discovery Questions for Partners

When a customer asks about Databricks + FSx for ONTAP S3 Access Points:

Are the target files currently stored on NFS, SMB, or both?
Is the workload read-only analytics, unstructured object processing, or Delta write?
Is Unity Catalog lineage mandatory for this use case?
Is this a regulated dataset (PHI, PII, financial)?
Can the PoC run with a dedicated instance profile and limited prefix?
What is the required concurrency and data size?
Is executor-scale Spark processing required, or is driver-only sufficient?
What rollback action is acceptable if FSx for ONTAP throughput impact is observed?
Who approves non-Unity Catalog access paths?
What evidence is required for security review?

Troubleshooting Playbook

When Databricks access to FSx for ONTAP S3 AP fails, isolate one layer at a time:

IAM — Can the instance profile call s3:ListBucket on the S3 AP ARN? Can it call s3:GetObject?
Unity Catalog — Does the same role work for a standard S3 bucket? Does it fail only for the FSx for ONTAP S3 AP ARN?
Network — Is the workspace customer-managed or Databricks-managed? Can the cluster reach NFS TCP 2049? Are route tables and security groups correct?
NFS server — Does showmount -e work? Does the ONTAP export policy allow the client?
Local runtime — Does strace show mount() returning EACCES? Does tmpfs mount succeed? Does user-space NFS RPC succeed?
Workaround — Does Dedicated + Instance Profile + boto3 work? Is bypassing Unity Catalog acceptable for this PoC?

Known Failure Signatures

Symptom	Likely layer	Next step
`no session policy allows s3:ListBucket`	Unity Catalog session policy	Compare regular S3 bucket vs FSx for ONTAP S3 AP with the same role
TCP 2049 unreachable	Network / managed VPC boundary	Test from customer-managed VPC
`mount.nfs: access denied by server` with `mount()` EACCES in strace	Local runtime restriction	Capture strace and `/proc/self/status` seccomp output
boto3 `NoCredentialsError`	Instance profile / IMDS blocked	Verify cluster mode is Dedicated and instance profile is registered
boto3 `ReadTimeoutError` on S3 AP	FSx for ONTAP backend or VPC endpoint routing	Test with a fresh SVM/volume to isolate; check FSx for ONTAP CPU utilization
boto3 `ReadTimeoutError` on S3 AP from Managed VPC (IMDS works)	Managed VPC egress restriction blocking FSx for ONTAP backend	Deploy in Customer-managed VPC (same VPC as FSx for ONTAP); VPC Peering does not resolve this
Driver-only boto3 works, but Spark job fails	Executor credential/network path	Validate credentials, routing, and concurrency from executors separately

What This Article Does Not Conclude

This article does not conclude that Databricks cannot ever support FSx for ONTAP S3 AP. It documents the behavior observed in one validated environment and identifies the platform boundaries that need vendor confirmation or additional support.

What to Tell Stakeholders

Current recommendation:

Use AWS-documented native service paths where they match the workload: Athena for SQL, Bedrock Knowledge Bases for RAG/GenAI, Glue or EMR Serverless for ETL/Spark, Lambda for serverless file processing, CloudFront for streaming, and Transfer Family for partner file exchange
Treat Athena as the validated read-oriented SQL path in Part 1. Treat Glue / EMR Serverless as validated ETL / Spark paths only where corresponding verification-pack evidence is available.
Treat Bedrock Knowledge Bases, Lambda (file processing), CloudFront, and Transfer Family as AWS-documented candidate paths that still require workload-specific validation
Use Databricks + Instance Profile + boto3 only for controlled PoC or unstructured data experiments
Do not position Unity Catalog + FSx for ONTAP S3 AP as production-ready until the session policy supports S3 Access Point ARN patterns
Do not rely on kernel NFS mounts inside Databricks until the platform explicitly supports this path
For Delta Lake production tables, continue to use supported object storage patterns

This validation should be used to guide architecture selection, not to disqualify Databricks from lakehouse workloads.

This validation should not be used to compare AWS-native services and Databricks as competing platforms. AWS-native services (Athena, Bedrock, Glue, EMR Serverless, Lambda) each have AWS-documented integration paths with FSx for ONTAP S3 AP — some validated in this series, others requiring workload-specific validation. Databricks is strong for governed lakehouse, Delta, ML, and production-scale data engineering workloads. The right choice depends on the access pattern, governance requirement, and workload type.

Key contributions of this validation:

Identified the root cause of NFS mount failure (seccomp BPF filter, not server-side denial) via strace analysis
Discovered the access_point field on External Location (via Databricks Support) that partially resolves the session policy
Proved that file-level read under UC governance is possible (1000 rows, schema inference)
Mapped the complete evidence chain: network → ONTAP → NFS RPC → kernel → seccomp
Established that Customer-managed VPC (same VPC as FSx) is the only validated network path
Provided a reusable troubleshooting playbook for future S3 AP integration attempts

Lessons Learned

1. "S3-compatible" ≠ "works everywhere S3 works"

FSx for ONTAP S3 AP is S3-compatible at the API level, but platform security layers (session policies, VPC restrictions) may not recognize the ARN format. S3 API compatibility and platform-integrated S3 governance are different things.

2. Error messages can be misleading

mount.nfs: access denied by server made me spend hours checking ONTAP export policies. The real issue was a local runtime restriction. Always use strace when mount fails unexpectedly.

3. Platform security boundaries are not always documented

You discover these boundaries by hitting them. The troubleshooting playbook above can save you time.

4. Customer-managed VPC is essential for storage integration

If you need to connect Databricks to anything beyond standard S3 buckets, deploy in a Customer-managed VPC. Databricks-managed VPC provides limited customer control over cluster networking compared with a customer-managed VPC.

This was further confirmed by testing S3 AP access from a Databricks-managed VPC with VPC Peering: even with VPC Peering active, routes configured, security groups permissive, and a S3 Gateway Endpoint present, S3 AP requests to FSx for ONTAP timed out. The Databricks-managed VPC egress restrictions block not only direct IP communication but also S3 AP backend connectivity.

S3 AP routing note: S3 AP requests are routed through the S3 service endpoint, not directly to the FSx for ONTAP IP. VPC Peering between the requester VPC and the FSx for ONTAP VPC does not help because the S3 service needs internal connectivity to the FSx for ONTAP file system. Customer-managed VPC (same VPC as FSx for ONTAP) is the only validated path.

Databricks Control Plane (SaaS)
        ^
        | NAT Gateway (required outbound)
        |
Databricks Cluster ENI (Customer VPC, private subnet)
        |
        | Private VPC routing (no internet required)
        v
FSx for ONTAP ENI / SVM (same VPC, private subnet)

For the Databricks Support Case Packet, include network evidence: cluster subnet ID, FSx for ONTAP subnet ID, route table IDs, security group rules, and DNS resolution for FSx for ONTAP endpoint.

5. Instance Profile is a pragmatic PoC workaround

Use Instance Profile + boto3 as a controlled PoC workaround. Do not use it as a substitute for Unity Catalog governance without a formal security review.

6. Always isolate variables when troubleshooting

When FSx for ONTAP S3 AP wasn't responding, I created a new SVM and volume to isolate the issue. This confirmed the problem was SVM-specific rather than a platform-wide limitation.

7. Negative validation creates value

A failed integration path can still create value when it prevents the wrong production architecture. This validation helps teams avoid assuming S3 API compatibility equals platform governance compatibility, choose the right engine for the right access pattern, and reduce time spent on ambiguous troubleshooting.

Databricks Support Case Packet

If you open a support case with Databricks, include:

Workspace type: Databricks-managed VPC or customer-managed VPC
Cluster access mode and DBR version
IAM role / instance profile configuration
Unity Catalog storage credential and external location configuration
Full AccessDenied error message (including the ARN and "no session policy" text)
S3 AP ARN and alias format
Network test results for NFS ports (TCP 2049, TCP 111, TCP 635)
strace output showing mount() EACCES
/proc/self/status showing seccomp mode
User-space NFS RPC success evidence (if applicable)
Instance Profile boto3 success evidence (if applicable)
showmount -e output (confirms export visibility)
tmpfs mount success evidence (proves mount syscall itself is allowed)

Use Case Fit Matrix

When this article says "validated in this broader series," it refers to evidence captured in the linked verification-pack or related articles, not to Databricks-specific validation in this Part 2 article.

Use case	Best current path	Why
SQL analytics on structured NAS files	Athena + FSx for ONTAP S3 AP	Verified read-oriented path with AWS-side governance controls, serverless
Enterprise IT RAG over documents	Bedrock Knowledge Bases + FSx for ONTAP S3 AP	AWS-documented tutorial; also validated in related series with permission-aware retrieval
ETL / data transformation	Glue or EMR Serverless + FSx for ONTAP S3 AP	Validated in this broader series where verification-pack evidence is available; validate production write-back semantics separately
Serverless file processing (thumbnails, OCR, transcription)	Lambda + FSx for ONTAP S3 AP	AWS-documented tutorial; validate for your workload
Large-scale Spark ETL	EMR Serverless + FSx for ONTAP S3 AP or standard S3 bucket	Validated in this series; Databricks executor-scale not validated on S3 AP
Production Delta Lake tables	Supported object storage (S3 bucket)	Required for Delta write semantics and UC governance
Unstructured data experimentation (Databricks)	Instance Profile + boto3 PoC	Works in driver-only pattern, needs governance review
Video streaming from NAS	CloudFront + FSx for ONTAP S3 AP	AWS-documented tutorial; validate caching, latency, and file size for your content
External partner file exchange	Transfer Family + FSx for ONTAP S3 AP	AWS-documented path; also validated in related series; validate file operation limitations (rename, append, upload size)
Lightweight serverless analytics	DuckDB Lambda + FSx for ONTAP S3 AP	Planned Part 3 validation; candidate for lightweight, low-idle-cost analytics
BI / dashboarding over NAS data	Candidate: QuickSight via Athena or Glue Catalog	AWS positions BI as a candidate use case; validate whether access path is Athena-backed or catalog-mediated

Cost Model Considerations

Engine	Primary cost driver	Best fit
Athena	Data scanned (per TB)	Occasional SQL queries, serverless
Bedrock Knowledge Bases	Model invocation + embedding + retrieval	RAG / GenAI over enterprise documents
Glue	DPU-hours	ETL pipelines, data transformation
Databricks	DBU + cloud compute instance hours	Lakehouse pipelines, ML, Delta workloads
EMR Serverless	vCPU / memory × runtime duration	Spark ETL without persistent clusters
Lambda + DuckDB	Invocation duration × memory	Lightweight serverless analytics, event-driven
CloudFront	Data transfer + requests	Video/media streaming from NAS

Cost comparison is not the focus of this article. Each engine has a fundamentally different pricing model. Databricks provides compute policies to control cluster creation, instance types, auto-termination, and cost-related attributes. For cost optimization, evaluate based on workload pattern (interactive vs batch, frequency, data volume) rather than unit price alone.

Partner / Customer Conversation Guide

If a customer asks whether Databricks can directly process FSx for ONTAP S3 Access Point data:

AWS-native service paths such as Athena, Bedrock Knowledge Bases, Glue, EMR Serverless, Lambda, CloudFront, and Transfer Family have AWS-documented integration patterns with FSx for ONTAP S3 AP. In this series, Athena (Part 1), Glue, and EMR Serverless have been validated; the other paths should be validated per workload, Region, IAM model, FSx for ONTAP-side authorization, and governance requirement.
Databricks Unity Catalog integration requires vendor confirmation for S3 Access Point ARN handling
Instance Profile + boto3 can be used for controlled PoC experiments, but it bypasses Unity Catalog governance and is classified as a legacy data access pattern by Databricks
Production Delta Lake workloads should continue to use supported object storage patterns
Any Databricks integration should be validated per workspace type, cluster mode, runtime version, IAM path, and governance requirement

Next Validation Metrics

Current blocker: Executor-scale validation requires a Customer-managed VPC workspace (same VPC as FSx for ONTAP). The Databricks-managed VPC workspace was tested with VPC Peering and Instance Profile (2026-05-24) — S3 AP access timed out due to managed VPC egress restrictions. A Customer-managed VPC workspace creation is pending Databricks support ticket resolution.

For executor-scale validation (not yet performed):

Object listing latency per executor
Total objects processed across cluster
Per-executor success/failure rate
Throughput per executor
Retry count and S3 API error rate
FSx for ONTAP throughput utilization during distributed access
Cost per processed GB

Driver-only boto3 success is not sufficient for Spark workloads. The next validation should run boto3 calls from executors using mapPartitions and compare credential, routing, latency, and error behavior across workers.

Executor-scale validation should not only test success/failure. It should capture per-executor latency, retry count, error code, and object count so that routing and concurrency behavior can be reviewed.

Benchmark run guidance:

Cold run: at least 1 (first access after cluster start, no metadata cache)
Warm metadata run: at least 1 (after initial listing populates metadata cache)
Repeated run: at least 3 (steady-state measurement)
Report: p50, p90, p95, p99 latency, plus average, min, max, and outliers
Include: object count, average object size, prefix depth, concurrent executor count
Include: FSx for ONTAP throughput utilization during test window
Note: S3 AP via FSx for ONTAP may exhibit metadata warm-up effects and prefix layout sensitivity. Cold vs warm differences should be documented explicitly.

Additional FSx for ONTAP metrics to capture:

FSx for ONTAP throughput utilization (% of provisioned capacity)
FSx for ONTAP CPU utilization
Network throughput (inbound/outbound)
S3 API request count by operation (List, Get, Head)
File count per prefix
Average object size
NFS/SMB latency during concurrent S3 API reads (contention indicator)

Expected output format (JSONL per executor):

{"executor_host": "ip-10-0-xx-yy", "partition_id": 3, "operation": "list_objects_v2", "status": "success", "latency_ms": 183, "objects_seen": 100, "error_code": null}

Adoption Success Metrics

For a controlled Databricks + FSx for ONTAP S3 AP PoC, define success criteria beyond technical pass/fail:

Baseline metrics (capture before validation):

Average search/access time (minutes) for target documents
Monthly document access count via current path
Current copy pipeline runtime (if applicable)
Current data freshness lag (hours)
Current support ticket count related to data access

PoC outcome metrics:

Number of target datasets evaluated
Number of successful read operations
Number of governance exceptions required
Time to first successful access
Number of support issues raised
Whether the customer selected Athena, Databricks, or another engine after validation
Decision outcome: proceed / adjust / stop
Time saved by early boundary identification (vs discovering in production)

Stop criteria:

No measurable business value after validation period
Governance exception required for production path with no compensating control available
Executor-scale validation fails with unacceptable error rate (define threshold before PoC)
FSx for ONTAP workload impact exceeds approved threshold (e.g., throughput utilization > 80%)
Vendor confirmation indicates unsupported path with no roadmap commitment
Security review rejects the access path without remediation option

Series Evaluation Criteria

Across this series, each engine is evaluated by:

Read-path compatibility
Write-path compatibility
Governance model
Operational impact
Performance evidence
Production readiness gap
Best-fit use case

Well-Architected Mapping

These criteria align with the AWS Well-Architected Data Analytics Lens:

Pillar	Evaluation focus in this series
Security	Governance model, IAM/AP policy, audit evidence, session policy behavior
Reliability	Failure modes, rollback path, support case evidence, DR considerations
Performance Efficiency	Throughput, executor-scale behavior, FSx for ONTAP utilization, latency
Cost Optimization	Engine-specific cost model, idle cost, cost per processed GB
Operational Excellence	Runbook, evidence template, support packet, monitoring

Business Value of Negative Validation

Negative validation is not failure. It is risk reduction.

A failed integration path can still create value when it prevents the wrong production architecture. This validation helps teams:

Avoid assuming S3 API compatibility equals platform governance compatibility
Choose the right engine for the right access pattern (Athena for read-only SQL, Databricks for lakehouse/ML)
Identify early when vendor confirmation is required before committing architecture
Reduce time spent on ambiguous troubleshooting by providing reproducible evidence
Prevent wasted PoC investment by documenting boundaries before production design
Enable informed conversations with vendors, partners, and security reviewers

For enterprise customers, early boundary identification can save weeks of engineering time and prevent costly architecture rework after production deployment.

What's Next

Series index:

Part 1: Athena — Query NAS Data In Place (validated read-oriented path, 9/9 negative tests pass)
Part 2: Databricks (this article) — session policy deep dive
Part 3: Snowflake — LIST Works, SELECT Doesn't (same session policy pattern)
Part 4: DuckDB Lambda — lightweight serverless analytics validation
Part 5: EMR Spark — read-write ETL pipeline (coming soon)
Part 6: Redshift Spectrum — DWH meets NAS data (coming soon)
Part 7: Trino — open-source SQL on NAS data (coming soon)

Open items:

Support cases: Waiting for Databricks response on session policy and NFS mount questions
FUSE NFS client: Investigating whether a user-space NFS client can bypass the runtime restriction

Caution on FUSE/user-space NFS: FUSE or user-space NFS clients should be treated as experimental only. They require separate validation for POSIX semantics, caching behavior, consistency, performance, failure recovery, and vendor supportability. Do not treat user-space NFS RPC success as a production workaround.

References

Related series by the same author (FSx for ONTAP S3 Access Points with other AWS services):

Building an Agentic Access-Aware RAG System with Amazon FSx for NetApp ONTAP, S3 Vectors, and S3 Access Points — Bedrock Knowledge Bases + permission-aware retrieval (GitHub)
FSx for ONTAP S3 Access Points as a Serverless Automation Boundary — AI Data Pipelines, Volume-Level SnapMirror DR, and Capacity Guardrails — Lambda, Bedrock, SageMaker, 17 industry use cases (GitHub)
Smart Routing, Transfer Family Ingestion, and Voice Chat — Permission-Aware RAG v4.2 — Transfer Family + SFTP ingestion for RAG pipeline

ONTAP S3 Multiprotocol vs FSx for ONTAP S3 Access Points:

ONTAP S3 multiprotocol (ONTAP 9.12.1+): S3 NAS bucket model on ONTAP SVM, enabling S3 clients to access NAS data directly on the ONTAP cluster
FSx for ONTAP S3 Access Points: AWS-managed S3 Access Point endpoint attached to FSx for ONTAP volume, integrating with AWS IAM, VPC, and S3-compatible services
Both expose NAS data via S3-style access, but the authorization path, service integration, and operational model differ. This article focuses on FSx for ONTAP S3 Access Points.

This article is part of the "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive" series. All tests were performed on a real AWS environment with FSx for ONTAP (ONTAP 9.17.1, ap-northeast-1) and Databricks (DBR 17.3 LTS, Premium tier) in May 2026.

Scope reminder: This article documents observed behavior in one validated environment. It does not validate production readiness, distributed executor-scale processing, or all Databricks runtime versions. Terminology uses "observed in this environment" rather than "unsupported" or "incompatible" — platform behavior may change with future updates.

Future updates: If Databricks platform behavior changes or vendor confirmation becomes available, this article should be updated with the new validation result rather than treated as a permanent compatibility statement.

Disclaimer: This article is an independent validation report and does not represent Databricks, AWS, or NetApp official guidance. Product behavior, support status, and platform capabilities may change. Always validate in your own environment and consult vendor documentation and support channels.

FSx for ONTAP S3 Access Points Lakehouse — What Works, What Doesn't, and Why

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Sun, 24 May 2026 07:24:34 +0000

TL;DR

Amazon FSx for ONTAP S3 Access Points let you access NAS file data through S3-compatible APIs — without first copying source files to S3.

I tested multiple analytics, AI/ML, and lakehouse access patterns across AWS-native services, open-source engines, and third-party platforms. The results fall into four categories:

Verified in this series ✅	Candidate (AWS-documented) 🔎	Partially resolved, not production-ready ⚠️	Not suitable for this path ❌
Athena, Glue, EMR Spark, Redshift Spectrum, DuckDB Lambda, Trino, Snowflake (with `AWS_ACCESS_POINT_ARN`)	Bedrock KB, Lake Formation, Quick	Databricks UC (session policy partially resolved; UC table creation and directory listing still blocked)	Delta / Iceberg / Hudi transactional write paths

The pattern: Read-oriented analytics and flat-file writes (such as Parquet append) worked reliably in my validation environment. Transactional table-format write paths failed in this validation because they require commit semantics (atomic rename, conditional metadata update) that were not satisfied through the FSx S3 AP path.

GitHub Repository: fsxn-lakehouse-integrations

Validation Vocabulary

Term	Meaning
Verified	Worked in my test environment with evidence in verification-pack/
Candidate	AWS-documented or related-series path that still requires workload-specific validation
Blocked	Failed due to integration-layer behavior observed in validation
Not suitable	Failed because required table-format semantics were unavailable or incompatible

When this article says "Verified," it means the behavior was observed in my test environment and evidence is available. It does not mean production certification or vendor support guarantee.

Why This Matters

Enterprise organizations store petabytes of file data on NAS (NFS/SMB). To analyze this data with modern tools, they typically:

Copy data from NAS to S3 (ETL pipeline)
Register in a catalog (Glue, Unity Catalog)
Query with analytics platform

FSx for ONTAP S3 Access Points eliminate step 1. The same files accessible via NFS/SMB are now queryable via S3 API — zero source-file movement, zero sync pipeline, zero duplicate storage.

Before: NFS/SMB → [ETL Copy] → S3 → Analytics Platform
After:  NFS/SMB ←→ FSx for ONTAP ←→ S3 Access Point → Analytics Platform
                    (same data, same volume)

Note for regulated workloads: "Zero data movement" means source files do not need to be copied from FSx for ONTAP to S3 for the tested access paths. However, metadata, query results, logs, embeddings, temporary files, and derived datasets may still be created by the consuming service. See Note for Regulated Workloads below.

From File Access to AI-Ready Data

Eliminating the copy pipeline is step one. The real business value comes from what you do next — turning raw file data into AI-ready data products that drive business outcomes.

The engines validated in this series form a multi-engine data product journey:

FSx for ONTAP (source of truth)
  ↓ S3 Access Point (zero-copy access)
  ├── Athena / Redshift Spectrum → Ad-hoc discovery, data profiling
  ├── Glue / EMR Spark → ETL, curated Parquet/Iceberg datasets
  ├── DuckDB Lambda → Lightweight validation, cost-optimized queries
  ├── Snowflake External Table → Governed analytics, Cortex AI (summarize, RAG, sentiment)
  ├── Lake Formation → Fine-grained access control (column/row/tag)
  └── Databricks (via DataSync → S3) → ML training, feature engineering, Mosaic AI

The key insight: You don't need to pick one engine. Each platform excels at a different stage of the data product lifecycle:

Stage	Best engine	Why
Discover	Athena, DuckDB Lambda	Cheapest way to explore what's in your NAS data
Profile & validate	Glue Crawler, DuckDB	Schema discovery, data quality checks
Transform & curate	Glue ETL, EMR Spark	Medallion architecture, write curated Parquet
Govern	Lake Formation, Snowflake	Column/row/tag access control, governance tags
Analyze & share	Snowflake, Redshift Spectrum	Governed analytics, data sharing, Cortex AI
Train & predict	Databricks, EMR	ML training, feature store, model serving

This is not "pick one platform" — it's "use the right engine for each stage, with FSx for ONTAP as the single source of truth."

Zero-ETL design principle: The goal is not "zero processing" — it's "no hand-built copy pipelines, no duplicate storage management, no stale data." Where a platform requires data in S3 (Databricks UC, Delta/Iceberg writes), use DataSync as a managed bridge — not a custom ETL pipeline.

Security Model

Every request to FSx for ONTAP S3 Access Points must pass two authorization layers (AWS documentation):

S3-side authorization: IAM identity policy, S3 Access Point policy, VPC endpoint policy (if applicable), SCP
FSx-side authorization: Associated UNIX or Windows file system user permissions on the underlying volume

Both layers must permit the request. A permissive IAM policy does not override restrictive file system permissions, and vice versa.

The Compatibility Map

Verified (Evidence in verification-pack)

Platform	Pattern	Benchmark	Cost/Query
Athena	Serverless SQL via Glue Catalog	54.8 MB/s (5M rows in 2.2s)	~$0.0005
DuckDB Lambda	In-process analytics (arm64)	10K rows in 452ms (warm)	~$0.00001
EMR Spark	Distributed Spark SQL	10K rows read+write in 16s	~$0.001
Redshift Spectrum	DWH + external data JOIN	5M rows in 4.3s	~$0.005
Trino	Open-source distributed SQL	5M rows in 1.5s	Compute cost only
Glue ETL	PySpark medallion pipeline	10K rows transform in 64s	~$0.02

Candidate (AWS-documented, requires workload validation)

Platform	Pattern	Notes
Lake Formation	Governance overlay	Table/column-level access behavior observed; production workload validation needed
Bedrock KB	RAG document ingestion	Per AWS tutorial; permission-aware retrieval requires separate validation

Blocked in Validation (Third-Party Platforms)

Platform	Symptom	Root Cause	Workaround
Databricks (Unity Catalog)	Subdirectory ls → AccessDenied; CREATE TABLE → fails	Session policy partially resolved with `access_point` field; prefix-level listing and UC table creation still blocked	Explicit-path spark.read works but without UC table registration, governance features (lineage, tags, fine-grained access) cannot be applied; Instance Profile + boto3 for full access (bypasses UC entirely)
Snowflake (External Stage)	✅ Works with `AWS_ACCESS_POINT_ARN`	`AWS_ACCESS_POINT_ARN` stage parameter resolves session policy for GetObject	Full zero-copy analytics: SELECT, External Table, Directory Table, Cortex AI (summarize/translate/sentiment), Governance Tags, Row/Column policies — all verified. See [Part 3]

Databricks update (2026-05-24): Setting the access_point field on the UC External Location partially resolves the session policy issue. Top-level dbutils.fs.ls, dbutils.fs.head, and spark.read with explicit file paths now succeed. However, UC table creation (CREATE TABLE LOCATION) fails, subdirectory listing is blocked, and write operations are denied. Without UC table registration, Unity Catalog governance features — lineage tracking, fine-grained access control, governance tags, and audit — cannot be applied to the data. This means the data is technically readable but not governable through UC. Support case active — awaiting guidance on table creation and prefix-level access.

Support cases filed with both vendors.

Not Suitable for This Path (Table Format Constraints)

Format	Write Operation	Why It Failed in Validation
Delta Lake	INSERT/MERGE/VACUUM	Requires conditional writes (`If-None-Match`) for `_delta_log/` commit — FSx for ONTAP S3 AP returns 501 Not Implemented
Apache Iceberg	CREATE TABLE/INSERT	S3FileIO metadata write requires conditional writes for atomic commit — same root cause as Delta
Apache Hudi	Upsert/Compaction	Timeline commit requires atomic rename — not available on FSx for ONTAP S3 AP

Important distinction (read vs write): The failures above are for write operations only. Reading pre-existing Iceberg/Delta tables (where metadata and data files already exist on storage) is theoretically possible via GetObject — but has not been validated on FSx for ONTAP S3 AP. If you have Iceberg tables written to standard S3 (via EMR/Glue), those can be registered in Glue Data Catalog and queried alongside FSx for ONTAP external tables from the same Athena/Redshift session.

In this validation, transactional table writes failed because the tested engines required conditional writes (If-None-Match / put-if-absent) that FSx for ONTAP S3 AP does not support (returns 501 Not Implemented). AWS feature request submitted (May 2026). See API support documentation.

What DOES work for writes: Flat Parquet/CSV append via PutObject (Athena CTAS, Glue ETL write-back, EMR Spark write, DuckDB COPY TO).

Benchmark Methodology

All benchmark numbers should be read with the following context:

Parameter	Value
FSx for ONTAP deployment type	Single-AZ
Provisioned throughput	128 MB/s
Region	ap-northeast-1
Dataset shape	10K rows (250 KB) and 5M rows (103 MB), single Parquet file
Run type	Warm (unless noted as cold start)
Network path	Internet-origin AP (no VPC attachment for managed services)

Future benchmark runs will also capture: prefix depth, file count per prefix, average object size, p50/p90/p95/p99 latency where available, and cold/warm/repeated run count.

FSx S3 AP latency is in the tens of milliseconds range, and throughput depends on the file system's provisioned throughput capacity (AWS documentation). These benchmarks are sizing references from one test environment, not service limits or guarantees.

Architecture Decision Guide

Q: Do you need to WRITE transactional tables (Delta/Iceberg)?
  → Yes: Use native S3 for write path; FSx S3 AP for read-only source data
  → No: FSx S3 AP can handle the read-oriented and flat-file write patterns validated in this series

Q: Do you need sub-millisecond latency or unlimited concurrency?
  → Yes: Use native S3
  → No: FSx S3 AP (tens of ms, provisioned throughput)

Q: Do you have existing NAS data you want to analyze?
  → Yes: FSx S3 AP eliminates the copy pipeline
  → No: Native S3 may be simpler

Q: Do you need NFS/SMB access alongside S3 analytics?
  → Yes: FSx S3 AP (multi-protocol on same data)
  → No: Evaluate based on above

Decision Criteria

Scale when:

Business metric improves (freshness, cost, time-to-insight)
Governance path is approved
Performance impact is within threshold

Adjust when:

Engine works but governance or performance needs redesign
Staging to native S3 is required for write path

Stop when:

Transactional table write semantics are mandatory on the same path
Vendor session policy blocks production path with no approved workaround
Security owner rejects the access model

Business Value Hypotheses

Business issue	Baseline metric	Expected value	Validation path	Decision owner
NAS analytics requires nightly copy to S3	Copy pipeline runtime, freshness lag	Reduce data freshness lag to near-zero	Athena / Glue / EMR direct query	Data platform owner
Enterprise documents are hard to search	Avg search time per user	Faster document discovery	Bedrock KB / permission-aware RAG	Information management owner
ETL pipeline duplicates storage	Duplicate storage cost	Lower copy and storage overhead	Glue / EMR write-back to same volume	Storage / FinOps owner
Platform selection is unclear	Weeks spent on PoC	Faster architecture decision	This compatibility map	Architecture lead

Partner Offer Paths

Customer need	Suggested offer	Exit decision
Query NAS data without copy	Athena / Redshift Spectrum validation pilot	Scale / adjust / stop
ETL from NAS to curated Parquet	Glue or EMR Serverless validation sprint	Production design / stage to S3
RAG over enterprise documents	Bedrock KB / permission-aware RAG assessment	Proceed only with authorization model validated
Databricks lakehouse integration	UC External Location with `access_point` field for read; staging to native S3 for Delta write	File-level read works under UC; subdirectory listing and table creation pending vendor resolution
Transactional table write	Native S3 table storage design	FSx S3 AP as source, not table log storage

The purpose of these offers is not to force every workload onto FSx S3 AP, but to quickly identify the right access path, the right engine, and the right stop condition.

Key Technical Findings

1. Internet-Origin AP Required for Managed Services

In this validation, managed service paths (Athena, Glue, Redshift Spectrum, Bedrock) required internet-origin access points because the service access path did not originate from the customer VPC. Validate this per service, region, and network configuration.

2. Parquet Timestamp Compatibility

pandas and DuckDB generate Parquet with nanosecond timestamps by default. Spark (Glue, EMR) cannot read these files. Always use microsecond resolution for cross-engine compatibility.

3. EMRFS vs S3A

EMR's EMRFS (s3://) natively supports S3 AP aliases. The S3A FileSystem (s3a://) does NOT work with AP aliases (URL parsing error). Use s3:// prefix in EMR.

4. DuckDB httpfs Configuration

DuckDB requires s3_url_style = 'path' and explicit s3_endpoint to work with S3 AP aliases. In Lambda, also set home_directory = '/tmp'.

5. Trino Hive Connector

Trino requires hive.s3.path-style-access=true and explicit hive.s3.endpoint to resolve S3 AP aliases. Same pattern as DuckDB — path-style access is the key.

6. S3 Gateway Endpoint Routing

VPC-attached compute (Lambda in VPC, EC2) may experience timeouts when accessing FSx S3 AP through an S3 Gateway VPC Endpoint. The FSx S3 AP alias resolves to s3-r-w.<region>.amazonaws.com which may not route correctly through the Gateway endpoint. Workaround: use NAT Gateway or place compute outside VPC. See FSx S3 AP Networking Considerations.

7. Session Policy Is the Common Blocker for Third-Party Platforms

The session policy issue is not unique to one vendor in this validation. It may affect any analytics platform that applies restrictive AssumeRole session policies designed around standard S3 bucket ARN patterns. AWS-native services work because they use IAM roles directly without intermediary session policies.

Note for Regulated Workloads

"Zero data movement" means source files do not need to be copied from FSx for ONTAP to S3 for the tested access paths. However, metadata, query results, logs, embeddings, temporary files, and derived datasets may still be created by the consuming service.

For regulated workloads, validate:

Data classification of source and derived data
Derived data location (query results, embeddings, temp files)
Encryption and key ownership at each layer
Audit log coverage (CloudTrail, platform logs, ONTAP audit)
Retention and deletion policy
Approval owner and expiration date

Bedrock KB is a strong candidate for RAG over NAS documents, but regulated use cases must validate permission-aware retrieval, data classification, human review requirements, and residual risk acceptance before production use.

For regulated workloads, do not start a PoC until the data owner, security owner, and platform owner agree on the allowed prefixes, derived data locations, logging scope, rollback plan, and approval expiration date.

Assurance artifacts to prepare:

Non-technical overview for stakeholders
Data flow diagram (source → AP → service → output)
Access control summary (dual-layer authorization)
Audit evidence summary
Rollback plan
Residual risk register

Store these artifacts with an approval ID, owner, review date, and expiration date so the PoC decision can be audited later.

GenAI / RAG Evaluation Metrics

For GenAI and RAG workloads on FSx for ONTAP data, measure:

Retrieval accuracy (relevant documents returned)
Permission-aware retrieval pass rate (unauthorized documents NOT returned)
Hallucination reduction vs baseline
Data freshness lag (NFS write → S3 AP availability)
Human review workload
User time saved vs previous search method

Start with read-only, permission-aware, human-review-attached PoC before production deployment.

Series Index

This is the series overview for "FSx for ONTAP S3 Access Points × Lakehouse Deep Dive."

Part	Platform	Status	URL
Part 1	Athena — Query NAS Data In Place	✅ Published	dev.to
Part 2	Databricks — A Layer-by-Layer Validation of Observed Boundaries	✅ Published	—
Part 3	Snowflake — From 'Access Denied' to Working External Tables	✅ Resolved	—
Part 4	DuckDB Lambda — Serverless for $0.00001/query	Ready to publish	—
Part 5	EMR Spark — Read-Write ETL Pipeline	Ready to publish	—
Part 6	Redshift Spectrum — DWH Meets NAS Data	Coming soon	—
Part 7	Trino — Open-Source SQL on NAS Data	Coming soon	—
Summary	This article (Overview — What Works and What Doesn't)	Ready to publish	—

Note: This overview article can be published as the final "summary" post in the series, or as a standalone reference.

Update to Part 1 (Athena)

Since Part 1 was published, additional verification has been completed and published as a v1.1 update:

CTAS write-back: Verified as WORKING (3.7s, writes Parquet back to FSxN S3 AP)
Partition projection: Verified with Hive-style partitioning
Benchmark: 54.8 MB/s peak throughput (5M rows, 103 MB scan in 2.2s)
9/9 negative tests pass: Unauthorized access correctly denied

Try It Yourself

git clone https://github.com/Yoshiki0705/fsxn-lakehouse-integrations.git
cd fsxn-lakehouse-integrations

# Deploy base infrastructure
aws cloudformation deploy \
  --template-file shared/cloudformation/fsxn-s3ap-base.yaml \
  --stack-name fsxn-lakehouse-base \
  --capabilities CAPABILITY_IAM

# Validate connectivity
python shared/scripts/validate-access.py --access-point-alias <your-ap-alias>

# Choose your platform: integrations/athena/, integrations/duckdb/, etc.

Each integration directory includes a README, CloudFormation template, deployment script, and sample queries.

What's Next

Databricks UC + access_point field — partial success confirmed (2026-05-24); awaiting vendor guidance on subdirectory listing and table creation
Snowflake AWS_ACCESS_POINT_ARN — resolved (2026-05-24); SELECT and External Table work with stage parameter
Apache Iceberg community engagement (S3FileIO + AP alias support)
ONTAP feature quantification (dedup ratio, snapshot RTO) — resolved (DNS/AD orphan config removed, S3 AP recovered 2026-05-24)
Redshift Spectrum and Trino deep-dive articles
Customer PoC execution with measured business outcomes

Operational Lessons Learned

S3 AP Timeout Caused by Orphaned DNS/AD Configuration (2026-05-24)

During this series validation, all S3 APs on one SVM became unresponsive for 7+ days. Root cause: the SVM had DNS servers configured for an AD domain that no longer existed. When the S3 AP backend processes requests on an AD-joined SVM, ONTAP's name-service stack attempts DNS resolution for user-mapping — if DNS is unreachable, requests block until timeout.

Key findings:

Disabling customer-configured FPolicy did NOT fix the issue
A separate SVM without DNS/AD worked normally on the same file system
Removing the orphaned CIFS/DNS configuration restored S3 AP instantly

Prevention: Do not leave orphaned DNS/AD configurations on SVMs used for S3 AP access. If AD is decommissioned, clean up vserver cifs and vserver services dns settings. See FSx S3 AP Networking — Section 7 for full details.

References

This series is based on hands-on verification, not documentation review. Every "Verified" claim has a corresponding evidence record in the verification-pack/ directory.

Disclaimer: This article is an independent validation report and does not represent AWS, NetApp, Databricks, or Snowflake official guidance. Product behavior, support status, and platform capabilities may change. Always validate in your own environment and consult vendor documentation and support channels.

From Serverless Patterns to Field-Ready Reference Architecture — FSx for ONTAP S3 Access Points, Phase 13.

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Fri, 22 May 2026 15:50:35 +0000

TL;DR

Previous phases showed what can be built.
Phase 13 shows how to evaluate, govern, deliver, and operate it.

In this context, field-ready means that the repository now includes not only deployable patterns, but also the guidance needed to evaluate, govern, size, deliver, and operate them in customer-facing scenarios.

The repository now includes success metrics, readiness guidance, governance controls, and benchmark-backed sizing references.

📊 Stats: 17 industry use cases + event-driven FPolicy + 6 FlexCache/FlexClone patterns | 1,499+ tests | 126 test files | 6 deployed CloudFormation stacks | Python 3.12 + SAM Transform

These stats represent repository validation coverage and sample stack verification, not a blanket production certification. The point of these numbers is not volume itself, but evidence that the repository now covers implementation, validation, delivery, and governance paths.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Why Phase 13 Was Needed

As the repository grew from patterns into a multi-use-case library, the main question changed. It was no longer only "Can this be implemented?" but "How should teams evaluate it, govern it, size it, and explain it to customers?" Phase 13 adds that adoption layer.

Who This Is For

Phase 13 is useful for four audiences:

Architects who need deployment, trigger-mode, and sizing guidance
Security and governance reviewers who need authorization, audit, and human-review controls
Partners and SIs who need delivery assets for workshops and PoCs
Platform teams who need production-readiness criteria and operational guardrails

What Phase 13 Delivers

Phase 13 has two layers: technical implementation and adoption guidance. The technical layer provides the repository-validated building blocks; the adoption layer explains how to evaluate and deliver them safely.

Technical Implementation

FlexClone serverless automation: Step Functions orchestrates Snapshot → Clone → ProcessFiles through S3AP → CIFS Share → Notify, with split VPC placement
FlexCache Anycast/DR and Dynamic FlexCache patterns: FC1 (Anycast DR with health check + route decision + failover simulation), FC2 (dynamic FlexCache create/delete per job), FC3-FC6 (GenAI RAG, Automotive CAE, Life Sciences, Gaming) — 6 new use case patterns with CloudFormation templates
Event-driven and replay-safe processing: FPolicy-based ingestion (not native S3 bucket notifications), replay storm testing (1000+ events), audit-oriented lineage (v2 fields + S3 Object Lock), protobuf wire validation
Operational visibility: Split-path S3AP monitoring, cost dashboard (Metrics Math), benchmark results (p50/p90/p99 + concurrent access)

Adoption Guidance

A key shift in Phase 13 is outcome-driven evaluation. Each use case now includes Success Metrics structured as Outcome, Metric, and Measurement Method, so teams can define what success means before running the PoC or deploying the pattern. This matters because teams can now align technical patterns with business outcomes before starting implementation, rather than treating deployment success as the PoC goal.

The deployment profiles and trigger-mode guide also define where teams can safely experiment: start with polling and PoC profiles, add monitoring and governance controls, and move toward production only after exit criteria are met.

Start and evaluate — helps teams identify the right entry point and define success before deployment:

Quick Start Guide + E2E Demo Script
Customer Discovery Template
17 UC Success Metrics (Outcome / Metric / Measurement Method)

Decide and design — helps architects choose the right trigger mode, deployment profile, authorization model, and sizing assumptions:

S3AP Authorization Model + Troubleshooting Commands
Deployment Profiles + Trigger Mode Decision Guide
S3AP Benchmark Results + Fargate vs EC2 Decision Matrix
Persistent Store Sizing Calculator

Deliver and operate — helps partners, platform teams, and reviewers move from PoC to governed operation:

Partner/SI Delivery Checklist + PoC Proposal Example + Industry Expansion Guide
Workshop Guide (1-day structure with participant roles)
Production Readiness (4-level maturity model with exit criteria)
Well-Architected 6-Pillar Mapping + Trade-offs
Governance Checklist + Public Sector Adoption Roadmap

FlexClone Serverless Pipeline (Technical Highlight)

A Step Functions state machine orchestrates the complete FlexClone lifecycle with split VPC placement:

Lambda	VPC Config	Reason
CreateSnapshot	VPC-internal	ONTAP REST API requires management LIF access
CreateFlexClone	VPC-internal	ONTAP REST API
ProcessFiles	VPC-external	S3AP object access uses a different network path from ONTAP management API
CreateCIFSShare	VPC-internal	ONTAP REST API

Industry applications: Media/VFX (render QC), EDA (parallel simulation), Healthcare (dataset branching), Financial (audit copies), DevOps (DB refresh).

Live verification: Snapshot → Clone → WaitForOnline → Notify completed in < 10 seconds against real FSx for ONTAP.

FlexCache Anycast/DR Pattern (FC1)

FlexCache AnyCast/DR provides geographic read distribution and disaster recovery for FSx for ONTAP volumes:

Health Check Lambda: Monitors FlexCache origin and cache volumes via ONTAP REST API
Route Decision Lambda: Determines optimal read path based on cache health, latency, and availability
Failover Simulation: Validates route-decision behavior when the origin path or selected cache path is marked unavailable in the sample routing state
DynamoDB Routing Table: Tracks active/standby cache topology

In this sample, "Anycast" refers to application-level routing decisions based on cache health and availability, not a replacement for network-layer anycast design.

This pattern addresses scenarios where read performance must be distributed across deployment locations, depending on the supported and tested FSx for ONTAP configuration — FlexCache provides read acceleration while the origin volume remains the single source of truth. The origin volume should be treated as the authoritative data source; cache volumes are acceleration paths whose access and route changes should be observable. This also helps governance discussions because teams can reason about where authoritative data resides and how cached access paths are audited.

This pattern focuses on read-path resilience and cache-aware routing; it does not replace a full DR strategy such as backup, replication, and recovery planning. For regulated environments, failover decisions should also define decision ownership, approval flow, and audit evidence for route changes. Route changes and failover decisions should be logged as audit events so that teams can review who changed the active path, when, and why.

The business outcome is faster and more resilient read access for distributed teams without requiring a full independent copy of the dataset.

The repository also includes FC2 (Dynamic FlexCache per-job lifecycle), FC3 (GenAI RAG with permission-aware chunking — connecting back to governance by keeping RAG preprocessing permission-aware), FC4 (Automotive CAE solver output analysis), FC5 (Life Sciences research data classification), and FC6 (Gaming build pipeline asset QC). Each has a deployable CloudFormation template and tests. The FlexCache/FlexClone patterns follow the same outcome-driven structure: each pattern should be evaluated through workload-specific success metrics, not only deployment success. Future updates will extend the same Outcome / Metric / Measurement Method structure to the FlexCache/FlexClone pattern READMEs.

Full documentation: flexcache-anycast-dr/

S3AP Benchmark Results (Sizing References)

These are sizing references from a specific test environment, not service-level guarantees. Validate in your own environment.

Test environment: FSx for ONTAP Single-AZ, 128 MBps throughput, ap-northeast-1.

S3AP object access was tested from the VPC-external path because ONTAP management API calls and S3AP object access used different network paths in the tested setup.

The percentile table is based on 20 repeated runs per object size. See the full benchmark document for methodology and raw observations.

GetObject Latency (concurrency=1)

Size	P50	P90	P99
1 KB	35.5 ms	39.0 ms	40.2 ms
1 MB	47.8 ms	63.3 ms	92.3 ms
5 MB	108.0 ms	115.8 ms	134.8 ms

Concurrent Access (1 MB file)

Concurrency	Avg Latency	Aggregate Throughput
1	63.8 ms	35.6 MB/s
5	112.9 ms	108.9 MB/s
10	151.7 ms	137.6 MB/s

Key finding: In this test environment, FSx Throughput Capacity became the bottleneck for parallel access. At 128 MBps provisioned throughput, concurrency=10 reached the observed saturation point. Higher parallelism should be evaluated with a higher FSx throughput configuration. Short-duration aggregate throughput can appear slightly above the provisioned value due to measurement windows, rounding, and burst behavior; sustained throughput should be validated against the provisioned FSx throughput capacity.

Range GET: Confirmed working in this test environment. Useful for DICOM headers (4KB), GDS/OASIS headers (1KB), SEG-Y trace headers (3.6KB), PDF first-page OCR (100KB).

This article uses MB/s; it is equivalent to the MBps notation shown in some AWS console contexts.

Full results: S3AP Benchmark Results

Important Architectural Clarifications

S3AP is an access boundary, not all S3 bucket semantics. FSx for ONTAP S3 Access Points provide an S3-facing access boundary for file data. Data remains on FSx for ONTAP and continues to be accessible through NFS and SMB. Not all bucket-level features or integration patterns apply directly, such as native S3 bucket notifications, lifecycle policies, and versioning. See the S3AP compatibility notes for the current tested behavior.

Trigger strategy matters. Because native S3AP event notifications are not available, the repository provides POLLING (simplest), EVENT_DRIVEN (FPolicy-based, near-real-time; not native S3 bucket notifications), and HYBRID modes. Default is POLLING.

Performance depends on FSx sizing. S3 API access does not remove the need to size FSx for ONTAP correctly. S3AP, NFS, and SMB access share the provisioned throughput of the same FSx file system. The split-path design separates ONTAP management API access from S3AP object access because they use different network paths in the tested environment.

Authorization is dual-layer. Both AWS IAM and ONTAP file system identity must permit the request. S3 API access does not bypass ONTAP file-system permissions.

Governance and Responsible AI

Phase 13 explicitly adds governance guidance for regulated workloads:

Human Review: High-risk scenarios such as healthcare, genomics, sensitive operations, and government archives are modeled with 100% human confirmation as the recommended default in these sample patterns.

This is because anonymization leaks, variant misclassification, alert errors, and redaction failures can affect patient privacy, sensitive operational decisions, citizen privacy, and public trust. These patterns treat AI outputs as assistive signals, not final decisions. Actual review thresholds should be adjusted based on each organization's risk assessment, data classification, and governance requirements.

Audit trail: DynamoDB records (who/when/what reviewed), with S3 Object Lock or similar immutability controls for tamper-resistant retention. The sample uses DynamoDB as one implementation option; customer implementations should align with the organization's existing audit platform, retention policy, and access-control model.
Separation of duties: Reviewer ≠ Approver ≠ Auditor ≠ Operator.
Periodic review: Example cadence: quarterly AI output quality review, annual compliance mapping update, and incident-triggered process revision.

The goal is not to automate final judgment, but to make AI-assisted processing reviewable, attributable, and auditable. Before moving beyond PoC, teams should identify who owns the decision to proceed, who approves AI-assisted outputs, and who reviews audit evidence. For regulated scenarios, this should be treated as a multi-stakeholder decision involving business owners, security, compliance, operations, and data owners.

For public sector and regulated workloads, the first step is often confirming data readiness: where the data resides, how it is classified, who can access it, and how review and audit records are retained. The Public Sector Adoption Roadmap maps PoC, controlled rollout, and broader adoption to governance checkpoints and stakeholder decisions.

This checklist provides governance guidance for architectural and operational review. It does not replace legal, compliance, privacy, or regulatory assessment by the responsible organization.

Full checklist: Governance Checklist

The Reading Path

Use the reading path below to choose the shortest route based on your role.

For partners and system integrators, Phase 13 provides reusable delivery assets rather than only reference code.

For a customer-facing motion:

Use the Partner/SI Delivery Checklist as the primary entry point.
Use the Workshop Guide for facilitation.
Use UC Success Metrics to define PoC success criteria.

The workshop assets are designed to end with a concrete decision package: selected use case, trigger mode, deployment profile, success criteria, stakeholders, and next-phase actions.

Typical stakeholders include:

Legal Ops
Digital Transformation Office
Medical IT
Application Owner
Operations / Risk
Plant IT

The expected output is a customer-ready PoC plan: selected use case, architecture option, success metrics, governance considerations, estimated effort and cost assumptions, and next-phase criteria.

If you are new to the repository:
Start with Choose Your Path and Quick Start Guide.

If you are a security or governance reviewer:
Start with S3AP Authorization Model and Governance Checklist.

If you are a Partner or SI:
Start with Partner/SI Delivery Checklist, Workshop Guide, and the PoC Proposal Example in the checklist. For customers interested in distributed read performance, DR, or workload-specific cache/clone automation, also review the FlexCache/FlexClone patterns FC1–FC6.

If you are planning production rollout:
Use Production Readiness, S3AP Performance, and Deployment Profiles.

If you are evaluating for public sector / regulated workloads:
Use Public Sector Adoption Roadmap, Governance Checklist, and Customer Discovery Template.

If you only have 30 minutes:

Read Choose Your Path
Deploy the Quick Start pattern
Review the Success Metrics for the closest UC

If you are preparing a customer conversation:

Review the Partner/SI Delivery Checklist
Pick the closest industry use case from the expansion guide
Copy the PoC Proposal Example and adapt the Success Metrics

What's Next

The Phase 13 documentation alignment backlog is complete.

Future validation may include:

Optional FSx 256/512 MBps benchmark runs (requires throughput configuration change and additional cost)
Additional customer-specific sample runs
Expanded CloudWatch correlation during higher-throughput tests

These are optional evidence expansions and do not change the recommended reading path or the core architecture guidance.

Conclusion

Phase 13 makes the pattern library field-ready.

Previous phases proved the architecture through 17 industry use cases, event-driven ingestion, multi-account distribution, FlexClone automation, and extensive repository validation.

This phase makes it easier to evaluate, govern, deliver, and operate.

The repository is no longer just a collection of serverless patterns. It is a reference package that a partner can use in a customer workshop, a security reviewer can use in an architecture review, and a platform team can use to plan production rollout — with the core guidance available from the same GitHub repository.

The constraints are documented. Benchmarks are provided as sizing references. Governance controls are explicit. The delivery path is structured.

That is what field-ready means here: not a final endpoint, but a practical baseline for informed evaluation, governed experimentation, and structured delivery.

If you are evaluating FSx for ONTAP S3 Access Points today, start with Choose Your Path, pick the closest use case, and review its Success Metrics before deploying.

Repository: github.com/Yoshiki0705/FSx-for-ONTAP-S3AccessPoints-Serverless-Patterns

Previous phases: Phase 1 · Phase 7 · Phase 8 · Phase 9 · Phase 10 · Phase 11 · Phase 12

Query NAS Data In Place with Athena and FSx for ONTAP S3 Access Points

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Fri, 22 May 2026 08:28:29 +0000

TL;DR

You can query files stored on Amazon FSx for NetApp ONTAP directly from Amazon Athena through an FSx-attached S3 Access Point — without copying the source data to an S3 bucket. The source files remain on the FSx for ONTAP volume and are accessed through S3 object APIs.

I verified this end-to-end: Parquet files written via NFS are immediately queryable from Athena using the official AWS tutorial pattern.

This is Part 1 of a series exploring how FSx for ONTAP S3 Access Points integrate with various Lakehouse platforms. Part 2 covers Databricks — where platform security boundaries make things significantly more complex.

GitHub Repository: fsxn-lakehouse-integrations

If you want to reproduce this validation, start from the repository's integrations/athena/ directory, which contains CloudFormation templates, sample data generators, and query scripts.

v1.1 Update (2026-05-24): Since initial publication, additional verification has been completed:

✅ CTAS write-back: Athena CTAS successfully writes Parquet back to FSx S3 AP (3.7s)

✅ Benchmark: 54.8 MB/s peak throughput (5M rows, 103 MB full scan in 2.2s)

✅ Partition projection: Hive-style partitioning with columnar pruning confirmed

✅ Security Verified: 9/9 negative security tests PASS + CloudTrail audit confirmed

See Benchmark Results and Security Verification sections below.

What Is Verified in This Article

Verified:

NFS-written Parquet file is visible via FSx S3 AP (ListObjectsV2, StorageClass: FSX_ONTAP)
Athena can query the file through Glue Data Catalog
Standard S3 bucket result location works as the documented pattern
Experimental FSx S3 AP result output worked in my environment
CTAS write-back to FSx S3 AP (v1.1 — previously listed as failed)
Benchmark: 54.8 MB/s peak throughput on 5M row dataset (v1.1)
Partition projection with Hive-style partitioning (v1.1)
Security Verified: 9/9 negative tests PASS (v1.1)

Not verified:

Delta / Hudi / Iceberg writes
S3 bucket event notification semantics
Large-scale performance limits
CloudTrail data event coverage for production (validated for PoC in v1.1)

Why This Matters

Enterprise file servers hold massive amounts of data — design files, inspection images, research documents, log archives. Traditionally, to analyze this data with cloud-native tools like Athena, you had to:

Copy data from NFS/SMB to S3 (DataSync, scripts, etc.)
Maintain sync pipelines
Pay for duplicate storage
Deal with stale data

FSx for ONTAP S3 Access Points (launched December 2025) change this. The same volume that serves NFS/SMB clients now exposes an S3-compatible API. Athena queries hit the same bytes that your NFS clients read — no copy required for the source dataset.

Users (NFS/SMB)                    Athena (S3 API)
      │                                  │
      ▼                                  ▼
┌─────────────────────────────────────────────┐
│         FSx for ONTAP Volume                │
│         /analytics/sensor_data.parquet      │
│         /analytics/logs/*.json              │
└─────────────────────────────────────────────┘

Use Cases This Unlocks

This pattern is useful when enterprise data already lives on NFS/SMB file shares and analytics teams want to query it without building a copy pipeline to S3.

Examples:

Manufacturing: Sensor logs, inspection results, quality reports produced by factory systems → Outcome: Reduce data freshness from 24h (nightly batch copy) to near-zero (direct query)
SAP / ERP: Batch export files, operational reports, reconciliation extracts, and analytics copies — not direct replacement for application-native persistence or HA design → Outcome: Eliminate sync pipeline maintenance
Financial services: Reconciliation files, transaction logs, regulatory extracts → Outcome: Audit evidence available in minutes instead of hours
Healthcare research: De-identified datasets, imaging metadata, study outputs → Outcome: Researchers query data without waiting for IT to provision S3 copies
EDA / Semiconductor: Design artifacts, simulation outputs, verification logs → Outcome: Engineers find relevant past designs without manual search
Enterprise file services: Archives for compliance analysis, audit evidence → Outcome: Compliance queries run on-demand instead of requiring pre-staged data

Mission-critical workload note
This pattern provides an analytics read-access layer for existing file data. It does not replace workload-specific HA, backup, Snapshot, SnapMirror, or DR designs. For SAP, databases, VDI, and enterprise file services, treat Athena-on-FSx as an analytics and evidence layer, not as the primary resilience architecture.

Workload Isolation Guidance

For mission-critical workloads, do not point exploratory analytics directly at the same directory used by latency-sensitive application writes unless the operational impact has been tested.

Recommended pattern:

Application-owned path: /prod/app-output/
Analytics landing path: /analytics/curated/
Athena query result path: Standard S3 bucket (conservative), or a separately validated output path
Snapshot / backup policy: Owned by the workload team
Glue/Athena access: Owned by the analytics platform team

For SAP, database exports, or ERP file drops, treat this pattern as a read-access analytics layer. Do not change application HA, backup, restore, or DR design just because the files are queryable through S3 APIs.

In this context, an analytics copy means an application-produced or batch-exported file that is safe for downstream analytics, not the primary application persistence path.

Operational Impact Validation

Before production use, validate operational impact:

Baseline NFS/SMB workload latency and throughput before enabling analytics queries
Athena query behavior during normal application write activity
FSx provisioned throughput utilization during scans (analytics and application workloads share the same backend throughput)
Query concurrency limits for the analytics team
Rollback plan if analytics workload affects application workload

Recommended metrics include FSx throughput utilization, client-side NFS/SMB latency, Athena query runtime, bytes scanned, and application-side error or timeout rates during query execution.

Rollback plan examples include disabling the Athena workgroup, revoking the S3 Access Point policy for analytics roles, reducing analytics query concurrency, or moving analytics to an isolated curated path.

What This Means for Production

For production, treat this as a shared-storage analytics access pattern. The value is eliminating source data copy; the responsibility is validating workload isolation, throughput impact, governance, and rollback.

This article is not a production certification. It is intended to start a production readiness discussion around workload isolation, governance, and rollback.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│  AWS Account                                                    │
│                                                                 │
│  ┌──────────────┐     ┌──────────────┐     ┌────────────────┐   │
│  │ FSx for ONTAP│     │ S3 Access    │     │ Athena         │   │
│  │ Volume       │◄────│ Point        │◄────│ (Serverless)   │   │
│  │              │     │ (Internet    │     │                │   │
│  │ /analytics/  │     │  origin)     │     │ SELECT ...     │   │
│  └──────────────┘     └──────────────┘     │ FROM table     │   │
│        ▲                     ▲             └────────────────┘   │
│        │                     │                      │           │
│   NFS/SMB clients      Glue Crawler          Query results      │
│   (write data)         (schema discovery)    (→ S3 bucket)      │
└─────────────────────────────────────────────────────────────────┘

Key points:

The access point must use Internet network origin. Athena accesses S3 from managed infrastructure outside your VPC. The AWS tutorial requires internet network origin for this path. VPC-origin access points deny requests from Athena.
Glue Data Catalog provides the schema layer between Athena and the S3 AP
Query results are written to an S3 bucket (the standard Athena pattern), not back to the FSx volume. See Observed Behavior for an experimental alternative.

Prerequisites

FSx for ONTAP file system (ONTAP 9.17.1+)
A volume with data (Parquet, CSV, JSON, etc.)
S3 Access Point created with Internet network origin
An Athena workgroup with a query results location (standard S3 bucket)
IAM permissions for Athena, Glue, and S3 AP access

Step 1: Create the S3 Access Point

aws fsx create-and-attach-s3-access-point \
  --name my-analytics-ap \
  --type ONTAP \
  --ontap-configuration '{
    "VolumeId": "<YOUR_VOLUME_ID>",
    "FileSystemIdentity": {
      "Type": "UNIX",
      "UnixUser": {"Name": "fsxn_athena_reader"}
    }
  }' \
  --region <YOUR_REGION>

Wait for the lifecycle to become AVAILABLE:

aws fsx describe-s3-access-point-attachments \
  --filters Name=volume-id,Values=<YOUR_VOLUME_ID> \
  --region <YOUR_REGION> \
  --query 'S3AccessPointAttachments[].{Name:Name,Lifecycle:Lifecycle,Alias:S3AccessPoint.Alias}'

Output:

[{
  "Name": "my-analytics-ap",
  "Lifecycle": "AVAILABLE",
  "Alias": "my-analytics-ap-xxxxxxxxxxxxxxxxxxxxxxxxxxxx-ext-s3alias"
}]

Note: The alias ending in -ext-s3alias identifies this as an FSx for ONTAP S3 Access Point (as opposed to regular S3 Access Points which end in -s3alias).

Security note for file-system identity
This walkthrough uses a dedicated read-only identity (fsxn_athena_reader). Make sure the corresponding UNIX/Windows permissions allow read access to the analytics path. Avoid using root in production — scope the identity to the minimum permissions required.

Step 2: Set the Access Point Policy

This walkthrough uses role-based principals for Athena and Glue. Replace the placeholder role ARNs with the IAM roles used by your Athena workgroup and Glue crawler. Avoid account-wide principals in production.

aws s3control put-access-point-policy \
  --account-id <YOUR_ACCOUNT_ID> \
  --name my-analytics-ap \
  --policy '{
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "AllowAnalyticsRead",
      "Effect": "Allow",
      "Principal": {"AWS": [
        "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<ATHENA_QUERY_ROLE>",
        "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<GLUE_CRAWLER_ROLE>"
      ]},
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap",
        "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap/object/*"
      ]
    }]
  }' \
  --region <YOUR_REGION>

The policy above is the conservative read-only analytics policy. If you intentionally test query result output to the FSx S3 Access Point (see Observed Behavior), add s3:PutObject scoped to the experimental output prefix only:

{
  "Sid": "AllowExperimentalResultWrite",
  "Effect": "Allow",
  "Principal": {"AWS": "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<ATHENA_QUERY_ROLE>"},
  "Action": "s3:PutObject",
  "Resource": "arn:aws:s3:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:accesspoint/my-analytics-ap/object/athena-results/*"
}

Security note: FSx for ONTAP S3 Access Points enforce S3 Block Public Access by default — this cannot be disabled. All requests require valid IAM credentials. Additionally, the file system user associated with the access point must have read permission on the files being queried.

Policy note: The policy above is the minimum that worked in my validation. If your Glue crawler or Athena workgroup reports location-related access errors, compare the policy with the official tutorial and CloudTrail events, and add only the required actions.

Step 3: Upload Test Data via NFS

On a machine with NFS access to the FSx volume:

import pandas as pd
import numpy as np

# Generate 10,000 rows of sensor data
np.random.seed(42)
n_rows = 10000
df = pd.DataFrame({
    'timestamp': pd.date_range('2026-01-01', periods=n_rows, freq='1min'),
    'sensor_id': np.random.choice(['sensor_A', 'sensor_B', 'sensor_C',
                                    'sensor_D', 'sensor_E'], n_rows),
    'temperature': np.round(np.random.normal(25, 5, n_rows), 2),
    'humidity': np.round(np.random.uniform(30, 90, n_rows), 2),
    'pressure': np.round(np.random.normal(1013, 10, n_rows), 2),
    'status': np.random.choice(['normal', 'warning', 'critical'], n_rows,
                                p=[0.85, 0.12, 0.03])
})

# Write as Parquet to the NFS-mounted volume
df.to_parquet('/mnt/fsxn/analytics/sensor-data/sensor_data.parquet', index=False)
print(f"Written {len(df)} rows, {df.memory_usage(deep=True).sum()/1024:.0f} KB")

The same file is now accessible via both NFS (/mnt/fsxn/analytics/sensor-data/sensor_data.parquet) and S3 API (s3://<AP_ALIAS>/sensor-data/sensor_data.parquet).

Step 4: Verify S3 AP Access

aws s3api list-objects-v2 \
  --bucket "$AP_ALIAS" \
  --prefix "sensor-data/" \
  --region <YOUR_REGION>

Output:

{
  "Contents": [{
    "Key": "sensor-data/sensor_data.parquet",
    "Size": 252858,
    "StorageClass": "FSX_ONTAP"
  }]
}

Note the StorageClass: FSX_ONTAP — this confirms the data lives on FSx, not S3.

Step 5: Create Glue Database and Table

aws glue create-database \
  --database-input '{"Name": "fsxn_analytics"}' \
  --region <YOUR_REGION>

You can either run a Glue Crawler for automatic schema discovery (recommended by the AWS tutorial), or create the table manually via Athena:

CREATE EXTERNAL TABLE fsxn_analytics.sensor_data (
  timestamp TIMESTAMP,
  sensor_id STRING,
  temperature DOUBLE,
  humidity DOUBLE,
  pressure DOUBLE,
  status STRING
)
STORED AS PARQUET
LOCATION 's3://<AP_ALIAS>/sensor-data/'
TBLPROPERTIES ('parquet.compression'='SNAPPY');

Step 6: Query with Athena

Basic aggregation

SELECT
  sensor_id,
  COUNT(*) AS readings,
  ROUND(AVG(temperature), 2) AS avg_temp,
  ROUND(AVG(humidity), 2) AS avg_humidity,
  SUM(CASE WHEN status = 'critical' THEN 1 ELSE 0 END) AS critical_count
FROM fsxn_analytics.sensor_data
GROUP BY sensor_id
ORDER BY critical_count DESC;

Verified result

sensor_id | readings | avg_temp | avg_humidity | critical_count
----------|----------|----------|--------------|---------------
sensor_A  |    2027  |   24.89  |    59.84     |      68
sensor_B  |    1986  |   25.11  |    60.23     |      62
sensor_C  |    2013  |   24.95  |    59.91     |      59
sensor_D  |    1974  |   25.03  |    60.15     |      55
sensor_E  |    2000  |   24.98  |    60.02     |      56

Query time: 1.46 seconds | Data scanned: 67 KB | Engine: Athena v3

Observed Behavior: Query Results Written to the FSx S3 Access Point

The AWS tutorial states:

"Athena reads data from your FSx for ONTAP volume through the access point. Athena query results are written to the Amazon S3 results bucket, not back to the FSx for ONTAP volume."

In my validation, however, setting OutputLocation to the FSx for ONTAP S3 Access Point alias succeeded and wrote the .csv and .metadata files back to the FSx volume:

aws athena start-query-execution \
  --query-string "SELECT 1 AS test" \
  --result-configuration \
    "OutputLocation=s3://<AP_ALIAS>/athena-results/" \
  --work-group primary \
  --region <YOUR_REGION>

Result: SUCCEEDED in 584ms

The result files appeared on the FSx volume and were immediately accessible via NFS.

Treat this as observed behavior from my environment, not a general production recommendation. The conservative production pattern is:

Source data: FSx for ONTAP S3 Access Point
Athena query results: Standard S3 bucket (as documented)

The experimental pattern validated in this post:

Source data: FSx for ONTAP S3 Access Point
Athena query results: FSx for ONTAP S3 Access Point (observed to work, not documented)

Validate this in your own environment before relying on it.

Governance warning: Do not enable experimental query result output to FSx S3 AP for sensitive datasets unless query result retention, encryption, audit evidence, and file-system permissions are reviewed. Query results may contain derived sensitive information. For sensitive datasets, experimental result output should require approval from the data owner, security owner, and workload owner.

Performance Characteristics

Metric	Observed	Notes
Simple SELECT query	584 ms	Includes result write
Aggregation (10K rows, 67KB)	1.46 s	GROUP BY with 5 aggregations
Data scan cost	Standard Athena pricing	$5 per TB scanned
Storage class	FSX_ONTAP	Confirmed in ListObjects

Performance note
These numbers validate functional compatibility, not performance limits. The dataset is intentionally small (67 KB, 10K rows). For real analytics workloads, test with realistic file sizes, object counts, partition layouts, concurrent queries, and FSx provisioned throughput. The throughput available through the S3 API depends on the FSx file system's provisioned throughput capacity (AWS documentation).

S3 API Compatibility Boundary

FSx for ONTAP S3 Access Points expose file data through S3 object APIs, but they should not be treated as standard S3 buckets.

The safe mental model is:

Use S3 APIs for object read/write access to files on FSx
Use Glue and Athena for read-oriented analytics
Do not assume S3 bucket-level features exist (event notifications, versioning, lifecycle policies)
Do not assume lakehouse commit semantics (rename, conditional writes)
Validate every platform integration separately

In this article, the verified pattern is read-oriented analytics over Parquet/CSV/JSON files. Transactional table formats and commit protocols are outside the safe default boundary.

Compatibility Matrix

Validated by legend:

This validation: Actually executed commands or queries in this environment and confirmed the result
Supported operations review: Confirmed based on the supported operations documentation or official tutorial
Supported operations review required: Not yet confirmed; additional validation needed before use

Capability	Status	Validated by	Notes
ListObjectsV2	✅ Verified	This validation	S3 AP alias worked
GetObject (Parquet scan)	✅ Verified	This validation	Athena v3
PutObject (small result file)	⚠️ Observed	This validation	Not documented as Athena result pattern
Glue table over S3 AP	✅ Verified	This validation	Manual DDL and Crawler
CTAS to S3 AP	✅ Verified (v1.1)	Post-publication validation	Writes Parquet via `external_location` parameter. 3.7s for GROUP BY → 3 rows.
Delta Lake writes	❌ Not supported	Part 7 validation (May 2026)	Requires conditional writes (`If-None-Match`) for `_delta_log/` commit — FSx for ONTAP S3 AP returns 501 Not Implemented
Iceberg writes	❌ Not supported	Part 7 validation (May 2026)	S3FileIO metadata write requires conditional writes — same root cause as Delta. AWS feature request submitted.
Iceberg reads (pre-existing table)	⚠️ Not validated	—	Theoretically possible if metadata files accessible via GetObject; requires separate validation
Hudi writes	❌ Not supported	Part 7 validation (May 2026)	Timeline commit requires atomic rename — not available on FSx for ONTAP S3 AP
S3 bucket event notifications	❌ Not part of verified pattern	Supported operations review required	Do not assume bucket-level eventing; validate against supported operations

CTAS is a write-path pattern, not just a read query. Treat CTAS separately from read-oriented SELECT validation because it writes new table data to a target S3 location and may leave partial/orphaned files on failure. CTAS should not be included in the initial read-oriented validation scope.

Transactional lakehouse formats may require semantics beyond simple object read/write, such as:

Atomic commit behavior
Rename or move-like commit operations
Conditional writes (If-None-Match)
Manifest consistency
Concurrent writer coordination
Cleanup of partial/orphaned files

This article does not validate those semantics. It validates read-oriented analytics over existing files.

Governance and Compliance Considerations

This pattern keeps the source files on FSx for ONTAP, but it does not remove the need for data governance.

Before using this pattern with regulated or sensitive datasets, review:

Data classification of source files
IAM and S3 Access Point policy scope (least privilege)
File system identity mapped to the access point (UNIX/Windows user permissions apply)
Glue Data Catalog permissions (who can see the table metadata)
Athena workgroup controls (query limits, result encryption)
Query result location and retention (results may contain derived sensitive data)
CloudTrail / audit evidence requirements
Snapshot, backup, retention, and deletion policy

Query results can be more sensitive than the original dataset because they may aggregate, filter, or derive new information. Apply encryption, retention, and access controls to the Athena result location as carefully as the source dataset.

This article is a technical validation, not a compliance attestation.

Production Controls Checklist

For regulated or sensitive datasets, define the following before production use:

[ ] Athena workgroup result location (standard S3 bucket)
[ ] Whether workgroup settings override client-side result settings
[ ] Query result encryption mode and KMS key ownership
[ ] Query result retention and deletion policy
[ ] IAM principals allowed to query the Glue table
[ ] File-system identity mapped to the S3 Access Point (dedicated, not root)
[ ] Audit evidence approach defined and validated (e.g., CloudTrail coverage for the S3 Access Point where applicable, with sample events captured as PoC evidence)
[ ] Approval process for enabling experimental result output to FSx S3 AP

For regulated workloads, consider enabling Athena workgroup override so that query result location and encryption cannot be changed by client-side settings. This prevents individual clients from changing where query results are written or how they are encrypted.

For regulated workloads, experimental writeback should be disabled by default and enabled only after explicit approval from the data owner, security owner, and workload owner.

Experimental writeback may be enabled only when:

Approval scope is documented
Output path is isolated from source data
Encryption and retention are defined for the output path
Cleanup and rollback procedures are documented
Review expiration date is set

Minimum audit evidence artifacts for PoC completion:

Scope statement: what the audit evidence demonstrates and what it does not (e.g., "validates access path and query result control for PoC scope; does not demonstrate full production compliance")
Access path description (IAM → AP policy → file-system identity)
Sample successful read event
Sample denied access event (if applicable)
Query result location configuration
Encryption configuration
Workgroup override setting (if used)
Reviewer sign-off (name, role, date, decision)

30-Minute Validation Flow

Create or verify the FSx S3 Access Point (AVAILABLE lifecycle)
Write one Parquet file through NFS to the analytics path
Confirm StorageClass: FSX_ONTAP with list-objects-v2
Create the Glue table (manual DDL or crawler)
Run one Athena query
Capture the validation artifacts (see below)
Decide Go / No-Go using the PoC Success Criteria

First Success Path

If you are validating this for the first time, keep the scope small.

Expected outcome:

One Parquet file written through NFS is visible through the S3 Access Point
Glue table creation or crawler schema discovery succeeds
Athena can query the file in place
Query result location behavior is validated and documented
NFS/SMB clients can still access the original file
IAM and file-system identity boundaries are understood

Do not start with Delta Lake, Hudi, Iceberg writes, large scans, or concurrent workloads. Prove the read path first.

PoC Success Criteria

Minimum success:

S3 Access Point attachment is AVAILABLE
ListObjectsV2 returns the expected test file
Glue table points to the S3 AP alias
Athena query succeeds and returns correct results
Results are reproducible from a clean workgroup/session

Operational success:

IAM role and S3 AP policy are scoped to the analytics roles
Athena workgroup controls are defined
Query result location and retention are documented
Dataset size and scan cost are measured
FSx throughput impact is measured during query
Existing NFS/SMB application workload impact is measured during Athena queries

Go / No-Go criteria:

Go: Read-only analytics on Parquet/CSV/JSON works with acceptable latency and cost
No-Go: Workload requires Delta/Hudi/Iceberg write commits through the S3 AP
No-Go: Platform governance requires Unity Catalog external locations and the platform cannot yet authorize the S3 AP (see Part 2)

Performance Test Plan

Note: This section defines the performance test plan and metrics to collect. It does not present benchmark results. Actual benchmark outputs will be added under verification-pack/ after validation runs are completed.

The next validation should include:

1 GB / 10 GB / 100 GB datasets
Many small files vs fewer large Parquet files
Partitioned layout (date=YYYY-MM-DD/sensor_id=...)
Concurrent Athena queries
Different FSx throughput capacity settings (128 / 256 / 512+ MBps)
NFS writer activity during Athena scans
Standard S3 result bucket vs observed FSx S3 AP result output

The goal is to separate Athena scan behavior, Glue metadata behavior, and FSx provisioned-throughput impact.

Additional request pattern considerations:

Sequential vs parallel S3 API reads
Prefix layout impact on listing performance
Small object listing overhead
Repeated query behavior with warm Glue/Athena metadata

Metrics collection sources:

FSx metrics: CloudWatch (FSx namespace)
Athena query metrics: get-query-execution API (EngineExecutionTimeInMillis, DataScannedInBytes)
Client-side latency: CLI timing or SDK instrumentation
Error/timeout sources: Athena query execution status and failure reason, client-side logs, application-side timeout logs, CloudTrail events where applicable

Record results separately for cold run (1+), warm metadata run (1+), repeated run (3+ executions). Report average, min, max, and notable outliers.

Validation Artifacts

For reproducibility, capture the following artifacts in your PoC:

S3 Access Point attachment lifecycle output (describe-s3-access-point-attachments)
list-objects-v2 output showing StorageClass: FSX_ONTAP
Glue table DDL or crawler output
Athena query execution ID
Athena query runtime and scanned bytes
Query result location and file listing
NFS listing showing the original source file is unchanged
IAM policy and access point policy used for the test

What's Next

In Part 2, I cover what happens when you try to connect Databricks to FSx for ONTAP S3 Access Points — where Unity Catalog's session policy, seccomp filters, and platform security boundaries create a significantly more complex picture.

Beyond Athena: The Multi-Engine Data Product Journey

Athena is often the first step in turning NAS file data into AI-ready data products. Here's how it fits in a broader multi-engine architecture:

FSx for ONTAP (source of truth)
  ↓ S3 Access Point
Athena (discover & profile)
  ↓ CTAS → curated Parquet on FSx for ONTAP or S3
  ├── Snowflake External Table → Cortex AI (summarize, RAG, sentiment)
  ├── Glue / EMR → further transformation, Iceberg tables on S3
  ├── Lake Formation → fine-grained governance (column/row/tag)
  └── Databricks (via DataSync → S3) → ML training, feature engineering

Why this matters for AI: The fastest path from "files on NAS" to "AI-ready data" is often:

Athena to discover what data exists and validate quality ($0.005/query)
Snowflake External Table for immediate Cortex AI functions (zero-copy, no COPY INTO needed for text AI)
Glue ETL to create curated Parquet datasets for ML training
Databricks for full ML lifecycle (if already in the ecosystem)

Each engine adds value at a different stage. Athena's role is discovery and validation — the cheapest way to confirm that your NAS data is queryable and useful before investing in downstream pipelines.

Series index:

Part 1: Athena (this article) — Security Verified
Part 2: Databricks — session policy deep dive (coming soon)
Part 3: DuckDB Lambda — serverless for $0.00001/query (coming soon)
Part 4: EMR Spark — read-write ETL pipeline (coming soon)

References

Scope reminder: This article verifies a limited read-oriented scenario. It does not validate production readiness, write-path behavior, distributed executor-scale processing, or all third-party analytics engines.

Article update plan:

v1.0 (2026-05-22) — Scope, observed behavior, validation plan.

v1.1 (2026-05-24, current) — Benchmark (54.8 MB/s), CTAS verified, partition projection, Security Verified (9/9 tests).

v1.2 (planned) — Production workload isolation, concurrent query scaling.

v1.3 (planned) — Cross-platform comparison (DuckDB, EMR, Redshift on same dataset).

Direct-to-Grafana: Shipping FSx for ONTAP Logs to Grafana Cloud Loki via OTLP Gateway

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Thu, 21 May 2026 09:12:09 +0000

TL;DR

We built a direct Lambda-to-Grafana Cloud pipeline that ships FSx for ONTAP audit logs to Loki without an intermediate OTel Collector. Three Lambda functions cover all event sources:

FSx for ONTAP audit logs → EventBridge Scheduler (every 5 min) → Lambda (polls & reads via S3 Access Point) → OTLP Gateway → Loki
EMS webhooks (ransomware alerts, quota warnings) → API Gateway → Lambda → OTLP Gateway → Loki
FPolicy file operations (real-time CIFS/SMB events) → ECS Fargate → SQS → Bridge Lambda → EventBridge → Lambda → OTLP Gateway → Loki

Everything is CloudFormation-templated, parameterized, and deployable with a single script. No hardcoded values, and the infrastructure is fully parameterized. This is a Grafana-specific direct integration by design; use the Collector path from Part 5 when you need backend portability.

If you only want to validate the path quickly, jump to First Success Path and deploy the audit poller first.

This is the single-backend counterpart to Part 5: simpler when Grafana Cloud is the chosen destination, less flexible when backend portability, enrichment, redaction, or multi-backend routing is required.

Why Direct Send (Without OTel Collector)?

In Part 5, we showed how the OTel Collector decouples Lambda from backends. That's the right choice when you need multi-backend delivery or vendor migration flexibility.

But if Grafana Cloud is your single observability platform and your goal is a simple serverless path, direct OTLP can be a good starting point. For production pipelines that need richer buffering, metadata enrichment, redaction, or routing, Grafana recommends an Alloy / Collector-based architecture.

Approach	Components	Latency	Cost
OTel Collector	Lambda → Collector (ECS/EC2) → Grafana	+50-100ms	Collector compute
Direct send	Lambda → Grafana OTLP Gateway	Minimal	Lambda only

The direct path is simpler, cheaper, and has fewer failure points. You can always graduate to the Collector path later (Part 5 shows how). Direct send is a good fit when operational simplicity is more important than in-pipeline enrichment, redaction, buffering, and multi-backend routing. If those requirements become mandatory, move the same OTLP payload model behind Alloy or the OpenTelemetry Collector.

Direct send reduces moving parts, but it also removes the Collector / Alloy queueing layer. For production, decide whether Lambda retry and DLQ are sufficient, or whether you need SQS buffering, DLQ replay, or the Collector / Alloy path for stronger delivery guarantees during endpoint outages or throttling.

Delivery guarantee decision (see full pattern guide):

Quickstart (this template): Scheduler retry + Scheduler DLQ + Lambda reserved concurrency + checkpoint retry

Medium volume: add Lambda failure destination and operational replay procedures

Higher reliability: insert SQS before shipping, or place Alloy / OTel Collector behind Lambda for batching, retry with persistent queue, transform, redaction, and multi-backend routing

Multi-backend or redaction/routing: use Part 5 Collector path

Architecture

┌─────────────────────────────────────────────────────┐
│ Event Sources                                        │
├─────────────────────────────────────────────────────┤
│                                                      │
│  EventBridge Scheduler                               │
│  rate(5 minutes) ──→ Lambda                          │
│                       │ lists new files via           │
│                       │ S3 Access Point              │
│                       │ (checkpoint in SSM)          │
│                       ▼                              │
│                OTLP Gateway                          │
│                (Grafana Cloud)                        │
│                       │                              │
│  EMS Webhook          │                              │
│  ──→ API GW ──→ Lambda ────────────┤                │
│     (ems_handler)                   │                │
│                                     ▼                │
│  FPolicy                           Loki             │
│  ──→ ECS Fargate ──→ SQS          (Explore,        │
│  ──→ Bridge Lambda                  Dashboard)      │
│  ──→ EventBridge                                    │
│  ──→ Lambda (fpolicy_handler) ─────────────────────┤
└─────────────────────────────────────────────────────┘

The audit log path uses a polling pattern: EventBridge Scheduler invokes Lambda every 5 minutes. Lambda lists new objects via the S3 Access Point, reads and processes them, then updates an SSM Parameter Store checkpoint to track progress. This avoids reliance on S3 Event Notifications, which are not supported by FSx for ONTAP S3 Access Points.

The same S3 Access Point boundary can be reused for other automation patterns (AI/ML, analytics, compliance archival) because the audit files remain on FSx for ONTAP while Lambda reads them through standard S3 object APIs — no data copy or NFS/SMB mount required.

This pattern does not replace ONTAP audit, EMS, or FPolicy configuration; it provides an AWS-native delivery and visualization layer for those ONTAP-native signals.

For business-critical workloads such as SAP, databases, VDI, or enterprise file services, treat this pipeline as an observability and evidence layer. It complements, but does not replace, workload-specific HA, backup, restore, and DR designs.

Use cases this unlocks:

Investigate file access activity for FSx for ONTAP-hosted enterprise file shares
Monitor available ONTAP EMS alerts, such as ransomware-related events, quota warnings, and storage/system events
Correlate audit logs, EMS, and FPolicy file operations in a single Grafana dashboard
Provide a lightweight observability path for SAP, database, VDI, and file service workloads using FSx for ONTAP
Start with direct OTLP delivery and graduate to Alloy / Collector when governance or multi-backend routing is required

The FPolicy path has two Lambda roles: a bridge Lambda that converts ECS/FPolicy server SQS output into EventBridge events, and fpolicy_handler.py, which ships those normalized EventBridge events to Grafana Cloud.

Key Discovery: OTLP Gateway, Not Loki Push API

During E2E verification, the Loki Push API returned HTTP 530 in my trial account. The OTLP Gateway worked reliably in this project and is the recommended Grafana Cloud OTLP ingestion path.

For logs, Grafana Cloud routes OTLP log data to Loki, where it becomes queryable with LogQL.

Our Lambda auto-detects the endpoint mode from the URL:

def _is_otlp_endpoint(endpoint: str) -> bool:
    """Detect Grafana OTLP Gateway or generic OTLP/HTTP logs endpoint."""
    endpoint = endpoint.rstrip("/")
    return (
        "otlp-gateway" in endpoint
        or endpoint.endswith("/otlp")
        or endpoint.endswith("/otlp/v1/logs")
        or endpoint.endswith("/v1/logs")
    )

USE_OTLP = _is_otlp_endpoint(LOKI_ENDPOINT)

When using the OTLP Gateway, configure LOKI_ENDPOINT as the base OTLP endpoint ending in /otlp. The Lambda appends /v1/logs when sending logs:

# Configure as base endpoint (Lambda appends /v1/logs)
LOKI_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
# Lambda POSTs to: https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp/v1/logs

The handler also accepts the full path (/otlp/v1/logs) without double-appending.

Endpoint	URL Pattern	Status
OTLP Gateway (preferred)	`https://otlp-gateway-prod-<region>.grafana.net/otlp`	✅ Recommended by Grafana Cloud docs; verified in this project
Loki Push API (fallback)	`https://logs-prod-<region>.grafana.net/loki/api/v1/push`	⚠️ May behave differently by account state; returned 530 in my trial validation
Self-hosted Loki OTLP	`https://<loki-host>/otlp`	Requires Loki OTLP ingestion support and structured metadata configuration; Loki 3.0+ enables structured metadata by default

Authentication: Basic Auth with base64 Encoding

Grafana Cloud uses Basic Auth for both endpoints. The critical detail: the value is base64(instanceId:apiToken), not plain text concatenation.

from base64 import b64encode

instance_id = "123456"  # From Grafana Cloud console
api_token = "glc_..."   # logs:write scope

credentials = f"{instance_id}:{api_token}"
auth_header = f"Basic {b64encode(credentials.encode()).decode()}"

Credentials are stored in AWS Secrets Manager as JSON:

{"instance_id": "<id>", "api_key": "<token>"}

The Lambda reads this at cold start and caches the auth header for subsequent invocations. For production, use the shared auth_cache.py module which provides TTL-based caching with automatic reload-on-401/403, so credential rotation does not require waiting for a new Lambda execution environment.

Internally, normalized records are now converted directly to OTLP as the primary path. Loki Push formatting is kept only as a fallback mode. This aligns with Part 5's "OTLP as producer contract" principle. For the full OTLP resource/log-record/body mapping and fsxn.* attribute naming policy, see the Grafana Operations Guide.

The Three Lambda Handlers

1. FSx Audit Log Handler via S3 Access Point (`handler.py`)

Polls for new FSx ONTAP audit log files via S3 Access Point, parses JSON/EVTX, and ships to Grafana Cloud. Uses SSM Parameter Store to checkpoint progress between invocations.

def lambda_handler(event, context):
    auth_header = get_auth_header()  # Cached from Secrets Manager

    if event.get("source") == "scheduler":
        # Polling mode: list new files, process, update checkpoint
        last_key = get_checkpoint()  # SSM Parameter Store
        new_keys = list_new_keys(S3_ACCESS_POINT_ARN, prefix, last_key,
                                 limit=MAX_KEYS_PER_RUN)

        for key in new_keys:
            if context.get_remaining_time_in_millis() < SAFETY_THRESHOLD_MS:
                break  # Stop early, resume on next scheduled run
            raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=key)
            logs = parse_logs(raw["Body"].read(), key)
            ship_to_grafana(logs, key, auth_header)  # Raises on failure
            set_checkpoint(key)  # Only after confirmed delivery
    else:
        # Manual test mode using an S3-event-shaped payload
        for record in extract_s3_records(event):
            raw = s3_client.get_object(Bucket=S3_ACCESS_POINT_ARN, Key=record["key"])
            logs = parse_logs(raw["Body"].read(), record["key"])
            ship_to_grafana(logs, record["key"], auth_header)

Query in Grafana Explore:

{service_name="fsxn-audit"} | json | Operation="create"

2. EMS Webhook Handler (`ems_handler.py`)

Receives ONTAP EMS events via API Gateway, parses with the shared EMS parser layer, and forwards to Grafana.

def lambda_handler(event, context):
    body = event.get("body", "")
    normalized = parse_ems_event(body)  # Shared Lambda Layer

    if USE_OTLP:
        payload = format_for_otlp(normalized)
    else:
        payload = format_for_loki(normalized)

    ship_to_grafana(payload, auth_header)

Labels: {service_name="fsxn-ems", source="ontap", severity="alert"}

Security note: Do not expose the EMS webhook endpoint as an unauthenticated public API in production. Use API Gateway authorization controls such as an API key, IAM authorization, Lambda authorizer, resource policy, WAF, or source IP restrictions based on your network design. The quickstart template uses AuthorizationType: NONE for simplicity — add appropriate controls before production use. See the webhook security guide for a full comparison of auth modes and a recommended shared-secret Lambda authorizer pattern.

3. FPolicy Handler (`fpolicy_handler.py`)

Subscribes to EventBridge events from the FPolicy ECS Fargate server and forwards file operation events.

def lambda_handler(event, context):
    detail = event.get("detail")  # EventBridge event

    if USE_OTLP:
        payload = format_for_otlp(detail)
    else:
        payload = format_for_loki(detail)

    ship_to_grafana(payload, auth_header)

Labels: {service_name="fsxn-fpolicy", source="ontap", operation="create"}

CloudFormation: Three Templates, Zero Hardcoded Values

Each template is fully parameterized:

Template	Purpose	Key Parameters
`template.yaml`	FSx audit log poller Lambda	S3AccessPointArn, GrafanaCredentialsSecretArn, LokiEndpoint, ScheduleExpression
`template-ems.yaml`	EMS webhook Lambda	GrafanaCredentialsSecretArn, LokiEndpoint, EmsParserLayerArn
`template-fpolicy.yaml`	FPolicy EventBridge Lambda	GrafanaCredentialsSecretArn, LokiEndpoint, EventBusName

The LokiEndpoint parameter accepts both OTLP Gateway and Loki Push API URLs — the Lambda auto-detects the mode. The quickstart template also sets Lambda reserved concurrency to 1 and provisions a Scheduler DLQ with retry policy to avoid overlapping poller runs and preserve failed scheduled invocations. Processing bounds (MAX_KEYS_PER_RUN, SAFETY_THRESHOLD_MS) are configured via Lambda environment variables.

Trigger Model: EventBridge Scheduler Polling

FSx for ONTAP S3 Access Points do not support S3 Event Notifications or EventBridge ObjectCreated events. Instead, this integration uses an EventBridge Scheduler polling pattern:

EventBridge Scheduler invokes the Lambda every 5 minutes (configurable via ScheduleExpression parameter)
Lambda lists new files via ListObjectsV2 on the S3 Access Point, using StartAfter to skip already-processed keys
Lambda reads and processes each new file, shipping logs to Grafana Cloud
Checkpoint (SSM Parameter Store) tracks the last successfully processed S3 key — on the next invocation, only newer files are processed

This pattern is simple, cost-effective, and works with AWS S3 API-compatible read paths such as FSx for ONTAP S3 Access Points. The trade-off is polling latency (up to 5 minutes by default) vs. the near-real-time delivery of event-driven triggers.

CloudTrail alternative: CloudTrail data events do work with FSx ONTAP S3 Access Points (confirmed by NetApp Workload Factory's Journal table feature). However, CloudTrail data events add additional delivery latency and $0.10/100K events cost (in my validation, the CloudTrail-based path had 5–15 minutes of end-to-end delay), making the polling pattern the better default for this use case. See the CloudTrail trigger alternative for a full analysis and CloudFormation example.

# CloudFormation: EventBridge Scheduler with retry and DLQ
AuditLogSchedule:
  Type: AWS::Scheduler::Schedule
  Properties:
    ScheduleExpression: !Ref ScheduleExpression  # default: rate(5 minutes)
    FlexibleTimeWindow:
      Mode: 'OFF'
    Target:
      Arn: !GetAtt LogShipperFunction.Arn
      RoleArn: !GetAtt SchedulerRole.Arn
      Input: !Sub '{"source": "scheduler", "s3_access_point_arn": "${S3AccessPointArn}", "prefix": "${S3KeyPrefix}"}'
      RetryPolicy:
        MaximumRetryAttempts: 2
        MaximumEventAgeInSeconds: 3600
      DeadLetterConfig:
        Arn: !GetAtt SchedulerDLQ.Arn

The handler also accepts S3 event format for manual testing via aws lambda invoke, so you can still test individual files without waiting for the scheduler.

Checkpoint Semantics

The quickstart uses a simple high-watermark checkpoint: the last successfully processed object key is stored in SSM Parameter Store, and the next run lists keys after that value.

This works when audit log object keys are monotonically increasing and immutable. For production, validate your audit log naming and rotation behavior. If files can arrive late, be overwritten, or appear out of lexical order, use a stronger checkpoint model such as:

Keeping a short lookback window
Deduplicating by object key + ETag or LastModified
Storing per-object processing state in DynamoDB
Updating the checkpoint only after confirmed Grafana delivery

The checkpoint is advanced only after Grafana returns a successful response for that object. If delivery fails after retries, the Lambda raises an error and the next scheduled run will retry from the last checkpoint.

Failure-path tests verify this behavior: if OTLP delivery returns failure after retries, the Lambda raises and the checkpoint does not advance past the failed object.

Files that parse successfully but contain no shippable records are treated as successfully processed and checkpointed; only delivery failures or parse errors prevent checkpoint advancement.

For production, add a poison-pill policy for files that repeatedly fail parsing or delivery; otherwise one bad file can block later audit logs when using a high-watermark checkpoint. See the Grafana operations guide for poison-pill handling, pipeline health alarms, and custom metrics.

Use SSM Parameter Store for the quickstart high-watermark checkpoint. Move to DynamoDB when you need per-object state, deduplication, replay tracking, or concurrent workers.

Delivery semantics: This pipeline provides at-least-once delivery, not exactly-once. If a Lambda invocation succeeds in sending logs to Grafana but fails before updating the checkpoint (e.g., timeout or transient SSM error), the next run will re-process and re-send those objects. For most observability use cases, duplicate log entries are acceptable. If deduplication is required, implement it explicitly using object key + ETag, event ID, or payload hash in DynamoDB. Do not rely on backend-side deduplication as the primary correctness mechanism.

Avoid Overlapping Poller Runs

Because the audit-log poller is schedule-driven, overlapping Lambda executions can race on the same key range and checkpoint. The quickstart template sets ReservedConcurrentExecutions: 1 to prevent this.

For higher-volume production pipelines, use a distributed lock (e.g., DynamoDB conditional write) and per-object processing state instead of relying on single-concurrency.

The quickstart also configures EventBridge Scheduler with a retry policy (2 retries, 1-hour event age) and a dedicated DLQ. If a scheduled invocation is throttled or fails, the event is preserved in the Scheduler DLQ for visibility and replay.

The quickstart uses 2 retries and 1-hour maximum event age to surface persistent failures quickly while avoiding unbounded retry storms. Increase these values only if your Grafana endpoint outage tolerance and duplicate-handling strategy are defined.

Processing Bounds

The poller bounds work per invocation to avoid timeout-related checkpoint corruption:

Max keys per run (MAX_KEYS_PER_RUN, default: 100): caps the number of files processed in a single invocation
Safety threshold (SAFETY_THRESHOLD_MS, default: 30000): stops processing when remaining Lambda time falls below 30 seconds

Variable	Default	Purpose
`MAX_KEYS_PER_RUN`	`100`	Maximum audit log files processed per invocation
`SAFETY_THRESHOLD_MS`	`30000`	Stop processing before Lambda timeout

Tune these values after observing Lambda duration, checkpoint age, Scheduler DLQ depth, FSx S3 Access Point read throughput, and Grafana send latency.

Because the checkpoint advances after each successfully delivered object, the next scheduled run resumes safely from where the previous run stopped.

S3 API Compatibility Boundary

FSx for ONTAP S3 Access Points provide S3 object API access (GetObject, ListObjectsV2, etc.) to file data that remains on the FSx for ONTAP file system. They should not be assumed to have the same bucket-level features or eventing behavior as standard S3 buckets. In this integration, the important difference is eventing: the audit log path uses Scheduler polling instead of S3 Event Notifications.

Minimum Read-Path Permissions

For the audit-log Lambda, verify:

s3:ListBucket on the S3 Access Point ARN
s3:GetObject on the S3 Access Point object ARN ({arn}/object/*)
S3 Access Point policy allows the Lambda execution role
The file-system user associated with the access point has read permission on the audit log path
If the access point is VPC-restricted, the Lambda network path can reach the S3 endpoint

IAM resource ARN examples:

# List access (s3:ListBucket)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>

# Object read (s3:GetObject)
Resource: arn:aws:s3:<region>:<account>:accesspoint/<access-point-name>/object/*

First Success Path

If this is your first deployment, start small:

# Deploy only the audit log poller
export MAX_KEYS_PER_RUN=1
export SAFETY_THRESHOLD_MS=30000
bash integrations/grafana/scripts/deploy.sh --audit-only

Then validate:

Confirm {service_name="fsxn-audit"} in Grafana Explore
Check the Scheduler DLQ is empty
Verify the SSM checkpoint advanced
Create the dashboard
Add EMS and FPolicy only after the audit path works (deploy.sh --all)

deploy.sh passes MAX_KEYS_PER_RUN and SAFETY_THRESHOLD_MS as Lambda environment variables. If unset, the template defaults (100 / 30000) are used.

The first validation should prove three things:

One audit file is visible in Grafana ({service_name="fsxn-audit"})
The SSM checkpoint advanced to the processed key
The Scheduler DLQ remains empty

One-Command Deploy and Cleanup

# Deploy all 3 stacks + update Lambda code (default is --all)
export GRAFANA_SECRET_ARN="arn:aws:secretsmanager:ap-northeast-1:<account>:secret:grafana/fsxn-loki-credentials-XXXXXX"
export S3_ACCESS_POINT_ARN="arn:aws:s3:ap-northeast-1:<account>:accesspoint/fsxn-audit-ap"
export LOKI_ENDPOINT="https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp"

bash integrations/grafana/scripts/deploy.sh --all

The cleanup script removes CloudFormation stacks and optionally deletes synthetic test objects. It does not delete production FSx audit files through the FSx-attached S3 Access Point — those remain on the FSx file system. Pass --s3-bucket and --s3-prefix only if you uploaded test data to a regular S3 bucket during validation.

# Tear down everything (dependency-safe order)
bash integrations/grafana/scripts/cleanup.sh --all \
  --s3-bucket your-bucket --s3-prefix audit/svm-prod-01/

The cleanup script deletes stacks in dependency-safe order (API Gateway before Lambda) and handles DELETE_FAILED states gracefully.

LogQL Query Examples

High-cardinality fields such as UserName and ObjectName remain in the log body and are extracted at query time with | json; they are intentionally not promoted to Loki labels to avoid index bloat and cost.

Once logs arrive, Grafana Explore becomes your investigation tool:

# All audit logs
{service_name="fsxn-audit"}

# Filter by operation
{service_name="fsxn-audit"} | json | Operation="delete"

# Failed access attempts (security investigation)
{service_name="fsxn-audit"} | json | Result="Failure"

# EMS ransomware alerts
{service_name="fsxn-ems"} | json | event_name="arw.volume.state"

# FPolicy file operations
{service_name="fsxn-fpolicy"} | json | operation="create"

# Human-readable format
{service_name="fsxn-audit"} | json | line_format "{{.UserName}} {{.Operation}} {{.ObjectName}}"

# Log volume over time (for dashboards)
count_over_time({service_name="fsxn-audit"}[5m])

Dashboard: 4 Panels for Storage Observability

The following panel queries are the exact queries generated by scripts/create-dashboard.sh and verified against this project's OTLP-ingested log shape. The repository includes a dashboard creation script that provisions a Grafana dashboard via API with four panels:

Log Volume (Time series): count_over_time({service_name="fsxn-audit"}[5m])
Operations Breakdown (Pie chart): sum by (Operation) (count_over_time({service_name="fsxn-audit"} | json [1h]))
User Activity Top 10 (Bar gauge): topk(10, sum by (UserName) (count_over_time({service_name="fsxn-audit"} | json [1h])))
Failed Events (Time series): count_over_time({service_name="fsxn-audit"} | json | Result="Failure" [5m])

Alerting: Ransomware Detection and Security Monitoring

Beyond dashboards, the integration includes three Grafana alerting rules provisioned via scripts/create-alerts.sh:

The table below shows the alert conditions. The provisioning script wraps these into Grafana alert expressions using count/reduce/threshold steps.

Alert	Detection Query (alert condition)	Severity
Ransomware Detection (ARP)	`count_over_time({service_name="fsxn-ems"} \	json \
Quota Soft Limit Exceeded	{% raw %}`count_over_time({service_name="fsxn-ems"} \	json \
Failed Access Spike	{% raw %}`count_over_time({service_name="fsxn-audit"} \	json \

The rules use Grafana's unified alerting format and are deployed to a "FSxN Alerts" folder. Configure contact points (Slack, PagerDuty, email) and notification policies in the Grafana UI to route alerts by severity or team label. The rule definitions are available as {% raw %}alerting/rules.yaml; see the alerting README for provisioning details, no-data behavior, contact point caveats, and threshold tuning guidance.

API compatibility: This script uses Grafana's Alerting Provisioning HTTP API (/api/v1/provisioning). Grafana 13+ introduces newer /apis routes while legacy /api routes remain available; check your Grafana Cloud version if provisioning fails. Provisioning alert rules does not automatically configure notification delivery — create or map contact points and notification policies before relying on these alerts for production response.

The sample rules treat "No data" as OK, because absence of matching ransomware, quota, or failed-access events is expected in normal operation. Query execution errors are routed as Error state for operator attention. These thresholds are starter defaults — tune them per SVM, workload, and normal user behavior before enabling production paging.

For production, monitor the pipeline itself: Scheduler DLQ depth, Lambda errors/throttles/duration, checkpoint age, and Grafana send failures.

Scheduler DLQ Replay

The Scheduler DLQ message is primarily an operational signal and replay payload. Because the poller uses a checkpoint, the next scheduled run may already retry the failed key range automatically.

When a scheduled invocation fails and lands in the Scheduler DLQ:

Inspect the DLQ message (contains the scheduler input payload)
Check the current checkpoint in SSM Parameter Store
Check whether a later scheduled run has already advanced the checkpoint and delivered the missed objects
If the checkpoint has advanced and Grafana shows the data, the failure was auto-recovered — delete the DLQ message
If the checkpoint has NOT advanced, the next scheduled run will retry automatically from the last checkpoint
For manual replay (if auto-retry is insufficient): invoke the Lambda directly with the scheduler payload, then delete the DLQ message

Before manually replaying a DLQ message, compare the DLQ payload with the current SSM checkpoint and Grafana ingestion state to avoid duplicate delivery.

For production, set a CloudWatch alarm on ApproximateNumberOfMessagesVisible > 0 for the Scheduler DLQ.

Lessons Learned

#	Lesson	Impact
1	Grafana Cloud OTLP endpoint is the recommended ingestion path; in my trial validation, OTLP Gateway succeeded while Loki Push API returned 530	Use OTLP Gateway as default
2	Basic Auth = `base64(instanceId:apiToken)`, not plain text	Auth failures if wrong encoding
3	Loki / Grafana Cloud can reject old timestamps depending on tenant limits; in my validation, logs older than 7 days were rejected	Use current timestamps in test data
4	Grafana HTTP API needs a Grafana Service Account token, not the Grafana Cloud ingestion token used for OTLP writes	Dashboard creation fails with wrong token
5	OTLP-ingested logs use `service_name` label, not `job`	Different query syntax than Loki Push API
6	CloudFormation stack deletion order matters (API GW before Lambda)	DELETE_FAILED if wrong order

Verified Query Matrix

In this Grafana Cloud environment, service.name was exposed as the service_name index label via Loki's default OTLP attribute-to-label mapping. This mapping is configurable per tenant, so validate labels in your own environment if queries return unexpected results.

All queries tested with OTLP-ingested fields in this project's Grafana Cloud instance:

Query	Expected	Verified
`{service_name="fsxn-audit"}`	Audit logs visible	✅
`{service_name="fsxn-audit"} \	json \	Operation="delete"`
`{service_name="fsxn-audit"} \	json \	Result="Failure"`
`{service_name="fsxn-ems"} \	json \	event_name="arw.volume.state"`
`{service_name="fsxn-fpolicy"} \	json \	operation="create"`
`count_over_time({service_name="fsxn-audit"}[5m])`	Time series data	✅

Production and PoC Resources

For deeper validation and production planning:

Delivery Guarantee Patterns — Quickstart → Medium → Higher reliability → Multi-backend
Webhook Security Guide — Auth modes, Lambda authorizer, production baseline
Grafana Operations Guide — Alarms, tuning, poison-pill, ownership, compliance
CloudTrail Trigger Alternative — Event-driven alternative analysis
PoC Checklist — Go/No-Go criteria for stakeholder sign-off
Cost Model — Direct send vs Collector vs Firehose cost comparison
Alerting README — Provisioning details, thresholds, contact point caveats
Graduating to Alloy — Move from direct Lambda OTLP send to an Alloy-backed telemetry pipeline
Partner Solution Brief — Target customers, PoC scope, deliverables, and responsibility boundaries

What's Next

Part 7: Splunk HEC — serverless log delivery with built-in Firehose support
Elastic integration: Bulk API with date-based indices
Cost model refinement: validate the Cost Model with measured volume tiers from real-world FSx for ONTAP workloads

Series Navigation

Part 1: Why Your FSx for ONTAP Logs Deserve Better
Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
Part 5: Escape Vendor Lock-in with OTel Collector
Part 6: Direct-to-Grafana: Shipping Logs via OTLP Gateway (this post)

Questions about the Grafana Cloud integration or OTLP Gateway? Drop a comment below.

Previous: Part 5 — Escape Vendor Lock-in with OTel Collector

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

Escape Vendor Lock-in: Multi-Backend Log Delivery with OTel Collector for FSx for ONTAP.

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Tue, 19 May 2026 09:10:53 +0000

TL;DR

We shipped the same FSx for ONTAP audit logs to three backends simultaneously — Datadog, Grafana Cloud, and Honeycomb — without changing a single line of Lambda code. The OpenTelemetry Collector sits between our Lambda and the backends as a routing layer. Adding or removing a backend is a YAML config change, not a code deployment.

Same audit logs → 3 backends simultaneously
Zero Lambda code changes between backends (SHA-256 verified)
OTel Collector as the vendor-neutral routing layer
All 3 event sources work: FSx audit logs via S3 Access Point, EMS webhooks, FPolicy file operations

What We're Building

In Part 2, we built a Lambda that speaks Datadog's API directly. It works great — but what happens when your security team wants Splunk, your SRE team wants Grafana, and your platform team is evaluating Honeycomb?

You'd need three separate Lambdas, each with vendor-specific formatting, auth, and retry logic. That's vendor lock-in expressed as infrastructure.

The Problem: Vendor-Specific APIs = Lock-in

Every observability vendor has their own wire format:

Vendor	Auth Header	Payload Format	Endpoint Pattern
Datadog	`DD-API-KEY: <key>`	Custom JSON schema	`https://http-intake.logs.{site}/api/v2/logs`
Splunk	`Authorization: Splunk <token>`	HEC `event` wrapper	`https://<host>:8088/services/collector/event`
Grafana Cloud	`Authorization: Basic <b64>`	OTLP	`https://otlp-gateway-prod-<region>.grafana.net/otlp`
Honeycomb	`x-honeycomb-team: <key>`	OTLP	`https://api.honeycomb.io`

If your Lambda speaks Datadog's API, switching to Grafana Cloud means rewriting your Lambda. That's the lock-in.

The Solution: OTLP as the Producer-to-Collector Contract

OpenTelemetry Protocol (OTLP) is the vendor-neutral producer-to-Collector contract. Our Lambda speaks OTLP — period. The OTel Collector handles routing, processing, and backend-specific export.

┌─────────────────────────────────────────────────────────────────────┐
│ AWS Account                                                         │
│                                                                     │
│  ┌──────────────┐     ┌──────────────────┐     ┌─────────────────┐  │
│  │ Audit Logs   │────▶│ Lambda           │     │ OTel Collector  │  │
│  │ (via S3 AP)  │────▶│ (OTLP Shipper)   │────▶│ (Docker/Fargate)│  │
│  │ EMS/FPolicy  │────▶│                  │     │                 │  │
│  └──────────────┘     └──────────────────┘     └─┬──────┬──────┬─┘  │
│                                                  │      │      │    │
└──────────────────────────────────────────────────┼──────┼──────┼────┘
                                                   │      │      │
                                                   ▼      ▼      ▼
                                              Datadog  Grafana Honeycomb
                                               (AP1)    Cloud

The Lambda sends OTLP/HTTP to the Collector. The Collector fans out to any combination of backends. Adding Honeycomb? Add 5 lines of YAML. Dropping Datadog? Remove 4 lines. No Lambda redeployment.

Prerequisites

Before starting, you need:

FSx for ONTAP with audit logging configured (see Part 2 for setup)
Docker installed locally (Colima works — see troubleshooting for compose compatibility)
At least one backend account:
- Datadog: API key + site (e.g., ap1.datadoghq.com)
- Grafana Cloud: Instance ID + API token (Cloud Portal → OTLP)
- Honeycomb: Ingest API key (starts with hcaik_)
AWS account with Lambda deployment capability
Parts 1–4 context (recommended but not required — this integration works standalone)

FSx for ONTAP S3 Access Point note: The Lambda reads audit logs through an S3 Access Point attached to the FSx for ONTAP volume. Data remains on the FSx file system — it is not copied to a separate S3 bucket. S3 API throughput via FSx depends on the file system's provisioned throughput capacity, not standard S3 scaling. Validate FSx read throughput separately from Collector and backend ingest throughput.

The OTel Collector Configuration

The Collector config is the heart of this pattern. Here's the full verified configuration for multi-backend delivery:

# otel-collector-config.yaml
# ✅ VERIFIED WORKING (2026-05-18)
# Image: otel/opentelemetry-collector-contrib:0.152.0
# Backends: Grafana Cloud (ap-northeast-0) + Honeycomb

receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

processors:
  # memory_limiter:        # Recommended for production
  #   check_interval: 1s
  #   limit_mib: 512
  #   spike_limit_mib: 128
  batch:
    timeout: 5s
    send_batch_size: 1000

exporters:
  otlp_http/grafana:
    endpoint: ${env:GRAFANA_OTLP_ENDPOINT}
    headers:
      Authorization: "Basic ${env:GRAFANA_BASIC_AUTH}"

  otlp_http/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
      x-honeycomb-dataset: ${env:HONEYCOMB_DATASET}

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

service:
  extensions: [health_check]
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp_http/grafana, otlp_http/honeycomb]

Depending on your Honeycomb environment and dataset model, x-honeycomb-dataset may be optional or handled differently. Refer to your Honeycomb OTLP setup page for the recommended configuration.

This article uses otlp_http (the forward-compatible component name). If your Collector version does not recognize it, use the older otlphttp alias or upgrade the Collector.

Section Breakdown

Section	Purpose	Key Settings
`receivers.otlp`	Accepts OTLP/HTTP from Lambda	Port 4318 (OTLP standard)
`processors.batch`	Buffers logs before export	5s timeout OR 1000 records (whichever first)
`exporters.otlp_http/*`	Sends to each backend	Per-backend auth headers
`extensions.health_check`	Liveness probe	Port 13133 for `curl -f` checks
`service.pipelines`	Wires components together	logs: receiver → processor → exporters

Production note: This configuration is suitable for development and validation. For production, add retry_on_failure and sending_queue settings to exporters, configure memory_limiter processor, and consider persistent storage extensions. Without persistent buffering, telemetry in the Collector's in-memory batch can be lost during Collector restarts.

Adding Datadog as a Third Backend

To send to all three simultaneously, add the Datadog exporter:

exporters:
  # ... existing grafana + honeycomb exporters ...

  datadog:
    api:
      key: ${env:DD_API_KEY}
      site: ${env:DD_SITE}

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp_http/grafana, otlp_http/honeycomb, datadog]

That's it. Restart the Collector. Same Lambda, same OTLP payload, now three destinations.

For Datadog, this example uses the Collector's dedicated datadog exporter rather than generic otlp_http, because it handles Datadog-specific intake behavior, metadata mapping, and host tagging.

The Lambda Handler (OTLP Shipper)

Key Design Decisions

Why OTLP? — It gives the Lambda a single producer-to-Collector contract. The Collector then handles each backend's supported exporter or intake path. One format to maintain, not three.
Why no vendor SDK? — SDKs add cold start latency, dependency management, and vendor coupling. Pure urllib3 + JSON keeps the Lambda lean.
Why AUTH_MODE? — Different Collectors may need different auth. The Lambda supports none, basic, and bearer modes without code changes.

Field Mapping: FSx ONTAP → OTLP Attributes

The Lambda maps FSx ONTAP audit fields to semantic OTLP attribute keys:

FSx ONTAP Field	OTLP Attribute Key	Example Value
`EventID`	`event.type`	`4663`
`UserName`	`user.name`	`admin@corp.local`
`ClientIP`	`client.address`	`10.0.1.50`
`Operation`	`fsxn.operation`	`ReadData`
`ObjectName`	`fsxn.path`	`/vol/data/reports/q4.xlsx`
`Result`	`fsxn.result`	`Success`
`SVMName`	`fsxn.svm`	`svm-prod-01`

The examples above focus on S3 audit logs because they are the highest-volume path. The same OTLP shipper pattern is reused for EMS webhook events and FPolicy file operations using source-specific field mappers (ems_handler.py, fpolicy_handler.py), while preserving the same Collector-facing OTLP contract. For EMS and FPolicy, source-specific service names are used (fsxn-ems, fsxn-fpolicy) to distinguish event sources in the backend.

Resource-level attributes (set once per payload, not per log record):

Attribute	Value	Purpose
`service.name`	`fsxn-audit`	Service identification
`cloud.provider`	`aws`	Cloud context
`cloud.platform`	`aws_fsx`	Platform context

cloud.platform=aws_fsx is a project-specific value used to identify FSx for ONTAP as the data source. It is not part of the OpenTelemetry semantic conventions standard cloud.platform values (which include aws_ec2, aws_ecs, aws_eks, aws_lambda, etc.).

Severity Determination Logic

The Lambda determines OTLP severity from the Result field:

WARN_KEYWORDS = ("fail", "denied", "error")

def determine_severity(result: Optional[str]) -> tuple[int, str]:
    """Determine OTLP severity from FSx ONTAP Result field."""
    if not result:
        return (9, "INFO")
    lower = result.lower()
    for keyword in WARN_KEYWORDS:
        if keyword in lower:
            return (13, "WARN")
    return (9, "INFO")

This means failed access attempts (Result: "Failure") automatically get severityNumber: 13 (WARN), making them easy to filter in any backend.

The Lambda sets both severityNumber and severityText according to the OpenTelemetry Logs Data Model severity level definitions.

OTLP Payload Construction

def build_otlp_payload(
    logs: list[dict[str, Any]],
    service_name: str,
    source_key: str,
) -> dict[str, Any]:
    """Build OTLP Log Data Model payload."""
    log_records = [map_log_record(log) for log in logs]

    return {
        "resourceLogs": [{
            "resource": {
                "attributes": [
                    {"key": "service.name", "value": {"stringValue": service_name}},
                    {"key": "cloud.provider", "value": {"stringValue": "aws"}},
                    {"key": "cloud.platform", "value": {"stringValue": "aws_fsx"}},
                ]
            },
            "scopeLogs": [{
                "scope": {"name": "fsxn-otel-shipper", "version": "1.0.0"},
                "logRecords": log_records,
            }],
        }]
    }

No vendor SDK. No vendor-specific formatting. Just the OTLP Log Data Model.

Retry with Exponential Backoff

MAX_RETRIES = 3
BASE_INTERVAL = 2  # seconds

def _send_otlp_payload(payload, endpoint, auth_headers=None) -> bool:
    """Send OTLP payload via HTTP POST with retry logic.

    Retries on HTTP 429 and 5xx. Does not retry on 4xx (except 429).
    Exponential backoff: 2s, 4s, 8s with jitter.
    """
    url = f"{endpoint}/v1/logs"
    headers = {"Content-Type": "application/json"}
    if auth_headers:
        headers.update(auth_headers)

    json_body = json.dumps(payload).encode("utf-8")

    for attempt in range(MAX_RETRIES):
        response = http.request("POST", url, body=json_body, headers=headers, timeout=30.0)

        if response.status < 300:
            return True
        if response.status == 429 or response.status >= 500:
            wait_time = BASE_INTERVAL * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait_time)
            continue
        # Client error (4xx except 429) — don't retry
        return False
    return False

AUTH_MODE Support

The Lambda supports three authentication modes via the AUTH_MODE environment variable:

AUTH_MODE	Behavior	Use Case
`none`	No auth headers sent	Local Collector (no auth needed)
`basic`	`Authorization: Basic <base64(token)>`	Grafana Cloud direct
`bearer`	`Authorization: Bearer <token>`	Generic OTLP endpoints

When using the Collector pattern, set AUTH_MODE=none on the Lambda — the Collector handles backend auth via its own config.

Direct auth modes (basic, bearer) are useful for testing or bypassing the Collector. In the multi-backend pattern, keep AUTH_MODE=none and let the Collector handle backend credentials.

Deployment

Local Development: Docker Run

# 1. Configure credentials
cd integrations/otel-collector
cp .env.example .env
# Edit .env with your backend credentials:
#   GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-ap-northeast-0.grafana.net/otlp
#   GRAFANA_BASIC_AUTH=<base64(instanceId:apiToken)>
#   HONEYCOMB_API_KEY=hcaik_<your-ingest-key>
#   HONEYCOMB_DATASET=fsxn-audit

# 2. Start OTel Collector
docker run -d --name otel-collector \
  -p 4318:4318 -p 13133:13133 \
  -v $(pwd)/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
  --env-file .env \
  otel/opentelemetry-collector-contrib:0.152.0

# 3. Verify health
curl -f http://localhost:13133/
# Expected: HTTP 200 — {"status":"Server available", ...}

The health_check extension confirms the Collector process is available; it does not guarantee that each backend exporter is successfully delivering logs. Monitor exporter errors separately using the Collector's internal telemetry metrics if enabled and exposed.

# 4. Send a test payload
bash scripts/generate-otlp-payload.sh --output /tmp/payload.json
curl -X POST http://localhost:4318/v1/logs \
  -H "Content-Type: application/json" \
  -d @/tmp/payload.json

Colima users: docker compose v2 plugin is NOT available in Colima. All scripts in this repo detect this and fall back to docker run. If you see "docker compose: command not found", this is expected behavior.

First Success Path

If you're trying this for the first time, start small:

Run the Collector locally with one backend.
Send one fresh OTLP payload.
Confirm the event appears in that backend.
Add the second exporter.
Only then move to multi-backend or AWS deployment.

This keeps the first validation focused on the producer-to-Collector contract before introducing backend parity and production networking.

AWS Deployment: CloudFormation

aws cloudformation deploy \
  --template-file integrations/otel-collector/template.yaml \
  --stack-name fsxn-otel-integration \
  --parameter-overrides \
    S3AccessPointArn=arn:aws:s3:ap-northeast-1:123456789012:accesspoint/fsxn-audit-ap \
    OtlpEndpoint=http://<your-collector-endpoint>:4318 \
    ApiKeySecretArn=arn:aws:secretsmanager:ap-northeast-1:123456789012:secret:fsxn-otel-key-XXXXXX \
    AuthMode=none \
  --capabilities CAPABILITY_IAM \
  --region ap-northeast-1

This template deploys the Lambda-side OTLP shipper. The Collector endpoint must already be reachable from the Lambda — for example, a local Collector for development, an EC2-hosted Collector, or an ECS/Fargate-based Collector in the same VPC. If the Lambda is in a VPC, ensure security groups allow outbound TCP 4318 to the Collector. See the repository's VPC Deployment Guide and Security Hardening Guide for production Collector deployment.

When the Collector handles auth, set AuthMode=none on the Lambda. The Collector config contains the per-backend credentials via environment variables (sourced from .env or Secrets Manager in production).

Environment Variables

Variable	Lambda	Collector	Description
`OTLP_ENDPOINT`	✅	—	Collector URL (e.g., `http://collector:4318`)
`AUTH_MODE`	✅	—	`none` / `basic` / `bearer`
`SERVICE_NAME`	✅	—	OTLP `service.name` attribute
`GRAFANA_OTLP_ENDPOINT`	—	✅	Grafana Cloud OTLP gateway URL
`GRAFANA_BASIC_AUTH`	—	✅	base64(instanceId:apiToken)
`HONEYCOMB_API_KEY`	—	✅	Ingest key (hcaik_...)
`HONEYCOMB_DATASET`	—	✅	Dataset name
`DD_API_KEY`	—	✅	Datadog API key
`DD_SITE`	—	✅	Datadog site (`datadoghq.com`, `datadoghq.eu`, `ap1.datadoghq.com`, etc.)

Verified Results

All backends were tested on 2026-05-18 using otel/opentelemetry-collector-contrib:0.152.0:

Backend	Region/Site	Status	Event Sources	Auth Method
Datadog	ap1.datadoghq.com	✅ Verified	S3 audit + EMS + FPolicy	Datadog exporter (`DD-API-KEY`)
Grafana Cloud	ap-northeast-0	✅ Verified	S3 audit + EMS + FPolicy	Basic Auth via `otlp_http`
Honeycomb	—	✅ Verified	S3 audit + EMS + FPolicy	`x-honeycomb-team` via `otlp_http`
Multi-Backend	Grafana + Honeycomb	✅ Verified	Simultaneous delivery	Both auth methods
Multi-Backend	Datadog + Grafana + Honeycomb	✅ Verified	Simultaneous 3-way delivery	All three exporters

All three backends received the same structured attributes:

event.type, user.name, client.address
fsxn.operation, fsxn.path, fsxn.result, fsxn.svm
cloud.provider=aws, cloud.platform=aws_fsx

OTLP standardizes the producer-to-Collector contract, but backend-specific indexing, query semantics, and retention behavior still need to be validated per destination. OpenTelemetry is not a backend — it defines APIs, protocols, and Collector components for telemetry generation, collection, processing, and export. Storage, visualization, and alerting are handled by the backends themselves. See the Backend Parity Matrix and PoC Checklist for backend-specific validation details.

The Proof: Zero Code Changes

Here's the key evidence. The Lambda handler's SHA-256 hash is identical regardless of which backend receives the logs:

$ shasum -a 256 integrations/otel-collector/lambda/handler.py
# Same hash whether targeting Datadog, Grafana Cloud, or Honeycomb
# The file never changes — only the Collector config does

What changes between backends? Only the OTel Collector config file.

Demonstration: Adding a Backend

Starting state: Grafana Cloud only.

# Before: single backend
service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana]

Adding Honeycomb:

# After: add 5 lines to exporters section + update pipeline
exporters:
  otlp_http/honeycomb:
    endpoint: https://api.honeycomb.io
    headers:
      x-honeycomb-team: ${env:HONEYCOMB_API_KEY}
      x-honeycomb-dataset: ${env:HONEYCOMB_DATASET}

service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana, otlp_http/honeycomb]

Restart the Collector. Done. No Lambda redeployment, no code review, no CI/CD pipeline for the shipper.

Demonstration: Removing a Backend

Dropping Datadog during a migration to Grafana Cloud:

# Remove from exporters list — that's it
service:
  pipelines:
    logs:
      exporters: [otlp_http/grafana]  # removed: datadog

Troubleshooting

Timestamp Rejection / Static Payload Gotcha

Datadog documents that logs older than 18 hours are dropped at intake (Datadog Logs API docs). Other backends may also reject or hide events with timestamps outside their accepted windows. In my testing, future timestamps also caused ingestion issues on some backends. When testing with static payloads, always generate fresh timestamps.

Fix: Use the payload generator to create fresh timestamps:

bash scripts/generate-otlp-payload.sh --output /tmp/fresh-payload.json
curl -X POST http://localhost:4318/v1/logs \
  -H "Content-Type: application/json" \
  -d @/tmp/fresh-payload.json

Grafana Cloud Auth Format

The loki exporter is NOT the correct approach for OTLP → Grafana Cloud.

❌ loki exporter with Loki push API
✅ otlp_http/grafana with OTLP gateway endpoint

The Basic Auth value must be base64(instanceId:apiToken):

# Generate the auth value
echo -n "<your-instance-id>:<your-grafana-cloud-api-token>" | base64

Where the instance ID is your numeric Grafana Cloud instance ID (found in Cloud Portal → OTLP configuration).

Honeycomb Key Types

Honeycomb has two key types. Only ingest keys work for data ingestion:

Key Prefix	Type	Works for OTLP?
`hcaik_`	Ingest API key	✅ Yes
`hcxik_`	Environment key	❌ No

If you see 401 Unauthorized from Honeycomb, check your key prefix.

Colima Docker Compose Compatibility

docker compose v2 plugin is not available in Colima environments. All scripts in this repository detect this automatically and fall back to docker run. This is expected — not an error.

If you need compose-like orchestration on Colima, use the explicit docker run commands shown in the Deployment section.

Common Mistake: loki Exporter vs otlp_http

A frequent misconfiguration when targeting Grafana Cloud:

# ❌ WRONG — loki exporter uses Loki-specific push API
exporters:
  loki:
    endpoint: https://logs-prod-<region>.grafana.net/loki/api/v1/push

# ✅ CORRECT — otlp_http uses the OTLP gateway
exporters:
  otlp_http/grafana:
    endpoint: https://otlp-gateway-prod-<region>.grafana.net/otlp

The OTLP gateway is Grafana Cloud's native OTLP ingestion endpoint. It handles logs, metrics, and traces through a single URL.

Cost Model: How to Think About It

Lambda Cost (OTLP Path vs Direct Send)

In my validation, the OTLP Lambda was simpler and shorter-lived than the vendor-specific direct-send path. Your duration will vary depending on batching, payload size, network path, and backend response time.

Component	Direct Send (Part 2)	OTLP + Collector
Lambda complexity	Vendor formatting + HTTP + retry	OTLP POST to nearby Collector
Lambda memory	256MB	256MB
Vendor SDK deps	Yes (adds cold start)	None
Retry complexity	Per-vendor	Delegated to Collector

OTel Collector Cost

The Collector introduces a fixed infrastructure cost that is independent of event volume:

Deployment	Best For
Docker on local machine	Development, testing
Docker on EC2 Spot (t3.small)	Low-volume production
ECS Fargate (0.5 vCPU, 1GB)	Production (no OS management)
ECS Fargate + NAT Gateway	VPC-internal production

When to Use Each Pattern

Scenario	Recommendation
Single vendor, low volume	Direct Send (Part 2 pattern) — no Collector overhead
Single vendor, high volume	Collector (buffering + backpressure benefits)
Multi-vendor evaluation	Collector (add/remove exporters freely)
Vendor migration in progress	Collector (parallel delivery during cutover)
Compliance: logs in multiple systems	Collector (fan-out is a config change)

The Collector has fixed infrastructure costs regardless of volume. As volume increases or vendors multiply, the Collector path becomes more cost-effective because it processes once and fans out. The Collector path centralizes fan-out outside the Lambda. Direct-send can also fan out within one Lambda, but that pushes vendor-specific formatting, retry behavior, and failure isolation back into application code.

Important: Backend ingest/retention costs are not included in these AWS-side estimates. Datadog, Grafana Cloud, and Honeycomb each have their own pricing models that can become the dominant cost at scale.

When to Use This Pattern

Multi-Vendor Evaluation

Want to try Honeycomb for a month alongside your existing Datadog setup? Add one exporter to the Collector config. No Lambda redeployment. No risk to your existing pipeline.

Compliance: Logs in Multiple Systems

Some organizations require audit logs in multiple systems — security team uses Splunk, dev team uses Datadog, compliance team needs a cold archive. The Collector fans out to all simultaneously from a single OTLP stream.

Migration Between Vendors

Moving from Datadog to Grafana Cloud? Run both exporters in parallel during migration. Verify data parity in the new system. Remove the old exporter when satisfied. Zero-downtime vendor migration.

Cost Optimization: Route by Volume

Use the Collector's processor pipeline to route high-volume noisy logs (read operations) to a cheaper backend while keeping security-critical events (deletes, permission changes) on a premium platform with alerting.

What's Next

For production hardening, the repository includes guides covering VPC deployment, health monitoring, persistent buffering, security hardening, and benchmarking. Auto-scaling and Multi-AZ deployment are natural next steps for production Collector operations.

For production and partner-led deployments, the repository includes:

Architecture Decision Record
VPC Deployment Guide — private networking, security groups, and Collector reachability from Lambda
Config Governance Guide
Security Hardening Guide
Operations Guide
Cost Model
PoC Checklist
Routing and Filtering Examples
Compliance Evidence Note
Migration Guide — zero-downtime migration from direct-send to the Collector path
OTel Semantic Mapping Guide — standard vs project-specific attributes, schema evolution, and what OTLP does not solve
Backend Parity Matrix — visibility and query behavior across Datadog, Grafana Cloud, and Honeycomb
Glossary / 用語集 — English/Japanese OTel terminology used in this project
Enterprise Workload Addendum — SAP, VMware, and mission-critical workload considerations
Storage Service Selection Note — when to use FSx for ONTAP, Amazon S3, Amazon EFS, and Amazon EBS

Key Takeaways

OTLP is the stable producer contract. Your Lambda speaks one protocol; the Collector handles backend-specific exporters.
OTel Collector is the routing and processing layer that decouples log producers from observability backends.
Zero Lambda code changes when switching or adding backends — verified with SHA-256 hash comparison.
Multi-backend delivery is a config change, not a code change. Add 5 lines of YAML, restart the Collector.
All three FSx ONTAP event sources work: FSx audit logs via S3 Access Point (Part 2), EMS webhooks (Part 3), and FPolicy file operations (Part 4).
Collector economics improve as volume increases or vendors multiply — fixed Collector cost is amortized across all destinations.
Start with direct send (Part 2) for simplicity. Graduate to the Collector when you need multi-backend, vendor migration, or volume-based routing.

Series Navigation

Part 1: Why Your FSx for ONTAP Logs Deserve Better
Part 2: Shipping FSx for ONTAP Logs to Datadog — The Serverless Way
Part 3: Event-Driven Ransomware Detection with ONTAP ARP + Datadog
Part 4: FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate
Part 5: Escape Vendor Lock-in with OTel Collector (this post)

Questions about the OTel Collector pattern or multi-backend delivery? Drop a comment below.

Previous: Part 4 — FPolicy File Activity Pipeline

GitHub: github.com/Yoshiki0705/fsxn-observability-integrations

FPolicy File Activity Pipeline — ONTAP to Datadog via ECS Fargate

Yoshiki Fujiwara(藤原善基)@AWS Community Builder — Mon, 18 May 2026 02:31:34 +0000

TL;DR

ONTAP FPolicy pushes file operation notifications over a persistent TCP connection. We run a lightweight Python server on ECS Fargate that receives these events, normalizes them, and forwards them to SQS → Lambda → Datadog. In my validation environment, create events reached Datadog in about 6 seconds. Rename/delete behavior depends on FPolicy mode, protocol, and FSx for ONTAP behavior, so this post documents both the working path and the limitations observed.

Update — production hardening path
This article remains the Datadog-specific introduction to the FPolicy file activity pipeline. Since publishing it, the repository has been expanded with production-readiness guidance, governance and security review checklists, sample payloads, CI policy, cfn-guard rules, and shared Python helpers for observability and idempotent object processing.

For production planning, start from the repository README:

Choose Your Path

Recommended first 30 minutes

Production Readiness Levels

PoC Success Criteria

Security Review Checklist

Governance and Compliance Guide

CI Policy

The FPolicy pattern has also been expanded with Persistent Store guidance, idempotent object processing, EventBridge dispatch, and a hybrid polling/event-driven migration path. This Part 4 article focuses on the Datadog delivery path; the repository now documents the broader production baseline.

Why FPolicy Needs Fargate

In Part 3, we showed how EMS webhooks deliver ARP alerts via API Gateway → Lambda. That works because EMS uses standard HTTPS.

FPolicy is different. ONTAP's FPolicy subsystem uses a proprietary binary protocol over persistent TCP connections. ONTAP initiates the connection to the FPolicy server and maintains it with periodic KeepAlive messages. This means:

❌ Lambda — No persistent TCP connections, max 15-minute timeout
❌ API Gateway — HTTP/HTTPS only, no raw TCP
✅ ECS Fargate — Persistent TCP listener, private IP, auto-restart

Why I Did Not Use an NLB in This Validation

I tested an NLB-based approach, but it did not work reliably in my validation. The issue was not that NLB cannot forward binary TCP traffic; it can. The challenge was FPolicy's stateful session negotiation and ONTAP's expectation of configured FPolicy server IPs. Health checks and connection behavior introduced additional complexity. For this validation, the simplest reliable path was to let ONTAP connect directly to the Fargate task's private IP and automate external-engine IP updates on task restart.

The Fargate task runs a Python server that:

Listens on TCP:9898
Handles FPolicy protocol negotiation (version handshake)
Receives KeepAlive messages (connection health)
Parses file operation notifications
Forwards structured events to SQS

Architecture

SMB/NFS Client
    │ file create/write/rename/delete
    ▼
FSx for ONTAP (FPolicy enabled)
    │ proprietary TCP protocol
    ▼
ECS Fargate (TCP:9898)
    │ parse → normalize → forward
    ▼
SQS Queue
    │ event source mapping
    ▼
Lambda (fpolicy_handler)
    │ format → ship
    ▼
Datadog Logs API v2 (source:fsxn-fpolicy)

Key design decisions:

ONTAP connects TO Fargate — the Fargate task must be reachable on a private IP. Because that IP can change on task restart, the ONTAP external engine must be updated automatically or operationally.
SQS decouples the TCP server from the shipping logic — if Datadog is slow, events buffer in SQS
Lambda handles Datadog shipping — retry logic, batch formatting, API key management
No NLB — ONTAP connects directly to the Fargate task's private IP

Production Boundary: Why FPolicy Needs More Than Lambda

The audit-log and EMS paths are natural fits for Lambda:

Audit logs are file/object reads through the FSx for ONTAP S3 Access Point read path
EMS events are HTTPS webhook payloads

FPolicy is different. ONTAP FPolicy uses a persistent TCP connection to an external FPolicy server. That makes it a poor fit for API Gateway + Lambda as the first receiver.

This is why the production-oriented path is:

ONTAP FPolicy
  → ECS Fargate TCP listener
  → SQS
  → Lambda shipper
  → Datadog

## Deployment

### Prerequisites

- FSx for ONTAP file system with a CIFS-enabled SVM
- VPC with private subnets (same as FSx for ONTAP)
- ECR repository with the FPolicy server image
- Private subnet egress for Fargate: either a NAT Gateway or VPC endpoints for ECR image pull, CloudWatch Logs, and SQS access

### Step 1: Deploy the Fargate Stack

bash
aws cloudformation deploy \
--template-file shared/templates/fpolicy-server-fargate.yaml \
--stack-name fsxn-fpolicy-server \
--parameter-overrides \
VpcId= \
SubnetIds= \
FsxnSvmSecurityGroupId= \
ContainerImage=.dkr.ecr..amazonaws.com/fsxn-fpolicy-server:latest \
--capabilities CAPABILITY_NAMED_IAM


This creates:
- ECS Cluster + Fargate Service (1 task)
- SQS Queue for FPolicy events
- Security Group (inbound TCP:9898 from FSx SG)
- CloudWatch Log Group

### Step 2: Deploy the Datadog Shipping Lambda

The template accepts the SQS queue ARN as a parameter and automatically creates the event source mapping:

bash

Get the SQS queue ARN from Step 1 outputs

SQS_ARN=$(aws cloudformation describe-stacks \
--stack-name fsxn-fpolicy-server \
--query "Stacks[0].Outputs[?OutputKey=='FPolicyQueueArn'].OutputValue" \
--output text)

aws cloudformation deploy \
--template-file integrations/datadog/template-ems-fpolicy.yaml \
--stack-name fsxn-datadog-ems-fpolicy \
--parameter-overrides \
DatadogApiKeySecretArn= \
DatadogSite=ap1.datadoghq.com \
FPolicySqsQueueArn=${SQS_ARN} \
--capabilities CAPABILITY_NAMED_IAM


This creates the Lambda function with an SQS event source mapping — no manual `create-event-source-mapping` needed.

### Step 3: Get the Fargate Task IP

bash
TASK_ARN=$(aws ecs list-tasks \
--cluster fsxn-fpolicy-server-cluster \
--service-name fsxn-fpolicy-server-service \
--query "taskArns[0]" --output text)

aws ecs describe-tasks \
--cluster fsxn-fpolicy-server-cluster \
--tasks $TASK_ARN \
--query "tasks[0].containers[0].networkInterfaces[0].privateIpv4Address" \
--output text


## ONTAP FPolicy Configuration

> **CLI note**: Some ONTAP versions show these commands under `vserver fpolicy ...`, while newer CLI contexts may allow shortened forms. Use the command form supported by your ONTAP version. The examples below use the form validated in my environment (FSx for ONTAP 9.17.1). See [NetApp CLI reference](https://docs.netapp.com/us-en/ontap-cli-9151/vserver-fpolicy-policy-external-engine-create.html) for the full command syntax.

FPolicy requires three components: an External Engine (where to send events), an Event (what to monitor), and a Policy (linking them together).

### Create the External Engine

shell
vserver fpolicy policy external-engine create -vserver \
-engine-name fpolicy_aws_engine \
-primary-servers \
-port 9898 \
-extern-engine-type asynchronous \
-ssl-option no-auth


> **Production note**: For production deployments, evaluate `server-auth` or `mutual-auth` instead of `no-auth`, and validate certificate handling between ONTAP and the FPolicy server. See [NetApp FPolicy external engine documentation](https://docs.netapp.com/us-en/ontap/nas-audit/create-fpolicy-external-engine-task.html).

### Create the FPolicy Event

shell
vserver fpolicy policy event create -vserver \
-event-name cifs_file_events \
-protocol cifs \
-file-operations create,write,rename,delete


> **Tip**: For write-heavy workloads, review the protocol-specific FPolicy filters supported by your ONTAP version and protocol. Where supported, use close/modify-oriented filters to reduce duplicate or noisy write events.

### Create and Enable the Policy

shell
vserver fpolicy policy create -vserver \
-policy-name fpolicy_aws \
-events cifs_file_events \
-engine fpolicy_aws_engine \
-is-mandatory false

vserver fpolicy enable -vserver \
-policy-name fpolicy_aws \
-sequence-number 1


This example uses an asynchronous, non-mandatory policy so client file operations are not blocked by FPolicy server processing or Datadog delivery. If the FPolicy server is unavailable, file operations continue unimpeded — but notifications may be buffered or lost depending on your ONTAP version and configuration.

### Verify Connection

shell
vserver fpolicy show-engine -vserver -engine-name fpolicy_aws_engine


You should see `connected` status. In the ECS logs, KeepAlive messages confirm the connection:

console
[INFO] fpolicy-server: [+] Connection from ('10.0.x.x', 44107)
[INFO] fpolicy-server: [Handshake] Policy=fpolicy_aws | Session=... | VsUUID=...
[INFO] fpolicy-server: [Send] NEGO_RESP | Version=1.2 | Policy=fpolicy_aws
[INFO] fpolicy-server: [KeepAlive] Received — connection healthy


## E2E Validation Results

File operations on the SMB share produce events that flow through the entire pipeline:

| Operation | ECS Log | SQS | Lambda | Datadog | Latency |
|-----------|---------|-----|--------|---------|---------|
| create `blog_demo_create.txt` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |
| create `blog_demo_write.txt` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |
| create `confidential_report_2026.xlsx` | ✅ | ✅ | ✅ shipped:1 | ✅ | ~6 seconds |

### ECS Fargate Logs — Connection Lifecycle

The FPolicy server logs show the complete lifecycle: server start → ONTAP connection → protocol handshake → KeepAlive → file events → SQS delivery.

![ECS Fargate CloudWatch Logs](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/aws-ecs-fpolicy-logs.png)

### Lambda CloudWatch Logs — Event Processing

Each SQS message triggers a Lambda invocation. Processing time is typically 30-50ms per event.

![Lambda CloudWatch Logs](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/aws-lambda-fpolicy-logs.png)

### Datadog Log Explorer

Query: `source:fsxn-fpolicy`

Each event contains structured attributes:
- `operation_type`: The file operation (create, write, rename, delete)
- `file_path`: The file that was operated on
- `client_ip`: The client that performed the operation
- `volume_name`: The ONTAP volume
- `svm`: The ONTAP SVM name (may show "unknown" if not resolved from handshake context)
- `timestamp`: When the operation occurred

![FPolicy events in Datadog Log Explorer](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/datadog-fpolicy-full-path.png)

![FPolicy event detail — structured attributes visible in the side panel](https://raw.githubusercontent.com/Yoshiki0705/fsxn-observability-integrations/main/docs/screenshots/datadog-fpolicy-detail.png)

## Correlating FPolicy with ARP

The real power emerges when you combine FPolicy file activity with ARP ransomware detection from Part 3:

plaintext
source:(fsxn-fpolicy OR fsxn-ems) @attributes.svm:svm-prod-01


This correlation query shows:
1. **ARP alert** (from EMS): "Ransomware detected on volume X"
2. **File operations** (from FPolicy): Which user, from which IP, created/renamed which files

Together they answer the critical incident response questions: *What happened, who did it, and from where?*

### Security Use Case: Detecting Suspicious File Creation Bursts

With FPolicy create events in Datadog, you can create a Monitor that fires when a single client creates more than 50 files in 5 minutes — a potential indicator of ransomware encryption or unauthorized bulk operations:

**Datadog Monitor query:**

plaintext
logs("source:fsxn-fpolicy @attributes.operation_type:create").rollup("count").by("@attributes.client_ip").last("5m") > 50


**Alert message:**

plaintext
🚨 Suspicious file creation burst detected on FSx for ONTAP

Client IP: {{@attributes.client_ip}}
Volume: {{@attributes.volume_name}}
Count: {{value}} file creations in 5 minutes

Investigate immediately — check if this is authorized batch processing or potential ransomware activity.


> **Note on delete monitoring**: If your FPolicy configuration and ONTAP version reliably deliver delete events (e.g., synchronous mode or a future ONTAP release), you can extend this pattern to bulk deletion detection. In my async-mode validation, delete notifications were not reliably delivered — I recommend using audit logs from [Part 2](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c) for delete-event completeness.

This is difficult to achieve with traditional audit log polling, which depends on rotation and scheduler intervals. FPolicy's event-driven delivery makes sub-minute detection possible for the operations it reliably captures.

## Operational Considerations

### Fargate Task IP Changes

When a Fargate task restarts (deployment, crash, scaling), it gets a new private IP. ONTAP's External Engine must be updated with the new IP. Options:

1. **Manual update**: `vserver fpolicy policy external-engine modify -primary-servers <new-ip>`
2. **Automated**: Lambda triggered by ECS task state change → ONTAP REST API update

The repository includes a helper script (`shared/scripts/fpolicy-update-engine-ip.sh --auto`) that detects the current task IP and updates the ONTAP engine. For full automation, wire an EventBridge rule on ECS task state changes to an update Lambda — this is not included in the base stack but is straightforward to add. Automated updates require network reachability to the ONTAP management endpoint and credentials (stored in Secrets Manager) with permission to modify the FPolicy external engine.

### Restart Resilience — Validated

I tested the full restart cycle to confirm the pipeline recovers gracefully:

| Step | Result | Time |
|------|--------|------|
| Stop Fargate (scale to 0) | Task stopped | ~30s |
| Restart Fargate (scale to 1) | New task, new IP | ~45s |
| Update ONTAP Engine IP | Reconnection | ~20s |
| File operation after restart | Event delivered to Datadog | ~6s |
| **Total recovery time** | | **~2 minutes** |

The Lambda's retry logic also proved itself: on the first request after reconnection, a transient `RemoteDisconnected` error occurred. The exponential backoff retry succeeded on the second attempt — exactly the behavior we designed for.

console
[WARNING] HTTP error shipping to Datadog (attempt 1/3): RemoteDisconnected
[INFO] Processing complete: {"statusCode": 200, "body": {"shipped": 1}}


### Cost Profile

| Component | Monthly Cost (estimate) |
|-----------|------------------------|
| Fargate (0.25 vCPU, 0.5 GB) | ~$10 |
| SQS (low volume) | < $1 |
| Lambda (event-driven) | < $1 |
| CloudWatch Logs | ~$2 |
| **Total** | **~$14/month** |

Compare this to an always-on EC2-based collector, plus OS patching, agent management, and HA considerations. Exact EC2 costs vary by region and instance type.

> This is an AWS-side estimate and excludes Datadog ingest/retention costs, NAT Gateway or VPC endpoint charges, ECR storage, and high-volume CloudWatch Logs.

### Scaling

A single Fargate task is sufficient for the low-volume validation scenarios in this post. The architecture can scale by tuning Fargate CPU/memory, SQS buffering, and Lambda concurrency, but you should benchmark your own workload before assuming a specific events/sec capacity.

### Monitoring

Key CloudWatch metrics to watch:
- `ECS/CPUUtilization` — Fargate task health
- `SQS/ApproximateNumberOfMessagesVisible` — Queue depth (should stay near 0)
- `Lambda/Errors` — Shipping failures
- `Lambda/Duration` — Processing time per batch

## The FPolicy Server

The FPolicy server (`shared/fpolicy-server/fpolicy_server.py`) implements:

- **Protocol negotiation**: Responds to ONTAP's version handshake
- **KeepAlive handling**: Acknowledges connection health checks
- **Event parsing**: Extracts file path, operation, user, client IP from binary frames
- **SQS forwarding**: Sends normalized JSON events to the queue
- **Write coalescing**: Configurable delay to batch rapid write events (default: 5 seconds)

The server runs in `realtime` mode — events are forwarded as they arrive, with optional write-complete delay to avoid duplicate notifications for multi-write operations.

## Limitations and Future Work

### Rename/Delete Events Not Delivered in Async Mode

In my E2E testing, ONTAP did not deliver rename or delete notifications to the FPolicy server in asynchronous mode — even though these operations are configured in the FPolicy event definition. Only create events were reliably delivered. This appears to be a limitation of FSx for ONTAP's FPolicy implementation in async mode for certain operation types.

**Workaround options:**
- Use synchronous mode (adds latency to file operations — not recommended for production)
- Combine FPolicy (event-driven create) with audit log polling (catches rename/delete in EVTX)
- Accept create-only monitoring for event-driven alerting, use audit logs for forensic completeness

### NFS Protocol Support

| Protocol | FPolicy Support | Notes |
|----------|----------------|-------|
| SMB/CIFS | ✅ Verified | Primary validation protocol |
| NFSv3 | ✅ Supported | Requires explicit `vers=3` mount option |
| NFSv4.0 | ✅ Supported | Requires explicit `vers=4.0` |
| NFSv4.1 | ✅ Supported | Requires ONTAP 9.15.1+, explicit `vers=4.1` |
| NFSv4.2 | ❌ Not supported | ONTAP FPolicy does not monitor NFSv4.2 operations |

For protocol support details, verify your ONTAP version. NetApp [documents](https://kb.netapp.com/onprem/ontap/da/NAS/Does_ONTAP_support_FPolicy_for_NFS_4.2) that FPolicy does not currently support NFSv4.2; supported NFS protocols include NFSv3, NFSv4.0, and NFSv4.1 (ONTAP 9.15.1+).

**Critical gotcha:** `mount -o vers=4` on modern Linux negotiates to NFSv4.2, which ONTAP FPolicy does **not** support. Always use explicit version: `mount -o vers=4.1` or `vers=3`.

**NFS + FPolicy latency:** NFSv3 lacks close semantics, so the FPolicy server cannot know when a write is complete. The server uses a configurable `WRITE_COMPLETE_DELAY_SEC` (default: 5s) to wait before forwarding the event. This adds latency but prevents premature processing of incomplete files.

**NFS write hang (observed):** In some configurations, NFS write operations may hang when FPolicy is enabled — even with `is-mandatory=false`. This is a [known ONTAP behavior](https://kb.netapp.com/onprem/ontap/da/NAS/NFS_hung_slowness_issue_when_dealing_with_long_path_names_with_FPolicy_enabled) related to FPolicy notification processing. If you experience this, verify your ONTAP version and consider limiting FPolicy scope to specific volumes.

### User Identity

In the current implementation, the `user` field may be empty for some operations depending on ONTAP's FPolicy notification content. The FPolicy binary frame includes user identity in extended attributes that require additional parsing logic. Future versions will extract this from the NOTI_REQ body.

### Event Durability During Restarts

In my validation, events generated while the Fargate server was disconnected were not observed downstream in Datadog after reconnection. Treat FPolicy delivery during server outages as something you must validate in your own environment.

ONTAP [documentation](https://docs.netapp.com/us-en/ontap/nas-audit/synchronous-asynchronous-notifications-concept.html) describes buffering behavior for asynchronous notifications — notifications generated during a network outage are stored on the storage node and can be fetched when the server comes back online. Beginning with ONTAP 9.14.1, [FPolicy persistent store](https://docs.netapp.com/us-en/ontap/nas-audit/persistent-stores.html) support is available for asynchronous non-mandatory policies. If you cannot tolerate event loss during FPolicy server restarts, evaluate persistent store and validate the behavior on your FSx for ONTAP version.

## Try It Yourself

bash

Clone the repository

git clone https://github.com/Yoshiki0705/fsxn-observability-integrations.git

Deploy prerequisites (if not already done)

aws cloudformation deploy \
--template-file shared/templates/fpolicy-server-fargate.yaml \
--stack-name fsxn-fpolicy-server \
--parameter-overrides \
VpcId= \
SubnetIds= \
FsxnSvmSecurityGroupId= \
ContainerImage= \
--capabilities CAPABILITY_NAMED_IAM

Configure ONTAP FPolicy (see ONTAP section above)

Create a file on the SMB share

Check Datadog: source:fsxn-fpolicy




## Where FPolicy Fits in ONTAP Telemetry

This series covers three ONTAP telemetry sources. Each serves a different purpose:

| Use Case | Best Source | Latency | Coverage |
|----------|-------------|---------|----------|
| Compliance audit trail | Audit logs ([Part 2](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c)) | Minutes (scheduler interval) | Complete historical record |
| Ransomware detection | ARP via EMS ([Part 3](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)) | ~30 seconds (webhook) | ML-based pattern detection |
| Event-driven file activity signal | FPolicy (this post) | ~6 seconds (TCP) | Create events validated; other operations depend on mode/version |
| Forensic investigation | Audit logs + FPolicy correlation | Combined | Timeline reconstruction |

**FPolicy is not a replacement for audit logs.** It provides an event-driven signal for detection and alerting. Audit logs provide the authoritative, complete historical record for compliance and forensics. Use them together.

## Key Takeaways

1. **Use Fargate for FPolicy TCP listener** — Lambda cannot maintain persistent TCP connections. Fargate provides the long-running listener without OS management.
2. **Use SQS to decouple ingestion from shipping** — If Datadog is slow or Lambda is throttled, events buffer safely in SQS.
3. **Validate operation coverage in your environment** — Async mode reliably delivered create events in my testing. Rename/delete behavior varies by ONTAP version and mode.
4. **Use audit logs for forensic completeness** — FPolicy provides event-driven signal for detection; audit logs (Part 2) provide the complete historical record.
5. **Treat FPolicy as event-driven alerting, not full audit replacement** — The two are complementary, not interchangeable.

## Production Considerations Beyond This Validation

This post validates the end-to-end path. For production deployments, the following topics warrant additional design work:

| Topic | Key Questions |
|-------|--------------|
| **HA / Multi-AZ** | ONTAP external engine supports `primary-servers` and `secondary-servers`. How to run multiple Fargate tasks across AZs? |
| **Scope Design** | Which volumes, operations, and protocols to monitor? How to avoid noisy workloads? |
| **Security Hardening** | TLS/mTLS for FPolicy, ECR image scanning, VPC Flow Logs, task role least-privilege |
| **Cost Model** | FPolicy generates events per file operation — Datadog ingest can become the dominant cost at scale |
| **Operations Runbook** | Task restart, engine disconnected, SQS backlog, Datadog missing events, NFS hang |
| **Stable Endpoint** | Auto-update Lambda for engine IP, or primary/secondary server design for zero-downtime restarts |

These topics are documented in the repository:

- **[Production Architecture Patterns](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-production-architecture-patterns.md)** — Single task, primary/secondary, auto-update, multi-AZ patterns with failure mode matrix
- **[Operational Guide](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-operational-guide.md)** — 4-layer health model, runbooks, IP reconciliation, synthetic health check
- **[PoC Checklist](https://github.com/Yoshiki0705/fsxn-observability-integrations/blob/main/docs/en/fpolicy-poc-checklist.md)** — Preconditions, scope, validation steps, success criteria, go/no-go

Contributions and questions are welcome.

## Series Navigation

- **Part 1**: [Why Your FSx for ONTAP Logs Deserve Better](https://dev.to/aws-builders/why-your-fsx-for-ontap-audit-logs-deserve-better-than-ec2-kod)
- **Part 2**: [Shipping FSx for ONTAP Logs to Datadog, The Serverless Way](https://dev.to/aws-builders/shipping-fsx-for-ontap-logs-to-datadog-the-serverless-way-n9c)
- **Part 3**: [Event-Driven Ransomware Detection with ONTAP ARP + Datadog](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)
- **Part 4**: FPolicy File Activity Pipeline (this post)

Coming next:
- **Splunk**: Replacing EC2 + Universal Forwarder with Lambda + HEC
- **OpenTelemetry**: The vendor-neutral escape hatch

---

*Questions about FPolicy or the Fargate architecture? Drop a comment below.*

*Previous: [Part 3 — Event-Driven Ransomware Detection with ONTAP ARP + Datadog](https://dev.to/aws-builders/event-driven-ransomware-detection-with-ontap-arp-datadog-4cda)*

**GitHub**: [github.com/Yoshiki0705/fsxn-observability-integrations](https://github.com/Yoshiki0705/fsxn-observability-integrations)

Forem: Yoshiki Fujiwara(藤原 善基)@AWS Community Builder

Why Delta, Iceberg, and Hudi Can't Write to FSx S3 Access Points — And What Works Instead

TL;DR

How to Read This Article

Prerequisite Concepts

Why Transactional Table Formats Need Special S3 Operations

Architecture: What's Supported vs What's Not

Failure Evidence: Delta Lake

Failure Evidence: Apache Hudi

Failure Evidence: Apache Iceberg

Summary: Why All Three Fail

What Works Instead

✅ Flat Parquet Append (PutObject)

✅ Athena CTAS (Write-back)

✅ DuckDB COPY TO

✅ EMR Spark Write (Flat Parquet)

Architecture Patterns for Transactional Workloads

Pattern 1: Read from FSx for ONTAP, Write to Native S3

Pattern 2: Write via NFS, Read via S3 AP

Pattern 3: Hybrid (FSxN for raw, S3 for curated)

Comparison with Other Engines in This Series

Partner Decision Card

Discovery Questions for Partners

Governance Impact

AI Readiness Score

Cost Analysis

Known Failure Signatures

What's Next

Start Here: 3 Steps to Validate in Your Environment

PoC Cost Summary (1-day validation)

References

Redshift Spectrum + Lake Formation — Enterprise Governance on NAS Data

TL;DR

How to Read This Article

Prerequisite Concepts

Architecture

Benchmark Results

Evidence Matrix

Setup

Step 1: Create External Schema (reuses Glue Catalog)

Step 2: Query FSx for ONTAP Data

Step 3: Add Lake Formation Governance

Lake Formation Governance Value

Fine-Grained Governance — Verified (May 2026)

Comparison with Other Engines in This Series

Partner Decision Card

Discovery Questions for Partners

Governance Impact Summary

AI Readiness Score

Cost Analysis

When to Use (and When Not To)

Use Redshift Spectrum + Lake Formation when:

Don't use when:

Known Failure Signatures

What's Next

References

Read-Write ETL on NAS Data with EMR Serverless Spark — No Cluster, No Copy

TL;DR

How to Read This Article

Prerequisite Concepts

Why EMR Serverless + FSx for ONTAP?

Architecture

Benchmark Results

Evidence Matrix

Critical Finding: EMRFS vs S3A

Critical Finding: Parquet Timestamp Compatibility

Comparison with Other Engines in This Series

Partner Decision Card

Discovery Questions for Partners

Governance Impact

AI Readiness Score

Cost Analysis

The PySpark Job

Deploy and Run

Known Failure Signatures

Gotchas and Lessons

1. Script must be on regular S3 (not S3 AP)

2. IAM role needs both S3 bucket and S3 AP permissions

3. Cold start is ~20 seconds

4. No session policy issues

Forem: Yoshiki Fujiwara(藤原善基)@AWS Community Builder

❌ Before: SELECT Fails Without `AWS_ACCESS_POINT_ARN`

✅ After: SELECT Succeeds With `AWS_ACCESS_POINT_ARN`

Step 2: Create Stage WITH `AWS_ACCESS_POINT_ARN`