Forem: LAVANYA PRIYA

Understanding 6 Common Data Formats in Cloud & Data Analytics

LAVANYA PRIYA — Fri, 03 Oct 2025 14:35:12 +0000

When working with data in the cloud or in analytics pipelines, the way data is stored and exchanged plays a huge role in performance, compatibility, and scalability.

In this blog, we’ll explore 6 popular data formats used in analytics:

CSV (Comma Separated Values)

SQL (Relational Table Format)

JSON (JavaScript Object Notation)

Parquet (Columnar Storage Format)

XML (Extensible Markup Language)

Avro (Row-based Storage Format)

We’ll use a small dataset of students (name, register number, subject, marks) and represent it in all six formats.

🎯 Sample Dataset

Here’s the dataset we’ll use throughout:

1️⃣ CSV (Comma Separated Values)

CSV is the simplest and most widely used format. Data is stored as plain text, with commas separating the values.

Example:

Alice,101,Math,85
Bob,102,Physics,78
Charlie,103,Chemistry,92

✅ Pros: Easy to read, lightweight.
⚠️ Cons: No schema, can get messy with nested/complex data.

2️⃣ SQL (Relational Table Format)

SQL stores data in tables with rows and columns. Data can be inserted using SQL statements.

Example:

CREATE TABLE Students (
    Name VARCHAR(50),
    Register_No INT,
    Subject VARCHAR(50),
    Marks INT
);

INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Alice', 101, 'Math', 85),
('Bob', 102, 'Physics', 78),
('Charlie', 103, 'Chemistry', 92);

✅ Pros: Schema enforcement, supports queries.
⚠️ Cons: Not ideal for unstructured or semi-structured data.

3️⃣ JSON (JavaScript Object Notation)

JSON represents data in a structured key-value pair format, often used in APIs and NoSQL databases.

Example:

[
  {
    "Name": "Alice",
    "Register_No": 101,
    "Subject": "Math",
    "Marks": 85
  },
  {
    "Name": "Bob",
    "Register_No": 102,
    "Subject": "Physics",
    "Marks": 78
  },
  {
    "Name": "Charlie",
    "Register_No": 103,
    "Subject": "Chemistry",
    "Marks": 92
  }
]

✅ Pros: Great for APIs, supports nested data.
⚠️ Cons: Not space-efficient compared to Parquet/Avro.

4️⃣ Parquet (Columnar Storage Format)

Parquet is a binary, columnar format designed for analytics (popular with Hadoop, Spark, AWS Athena, BigQuery). It stores data by columns, making queries faster.

Example in Python (to generate a .parquet file):

import pandas as pd

# Sample dataset
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Register_No": [101, 102, 103],
    "Subject": ["Math", "Physics", "Chemistry"],
    "Marks": [85, 78, 92]
}

df = pd.DataFrame(data)
df.to_parquet("students.parquet", engine="pyarrow", index=False)

print("✅ Parquet file saved as students.parquet")

⚡ Binary file (not human-readable). If you open it, you’ll see compressed binary data.

✅ Pros: Efficient for large datasets, great compression, optimized for analytics.
⚠️ Cons: Not human-readable, needs libraries to parse.

5️⃣ XML (Extensible Markup Language)

XML stores data in a tag-based hierarchical structure. It is verbose but still used in enterprise systems.

Example:

<Students>
  <Student>
    <Name>Alice</Name>
    <Register_No>101</Register_No>
    <Subject>Math</Subject>
    <Marks>85</Marks>
  </Student>
  <Student>
    <Name>Bob</Name>
    <Register_No>102</Register_No>
    <Subject>Physics</Subject>
    <Marks>78</Marks>
  </Student>
  <Student>
    <Name>Charlie</Name>
    <Register_No>103</Register_No>
    <Subject>Chemistry</Subject>
    <Marks>92</Marks>
  </Student>
</Students>

✅ Pros: Self-descriptive, supports hierarchical data.
⚠️ Cons: Verbose, storage-heavy compared to JSON.

6️⃣ Avro (Row-based Storage Format)

Avro is a row-based binary format developed by Apache. It’s great for data serialization and works well with Kafka & Hadoop.

Avro requires both data and a schema.

Schema (students.avsc):

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "Register_No", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

Example in Python (to generate .avro file):
import fastavro
from fastavro import writer

# Define schema
schema = {
    "type": "record",
    "name": "Student",
    "fields": [
        {"name": "Name", "type": "string"},
        {"name": "Register_No", "type": "int"},
        {"name": "Subject", "type": "string"},
        {"name": "Marks", "type": "int"}
    ]
}

# Data
records = [
    {"Name": "Alice", "Register_No": 101, "Subject": "Math", "Marks": 85},
    {"Name": "Bob", "Register_No": 102, "Subject": "Physics", "Marks": 78},
    {"Name": "Charlie", "Register_No": 103, "Subject": "Chemistry", "Marks": 92}
]

# Save to Avro file
with open("students.avro", "wb") as out:
    writer(out, schema, records)

print("✅ Avro file saved as students.avro")

⚡ Like Parquet, the Avro file is binary and not human-readable.

✅ Pros: Schema-based, efficient row storage, great for streaming.
⚠️ Cons: Needs schema + libraries to parse.

“MongoDB: The Netflix Data Saga”

LAVANYA PRIYA — Mon, 25 Aug 2025 17:14:40 +0000

Episode 1: Data Never Sleeps

Tonight’s episode takes us deep into the world of Netflix titles, but not through binge-watching. Instead, we’re stepping behind the scenes — into the database where all the magic begins.

Our tool of choice? MongoDB.
Our mission? To clean, query, and command the data — just like the mastermind of a thrilling heist.

Act 1: The Pilot — Setting Up MongoDB

Every great show starts with a setup.
We installed MongoDB, spun up our database called netflixDB, and created a collection called titles.
Think of it as the stage where our actors (movies & TV shows) perform.

Act 2: Enter the Cast — Insert Records

Just like introducing characters in the first season, we manually inserted 10 Netflix titles into our titles collection. Each came with their attributes: title, country, release_year, description, and even a ratingValue.

db.titles.insertMany([
  { "show_id": "s1", "title": "The Irishman", "country": "United States", "ratingValue": 4.7 },
  { "show_id": "s2", "title": "Sacred Games", "country": "India", "ratingValue": 4.5 },
  ...
])

Act 3: The Drama — Who Rules the Ratings?

Every series has its critics, and in our dataset, ratings decide who takes the throne.
We asked MongoDB: “Show us the top 5 Netflix titles with the highest average rating.”

db.titles.aggregate([
  { $group: { _id: "$title", avgRating: { $avg: "$ratingValue" } } },
  { $sort: { avgRating: -1 } },
  { $limit: 5 }
])

And just like in a finale twist, Inception and Dangal battled for the top spot!

Act 4: The Mystery of the Word “Good”

What if we only looked at descriptions with the word good?
MongoDB, our detective, quickly scanned all storylines:

db.titles.countDocuments({
  description: { $regex: "good", $options: "i" }
})

Act 5: Stories from India

Every drama has its backdrop.
We filtered our data to see only those titles that originated in India:

db.titles.find({ country: "India" })

Act 6: Plot Twists — Update & Delete

Just like characters evolve, so does data.
We updated one record (s3) with a refreshed description:

db.titles.updateOne(
  { show_id: "s3" },
  { $set: { description: "Updated review: A very good and insightful Netflix documentary." } }
)

But sometimes, characters leave the show.
So we deleted record s6:

db.titles.deleteOne({ show_id: "s6" })

Finale: Exporting the Series

And no season is complete without taking the story global.
We exported our data to JSON and CSV formats for safe keeping and further analysis.

Closing Credits 🎬

By the end of this binge-worthy database journey, we:
✅ Inserted Netflix titles
✅ Found top 5 highest-rated shows
✅ Counted descriptions with “good”
✅ Filtered titles by country
✅ Updated and deleted records
✅ Exported query results

💡 Why it matters:
This exercise mirrors real-world data engineering tasks:

inserting raw data,

performing aggregations,

filtering for insights,

maintaining data quality,

and finally exporting results for downstream use.

And just like any great show, this is only Season 1.
Stay tuned for more adventures in MongoDB & Data Engineering.