<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: LAVANYA PRIYA</title>
    <description>The latest articles on Forem by LAVANYA PRIYA (@lavanya_priya_3c9225a7a5b).</description>
    <link>https://forem.com/lavanya_priya_3c9225a7a5b</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3458303%2Ff4afc9ff-1566-4008-9afd-ee4476dfc8b7.png</url>
      <title>Forem: LAVANYA PRIYA</title>
      <link>https://forem.com/lavanya_priya_3c9225a7a5b</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/lavanya_priya_3c9225a7a5b"/>
    <language>en</language>
    <item>
      <title>Understanding 6 Common Data Formats in Cloud &amp; Data Analytics</title>
      <dc:creator>LAVANYA PRIYA</dc:creator>
      <pubDate>Fri, 03 Oct 2025 14:35:12 +0000</pubDate>
      <link>https://forem.com/lavanya_priya_3c9225a7a5b/understanding-6-common-data-formats-in-cloud-data-analytics-1g8o</link>
      <guid>https://forem.com/lavanya_priya_3c9225a7a5b/understanding-6-common-data-formats-in-cloud-data-analytics-1g8o</guid>
      <description>&lt;p&gt;When working with data in the cloud or in analytics pipelines, the way data is stored and exchanged plays a huge role in performance, compatibility, and scalability.&lt;/p&gt;

&lt;p&gt;In this blog, we’ll explore 6 popular data formats used in analytics:&lt;/p&gt;

&lt;p&gt;CSV (Comma Separated Values)&lt;/p&gt;

&lt;p&gt;SQL (Relational Table Format)&lt;/p&gt;

&lt;p&gt;JSON (JavaScript Object Notation)&lt;/p&gt;

&lt;p&gt;Parquet (Columnar Storage Format)&lt;/p&gt;

&lt;p&gt;XML (Extensible Markup Language)&lt;/p&gt;

&lt;p&gt;Avro (Row-based Storage Format)&lt;/p&gt;

&lt;p&gt;We’ll use a small dataset of students (name, register number, subject, marks) and represent it in all six formats.&lt;/p&gt;

&lt;p&gt;🎯 Sample Dataset&lt;/p&gt;

&lt;p&gt;Here’s the dataset we’ll use throughout:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fff64axfxz016zbrrf8ms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fff64axfxz016zbrrf8ms.png" alt="Sample Dataset" width="464" height="121"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;1️⃣ CSV (Comma Separated Values)&lt;/p&gt;

&lt;p&gt;CSV is the simplest and most widely used format. Data is stored as plain text, with commas separating the values.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alice,101,Math,85
Bob,102,Physics,78
Charlie,103,Chemistry,92
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ Pros: Easy to read, lightweight.&lt;br&gt;
⚠️ Cons: No schema, can get messy with nested/complex data.&lt;/p&gt;

&lt;p&gt;2️⃣ SQL (Relational Table Format)&lt;/p&gt;

&lt;p&gt;SQL stores data in tables with rows and columns. Data can be inserted using SQL statements.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE Students (
    Name VARCHAR(50),
    Register_No INT,
    Subject VARCHAR(50),
    Marks INT
);

INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Alice', 101, 'Math', 85),
('Bob', 102, 'Physics', 78),
('Charlie', 103, 'Chemistry', 92);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ Pros: Schema enforcement, supports queries.&lt;br&gt;
⚠️ Cons: Not ideal for unstructured or semi-structured data.&lt;/p&gt;

&lt;p&gt;3️⃣ JSON (JavaScript Object Notation)&lt;/p&gt;

&lt;p&gt;JSON represents data in a structured key-value pair format, often used in APIs and NoSQL databases.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[
  {
    "Name": "Alice",
    "Register_No": 101,
    "Subject": "Math",
    "Marks": 85
  },
  {
    "Name": "Bob",
    "Register_No": 102,
    "Subject": "Physics",
    "Marks": 78
  },
  {
    "Name": "Charlie",
    "Register_No": 103,
    "Subject": "Chemistry",
    "Marks": 92
  }
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ Pros: Great for APIs, supports nested data.&lt;br&gt;
⚠️ Cons: Not space-efficient compared to Parquet/Avro.&lt;/p&gt;

&lt;p&gt;4️⃣ Parquet (Columnar Storage Format)&lt;/p&gt;

&lt;p&gt;Parquet is a binary, columnar format designed for analytics (popular with Hadoop, Spark, AWS Athena, BigQuery). It stores data by columns, making queries faster.&lt;/p&gt;

&lt;p&gt;Example in Python (to generate a .parquet file):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd

# Sample dataset
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Register_No": [101, 102, 103],
    "Subject": ["Math", "Physics", "Chemistry"],
    "Marks": [85, 78, 92]
}

df = pd.DataFrame(data)
df.to_parquet("students.parquet", engine="pyarrow", index=False)

print("✅ Parquet file saved as students.parquet")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⚡ Binary file (not human-readable). If you open it, you’ll see compressed binary data.&lt;/p&gt;

&lt;p&gt;✅ Pros: Efficient for large datasets, great compression, optimized for analytics.&lt;br&gt;
⚠️ Cons: Not human-readable, needs libraries to parse.&lt;/p&gt;

&lt;p&gt;5️⃣ XML (Extensible Markup Language)&lt;/p&gt;

&lt;p&gt;XML stores data in a tag-based hierarchical structure. It is verbose but still used in enterprise systems.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;Students&amp;gt;
  &amp;lt;Student&amp;gt;
    &amp;lt;Name&amp;gt;Alice&amp;lt;/Name&amp;gt;
    &amp;lt;Register_No&amp;gt;101&amp;lt;/Register_No&amp;gt;
    &amp;lt;Subject&amp;gt;Math&amp;lt;/Subject&amp;gt;
    &amp;lt;Marks&amp;gt;85&amp;lt;/Marks&amp;gt;
  &amp;lt;/Student&amp;gt;
  &amp;lt;Student&amp;gt;
    &amp;lt;Name&amp;gt;Bob&amp;lt;/Name&amp;gt;
    &amp;lt;Register_No&amp;gt;102&amp;lt;/Register_No&amp;gt;
    &amp;lt;Subject&amp;gt;Physics&amp;lt;/Subject&amp;gt;
    &amp;lt;Marks&amp;gt;78&amp;lt;/Marks&amp;gt;
  &amp;lt;/Student&amp;gt;
  &amp;lt;Student&amp;gt;
    &amp;lt;Name&amp;gt;Charlie&amp;lt;/Name&amp;gt;
    &amp;lt;Register_No&amp;gt;103&amp;lt;/Register_No&amp;gt;
    &amp;lt;Subject&amp;gt;Chemistry&amp;lt;/Subject&amp;gt;
    &amp;lt;Marks&amp;gt;92&amp;lt;/Marks&amp;gt;
  &amp;lt;/Student&amp;gt;
&amp;lt;/Students&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✅ Pros: Self-descriptive, supports hierarchical data.&lt;br&gt;
⚠️ Cons: Verbose, storage-heavy compared to JSON.&lt;/p&gt;

&lt;p&gt;6️⃣ Avro (Row-based Storage Format)&lt;/p&gt;

&lt;p&gt;Avro is a row-based binary format developed by Apache. It’s great for data serialization and works well with Kafka &amp;amp; Hadoop.&lt;/p&gt;

&lt;p&gt;Avro requires both data and a schema.&lt;/p&gt;

&lt;p&gt;Schema (students.avsc):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "Register_No", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

Example in Python (to generate .avro file):
import fastavro
from fastavro import writer

# Define schema
schema = {
    "type": "record",
    "name": "Student",
    "fields": [
        {"name": "Name", "type": "string"},
        {"name": "Register_No", "type": "int"},
        {"name": "Subject", "type": "string"},
        {"name": "Marks", "type": "int"}
    ]
}

# Data
records = [
    {"Name": "Alice", "Register_No": 101, "Subject": "Math", "Marks": 85},
    {"Name": "Bob", "Register_No": 102, "Subject": "Physics", "Marks": 78},
    {"Name": "Charlie", "Register_No": 103, "Subject": "Chemistry", "Marks": 92}
]

# Save to Avro file
with open("students.avro", "wb") as out:
    writer(out, schema, records)

print("✅ Avro file saved as students.avro")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⚡ Like Parquet, the Avro file is binary and not human-readable.&lt;/p&gt;

&lt;p&gt;✅ Pros: Schema-based, efficient row storage, great for streaming.&lt;br&gt;
⚠️ Cons: Needs schema + libraries to parse.&lt;/p&gt;

</description>
      <category>data</category>
    </item>
    <item>
      <title>“MongoDB: The Netflix Data Saga”</title>
      <dc:creator>LAVANYA PRIYA</dc:creator>
      <pubDate>Mon, 25 Aug 2025 17:14:40 +0000</pubDate>
      <link>https://forem.com/lavanya_priya_3c9225a7a5b/mongodb-the-netflix-data-saga-3plc</link>
      <guid>https://forem.com/lavanya_priya_3c9225a7a5b/mongodb-the-netflix-data-saga-3plc</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Episode 1: Data Never Sleeps&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Tonight’s episode takes us deep into the world of Netflix titles, but not through binge-watching. Instead, we’re stepping behind the scenes — into the database where all the magic begins.&lt;/p&gt;

&lt;p&gt;Our tool of choice? MongoDB.&lt;br&gt;
Our mission? To clean, query, and command the data — just like the mastermind of a thrilling heist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act 1: The Pilot — Setting Up MongoDB&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every great show starts with a setup.&lt;br&gt;
We installed MongoDB, spun up our database called netflixDB, and created a collection called titles.&lt;br&gt;
Think of it as the stage where our actors (movies &amp;amp; TV shows) perform.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp0ko2pn7tbpowrm4eih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvp0ko2pn7tbpowrm4eih.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act 2: Enter the Cast — Insert Records&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Just like introducing characters in the first season, we manually inserted 10 Netflix titles into our titles collection. Each came with their attributes: title, country, release_year, description, and even a ratingValue.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.titles.insertMany([
  { "show_id": "s1", "title": "The Irishman", "country": "United States", "ratingValue": 4.7 },
  { "show_id": "s2", "title": "Sacred Games", "country": "India", "ratingValue": 4.5 },
  ...
])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpn6xpeavp9w2bwo295w7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpn6xpeavp9w2bwo295w7.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act 3: The Drama — Who Rules the Ratings?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every series has its critics, and in our dataset, ratings decide who takes the throne.&lt;br&gt;
We asked MongoDB: “Show us the top 5 Netflix titles with the highest average rating.”&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.titles.aggregate([
  { $group: { _id: "$title", avgRating: { $avg: "$ratingValue" } } },
  { $sort: { avgRating: -1 } },
  { $limit: 5 }
])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmm08t94mu4axx7o2ukhl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmm08t94mu4axx7o2ukhl.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And just like in a finale twist, Inception and Dangal battled for the top spot!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act 4: The Mystery of the Word “Good”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What if we only looked at descriptions with the word good?&lt;br&gt;
MongoDB, our detective, quickly scanned all storylines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.titles.countDocuments({
  description: { $regex: "good", $options: "i" }
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuruekb8fnax7k8vk99l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuruekb8fnax7k8vk99l.png" alt=" " width="800" height="148"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act 5: Stories from India&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every drama has its backdrop.&lt;br&gt;
We filtered our data to see only those titles that originated in India:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.titles.find({ country: "India" })
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F246g80pkeuwrazufbecr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F246g80pkeuwrazufbecr.png" alt=" " width="800" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act 6: Plot Twists — Update &amp;amp; Delete&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Just like characters evolve, so does data.&lt;br&gt;
We updated one record (s3) with a refreshed description:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.titles.updateOne(
  { show_id: "s3" },
  { $set: { description: "Updated review: A very good and insightful Netflix documentary." } }
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fionc18orwnk0q0kopcc2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fionc18orwnk0q0kopcc2.png" alt=" " width="800" height="290"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But sometimes, characters leave the show.&lt;br&gt;
So we deleted record s6:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db.titles.deleteOne({ show_id: "s6" })
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcwszmar7i7va472u5zjx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcwszmar7i7va472u5zjx.png" alt=" " width="663" height="176"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finale: Exporting the Series&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And no season is complete without taking the story global.&lt;br&gt;
We exported our data to JSON and CSV formats for safe keeping and further analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Closing Credits 🎬&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By the end of this binge-worthy database journey, we:&lt;br&gt;
✅ Inserted Netflix titles&lt;br&gt;
✅ Found top 5 highest-rated shows&lt;br&gt;
✅ Counted descriptions with “good”&lt;br&gt;
✅ Filtered titles by country&lt;br&gt;
✅ Updated and deleted records&lt;br&gt;
✅ Exported query results&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💡 Why it matters:&lt;/strong&gt;&lt;br&gt;
This exercise mirrors real-world data engineering tasks:&lt;/p&gt;

&lt;p&gt;inserting raw data,&lt;/p&gt;

&lt;p&gt;performing aggregations,&lt;/p&gt;

&lt;p&gt;filtering for insights,&lt;/p&gt;

&lt;p&gt;maintaining data quality,&lt;/p&gt;

&lt;p&gt;and finally exporting results for downstream use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;And just like any great show, this is only Season 1.&lt;br&gt;
Stay tuned for more adventures in MongoDB &amp;amp; Data Engineering.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>mongodb</category>
      <category>database</category>
      <category>learningjourney</category>
    </item>
  </channel>
</rss>
