Forem: Akhilesh Pratap Shahi

Apache Spark Installation

Akhilesh Pratap Shahi — Mon, 19 Jan 2026 09:34:50 +0000

Each step is important when you are learning new things. When you are learning new technology, setting up its environment is important so that you can practice effectively; we call it a "baby step".
Installing Apache Spark involves a few key steps like ensuring the prerequisites are installed and then downloading, extracting and configuring the spark binaries for your specific operating system.
Apache Spark runs on the Java Virtual Machine (JVM), so the Java Development Kit (JDK) is a requirement. Hey, don't panic! Installing the JDK doesn't mean you have to code in Java. It simply provides the JVM, creating the necessary runtime environment that Spark needs to execute its tasks.

Prerequisites

1. Java

Spark requires Java 8 or 11 to be installed, make sure that your system works on either one of these.

My suggestion is to go with Java 8 because:

Hadoop 3.0.x and 3.2.x only support Java 8
Hadoop 3.3+ supports Java 8 and 11 (runtime only)

Check Java version:

java -version

If not installed:

sudo apt install openjdk-8-jdk -y # Change according to your choice of version

2. Minimum System Requirement

OS (Operating System): Ubuntu 20.04 or 22.04 https://ubuntu.com/download
RAM: 4GB (Recommended 8GB)
Disk: 20GB free space
CPU: 2 cores

To create this setup in your windows firstly we have to create a Virtual Environment using wsl.

If you want the installation guide for Mac drop comment will be coming up with mac set up

3. Python (For PySpark)

Required: Python 3.7+

Check versions:

python --version
pip3 --version

If not installed:

sudo apt install python3 python3-pip -y

4. SSH (Mandatory for Hadoop)

Hadoop daemons require passwordless SSH, even on a single machine.

Check SSH:

ssh localhost

If not installed:

sudo apt install openssh-server -y

5. Linux Utilities (Required)

Install basic tools:

sudo apt install -y \
wget \
curl \
rsync \
vim \
nano \
net-tools \
procps

Why:

rsync: Hadoop file sync
procps: jps
net-tools: network checks

6. Environment Variables

(We will be sorting this out together)

7. Browser (For UIs)

Any browser of your choice will work:

chrome
safari
firefox

8. Permissions

You must:

Have sudo access
Be able to write to /opt

NOTE:

Run the commands below. If all of them pass, we are ready to move forward with the installation of Hadoop + Spark.

java -version
python3 --version
ssh localhost
sudo ls /opt

LET's START FOR WHAT WE ARE HERE ACTUALLY

STEP 1

Install & Configure Hadoop (Single Node Cluster)

1. Setup Passwordless SSH (Mandatory)

What's the use for this?
Hadoop uses SSH to

Start Daemons
Stop Daemons
Manage Nodes (even localhost)

"Even on one machine Hadoop behaves like a cluster"

This is very common question to ask, and you might be thinking the same; why passwordless if we can use password? Well let me tell you this you can enter the password what about your machine; don't mind me here but your machine is dumb, dumber than you think it won't be able to enter the password and we want this dumb thing to get the access that's why we say "Comm'mon you dumb, take this passwordless access and leave me alone". So, Daemons cannot enter password because this dude is dumb.
We have to generate the SSH keys and allow localhost login

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

The above command will create to keys id_rsa (Private key) and id_rsa.pub (public key)
Now we have to authorize it:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

Here we are done with this, now we have to test if it login without password voila we have achieved it our SSH layer is ready to go

ssh localhost

2. Download Hadoop

We will be downloading the hadoop, when i say hadoop don't take it as a simple program it is a set of Java Services which includes.

HDFS
YARN
MapReduce(Runtime)

We will be keeping all third party software at location /opt to keep our system clean and sorted.

cd /opt
sudo wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
sudo tar -xvzf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 hadoop

(https://downloads.apache.org/hadoop/common/) This has all the possible options for version of Hadoop. Pick which ever you find best for your work. If you want current stable version go inside stable folder and copy hadoop-x.x.x.tar.gz path and run the above command.

change the ownership:

sudo chown -R $USER:$USER /opt/hadoop

3. Environment Variable

As I have told you before your machine is dumb so it won't be able to find the hadoop, binaries, config files. So we set the environment variable so that we can tell linux.

Where Hadoop is Installed
Where its Binary File lives
Where config files live

make sure you have VS code installed. Will be easier for you to manage things going forward, if you have any IDE installed. My preference is with VSCode so I am suggesting VSCode.

Variable	Purpose
HADOOP_HOME	Hadoop root directory
HADOOP_CONF_DIR	Hadoop XML configuration files
PATH	Run `hadoop`, `hdfs`, `yarn`

cd ~

Here you will find a file named as .bashrc this file actually contains all the environment variable and command that should run at the startup time.

Open .bashrc in VSCode.

in the last of the file add below commands.

export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Save the file and run the below command on terminal.

source ~/.bashrc

Now, We have to check if the hadoop is wired properly. If it is wired properly we have completed another step successfully.

hadoop version

4. Hadoop Configuration Files

Hadoop behaves exactly how we say it to behave within its limitations, and to tell hadoop how to behave we don't use broom like mom use to. We give them a manual and that is provided through XML Files.
These files will be found at location $HADOOP_CONF_DIR.

4.1. core-site.xml

This actually controls the file system abstraction and set the default filesystem URI

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Default filesystem -->
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>

  <!-- Temporary directory used by Hadoop -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop/tmp</value>
  </property>

</configuration>

replace the core-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default core-site.xml file.

What this change actually means. So any file operation defaults to HDFS and the Name Node will run on localhost, for HDFS RPS the port is set to 9000.

Every hdfs dfs command uses this URI

spark also reads this when accessing HDFS

create a temp dir:

mkdir -p /opt/hadoop/tmp

4.2. hdfs-site.xml

This Controls HDFS storage, replication, metadata, and block storage.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Single node replication -->
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <!-- NameNode metadata storage -->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///opt/hadoop/data/namenode</value>
  </property>

  <!-- DataNode block storage -->
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///opt/hadoop/data/datanode</value>
  </property>

  <!-- Enable Web UI -->
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
  </property>

</configuration>

Replace the hdfs-site.xml file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default hdfs-site.xml file.
We will create the directory because hadoop will not create them by themselves.

mkdir -p /opt/hadoop/data/namenode
mkdir -p /opt/hadoop/data/datanode

4.3. mapred-site.xml

This Controls MapReduce execution engine (needed for YARN + Spark)

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <!-- Run MapReduce on YARN -->
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>

  <!-- MapReduce job history server address -->
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>localhost:10020</value>
  </property>

  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>localhost:19888</value>
  </property>

</configuration>

When we run spark on YARN it reuses MapReduce shuffle services, this is mandatory configuration even if you never run MR jobs.

4.4. yarn-site.xml

This Controls YARN resource management and container execution

<?xml version="1.0"?>

<configuration>

  <!-- Enable shuffle service -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

  <!-- ResourceManager hostname -->
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
  </property>

  <!-- Memory allocation (adjust to your RAM) -->
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>4096</value>
  </property>

  <!-- CPU allocation -->
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>2</value>
  </property>

  <!-- Minimum container memory -->
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>512</value>
  </property>

  <!-- Maximum container memory -->
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>4096</value>
  </property>

</configuration>

Spark executors are YARN containers and these limits decide executor size

4.5. hadoop-env.sh

This tell the hadoop which java it has to use, without this hadoop daemon will fail to start.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HEAPSIZE=1024

4.6. Slaves (Workers).

Previously it was known as slaves now we got civilized and started saying the same thing workers. This tells hadoop where to run DataNode and NodeManager will run.

Go to hadoop directory.

cd $HADOOP_CONF_DIR

Open the workers file.

nano workers

make sure it contains exactly this.

localhost

if it doesn't make sure to edit and then save.

4.7. Verify Configuration

This step will ensure that the hadoop is perfectly initialized and actually running, not just configured on disk.

Format the Name Node (First Time Task only)

hdfs namenode -format

This will create the metadata and namespace. Without formatting hdfs cannot start. Make sure you do this only once in lifetime of hadoop installation, formatting later will delete the hdfs meta data

Now, this is done. The hadoop is perfectly installed and we will start the services.

start-dfs.sh
start-yarn.sh

This will start the HDFS Daemons (storage level) and YARN Daemons (resource level)

verify the running daemons

jps

This command should return something like below:

jps Output

if you see this then it means all the Hadoop JVM process are alive. if anything is missing the hadoop is not fully up in this case, retrace your steps.

5. Web Interface

This show the real time cluster state.

Service	URL
NameNode UI	http://localhost:9870
YARN UI	http://localhost:8088

Name Node UI

We can also take browse the HDFS directories from utilities > browse the file system.

HDFS file system

If both open hadoop is running correctly.

6. Confirmation

Confirmation is important and we must ensure HDFS, YARN works and Daemons are healthy.

hdfs dfs -mkdir -p /user/$USER
hdfs dfs -ls /user

If this works properly then it's confirm that Hadoop layer is stable.

STEP 2

Installing Spark

What we will be doing here.

Installing Spark
Tell spark where hadoop config live
make spark submit jobs to YARN
Enable Spark to read/write HDFS Spark will always be dependent on YARN, it does not run its own cluster

1. Download Spark(Hadoop Compatible)

We will be downloading the pre-build spark binary that already includes Hadoop Integration libraries. Spark internally relies on Hadoop FileSystem API to talk to HDFS and YARN Client APIs to request container. If Spark is not compiled with hadoop it won't be able to read or write ro HDFS or submit the application to YARN.

(LINK TO COPY THE VERSION SUITABLE LINK - https://downloads.apache.org/spark/)
You will have to copy the link for folder link mentioned in below picture

Spark Directory Suitable With Hadoop

Choose the hadoop suitable Spark tgz file to go forward with installation.

cd /opt
sudo wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

Extract:

sudo tar -xvzf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 spark
sudo chown -R $USER:$USER /opt/spark

Hadoop aware spark binaries are available in our machine but not yet connected.

2. Settting Up Spark Environment Variable.

This is done so that our machine can understand where our spark is installed and where spark command lives, which python spark should use. Linux doesn't automaticallu know about software installed in /opt. So by setting below variable we make linux intelligent enough to know that spark is installed.

SPARK_HOME: Spark root directory
PATH: where spark-shell, spark-submit, pyspark live
PYSPARK_PYTHON: avoids python version mismatch This ensures that command can run from anywhere and pyspark should use python3 consistently.

Open .bashrc to VSCode.
Add the following set of code in .bashrc at the last.

# Spark
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3

save the .bashrc file and run below command on terminal

source ~/.bashrc

verify:

spark-shell --version

If this print spark version congratulations your spark is successfully installed.

bin-hadoop3 build contains hadoop client libraries

3. Configure Spark to use Hadoop & YARN

This is one of the most crucial part, do not miss. This explicitly connect Spark to hadoop Cluster.
Use below commands.

cd $SPARK_HOME/conf
cp spark-env.sh.template spark-env.sh
nano spark-env.sh

add below set of command to the spark-env.sh file.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export YARN_CONF_DIR=/opt/hadoop/etc/hadoop

Spark does not auto discover the Hadoop, these setting will tell spark that which Java Runtime to use, where HDFS and YARN configuration live. If we miss this Spark won't be able to locate NameNode, ResourceManager, HDFS Paths.

4. Prepare HDFS for Spark Execution

We will be creating the required directories.

hdfs dfs -mkdir /spark
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -chmod -R 777 /spark

When we are running the the spark on YARN, spark uploads jar and config to HDFS and uses HDFS for application staging whereas write logs and metadata under /user/<username>. If the directories won't be available it will throw a runtime failure, not a startup errors.

4. Run Spark Using YARN (Validation)

Here we will be running the spark using hadoop's resource manager.

Python

pyspark --master yarn

Test:

rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x * 10).collect()
print(result)

Exit Pyspark:

exit()

Scala

spark-shell --master yarn

Test:

sc.parallelize(1 to 5).map(_ * 10).collect()

Exit Spark Scala:

:quit

If this runs perfectly here we are ready with Apache Spark and ready to practice the code.

NOTE:
YARN log aggregation is intentionally skipped here.
It will be covered later when we discuss debugging Spark jobs.

The Ultimate Data Engineering Roadmap: From Beginner to Pro

Akhilesh Pratap Shahi — Sun, 10 Nov 2024 22:58:07 +0000

🎉 Data Engineering Roadmap: From Newbie to Data Dynamo! 🌐

Data engineering is the backbone of today’s data-driven world. From designing data pipelines to wrangling big data, data engineers make sure data is accessible, reliable, and ready to power insights. If you’re thinking about diving into this field, this roadmap will guide you from rookie to rockstar, covering essential skills, tools, and some project ideas to get you going.

Today, data is everywhere — overflowing from our apps, devices, websites, and yes, even our smart fridges. But data alone is a bit like buried treasure; valuable, sure, but only if you know how to dig it up. That’s where data engineers come in! Imagine if every time a company wanted feedback on a product, they had to survey a million people by hand. Or if every click on a site just disappeared into the digital void. Data engineers save the day by managing, organizing, and optimizing data pipelines so businesses can know exactly what’s happening in real time. They’re the superheroes without capes, but probably with a trusty hoodie and coffee mug. ☕

So, why consider data engineering? For starters, demand is sky-high — companies know data is their goldmine, and they need skilled pros to dig it up. Data engineering is one of the fastest-growing jobs in tech, with excellent pay, strong growth prospects, and the satisfaction of knowing you’re the backbone of decision-making and innovation.

But it’s more than just job security. Data engineering is the perfect blend of creativity and logic, with challenges that keep you on your toes. Whether it’s setting up a database that can handle billions of records or designing a pipeline that pulls in data from around the world in seconds, data engineers are at the forefront of cool tech.

Keep this roadmap handy! It’ll help you stay updated with market demands, understand what’s in demand, and guide you on what to learn next and when. 📈

If you’re excited about tech, data, and a bit of organized chaos, data engineering could be your calling. Let this guide be your step-by-step roadmap to go from beginner to data engineering pro, with the skills, tools, and hands-on projects that’ll make you job-ready and set for a thrilling career in this fast-paced field.

Step 1: Understand the Role of a Data Engineer 🕶️

Before you roll up your sleeves, let’s get clear on what data engineers actually do (hint: it’s a LOT more than staring at a screen full of code). Here’s your quick “Data Engineer Starter Pack”:

Key Responsibilities:

Build Data Pipelines: Think of these as conveyor belts for data, moving it smoothly from one place to another.
ETL Magic: Extract, Transform, Load (or “Every Time Late” — kidding!) processes that prep data for analysis.
Data Quality & Governance: Making sure data is accurate, clean, and not full of mysterious empty values.
Storage Solutions: Picking the right data warehouses, lakes, or… “lakeshouses”? Yep, that’s a thing now. 🏠💧
Optimization: If your data is moving like a turtle, you’re doing it wrong. Data engineers are the speed champions.
Collaboration: You’ll be the bridge between data science, business, and engineering teams. Social skills + tech skills = data engineer gold.

Step 2: Nail Down the Basics 📚

If you’re new to this, don’t worry — everyone starts here! Let’s talk about the building blocks. And yes, there will be homework (projects) later! 📝

Databases (They’re Everywhere!) 🗄️

SQL Databases: Start with SQL for relational data. Practice in MySQL or PostgreSQL. If you can’t remember, just think “SQL” stands for “Super Quick Learner” (okay, not really).
NoSQL Databases: For semi-structured data, dabble with MongoDB or Cassandra. You’ll want to handle unstructured data, too!
Graph & Time-series Databases: For when your data has lots of relationships or time-specific values, tools like Neo4j and InfluxDB are amazing.

Data Warehouses and Modeling 🏛️

Learn the difference between Star Schemas and Snowflake Schemas (hint: one is simpler, the other is more detailed).
Master the ETL Process: Imagine you’re Marie Kondo for data — organize, clean, and prepare it to spark joy for your analysts. ✨

Big Data Tech 🚂

Big data isn’t just big, it’s also messy. Learn to handle it with:

Apache Hadoop for storage.
Apache Spark for processing — like the jetpack for big data, Spark makes it FLY. 🔥

Step 3: Pick Up Key Tools & Technologies 🔧

Welcome to the “choose your own adventure” part of the roadmap. Data engineering has a LOT of tools, but you can get started with these essentials:

Data Processing with Apache Spark

Spark is like the Batman of data engineering. It’s versatile and saves the day in a lot of situations.

PySpark: The Python API for Spark, making it easier to work with large datasets. (Python + Spark awesomeness.)
Spark SQL: A module for querying structured data in Spark. (SQL-like data manipulation.)
Spark MLlib: For machine learning in Spark.
Spark Streaming: Enables real-time data processing.

Mastering Spark allows you to handle large datasets, a crucial skill in big data environments.

Cloud Platforms (AWS, Azure)

Everything’s moving to the cloud! Learn the essentials on either platform (or both if you’re ambitious):

AWS: Start with S3 (storage), Redshift (warehouse), Glue (ETL), and EMR (processing).
Azure: Try out Azure Data Lake, Synapse Analytics, and Azure Databricks.

AWS:

Amazon S3: Object storage, commonly used for data lakes.
Amazon Redshift: Data warehousing solution optimized for analytics.
AWS Glue: Serverless ETL service.
Amazon EMR: Managed Hadoop and Spark clusters for big data.

Azure:

Azure Data Lake Storage: Optimized for big data storage.
Azure Synapse Analytics: Combines data warehousing, big data, and data integration.
Azure Databricks: Managed Spark service for collaborative work.

Having hands-on experience with both platforms will make you adaptable and increase job opportunities.

Databricks for Big Data and Machine Learning

It’s Spark, but with a cool notebook-style interface. Perfect for collaborative big data work:

Collaborative Notebooks: For developing ETL workflows and machine learning models.
Delta Lake: Adds reliability to data lakes with ACID transactions and schema enforcement.
MLflow: Manages the machine learning lifecycle, from experimentation to deployment.

Mastering Databricks will help you run scalable data processing and machine learning workflows in a collaborative environment.

Apache Airflow (Workflow Orchestration)

Data pipelines need maintenance, and Airflow helps schedule and monitor tasks. Think of it as a calendar for your data’s journey.

Version Control with Git

Git is essential for version control and collaboration, especially in larger projects. Familiarize yourself with branching, merging, and pull requests to streamline teamwork.

Step 4: Get Your Coding Skills in Shape 💻

You’re a data engineer — you’ll code more than you might expect. Here’s the lowdown:

🐍 Python Programming

Python is the backbone for many data engineering tasks. Start with:

Pandas: For data manipulation and analysis (data wrangling).
NumPy: For handling multi-dimensional arrays (numerical operations).
PySpark: Python API for Spark (big data jobs (because Spark is a big deal!)).

💻 Shell Scripting

Need to automate something? The command line is your best friend. Basic bash skills will save you HOURS.

Scala

If you’re working heavily with Spark, Scala is worth learning due to its efficiency in distributed systems and Spark’s native support for Scala.

SQL & NoSQL

SQL is critical for structured data, while NoSQL databases (like MongoDB) are useful for unstructured or semi-structured data, making them essential in big data applications.

Step 5: Build Projects to Show Off Your Skills 🎨

Now the fun part — hands-on projects! Pick one (or all) of these and show the world your skills:

ETL Pipeline with APIs: Pull data from an API, transform it, load it somewhere cool. Imagine turning Twitter data into a table of “tweets worth reading.”
Data Warehouse Schema Design: Build a schema for an imaginary e-commerce business. Show off your Star and Snowflake schemas!
Real-Time Data Processing: Combine Kafka and Spark Streaming for a real-time project, like a stock price tracker or live sports analytics.
Automated Data Workflows: Use Airflow to automate an ETL process, so you can sleep while data does the heavy lifting.

Step 6: Learn Data Governance & Security 🔒

As a data engineer, making data accessible but secure is a huge part of your job. Dive into:

- Data Quality & Lineage: Know where your data comes from and what it’s been through. Trace it like a detective. 🕵️

Security: Understand encryption, access control, and other best practices to keep sensitive data protected.

Step 7: DevOps & Agile for Data Engineers 🚀

Data engineering isn’t just about the tech — you’ll work with teams and need to get data in front of people fast. Embrace:

CI/CD Pipelines: Jenkins and Docker to make sure your code always works, even on Friday afternoons.
Agile Principles: Data teams often work in Agile. Learn Jira for task management and brush up on sprints, stand-ups, and the like.

Step 8: Document and Showcase Your Work

Building a portfolio is crucial for data engineering roles. Host your projects on GitHub, with detailed READMEs and explanations.

The Final Countdown: Sum It Up, Data Dynamo! 🎉

Phew! You’ve made it this far, and that’s no small feat. Becoming a data engineer is like assembling a 5,000-piece puzzle… without the picture on the box! 🧩 But trust me, it’s worth every late night, every caffeine-fueled coding session, and every “why won’t this query work?!” moment.

So, what’s the deal with data engineering? Well, you’re building the backbone of the digital world. You make sure data flows smoothly from point A to point Z (and everywhere in between), ready for the analysts, scientists, and executives to turn it into insights and decisions. You’re the unsung hero, the wizard behind the curtain, duhh… okay, you get the picture. 🧙‍♂️✨

What You’ve Learned (and Survived)

From SQL basics to Spark sorcery, every skill you’ve picked up has leveled you up. Now you’re armed with the knowledge of databases, ETL processes, data lakes, cloud tech, and big data frameworks. And that’s no joke! Each of these is a superpower on its own. Here’s what your roadmap has covered:

SQL Mastery: Because knowing how to wrangle data is like knowing the right spell for every situation.
Data Warehouse & Big Data Know-How: You’ve learned how to store data, transform it, and make it accessible for analysis at scale. Hello, Hadoop and Spark! 🚀
ETL and Data Pipelines: The art of getting data from here to there, transformed and ready to rock.
Data Lake Deep Dive: Because sometimes, you need to store it all and let the data scientists sort it out later.
Python and Beyond: Coding for data wrangling, automation, and more. Pandas, NumPy, and PySpark are now your BFFs. 🐼🐍
Cloud Tech Mastery: From AWS to Azure, you’re building in the cloud, where data engineering lives and breathes these days.
Project-Ready Skills: Version control with Git, automation with Airflow, and CI/CD with DevOps practices — you’re equipped to take on real-world projects.

Why This is a Marathon, Not a Sprint 🏃‍♂️☕

Let’s face it: data engineering is no quick certification. It’s a long haul, like assembling IKEA furniture without the instructions (and with a few mystery parts). You’ll need perseverance, curiosity, and yes, a strong tolerance for caffeine.

The best way to make progress? Start with small steps:

SQL Basics ➡️ then to Advanced Joins ➡️ finally to Optimization Techniques.
Python for Data Wrangling ➡️ then to PySpark ➡️ finally to Big Data Magic.
Design an ETL Pipeline ➡️ then to Data Lake Architecture ➡️ eventually to Orchestrating Complex Pipelines with Airflow.

And remember, it’s okay to make mistakes! Every data engineer has spent countless hours debugging queries, rewriting code, and scratching their head over a missed comma. Mistakes are just part of the process.

Here’s What’s Next: Your Data Engineer’s To-Do List 📝

Get Hands-On: Build projects that showcase your skills, whether it’s a small ETL pipeline or a real-time data streaming setup. Trust me, nothing teaches like doing.
Explore New Tools: The field’s evolving fast! Stay curious about new technologies and trends.
Network with Fellow Data Engineers: Connect with other data professionals, join meetups, and ask questions. The data community is here to help.
Document Everything: Make your GitHub shine. Write READMEs, share your process, and let your future employers see your journey.

The Final Pep Talk 🌟

Data engineering is tough, but so are you. You’re now equipped with a roadmap to success, and every project you build brings you one step closer to mastery. Embrace the journey, savor those small wins, and don’t let the bugs bring you down.

So, grab your laptop, your favorite playlist, and a cup of your favorite fuel — you’ve got this. 🚀

Akhilesh Pratap Shahi

SQL MASTERY - P00 ( Introduction to SQL and It's Technicality

Akhilesh Pratap Shahi — Sun, 17 Mar 2024 15:54:20 +0000

When we hear the word SQL, the first thing that pops up in our head is data. What exactly does SQL mean? It stands for Structured Query Language. Using SQL, we can write queries which will help us to manipulate data in mystical ways. Data is the fuel in the market. One who knows how to deal with data has a superpower to manipulate the market.

Umm... Data! Now the question comes up: what is actually "DATA"? So, the data itself is divided into further categories: Structured, Semi-Structured, and Unstructured data. Structured data is organized into a specific format, like tables in a database, with clearly defined fields. Semi-structured data doesn't have a rigid structure like structured data but may have some organizational elements, like tags or keys, such as XML or JSON files. Unstructured data has no predefined format and includes things like text documents, images, and videos. By now, you know what data is—it's your name, place location, or anything that you can imagine about yourself on a digital platform. Even your digital footprints are themselves data to predict your area of interest.

In this series, we are going to cover Structured data, as the name suggests itself, "Structured Query Language". Oh! So data is everywhere, but how and where do we store this messy thing? While talking about Data, a Database is something that comes complimentary or vice versa is true. So, you might have heard of MySQL, PostgreSQL, Oracle, MongoDB, etc. These are termed as databases. Database systems are software applications designed to store, manage, and manipulate data. In simple words, a database is a platform where we will be storing lots and lots of data, and by using query languages like SQL, we can play along with data using SQL. These languages provide mechanisms for creating, updating, querying, and administering databases. Going further, we are going to learn more about MySQL, as this is one of the most common, and in the future, we will cover Google BigQuery for heavy datasets.

Now you know what you are jumping into. From the next post, we will start with the real gaming, first with installation and then write the first query to read data.