<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Akhilesh Pratap Shahi</title>
    <description>The latest articles on Forem by Akhilesh Pratap Shahi (@shahiakhilesh1304).</description>
    <link>https://forem.com/shahiakhilesh1304</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F999477%2Fd8ef8d99-e47d-4dee-b0d0-2d080c9fff93.jpeg</url>
      <title>Forem: Akhilesh Pratap Shahi</title>
      <link>https://forem.com/shahiakhilesh1304</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/shahiakhilesh1304"/>
    <language>en</language>
    <item>
      <title>Apache Spark Installation</title>
      <dc:creator>Akhilesh Pratap Shahi</dc:creator>
      <pubDate>Mon, 19 Jan 2026 09:34:50 +0000</pubDate>
      <link>https://forem.com/shahiakhilesh1304/apache-spark-installation-bh0</link>
      <guid>https://forem.com/shahiakhilesh1304/apache-spark-installation-bh0</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbjaz0l232v418zskjrze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbjaz0l232v418zskjrze.png" alt="Spark Modules" width="684" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each step is important when you are learning new things. When you are learning new technology, setting up its environment is important so that you can practice effectively; we call it a "baby step".&lt;br&gt;
Installing Apache Spark involves a few key steps like ensuring the prerequisites are installed and then downloading, extracting and configuring the spark binaries for your specific operating system.&lt;br&gt;
Apache Spark runs on the Java Virtual Machine (JVM), so the Java Development Kit (JDK) is a requirement. Hey, don't panic! Installing the JDK doesn't mean you have to code in Java. It simply provides the JVM, creating the necessary runtime environment that Spark needs to execute its tasks.&lt;/p&gt;
&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Java
&lt;/h3&gt;

&lt;p&gt;Spark requires &lt;code&gt;Java 8 or 11&lt;/code&gt; to be installed, make sure that your system works on either one of these.&lt;br&gt;&lt;br&gt;
My suggestion is to go with &lt;code&gt;Java 8&lt;/code&gt; because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hadoop &lt;code&gt;3.0.x&lt;/code&gt; and &lt;code&gt;3.2.x&lt;/code&gt; only support Java 8&lt;/li&gt;
&lt;li&gt;Hadoop &lt;code&gt;3.3+&lt;/code&gt; supports &lt;code&gt;Java 8 and 11 (runtime only)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check Java version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;java &lt;span class="nt"&gt;-version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If not installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;openjdk-8-jdk &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="c"&gt;# Change according to your choice of version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Minimum System Requirement
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OS (Operating System):&lt;/strong&gt; Ubuntu 20.04 or 22.04
&lt;a href="https://ubuntu.com/download" rel="noopener noreferrer"&gt;https://ubuntu.com/download&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 4GB (Recommended 8GB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk:&lt;/strong&gt; 20GB free space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU:&lt;/strong&gt; 2 cores&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;To create this setup in your windows firstly we have to create a Virtual Environment using wsl.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you want the installation guide for Mac drop comment will be coming up with mac set up&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Python (For PySpark)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Required: &lt;strong&gt;Python 3.7+&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check versions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
pip3 &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If not installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;python3 python3-pip &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. SSH (Mandatory for Hadoop)
&lt;/h3&gt;

&lt;p&gt;Hadoop daemons require &lt;strong&gt;passwordless SSH&lt;/strong&gt;, even on a single machine.&lt;/p&gt;

&lt;p&gt;Check SSH:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If not installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;openssh-server &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Linux Utilities (Required)
&lt;/h3&gt;

&lt;p&gt;Install basic tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
wget &lt;span class="se"&gt;\&lt;/span&gt;
curl &lt;span class="se"&gt;\&lt;/span&gt;
rsync &lt;span class="se"&gt;\&lt;/span&gt;
vim &lt;span class="se"&gt;\&lt;/span&gt;
nano &lt;span class="se"&gt;\&lt;/span&gt;
net-tools &lt;span class="se"&gt;\&lt;/span&gt;
procps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;rsync&lt;/code&gt;: Hadoop file sync&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;procps&lt;/code&gt;: &lt;code&gt;jps&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;net-tools&lt;/code&gt;: network checks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Environment Variables
&lt;/h3&gt;

&lt;p&gt;(We will be sorting this out together)&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Browser (For UIs)
&lt;/h3&gt;

&lt;p&gt;Any browser of your choice will work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;chrome&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;safari&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;firefox&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8. Permissions
&lt;/h3&gt;

&lt;p&gt;You must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Have &lt;code&gt;sudo&lt;/code&gt; access&lt;/li&gt;
&lt;li&gt;Be able to write to &lt;code&gt;/opt&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Run the commands below. If all of them pass, we are ready to move forward with the installation of &lt;strong&gt;Hadoop + Spark&lt;/strong&gt;.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;java &lt;span class="nt"&gt;-version&lt;/span&gt;
python3 &lt;span class="nt"&gt;--version&lt;/span&gt;
ssh localhost
&lt;span class="nb"&gt;sudo ls&lt;/span&gt; /opt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;strong&gt;LET's START FOR WHAT WE ARE HERE ACTUALLY&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;STEP 1&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Install &amp;amp; Configure Hadoop (Single Node Cluster)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Setup Passwordless SSH (Mandatory)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What's the use for this?&lt;/strong&gt;&lt;br&gt;
Hadoop uses SSH to &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start Daemons&lt;/li&gt;
&lt;li&gt;Stop Daemons&lt;/li&gt;
&lt;li&gt;Manage Nodes (even localhost)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;strong&gt;"Even on one machine Hadoop behaves like a cluster"&lt;/strong&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is very common question to ask, and you might be thinking the same; why passwordless if we can use password? Well let me tell you this you can enter the password what about your machine; don't mind me here but your machine is dumb, dumber than you think it won't be able to enter the password and we want this dumb thing to get the access that's why we say "Comm'mon you dumb, take this passwordless access and leave me alone". So, Daemons cannot enter password because this dude is dumb.&lt;br&gt;
We have to generate the SSH keys and allow localhost login&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh-keygen &lt;span class="nt"&gt;-t&lt;/span&gt; rsa &lt;span class="nt"&gt;-P&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; ~/.ssh/id_rsa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above command will create to keys &lt;code&gt;id_rsa&lt;/code&gt; (Private key) and &lt;code&gt;id_rsa.pub&lt;/code&gt; (public key)&lt;br&gt;
Now we have to authorize it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; ~/.ssh/id_rsa.pub &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.ssh/authorized_keys
&lt;span class="nb"&gt;chmod &lt;/span&gt;600 ~/.ssh/authorized_keys
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we are done with this, now we have to test if it login without password voila we have achieved it our &lt;code&gt;SSH&lt;/code&gt; layer is ready to go&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Download Hadoop
&lt;/h3&gt;

&lt;p&gt;We will be downloading the &lt;code&gt;hadoop&lt;/code&gt;, when i say hadoop don't take it as a simple program it is a set of Java Services which includes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HDFS&lt;/li&gt;
&lt;li&gt;YARN&lt;/li&gt;
&lt;li&gt;MapReduce(Runtime)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We will be keeping all third party software at location &lt;code&gt;/opt&lt;/code&gt; to keep our system clean and sorted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /opt
&lt;span class="nb"&gt;sudo &lt;/span&gt;wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
&lt;span class="nb"&gt;sudo tar&lt;/span&gt; &lt;span class="nt"&gt;-xvzf&lt;/span&gt; hadoop-3.3.6.tar.gz
&lt;span class="nb"&gt;sudo mv &lt;/span&gt;hadoop-3.3.6 hadoop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;(&lt;a href="https://downloads.apache.org/hadoop/common/" rel="noopener noreferrer"&gt;https://downloads.apache.org/hadoop/common/&lt;/a&gt;)  This has all the possible options for version of Hadoop. Pick which ever you find best for your work. If you want current stable version go inside stable folder and copy hadoop-x.x.x.tar.gz path and run the above command.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;change the ownership:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; &lt;span class="nv"&gt;$USER&lt;/span&gt;:&lt;span class="nv"&gt;$USER&lt;/span&gt; /opt/hadoop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Environment Variable
&lt;/h3&gt;

&lt;p&gt;As I have told you before your machine is dumb so it won't be able to find the hadoop, binaries, config files. So we set the environment variable so that we can tell linux.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where Hadoop is Installed&lt;/li&gt;
&lt;li&gt;Where its Binary File lives&lt;/li&gt;
&lt;li&gt;Where config files live&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;make sure you have VS code installed. Will be easier for you to manage things going forward, if you have any IDE installed. My preference is with VSCode so I am suggesting VSCode.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HADOOP_HOME&lt;/td&gt;
&lt;td&gt;Hadoop root directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HADOOP_CONF_DIR&lt;/td&gt;
&lt;td&gt;Hadoop XML configuration files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PATH&lt;/td&gt;
&lt;td&gt;Run &lt;code&gt;hadoop&lt;/code&gt;, &lt;code&gt;hdfs&lt;/code&gt;, &lt;code&gt;yarn&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here you will find a file named as &lt;code&gt;.bashrc&lt;/code&gt; this file actually contains all the  environment variable and command that should run at the startup time.&lt;/p&gt;

&lt;p&gt;Open &lt;code&gt;.bashrc&lt;/code&gt; in VSCode.&lt;/p&gt;

&lt;p&gt;in the last of the file add below commands.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HADOOP_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/hadoop
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HADOOP_CONF_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$HADOOP_HOME&lt;/span&gt;/etc/hadoop
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$PATH&lt;/span&gt;:&lt;span class="nv"&gt;$HADOOP_HOME&lt;/span&gt;/bin:&lt;span class="nv"&gt;$HADOOP_HOME&lt;/span&gt;/sbin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save the file and run the below command on terminal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, We have to check if the hadoop is wired properly. If it is wired properly we have completed another step successfully.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hadoop version
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  4. Hadoop Configuration Files
&lt;/h3&gt;

&lt;p&gt;Hadoop behaves exactly how we say it to behave within its limitations, and to tell hadoop how to behave we don't use broom like mom use to. We give them a manual and that is provided through XML Files.&lt;br&gt;
These files will be found at location &lt;code&gt;$HADOOP_CONF_DIR&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  4.1. core-site.xml
&lt;/h4&gt;

&lt;p&gt;This actually controls the file system abstraction and set the default filesystem URI&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0"?&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Default filesystem --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;fs.defaultFS&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;hdfs://localhost:9000&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Temporary directory used by Hadoop --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;hadoop.tmp.dir&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;/opt/hadoop/tmp&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;replace the &lt;code&gt;core-site.xml&lt;/code&gt; file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default core-site.xml file.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What this change actually means. So any file operation defaults to HDFS and the Name Node will run on &lt;code&gt;localhost&lt;/code&gt;, for HDFS RPS the port is set to 9000.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every hdfs dfs command uses this URI&lt;/li&gt;
&lt;li&gt;spark also reads this when accessing HDFS&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;create a temp dir:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/hadoop/tmp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4.2. hdfs-site.xml
&lt;/h4&gt;

&lt;p&gt;This Controls HDFS storage, replication, metadata, and block storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0"?&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Single node replication --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;dfs.replication&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;1&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- NameNode metadata storage --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;dfs.namenode.name.dir&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;file:///opt/hadoop/data/namenode&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- DataNode block storage --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;dfs.datanode.data.dir&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;file:///opt/hadoop/data/datanode&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Enable Web UI --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;dfs.webhdfs.enabled&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;true&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace the &lt;code&gt;hdfs-site.xml&lt;/code&gt; file data with the information that i have provided above, do not forget to take the backup of what is mentioned in the default hdfs-site.xml file.&lt;br&gt;
We will create the directory because hadoop will not create them by themselves.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/hadoop/data/namenode
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/hadoop/data/datanode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4.3. mapred-site.xml
&lt;/h4&gt;

&lt;p&gt;This Controls MapReduce execution engine (needed for YARN + Spark)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0"?&amp;gt;&lt;/span&gt;
&lt;span class="cp"&gt;&amp;lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Run MapReduce on YARN --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;mapreduce.framework.name&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;yarn&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- MapReduce job history server address --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;mapreduce.jobhistory.address&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;localhost:10020&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;mapreduce.jobhistory.webapp.address&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;localhost:19888&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;When we run spark on &lt;code&gt;YARN&lt;/code&gt; it reuses &lt;code&gt;MapReduce&lt;/code&gt; shuffle services, this is mandatory configuration even if you never run MR jobs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  4.4. yarn-site.xml
&lt;/h4&gt;

&lt;p&gt;This Controls YARN resource management and container execution&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?xml version="1.0"?&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;configuration&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Enable shuffle service --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;yarn.nodemanager.aux-services&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;mapreduce_shuffle&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- ResourceManager hostname --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;yarn.resourcemanager.hostname&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;localhost&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Memory allocation (adjust to your RAM) --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;yarn.nodemanager.resource.memory-mb&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;4096&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- CPU allocation --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;yarn.nodemanager.resource.cpu-vcores&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;2&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Minimum container memory --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;yarn.scheduler.minimum-allocation-mb&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;512&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

  &lt;span class="c"&gt;&amp;lt;!-- Maximum container memory --&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;name&amp;gt;&lt;/span&gt;yarn.scheduler.maximum-allocation-mb&lt;span class="nt"&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;value&amp;gt;&lt;/span&gt;4096&lt;span class="nt"&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/configuration&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Spark executors are YARN containers and these limits decide executor size&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  4.5. hadoop-env.sh
&lt;/h4&gt;

&lt;p&gt;This tell the hadoop which java it has to use, without this hadoop daemon will fail to start.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;JAVA_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/lib/jvm/java-11-openjdk-amd64
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HADOOP_HEAPSIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1024
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4.6. Slaves (Workers).
&lt;/h4&gt;

&lt;p&gt;Previously it was known as slaves now we got civilized and started saying the same thing workers. This tells hadoop where to run DataNode and NodeManager will run.&lt;/p&gt;

&lt;p&gt;Go to hadoop directory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="nv"&gt;$HADOOP_CONF_DIR&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open the workers file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano workers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;make sure it contains exactly this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;if it doesn't make sure to edit and then save.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.7. Verify Configuration
&lt;/h3&gt;

&lt;p&gt;This step will ensure that the hadoop is perfectly initialized and actually running, not just configured on disk.&lt;/p&gt;

&lt;p&gt;Format the Name Node (First Time Task only)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hdfs namenode &lt;span class="nt"&gt;-format&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create the metadata and namespace. Without formatting hdfs cannot start. Make sure you do this only once in lifetime of hadoop installation, formatting later will delete the hdfs meta data&lt;/p&gt;

&lt;p&gt;Now, this is done. The hadoop is perfectly installed and we will start the services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;start-dfs.sh
start-yarn.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will start the &lt;code&gt;HDFS Daemons&lt;/code&gt; (storage level) and &lt;code&gt;YARN Daemons&lt;/code&gt; (resource level)&lt;/p&gt;

&lt;p&gt;verify the running daemons&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;jps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command should return something like below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mkaku6u79o8b1kmyv61.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9mkaku6u79o8b1kmyv61.png" alt="jps Output" width="334" height="151"&gt;&lt;/a&gt;&lt;/p&gt;
jps Output



&lt;p&gt;if you see this then it means all the Hadoop JVM process are alive. if anything is missing the hadoop is not fully up in this case, retrace your steps.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Web Interface
&lt;/h3&gt;

&lt;p&gt;This show the real time cluster state.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;URL&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;NameNode UI&lt;/td&gt;
&lt;td&gt;&lt;a href="http://localhost:9870" rel="noopener noreferrer"&gt;http://localhost:9870&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;YARN UI&lt;/td&gt;
&lt;td&gt;&lt;a href="http://localhost:8088" rel="noopener noreferrer"&gt;http://localhost:8088&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43ldhp87a36npnxvcx6p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43ldhp87a36npnxvcx6p.png" alt="Name Node UI" width="800" height="653"&gt;&lt;/a&gt;&lt;/p&gt;
Name Node UI



&lt;p&gt;We can also take browse the HDFS directories from &lt;code&gt;utilities &amp;gt; browse the file system&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc71dac78m3t17wwda40k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc71dac78m3t17wwda40k.png" alt="HDFS file system" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;
HDFS file system



&lt;p&gt;If both open hadoop is running correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Confirmation
&lt;/h3&gt;

&lt;p&gt;Confirmation is important and we must ensure &lt;code&gt;HDFS&lt;/code&gt;, &lt;code&gt;YARN&lt;/code&gt; works and &lt;code&gt;Daemons&lt;/code&gt; are healthy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hdfs dfs &lt;span class="nt"&gt;-mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /user/&lt;span class="nv"&gt;$USER&lt;/span&gt;
hdfs dfs &lt;span class="nt"&gt;-ls&lt;/span&gt; /user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this works properly then it's confirm that Hadoop layer is stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;STEP 2&lt;/strong&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Installing Spark
&lt;/h2&gt;

&lt;p&gt;What we will be doing here.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installing Spark&lt;/li&gt;
&lt;li&gt;Tell spark where hadoop config live&lt;/li&gt;
&lt;li&gt;make spark submit jobs to YARN&lt;/li&gt;
&lt;li&gt;Enable Spark to read/write HDFS
Spark will always be dependent on YARN, it does not run its own cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1. Download Spark(Hadoop Compatible)
&lt;/h3&gt;

&lt;p&gt;We will be downloading the pre-build spark binary that already includes Hadoop Integration libraries. Spark internally relies on Hadoop FileSystem API to talk to HDFS and YARN Client APIs to request container. If Spark is not compiled with hadoop it won't be able to read or write ro HDFS or submit the application to YARN.&lt;/p&gt;

&lt;p&gt;(LINK TO COPY THE VERSION SUITABLE LINK - &lt;a href="https://downloads.apache.org/spark/" rel="noopener noreferrer"&gt;https://downloads.apache.org/spark/&lt;/a&gt;)&lt;br&gt;
You will have to copy the link for folder link mentioned in below picture &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieu0inhrzwfjcuqlvcfe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieu0inhrzwfjcuqlvcfe.png" alt="Spark Directory Suitable With Hadoop" width="500" height="37"&gt;&lt;/a&gt;&lt;/p&gt;
Spark Directory Suitable With Hadoop



&lt;p&gt;Choose the hadoop suitable Spark &lt;code&gt;tgz&lt;/code&gt; file to go forward with installation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /opt
&lt;span class="nb"&gt;sudo &lt;/span&gt;wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extract:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo tar&lt;/span&gt; &lt;span class="nt"&gt;-xvzf&lt;/span&gt; spark-3.5.0-bin-hadoop3.tgz
&lt;span class="nb"&gt;sudo mv &lt;/span&gt;spark-3.5.0-bin-hadoop3 spark
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; &lt;span class="nv"&gt;$USER&lt;/span&gt;:&lt;span class="nv"&gt;$USER&lt;/span&gt; /opt/spark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hadoop aware spark binaries are available in our machine but not yet connected.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Settting Up Spark Environment Variable.
&lt;/h3&gt;

&lt;p&gt;This is done so that our machine can understand where our spark is installed and  where spark command lives, which python spark should use. Linux doesn't automaticallu know about software installed in &lt;code&gt;/opt&lt;/code&gt;. So by setting below variable we make linux intelligent enough to know that spark is installed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SPARK_HOME&lt;/code&gt;: Spark root directory&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PATH&lt;/code&gt;: where &lt;code&gt;spark-shell&lt;/code&gt;, &lt;code&gt;spark-submit&lt;/code&gt;, &lt;code&gt;pyspark&lt;/code&gt; live&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;PYSPARK_PYTHON&lt;/code&gt;: avoids python version mismatch
This ensures that command can run from anywhere and pyspark should use python3 consistently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Open &lt;code&gt;.bashrc&lt;/code&gt; to VSCode.&lt;br&gt;
Add the following set of code in &lt;code&gt;.bashrc&lt;/code&gt; at the last.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Spark&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;SPARK_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/spark
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$PATH&lt;/span&gt;:&lt;span class="nv"&gt;$SPARK_HOME&lt;/span&gt;/bin:&lt;span class="nv"&gt;$SPARK_HOME&lt;/span&gt;/sbin
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PYSPARK_PYTHON&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;python3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;save the &lt;code&gt;.bashrc&lt;/code&gt; file and run below command on terminal&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spark-shell &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this print spark version congratulations your spark is successfully installed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;bin-hadoop3&lt;/code&gt; build contains hadoop client libraries&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  3. Configure Spark to use Hadoop &amp;amp; YARN
&lt;/h3&gt;

&lt;p&gt;This is one of the most crucial part, &lt;code&gt;do not miss&lt;/code&gt;. This explicitly connect Spark to hadoop Cluster.&lt;br&gt;
Use below commands.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="nv"&gt;$SPARK_HOME&lt;/span&gt;/conf
&lt;span class="nb"&gt;cp &lt;/span&gt;spark-env.sh.template spark-env.sh
nano spark-env.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;add below set of command to the &lt;code&gt;spark-env.sh&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;JAVA_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/lib/jvm/java-11-openjdk-amd64
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HADOOP_CONF_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/hadoop/etc/hadoop
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;YARN_CONF_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/hadoop/etc/hadoop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Spark does not auto discover the Hadoop, these setting will tell spark that which Java Runtime to use, where HDFS and YARN configuration live. If we miss this Spark won't be able to locate NameNode, ResourceManager, HDFS Paths.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4. Prepare HDFS for Spark Execution
&lt;/h3&gt;

&lt;p&gt;We will be creating the required directories.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hdfs dfs &lt;span class="nt"&gt;-mkdir&lt;/span&gt; /spark
hdfs dfs &lt;span class="nt"&gt;-mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /user/&lt;span class="nv"&gt;$USER&lt;/span&gt;
hdfs dfs &lt;span class="nt"&gt;-chmod&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; 777 /spark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;When we are running the the spark on YARN, spark uploads jar and config to HDFS and uses HDFS for application staging whereas write logs and metadata under &lt;code&gt;/user/&amp;lt;username&amp;gt;&lt;/code&gt;. If the directories won't be available it will throw a runtime failure, not a startup errors.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  4. Run Spark Using YARN (Validation)
&lt;/h3&gt;

&lt;p&gt;Here we will be running the spark using hadoop's resource manager.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Python&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pyspark &lt;span class="nt"&gt;--master&lt;/span&gt; yarn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;rdd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parallelize&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rdd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit Pyspark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Scala&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spark-shell &lt;span class="nt"&gt;--master&lt;/span&gt; yarn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="nv"&gt;sc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;parallelize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;map&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;_&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;collect&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit Spark Scala:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;:quit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  If this runs perfectly here we are ready with Apache Spark and ready to practice the code.
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;YARN log aggregation is intentionally skipped here.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;It will be covered later when we discuss debugging Spark jobs.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>spark</category>
      <category>installation</category>
      <category>java</category>
      <category>python</category>
    </item>
    <item>
      <title>The Ultimate Data Engineering Roadmap: From Beginner to Pro</title>
      <dc:creator>Akhilesh Pratap Shahi</dc:creator>
      <pubDate>Sun, 10 Nov 2024 22:58:07 +0000</pubDate>
      <link>https://forem.com/shahiakhilesh1304/the-ultimate-data-engineering-roadmap-from-beginner-to-pro-21nf</link>
      <guid>https://forem.com/shahiakhilesh1304/the-ultimate-data-engineering-roadmap-from-beginner-to-pro-21nf</guid>
      <description>&lt;p&gt;🎉 &lt;strong&gt;Data Engineering Roadmap: From Newbie to Data Dynamo!&lt;/strong&gt; 🌐&lt;br&gt;&lt;br&gt;
Data engineering is the backbone of today’s data-driven world. From designing data pipelines to wrangling big data, data engineers make sure data is accessible, reliable, and ready to power insights. If you’re thinking about diving into this field, this roadmap will guide you from rookie to rockstar, covering essential skills, tools, and some project ideas to get you going.&lt;/p&gt;

&lt;p&gt;Today, data is everywhere — overflowing from our apps, devices, websites, and yes, even our smart fridges. But data alone is a bit like buried treasure; valuable, sure, but only if you know how to dig it up. That’s where data engineers come in! Imagine if every time a company wanted feedback on a product, they had to survey a million people by hand. Or if every click on a site just disappeared into the digital void. Data engineers save the day by managing, organizing, and optimizing data pipelines so businesses can know exactly what’s happening in real time. They’re the superheroes without capes, but probably with a trusty hoodie and coffee mug. ☕  &lt;/p&gt;

&lt;p&gt;So, why consider data engineering? For starters, demand is sky-high — companies know data is their goldmine, and they need skilled pros to dig it up. Data engineering is one of the fastest-growing jobs in tech, with excellent pay, strong growth prospects, and the satisfaction of knowing you’re the backbone of decision-making and innovation.  &lt;/p&gt;

&lt;p&gt;But it’s more than just job security. Data engineering is the perfect blend of creativity and logic, with challenges that keep you on your toes. Whether it’s setting up a database that can handle billions of records or designing a pipeline that pulls in data from around the world in seconds, data engineers are at the forefront of cool tech.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fctwzfhjhsc7lqn8vcniu.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fctwzfhjhsc7lqn8vcniu.jpg" alt="Road Map Diagram" width="800" height="1928"&gt;&lt;/a&gt;&lt;/p&gt;
Keep this roadmap handy! It’ll help you stay updated with market demands, understand what’s in demand, and guide you on what to learn next and when. 📈



&lt;p&gt;If you’re excited about tech, data, and a bit of organized chaos, data engineering could be your calling. Let this guide be your step-by-step roadmap to go from beginner to data engineering pro, with the skills, tools, and hands-on projects that’ll make you job-ready and set for a thrilling career in this fast-paced field.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1: Understand the Role of a Data Engineer 🕶️&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before you roll up your sleeves, let’s get clear on what data engineers actually do (hint: it’s a LOT more than staring at a screen full of code). Here’s your quick “Data Engineer Starter Pack”:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Responsibilities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Build Data Pipelines:&lt;/strong&gt; Think of these as conveyor belts for data, moving it smoothly from one place to another.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ETL Magic:&lt;/strong&gt; Extract, Transform, Load (or “Every Time Late” — kidding!) processes that prep data for analysis.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Quality &amp;amp; Governance:&lt;/strong&gt; Making sure data is accurate, clean, and not full of mysterious empty values.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage Solutions:&lt;/strong&gt; Picking the right data warehouses, lakes, or… “lakeshouses”? Yep, that’s a thing now. 🏠💧
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimization:&lt;/strong&gt; If your data is moving like a turtle, you’re doing it wrong. Data engineers are the speed champions.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaboration:&lt;/strong&gt; You’ll be the bridge between data science, business, and engineering teams. Social skills + tech skills = data engineer gold.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2: Nail Down the Basics 📚&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you’re new to this, don’t worry — everyone starts here! Let’s talk about the building blocks. And yes, there will be homework (projects) later! 📝&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databases (They’re Everywhere!) 🗄️&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQL Databases:&lt;/strong&gt; Start with SQL for relational data. Practice in MySQL or PostgreSQL. If you can’t remember, just think “SQL” stands for “Super Quick Learner” (okay, not really).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NoSQL Databases:&lt;/strong&gt; For semi-structured data, dabble with MongoDB or Cassandra. You’ll want to handle unstructured data, too!
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph &amp;amp; Time-series Databases:&lt;/strong&gt; For when your data has lots of relationships or time-specific values, tools like Neo4j and InfluxDB are amazing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Warehouses and Modeling 🏛️&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn the difference between Star Schemas and Snowflake Schemas (hint: one is simpler, the other is more detailed).
&lt;/li&gt;
&lt;li&gt;Master the ETL Process: Imagine you’re Marie Kondo for data — organize, clean, and prepare it to spark joy for your analysts. ✨&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Big Data Tech 🚂&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Big data isn’t just big, it’s also messy. Learn to handle it with:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Apache Hadoop&lt;/strong&gt; for storage.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Spark&lt;/strong&gt; for processing — like the jetpack for big data, Spark makes it FLY. 🔥&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3: Pick Up Key Tools &amp;amp; Technologies 🔧&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Welcome to the “choose your own adventure” part of the roadmap. Data engineering has a LOT of tools, but you can get started with these essentials:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Processing with Apache Spark&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark is like the Batman of data engineering. It’s versatile and saves the day in a lot of situations.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PySpark:&lt;/strong&gt; The Python API for Spark, making it easier to work with large datasets. (Python + Spark awesomeness.)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark SQL:&lt;/strong&gt; A module for querying structured data in Spark. (SQL-like data manipulation.)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark MLlib:&lt;/strong&gt; For machine learning in Spark.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark Streaming:&lt;/strong&gt; Enables real-time data processing.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mastering Spark allows you to handle large datasets, a crucial skill in big data environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Platforms (AWS, Azure)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Everything’s moving to the cloud! Learn the essentials on either platform (or both if you’re ambitious):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AWS:&lt;/strong&gt; Start with S3 (storage), Redshift (warehouse), Glue (ETL), and EMR (processing).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure:&lt;/strong&gt; Try out Azure Data Lake, Synapse Analytics, and Azure Databricks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AWS:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon S3:&lt;/strong&gt; Object storage, commonly used for data lakes.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Redshift:&lt;/strong&gt; Data warehousing solution optimized for analytics.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Glue:&lt;/strong&gt; Serverless ETL service.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon EMR:&lt;/strong&gt; Managed Hadoop and Spark clusters for big data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Azure:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Azure Data Lake Storage:&lt;/strong&gt; Optimized for big data storage.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Synapse Analytics:&lt;/strong&gt; Combines data warehousing, big data, and data integration.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Azure Databricks:&lt;/strong&gt; Managed Spark service for collaborative work.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Having hands-on experience with both platforms will make you adaptable and increase job opportunities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks for Big Data and Machine Learning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s Spark, but with a cool notebook-style interface. Perfect for collaborative big data work:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Collaborative Notebooks:&lt;/strong&gt; For developing ETL workflows and machine learning models.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Lake:&lt;/strong&gt; Adds reliability to data lakes with ACID transactions and schema enforcement.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLflow:&lt;/strong&gt; Manages the machine learning lifecycle, from experimentation to deployment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mastering Databricks will help you run scalable data processing and machine learning workflows in a collaborative environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Airflow (Workflow Orchestration)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data pipelines need maintenance, and Airflow helps schedule and monitor tasks. Think of it as a calendar for your data’s journey.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version Control with Git&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Git is essential for version control and collaboration, especially in larger projects. Familiarize yourself with branching, merging, and pull requests to streamline teamwork.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Step 4: Get Your Coding Skills in Shape 💻&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You’re a data engineer — you’ll code more than you might expect. Here’s the lowdown:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🐍 Python Programming&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Python is the backbone for many data engineering tasks. Start with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pandas:&lt;/strong&gt; For data manipulation and analysis (data wrangling).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NumPy:&lt;/strong&gt; For handling multi-dimensional arrays (numerical operations).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PySpark:&lt;/strong&gt; Python API for Spark (big data jobs (because Spark is a big deal!)).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;💻 Shell Scripting&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Need to automate something? The command line is your best friend. Basic bash skills will save you HOURS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scala&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you’re working heavily with Spark, Scala is worth learning due to its efficiency in distributed systems and Spark’s native support for Scala.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQL &amp;amp; NoSQL&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SQL is critical for structured data, while NoSQL databases (like MongoDB) are useful for unstructured or semi-structured data, making them essential in big data applications.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Step 5: Build Projects to Show Off Your Skills 🎨&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now the fun part — hands-on projects! Pick one (or all) of these and show the world your skills:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ETL Pipeline with APIs:&lt;/strong&gt; Pull data from an API, transform it, load it somewhere cool. Imagine turning Twitter data into a table of “tweets worth reading.”
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse Schema Design:&lt;/strong&gt; Build a schema for an imaginary e-commerce business. Show off your Star and Snowflake schemas!
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-Time Data Processing:&lt;/strong&gt; Combine Kafka and Spark Streaming for a real-time project, like a stock price tracker or live sports analytics.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Data Workflows:&lt;/strong&gt; Use Airflow to automate an ETL process, so you can sleep while data does the heavy lifting.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Step 6: Learn Data Governance &amp;amp; Security 🔒&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As a data engineer, making data accessible but secure is a huge part of your job. Dive into:&lt;/p&gt;

&lt;h2&gt;
  
  
  - &lt;strong&gt;Data Quality &amp;amp; Lineage:&lt;/strong&gt; Know where your data comes from and what it’s been through. Trace it like a detective. 🕵️  
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Security:&lt;/strong&gt; Understand encryption, access control, and other best practices to keep sensitive data protected.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 7: DevOps &amp;amp; Agile for Data Engineers 🚀
&lt;/h3&gt;

&lt;p&gt;Data engineering isn’t just about the tech — you’ll work with teams and need to get data in front of people fast. Embrace:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD Pipelines&lt;/strong&gt;: Jenkins and Docker to make sure your code always works, even on Friday afternoons.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agile Principles&lt;/strong&gt;: Data teams often work in Agile. Learn Jira for task management and brush up on sprints, stand-ups, and the like.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Step 8: Document and Showcase Your Work
&lt;/h3&gt;

&lt;p&gt;Building a portfolio is crucial for data engineering roles. Host your projects on GitHub, with detailed READMEs and explanations.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Final Countdown: Sum It Up, Data Dynamo! 🎉
&lt;/h3&gt;

&lt;p&gt;Phew! You’ve made it this far, and that’s no small feat. Becoming a data engineer is like assembling a 5,000-piece puzzle… without the picture on the box! 🧩 But trust me, it’s worth every late night, every caffeine-fueled coding session, and every “why won’t this query work?!” moment.&lt;/p&gt;

&lt;p&gt;So, what’s the deal with data engineering? Well, you’re building the backbone of the digital world. You make sure data flows smoothly from point A to point Z (and everywhere in between), ready for the analysts, scientists, and executives to turn it into insights and decisions. You’re the unsung hero, the wizard behind the curtain, duhh… okay, you get the picture. 🧙‍♂️✨&lt;/p&gt;




&lt;h3&gt;
  
  
  What You’ve Learned (and Survived)
&lt;/h3&gt;

&lt;p&gt;From SQL basics to Spark sorcery, every skill you’ve picked up has leveled you up. Now you’re armed with the knowledge of databases, ETL processes, data lakes, cloud tech, and big data frameworks. And that’s no joke! Each of these is a superpower on its own. Here’s what your roadmap has covered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SQL Mastery&lt;/strong&gt;: Because knowing how to wrangle data is like knowing the right spell for every situation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Warehouse &amp;amp; Big Data Know-How&lt;/strong&gt;: You’ve learned how to store data, transform it, and make it accessible for analysis at scale. Hello, Hadoop and Spark! 🚀&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ETL and Data Pipelines&lt;/strong&gt;: The art of getting data from here to there, transformed and ready to rock.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Lake Deep Dive&lt;/strong&gt;: Because sometimes, you need to store it all and let the data scientists sort it out later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Python and Beyond&lt;/strong&gt;: Coding for data wrangling, automation, and more. Pandas, NumPy, and PySpark are now your BFFs. 🐼🐍&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud Tech Mastery&lt;/strong&gt;: From AWS to Azure, you’re building in the cloud, where data engineering lives and breathes these days.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Project-Ready Skills&lt;/strong&gt;: Version control with Git, automation with Airflow, and CI/CD with DevOps practices — you’re equipped to take on real-world projects.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Why This is a Marathon, Not a Sprint 🏃‍♂️☕
&lt;/h3&gt;

&lt;p&gt;Let’s face it: data engineering is no quick certification. It’s a long haul, like assembling IKEA furniture without the instructions (and with a few mystery parts). You’ll need perseverance, curiosity, and yes, a strong tolerance for caffeine.&lt;/p&gt;

&lt;p&gt;The best way to make progress? Start with small steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SQL Basics&lt;/strong&gt; ➡️ then to &lt;strong&gt;Advanced Joins&lt;/strong&gt; ➡️ finally to &lt;strong&gt;Optimization Techniques&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Python for Data Wrangling&lt;/strong&gt; ➡️ then to &lt;strong&gt;PySpark&lt;/strong&gt; ➡️ finally to &lt;strong&gt;Big Data Magic&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Design an ETL Pipeline&lt;/strong&gt; ➡️ then to &lt;strong&gt;Data Lake Architecture&lt;/strong&gt; ➡️ eventually to &lt;strong&gt;Orchestrating Complex Pipelines with Airflow&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And remember, it’s okay to make mistakes! Every data engineer has spent countless hours debugging queries, rewriting code, and scratching their head over a missed comma. Mistakes are just part of the process.&lt;/p&gt;




&lt;h3&gt;
  
  
  Here’s What’s Next: Your Data Engineer’s To-Do List 📝
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Get Hands-On&lt;/strong&gt;: Build projects that showcase your skills, whether it’s a small ETL pipeline or a real-time data streaming setup. Trust me, nothing teaches like doing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Explore New Tools&lt;/strong&gt;: The field’s evolving fast! Stay curious about new technologies and trends.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Network with Fellow Data Engineers&lt;/strong&gt;: Connect with other data professionals, join meetups, and ask questions. The data community is here to help.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Document Everything&lt;/strong&gt;: Make your GitHub shine. Write READMEs, share your process, and let your future employers see your journey.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  The Final Pep Talk 🌟
&lt;/h3&gt;

&lt;p&gt;Data engineering is tough, but so are you. You’re now equipped with a roadmap to success, and every project you build brings you one step closer to mastery. Embrace the journey, savor those small wins, and don’t let the bugs bring you down.&lt;/p&gt;

&lt;p&gt;So, grab your laptop, your favorite playlist, and a cup of your favorite fuel — you’ve got this. 🚀&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Akhilesh Pratap Shahi&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>datascience</category>
      <category>computerscience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>SQL MASTERY - P00 ( Introduction to SQL and It's Technicality</title>
      <dc:creator>Akhilesh Pratap Shahi</dc:creator>
      <pubDate>Sun, 17 Mar 2024 15:54:20 +0000</pubDate>
      <link>https://forem.com/shahiakhilesh1304/sql-mastery-p00-introduction-to-sql-and-its-technicality-52h2</link>
      <guid>https://forem.com/shahiakhilesh1304/sql-mastery-p00-introduction-to-sql-and-its-technicality-52h2</guid>
      <description>&lt;p&gt;When we hear the word SQL, the first thing that pops up in our head is data. What exactly does SQL mean? It stands for Structured Query Language. Using SQL, we can write queries which will help us to manipulate data in mystical ways. Data is the fuel in the market. One who knows how to deal with data has a superpower to manipulate the market.&lt;/p&gt;

&lt;p&gt;Umm... Data! Now the question comes up: what is actually "DATA"? So, the data itself is divided into further categories: Structured, Semi-Structured, and Unstructured data. Structured data is organized into a specific format, like tables in a database, with clearly defined fields. Semi-structured data doesn't have a rigid structure like structured data but may have some organizational elements, like tags or keys, such as XML or JSON files. Unstructured data has no predefined format and includes things like text documents, images, and videos. By now, you know what data is—it's your name, place location, or anything that you can imagine about yourself on a digital platform. Even your digital footprints are themselves data to predict your area of interest.&lt;/p&gt;

&lt;p&gt;In this series, we are going to cover Structured data, as the name suggests itself, "Structured Query Language". Oh! So data is everywhere, but how and where do we store this messy thing? While talking about Data, a Database is something that comes complimentary or vice versa is true. So, you might have heard of MySQL, PostgreSQL, Oracle, MongoDB, etc. These are termed as databases. Database systems are software applications designed to store, manage, and manipulate data. In simple words, a database is a platform where we will be storing lots and lots of data, and by using query languages like SQL, we can play along with data using SQL. These languages provide mechanisms for creating, updating, querying, and administering databases. Going further, we are going to learn more about MySQL, as this is one of the most common, and in the future, we will cover Google BigQuery for heavy datasets.&lt;/p&gt;

&lt;p&gt;Now you know what you are jumping into. From the next post, we will start with the real gaming, first with installation and then write the first query to read data.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>learning</category>
      <category>tutorial</category>
      <category>series</category>
    </item>
  </channel>
</rss>
