<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Tiago Xavier</title>
    <description>The latest articles on Forem by Tiago Xavier (@tiagotxm).</description>
    <link>https://forem.com/tiagotxm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F592028%2F4f0a4c92-c487-44a1-9ecb-501a54dc7803.jpg</url>
      <title>Forem: Tiago Xavier</title>
      <link>https://forem.com/tiagotxm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tiagotxm"/>
    <language>en</language>
    <item>
      <title>[Spark-k8s] — Getting started # Part 1</title>
      <dc:creator>Tiago Xavier</dc:creator>
      <pubDate>Tue, 19 Jul 2022 17:17:49 +0000</pubDate>
      <link>https://forem.com/tiagotxm/spark-k8s-getting-started-part-1-j0g</link>
      <guid>https://forem.com/tiagotxm/spark-k8s-getting-started-part-1-j0g</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--oKv8hByW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l1vgadbyg7m5l1rh48di.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--oKv8hByW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l1vgadbyg7m5l1rh48di.png" alt="Image description" width="880" height="462"&gt;&lt;/a&gt;&lt;br&gt;
This is the first part of a post series about running Spark on Kubernetes. You can read more about the advantages of using this approach in this Spot.io post &lt;a href="https://spot.io/blog/the-pros-and-cons-of-running-apache-spark-on-kubernetes/"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the end of this post, you will have a local environment ready to use minikube to run spark.&lt;/p&gt;
&lt;h2&gt;
  
  
  Create local environment
&lt;/h2&gt;

&lt;p&gt;Except for autoscaling, which I will simulate using AWS EKS, Cluster Autoscaler, and Karpenter, we will use this local environment to work with spark throughout this series. The post on scalability will be published soon.&lt;/p&gt;

&lt;p&gt;You must install minikube to use it locally. The procedures are very simple, and the official documentation is available &lt;a href="https://minikube.sigs.k8s.io/docs/start/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tip: I strongly advise installing this &lt;a href="https://github.com/ohmyzsh/ohmyzsh/tree/master/plugins/kubectl"&gt;plugin&lt;/a&gt; if you use &lt;a href="https://github.com/ohmyzsh/ohmyzsh"&gt;ohmyzsh&lt;/a&gt; because provides you an alternative to aliasing popular kubectl commands.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Let’s create a multinode cluster with 2 nodes after installing minikube&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;minikube start --nodes 2 -p my-local-cluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Perhaps you’re wondering why there are two local nodes, and the answer is that Kubernetes lets us specify which node on which we can install our applications.&lt;br&gt;
In this post, we will simulate a small production environment, and in the next post, the spark driver will run on ON_DEMAND nodes and the executors on SPOT nodes. As a result, we can reduce cloud costs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can see if everything worked out by looking at this output at the console.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eQJ_XKLr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vkngl0gfdns9h7gu9ugc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eQJ_XKLr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vkngl0gfdns9h7gu9ugc.png" alt="Image description" width="860" height="96"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine the following hypothetical scenario: my-local-cluster is an ON_DEMAND instance and my-local-cluster-m02 is a SPOT instance&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IoOsqThh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lt0mozz5o4ye808eqyvn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IoOsqThh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lt0mozz5o4ye808eqyvn.png" alt="Image description" width="528" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Create the processing namespace from which our spark job will run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ k create namespace processing

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Installing Spark
&lt;/h2&gt;

&lt;p&gt;The SparkOperator must be installed before we can use Spark on Kubernetes. Google created this operator, which is available on &lt;a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator"&gt;Github&lt;/a&gt;. In a nutshell, the operator is in charge of monitoring the cluster for specific events related to the spark job, as known as kind: SparkApplication&lt;/p&gt;

&lt;p&gt;The simplest way to install is to follow the documentation using Helm but we can customize some interesting features by downloading the SparkOperator helm&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
$ helm pull spark-operator/spark-operator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will get this result&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AH3zo5Bh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yd35u6dc6dttl4m6ld2b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AH3zo5Bh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yd35u6dc6dttl4m6ld2b.png" alt="Image description" width="418" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's open the &lt;strong&gt;values.yaml&lt;/strong&gt; file and make some changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;...
sparkJobNamespace: "processing"
...
webhook:
  enable: true
  ...
  namespaceSelector: kubernetes.io/metadata.name=processing
...
nodeSelector:
    kubernetes.io/hostname=my-local-cluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;sparkJobNamescape&lt;/strong&gt; specifies which namespace SparkOperator will use to receive spark job events.&lt;/p&gt;

&lt;p&gt;Because we will control instance types for spark drivers and executos in the following post, &lt;strong&gt;webhook&lt;/strong&gt; is set to true. Don’t be concerned right now.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;nodeSelector&lt;/strong&gt; parameter specifies which node instance the SparkOperator pod will be installed on. We need to ensure that we do not lose the spark-operator pod, so we use an ON_DEMAND instance in our hypothetical example.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run to install&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ helm install &amp;lt;chart-name&amp;gt; &amp;lt;chart-folder&amp;gt; -n &amp;lt;namespace&amp;gt;
$ helm install spark-operator spark-operator -n spark-operator --create-namespace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now have SparkOperator installed and running&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--nRHBRU0E--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ij355cduhob542hz85xt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--nRHBRU0E--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ij355cduhob542hz85xt.png" alt="Image description" width="880" height="126"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The final step is to set up a spark job to run.&lt;/p&gt;

&lt;p&gt;An example of a spark job defined using a YAML file is provided below. As you can see, we’re defining the &lt;strong&gt;processing namespace&lt;/strong&gt; in which all of our spark jobs will run, and the SparkOperator will receive events from that.&lt;/p&gt;

&lt;p&gt;In addition, &lt;strong&gt;kind: SparkApplication&lt;/strong&gt; is specified&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: my-job-with-operator
  namespace: processing
spec:
  type: Scala
  mode: cluster
  image: "tiagotxm/spark3.1:latest"
  imagePullPolicy: Always 
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.1.1.jar"
  sparkVersion: "3.1.1"
  restartPolicy:
    type: Never
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.1.1
    serviceAccount: spark-operator-spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"     
  executor:
    cores: 3
    instances: 3
    memory: "512m"
    labels:
      version: 3.1.1
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll use a spark &lt;strong&gt;image&lt;/strong&gt; and run a simple spark job defined at mainApplicationFile, which is contained within the image.&lt;/p&gt;

&lt;p&gt;Let's get started on our simple task.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ k apply -f hello-spark-operator.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can monitor the job’s start and all spark-related events, such as the creation of a prod driver and an executor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ kgpw -n processing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZNlukFVe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4ipr3ypldi2bc8j0ll79.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZNlukFVe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4ipr3ypldi2bc8j0ll79.png" alt="Image description" width="772" height="652"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, double-check the final status&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ k get sparkapplications -n processing
```



##Conclusion

We created a local environment in this post to run spark jobs with two namespaces. The SparkOperator has been installed in the **spark-operator namespace**, and it now handles spark events from the **processing namespace**

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/68hmecxcry46n41c48la.png)


I hope this post was useful and you can find a complete example in my git [repository](https://github.com/tiagotxm/data-engineer-projects/blob/main/jobs/getting-started-spark-operator/hello-spark-operator.yaml).

Thank you for your time!

You can also read on [medium] (https://tiagotxm.medium.com/spark-k8s-getting-started-part-1-44200fb53606) if you prefer
Join me: https://www.linkedin.com/in/tiagotxm/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>spark</category>
      <category>kubernetes</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
