DEV Community

kination
kination

Posted on

'Spark Connect' in Apache Spark 4.0

Apache Spark 4.0 became official few month ago, with lots of enhancements and new features.

While the release includes various improvements, the evolution of Spark Connect particularly stands out as a significant leap forward, making remote Spark sessions more seamless and powerful.

What is Spark Connect?

If you're unfamiliar with this, you can think as "decoupled client-server architecture" for Apache Spark.

spark-connect-architecture

As you see in this image, it has client and server component. By using client API, it allows developers to interact with a Spark cluster from any application or environment, such as IDEs, notebooks, and applications written in various languages. You can think of similarity with Apache Livy, but it has much more thin client to make application lightweight. Moreover, it is based on same API format and name with classic spark module, so it does not require additional API translation from client.

spark-connect-communicate

It starts by translating DataFrame operations into unresolved logical plans encoded using protocol buffers in client side. These are sent to the server using the gRPC framework, and received protobuf messages will be reconstructed to logical plan. From there, common Spark execution process goes on, such as optimizing logical plan, transforming to physical plan, and execute.

After execution, result will be return in "Apache Arrow" record format, streamed back with gRPC, so can be read sequentially in client side.

This separation of client and server provides greater flexibility, easier dependency management.

What's in Spark 4.0 for this

Enhanced API Coverage and Stability

Previous Spark Connect user may have encountered missing or partially implemented APIs. The development team has made a great effort to address these gaps in Spark 4.0.

Through hard work from engineers done in SPARK-47908 and SPARK-49248, Spark Connect now achieved much higher API parity with common Spark implementation. This means that more of your existing Spark code will work seamlessly in 'connect' mode, without requiring extensive rewrites.

This improved compatibility makes migrating existing applications to the connect architecture a much smoother and more predictable process.

Simple trigger switch for 'connect' mode

One of the most user-friendly changes in Spark 4.0 is the simplified process to enable Spark Connect. Previously, setting up a remote connection required more specific configuration. Now activating it become simple, by adding single configuration property.

By progress in SPARK-50605, you can now switch to 'connect' mode just by setting "spark.api.mode"

from pyspark.sql import SparkSession

SparkSession.builder.config("spark.api.mode","connect").master("...").getOrCreate()
Enter fullscreen mode Exit fullscreen mode

or

$ spark-submit --master "..." --conf spark.api.mode=connect
Enter fullscreen mode Exit fullscreen mode

Default parameter for this is 'classic', and it will work in this way if you don't define anything, for ensuring backward compatibility. It is a small change, but significantly lowers the barrier to entry for developers looking to leverage the benefits of remote, interactive Spark environment.

Machine Learning modules

Another significant leap will be Machine Learning (ML) capabilities. As outlined in SPARK-50812, you can now execute not only SQL/DataFrame operations, but also ML workloads through Spark Connect.

This will be great improvements for data scientists who want to build, train, and manage models on a powerful cluster directly from their local development environment or notebook.

And they are still working for additional improvements which targets 4.1 in SPARK-51236, so you can expect for this too.

Reference:

Top comments (0)

Developer-first embedded dashboards

Developer-first embedded dashboards

Embed in minutes, load in milliseconds, extend infinitely. Import any chart, connect to any database, embed anywhere. Scale elegantly, monitor effortlessly, CI/CD & version control.

Get early access

πŸ‘‹ Kindness is contagious

Dive into this thoughtful piece, beloved in the supportive DEV Community. Coders of every background are invited to share and elevate our collective know-how.

A sincere "thank you" can brighten someone's dayβ€”leave your appreciation below!

On DEV, sharing knowledge smooths our journey and tightens our community bonds. Enjoyed this? A quick thank you to the author is hugely appreciated.

Okay