'Spark Connect' in Apache Spark 4.0

#dataengineering #spark #sparkconnect

Apache Spark 4.0 became official few month ago, with lots of enhancements and new features.

While the release includes various improvements, the evolution of Spark Connect particularly stands out as a significant leap forward, making remote Spark sessions more seamless and powerful.

What is `Spark Connect`?

If you're unfamiliar with this, you can think as "decoupled client-server architecture" for Apache Spark.

As you see in this image, it has client and server component. By using client API, it allows developers to interact with a Spark cluster from any application or environment, such as IDEs, notebooks, and applications written in various languages. You can think of similarity with Apache Livy, but it has much more thin client to make application lightweight. Moreover, it is based on same API format and name with classic spark module, so it does not require additional API translation from client.

It starts by translating DataFrame operations into unresolved logical plans encoded using protocol buffers in client side. These are sent to the server using the gRPC framework, and received protobuf messages will be reconstructed to logical plan. From there, common Spark execution process goes on, such as optimizing logical plan, transforming to physical plan, and execute.

After execution, result will be return in "Apache Arrow" record format, streamed back with gRPC, so can be read sequentially in client side.

This separation of client and server provides greater flexibility, easier dependency management.

What's in Spark 4.0 for this

Enhanced API Coverage and Stability

Previous Spark Connect user may have encountered missing or partially implemented APIs. The development team has made a great effort to address these gaps in Spark 4.0.

Through hard work from engineers done in SPARK-47908 and SPARK-49248, Spark Connect now achieved much higher API parity with common Spark implementation. This means that more of your existing Spark code will work seamlessly in 'connect' mode, without requiring extensive rewrites.

This improved compatibility makes migrating existing applications to the connect architecture a much smoother and more predictable process.

Simple trigger switch for 'connect' mode

One of the most user-friendly changes in Spark 4.0 is the simplified process to enable Spark Connect. Previously, setting up a remote connection required more specific configuration. Now activating it become simple, by adding single configuration property.

By progress in SPARK-50605, you can now switch to 'connect' mode just by setting "spark.api.mode"

from pyspark.sql import SparkSession

SparkSession.builder.config("spark.api.mode","connect").master("...").getOrCreate()

$ spark-submit --master "..." --conf spark.api.mode=connect

Default parameter for this is 'classic', and it will work in this way if you don't define anything, for ensuring backward compatibility. It is a small change, but significantly lowers the barrier to entry for developers looking to leverage the benefits of remote, interactive Spark environment.

Machine Learning modules

Another significant leap will be Machine Learning (ML) capabilities. As outlined in SPARK-50812, you can now execute not only SQL/DataFrame operations, but also ML workloads through Spark Connect.

This will be great improvements for data scientists who want to build, train, and manage models on a powerful cluster directly from their local development environment or notebook.

And they are still working for additional improvements which targets 4.1 in SPARK-51236, so you can expect for this too.