Forem: Daniel Westheide

"Introducing kontextfrei"

Daniel Westheide — Thu, 09 Nov 2017 13:56:28 +0000

This article was originally posted on Daniel Westheide's blog.

For the past 15 months, I have been working on a new library on and off. So far, I have been mostly silent about it, because I didn't feel like it was ready for a wider audience to use – even though we had been using it successfully in production for a while. However, since I broke my silence as long ago as April this year, when I did a talk about it at this year's ScalarConf in Warsaw, a blog post is overdue in which I explain what this library does and why I set out to write it in the first place.

Last year, I was involved in a project that required my team to implement a few Spark applications. For most of them, the business logic was rather complex, so we tried to implement this business logic in a test-driven way, using property-driven tests.

The pain of unit-testing Spark applications

At first glance, it looks like this is a great match. When it comes down to it, a Spark application consists of IO stages (reading from and writing to data sources) and transformations of data sets. The latter constitute our business logic and are relatively easy to separate from the IO parts. They are mostly built from pure functions. Functions like these are usually a perfect fit for test-driven development as well as for property-based testing.

However, all was not great. It may be old news to you if you have been working with Apache Spark for a while, but it turns out that writing real unit tests is not actually supported that well by Spark, and as a result, it can be quite painful. The thing is that in order to create an RDD, we always need a SparkContext, and the most light-weight mechanism for getting one is to create a local SparkContext. Creating a local SparkContext means that we start up a server, which takes a few seconds, and testing our properties with lots of different generated input data takes a really long time. Most certainly, we are losing the fast feedback loop we are used to from developing web applications, for example.

Abstracting over RDDs with kontextfrei

Now, we could confine ourselves to only unit-testing the functions that we pass to RDD operators, so that our unit tests do not have any dependency on Spark and can be verified as quickly as we are used to. However, this leaves quite a lot of business logic uncovered. Instead, at a Scala hackathon last May, I started to experiment with the idea of abstracting over Spark's RDD, and kontextfrei was born.

The idea is the following: By abstracting over RDD, we can write business logic that has no dependency on the RDD type. This means that we can also write test properties that are Spark-agnostic. Any Spark-agnostic code like this can either be executed on an RDD (which you would do in your actual Spark application and in your integration tests), or on a local and fast Scala collection (which is really great for unit tests that you continously run locally during development).

Obtaining the library

It's probably easier to show how this works than to describe it with words alone, so let's look at a really minimalistic example, the traditional word count. First, we need to add the necessary dependencies to our SBT build file. Kontextfrei consists of two different modules, kontextfrei-core and kontextfrei-scalatest. The former is what you need to abstract over RDD in your main code base, the former to get some additional support for writing your RDD-independent tests using ScalaTest with ScalaCheck. Let's add them to our build.sbt file, together with the usual
Spark dependency you would need anyway:

resolvers += "dwestheide" at "https://dl.bintray.com/dwestheide/maven"
libraryDependencies += "com.danielwestheide" %% "kontextfrei-core-spark-2.2.0" % "0.6.0"
libraryDependencies += "com.danielwestheide" %% "kontextfrei-scalatest-spark-2.2.0" % "0.6.0" % "test,it"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.0"

Please note that in this simple example, we create a Spark application that you can execute in a self-contained way. In the real world, you would add spark-core as a provided dependency and create an assembly JAR that you pass to spark-submit.

Implementing the business logic

Now, let's see how we can implement the business logic of our word count application using kontextfrei. In our example, we define all of our business logic in a trait called WordCount:

package com.danielwestheide.kontextfrei.wordcount

import com.danielwestheide.kontextfrei.DCollectionOps
import com.danielwestheide.kontextfrei.syntax.SyntaxSupport

trait WordCount extends SyntaxSupport {

  def counts[F[_]: DCollectionOps](text: F[String]): F[(String, Long)] =
    text
      .flatMap(line => line.split(" "))
      .map(word => (word, 1L))
      .reduceByKey(_ + _)
      .sortBy(_._2, ascending = false)

  def formatted[F[_]: DCollectionOps](counts: F[(String, Long)]): F[String] =
    counts.map {
      case (word, count) => s"$word,$count"
    }
}

The first thing you'll notice is that the implementations of counts and formatted look exactly the same as they would if you were programming against Spark's RDD type. You could literally copy and paste RDD-based code into a program written with kontextfrei.

The second thing you notice is that the method signatures of counts and formatted contain a type constructor, declared as F[_], which is constrained by a context bound: For any concrete type constructor we pass in here, there must be an instance of kontextfrei's DCollectionOps typeclass. In our business logic, we do not care what concrete type constructor is used for F, as long as the operations defined in DCollectionOps are supported for it. This way, we are liberating our business logic from any dependency on Spark, and specifically on the annoying SparkContext.

In order to be able to use the familiar syntax we know from the RDD type, we mix in kontextfrei's SyntaxSupport trait, but you could just as well use an import instead, if that's more to your liking.

Plugging our business logic into the Spark application

At the end of the day, we want to be able to have a runnable Spark application. In order to achieve that, we must plug our Spark-agnostic business logic together with the Spark-dependent IO parts of our application. Here is what this looks like in our word count example:

package com.danielwestheide.kontextfrei.wordcount

import com.danielwestheide.kontextfrei.rdd.RDDOpsSupport
import org.apache.spark.SparkContext

object Main extends App with WordCount with RDDOpsSupport {

  implicit val sparkContext: SparkContext =
    new SparkContext("local[1]", "word-count")

  val inputFilePath = args(0)
  val outputFilePath = args(1)

  try {

    val textFile   = sparkContext.textFile(inputFilePath, minPartitions = 2)
    val wordCounts = counts(textFile)
    formatted(wordCounts).saveAsTextFile(outputFilePath)

  } finally {
    sparkContext.stop()
  }
}

Our Main object mixes in our WordCount trait as well as kontextfrei's RDDOpsSupport, which proves to the compiler that we have an instance of the DCollectionOps typeclass for the RDD type constructor. In order to prove this, we also need an implicit SparkContext. Again, instead of mixing in this trait, we can also use an import.

Now, our Main object is all about doing some IO and integrating our business logi into it.

Writing Spark-agnostic tests

So far so good. We have liberated our business logic from any dependency on Spark, but what do we gain from this? Well, now we are able to write our unit tests in a Spark-agnostic way as well. First, we define a BaseSpec which inherits from kontextfrei's KontextfreiSpec and mixes in a few other goodies from kontextfrei-scalatest and from ScalaTest itself:

package com.danielwestheide.kontextfrei.wordcount

import com.danielwestheide.kontextfrei.scalatest.KontextfreiSpec
import com.danielwestheide.kontextfrei.syntax.DistributionSyntaxSupport
import org.scalactic.anyvals.PosInt
import org.scalatest.prop.GeneratorDrivenPropertyChecks
import org.scalatest.{MustMatchers, PropSpecLike}

trait BaseSpec[F[_]]
    extends KontextfreiSpec[F]
    with DistributionSyntaxSupport
    with PropSpecLike
    with GeneratorDrivenPropertyChecks
    with MustMatchers {

  implicit val config: PropertyCheckConfiguration =
    PropertyCheckConfiguration(minSuccessful = PosInt(100))
}

BaseSpec, like our WordCount trait, takes a type constructor, which it simply passes along to the KontextfreiSpec trait. We will get back to that one in a minute.

Our actual test properties can now be implemented for any type constructor F[_] for which there is an instance of DCollectionOps. We define them in a trait WordCountProperties, which also has to be parameterized by a type constructor:

package com.danielwestheide.kontextfrei.wordcount

trait WordCountProperties[F[_]] extends BaseSpec[F] with WordCount {

  import collection.immutable._

  property("sums word counts across lines") {
    forAll { (wordA: String) =>
      whenever(wordA.nonEmpty) {
        val wordB = wordA.reverse + wordA
        val result =
          counts(Seq(s"$wordB $wordA $wordB", wordB).distributed)
            .collectAsMap()
        assert(result(wordB) === 3)
      }
    }
  }

  property("does not have duplicate keys") {
    forAll { (wordA: String) =>
      whenever(wordA.nonEmpty) {
        val wordB = wordA.reverse + wordA
        val result =
          counts(Seq(s"$wordA $wordB", s"$wordB $wordA").distributed)
        assert(
          result.keys.distinct().collect().toList === result.keys
            .collect()
            .toList)
      }
    }
  }

}

We want to be able to test our Spark-agnostic properties both against fast Scala collections as well as against RDDs in a local Spark cluster. To get there, we will need to define two test classes, one in the test sources directory, the other one in the it sources directory. Here is the unit test:

package com.danielwestheide.kontextfrei.wordcount

import com.danielwestheide.kontextfrei.scalatest.StreamSpec

class WordCountSpec extends BaseSpec[Stream]
  with StreamSpec
  with WordCountProperties[Stream]

We mix in BaseSpec and pass it the Stream type constructor. Stream has the same shape as RDD, but it is a Scala collection. The KontextfreiSpec trait extended by BaseSpec defines an abstract implicit DCollectionOps for its type constructor. By mixing in StreamSpec, we get an instance of DCollectionOps for Stream. When we implement our business logic, we can run the WordCountSpec test and get instantaneous feedback. We can use SBT's triggered execution and have it run our unit tests upon every detected source change, using ~test, and it will be really fast.

In order to make sure that none of the typical bugs that you would only notice in a Spark cluster have sneaked in, we also define an integration test, which tests exactly the same properties:

package com.danielwestheide.kontextfrei.wordcount

import com.danielwestheide.kontextfrei.scalatest.RDDSpec
import org.apache.spark.rdd.RDD

class WordCountIntegrationSpec extends BaseSpec[RDD]
  with RDDSpec
  with WordCountProperties[RDD]

This time, we mix in RDDSpec because we pass parameterize BaseSpec with the RDD type constructor.

Design goals

It was an explicit design goal to stick to the existing Spark API as closely as possible, allowing people with existing Spark code bases to switch to kontextfrei as smoothly as possible, or even to migrate parts of their application without too much hassle, with the benefit of now being able to cover their business logic with missing tests without the usual pain.

An alternative to this, of course, would have been to build this library based on the ever popular interpreter pattern. To be honest, I wish Spark itself was using this pattern – other libraries like Apache Crunch have shown successfully that this can help tremendously with enabling developers to write tests for the business logic of their applications. If Spark was built on those very principles, there wouldn't ne any reason for kontextfrei to exist at all.

Limitations

kontextfrei is still a young library, and while we have been using it in production in one project, I do not know of any other adopters. One if its limitations is that it doesn't yet support all operations defined on the RDD type – but we are getting closer. In addition, I have yet to find a clever way to support broadcast variables and accumulators. And of course, who is using RDDs anyway in 2017? While I do think that there is still room for RDD-based Spark applications, I am aware that many people have long moved on to Datasets and to Spark Streaming. It would be nice to create a similar typeclass-based abstraction for datasets and for streaming applications, but I haven't had the time to look deeper into what would be necessary to implement either of those.

Summary

kontextfrei is a Scala library that aims to provide developers with a faster feedback loop when developing Apache Spark applications. To achieve that, it enables you to write the business logic of your Spark application, as well as your test code, against an abstraction over Spark’s RDD.

I would love to hear your thoughts on this approach. Do you think it's worth it defining the biggest typeclass ever and reimplementing the RDD logic for Scala collections for test purposes? Please, if this looks interesting, do try it out. I am always interested in feedback and in contributions of all kind.

The Empathic Programmer

Daniel Westheide — Tue, 07 Feb 2017 08:42:20 +0000

This article was originally posted on Daniel Westheide's blog.

In 1999, Andrew Hunt and Dave Thomas, in their seminal book, demanded that programmers be pragmatic. Ten years later, Chad Fowler, in his excellent book on career development, asked programmers to be passionate. Even today, I still consider a lot of the advice in both of these books to be incredibly valuable, especially Fowler's book that helped me a lot, personally.

Nevertheless, in recent years, I have witnessed again and again that one other quality in programmers is at least as important and that it hasn't even seen a fraction of the attention it deserves. The programmer we should all strive to be is the empathic programmer. Of course, I am not the only one, let alone the first one, to realize that. For starters, in my bubble, Benjamin Reitzimmer wrote an excellent post about what he considers to be important qualities of a mature developer a while ago, and empathy is one of them. I consider a lack of empathy to be the root cause for some of the biggest problems in our industry and in the tech community. In this post, I want to share some observations on how a lack of empathy leads to problems. Consider it a call to strive for more empathy.

So what is empathy? Here is a definition from Merriam-Webster:

the action of understanding, being aware of, being sensitive to, and vicariously experiencing the feelings, thoughts, and experience of another of either the past or present without having the feelings, thoughts, and experience fully communicated in an objectively explicit manner; also: the capacity for this

Empathy at the workplace

It shouldn't come as a surprise that the ability to show empathy can come in handy in any kind of job that involves working with other people, including the job as a programmer. This is true even if you work remotely – the other messages you see in your Slack channels are not all coming from bots. There are actual human beings behind them.

One of the situations where we often forget to think about that is code reviews. Just writing down what is wrong with a pull request without thinking about tone can easily lead to the creator of the pull request feeling personally offended. April Wensel has some good advice on code reviews. What's crucial is to develop some sensitivity for how your words will be perceived by the receiver, which requires to put yourself into their shoes, see through their eyes and reflect how they will feel. This is easier the better you know the person, otherwise you will have to make some assumptions, but I think that's still far better than not reflecting at all on how the other person will feel.

Another workplace situation where I have often seen a lack of empathy is when members of two different teams need to collaborate to solve a problem or get a feature done. In some companies, I have seen an odd, competitive "us versus them" attitude between teams. This phenomenon has been explored by social and evolutionary psychologists, and while such a behaviour might still be in our nature, that doesn't mean that we cannot try to overcome it. A variant of "us versus them" is "developers versus managers". We developers have a hard time understanding why managers do what they do, but frankly, often, we don't try very hard. I have often seen developers taking on a very defensive stance against managers, and of course, the relationship between managers and developers in these cases was rather chilly. Getting to know "the other side" would certainly help to empathize with managers. Understanding why they act in a specific way is absolutely necessary in order to get to a healthy relationship with them.

Empathy in the tech community

Empathy is not only important at your workplace, but also very much so when you are interacting with others in our community, be it on mailing lists, conferences, or when communicating with users of your open source library, or developers of an open source library you are using. In some of these situations, a lack of empathy can strengthen exclusion, ultimately leading to a closed community that is perceived as elitist and arrogant.

As a developer using an open source library, empathize with the developers of the library before you start complaining about a bug, or better yet, a missing feature. Sam Halliday wrote an interesting post called The Open Source Entitlement Complex. It's hard to believe, but apparently, many users of open source libraries have this attitude that the developers of these libraries are some kind of service provider, happily working for free to do exactly what you want. This is not how it works. The same way that wording and tone are important in code reviews, try to empathize with the developers who spend their free time on this library you use. Serving you and helping you out because you didn't read the documentation is probably not their highest priority in life, so don't treat them as if it is.

On the other hand, when presenting your open source library to potential users, consider how these people will feel about that presentation. Does it make them feel respected? Does it make them feel welcome? I am sorry to disappoint you, but I think that a foo bar "for reasonable people" does not have that effect. Personally, I find this to be very condescending and think it will intimidate a lot of people and turn them away. It implies that any other way than yours is not reasonable, and that, hence, people who have not used your library yet, but some different approach, are unreasonable people. As library authors, let's show some empathy as well towards our potential users. As always in tech, there is no silver bullet, and there are trade-offs. There are probably perfectly good reasons why someone has been using a different library so far, and maybe even after looking at your library, there will still be good reasons not to use yours. Even if you are convinced that your library is so much better, you aren't exactly creating an open and welcoming atmosphere by basically telling people visiting your project page that they are unreasonable for using anything else.

If you are at a tech conference, and you ask women whether they are developers at the very beginning of your conversation, but don't do the same with men, you are probably not doing that out of malignity, but because you don't see many women at tech conferences who are actually developers. Nevertheless, to the receiver, this seemingly harmless and neutral question doesn't come across like that at all. She has probably heard this question many times, and constantly hearing doubts about whether you are really a programmer doesn't exactly make you feel welcome, or confident. Show some empathy when you talk to other people at tech conferences. Imagine what it would be like to constantly be doubted, for example. If you don't see a need for being inclusive, that's probably because you had no problem being included in the community. This likely means that you are a man, and probably white. Since most people around you are like you, chances are you don't even know any women or other unprivileged people who are developers. The problem of being privileged is that you don't notice it. Talk to women on conferences and let them tell you about their experiences. By showing empathy, you can create a more welcoming environment of inclusion and foster diversity.

Summary

These are my two cents about empathy, and lack thereof, in the tech community, and how it relates to inclusion and diversity. Empathy is important not only at the workplace, when interacting with co-workers, but also when we are participating in the tech community, as conference visitors, open source developers, and users of open source libraries. Only by showing empathy, we can create an inclusive and open community. Let's try to be more aware of the effects we have on each other, and act accordingly. Thanks!

Forem: Daniel Westheide

"Introducing kontextfrei"

The pain of unit-testing Spark applications

Abstracting over RDDs with kontextfrei

Obtaining the library

Implementing the business logic

Plugging our business logic into the Spark application

Writing Spark-agnostic tests

Design goals

Limitations

Summary

Links

The Empathic Programmer

Empathy at the workplace

Empathy in the tech community

Summary