Forem: Bill Schneider

Readability with break statements

Bill Schneider — Tue, 06 Nov 2018 19:52:06 +0000

There is a widespread belief that break and continue statements are a bad programming practice. See this StackOverflow thread for an discussion. They can make code less readable if they make the intent less clear. In some cases, though, I believe they are actually better than the alternatives and can improve readability.

Here's a recent example from working with AWS in Python.

ssm = boto3.client("ssm")
response = ssm.send_command(...)
command_id = response["Command"]["CommandId"]
while True:
  invocation = ssm.get_command_invocation(... command_id ...)
  if invocation['Status'] not in ['Pending', 'InProgress']
    break
  time.sleep(5)

This code sends a command to AWS SSM, and gets the status of the invocation in a loop, exiting when completed. It will also pause before retrying.

Note that the only way to terminate this loop is the break statement.

I thought of a few alternatives to avoid a break but actually like them less.

One option is to bootstrap an initial request outside the loop, so you can include the status directly on the while condition:

invocation = ssm.get_command_invocation(... command_id ...)
while invocation['Status'] not in ['Pending', 'InProgress']:
  time.sleep(5)
  invocation = ssm.get_command_invocation(... command_id ...)

I don't like this as much because of the duplication.

Another way to address this is by using a flag to indicate completion:

done = False
while not done:
  invocation = ssm.get_command_invocation(... command_id ...)
  done = invocation['Status'] not in ['Pending', 'InProgress']
  time.sleep(5)

No duplication here, but the flag variable is extra clutter, and is not really that much more readable than the break version, because it's not obvious from the while statement itself what condition will cause the loop to terminate. Also, even if the loop completes on the first iteration, you still have to sleep before exiting the loop -- unless you add a break, in which case the flag is useless.

Compared to the other options, the original version with the while True / break feels like the least bad, even if that conclusion is unintuitive.

Spring Boot listener for AWS SQS with Spring Cloud

Bill Schneider — Tue, 18 Sep 2018 01:36:14 +0000

This originally appeared on my personal blog

I was surprised how little code I needed to get a Spring Boot application listening to an Amazon SQS queue.

I put a gist on Github to illustrate.

The key is that when you have the right dependencies in your Maven POM, all you have to do is annotate your listener method with @SqsListener:

@SqsListener("your-queue-name")
public void listen(DataObject message) {
    LOG.info("!!!! received message {} {}", message.getFoo(), message.getBar());
}

The dependency on spring-cloud-starter-aws takes care of initializing everything and scanning for annotated methods.

Command line arguments and authentication

You specify AWS credentials and region through Spring Boot properties. I passed these as command line arguments through Eclipse, where I was debugging locally / not on AWS:

--cloud.aws.region.static=us-east-1 set my region to US East 1 (Northern VA).
--cloud.aws.credentials.useDefaultAwsCredentialsChain=true tells Spring Boot to use the AWS DefaultAWSCredentialsChain, which will pull credentials from either environment vars or ~/.aws/credentials file.
I also set environment variable AWS_PROFILE for the default credentials chain to find my credentials under the correct profile.

If I were running in AWS itself, the EC2 instance metadata could have determined the region automatically, and also provided credentials via the instance profile.

Testing via AWS console

I used the AWS console to send test messages. The main gotcha is that if you are using JSON messages and using Spring to automatically deserialize JSON to your objects via @JsonProperty annotations, you will need to specify the message
attribute (header) contentType with value application/json. Otherwise the conversion will fail with an unhelpful error message like "Cannot convert from [java.lang.String] to .... for GenericMessage ...." with no indication why
there was a failure.

There is another alternative for reconfiguring the default Spring messaging classes to ignore the contentType header.

Spark UDFs to migrate from other SQL dialects

Bill Schneider — Mon, 08 Jan 2018 15:20:24 +0000

This article originally appeared on my blog

I found it helpful to create Spark UDFs to make it easier to migrate logic in SQL from another database like SQL Server.

SQL Server defines several string functions like LEN, REPLACE and CHARINDEX, which are not available in Spark by default. Fortunately these are easy to implement in Spark with UDFs:

spark.udf.register("len", (s: String) => s.length())
spark.udf.register("replace", (orig: String, toReplace: String, replaceString: String) => orig.replace(toReplace, replaceString))
spark.udf.register("charindex", (substring: String, str: String, startPos: Int) => str.indexOf(substring, startPos - 1) + 1)

These are all thin wrappers around native Scala string functions. These will now be available for use in Spark SQL queries:

select len('foo bar') as len_test,
     replace('foo bar baz', 'bar', 'quux') as replace_test,
     charindex('.', '1.2.3', 3) as idx_should_be_4,
     charindex('.', '1.2.3', 0) as idx_should_be_2,
     charindex('@', '1.2.3', 0) as idx_should_be_0

This makes it a little easier to copy-paste queries from SQL Server to Spark, if the syntax is otherwise standard.

The one downside is that Spark UDFs are functions, not methods, and as such do not allow for default argument values. So you would have to explicitly add a 0 as the third argument for CHARINDEX wherever it's missing.

EC2 proxy to RDS for a static IP address

Bill Schneider — Fri, 05 Jan 2018 18:58:42 +0000

This post originally appeared on my blog

RDS instances in AWS do not get a static IP address. This is usually a good thing, not a problem. This provides flexibility to preserve availability while the physical RDS host may shift around for resizing, or failing over to a different availability zone (AZ). In either case, clients connect to RDS by hostname, and AWS magically updates the hostname to point at the IP address for the currently active host.

The only time this creates a challenge is when you want to connect to RDS from a private/corporate network and have to update firewall or VPN tunnel configuration to allow connections to RDS. If this isn't an issue for you, you can stop reading this.

The problem is firewall rules, VPN tunnels, and NAT rules all work on IP addresses, not hostnames. You can't configure your firewall to unblock traffic to an RDS instance if you don't know its IP address.

The workaround I found was to put an EC2 server in front of RDS as a TCP proxy. You can give a static IP address to an EC2 instance with the PrivateIpAddress property of AWS::EC2::Instance, or with an AWS::EC2::EIPAssociation resource for a static publicly-routed IP. Then you use that EC2 instance to forward traffic on to the RDS instance by hostname. The EC2 instance's IP address then becomes the database's static IP for firewall purposes.

There's lots of different ways you can forward traffic from EC2 to RDS. You can pick whichever one best suits you:

socat: e.g., socat TCP-LISTEN,[port],fork,reuseaddr TCP:[hostname]:[port].
- Pros: simple and convenient, easy to install with yum install socat.
- Cons: not widely known; forks a process per connection so not good for high volume
haproxy
- Pros: Robust, scalable
- Cons: AWS Linux packages do not include latest version with runtime DNS resolution; must be built from source, and even then requires some contortions to resolve hostnames at runtime.
nginx
- Pros: you might already be using it
- Cons: overkill for a port forwarder
ssh port forwarding
- Pros: widely understood, ssh/sshd already installed by default
- Cons: requires establishing and authenticating an SSH connection, which is overkill when you only want port forwarding

On security group setup: Put the EC2 instance and RDS instance in two different security groups, and then those security groups can refer to each other. This is a perfect use case for CloudFormation's AWS::EC2::SecurityGroupIngressand AWS::EC2::SecurityGroupEgress resources ("typically to allow security groups
to reference each other"). Since you don't know the RDS instance's IP address, you can refer to the RDS instance's security group. The EC2 security group would have an egress rule to RDS security group and vice versa.

It's a good idea to otherwise lock down the EC2 security group. The EC2 instance should only allow outbound access to the target RDS instance, and DNS for hostname resolution. The RDS instance should only allow inbound access through the EC2 proxy.

Other CloudFormation tips: Keep the EC2 instance in a stack by itself so it can be rebuilt independently. You can make cross-stack references to the RDS instance from the EC2 stack. Also, you can put a script in the UserData property in EC2 to inject the RDS hostname (from the cross-stack reference) into Upstart config files (/etc/init/your-proxy-service.conf) so your proxy service will start automatically on boot and refer to the correct hostname.

Learning Scala for Spark, and the apply method

Bill Schneider — Wed, 27 Dec 2017 14:55:36 +0000

This article originally appeared on my blog

Sometimes in Spark you will see code like

val df1 = ...
val df2 = ...
val df3 = df1.join(df2, df1("col") === df2("col"))

It is a little odd at first to use DataFrame objects like methods.

What's going on here?

In Scala, objects have an apply method, which allows any object to be invoked like a method. obj(foo) is equivalent to obj.apply(foo). DataFrame's apply method is the same as col, so df("col") is equivalent to df.col("col").

This is also related to why you can create instances of case classes without new -- a case class defines a companion object with the same name, and that
companion object has an apply method that returns new ClassName().

Personally I haven't learned to like Scala's apply feature, because it's not entirely obvious what obj(foo) is supposed to do. But in this case,
it makes sense to have shortcuts like that when I'm thinking of Scala as a DSL for Spark.

Readability analogy in music

Bill Schneider — Sat, 16 Dec 2017 12:33:01 +0000

This article originally appeared on my blog

In music, you can often write the same note two different ways, for example, B-flat and A-sharp correspond to the same key on a piano keyboard. When you use which depends on surrounding context. A chord C/E/G/B-flat is a C dominant 7th and resolves to an F chord. The same chord written
as C/E/G/A-sharp is an augmented 6th and resolves to B major. So which way the chord is written tells you something about where it's going next.

The other day, I saw music with an augmented 6th chord written as a dominant 7th, and I found it confusing to look at a sequence of notes
like A-natural, A-flat, A-natural. Given the first two notes in that sequence, you usually expect the third note to be G.

So what does this have to do with code?

Readability matters.

With code, you can often get the same end result multiple ways. It's important for your code to look like what it does, so anyone reading it
will be able to understand it. Since we spend at least 90% of our time reading code (the Uncle Bob figure) focusing on readability will improve
productivity.

Poorly named methods are the equivalent of that A-flat that should have been written as a G-sharp: it will sound (or work)
the same, but you're making the reader work harder than they should have to.

Learning Scala for Spark, or, what's up with that triple equals?

Bill Schneider — Mon, 11 Dec 2017 22:48:34 +0000

This article originally appeared on my blog

I began to learn Scala specifically to work with Spark. The sheer number of language features in Scala can be overwhelming, so, I find it useful to learn Scala features one by one, in context of specific use cases. In a sense I'm treating Scala like a DSL for writing Spark jobs.

Let's pick apart a simple fragment of Spark-Scala code: dataFrame.filter($"age" === 21).

There are a few things going on here:

The $"age" creates a Spark Column object referencing the column named age within in a dataframe. The $ operator is defined in an implicit class StringToColumn. Implicit classes are a similar concept to C# extension methods or mixins in other dynamic languages. The $ operator is like a method added on to the StringContext class.
The triple equals operator === is normally the Scala type-safe equals operator, analogous to the one in Javascript. Spark overrides this with a method in Column to create a new Column object that compares the Column to the left with the object on the right, returning a boolean. Because double-equals (==) cannot be overridden, Spark must use the triple equals.
The dataFrame.filter method takes an argument of Column, which defines the comparison to apply to the rows in the DataFrame. Only rows that match the condition will be included in the resulting DataFrame.

Note that the actual comparison is not performed when the above line of code executes! Spark methods like filter and select -- including the Column objects passed in--are lazy. You can think of a DataFrame like a query builder pattern, where each call builds up a plan for what Spark will do later when a call like show or write is called. It's similar in concept to something like IQueryable in LINQ, where foo.Where(row => row.Age == 21) builds up a plan and an expression tree that is later translated to SQL when rows must be fetched, e.g., when ToList() is called.

Measuring AWS Redshift Query Compile Latency

Bill Schneider — Mon, 18 Sep 2017 15:14:45 +0000

This article originally appeared on my blog

AWS is transparent that Redshift's distributed architecture entails a fixed cost every time a new query is issued. The documentation says the impact "might be especially noticeable when you run one-off (ad hoc) queries."

I went deeper to try to quantify exactly what "noticeable" means.

To isolate the impacts of data cache hits/misses from query compilation, I ran a bunch of queries on empty tables so there is no data to load or cache. Each query was slightly modified to trigger a recompilation, by changing the columns or aggregate functions.

I found that the compile latency scales with the complexity of the query.

Simple query: usually between 1-1.5 sec, with an outlier around 3 seconds. Example of a simple query:

select sum(a1) from foo where a2 = 1;
select sum(a2) from foo where a3 = 1;
-- etc.

More complex query with more conditions, and group-by: usually around 2-3 seconds. Example of a query in this category:

select a8, a9, sum(a1), sum(a2)
from foo
where foo.a3 > 10 and foo.a4 < foo.a5
group by a8, a9;

Even more complex, with joins and group-by: average around 5 seconds, ranging between 3-7 seconds. Example query:

select s, s2, count(a6), sum(a7)
from foo
join bar on bar.a = foo.a6
join baz on baz.b = foo.a7
where foo.a3 = 1 and baz.s2 is not null
group by s, s2;

What does agile development have in common with amateur theater?

Bill Schneider — Mon, 18 Sep 2017 00:52:37 +0000

This article originally appeared on my blog

Working in an agile development environment, I noticed some parallels to my experiences with student theater several decades ago.

In both cases, you never have enough time to get your production / release perfect. In theater, your dates are fixed in advance, and you work within that constraint. Your production has to be "releaseable" in the sense that you are expected to perform a whole show start-to-finish, and you have to accept some fine-tuning just won't get done. You have some flexibility on how elaborate you make your staging, scenery, costumes, etc., and you do the best you can with the resources and the time you have. In software, you commit to a release schedule, and you scope your releases to what you can get done within that schedule. It's better to drop features than to delay the release.

Another common concept is progressive refinement. The idea is to build the big picture in broad strokes first, then come back to fill in details. (Think about how JPEGs look while downloading.) The first thing you when you start rehearsals is read through the whole script, start-to-finish. No matter what, everyone has to know their lines and music. Then you start adding staging, bits at a time. Early into the rehearsal period you would be able to do a minimalist performance--you wouldn't have full staging or sets, but it would be something. In software, this would be like defining the full product vision, then building out enough critical features to release an early MVP.

On the people side, theater productions tend to have distinct roles. Producers are responsible for publicity and ticket sales, directors and music directors are responsible for making sure performers know where to stand and how to sound, etc. These roughly correspond to software team roles like product manager, development manager, and technical lead -- there is a separation of responsibility between commercial success (producer/product manager) and the day-to-day management of rehearsals/development (directors/managers). In an amateur/student group people usually wear multiple hats, similar to an agile or startup environment--everyone is a stakeholder in overall success and pitches in where needed--but there are clear affinities.

Finally, on people management: it is important to have the right people on the team. In both cases, teams that are excited to work together will feed off each other and can outperform the sum of their parts. On the flip side, someone who looks great on paper might not be a good fit for your organization, and will often dissapoint. In both cases, I learned this the hard way.

Opinions on truthiness across languages

Bill Schneider — Thu, 20 Jul 2017 13:48:22 +0000

A version of this article originally appeared on my GitHub pages blog

Different languages have different opinions about what to treat as "truthy" or "falsy" when using a non-boolean object as an expression inside an if statement.

I looked at Python, Groovy, Javascript and Ruby to compare their differences.

Null is always falsy
Zero and empty strings are falsy, except in Ruby
Empty collections (set/list/dict) are falsy in Python and Groovy but not Javascript or Ruby

My observations and personal opinions on language design:

Python treats zero, empty strings and collections all as 'falsy'. Personally, I find this the most intuitive convention.
- Treatment of zero and null as falsy has historical precedent, from C. False and null pointers are both represented as zeros in a register or memory location.
- Treatment of empty strings and collections is a nice convenience, given the number of times I've written conditionals like if (foo != null and !foo.empty()). It's usually the exception that I want to distinguish between null and empty in a conditional. So it's nice that if (foo) handles the common case, then I can write if (not foo is None) when I really do want to distinguish null.
- Treatment of empty string as similar to null feels familiar from my Oracle experience. Also, it's consistent with treatment of an empty collection.
Groovy is inspired by Python and adopts similar conventions for truthiness.
Ruby takes a different opinion that all values are truthy except nil (and false, of course). While it's not my personal preference, it's defensible and self-consistent.
Javascript can reliably be expected to deliver a WTF. Javascript treats zero and empty strings as falsy, but empty collections as truthy. To me, it's hard to understand why strings and collections ought to behave differently; the Python behavior makes much more sense. But wait, it gets even better: check out this link on StackOverflow.

Balancing early and later project risks

Bill Schneider — Mon, 16 Jan 2017 12:44:21 +0000

One of the things I liked about this post on "Senior Engineers Reduce Risk" is how it called out two different kinds of project risks:

Early in a project lifecycle, the biggest risk is building the wrong thing
Later in the project lifecycle, once you know you're building the right thing, the “-ilities (scalability, maintainability etc.) become bigger risks

The author's point is that senior engineers need help identify and mitigate these risks.

One additional responsibility of a senior engineer, in my opinion, is to understand the tradeoffs between these kinds of risks, and how to balance those tradeoffs. This is tricky because, to paraphrase Yogi Berra, predictions are hard--especially about your future user load or revenue. You can think of this like type 1/type 2 error (false positive/negative) in hypothesis testing:

Type 1 error: premature optimization/generalization/etc. You spend time scaling something that doesn't sell, or designing a generic platform that only gets used once.
Type 2 error: technical debt. By the time you realize you have a scaling problem it's too late, and your users end up unhappy. Or, your lack of CI processes and tests slows down future releases.

The type 1 vs. type 2 metaphor assumes you have constrained resources - an engineering hour spent on scaling is an hour not spent on prototyping to get feedback from users. So reducing one kind of risk will increase the other kind of risk and vice versa.

Given that both kinds of error are bad, what do you do? You have to balance the possible outcomes from these risks, and prioritize based on what's more important to you, and this is context-dependent. A senior engineer should know how to reach out and communicate with business stakeholders to figure out the right balance, telling a good story about the risks that may not be immediately evident to non-technical team members. A senior engineer will have lived through both kinds of errors and can draw from their past experience in their storytelling.

My own personal opinion: after living through projects with both kinds of type 1/type 2 errors, I would rather take type 2 over type 1 most of the time. 37signals sums this up with the mantra "It's a problem when it's a problem". The catch is you have to be disciplined enough to identify and communicate future risks, and have a plan to address them if and when they become issues. It can be OK to defer scaling if and only if it is a deliberate, conscious tradeoff to prioritize something else, so there are no surprises later.

This is also why "debt" is a good metaphor. In personal finance, some kinds of debt are good because they help reach a strategic goal: buying a house, getting an education, starting a business. Other kinds of debt are bad: racking up credit card balances without a plan to pay them off. Similarly, deferring some "-ilities" in pursuit of a higher priority business goal can be a good thing, while ignoring them outright is bad.