Forem: jspiliot

Does size matter in Pull Requests: Analysis on 30k Developers

jspiliot — Fri, 24 Nov 2023 17:12:01 +0000

At one point or another you might have found yourself putting a Pull Request up for review that was significantly bigger than what you were expecting it to be. And you found yourself wondering:

“How big should it really be? Is there a sweet spot for the size of a review? If we could theoretically always fully control it, how big should we make it?”

You googled around, and you found a lot of resources, sites, and articles like this one, analysing the subject and ending up with something along the lines of:

“Too few lines might not offer a comprehensive view of the changes, while an excessively large PR can overwhelm reviewers and make it challenging to identify issues or provide meaningful feedback”

And although you understood the sentiment of the writer, you also understood that the theoretical answer could only be vague, as there is not a silver bullet. As always life is more complicated than that.

What we are gonna be doing in this article is something different however:

“We will analyze the PRs of ~30k developers to see how the size of PRs correlates with lead time, comments received and change failure, to try and find what statistically is the best size, as well as examine what affects it.”

Disclaimer: For anyone who has played around with data, and especially if you did any courses/training in data, the above might bring back some memories of this phrase “Correlation does not mean causation”. First of all hello to you my fellow scholar, and secondly you are absolutely right. We will try to look at it from various angles to see how this correlation varies by company, developer, per developer and amount of code committed, and any other angles which might help us understand what others values, for any reason, follow relevant patterns. However, these are “only” numbers and correlations, they do not explain the reason behind them, so any assumptions for causes that we make are more anecdotal and less scientifically backed.

Methodology

Lead Time

In this case we use as lead time the time between the earliest PR event (either 1st commit, or PR open), and when the PR gets merged in.

Data Preparation

Data that are removed as outliers:

PRs that had a lead time of more that 6 months
PRs that had a lead time of less than 5 minutes
File changes of more than 20k lines
PRs with more than 20k line changes

After we have done that we have a few hundreds of thousands of merged Pull Requests that are used to produce the below analysis.

Algorithm

All Correlations have been done using the kendall tau method, which should be able to better estimate the correlation in the case of non-linear relationships.

How does Lead Time relate to PR size

Before we go more deeply, intuitively we expect that the size of a PR should correlate in one way or another with the lead time, but is it actually the case? Running correlation between the two variables for the whole dataset, gives us as a result the below correlation matrix.

PR size to Lead Time correlation

From these numbers we could say that there seems to be some correlation between the two variables but it seems to be a bit above the limit of statistical insignificance, meaning that:

Their correlation is there, but is not very strong, maybe less than one would have expected.

Seems like we’ll have to dig deeper to see why this correlation appears to be so weak, and unfortunately, plotting the graph of total line changes to lead time if anything makes things less clear, as although the trend seems to suggest that the ones with the higher lead time had slightly bigger size on average, we see that any link between them is not so clear to see.

Total PR size to Lead Time

Now, if we change this chart a bit, by grouping the data points by day, and taking the median of the total changes by day, we start to see a bit more clearly how they relate and potentially an explanation for why their correlation is not that high.

Mean total PR size to daily Lead Time

So this suggests that at fast lead times the PRs are consistently low in lines changed, and as they get bigger there is a linear increase on the lead time. However, higher lead times can be produced by any size of PR and the correlation is very low between them.

What is the best size

To try and answer this question, we’d first have to ask ourselves what is it that matters to us, ie what are we trying to optimize for. Now, that is a question with endless possibilities. For our purposes however, we will examine what is the largest size of PR that statistically works better given these 3 wants:

Low lead time (aka be done fast)
High number of comments (not too big to review properly)
Low defects/reverts (aka we are not breaking things)

If we plot in a heatmap the probability of a PR getting done in a number of weeks to the size we get the below.

Heatmap of probability of PR of a size (x axis) getting done in a number of weeks (y axis)

Meaning that a PR of less than 100 lines of code has ~80% chance of getting done within the first week

A similar heatmap for the amount of comments gives us the below.

Heatmap of probability of a PR of a size (x axis) getting an amount of comments (y axis)

Which means that a PR of 6000 lines of code has the same probability of getting 0 review comments as much as a review of less than 50 lines of code.

And finally doing the same for the probability of reverts gives us the below heatmap, and depicting the probability of no commits from that PR getting reverted gives us the below.

Probability of a PR of a size (x axis) not having to be reverted (0 reverts)

Which means that generally larger size PRs have a larger probability of being having some parts of their code reverted (ie faulty)

From the above if we plot on the same graph the probabilities of completing an PR within the 1st week, the probability of getting at least 1 or more comments, and the probability to not have to revert a commit from that PR, we get the below.

Probability (y axes) of a PR getting done in a week (blue), to have comments (green) and to be reverted (red) over lines of code

Therefore, statistically, below ~400 lines of code per PR gives a good probability of getting some review comments, completing it within the first week and not having issues with the code.

Of course that is only “statistically” the case. It surely depends on a lot of things. Let examine some potential ones.

Does it depend on the user

We would potentially expect it to vary per user, but how different it could be per user, either that being the author or the reviewer, could be more interesting. After removing all users that have one or more of the below:

Less than 10 merged PRs
Less than 10 commits
Less than 100 lines of code changed

And all reviewers that have:

Less than 10 approved and merged PRs

We perform correlation analysis between Lead Time and PR size per user. If we then put the result of the analysis on a histogram showing how many users had each of the correlation value, we get the below charts:

Histograms of Lead Time to PR size correlation per amount of unique PR authors (left) and PR reviewers (right)

The correlation between Lead Time and PR size heavily depends on the PR author as well as the PR reviewer

There are a wide range of reasons why that could potentially happen, like level of seniority, company/team process, coding language, review tool, etc.

Below we plot the relation of the correlation depending on the amount of lines of code a developer has written throughout the last 6 months. Although that instinctively could lead us to think that that means a more “experienced” developer, it is not necessarily true, as it may be also affected by multiple factors, such as eg amount of meetings, mentoring, collaboration per day, which could vary on seniority, the tasks each one took up, etc., and so on and so forth.

Nonetheless we depict it here for anyone that might find it interesting. Also keep in mind that the difference in the correlation between a user with many PRs merged and few is not a very large one.

Lead Time to PR size correlation value for developers per code committed within the last 6 months

The more lines one has written the more correlated the PR size is with the lead time. This could also mean that lead time becomes more predictable in this case, and it depends more heavily on the size of the PR and not other parameters (e.g. complexity). However, more analysis would be required to establish that.

Does it depend on the company

We mentioned earlier that there are various potential reasons for a correlation between Lead Time and PR size, and we also said that the cause of the strength of the correlation would also be multivariate. One of the potential causes being company/team processes. If that would be the case we’d expect to see the correlation varies by the company.

Taking a small sample of companies and examining the strength of that correlation seems to suggest that that is a valid assumption as well, as we can see here it varies from 0.1 suggesting that the two metrics are not related at all for the specific company to almost 0.7 suggesting a relatively strong correlation between the two.

Lead Time to PR size mean correlation per company (sample)

How much PR size relates to Lead Time seems to depend heavily on the specific company

Does it change over time

It absolutely does, and massively so! Unfortunately, it’s rather hard to depict that for everyone in a single chart. However, I’m putting here my own correlation over time that I got from our free analytics platform so you can get an image of how much it can vary.

Lead Time to PR Size Correlation over time chart for myself

Conclusion

We examined the correlation between Lead Time and PR size to try and see if we can draw some conclusions about what is the size we should be aiming for. We found that statistically there are some generalisations we can do and estimate an optimal size. However we also came to the conclusion that the link between them heavily depends on the company, the team and even the individual developer. Which seems to suggest in the end that:

Each developer works in unique ways, and only you, if anyone, knows what is the optimal for you, and your team.

Now If you would like to check where you or your team/company stands wrt this correlation between Lead Time and PR size, we created a simple way for developers and teams to get insight on how this correlation changes over and see where they stand, either individual, team, or as a whole company. If you are curious about it, feel free to check it out.

Example correlation analysis page

Serverless at Scale: Lessons From 200 Million Lambda Invocations

jspiliot — Fri, 10 Nov 2023 11:15:42 +0000

Serverless computing, with Lambda functions at the heart of it, has irrevocably changed the way we build and scale applications, more than anything, by adding another question to the list of questions in the beginning of every project:

“Should we do this serverless”

And although sometimes the answer is as simple as “None of our other systems are serverless, let’s not start mixing them”, a lot of other times it’s more complicated than that. Partially because:

“Serverless architecture promises flexibility, infinite scalability, fast setups, cost efficiency, and abstracting infrastructure allowing us to focus on the code. But does it deliver?”

With a technology that makes these promises it’s hard to mute that voice inside us asking questions like:

“what if it’s easier to maintain? Maybe it’s at least easier to spin up? Could it make us go faster? Would it be cheaper? What if we can scale more easily?”

Therefore, sooner or later we all will be intrigued enough to search it more, or implement it in a PoC just to see if it makes sense to go forward. And that’s where it gets tricky!

At Adadot we ran over 200 million Lambdas in the last year. In this year-long journey to harness the power of Lambda functions, we saw both good and sides of them that we didn’t really expect.

Our Setup

Context: At Adadot we are dealing with developer data, which at the end of the day means A LOT of data. So capturing and processing big volumes of data is our focus.

When we started, we were a small team of a couple of software engineers, and we didn’t know how the system requirements would need to scale over the next year(s). We had very little production experience of serverless architectures, but each one of us a decade with "traditional" architectures. We thought we at least knew the below about serverless:

It would scale "infinitely", as long as we have enough money to throw on it, so it could save us in a scenario of unexpected traffic growth.
We won’t have to pay almost anything as long as our traffic is very low
Lambdas specifically have high warmup times, and we dont wanna have to manage warm up strategies, or use provisioned capacity (I’ll explain this in-depth in another article)

So since we had no clue how our traffic would be over the next years, we took early on a decision on our architecture’s principles:

Any API that is customer-facing, needs fast response and doesn’t do heavy operations will be handled by an always-on server.

Any other function would be run in lambdas.

Anything else that can be serverless (databases, queues, streams etc) will be serverless.

Cost evaluation is to be performed periodically as traffic gets more predictable to re-evaluate.

Fast forward 2 years, and especially due to our requirements on what to deliver to the end-users, we ended up with a serverless-heavy architecture, which includes only 1 “monolithic” API server and some always-on databases, but also about 200 lambdas, 100 step functions, 100 kinesis streams, 100 SQS, DynamoDB (including streams), Redshift & Neptune clusters, API Gateways and a whole bunch on AWS networking components.

And all that is how we got to us running over 200 million lambdas in the last year alone. However, as you may have deduced from the above these were not serving customer-facing APIs, but rather various tasks we had to do on the backend, based on various different triggers and events in order to have all of your data ready, in the right format, and with the required analyses done so our customers could have access to them.

So without further ado, let’s move on to the more interesting part of what we learned out of this year.

Ease & Speed

So are Lambdas easier to spin up? For us that has been a constant and resounding “yes” throughout our whole journey. We found that Lambdas need significantly less configuration and setup in order to spin up and start working, and less initial thinking over instance sizing configurations. We are using the serverless framework for our Lambdas, so in order to have a Lambda function up and running it’s as simple as creating the serverless.yml file and running `serverless deploy`. We mainly use JS (with TypeScript) and Python, so something like the below is enough to have a function up and running even including bundling, IAM permissions, environmental variables, x-ray tracing, custom layers and really everything needed for each one. So really once you’ve written the function in 5 minutes you are ready to go.

serverless.yml for TypeScript and Python

That being said, we use Terraform to manage the rest of our infrastructure, but we found that managing Lambda functions using Terraform was of significantly bigger complexity. That means however, that we have 2 different ways to manage our infrastructure, and they don't play that well between them, meaning that they don’t really communicate resource information, so unfortunately that means, that there are cases that the id of a resource needs to be hardcoded or put as a wildcard to be used by the other.

Infinite (or not) Scaling

Usually serverless architectures are presented as being “infinitely horizontally scaling”, which is, at most times, too good to be true. And I’m obviously not talking about the practical impossibility of anything reaching infinity, but a much much more tangible limit. In the case of Lambdas that limit is the maximum amount of unreserved - or reserved - concurrency. These two types of concurrencies can be summarized as below:

Unreserved concurrency is the maximum amount of concurrent Lambda executions for an AWS account, once you reach that amount you are getting throttled by AWS (ouch!)
Reserved concurrency is an amount of the unreserved concurrency that you have reserved for a specific Lambda function. That amount can only be used by that function and is the maximum amount that that function can use.

So unreserved is the maximum amount for the account. For our region that limit is set to 1000. That might sound like a lot when you start, but you quickly realise it’s easy to reach once you have enough Lambdas and you scale up. We found ourselves hitting that limit a lot once our traffic and therefore our demands from our system started scaling.

For us the answer was to:

implement an improved caching logic and not rerun heavy operations when they had already been calculated before and
spread the calculation loads throughout the day, depending on the end-users timezone.

You can see the difference these made for us in the graph below (note that we had less than 700 unreserved concurrency because we had reserved the rest).

Daily concurrent lambda execution

Function Resources

Upon setting up a Lambda function you have to set up the amount of memory for the container, and that will internally be reflected as a CPU specification for the Lambda function container. There is a limit of 10Gb. Although if we were to do everything with Lambda we would reach that limit easily, it generally hasn’t been an issue for us. Where the issue lies is that

“cost of lambda = memory * execution time”

Which means these settings have massively affected the cost of the Lambdas. All good up to here, however, it gets more complicated if the function load is not always constant, but depends on some external parameters (e.g. the volume of data for the user). You are then forced to put the Lambda memory on the highest setting even if it’s the 99th percentile of your Lambdas’ loads, and overpay 99% of the times. You can not just say “take as much as you need” (for obvious reasons). So

the classic old problem of “how big should our server be” is still there, you just don't need to ask “how many of them” as long as they follow the other limits

Storage Limits

Unless you go for the hard path of using Fargates (which we had to do in cases), Lambdas have a size limit of 10GB container image (uncompressed including layers), and for all account’s Lambdas combined there is a limit of 75GB. That might sound like plenty, and it might indeed be, but it heavily depends on what language you use for your Lambda. If you use a language that has a sort of tree-shaking so you upload only what you really use (e.g. JS) then you are probably covered forever. On the other hand, for other languages (looking at you, Python), 10GB is hardly enough. Once you import Pandas you’re on the limit. You can forget Pandas and scipy at the same Lambda.

Time Limits

Initially we thought “it’s just like an on-demand server”. It spins up, does what you need and drops down again. Yes and no. It spins up and does what you want, but only up to 15 minutes, and drops again. If a task needs more than 15 minutes to finish you have to go a different way. Also, of course, if a function runs 99.9% of the time relatively fast, but the rest 0.01% it takes consistently more than 15 minutes, you need to find an alternative, or break it into smaller pieces. This is part of why we have about 200 Lambda functions, as we were not really happy with the logging retention of Fargate, we chose to stay mostly on Lambdas and kept breaking them into executable chunks. Also in our case a lot of the timeouts were expected and acceptable due to backoffs on the downstream resources, so our system is resilient to this. Still our execution duration graph shows how far is the average execution time to the maximum.

Max, min and average daily lambda execution durations

Monitor and Analyze

Proper monitoring and logging are essential for maintaining visibility into your serverless application. Especially when they are triggered not by customer-request, but by events. You really should not just fire and forget them, cause it’s really easy for nothing to work without anyone noticing unless you go above and beyond on your monitoring and alerting.

AWS provides tools like CloudWatch, X-Ray, and CloudTrail, which help you gain insights into your Lambda functions' performance, trace requests, and monitor your entire architecture. By setting up alarms and automating responses, you can ensure that your application remains healthy and responsive.

We combine all of these AWS services to get full coverage. Also on alert states, we notify the team through specific Slack channels, and we fight constantly against false positives to keep these channels as quiet as possible when no intervention from the team is required. Now due to the fact that some failures are expected, this is a complicated thing to get right so you only get notified for what “matters”.

Emphasize Resilience

Apart from the classic resiliency strategies for unexpected outages from AWS (multi-AZ, multi-region, etc), in serverless, it’s important to have strategies in place from the beginning to mitigate all the issues if the various places that can go wrong, like:

Internal failures due to downstream services
Internal failures due to bugs (oops!)
Message payload failures
Orchestration failures
AWS throttling errors
AWS random errors (yes, they do occur)
Lambda timeouts

Which means that you need to figure out what happens when it fails, and more importantly what happens with the payload, especially if the task still needs to be run. How you can find it again and how to re-drive it through the system. Ideally this should be automatic, but what happens if the automatic system failed, or tried multiple times and couldn’t succeed?

To this end we make extended use of Dead-Letter Queues (DLQs) with redrive strategies and alerts on failing to empty the queue, meaning the processing of the message cannot be done.

Orchestration

SQS, SNS, Dynamo streams, kinesis streams, API gateways, crons, etc, all are very useful to stream, circulate, and trigger processes and lambda functions in particular, but you quickly find yourself in a sea of events that happen whenever wherever, or just wanting to call a Lambda after another one has finished, or orchestrate various operations and procedures in various AWS services. That’s where the Step Functions are coming into place. They are state machines that you can just build using the visual editor and orchestrate any kind of complicated series of events.

What the catch, you said? Well, there is a limit of 25 thousand events per standard step function (there is also the express but I wouldn’t recommend it, I’ll write another article about step functions and I’ll dive into this). Which means if you have a lot of loops happening (maybe because you broke your lambdas into small pieces for the time limit issue like us), you will end up having to also break your step functions into smaller state machines and call off from the other. It quickly gets fairly complicated, however it still is a great tool that allows you to orchestrate anything fast and easily.

Cost

As we said cost calculation is relatively simple it’s:

“cost of lambda = memory * execution time ms”

Or to be more precise

“cost of lambda = memory * round_up(execution time ms, 100)”

So with us running 200 million of them, it is time to answer the one of the questions you’ve probably been waiting for.

“Was it actually more cost efficient than just running a couple of servers?”

Well let’s see what the maths say:

First of all let’s calculate how many servers would we need if we were to replace all our lambdas by a set of EC2 servers that had the same specs, for the last year, and taking the average duration per lambda run we had.

In our case these relations could be described by the relation in the chart, where medium instance is the one with the average specs of our lambdas, and xlarge the one that satisfies our most demanding lambda requirements. (Will explain it more, and beautify it), showing the number of servers to the number of monthly invocations.

We see that in our case the ~17 million average monthly invocations would be about 25 on-demand, always-on medium instances or 6 xlarge ones. Which would most definitely satisfy the actual needs we’ve had the past year.

monthly cost per millions of lambda invocations, number of xlarge and number of medium EC2 servers

So if it is in the end more expensive than just having a couple of servers running all the time (not even taking into account dynamic scaling etc), why would anyone do it? Well, why did we do it at least. We did it because it allowed our team to:

Spend less time on infrastructure scaling concerns, and focus more on our goals
Be ready for almost any load at any time, sudden growth was less of a fear, so when it happened we could focus on customers issues, and not infrastructure

So in that essence we were hoping to exchange money for time to focus on other things, and less stress for our team. Did we succeed? Hard to tell, no one really knows how it would have been if we had gone the other direction.

Conclusion

In conclusion, let’s see where we stand with regards to these bugging questions we had when we started:

Is it easier to spin up: absolutely!
Is it easier to maintain: No really, just different.
Does it make your team go faster: It makes starting or making a new “service” faster, but overall you will end up spending that time elsewhere, eg monitoring.
Is it cheaper: When you have no traffic it absolutely is, later on, it quickly isn’t.
Can it scale more easily: Only for your very first customers, after that you have same amount of concerns, a lot of the same and some slightly different, but it definitely isn’t fire-and-forget scalability

And for us, was it worth it? Probably, especially in the beginning, but as costs increase and traffic becomes more predictable we might mix it more with plain, old servers.

Overall, we've learned valuable lessons about scaling, resource allocation, storage, execution time, monitoring, resilience, and cost efficiency and how these affect each other.

We were also reminded multiple times that:

“even if something is "infinitely" scaling, the hardest part is that everything that it interacts with needs to be "equally infinitely" scalable”

Which in reality is not easy to achieve.

Our decision to go serverless aimed to save time, reduce infrastructure concerns, and be ready for any growth at any time. While it has its advantages, the cost-efficiency of serverless depends on the project's specific needs and priorities, and the complexity of use when it comes to actual real-world applications is, as always, more than initially expected or wished for.