Forem: Rob De Feo

How to get the most from AWS Activate

Rob De Feo — Wed, 18 Nov 2020 11:15:26 +0000

Startups have access to credits through AWS Activate, but thats just the start. Your membership includes credits, support, training, exclusive offers and more.

Here are my top suggestions on how to use Activate for Startups to learn, grow, and move fast.

Signup to Activate and get access to the Activate Console, upto $100,00 in credits, training, support and more.

Subscribe to the channel for the latest videos.

First look at the AWS Activate Console

Rob De Feo — Tue, 10 Nov 2020 14:52:12 +0000

Startups building on AWS can apply for AWS Activate, offering credits, training, support and more. With the recently launch Activate Console, you can see the status of your Activate membership.

From the console you can see your credit status, tailored tech and business content, find exclusive offers, and wide range of guidance and support.

Signup to Activate and get access to the Activate Console, upto $100,00 in credits, training, support and more https://go.aws/3ctbXHY.

Subscribe to the channel for the latest videos.

Experiment and prototype outside of your MVP

Rob De Feo — Thu, 29 Oct 2020 16:48:44 +0000

You will have many questions when building a startup, a prototype is I think the best way to answer them. Experimenting and trying new things as often as possible is an awesome habit. And that's faster and easier outside of your MVP.

Your product and it doesn't matter if its a beta or v2.58.2 is what your customer use, and they use it because it solves and important problem for them. It needs to work.

So prototype outside of your main product, it's easier to experiment, will fail faster, and have loads of fun trying new things.

Subscribe to the channel for the latest videos.

SimScale - Legacy Desktop Simulation Software to the Cloud

Rob De Feo — Thu, 11 Jun 2020 11:21:37 +0000

Computed Aided Engineering (CAE) allow engineers to run Computational Fluid Dynamics, Finite Element Analysis and Thermal simulations. The software is built and maintained over years with many contributors as open source in large C++ codebases. Simulation software was designed for running on a desktop client. SimScale run these in the cloud as part of a modern mircoservice architecture.

Anatol Dammer one of five co-founders takes us behind the scenes and explains how SimScale have taken large, difficult to scale, legacy codebases and built a microservice architecture using modern programming languages.

These are some useful resources that related to the episode:

Cloud based collaboration - https://www.simscale.com/blog/2020/03/cae-collaboration-features/
Running desktop applications in the cloud https://www.simscale.com/blog/2019/09/non-cloud-native-services/
SimScale are hiring https://www.simscale.com/jobs/

Listen

Learn from some of the smartest people in the business by subscribing to:

Spotify
Apple Podcasts
Google Podcasts
Breaker
Stitcher
RSS feed to listen on your platform of choice
More details can be found on the episode page

Transcript

Rob: Welcome back to Starting Engineering a podcast that goes behind the scenes at Startups. I'm Rob De Feo Startup Advocate at AWS. Together we'll hear from the engineers, CTOs and founders that build the technology and products of some of the world's leading startups. From launch through to achieving a mass scale and everything in between. Experts share their experiences, lessons learned and best practices. In this episode, our guest Anatol was one of five university graduates who co-founded SimScale takes us behind the scenes of how they built an online simulation platform for CAD models. Anatol and his team work on anything and everything infrastructure from data storage, provisioning cloud resources through to cost optimization, privacy and security. Anatol can you start off by telling us about SimScale and what problem it solves for your customers?

Anatol: Yeah hi Rob, totally SimScale is there for anyone who builds physical products, do a design validation digitally on a physical level before they actually prototype or build their product. It could mean anything from an electronics enclosure that you want to test for ventilation efficiency up to race cars where you want to look at aerodynamics or full buildings and bin load on full buildings. So anything in terms of fluid flow or structure mechanics that you want to test before you actually build something, that's what you can do with SimScale.

Rob: What you're building sounds like it applies to most physical engineering or real world products. I've seen before computer design or CAD software that runs on people's local machines. Are you build in like a CAD for the cloud?

Anatol: CAD software is actually one step before us in the workflow. So the CAD software is what you would use to actually create a model of your product that you want to build, the same as with simulation software. And all of these softwares have traditionally run on the computer, on the desktop, sometimes with your local cluster for running large computations. But what we've seen in this industry is that CAD softwares have already moved the browser. We you simply expect that the next logical step and we also get this feedback from our users and customers, of course.

Rob: What are the reasons that you're seeing for people wanting to make this shift? So instead of running it locally, they're choosing or needing to run in the cloud.

Anatol: There's actually various benefits and different sort of dimensions. One is simply the effort to get started. You don't have to buy a large hardware cluster or workstation to run your large simulations. You can basically just run them in the cloud, have it arbitrarily scalable. You can run as many at the same time as you want. That's one huge advantage already. What we also think it's a huge benefit of SimScale is simply commitment that you have to make. So if you buy a desktop software, you have to usually make a large commitment in terms of not just buying hardware, but also training, buying licenses, which are usually annually. You don't pay per use. You have a fixed costs, which is quite large in most cases. Using SimScale it's just like any traditional SaaS software. You don't have to install anything. You can pay as go. We actually measure how much you use on the computational side and only charge you for that. On top of all of this are the typical benefits that cloud applications have things like collaboration where you can work together with your colleagues on projects and easy sharing of results and so on and so on.

Rob: It's really quite a different SaaS product to many that already out there. What I mean is that a lot of what I've seen for SaaS is essentially modeling access to a database with like a request respond to a pattern on top. I'm thinking, you must have built something quite different for your customers. Can you talk a little bit about some of the differences between a common SaaS solution and what you've needed to build?

Anatol: Yeah, you're exactly right. So in this area in which we're in, which is computer aided engineering, it's just simply a quite complex application domain. Which means there's different kinds of simulations that you can do structure mechanics, fluid flow and so on. And even within those there's a large array of different kinds of simulation software based on different principles. Both on the mathematics and also actually how would you write the software? Where we started was basically using two open-source softwares OpenFoam and Code_Aster. From there we've matured and integrated many more sort of simulation softwares. There's also something to be said for becoming more like a platform for integrating more solvers in the future. The basic challenges is, as you already alluded to, it's quite computationally heavy. You have these large batch jobs which you have to run. Which run sometimes for days or even weeks. You have to make sure that they actually produce the right data. But you also want to have some analytics and see if they actually doing the right thing so that you don't charge the user for something that he can not use in the end. And on the other side we also have before you can actually run those large jobs we have a complex workflow to actually get to a point to specify those jobs and make sure that you have the right setup to actually have the right parameters to get the right results. And this is something which is more interactive. There the challenge is that for example, we have to interact with the geometry file of your product so basically the 3D description of what you want to build. We have to interact with that in the browser for example, to find certain constraints on the simulation and that's something which is also tricky but you can also go into more detail.

Rob: Curious about how you converted your software into a SaaS solution, to make this work did you have to do something special? Did you have to adapt to rewrite the libraries? Or was it a matter of creating the Docker files and then deploying them as containers?

Anatol: Oh yeah, this is actually quite involved topic, as I'm sure that any of your any of our engineers are happy to talk to you about. This is also the reason why, I mean if you look at the market SimScale is basically the market leader and there's not too much competition. The main reason is that this is one of the biggest problems to actually get this to work. If you look at all of these libraries that are out there and simulation software solvers all of them are, almost all of them are legacy software. Written over decades, huge amount of contributors often very sort of scientific oriented code I would say, and not necessarily the best software architecture in all cases. Which brings many, many challenges. Even just using them yourself on your desktop is tricky. Just getting it to compile, actually to run, produce the correct results even after its compiled, it's tricky. And if you then want to actually make a micro-service or more like a modern cloud application out of this and we chose microservice approach for reasons which can also go into.

Rob: Defiantly want to hear more about why you pick microservices and how you implemented that, but to consider using legacy software, is that how you started off?

Anatol: Yeah, of course we did not. So like most startups, we started with a monolith because it just seemed easy and wanted prototype and get something out the door and make sure that if a proof of concept. You know to also show to investors and because it was clear that you're going to need money and hire more people to actually grow this idea. We started just with basically one huge chunk of application, but luckily much, much, much smaller and well-behaved. You know, this worked well for the first few years I would even say, took quite a while to get to a point where this solution was ready for the market and ready for the first external users. But of course this is just something which doesn't scale it any more at some point. I think it's actually an argument to be made in some cases, that monoliths can work out and can be the right approach to certain problems. But especially in our application domain where you have a complex workflow with different parts which are actually implemented by different teams. It just makes a huge amount of sense to sort of decompose the system and have microservices. It is something that we started luckily after a few years after hiring also some experienced developers who knew how to actually do this. Yeah, this is what he's been doing ever since and we are quite happy about this.

Rob: A common pattern that I see is startups, they'll take or build a monolith to start off with and then deploy and continuously iterate on top of this and then break it to microservices. What was or was their particular event or action or time? I mean, when did you know you should be building it as microservices rather than a monolith?

Anatol: Yes. Yes, definitely. The first two years were basically spent building a prototype that you know, worked it produced results, and that was good. But when we then started actually go into users and customers with had which had real problems. Not just test cases that we're running and maybe having some beta users, which were quite experienced themselves. Before that there are certain challenges that you didn't expect, or at least not in this magnitude. For example, people are bringing all kinds of different geometry files, describing all kinds of different products, having slight defects which you have to fix before a simulation. Especially in this part of our system this CAD processing or this geometry processing of CAD files. We realize we have to much, much, much improve our system. We hit a roadblock with the existing services that we had written and it was clear we have to use existing solutions which are market proven and used by all of the big simulation software vendors as well. In the end we realized some of these solutions, well actually all of them a written for example in C++. They are mostly used in legacy codebases which are huge C++ applications. Which have all kinds of challenges in terms of maintainability and efficiency in development. We spent a lot of time actually looking into this, playing around integrating C++ codebases. At some point it just turns out to be, it just took too long and to be sort of sat back on and ask ourselves "OK, I mean, we have to be faster?" We cannot afford hiring 10 developers just make these basic integrations work into our system.

Rob: I know that working with C++ code, especially these large codebase is can be slow going. In your case, you can't exactly afford to hire your way out of the problem. So what do you do to develop quickly against these hard to maintain systems?

Anatol: Exactly, exactly. We had a bright idea, right. One of our senior developers. And I mean I really have to say this, we actually acquired small team but we have some excellent people on the staff which have often very creative ideas. Which is just something we need in this company a lot and which is proven to work out great. He had the idea that, this is sort of new language at the time called Go or GoLang which has some nice properties. First of all, it's a modern language that has garbage collection, it's quite safe to use, it's actually quite easy to learn to. Even someone who doesn't have much experience in that language at all, you can usually get them on boarded in a matter of days. Doesn't mean that they write the perfect architecture but they can actually get working. This is one great property, so it's like a safe, modern language which can use especially for web applications it works quite nicely. The huge advantage or the huge leverage we saw for us is that it has great integration with C++ API's. If you look at integrating C++ libraries from Java for example it's terrible, right. There's two different frameworks for that and they are both really, really inefficient and it's just not fun at all. This developer has this idea "why don't you just take the C++, libraries and components that we want to integrate into our system and just wrapped them in a layer of Go" Where we can actually easily use the C++ API but we're sort of not slowed down by integrating into a microservices system, create API calls, you know, implement monitoring, pending error codes and so on. This is much, much easier and faster to do in Go and this something which we tried in a sort of prototype and it's just worked extremely well and used this ever since for all sort of external C++ codebases or C codebases we want to integrate. And it's been a game changer.

Rob: That's awesome. So you can also use a modern language like go so you can have fast development and all the features that come with that, but still be able to use the power that's been built over a long period of time into this C++ libraries. Is that the approach that you've kind of broadly taken when it came to also breaking down the monolith as in taking these components, putting a wrapper out in front of it?

Anatol: When we started using it, it was more like a new development. Where we basically had an old service and written internally which we had to replace by integrating external libraries. This was basically where we started, This was the first project which we had for that. But also now we also using it basically anywhere we have to write a microservice or a web service which integrates with C++.

Rob: Now I imagine what appears to be a set of GoLang wrappers that present these libraries as endpoints. Due to the compute requirements how does your software manage these heavy or long running jobs?

Anatol: We have a complex workflow to get from a CAD file for a product to a simulation and a simulation result, there's actually multiple steps in between. There is a certain part that's interactive So basically the user uploads a geometry file, a CAD file describing their product. Then you actually have to analyze this, generate metadata, generate visualization files that you can actually display in a browser for the user to make this work well. There's also something which we're working on right now where users can actually make small modifications to that file in a browser, which is something like a small CAD system in the browser. Which is I think also going to be another game changer for us and our users. Then you actually have to do many, many things for are interactive. If you have a valve for example, and you want to simulate or you want to simulate the flow through the valve you basically have to click on the valve and say, "okay, here water comes in or different material at the speed and so on and so on". There's a large amount of work thats interactive and this is the part that is running as microservices. For example for CAD handling, geometry handling and so on these are microservices and they're running on ECS. This was a few years ago, we looked how do we run this? When we actually started SimSscale ECS didn't exist yet so there was was basically EC2, there was beanstalk, but there was no ECS, there was no Kubernetes really, or at least not in a usable state. As soon as ECS became available we quickly migrated to that, because before we were basically still running this monolith. So we started breaking things apart and moving to ECS and microservices.

Rob: So you sort of hinted there that with the release of ECS, this was one of the piece of technology that helped enable you to break down your monolith. Can you kind talk us through a little bit why that was?

Anatol: Before we were still running in this monolith, I guess actually multiple things converge. First of all we realized both from maintainability perspective of this monolith, legacy code, and then having much more developers that we want to be able to efficiently work on our codebase. Decided ok we have to go in this microservice direction. But then we actually spent a bit of time sort of hacking this together ourselves running microservices on instances ourselves and making this work with our own tooling. Actually even still outside AWS in parts because back then we had a part of our services outside AWS, which was also a big mistake cost us a lot of time to actually maintain all of this infrastructure. The large best simulations there actually were already running in AWS because there was no other way of doing this from the very beginning of the company than EC2. We said ok multiple things we have to break apart this monolith a bit more have more microservices. What we don't want to maintain is infrastructure and logic for releasing all microserivces anymore. Also we want to actually run all of our services inside AWS, so not just large batch job but also the interactive services, run all of this together. Then you basically look, and before that it was only, as I said EC2 and I think Elastic Beanstalk and no ECS. And ECS simply offered kind of exactly the platform that we needed. It's something which takes Docker images or Docker services and you create clusters you run them there and all of these sort of heavy lifting of managing releases, draining connections to old versions of your services all of this is basically handled by ECS. So we just started migrating everything there.

Rob: You mentioned about some batch jobs running but you didn't exactly say where they were. So are they running inside ECS along with everything else? And did you have to do anything in particular to make it work?

Anatol: Oh, yeah. Good point, good point. Yeah, I didn't get to that yet. This indicative part of the workflow where people set up simulations and specify all the parameters this is what is interactive and this is what is running in ECS. From the very beginning of SimScale what be we're not running in ECS or what we were running in EC2. Were the actual simulation jobs themselves because both to be able to control the different resource requirements that jobs have that's that's the first reason. The second one is because it's quite spiky workload and we felt that of course, you can also scale up and down ECS clusters but we felt that it makes sense to sort of separate these workloads. They have quite different resource requirements so it's not so easy to actually pack these kinds of jobs together on one cluster and have an efficient packing of these jobs. You also have cases still and it is actually going back to the very beginning of our talk. Sometimes these legacy softwares as they behave somewhat pathologically right, at least in the beginning and now it's much, much better. But we still have sometimes cases where some of these softwares go into states which heavily affect the instance overall, where they take up all the memory for example and start killing things. This is also one reason why we want to separate jobs from different users as much as possible. So yeah this is all running in EC2 right now.

Rob: I see that layer of separation being very helpful especially if something goes wrong. A lot of what you built sounds like it could work with queues. Is that the way that you've architected it? And does this mean that it changes the way that you can scale up and down the EC2 instances?

Anatol: This is actually also quite interesting, especially because this is something which he implemented literally like the first one or two weeks of SimScale we ran it a long time and it worked well and served us well. But at some point we had to professionalize this a bit more and make it easier for users and better for users. Yeah we did actually many, many things maybe we can highlight one or two of them. So one thing that proved to be a problem at some point is that in the very first two years of SimScale the main point was to get us to run at all or to make it really work, to get correct results, to get good results and so on. Once we even past that point a long time ago already the bigger point was became to really optimize the user experience. So make sure that our users have a snappy experience that they don't have to wait a long time to actually get results. One problem that we had there is that there are huge software packages. Even if you dockerize them sometimes these are multi gigabyte packages of software applications that you have to load somewhere onto an instance and run. We had complaints from our users and also from our internal analytics that showed sometimes a user clicks run this job for me and then it takes minutes to actually start getting results. At some point it became acceptable especially for quite small and big jobs users wanted to run maybe just to test something. We started using analytics looking where does this time actually go? What happens? There are certain things which you cannot improve a lot. For example if you start an EC2 instance it takes a minute or two to start up. This is something that is inherent that I think Amazon has a lot of things to actually optimize this to make it as fast as possible. But there's just a certain amount of time that you are always going to have. We started optimizing around that a bit having a pool of instances available at all times. But then you have to be smart about, which kind of instance you put into this pool. You always have inefficiency because different jobs have different requirements. This worked out a bit but not too great. Then we dug a bit more, we actually I mean, there were many iterations in this process. But we ended up optimizing which had the most impact I would say, is that we saw that sometimes it takes a long time to actually load all the software on instances. What we had was a precooked AMI basically an image which contains all of our software what we have to run and his image was about 80 gigabytes, which is a lot. If people are used to modern applications you'd have certain stacks which require larger compilation products and so on. But 80 gigabytes to put it on an instance to run jobs it's a lot. Of course we optimized this, we cut it down, we made it smaller. What we realized in the end is that if you started and EC2 instance the image going to be running on that instance has to be fetched from S3. Which is nice and good, but turned out this process is sometimes a bit slow and this is something which we found to be a huge factor. What we did to address this "we said okay, we tried to optimize it, we just didn't manage to optimize this too much". So we created a separate volume which contains all of our software. We keep a pool of these volumes which is much cheaper than running a pool of instances which are much more expensive than just keeping EBS volumes available. We maintain this pool we make sure that it is always up to date, that our software update on those volumes. And once we create job instance we attached on of those volumes there while from this pool and we have all of our software available and it's pre-warmed and ready to run.

Rob: Well, that's good so you are reducing the amount of time it takes to start an instance even if it has a large amount of software because your AMI has a small install footprint and on the EBS volumes you have the software installed on them and just attach them as you go. Because you reduced the time it takes a spin up an instance you can have a smaller pool of compute as well available. And then are you even taking it further? Do you have a mixture of on demand as well as other pricing models like spot?

Anatol: Totally, totally. In the beginning we only use on demand instances because we were worried about termination's but especially because we also a freemium plan where you want to optimize at least that from costs, even though we also get value out of it by having great simulation projects which are public. We wanted to optimize some costs here a bit as well. We experimented using spot instances and something which is quite nice that Amazon recently or I mean now it's also been a year I think started offering. Is basically capacity optimized instance pools. What happens is you tell Amazon which kind of resources you want to actually have in your cluster and Amazon picks the instances types for you. Using that we actually we were able to bring down spot terminations so we are saving money and having a more stable system.

Rob: Is it just CPU instances you use or does some of your simulations require GPU? And if so are there any different considerations you need to take when using in the spot market?

Anatol: A while ago we started using a GPU based simulation software solver which we use actually because it's extremely scalable. So we can use it to actually simulate entire buildings or building blocks in cities, which is great. And Amazon luckily has GPU instances available which is perfect because they're incredibly expensive to purchase yourself, so this would be completely unfeasible. One challenge that we've seen though is that at least where we operate, of course Amazon has multiple regions which have different amounts of instances available and different ability zones and different instances types. One problem that we saw is that at least in our region there's not always enough GPU instance available of all types. So that's something that we're struggling right now a bit with. Of course one solution for this would be to say okay if we have batch jobs we can run them any time, you just wait until maybe demand for GPU instances goes down and you run them. The thing is of course that even though they are large and long running jobs our users they would like to have results fast. Some cases if a job takes days maybe this is not so critical but the rest of the cases, especially with this extremely scalable solver where you can actually have complex results in a short amount of time people would like to have them quickly. What we're looking at here is that it seems like the best solution is to go multi region. We see in other regions if there is more of these types of instances available and then it becomes more about having tricky tradeoffs. For example if we run these jobs in different regions where theres much GPU instance available you have to pay either for a data transfer and then you have to see if this is worth the cost or we have to replicate some of our data processing services into other regions. Where we have to see if the fixed costs of running the services or scaling them up and down offsets the costs or data transfer.

Rob: You mentioned earlier about having a browser based editing tool, this sounds like it would have different compute needs to your simulation jobs? Can you talk about some of the differences between the two architectures?

Anatol: This is actually a feature which we're extremely proud of because none of our competitors have it yet. Our users sometimes want to make small modifications to their CAD files after they upload them. Having a small in the browser CAD editor not a full fledged one because that's quite complex, but just a small one to make small modifications. So we actually leverage some nice approaches that we've learned over the years to make this work well. One of them is using Go to package C++ applications to wrap them kinda of and also using these interactive services that we have built so far at SimScale. What we did is we purchased and licensed the worldwide best CAD library or CAD kernel, is the right term used in the industry and we wrapped this with our own services. In the end we did not want to use ECS because ECS works well for many use cases. But in this case we have a situation where we have to create sessions for this service on demand. So if a user comes to SimScale they upload a CAD file they have to run a session of this external library that we licensed, this is simply limitation of this library and we have to basically create this session somewhere. We realized this doesn't happen fast enough. So if we scale on ECS to actually start that task takes too long. What we experimented then with is using actually Fargate which is kind of built for exactly this. Fargate is more built for this one off short tasks and they often also start faster. It actually worked ok the main reason why we ended up not pursuing it back then and actually rolled back to using ECS and making optimizations to start these sessions faster was simply costs. So back then the Fargate pricing model was not a good fit for this, now this is actually might have changed again in our favor because Amazon changed a bit the Fargate pricing model. So this is something which actually considering going back to.

Rob: The pattern I see the most in startups is a variation of the N tier architecture, which has multiple layers where the presentation, application processing and data management functions are separated. With this architecture, startups can interact quickly and then scale out their application when needed. Usually you can implement application processing or business logic by importing libraries and then building on them in your language of preference. The simulation libraries SimScale use, a large legacy codebases written in C++, a language that Anatol knew they could not iterate quickly with. Unable to hire themselves out the problem they had to find another solution. So they built what looks to me like a facade layer in GoLang. A thin layer that interacts with the C++ library exposing the functionality as an API that they control. Hiding much of the complexity of the underlying codebase, this makes it easier and faster to build on top of. GoLang is a compiled language it's easy to containerize with short and clear Docker files and using containerization SimScale could then start to create services and break apart their monolith. I find that breaking out long running jobs from a monolith as single services is usually the best place to start. Services do one thing and this makes them smaller and simpler to understand as they use a fraction of the codebase. They can then be scaled in increments based on the queue length. And since they burn in a separate process even a long running queue will not affect the rest of your application. Simulation jobs can run for hours or even days building these services and using queues creates a fault tolerant architecture making spot instances it's a great way to save money. Spot instances are the unused EC2 capacity offered up to 90 percent discount on on demand prices. SimScale used these to reduce their costs and even offer a free tier. By being flexible with the instance type, size, availability zone, and even region they can have the biggest discounts and availability offer. If for whatever reason they can't get enough of the compute they need then they can fall back to using on demand. One pattern that was really interesting to me was how they use EBS volumes to speed up making an instance available. They needed to quickly complete small simulation jobs for customers but they have software packages that run in the tens of gigabytes. From their testing they weren't happy with the speed of getting an instance started from an AMI of that size. They did something though that I've not seen before, which was to have a pool of EBS volumes ready with the software installed on them. Then when an instant starts they could attach the prepared EBS volume to the EC2 instance dynamically. Based on their success this appears to have balanced out the cost and performance for their needs. Simulation jobs could need large CPU or even GPU instances and having a warm pool of these available would be expensive. Instead having ready EBS volumes has improved their startup time and overall reduced their costs. Let's get back to Anatol to hear about what he's learned while building SimScale and the best way to onboard developers and advice on how to start a startup. You've lot over the last eight years building SimScale. If you go back and do one thing differently, what would that be?

Anatol: Yeah, I mean, this would be of course a total game changer because we've done so many things especially also from hard work which you can not replace by anything else, just experimenting, building things and failing and doing better next time. Almost all startups learn this lesson that starting from the beginning having good continuous integration approach, that you don't build things and deploy them by hand. It's extremely messy to clean up at some point and just as a huge drag on efficiency. Maybe not starting with a monolith even though that's a tough choice to make because microservice approaches have some overheads which is maybe too much in the beginning. At least making a switch a maybe more sane point before we have I think it was 30 developers or 25 developers working on our code already and being extremely impacted by the inefficiency of our system. And then of course picking the right services from the beginning, for example being smart about picking ECS from a very early point on and not running our Docker services ourselves or with home built scripts.

Rob: You mentioned that you actually acquired a group of developers. So when you have developers join your team what are the best practices that you found for onboarding them? Are there any resources or particular places that you point them to to get them started?

Anatol: There's actually something which is also closed to our heart which was not great in the past and it's much, much better now as well. I would not even start talking about resource to be honest, one of the main things which you can do to make a developer's life easy is just having a good architecture in place, something which people understand intuitively or which isn't arcane and written in a way that you actually have to read a lot of documentation to understand it. It's much better if you have a system that is so intuitive and makes so much sense that just kind of clicks when you're looking at it, that I would say really like a huge win. Also for example having things like good tests in place which both describe how the system works and also make it safe to modify and play around with it. we also have documentation for example which we point people to introducing them to all the different tools that we use, introducing them to Amazon, which kind of services in Amazon do we have, what we use them for, how do you interact with them. We actually have a bit of our own tooling built around ECS for managing deployments, managing service versions and so on. Of course how to use that, the tools themselves also written a way to sort of self-explanatory but at least some documentation of how to use them it's quite important too.

Rob: Now if we put you in the shoes of a mentor for a developer that wants to go and start their first startup. What piece of advice would you give to them to set them off on the right path?

Anatol: Interact with excellent people. I simply believe that SimScale couldn't exist if he wouldn't have quite early on fund just a great team of people and maintained the standard and maybe even leveled it up more. Making sure that every addition to the team is someone who brings in knowledge, who brings in creativity, and is also just a pleasure to work with. I believe still the biggest advice is critical to any business, make sure that you interact with the right people, smart people are there to help you have creative ideas and friends about solutions.

Rob: Thank you for sharing how you've built, scaled and architected, using difficult to maintain legacy code basis, but also your lessons learned and best practices. If you're excited about building the next big thing where you want to learn from the experts that have been there and done that, subscribe to startup engineering wherever you get your podcasts. Remember to check out the show notes for useful resources related to this episode including blog post by SimScale and Anatol's team and how to get in touch with them. Until the next time. Keep on building.

Echobox — Scaling Remote Working with Data — Startup Engineering

Rob De Feo — Mon, 11 May 2020 10:57:28 +0000

Working remotely is becoming the new norm with more companies adopting long term, yet little analysis has been completed to measure its benefits. Before scaling their team Echobox measured the impact on remote working on performance. Marc Fletcher the CTO of Echobox explains how they analyzed remote working in their team by using data collected on productivity over a 2 year period.

Marc Fletcher graduated with a PhD in Physics (Quantum Computing) from the University of Cambridge and has been the CTO at Echobox since 2014. Whilst not jumping out of planes or competing for GB as a professional skydiver, he's particularly passionate about maximising productivity in high performance cross-functional technology teams.

These are some useful resources that related to the episode:

Nicole Forsgren - Author of Accelerate: The Science of Lean Software and DevOps, and is best known for her work measuring the technology process and as the lead investigator on the largest DevOps studies to date.
Echobox are hiring - https://careers.echobox.com
Github Organization - https://github.com/ebx
AWS resources for remote working - https://aws.amazon.com/blogs/aws/working-from-home-heres-how-aws-can-help/
Halo effect - https://en.wikipedia.org/wiki/Halo_effect

Listen

Learn from some of the smartest people in the business by subscribing to:

Spotify
Apple Podcasts
Google Podcasts
Breaker
Stitcher
RSS feed to listen on your platform of choice
More details can be found on the episode page

Transcript

Rob: Welcome back to Startup Engineering. A podcast goes behind the scenes at Startups. I'm Rob De Feo startup Advocate at AWS. Together we'll hear from the engineers, CTOs and founders, that built the technology and products at some of the world's leading startups. From launch through to achieving mass scale and everything in between. Experts share their experiences, lessons learned and best practices. Our guest has a PhD in quantum computing and competes for Great Britain as a skydiver. Mark the CTO Echobox since 2014 is particularly passionate about maximizing productivity across technology teams. In this episode, Mark explains how they use data to maximize cross-functional team performance. We know that making data driven decisions is the best approach. But what do you do when you need to make a decision with far reaching consequences? And data is hard to come by. Mark will help explain how his team at Echobox approach this problem. Could you start by giving us some background and describe the problem that Echobox is solving.

Marc: We're a six year old startup. We build products to help automate and optimize digital publishing. The vision of the company is to help automate as much of the publishing space as possible. To allow companies to reinvest all of the extra time and quality research and journalism.

Rob: One of the things that I found was quick to realize about Echobox is that you use data everywhere, especially when it comes to making decisions.

Marc: Absolutely. I think trying to take data driven decisions is very much part of our culture. It's one that encourages us to always go back to the data to keep learning to change our decisions when the data changes. That's something we've done from the very beginning.

Rob: You make it sound quite easy, but it's really difficult to do early stage in the startup. Often there isn't enough data. And then as things are growing, things are growing almost too quickly. You don't have time to get the look at the data. How do you strike that balance?

Marc: That is a challenge. I think, to every startup will recognize you have very limited resources not only to deliver a business value, but also to evaluate and optimize the processes that your implementing in order to deliver that business value. There's never a right answer. You can always look back in hindsight and say, we wish we'd spent more time focusing on this particular process because now we've learned a better way. And that's a balance that every startup is constantly trying to find.

Rob: Many of the startups I speak to use data to make decisions, particularly on product or where to make investments. You're also using it internally with your processes. You talk us through the example when your team came to asking to work more remotely and how you approached this?

Marc: We try and take a data driven decision. We obviously want people's working environments and lives to be as fun as and as productive as as possible.

Rob: Productivity and fun may or may not be connected. Can you talk us through your thought process and how you approached figuring this out?

Marc: For any decision that's we're trying to make that impacts the way in which we work or the processes involved, is to go out and try and learn from other people's mistakes. Most of the time, it'll involve a lot of Google searches to ask the same questions of other companies and see how they go about solving this particular problem or what optimizations have they found. Once we've collated all of the different perspectives that are out there, we try and apply it to ask particular use case and say "which of these processes do we want to try?", "Which do we want to experiment with?" Generally, we do that through a very quick show of hands. The ideas that have the most interest will try first. That's a process that we are doing continuously. It's a never ending stream of trying things differently. And over time, the hope is that we end up, leaner more efficient process that everybody feels that they've been able to contribute to.

Rob: Yeah, that's great. It keeps people involved. It sounds iterative. I don't know about you, but when I'm looking around on the Internet, I'm always stumbling across these great new ideas of these are sort of things that you're going to try out?

Marc: The newest and latest ideas are great for inspiration. But one of the things that we're very careful of is referred to as the halo effect. Which is the tendency people have to focus on the high financial performance of a successful company, and then they assume that that performance has been achieved as a consequence of all of its attributes, even the negative ones. I've spoken to people to of what, to Google for over 20 years. They will generally argue that Google does because it works for them. Google has obviously created a fantastic culture. It's a structure and process that works for them. But at the same time, it's it's far from perfect. When people are looking to these new ideas and concepts, I think it's always important to evaluate them for your industry, for your stage of company, for your old team size, these are very important factors.

Rob: Remote working been around for a while in different forms. What would the concerns that you had, if any, when your team wanted to scale out working remotely?

Marc: For us There hasn't been any think with potentially higher impact than that of remote working. As a company we have always tried to maximize the flexibility people have when executing their work. We want to be very outcome driven rather than how would you necessarily get to that outcome. From the very beginning of the company we've allowed very small amounts of remote working. Around two years ago we started having people in the team asking if they could increase the amount of remote working. You cross a threshold where once someone is remote more than they are in the office. That creates some potentially very large questions as to what the impacts of that going to be on the company. The costs of experimentation there might be very high because once someone is working fully remotely. It may be that they don't want to come back into the office. If you find out that actually it's having negative consequences. There's huge considerations when deciding how do you test that, how do you proceed with that, and how do you address those kinds of requests?

Rob: You spoke about it being difficult to experiment because of cost of experimentation is high. What were the metrics that you were most concerned about and how did you adapt your approach to be able to measure and mitigate these?

Marc: I think what's unique about the approach to Echobox is taken when thinking about remote working in particular is the productivity impact or the speed of business value delivery from a company perspective. The vast majority of cases when people talk about remote working people are considering the employees perspectives and the benefits to them and the impacts on team dynamics. For us, it was very much a case of we want to start from a position of what's the company impact, then sort of make sure that all of the other considerations were also, part of that process.

Rob: This is where I think your approach is particularly interesting as you've looked at it from the perspective of a company. And one of the key concerns you have is how would productivity be affected? Were there any other benefits outside of this that you were hoping to realize if this was successful?

Marc: The ability to hire fantastic developers. We-re a very sort of cutting edge company that works with the latest A.I. and machine learning technologies. Being able to hire great people makes a huge difference in our ability to deliver those kind of technologies in products that people care about. When it comes to hiring, particularly in a city as large as London, the amount of competition is fierce. If as a company we can confidently hire people that are fully remote. That gives us a huge or much, much larger pool of potential candidates to pick from. It's not just that these candidates might be technically more capable. They also come with different experiences and different cultures. These kind of things also enrich a company beyond just the products that we build.

Rob: We've got a good idea of the benefits, what's the ideal way to run an experiment, collect the data and then roll this out?

Marc: I think most people in this kind of situation would love to run a controlled AB test where they get half of their workforce to work fully remotely. Then compare the results between team members that have been working fully, remotely and the ones that have been in the office. Arguably that's one way to get a reasonably definitive answer, assuming of course your sample size is large enough. The problem, and this is particularly acute, for small companies like startups is one, they don't have a large sample size. And two, the risk to them, if all of a sudden it does blow up in their face is ginormous. All of a sudden they've committed the company down this path and then they find out the results have been negative. Now they need to try and recover any of the productivity that has been lost. What generally ends up happening is that people only report on the anecdotal or personal experiences of people working remotely. What we've tried to do at Echobox and I know that there's other studies that have done similar things, is they've just done a retrospective analysis on the data they already have available to say, "Okay. Over time, we've had different levels of remote working. How has that impacted various metrics that relate to productivity? And what can that tell us about the risks or the costs of trying this at a larger scale or on an ongoing basis?"

Rob: You mentioned about the experiences from the perspective of an employee, almost the anecdotal experiences. How do you take these into account when you're running this experiment?

Marc: This is where things start to get really interesting when it comes to remote working, because I think the anecdotal experiences of individuals very rarely lines up with the reality of what the data tells you. For the vast majority of people that work remotely, they will argue, "I felt more productive" or "I was more productive". That's a question. We could potentially look at the data and ask, if that's correct. I think far more subjectively where data is isn't going to be as helpful or feelings of stress, of motivation of the ability to feel creative, ability of not having to commute and travel, the proximity to customers, potentially different time zones are involved. These are just some of the multitude of factors that someone will personally experience when when they experiment with remote working.

Rob: Let's leave aside one moment, the immeasurable or at least difficult to measure a focus on what you were able to measure and how you measured it.

Marc: We will have a longer form article available that people could look at. There's a couple of things. We have measured the echo box to try and give us insights into the productivity impacts of remote working. The first one is obviously the proportion of team members time is spent remote vs in the office. Generally, we would have people self select a given number of days a week when they would be in the office and vice versa. What we can then also measure and compare against that particular baseline number is the number of deployments that we have to the products. And there's some additional research by Nicole Forsgren, they have a great book that talks about engineering productivity. One of the predictive relationships they establish is the number of deployments and productivity, that's sort of been our primary metric for measuring productivity. We also measures are team velocity, that's the number of story points that engineers are competing on average in any given month.

Rob: You were using the number of deployments and the number of story points completed and presumably some sort of flag indicating whether they're working remotely or not. What did you have to do to collect this information?

Marc: This is the data we were already collecting as part of our normal drive to try and measure the things that are measurable that, you know, may or may not be useful for current and future use cases. In this particular case, we were just very lucky that we were measuring the number of remote days. When we actually were asking questions about the company impact of remote working, we were able to go back and pull out measures of deployments and story points.

Rob: After you pulled this data what were the findings? How were story points and deployments affected?

Marc: I would provisor this by saying our sample size is small. And I think the predictive relationship between these metrics is still something that is being established in the scientific literature. And also just because we see these results doesn't mean that other people will see different results. I think that's again, one of the conflating factors when it comes to understanding the impacts of remote working. But in our particular case, we saw no correlation between the level of remote working and the productivity of our engineers. The relationships are flat. We can't say whether it was up or down. In our case, it looks like there was no impact.

Rob: This finding is huge. What you're able to establish with some degree of certainty is that someone working remotely and someone worked in the office. There's no significant difference in productivity, as you measured it, by story points and deployments.

Marc: Yes, that that's what our data seems to be sharing, which very much goes against the anecdotal reports of these same team members that felt they were much more productive when working remotely. There's obviously an interesting thing going on there where people perhaps feel happier and they feel more productive. But in reality, they are achieving the same amount of work. They might be happier while they're doing it.

Rob: There's something interesting going on here you've got people's perceptions thinking that there are the more productive or happier. But the data shows exactly the same performance, when you present this to a team how do they respond to it?

Marc: I think when people look at the data themselves, they can almost integrate that into their own perspectives and understand that from a company's perspective, there may be no advantage or disadvantage to people working remotely. It very much becomes a personal question, of personally how do I feel about working remotely? I think for the most part, we found that people's preference lies somewhere between always being in the office and always being fully remote. At least for our engineers, because there is a balance required us to get those feelings of being part of a team and the ability to collaborate and learn creatively is as a group. These kinds of processes are much easier to do in person than over a video call or something like that. At the same time, you have some amount of work that requires focus and a destruction free environment. There's generally no better place to achieve that than by working out of the office where all of these distractions might be. Take time away from your work.

Rob: You mentioned the distraction for the environment so often away from the office, at least in my personal experience. I think that depends. For example, if I'm able to choose when I'm working remotely, I can schedule my time in and around things I know about school holidays or building work that's being scheduled. If this is something that I know do all the time, I'll lose some of that flexibility. Is this something that you observed?

Marc: That is the key thing to take away when it comes from talking about remote working, is that it all depends. There's so many different underlying factors that ultimately impact somebodies productivity. It's not the remote working that's directly impacting this. It's the contribution and the interplay between all of these underlying factors that really impacts the end result.

Rob: If you're doing this for the first time it's like anything, it's new to you, its going to take you some time to learn how to work remotely effectively. Did you notice any productivity difference between when someone first started working remotely or after a while?

Marc: Absolutely. I think when people start working remotely for the first time, there's almost an anticipated dip in productivity, while people work out "okay, what is the correct environment?", "Where in my home should I set up my office?", "Do I need a better office chair?", "What software should I be using?", "What headset and microphone is going to give me the clearest and most engaging communication video calls with your colleagues?" These things takes time to optimize and it's over time that you improve and iterate those until you end up with the environment that is very much customized to your personal preferences. I think maybe that's where people subjectively now feel more productive. They may well certainly be and in other environments. But at least in our case, once people have achieved that sort of long term stabilization of this is my setup. This is what works for me. We haven't seen an increasing productivity.

Rob: It makes sense. It would take someone time to get that setup as functional as an office that's evolved over time to be good to collaborate with colleagues. One of the things I'm thinking over time is how does the relationship change? If I was used to working with my colleagues in front of me and now I don't, how do they change over time? Is that something you're able to measure?

Marc: That would be something great if we could measure it right now. Other than our quarterly employee happiness surveys that most companies will be running. There's no measurement for that and I think the scale of our company at the moment, that doesn't really give us enough resolution to make any statistically significant decisions.

Rob: This is one metric that you're not yet tracking. But on top of deployment's, employee happiness and story points are there other metrics that you'd want to track as you scale out remote working.

Marc: We'd love to being a data driven company. We would love to have very high accuracy on all of the data that we collect. We're still relatively a small company. We can't lose sight of the fact that we should be trying to deliver business value as quickly as possible. It is definitely very easy to end up in a position where you're constantly trying to improve your processes to the point you're not actually delivering any business value. I think people will see this a lot from a technical perspective. People want to spend time removing technical debt from a platform so that it's easier to make future changes to reduce the cost of the platform. You can spend a huge amount of time on these things and you haven't really moved the product forward for your customers in that time. It's always about finding the balance between the right amount of measurements, the right amount of retrospectives, the right amount of building new things. That's the hard part, is deciding how to balance those different things out.

Rob: Yeah, that's hard. So how do you do it? How do you balance between moving quickly, building new things and making improvements?

Marc: So the way in which we balance these requirements at Echobox is to measure everything that we can in the time that we have available in order to track the things that we think we care most about. We tend to limit to a couple of hours a month per person to track. Generally speaking, on a monthly basis, a couple of high level metrics. I would absolutely agree that having too much data can be a negative thing because it might encourage you to start abnormally hunting and asking questions about why certain patterns exist that might not actually be that relevant to your high level business metric. It's better to track a small number of things well that you can have a high degree of confidence in so that when you do start asking longer-term questions or retrospective questions, as as we have tried to do with our analysis of remote working, you can similarly have a good degree of confidence in your ultimate conclusions.

Rob: If there's something that could be done to improve your measurement, what would that be?

Marc: I think it would be great if there were more well-established measures of productivity, particularly within a technology organization that allowed you to be more confident in the metrics you are tracking, ideally down to an individual level. In all of the work we have done to try and assess the impact of remote working, we've really only been looking at team level metrics. The reason I think being able to do it. It's an individual level is the ideal case is that you might find some people within the team are more productive when they're working remotely and other people are perhaps less productive when they're working remotely, which on average will obviously balance out to zero. That would be very good evidence suggests it is important to allow people to make optimizations at the individual level. And really as a company that's the situation we've more or less ended up with. We want to be as flexible as possible in how people get the work done. There's many choices that people have to make for themselves. Some people would rather be in the office more, others want to be in the office less. One of the nice things that's the work that we have done so far allows us to do is to have discussions with people about supporting those different use cases.

Rob: What you've mainly been describing is remote work and rather than distributed teams, as in there's a central office, instead of people spread around the world. Is it something that you're looking to do to create a more distributed working environment?

Marc: Right now, as a company all of our technology focused team members are in the same time zone, or at least team members that need to jump on a video call regularly are in the same time zone. I imagine that as the company scales and you start having teams across time zones. This is yet just another complicating factor that you need to manage and try and find efficient ways of dealing with it.

Rob: We've mainly been talking about how to measure people's productivity from home working and inside the office. But how do you look to improve people's productivity when people are working in different locations?

Marc: We certainly do the standard retrospectives when it comes to delivering business fairly every couple of weeks. When you plan the next iteration of work, everyone is asking themselves, "how did the last couple of weeks go?", "What can we do to improve?" We also have dedicated retrospective discussions once every six or seven weeks that allows people to take a more long term look at the last month or quarter and say, what changes would we like to propose? What experiments would we like to run in terms of the processes that we execute? That's something that's just very much ongoing.

Rob: Based on some of these learnings that you've gathered on your team, if you are helping someone to scale out remote working, what would you advise them to do or how would you advise them to approach it?

Marc: I like to think that if you have a particular goal in mind, whether that's a process or a technical objective, you minimize your risk by trying to iterate towards that. If we look at remote working as a particular example, what that would mean is it's probably most difficult to go from being in an office full time to being fully remote full time because it's the most discontinuous in terms of experience and process. My recommendation would be do one day remote initially, use that time to optimize your setup. And then if you find to actually that one day is beneficial, then you can try two days and you just keep going until you feel you found a personal optimum or whichever company you work for believes theres particular times you need to be in the office for important meeting and you just keep iterating. The hope is that over time you make these small, low risk, low cost changes that you're always moving towards your company's optimum for these processes.

Rob: Now that you've established that there's no gain or improvement from working remotely for your company. Are you looking to scale out the right working? And are you looking to hire from different pools of talent?

Marc: As a small company, we want to make sure we have an environment that allows for maximum collaboration and creativity, which inevitably means we have a slight preference to people being in London and can come into the office for some proportion of the week.

Rob: Does that mean that you're actively looking at hiring people and potentially even looking at hiring a distributed team?

Marc: Yes, absolutely. Because of the work we have done, we also just as happy to consider fantastic people from anywhere in the world because we accept that there might be people that are better outside of London. And even though that might be a small negative impact due to not being able to collaborate as freely. The other attributes of these candidates can make up the difference. For us as a company that's been one of the main conclusions of all of this work, is it just allows us to be more flexible, to take advantage of greater opportunity out there without having to worry about it being a significant negative impact on the company.

Rob: Aside from hiring, are there any key learnings or large benefits that you've discovered?

Marc: We very much like the ability to be able to make new on its decisions, live in a very complicated world that's fast changing and new technology comes along that completely changes the landscape and the way in which people work. We like being able to take these data driven decisions that we review and people can comment on provides suggestions all the time. I think that becomes a strength in the long run because we're always in a position where we can take advantage of opportunities as they come along.

Rob: I think theres a the temptation when you change one thing in this case, the location where you're working to keep everything else as close to as was the same before. I think this misses a big opportunity to change the way people work. An easy change to mention is working asynchronously isn't having a gap between communication. Are there other things that you tried out?

Marc: People shouldn't expect to work the same when their fully remote, than when they're in an office. I think one of the huge benefits of being able to work remotely is the extra flexibility that might mean for a lot of people. They start their day a little bit earlier. They might take more regular breaks because they have the opportunity for going outside, going to the shops, and then they finish a little bit later. From a company's perspectives, that's obviously a good thing because you have people who can help resolve problems outside of normal business hours a bit more easily. Even if at the end of the day the number of hours their working is still the same. It's just perhaps not as compressed.

Rob: This is a big change to implement in a very different way than implementing a new key feature or even re-architecting a large piece of technology. This is a change that impacts people's lives. Individuals will have different ability to do this depending on the personal circumstances. So how do you work with your team to make this a success?

Marc: The key thing that everybody in the technology teams at Echobox is used to is experimentation. Nobody who joins our company is ever afraid of trying something differently for fear of it not working out or it doesn't work. Because these are the discussions we will have with everybody after we've run an experiment to find out how they feel personally, professionally. When it comes to processes like remote working, it's not something we've ever forced on people. People will experiment a little bit. If they like it, they'll keep doing it. And that in and of itself can be a good sign the experiment is a success from one perspective. If people want to continue doing something a certain way, then it means it's easier for them or they feel it's more efficient. In that kind of environment nobody ever feels like something is being forced on them just because someone wants it to be a certain way. Everybody's been involved in the experiment and understands the considerations that went into it, their aware of the results. We've always tried to optimize for people being in different locations, whether that's on the opposite side of the office or perhaps someone has just finished a meeting with a customer in the side of London. A lot of our processes were already set up to support video calls or working on documents collaboratively over the Internet that has made sense to do more or less from day one.

Rob: Mark has looked in-depth at the impact of working remotely, particularly what impact there is on the business. This gives a perspective that's often missing. He identified that scaling out remote working could quickly become a one way door, meaning a decision that's difficult to reverse. When confronted with a choice like this collecting data, analyzing a moving slowly is important. In this case analysis was made more difficult because there were no established methods for collection and measurement of productivity data when working remotely. Echobox had good existing practices including a heuristic on how much time to spend collecting data. I really like the idea of spending a few hours per month to measure the most important pieces of data. That acts as a guideline but also as a constraint. Mark and his team had to find the best metrics from the available data they had already been collecting. They used the number of story points, deployments and a flag indicating if someone's working remotely, plus employee happiness scores. Their methodology was sound. Too often people collect data to prove what they already believe in, leading to confirmation bias. This is where you prefer data that confirms your existing beliefs. The robust approach is attempt to disprove a hypothesis, in this case show there was no effect on productivity when a team works remotely. The result of disproving that working remotely affects productivity might sound underwhelming, but it's actually powerful. It means they can be sure that productivity will not change based on their team working remotely. Being confident in this means they can scale out remote working with no expected impact on productivity. Let's get back to Marc for his learning best practices and advice on how to successfully implement remote working in your startup or as an individual how to be successful in starting to work remotely. You've gone through this process in depth, a monitored it closely with data. What are the biggest learnings that you can ship based on this?

Marc: Be aware of the different underlying factors that might ultimately influence the changes in productivity. You need to consider things like distractions and focus of your team's. The ability for them to work creatively and collaborate, learnings that are achieved through group discussions. You need to consider people's levels of motivation and the impact of different feelings of isolation, somebody stress, costs of training replacements potentially if you have higher turnover. You need to consider the additional time that someone might be able to contribute because they no longer have to commute. Your office space costs, proximity of your customers, particularly if timezones are relevant. You also need to consider the beliefs and attitudes of your employees. There's some great research that's out there that suggests someone's success or failure at remote working might be entirely down to their belief that it will work. At that level it becomes almost a self-fulfilling prophecy. When it comes to starting remote work I would always recommend people over communicate as much as possible. Once you're a couple of weeks in, get some feedback from your colleagues, ask them if they feel you have generally been over communicating what they feel the level is about, right. Communication I think is one of the key things to try and get right when working remotely.

Rob: Ok, great. Do you have any advice for engineers that are about to start remote working in their current roles or even for junior engineers perhaps the first role being a remote job?

Marc: Things like remote working are particularly challenging for junior team members not necessarily new team members, but junior team members in particular. Because when someone is starting out in their professional career, they haven't yet fully developed the right approaches to focus and motivation and how to solve problems or how to get help solving certain problems. My recommendation to anyone that is junior and is starting to experiment with remote working is make sure you get yourself a mentor, someone in your team that's perhaps more familiar with the processes involved and have regular unofficial catch ups with them. It might be your manager in which case there'll be a bit more structure to it. But ask lots of questions there's never a bad question you can ask. Find out what has worked for other people and what hasn't. And experiment run those personal experiments to find out what is the best way for you personally to get the most out of remote working.

Rob: Thank you Mark for sharing this inside story on how you approached and validated there was no significant productivity impact for remote working for your team, the lessons learned, and best practices. If you're excited about building the next big thing or you want to learn from the engineers that have been there, done that. Subscribe to Startup Engineering wherever you get your podcasts.

Remember to check out the show notes for useful resources related to this episode.

And until the next time. Keep on building.

AWS Activate founder tier providers $1,000 in AWS Credits, access to experts and the resources needed to build, test & deploy. Activate your startup today.

Deepset — Scaling machine learning research to enterprise ready services — Startup Engineering

Rob De Feo — Wed, 08 Apr 2020 09:46:12 +0000

Using the latest from machine learning research in enterprise products is hard. Research projects are built to advance research goals. Its not easy to convert papers, code, and scripts in products. They are difficult to maintain and scale.
Malte Pietsch explains their approach to scaling research into production ready enterprise scale applications.

Malte Pietsch is a Co-Founder of deepset, where he builds NLP solutions for enterprise clients, such as Siemens, Airbus and Springer Nature. He holds a M.Sc. with honors from TU Munich and conducted research at Carnegie Mellon University.

He is an active open-source contributor, creator of the NLP frameworks FARM & haystack and published the German BERT model. He is particularly interested in transfer learning and its application to question answering / semantic search.

These are some useful resources that related to the episode:

Deepset - Make sense out of your text data - https://deepset.ai/
FARM - Fast & easy transfer learning for NLP. Harvesting language models for the industry - https://github.com/deepset-ai/FARM
HayStack - Transformers at scale for question answering & search - https://github.com/deepset-ai/haystack
SageMaker - Machine learning for every developer and data scientist - https://aws.amazon.com/sagemaker/
Spot Instances - Managed Spot Training in Amazon SageMaker - https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html
ElasticSearch - Fully managed, scalable, and secure Elasticsearch service - https://aws.amazon.com/elasticsearch-service/
Automatic mixed precision - Automatic Mixed Precision for Deep Learning - https://developer.nvidia.com/automatic-mixed-precision
PyTorch - Open source machine learning framework that accelerates the path from research prototyping to production deployment - https://pytorch.org/
NumPy - Fundamental package for scientific computing with Python - https://numpy.org/
MLFlow - An open source platform for the machine learning lifecycle - https://mlflow.org/
BERT - Bidirectional Encoder Representations from Transformers - https://en.wikipedia.org/wiki/BERT_(language_model)
SQuAD - The Stanford Question Answering Dataset - https://rajpurkar.github.io/SQuAD-explorer/
Sebastian Ruder - Research scientist at DeepMind - https://ruder.io/
Andrew Ng - His machine learning course is the MOOC that had led to the founding of Coursera - https://www.coursera.org/instructor/andrewng

Listen

Learn from the smartest people in the business by subscribing to:

Apple Podcasts
Google Podcasts
Spotify
Breaker
Stitcher
More details can be found on the episode page

Transcript

Rob: Welcome back to Startup Engineering. A podcast that goes behind the scenes at startups. I'm Rob De Feo Startup Advocate at AWS. Together we'll hear from the engineers, CTO's and founder's that build the technology and products at some of the world's leading startups. From launch through to achieving mass scale and everything in between. Experts share the experiences, lessons learned and best practices. In this episode, our guest, Malte, a co-founder of deepset, takes us behind the scenes of how they take the latest from machine learning research to use in enterprise scale products. Malte can you describe the problem that you're solving for anyone that's not yet had a chance to use deepset?

Malte: We are startup from Berlin, founded almost two years ago, and we are working on deep learning based natural language processing, mostly focusing on transfer learning these days.

Rob: With your focus on transfer learning, what's the big problem you're solving with an NLP?

Malte: Biggest problem that used to be out there is the the gap between research and the actual industry or you don't have enough training data usually in the industry. Transfer learning is one way off of actually solving that where you can have pre trained models and apply it in the industry using less data. Our company right now focused a lot on improving enterprise search engines with the help of NLP. We use transfer learning, but for the sake of improving search results.

Rob: Can you describe why transfer learning is so important for you and also for the industry at large?

Malte: Yeah. So back at the days you mostly had one problem, one NLP problem. Then you were looking out there "what kind of model architecture helps here?" We're collecting some training data for this particular use case, for example, for the case of question answering where you have an input, a natural language question. As an output, you want to get the answer from some some kind of text. There you would have to basically collect a lot of examples, have lots of people annotating this kind of data and then train your model. Nowadays what you can do with transfer learning is that your basically train a model just on raw text data without any annotations. Just like everybody has this kind of text data laying around. Then you use these models on some downstream tasks and there you don't need this much of training data anymore. Transfer learning allows building your machine and what's with less training data giving you better performance. The third effect that we see is your development cycle that's more streamlined. You can re-use the models across different tasks.

Rob: So transfer load is an important development. You're able to take the same model and use across different domains, that really reduces the amount of data annotation needed.

Malte: Yeah, exactly.

Rob: This is one of the big developments in machine learning. So how are you able to take this research and put into production and what products you use it for?

Malte: We are currently focusing a lot on the search engine style or search engine problem. What we find there is we barely started with looking at research and what is out there. More or less two years ago there was a really big jump in performance when you look at research papers and that was basically a start when when we also said, okay, let's focus on this, let's get this into production enterprise level. That was basically our journey then taking this research code and bringing it now to a stage where we can use it really at scale in enterprises.

Rob: There's continuously jumps in research and there's a lot of available public information. But transferring that into a product is a difficult task. So how do you go about doing that and what some of the problems you found along the way?

Malte: So, first of all, I think the dispute and research was crazy. These days in NLP there's new papers published every week and new state of the art models. Just keeping track of all these models is really challenging, if your main job is not reading research papers. Then once you think you're settled on, let's say, a model or what you want to do. I think there are a couple of problems or challenges when you want to bring them into production. One of them is a basically pure scale. Usually enterprises have way more data. Speed matters, you have to have it in real time. Sometimes the tasks also a bit different. It's not always the task that you see in the research papers. They're not exactly the tasks that you need to solve in the real world. In the area of NLP, we have another problem of domain data. In most companies that I know you speak English, but not the plain English language that is, for example, use in Wikipedia. There's certain terminology, say, in the legal sector or aerospace, which makes the NLP task a bit different.

Rob: And these are the areas where research has been developing very quickly.

Malte: Yes. I can also look on the domain data as well where you can transfer learning, but I would say mostly that it is the jumps that we saw on just model performance accuracy. This really enabled a lot of new opportunities for businesses and it became really interesting to use these models.

Rob: When I've looked at research papers or code samples attached to them. They tend to be focused around the task in hand as in developing the research. Can you talk about how you've taken that and then brought it to the next step in your product development?

Malte: The key problem with many research papers is that they have are really meant for this or one purpose, and that is basically you were training the model for this research task. One problem that comes with it, is that this code is usually a couple of scripts. It's not really meant for production. I think there's a big temptation just take these scripts and use them and your proof of concept. What we found there and with a lot of customers in industry, is that there's a risk that you end up in a proof of concept trap, you do your POC, but then you can't really bring it into production. Even if you can bring it to production, you risk some of their long term technical debt, where you feeling a lot of silos. This is causing I think, a lot of frustration. There's a risk that it causes frustration, disappointment about machinery learning, if you just take this research code and go forward with it.

Rob: What's the process you take when you have an idea and then you create a POC? How do you bring that into production?

Malte: I think it's a very common approach for machining projects that you start with as proof of concept. Where you say, "okay, let's be very fast, agile, try a few things out". This is I think totally fair, but I think I saw a recent number, which I think only 16 percent of a few POC's really make it into production in the end. This is what some people called POC trap. You start a lot of POC's in a company, but only very few of them make it into production and really create value for the business. There's a couple of reasons of course, but one that we saw a lot is that you will have a model that works actually in the POC stage. People are first happy and say, "Wow, OK, that kind of works." But then you start talking, "OK, how can we bring this into production?" Then the problems occur and say, "okay, that's not scalable or we have to completely rewrite our code, this would take half a year or a year or this doesn't even scale at all". And then a lot of projects just die.

Rob: So you're acutely aware of the POC trap? But being aware is just one thing, I see a lot of people that often aware of something yet still fall into that same trap. How do you avoid it?

Malte: One way of dealing with it is to force yourself, even in this POC stage, to write maintainable code, modular code. Not just taking these POC scripts that are published together with the papers but really have a clean software engineering approach. I think is just a lot of best practices that have been established in software engineering that also apply to machine learning projects. Making record modular, having tests, having parameters put into it to a config that you can easily share and store. In the long run, allowing reproducible experiments because the POC stage, you really do a lot of different experiments and you need some system, a way of organizing things that you in the end can easily select the useful experiments and then move them into production.

Rob: Can you go into detail about one, the examples where you've done that?

Malte: Yeah. One was actually in the area of transfer learning. We edit a framework, an open source framework called FARM started with Google publishing code together with a BERT model. We had a look at this code and said, "OK, let's take this and let's use it in the industry". One step of that was really making it modular. That meant that we have certain components in the pipelines for pre-processing, in this case it's usually a tokenizer, which separates your text and into individual tokens, something like words. Then having processors that can consume different inputs from files, from API requests and so on. And also a way of sticking together these models, which is not always done like this, but we find it very useful to also separate the models to smaller pieces.

Rob: You going a step further and just creating modular code. Can you describe what you mean by separating the models.

Malte: In the area of transfer learning, you usually have a language model that you would train just on this pure text and that's something like BERT. To actually use these language models you need something called a prediction hat. So a few layers in your neural network that you stick on top. These layers can then deal with let's say a text classification or question answering task. You have now two options either you treat the whole thing as one model. So you have, let's say one BERT model for a question answering or you treat it really as these two components. You have a language model as a core. Then you have a prediction hat as a second object on top. If you do that, you gain some flexibility because you can exchange the language model part quite easily if new architectures come out. And that's what we see, on a on a weekly basis now, with many different architectures. You can also experiment in a nice way, where you stick multiple of these prediction hats on top of the model.

Rob: What you're describing is creating modular code, not only so it helps you bring into production, but also so you can help continue experimentation. What's the most important factor when you're building modular code?

Malte: I would say was we both, having this flexibility during experimentation, but then also later when you moved to production, you want lots also there to be future proof. Let's say there is in your model architecture coming out, you don't want to end up rewriting your whole codebase and testing everything again. You really just want to switch the language model, for example. So I see it's both experimenting and operating in the long run.

Rob: How long does it take you normally to take code that you've seen in a research product and make that into production ready code.

Malte: It depends, I would say a lot of the task. In our case for transfer learning, we were really ambitious and said "okay, we want to build a framework around it and have it open source, it's really applicable to many different tasks". The first version took us maybe around two months and published it and in July last year. We are working on it constantly now to improve it, to extend it, yeah, of course, it's never a never ending story. But I would say like we got the first model with this kind of framework into production after two months more or less

Rob: To build the way that you're describing, there's a bigger burden on the initial effort, but that's going to pay off over time. How do you balance the competing goals in a startup to build software that can scale but also to move quickly?

Malte: Yeah, I think thats the art, having this trade off and realizing that there is this trade off. I think in my early days when I worked for another startup, I was really tempted to just take the code and get it out as fast as possible. But then, yeah, you really learned that really need to waste a lot of time afterwards for debugging, testing. Maintaining your codebase. I don't have like a number here I guess.

Rob: Do you have a specific example of when you didn't do this and it actually gave you really big problems and that really changed the way that you thought about how you would build.

Malte: Yeah I think like debugging in general. I had a model once which was which was failing in production and it was failing very silently, it was not throwing error messages of course, but the performance was just degrading. In the end, we kind of figured out that this was very nested bug, deep down in a script. We had updated a few other components of this pipeline, but didn't think of this script, this configuration there. That ended up costing me a lot of debugging time to trace this bug.

Rob: It would be great to hear about the reasoning why that you not just create the code more maintainable, but you actually adapt the models. Can you talk a little bit about the reasoning why you do this?

Malte: There's a lot of what, say, standard research tasks, that research has focused on and there's a good reason for it, because if they a stand a task standard, let's say also a leaderboard, maybe a standard dataset they work on. It's easy for them to compare different models, this makes a lot of sense. These tasks are not always translatable or always transferred to the real world or to companies. One example of that is this question answering, where you have a dataset out there and a task, where you have a question and very small passage from Wikipedia and you need to find the answer in this small passage. The dataset is called SQuAD and it's really the what says most popular dataset out there, everybody talks about it and big companies are in competition to get the most prominent ranks on the leaderboard. But in the real world, you will rarely have a case, I would argue there where you need to find an answer within a very small passage, say 100 words, 200 words. In a real world, you usually large collections of documents, thousands of documents lying around and on SharePoint or somewhere on a file storage system. You want to find the information from there. This is what I mean, there is a gap between research tasks and real world tasks, and you need to find a way to transfer the results that were made in research to your real world task. In this case, for example, let's maybe walk through that a bit. It means many two things, the first one is really scaling from this task of having passages, where you need to find the answer to large document basis. You could say, "OK, maybe just a matter model speed so let's optimize the hell out of the model" there are a lot of best practice to do that. For this particular problem there's no way to do that. Even with all the best practice out there, you could never gain such a speed that it's scales to thousands of documents. What you need to do there is becoming a bit more creative, utilizing what is out there and stitching things together. What we did in this scenario was creating a pipeline of two models. We have one model that is very slow, for example, a BERT model. It's very powerful but slow and that's what people use in research. We now put another model in front of it called standard retriever, which is very fast, but it's only a heuristic model. This first model basically can identify from the thousands of documents, the 20 or 50 most promising ones. These get then fed to our BERT model and with that, we get really quite good accuracy and speed at the same time.

Rob: That's a really cool approach. You have one model that looking for candidate examples. Then there's another model that takes this much smaller list and gives a more detailed answer. What sort of improvements did you see with this new implementation?

Malte: From impossible to a few seconds? Yeah, I really think if you talk about thousands of documents just applying the model out of the box, the BERT model I think would take days to process it even on quite, quite powerful infrastructure on GPU's. Now we're down to one second, two seconds, depend a bit on how accurate your results should be. More or less the order of magnitude we're talking about.

Rob: That's a really big jump. So what tools and architecture do you use to build and train these models?

Malte: We basically use PyTorch as a deep learning framework when we train our models, we have a couple of steps that are involved. The first one is training models from scratch if it's really needed. This is really a heavy workload task, we rely on a large GPU clusters we use currently for that purpose are our FRAM framework, which is open source. It's quite tightly now integrated with SageMaker, you can train on large GPU instances. We use right now, most of the time 4 to 8, sometimes 16 and NVIDIA V100 GPU's for the training step. But this is really only need a few scenarios, more common is than for the QA model to take a pre-trade model that maybe already out there and fine tune it for the question answering task. The setup or architecture behind it is pretty similar, also done with PyTorch on GPU's. You need to usually not that many and takes I think maybe like an hour or two on a 4 time V100 instance P3x8Large. That's basically for training the model and then we need to move them ready for inference, there we then integrated them in this pipeline that I mentioned. Having a fast heuristic model in the front and then adding our new trained model afterwards. We have there usually quite tight integration with ElasticSearch, which is very good in getting this fast high heuristic results and scales nicely even across millions of documents.

Rob: Cost can be an important factor when training models are you do anything to manage the costs?

Malte: Oh yeah, of course as a startup that's something we also have to keep an eye on. Especially training these large language models is quite expensive. We worked a lot on the integration with SageMaker particularly for one reason, and that is saving costs using spot instances. You can now use spot instances and basically the model starts training if instances are available. At some point it might stop, we save all the checkpoints, store the all the states of optimizer and so on, then resume training once there is again another instance available. That helped us to reduce the costs by around 70 percent in the last runs that we measured. That was definitely for us an interesting feature of SageMaker.

Rob: To achieve this, you need a way of be able to stop and resume the models. Is this something that was native to the model or is this part of the process you build in when you're productionizing? Or is there another technique use?

Malte: Yeah, that's something we had to build in. It's actually not not super difficult, but we made our learnings there. What you need to do, what you need to save. Every state of every object that you have in your training pipeline. For us that meant saving the current model, the weights of the neural network and that is pretty easy in PyTorch to do. What was more tricky is than actually saving the state of the optimizer. In our case we was learning rate schedules, the learning rate changes over time depending on the progress and training. This is really something you also need to save and load again when you resume training, that was actually a tricky part to figure out which states you need to save there in PyTorch. We wanted to have full reproducibility, we always measure say with one run without spot training and another run using spot training and there should completely line up. We figured out in the end a lot of seeds, not only the regular ones PyTouch, NumPy, random library, but also there are random number generators that you need to set and only then you can really have full reproducibility.

Rob: You mentioned previously about the FARM. Can you describe a little bit more detail about what it is and what purpose it plays in your training?

Malte: Yeah, it's basically a framework for transfer learning, it can take one of these pre-trained models that out there that are published, the most popular one BERT and apply it to your own problem. For example, classifying documents or this question answering tasks that I was talking about. We built this framework in a modular way, we believe that's how you can maintain the code in the long run and with a lot of support for experiments and tracking these experiments with other open source frameworks, e.g. MLflow. We built it because we needed it, we found it very useful in our work and then decide to open source it. That's I think everything you need to have a fast POC, but avoiding some technical debt. You have some modular code already, maintainable code, if you transition to production, it's quite easy to keep it up to date.

Rob: When you're scaling up into production. What are the key problems or acceptance criteria that you're working towards?

Malte: I think the most common problem or problem that people find us optimizing model speed, that's what we find a lot of blog articles how to do that. Just a few things that were useful for us in the past are automatic mixed precision training, the idea is to not use full precision of a float32 in all your model weights, for some parameters it's enough if you have less precision. Automatic mix precision training (AMP) is a smart way of figuring out which parameters need this precision and which are fine with less precision. This got us some deployments improvements by about 30 - 40 percent of speed, which is fun machine and quite interesting, and also saves costs on GPU'S.

Rob: How can I find out more about the open source projects that you're working on?

Malte: You can find to it on our website deepset.ai, for us what's most important is our open source projects, there's FARM, which is a framework for transfer learning. What we currently focus a lot on the second framework called haystack, to find the needle in the haystack, if you want to do search, if you want to do question answering at scale, that might be worth looking at. That's where we implemented a few of the tenants such we just discussed, integrating it with ElasticSearch and having these two models in one pipeline.

Rob: Research projects and products have significantly different implementations that reflect in their different goals. Research projects are designed to continuously test and validate theories and improve on each other's results. Yet products are built to solve valuable customer problems. Taking the latest from research and using it in products can help to solve new problems, improve existing performance but converting or scanning research for use in real world problems is an involved process. The good news is it follows many of the engineering principles that exist in software engineering today. Making code modular allows for re-use and improves maintainability and it has similar benefits in machine learning software development. Another key consideration for startups build a machine learning products is cost use and spot instances in SageMaker can help reduce costs significantly, in deepset's case they saved 70 percent on their training costs. Let's get back to Malte to learn about his learnings, best practices and advice on how to stay up to date in the fast moving world of machine learning. You've now gone through this process multiple times and also built open source projects. If you were to start from zero today, what would you do differently?

Malte: Yeah, I would definitely start even in POC's, I would pay more attention to good software engineering practices, this is something I learned along the way and also to model the monitoring if you deploy models. We had a couple of situations where it was difficult to find out if actually the model was failing or not and some good dash-boarding good monitoring helps there a lot. Then I think when it comes to open source, that's something that's really important to us and in our DNA, I would even earlier engage in an open source development. It took me quite a while to become a contributor in other projects and even longer to get our own projects out there. But it's really super rewarding and you learn a lot of things as a startup, it's super helpful to get early user feedback, to get other contributors on board and also to get some some visibility. For us it was very helpful to publish the German BERT model very early on and got a lot of traction just because of this model, a lot of applications, a lot of talent coming in. That was really key for us, I can just encourage everybody to either engage an existing open source projects or consider open sourcing your own products.

Rob: And if you were to give a single piece of advice to an engineer that wants to build a machine learning. What advice would you give them?

Malte: If working on the NLP side so working with text, definitely transfer learning. I think there's no way around it these days, it's still worth comparing those models to simpler models, more traditional ones as a benchmark. But from my experience, you usually go better with the with transfer learning and transformers. Secondly, think about your long term strategy and just not implement something hacky, but really built it in a way that can last and that you can monitor and that you can maintain,.

Rob: Machine learning is a fast moving world with a lot of new developments, what are the key resources that you use to keep up to date?

Malte: I' m big fan of research papers, I read usually on the weekend, sometimes during breakfast just to keep a bit updated. There are a couple of very great newsletters in this area from Sebastian Ruder for example has a very NLP news letter. A lot of great resources for online courses Andrew Ng and so on. I think conferences are great to keep updated on on some modelling and engineering practices, but more importantly, to exchange with a fellow's to discuss what they are doing, what what went wrong, what their learnings are.

Rob: Thank you Malte for sharing how you take machine learning research into production, your lessons learned and best practices. If you're excited about building the next big thing or you want to learn from the engineers that have been there and done that, subscribe to Startup Engineering wherever you get your podcasts.

Remember to check out the show notes for useful resources related to this episode.

And until the next time. Keep on building.

AWS Activate founder tier providers $1,000 in AWS Credits, access to experts and the resources needed to build, test & deploy. Activate your startup today.

Zego — Retrofitting an insurance accounting ledger — Startup Engineering

Rob De Feo — Tue, 31 Mar 2020 14:18:42 +0000

Startups need to move quickly, but doing this in a regulated industry is difficult. Zego is a London based insurance technology startup that provides short term insurance for the gig economy workers.

Stuart, a co-founder and staff engineer, explains how Zego balances regulatory requirements and fast iteration.

Stuart has been a coder since he was 13 and professionally for the last 15 years. He has worked in various London startups since 2011. In 2016 Stuart eventually bit the bullet and co-founded Zego, a micro-insurance startup for the gig economy.

Zego is a global insurtech business providing flexible commercial insurance for businesses and professionals.

These are some useful resources that related to the episode:

Zego Blog
AWS Activate founder tier providers $1,000 in AWS Credits, access to experts and the resources needed to build, test & deploy. Activate your startup today.
How to break a monolith application into microservices using ECS and Docker.

Listen

Learn from the smartest people in the business by subscribing to:

Apple Podcasts
Google Podcasts
Spotify
Breaker
Stitcher
More details can be found on the episode page

Transcript

Rob: Welcome back to startup engineering. A podcast that goes behind the scenes at startups I am Rob De Feo, Startup Advocate at AWS. Together we'll hear from the engineers CEOs, founders that built the technology and products that some of the world's leading startups from launch through to achieving mass scale and all the bumps in between. Experts share their experiences, lessons learned and best practices. In this episode, our guest Stu, a co-founder and staff engineer at Zego takes us behind the scenes of how they built and retrofitted a ledger. Stu can you describe the problem that Zego is solving, for the people that have not yet used it?

Stu: Yes, I'd love to. Zego is building a brand new insurance company from scratch. The way insurance works is not fit for the modern world. People are changing the way they work, people changing the way they build companies. The gig economy, the new mobility ability economy, the sharing economy. All of these things have emerged and really grown in the last five years, and traditional insurance companies are hampered by their systems. Hampered by legacy systems and legacy processes that mean that they can't keep up and provide the kind of insurance products that people need to be able to do their job. Zego is all about empowering people to go out and live their life and work the way they want to work, without being held back by insurance.

Rob: That's amazing. So you were really quickly able to build this new product for this new market. Can you talk about some of the technology challenges that you faced when doing this?

Stu: I think probably a couple of the big ones were the fact that we needed to build a whole bunch of software from scratch. Traditional policy management systems, for example, are used to dealing with policies that last a year, maybe they last for a month at the shortest time. We wanted to write policies that lasted 12 minutes. When we spoke to some initial underwriters to try and get them onboard with this product, they said it costs us £8 every time we write a policy to our database. For whatever reasons, all of their legacy systems and all of the licensing that they had, they said "we can't possibly sell a policy for £0.65". We did have to write our own policy management system from scratch. Insurance is complicated. It's a legal minefield. You need to be very, very careful about what you're doing and making sure that people are covered because at the end of the day, it's potentially someone's livelihood. It's very, very important to get it right the first time. On top of the massive amount of data that we started getting from these work providers, we were finding out who was on shift and working patterns. Dealing with all of that data, getting their data into a system, that we are able to handle customers working for multiple work providers at different times, at the same time and overlapping shifts. When it came to the finance side of things, it also became very, very complex.

Rob: It sounds like the approach you took, you didn't necessarily expect all these challenges that you had. Especially when you're moving quickly in the beginning. What are the challenges that you're having to face building the next step in your technology?

Stu: Probably one of the most interesting projects that we've been building recently is actually our accounting ledger. There's a special type of accounting for insurance brokers called Insurance Broker Accounting. Yeah, they didn't get too creative. Essentially, we hold at any time an uncomfortable amount of other people's money. Whether that's our customers money, underwriters money, or the tax man's money. If you look in our bank accounts, that's not all ours. We can't go spend it on Friday breakfast and software engineering salaries. We needed to know exactly how much of that belonged to each one of our 40,000 customers. How much of that money belongs to insurers and how much of that money we had to give to the taxman. That there itself is very complex because it's a rabbit hole.

We sell in five different countries around Europe, not a single one of them deals with tax the same way. We thought it was gonna be easy because we started in the UK and the UK taxes are the simplest, 12% flat tax. The second country was Ireland, and we know there it's a little bit more complex. It's a flat tax plus a levy that you have to pay. Then we went to Spain, there's a flat tax plus a levy plus another levy, that depends on what kind of vehicle you're insuring. "Okay, cool, we can get there." We went to France, there are five different flat taxes depending on how much of the policy covers a certain type of risk. It just got more and more complex as we did it.

Most companies end up solving it, for accounting purposes is to throw an army of accountants at it. In traditional insurance companies and traditional finance companies a massive amount of their staff ends up being a very large finance department.

Rob: You're trying to build a technology first company above all else?

Stu: Exactly. Our CEO Sten. He's tasked us with beating Aviva, who have 120,000 staff and doing it with less than 2,000. That's going to require an awful lot of work, on the technology side to make sure that we can build a super efficient finance department.

Rob: What did this look like before you have this new solution? What was the process you had to go through to rebuild it?

Stu: You don't hear about our first version. We just kind of kept track of how much money people have put into their account. Then overfunded our client money account to make sure that we would always have enough. We didn't have that many customers, so we did throw a few accountants at it to find out exactly how much money we had. Then we would pay our insurers out of our savings account instead. We always had enough money to pay everybody. But as we grew bigger and took on more and more customers, more and more varied products, more, more different underwriters, we needed to start actually doing this properly.

Rob: How important is it to be precise with this type of system?

Stu: When it comes down to, certainly things like tax and client money, things have to reconcile to the penny. If something is out by one penny, then that is enough for us to stop what we're doing and go in and try and find out why it's out by one penny.

Rob: It's interesting you're going down to the accuracy of the nearest penny. Is there a specific reason why you're doing that?

Stu: Probably regulations, is the biggest one. In a regulated industry we get audited. We have been audited in three years, I think three times. We fully expect to be audited on a regular basis, especially because we are doing something new. We are building a whole bunch of new technology and building things that the regulators haven't seen and building new products that the regulators haven't seen. They want to make sure that we are doing things fair and that we're treating people fairly, that is also one of our main goals. Our big three company things are that we want to be fair, we want to be simple and we want to be flexible. That means that if you give someone money, you expect to get that money back and to the penny. E.g. if you lent me £10 and I give you back £9.99 when it's only a penny, it's fine. You'd still feel a little bit like "well, hang on, yeah, it might be fine, but it's still my penny". Combined with the fact that, we want to make sure that if a user asks for their money back, which they it's their money, they can have it. They get exactly what they're entitled to. Certainly the tax man he cares about a penny. It was very much a case of making sure that this stuff worked. That was one of the big problems. We've essentially been retrofitting this ledger onto an accounting system that has been running for two and 1/2 years.

Rob: Was this an off the shelf solution, or was it something you built from the ground up?

Stu: No, this is a new ledger that we're building on to replace the old charges and credit system that we built back in 2017/16. What that meant was, it was much more accurate. It's much more specific about exactly what different charges. Who they were for? What account there was supposed to go into? Things like that.

Rob: How are you able to get to that level of detail and accuracy?

Stu: A lot of work! Mainly because we write our own software from scratch, we were able to integrate it really well into the platform that we're building. We didn't have to integrate with an off the shelf policy management system and we didn't have to integrate with an off the shelf user management system. It was very easy for us to dig into the bits of the system that required integrating directly into that accountancy system and just put it in there. It's all our software, it's all our code. Where it became tricky was when we dealt with third parties. We don't handle payments ourselves. For example, we use Stripe for payments, we GoCardless for bank transfers, and we use financing companies for doing premium financing. Those all have varying levels of technical integration. Stripe is one of the gold standards for how to do really great developer experience and really great integration. We can get a lot of information from them which allows us to automate things. Like when someone does a dispute on a charge on their card, it doesn't have to go through a manual process. That whole system can now be automated. Some of our other partners are less technically efficient and it still requires little bits of manual processes and getting those integrations in place.

Rob: You have the requirement to be able to reconcile these thousands of policies in a complete way. What's the most important thing that you need to be aware of? Or the most important piece of information that you're capturing to do this?

Stu: Most of it is about exposure. So exposing as much of the underlying data as we can, exposing as much of the underlying business events that caused a transaction to happen. Money doesn't just move for no reason. If you can understand why a transaction happened, then even if it's something that system can't exactly automate you can still know why this transaction happened. We know what the parties involved are. We know that money needs to approve from these different places. From this account to our insure account and from one of our customer accounts through to our account. Or maybe they're used up some of the promotional credit and we have to move some money from our marketing account and marking budget into their insure account. Knowing exactly what's supposed to take place, which the system can do. Sometimes to know exactly how much is supposed to move, that requires one of our very smart accountants.

Rob: You've given us a good idea about why the ledger is so important and what it does. Can you explain to people a little bit about the architecture? And some of the key pieces of technology used to build this?

Stu: Our main tech stack is built on Django and Python. We were currently decomposing our initial monolith. Which I think is a phase that startups go through. It's quite an exciting one. We went with Python across the stack and across the entire business we use Python. Data scientists write in Python. We taught all of the business intelligence team Python. All of our application stack is written in Python. It's all hosted on AWS. We went with AWS quite early because we knew that we needed to be cloud based and we knew we needed to scale. I didn't want to be managing Postgres databases at two o'clock in the morning, So we are on Postgres RDS. We use all sorts of things now, Lambda, SQS for a lot of the events and job systems that we do.

Rob: What is your team mostly spending their time working on now.

Stu: The work we do and going at the moment is to start pulling a bunch of these things apart. Really untangling the web of a monolith. Then we can move into a service based world. Our systems engineers are very excited because they'll get to use all sorts of fun new tools like Istio, and Kubernetes. We are spiking a lot of that at the moment. We are probably going to go down the GRPC route for our services.

Rob: You're using GRPC to have that quick internal communication with your services.

Stu: Yeah, exactly. We really like the way that you can enforce that hard contract and really code your API contract into the GRPC layer itself and then build out stubbs. Which will mean that you when, it's not even if. When we decide to use languages other than Python. They can also interact really, really well.

Rob: You spoke a lot about having a monolith. When you build this new ledger, is that something that you built outside the monolith?

Stu: It's currently still inside. When we are building at the moment, we are building in a way that is going to allow us to pull stuff out of the monolith very easily. Whilst we still spike exactly how we're going to do monitoring, tracing, logging authorization, security, and all of those bits around service. We're building new applications and new things into the monolith. We are making sure that they stay very self contained. They still talk to other parts of the monolith via a fairly well defined contract. It's just that the contract happens to be in the running in the same process and on the same machine rather than over a network call.

Rob: Without architectural boundaries, are you having to be more strict in code reviews to ensure that people aren't breaking this?

Stu: It takes a lot of developer strictness. Everyone is on the same page. Everyone knows that we would rather spend a little bit of extra time doing this properly to save us a lot of time later trying to untangle it. Especially for something as complex as this ledger.

Rob: Switching back to the ledger, when you were building this, I guess there were some unexpected events or unexpected problems that you came across. Can you talk about some of the edge cases you have to deal with?

Stu: I guess one of the interesting bits is we didn't have it from the start. By putting in two and 1/2 years in, we needed to go back and reconcile to the beginning of time. Which is very difficult because of some data quality issues that were around from late 2016 early 2017, when we didn't really know what we were doing. Meant that as soon as you started to try and reason about exactly "why this particular bit of money moved" or "why this transaction happened" made it very difficult. We've actually got a new transaction type in the ledger where a transaction is "in suspense". In suspense basically means no one's really quite worked out why this transaction happened. We can tell what happened. We've always been able to know that some money moved between here and there and who had moved to. Understanding why that happened, sometimes it's not there. We had to come up with a version that we could say to people "hey, we're either still looking into why this one happened" or sometimes just writing off and saying "yeah we know that we gave this person some money at some point in the past". As long as we are above board with it and so long as most of the time it's us giving other people money, nobody cares too much.

Rob: That's a really interesting approach and solution to the way that you solve this. Was that because that you give a lot of flexibility to support staff? Or was it because of the early versions of the MVP and software built or something else?

Stu: A lot of it came down to flexibility we had early on. We've always been very customer centric and we've always had a really great customer service team. Who very early on were given quite open tools to be able to do whatever they needed to do it. If we had customers who needed to be given a refund for some reason, they could just go in and give them a refund. Then going back two and 1/2 years later and saying to somebody "this time that you transferred this money from our account to this customer's account, why did you do that?" and they are like "this is two and 1/2 years ago, I don't remember". Which is a pretty valid excuse for someone who speaks to like hundreds of customers a day. It was a lot of, not human error, but human processes that happened very early on that we didn't have a record of.

The other one was technical. Early versions of the software, like the quotation engine, pricing engine and things like that which didn't round decimal places as deep as they do now. There would be rounding errors. That's where quite a lot of the one penny errors come from. The early versions of the software might round-up or round-down. When you're dealing with taxes in percentages on low numbers, you can quite easily break down a premium that should be £1.00, When you add up all of the bits of that break down, it gets to £0.99 instead. It's very easy to see those bits and to understand "hey, I can see that this doesn't reconcile" but to understand exactly who that £0.01 belongs to becomes quite tricky. You then have to go through, have we underpaid our tax man or our insurer? Have we made sure that the customer has paid the right amount and not too much for their insurance? Those are the bits that when we went back in, retrofitted all of the legacy data that we had into this new system, we found quite a lot of those.

Rob: At the moment you built the ledger, you think it's finished. Can you talk us through what happens when you turn on and you're about to put two and 1/2 years worth of data through it. What are the mechanics of this? How does that work?

Stu: The way it's built now it's all on an event based system. An event happens that causes a transaction and we write those transactions or it causes a future transaction and we write a pending transaction. That is really easy and volumes are fairly steady. When it came time to do that for all of the backfill transactions. It was one big job. We had something in the order of 500,000 transactions. Each transaction has like 7/8 different entries in the transaction. And so we're talking like 6/6.5 million rows that we needed to calculate and write into the database. We batched it up and we did it as a job and fully expected that job to take about six hours. I kicked it off one night and said, "cool, let's let's come back tomorrow morning and see what this job looks like". We come back the next morning in the office and we look at the logs and it's about 20% of the way through. Okay, this six hours is probably going to be more like a week. We actually had to stop the job because what we hadn't built into it was the ability for the job to restart and not have to restart from the beginning of time.

Rob: You built it in such a way that if you got the 1st 10,000 transactions correct, but then there was a problem, you wouldn't have to start again?

Stu: If the job was to be killed, and this is one of the one of the considerations you need to take into account when running long running jobs. On cloud services if you don't batch them up into lots of little jobs, a server can get killed underneath you. It's one of the things that you take into account when you're building an application. Most requests are relatively short lived and can be retried. If you have a job, and that one single job is taking hours and hours at a time. The chances of needing to restart that job grow exponentially. We stopped the job and then we broke it up. Instead of one long running job, it would be a couple of million very, very short running jobs. Which is great because it meant we could stick them on a queue and watch the queue go down. If for whatever reason a server died, got taken away, or somebody was deploying and they needed to restart it wouldn't hold that up.

Rob: Now you have the ability to retry things. Then what happens when you run subsequent jobs? Were there other things that you learnt? What did you experiment with?

Stu: Most of the iterations we went through once we had done the backfill were finding the bugs. Finding not only the bits where the bugs were caused by legacy data. Those were just a manual process, a very long manual process. Involving the engineers and the accounting team to actually work out what should have happened then manually updating transactions and entries. Then it was the ones that were continuing to happen. The ones where the rounding errors we're coming from deep within parts of the system that were two years old. The option there is either to paper over those cracks in the accounting ledger or dig into the systems that people hadn't touched for a year and fix them. We went with the second option. If we are going to spend the time, let's spend the time fixing these deep in the issues. The tax engine, quotation engine and all of our pricing factors. Instead of going to 2 or 3 decimal places, some of them now go to 10 decimal places. What that means is that the whole system is more accurate.

Rob: Each time you fixed the bug or you've improved the system, you've made it correct. But you've made it correct in the past. How did you manage that? How do you fix these things?

Stu: We go back and we fix the past.

Rob: That means you're running it over all the old data and getting new results for it.

Stu: Yeah, we ran and we ran it over all of the old data. Everytime we find one of these bugs, we know how much of the old data is not precise. We can work out which transactions are in suspense or which transactions require a fix. You can look at that and you can say "ok cool if we make this fixed, not only is it going to fix this but going forward, but this is going to fix, you know, 10% off all of the errors in our historical transactions". When I say we changed the past, most of the time, changing the past is not going to the database records and changing the previous database records. A ledger is supposed to be appended only. It means creating new transactions that will actually fix from an accounting sense the previous ones. If you've overcharged someone by £0.01 two years ago, you give them a £0.01 refund today.

Rob: This is an important characteristic about Ledger. You can't go back in the past and make changes. You need to add another transaction to make an alteration. Now the ledger is working, you're been firing transactions at it, it's all updated. It's more accurate. Does it just work? Or is there more that needs to be done?

Stu: There's a lot more to do. It's running for probably about 80% of all products at the moment. When we started we had about 500,000 transactions including backfill. We have now just over 1,000,000 transactions which means 9,000,000 million entries. That doesn't include any of the B2B transactions, they are still handled manually by the accounting team. We sell products to consumers, delivery riders and Uber drivers. But we also sell larger group fleet policies to people starting kick scooter companies or people who have a fleet 100 vans who don't need insurance on all 100 vans all year round because half the year their vans are sitting in a garage and half year it's Christmas time and they're all out on the road delivering.

They are also really beneficial for our flexible insurance policies. What we do is we currently manually handle all of those transactions. They are lower in volume, but often a little bit more complex because of the usage data for large volumes of fleets even if it's under one transaction. The next step is to get the rest of those in so that we can completely automate all of the manual processes and the grudge work that our accounting team has to do.

Rob: With the ledger in place now, and all the data has been updated and even more accuracy. Are you able to run different types of reports and use different tooling on top of it?

Stu: We have quite a strong data engineering team. What they do is pull data not only from this ledger for analytics, but also from all of our policy management systems all the way through to all of our website and app analytics and put it into a data warehouse. Make that available for the business intelligence team. We use a tool called Looker.

Rob: With Looker being the tool that your business analysts are using. Where's the data? What tool is used to get the information out?

Stu: We have RedShift for our data warehouse, and that pulls all of the other data stores. We have a number of RDS databases, we've got a DynamoDB database, then it also pulls from external sources. It pulls everything out of not only our ledger, but also our bank account details, things like Xero that we use for expenses, Google Analytics, and from all of these other places into this one source. That aggregates all of our different data sources in one and then makes that available so that you combine analytics across multiple different data sources.

Rob: Solving the problem of accounting for every penny when insurance premiums are measured in minutes rather than years. Then retrofitting years worth of data and transactions against it was a difficult challenge. Using engineering resources rather than an army of accountants creates a scalable solution. Zego takes an iterative approach to building software. It has well defined contracts in their monolith which allows them to move quickly and create code boundaries that could be broken out into micro services in the future. Engineering a ledger to be 100% correct when running over old data is an impractical approach. Creating checkpointing allowed Zego to iteratively build the ledger and fix bugs without having to start from zero each time. Let's get back to Stu to hear about the learnings, best princes and advice he has to offer.

Going through this process you have learned a lot. If you were able to start again today knowing everything that you know now, what would you do differently?

Stu: One of the big ones is making sure that we understand where the deficiencies in our legacy data occur. There are a number of other projects on-going at the moment to make sure that the application that we built when we were six months old and had no idea what we were doing or where we're gonna be in a few years is actually scaleable to the existing systems. It's all well and good to come up with a brand new system and a brand new set of data models and go "yeah, this is so much better". If you have no way of migrating your existing data into those models, then you're gonna be in for a world of hurt. I mean, that's what we found in this. There have been a lot of late nights and a lot of manual work gone into ensuring that all of that past data was cleaned up. Some of that we could have done beforehand, it would be faster. You spend a lot of computing time calculating things that are wrong, and then fixing a bug and having to spend completely re-calculating those things to make them right. Some of that data could have been picked up beforehand, and some of it couldn't, some of it we didn't know that it was wrong until we had done this work. Looking into some of the other projects that we have upcoming around user management and CRM. It's really about cleaning up some of that legacy data before we start embarking on the project. Then we are very much in the process of decomposing our monolith into services. When we build stuff, we are still needing to build on to our existing platform, our existing monolith and building that in a way that we know it's going to be easy to pull it apart in six months time is something that goes into every single bit that we do now. It's one of the main considerations.

Rob: You were very intentional about the way that you started out with a monolith and it's worked really well for you to get up and running really quickly. Now you have to pull it apart and that's a significant engineering effort. Is there something that you would have done differently?

Stu: Yes and no. I definitely still would have started with a monolith. I think that going into an industry that you don't have years and years and years of domain knowledge about, you don't know where those boundaries are going to be drawn. We changed our business model three times in the first six months. We changed our pricing system. We changed so many things very, very early on. If I had been trying to draw very hard boundaries right at the beginning, I would have spent all of our time redrawing boundaries instead of just allowing everything to grow. That's been one of the things that allowed us to succeed and grow quickly. I probably would have started the work that we're doing now to really start drawing those boundaries even internally within the monolith a little bit earlier. We started about halfway through last year. New modules that were added with essentially a contract layer on top of them and anything that needed to use those new modules use it via this contract. I think we probably could have started that maybe 6 to 12 months earlier.

Rob: There's a lot of discussion around building microservices or a monolith first. If you're building microservices from the beginning, defining the process boundaries is really, really difficult. Is their methodology to be able to define the process boundaries upfront?

Stu: Exactly, if anyone ever says that they fully understand where all of their boundaries for their services should lie on day one, then either they are an absolute genius or their little bit delusional. I certainly would have been delusional if I tried to tell anyone that I knew enough about insurance and enough about a product that was brand new in an industry that was fast changing in the middle of 2016, because I didn't.

Rob: Thank you Stu for sharing your best practices, experiences and lessons learned. If you're excited about building the next big thing or you want to learn from the engineers that have been there and done that, subscribe to startup engineering wherever you get your podcasts.

Remember to check out the show notes for useful resources related to this episode.

Until the next time, keep on building.

Startup Engineering Preview

Rob De Feo — Mon, 30 Mar 2020 16:28:19 +0000

Startup Engineering podcast has launched. I go behind the scenes with the engineers, CTO’s, and founders that built the technology and products at some of the world’s leading startups.

Each episode we will learn the inside story of how they overcame their biggest technology challenges. From launch through to achieving mass scale, and all the bumps in between. Engineers share their experiences, lessons learnt and best practices.

If you are excited about building the next big thing or you want to learn from the engineers that have been there and done that, subscribe to Startup Engineering wherever you get your podcasts.

Until next time keep on building.

AWS Activate founder tier providers $1,000 in AWS Credits, access to experts and the resources needed to build, test & deploy. Activate your startup today.