<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Nathan Epstein</title>
    <description>The latest articles on Forem by Nathan Epstein (@nathanepstein).</description>
    <link>https://forem.com/nathanepstein</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F28861%2F9d79f972-7c0b-4e22-be28-20e918d1df07.jpeg</url>
      <title>Forem: Nathan Epstein</title>
      <link>https://forem.com/nathanepstein</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/nathanepstein"/>
    <language>en</language>
    <item>
      <title>On Server Administration In Data Engineering</title>
      <dc:creator>Nathan Epstein</dc:creator>
      <pubDate>Fri, 03 Apr 2020 12:32:42 +0000</pubDate>
      <link>https://forem.com/nathanepstein/on-server-administration-in-data-engineering-3ihp</link>
      <guid>https://forem.com/nathanepstein/on-server-administration-in-data-engineering-3ihp</guid>
      <description>&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;Cloud computing is almost always a good idea, serverless computing is sometimes a good idea, and you probably shouldn't be managing your own machines on premises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Intro Notes
&lt;/h2&gt;

&lt;p&gt;It should come as no surprise that data analysis pipelines require compute resources for the various steps they include. Downloading data requires computation, as does reading and transforming data, as does building models for prediction. All of this is to say that we, as the engineers responsible for building such pipelines, need to make informed decisions about the infrastructure we use to execute the various computations associated with the deployment of predictive models. Towards this objective, we have a wide range of options. These include - but are certainly not limited to - running executables on local machines, running individual cloud servers, managing clusters of cloud machines, and delegating computation to anonymous cloud machines. It is possible to identify contexts in which any of these approaches are an appropriate choice and a valuable exercise to examine their associated tradeoffs. Through this examination, we can build deeper understanding of how to evaluate infrastructure choices in our own data systems.  &lt;/p&gt;

&lt;h2&gt;
  
  
  The Base Case: Local Computing
&lt;/h2&gt;

&lt;p&gt;The first and simplest option is compute on a local machine. The strengths and weaknesses here are reasonably clear. A single local machine is easy to administer but is likely to run into limitations quickly. In particular, almost any production use case will lead to bottlenecks which require more complex server options. Running compute on your local machine is certainly the fastest and easiest way to get started. The environment can be heavily customized and processes can be run on demand without the overhead of SSH or other remote communication methods. But the advantages mostly end there. Local compute comes with operational fragility and is inherently unscalable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud Computing
&lt;/h2&gt;

&lt;p&gt;The next option is to run compute on a single cloud machine. This has many of the same advantages as a single local machine. It is similarly straight forward to administer and allows for simple centralization of process and resouce management. On top of this, managed cloud computing services afford additional benefits which are essential for many production use cases. &lt;/p&gt;

&lt;p&gt;The foremost of these concerns is resource availability. Use of a third party cloud provider allows for delegation of responsibility for ensuring that compute resources are provided without disruption. In the case of a self-administered local machine, we are responsible for resolving any issues (software failures, hardware failures, power outages, etc.) which might cause our infrastructure to become unavailable. This is undesirable in that it requires us to devote attention to concerns outside our core competency and objectives - the construction of data pipelines. With a managed cloud, we side step this issue. If a machine goes down, a new one is provided. Our infrastructure concerns are limited to the setup of the relevant software environment. &lt;/p&gt;

&lt;p&gt;Another related concern is disaster recovery. On a local machine, we painstakingly construct our software environment to match our computing needs. The various packages, programming languages, and libraries are installed. Versions are selected in order to be internally compatible with each other and with our application needs. Application code is written and arranged according to a deliberate file structure. This machine setup is a meaningful amount of work which, without appropriate tooling, can be quite painful to replicate. So if our locally administered machine is made permanently unavailable - either through a software failure, physical damage to the machine, or via physical depreciation over time - recovery can be an expensive affair. Can we ameliorate this issue with appropriate tooling? Of course. But there isn't really a compelling reason to do so. If we're making use of a managed cloud provider, then any machine replacement will be abstracted away. Physical resources will be replaced by the cloud provider without requiring any attention or thought on our end. &lt;/p&gt;

&lt;p&gt;Additionally, third party cloud providers will typically have telemetry offerings which are quite useful from an operational perspective. This can include monitoring of network IO, CPU usage, and status checks. Being able to monitor these things is valuable for identifying patterns of resource usage and, in turn, determining the necessary machine resources for compute tasks. It's certainly possible to implement this telemetry ourselves - either through custom implementations or the use of open source software - but this is, again, disadvantageous. To the extent that we can delegate responsibility for concerns which are not related to the core objective of building data pipelines, we are generally well served by doing so. &lt;/p&gt;

&lt;p&gt;A common use of this telemetry is resource scaling. We may view our metrics and determine that the compute resources we have are not well matched to the needs of the application. We may have a larger machine than is required and would be just as happy with a less expensive resource. Or perhaps we have identified resource bottlenecks and need to scale up. Making these adjustments is a non-trivial undertaking when managing servers ourselves. Either we need to purchase a new machine or make physical alterations. Both of these require technical expertise which is far removed from the central problem of constructing data analysis pipelines. But with a cloud provider, the transition is as simple as selecting the preferred resource. The physical migration which occurs is abstracted from us.&lt;/p&gt;

&lt;p&gt;Managed cloud providers also offer resource standardization. This means if we do decide to make a scale adjustment, which entails an alteration of the underlying physical infrastructure (either in form of a modification or new machine), we don't have to worry about our software functioning differently.  Virtualization is handled by the cloud provider which affords us the capacity to move our application across different machines without worrying about our environment. Of course, we can use virtualization on a local machine and impose a shared environment on future machines but this is additional responsibility we'd prefer to delegate. &lt;/p&gt;

&lt;h2&gt;
  
  
  Horizontal Scaling
&lt;/h2&gt;

&lt;p&gt;As our compute needs increase, we will likely need to scale horizontally rather than vertically. That is, we may need additional servers rather than larger ones. This is intuitive both because there are limits to the size of a single machine and because costs tend to scale in a super-linear fashion. Each incremental increase in machine size comes with an increasingly higher cost. This leads to the result that it is more cost effective to distribute compute across many small machines than a few large ones. &lt;/p&gt;

&lt;p&gt;This capacity to scale comes with a complexity cost. Distributed computation requires coordination of resources across the various machines. The form that this communication takes will be a function of the compute being done. There are many tools for managing machine groups which warrant their own detailed treatment. Applications involving the composition of several jobs distributed over a cluster may call for orchestration tools like Kubernetes. Distributing analysis of large data sets across many machines can be done with libraries such as Hadoop and Spark. In many cases, coordination of machines can be handled manually via API calls or other forms of inter-process communication. Whatever the tooling used to facilitate managing the complexity of distributed compute, its advantage over single-machine computing is the capacity for arbitrary horizontal scaling. &lt;/p&gt;

&lt;p&gt;Of course, we have the option of whether to achieve horizontal scale via local or cloud machines. In the case of local machines, this means procuring the necessary quantity of servers, physically maintaining them, and configuring the appropriate software to coordinate computing among them. The tradeoffs associated with this approach roughly mirror those of running compute on a single local machine. There are potential benefits in the way of customizability, information security, and cost. Conversely, horizontal scaling using a managed cloud provider affords the benefits of flexibility, comparative ease of management, reliability, and pre-built tooling. &lt;/p&gt;

&lt;p&gt;Using managed cloud resources also leads to an important orgnizational benefit. Because these offerings have a broad user base, there is a comparatively large potential labor supply. That is, there are more hirable individuals with the expertise to manage common cloud infrastructure than there are with the expertise to manage niche deployments.&lt;/p&gt;

&lt;p&gt;As data pipelines become more complex and resource intensive, the need for horizontal scaling typically follows. Certain organizations, particularly very large ones, may have specific needs which warrant the maintenance of physical computing infrastructure. However, many organizations find that the use of a virtual private cloud is the appropriate means of achieving the horizontal scale required by their pipelines. &lt;/p&gt;

&lt;h2&gt;
  
  
  Serverless Computing
&lt;/h2&gt;

&lt;p&gt;Another computation framework which has emerged more recently is serverless computing. Of course, there are actually servers which handle compute but their administration is abstracted from the end user. In the serverless compute model, application code is executed by a cloud service provider using physical machine resources that they provision and administer. The client of the serverless compute is only responsible for specifying the executable and associated meta-data (i.e. timing, function inputs, etc.).&lt;/p&gt;

&lt;p&gt;As a comparatively nascent space, the the options within serverless computing are evolving rapidly. In addition to serverless compute, commercial offerings exist for serverless databases in which the scaling and management of the database is abstracted from the user by the cloud provider. It seems reasonable to expect that both the variety and quality of such offerings will continue to grow quickly.&lt;/p&gt;

&lt;p&gt;The primary advantage of the serverless framework is the ease of administration. Because this work is abstracted from the client, the need for both effort and expertise on this front is removed. This allows users to focus on the particulars of their application logic and not need to think about the infrastructure which is responsible for the execution. &lt;/p&gt;

&lt;p&gt;An additional advantage is cost. Depending on the usage pattern, serverless compute is often cheaper than having dedicated machines. For systems in which compute is intermittent and there are long periods of machine resource underutilization, serverless compute is likely to be a cost effective solution. Existing serverless compute offerings charge for the compute time used so if dedicated machines sit idle, they will have a high cost relative to their on-demand counterparts. &lt;/p&gt;

&lt;p&gt;Another, related, benefit of serverless compute is the elasticity of resources. Machines are requisitioned by the cloud provider to accommodate the application at runtime so effectively arbitrary changes in scale are possible. If the system has no work to complete, then no physical resources are claimed or paid for. As work is demanded by the system, the appropriate amount of compute resources are acquired for the duration of the tasks. &lt;/p&gt;

&lt;p&gt;There are important tradeoffs to consider when transitioning to a serverless architecture. While the benefits of serverless are significant, it is not the correct choice for all computing contexts.&lt;/p&gt;

&lt;p&gt;First, there are systems for which serverless computing would be meaningfully more expensive. We highlighted that alternation between bursts of compute and periods of idleness is a usage pattern which is handled in a cost effective manner by serverless compute. The inverse is also true. If resource usage is consistently high, then a dedicated machine is likely a cheaper option; perhaps significantly so.&lt;/p&gt;

&lt;p&gt;There are also performance costs to serverless compute. Serverless computing is an on demand model which means that utilized resources need to be acquired at runtime. This also applies to the loading of dependencies. Rather than being a one time process on a dedicated machine, this will be a recurring process for each run of the application code. This spin up process comes with a latency cost.&lt;/p&gt;

&lt;p&gt;Another drawback to serverless is the comparative inability to customize the machine on which application code is run. Managed compute services generally provide a particular environment in which your dependencies must be built. While this may not be a major concern for many applications, it may complicate the deployment of applications which have intricate and particular dependencies. The serverless deployment of Docker images, which would serve to ameliorate this issue, can involve additional complexity and is not universally supported by major cloud providers. The prevalence of templated runtimes over fully customizable alternatives presents an additional roadblock for the deployment of applications using less common programming languages.  &lt;/p&gt;

&lt;p&gt;An additional concern is telemetry. A primary feature of serverless computing is that the user experience of server administration is hands off. While this is typically a benefit, there are circumstances in which detailed monitoring of the executing machine - beyond just process logs - is desirable but not available.&lt;/p&gt;

&lt;p&gt;The last major concern is vendor lock. Serverless computing is provided by a managed cloud provider according to vendor specific interfaces. This means that building systems around a serverless architecture entails committing to a particular vendor and accepting that there will be costs associated with changing providers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concluding Notes
&lt;/h2&gt;

&lt;p&gt;Management of compute resources is an essential component of building data pipelines. While there are no universal rules of server administration, it is still important to understand the essential tradeoffs in order to make informed infrastructure decisions. Hopefully, the above is a useful starting point in highlighting the competing concerns at play within your own data pipelines.&lt;/p&gt;

</description>
      <category>cloud</category>
      <category>infrastructure</category>
      <category>serverless</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Antifragile Software</title>
      <dc:creator>Nathan Epstein</dc:creator>
      <pubDate>Tue, 08 May 2018 01:24:54 +0000</pubDate>
      <link>https://forem.com/nathanepstein/antifragile-software-3oh3</link>
      <guid>https://forem.com/nathanepstein/antifragile-software-3oh3</guid>
      <description>&lt;p&gt;Software projects famously suffer from unforeseen complexities that slow development and undermine teams' ability to execute on high level objectives. In light of this, the desirability of developing &lt;a href="https://en.wikipedia.org/wiki/Antifragility"&gt;"antifragile"&lt;/a&gt; software projects that grow stronger through this complexity - as opposed to collapsing under it - should be obvious. What follows are a few principles aimed at achieving this. &lt;/p&gt;

&lt;h3&gt;
  
  
  Prefer dependency on software with a long history of use
&lt;/h3&gt;

&lt;p&gt;Project requirements are generally complex, incompletely specified, and non-static over time. For the most part, this means that a demonstrated history of usefulness should be weighed more heavily than rationalizations about a technology's value. &lt;/p&gt;

&lt;p&gt;Software that survives through a long period of wide use has a demonstrated ability to handle practical complexity beyond that of newer software. This translates directly to a longer expected shelf-life. From this, we get "Lindy effects" where the longer a technology has been used, the longer it is likely to continue to be in use.&lt;/p&gt;

&lt;p&gt;When released, smart people spun compelling narratives around why web developers would want to adopt Backbone.js, CoffeeScript, Knockout.js, Meteor, Angular, Aurelia, Haml, and an expansive graveyard of forgotten JavaScript frameworks. And yet, all of these (to varying degrees) have seen their usage wane as they've failed to adequately match the requirements of real software projects. &lt;/p&gt;

&lt;p&gt;By way of contrast, SQL has been around for decades and shows no signs of going away any time soon. Through extensive use, features which could easily be rationalized as weaknesses (a generally insecure text-based API for example) have revealed themselves to be strengths (by allowing non-technical business users to explore data without an engineer). Time, and not narrative, is the judge. &lt;/p&gt;

&lt;p&gt;This seems like an intuitive result. Given even odds, how many people would bet that SQL will be outlasted by a newer and "better" alternative like, say, MongoDB?&lt;/p&gt;

&lt;h3&gt;
  
  
  Prefer dependency on software that is used by its maintainers
&lt;/h3&gt;

&lt;p&gt;A proven track record is ideal when choosing software but it's not always an option. If you're picking a JavaScript web framework, for example, essentially all of your options are young projects (except perhaps jQuery which, if it supports the requirements of your project, is a great choice).&lt;/p&gt;

&lt;p&gt;But if you have to go with something new, its preferable to use software which is used by its authors. This creates a sensitivity to unpredictable challenges and will tend to lead the project to get stronger with time. If the author is using a project, emergent issues will prompt corresponding feature development. Conversely, an author who doesn't actively use their project will develop to match a preconceived mental model of reality instead of the real thing. &lt;/p&gt;

&lt;p&gt;Consider the example of AngularJS vs. React. AngularJS grew quickly based largely on the excitement surrounding the fact that it was developed by Google. The rationalization was something like "Google has a lot of money and smart people, so their framework will obviously be great". But Google famously didn't use AngularJS for their own projects; their design turned out to be poorly suited to real projects and was abandoned for a whole sale re-write in the form of Angular 2.&lt;/p&gt;

&lt;p&gt;By way of contrast, React is used by Facebook (which maintains the framework). The project has been growing rapidly, is healthily maintained, and appears to be the best active possibility for stability within the JS ecosystem. &lt;/p&gt;

&lt;p&gt;According to the Stack Overflow Developer Survey (2017 and 2018), the percentage of respondants using Angular dropped from 44.3% to 36.9% while those using React jumped from 19.5% to 27.8%. This indicates a massive migration from Angular to React which shows no signs of slowing. &lt;/p&gt;

&lt;h3&gt;
  
  
  Prefer project owners over issue owners
&lt;/h3&gt;

&lt;p&gt;In organizing software development work, prefer assigning people to high-level projects over low-level tasks. The reason for this is similar to the above argument about preferring software used by its authors. In completing small incremental tasks, it's easy to introduce technical debt for temporary expendience; project ownership creates incentives to avoid this kind of practice. &lt;/p&gt;

&lt;p&gt;A developer tasked with specific, contained features has incentives to trade the long term health of the project for ease of development / efficiency in the short term. A project owner will recognize their own exposure to long term issues and be inclined to make technical decisions which support the long term health of the project; new challenges will lead to project growth opportunities - instead of buried time bombs.  &lt;/p&gt;

</description>
    </item>
    <item>
      <title>What is a "10x" Programmer?</title>
      <dc:creator>Nathan Epstein</dc:creator>
      <pubDate>Sat, 05 Aug 2017 18:09:03 +0000</pubDate>
      <link>https://forem.com/nathanepstein/what-is-a-10x-programmer</link>
      <guid>https://forem.com/nathanepstein/what-is-a-10x-programmer</guid>
      <description>&lt;p&gt;A lot of attention is paid to the value of "rockstar" or "10x" programmers in building successful organizations. It's not hard to understand why; the inherently scalable nature of software means that marginal differences in programming work result in large differences in output.&lt;/p&gt;

&lt;p&gt;Whether targeting outliers is a sustainable hiring strategy (it certainly can't be if everyone is doing it) is a widely debated topic. Less often talked about is what actually makes somebody one of these great programmers.&lt;/p&gt;

&lt;p&gt;Without a mental model, it's difficult to either work towards being a great programmer or identify such individuals when hiring. What follows is an (opinionated) attempt to distill some of the qualities that make a great programmer. In particular, this list emphasizes delivering value within an organization (as distinct from the pure craft of programming). The list is loosely sorted in ascending order of difficulty / rarity.&lt;/p&gt;

&lt;h4&gt;
  
  
  1) Strong programmers can write business logic.
&lt;/h4&gt;

&lt;p&gt;This means the ability to write working code which yields a solution to a presented problem. Sort an array of values, determine the right data to show a given user, etc. Obviously this is not a binary condition (everyone has different limits) but most development tasks shouldn't stretch the limits of your ability in this regard.&lt;/p&gt;

&lt;p&gt;One big reason that software engineering interview processes are broken is that this is simultaneously the minimum requirement for being able to hold a programming job and the complete extent of what is tested for.&lt;/p&gt;

&lt;h4&gt;
  
  
  2) Strong programmers can write good code.
&lt;/h4&gt;

&lt;p&gt;This is different from being able to write application logic. Code is for other people and should be written as such. This means succinct and expressive names, modular classes and functions, and readable logic. It's not enough that a piece of code "works". Good code is straightforward for others to use.&lt;/p&gt;

&lt;h4&gt;
  
  
  3) Strong programmers can design and architect things correctly.
&lt;/h4&gt;

&lt;p&gt;There are of course many acceptable solutions to a given design or architecture problem. You can reasonably choose to trade simplicity for performance if the application warrants it. You cannot reasonably choose an approach that is both complex and slow (where a simple or performant solution exists).&lt;/p&gt;

&lt;p&gt;Coming up with a "correct solution" means having the depth of understanding to make appropriate tradeoffs and avoid inflicting a deadweight loss on the software.&lt;/p&gt;

&lt;h4&gt;
  
  
  4) Strong programmers know a lot.
&lt;/h4&gt;

&lt;p&gt;This is fairly self-explanatory; software engineering is knowledge intensive work. You can Google syntax but you can't Google fundamental problem solving ability.&lt;/p&gt;

&lt;h4&gt;
  
  
  5) Strong programmers learn a lot.
&lt;/h4&gt;

&lt;p&gt;Again, this is pretty self-explanatory. In the course of working as a software engineer, you will encounter situations where you need knowledge you don't have. Being able to acquire that knowledge is important to being effective.&lt;/p&gt;

&lt;h4&gt;
  
  
  6) Strong programmers teach a lot.
&lt;/h4&gt;

&lt;p&gt;Yet again, this is straight forward. Teaching is an economical way to deliver value to an organization. It makes others more effective in their work, creates a more attractive work environment, and fosters a culture of knowledge growth.&lt;/p&gt;

&lt;h4&gt;
  
  
  7) Strong programmers can take ownership of projects.
&lt;/h4&gt;

&lt;p&gt;I'll define this as being able to take high-level business requirements and deliver a good software solution to the problem. This is less a technical skill than a "human factor".&lt;/p&gt;

&lt;p&gt;Completing a project often involves a mix of programming and other non-technical concerns that are difficult to bake into a standard process. This means identifying what stake holders want (which is often distinct from what they ask for), forming the correct approach, aligning involved parties, and executing to deliver a strong finished product.&lt;/p&gt;

&lt;p&gt;Being able to assume responsibility for these varied factors is a major separator of people who are good at writing code and people who are able to have major impact in an organization.&lt;/p&gt;

</description>
      <category>devtips</category>
      <category>coding</category>
      <category>development</category>
    </item>
  </channel>
</rss>
