Forem: Nathan Epstein

A Looming Crisis of AI Generated Text

Nathan Epstein — Wed, 22 Apr 2026 23:46:12 +0000

The recent announcement of Mythos, Anthropic’s next generation model for complex autonomous workflows, seems to be something of a rubicon. When it comes to reading and writing text, we’re moving from a world of AI assistance to AI replacement. As the models improve and standardized test scores fall, it’s tempting to embrace the transition and declare a sort of bankruptcy on the enterprise of literacy. Talk to any educator, from elementary school teacher to university professor, and you are likely to hear how students no longer bother to write their own essays, instead submitting the prompt to ChatGPT and only sometimes reading the responses before submitting for evaluation. Why bother with actual writing when the alternative is so cheap and effective? As someone whose professional roles straddle machine learning and literature, I both understand and categorically reject this impulse. Even if we accept the most ambitious predictions for AI - a future of widespread adoption and cognitive delegation - there are several reasons that the public capacity to write must still be fiercely guarded.

First is the cultivation of a population which is intellectually prepared to coexist with this transformative technology. The audience for complex and long-form writing is shrinking, with attention spans pummeled by ever-short form social media and LLM summarization. But the curious and industrious among us are well positioned to achieve new heights of intellectual capability. There has never been less friction to quickly learn about new topics. And the productivity of AI-assisted white-collar work is higher than either its analog or unsupervised AI competitors. Public facility in writing is critical to ensure that the intellectually engaged are both nourished by and can contribute to a steady flow of new ideas. Measured in eyeballs, there are dwindling consumers for novel writing; measured in effective mindshare that market is growing.

Beyond direct consumption, the impact of original writing is magnified by shaping the LLMs that the public increasingly delegate to. Much like the internet that serves as their primary training corpus, LLMs are likely to become a central piece of our social infrastructure. And these models need content to train on. To the extent that humanity stops writing its own original work, the models have learned all that they ever will. With a rising share of new text on the internet produced by LLMs, next generation models will increasingly be trained on their own output in a self-referential doom loop. Absent new external inputs, the models are destined for the frailty and disorder that comes with multi-generational inbreeding.

To see this dynamic in action, look to another craft that AI is allegedly making obsolete. It has been celebrated - often uncritically - that much of the work of software engineering is being replaced by AI. As early as April 2025, the CEO of Microsoft announced that 30% of the company's code is being written by AI. Microsoft is, of course, just one of many companies seeking to benefit from the automation of expensive work. But less well appreciated is that AI will regularly choose suboptimal tools based on the availability of its training data; models generate code based on what they’ve seen, not what’s actually best suited to the problem at hand.

And this problem will only get worse as software engineers lean on and are replaced by LLMs. More code is generated with yesterdays tools which only increases their representation in the training set; this in turn increases the likelihood that the LLMs choose the same ill-suited tool for the next job.

So it is for the writing of essays, books, and other substantive literature. Low quality and derivative content is easily replaced by AI but the need for expert editing and original idea generation is high. Even granting that AI can generate frontier content at the current moment (a dubious proposition), that frontier is unlikely to be meaningfully advanced by models lacking the novel inputs which have brought them this far.

Whether it’s Mythos or the next generation models that eventually replace it, we should expect that an increasing share of the world’s text will be produced by AI. While this is certainly a boon for productivity, it should not mitigate concern over the decline of basic literacy. In success, AI will be directed by humans whose intellectual flourishing depends on the consumption and production of new ideas. Moreover, the models on which many people will depend need novel input to advance. So modern technological developments call for old fashioned solutions; more emphasis on unassisted writing may be the surest path to protect the future of all writing. Mediocre text is unlikely to be useful for training models and is easily produced by them but the rare original idea is more precious than ever.

On Server Administration In Data Engineering

Nathan Epstein — Fri, 03 Apr 2020 12:32:42 +0000

TLDR

Cloud computing is almost always a good idea, serverless computing is sometimes a good idea, and you probably shouldn't be managing your own machines on premises.

Intro Notes

It should come as no surprise that data analysis pipelines require compute resources for the various steps they include. Downloading data requires computation, as does reading and transforming data, as does building models for prediction. All of this is to say that we, as the engineers responsible for building such pipelines, need to make informed decisions about the infrastructure we use to execute the various computations associated with the deployment of predictive models. Towards this objective, we have a wide range of options. These include - but are certainly not limited to - running executables on local machines, running individual cloud servers, managing clusters of cloud machines, and delegating computation to anonymous cloud machines. It is possible to identify contexts in which any of these approaches are an appropriate choice and a valuable exercise to examine their associated tradeoffs. Through this examination, we can build deeper understanding of how to evaluate infrastructure choices in our own data systems.

The Base Case: Local Computing

The first and simplest option is compute on a local machine. The strengths and weaknesses here are reasonably clear. A single local machine is easy to administer but is likely to run into limitations quickly. In particular, almost any production use case will lead to bottlenecks which require more complex server options. Running compute on your local machine is certainly the fastest and easiest way to get started. The environment can be heavily customized and processes can be run on demand without the overhead of SSH or other remote communication methods. But the advantages mostly end there. Local compute comes with operational fragility and is inherently unscalable.

Cloud Computing

The next option is to run compute on a single cloud machine. This has many of the same advantages as a single local machine. It is similarly straight forward to administer and allows for simple centralization of process and resouce management. On top of this, managed cloud computing services afford additional benefits which are essential for many production use cases.

The foremost of these concerns is resource availability. Use of a third party cloud provider allows for delegation of responsibility for ensuring that compute resources are provided without disruption. In the case of a self-administered local machine, we are responsible for resolving any issues (software failures, hardware failures, power outages, etc.) which might cause our infrastructure to become unavailable. This is undesirable in that it requires us to devote attention to concerns outside our core competency and objectives - the construction of data pipelines. With a managed cloud, we side step this issue. If a machine goes down, a new one is provided. Our infrastructure concerns are limited to the setup of the relevant software environment.

Another related concern is disaster recovery. On a local machine, we painstakingly construct our software environment to match our computing needs. The various packages, programming languages, and libraries are installed. Versions are selected in order to be internally compatible with each other and with our application needs. Application code is written and arranged according to a deliberate file structure. This machine setup is a meaningful amount of work which, without appropriate tooling, can be quite painful to replicate. So if our locally administered machine is made permanently unavailable - either through a software failure, physical damage to the machine, or via physical depreciation over time - recovery can be an expensive affair. Can we ameliorate this issue with appropriate tooling? Of course. But there isn't really a compelling reason to do so. If we're making use of a managed cloud provider, then any machine replacement will be abstracted away. Physical resources will be replaced by the cloud provider without requiring any attention or thought on our end.

Additionally, third party cloud providers will typically have telemetry offerings which are quite useful from an operational perspective. This can include monitoring of network IO, CPU usage, and status checks. Being able to monitor these things is valuable for identifying patterns of resource usage and, in turn, determining the necessary machine resources for compute tasks. It's certainly possible to implement this telemetry ourselves - either through custom implementations or the use of open source software - but this is, again, disadvantageous. To the extent that we can delegate responsibility for concerns which are not related to the core objective of building data pipelines, we are generally well served by doing so.

A common use of this telemetry is resource scaling. We may view our metrics and determine that the compute resources we have are not well matched to the needs of the application. We may have a larger machine than is required and would be just as happy with a less expensive resource. Or perhaps we have identified resource bottlenecks and need to scale up. Making these adjustments is a non-trivial undertaking when managing servers ourselves. Either we need to purchase a new machine or make physical alterations. Both of these require technical expertise which is far removed from the central problem of constructing data analysis pipelines. But with a cloud provider, the transition is as simple as selecting the preferred resource. The physical migration which occurs is abstracted from us.

Managed cloud providers also offer resource standardization. This means if we do decide to make a scale adjustment, which entails an alteration of the underlying physical infrastructure (either in form of a modification or new machine), we don't have to worry about our software functioning differently. Virtualization is handled by the cloud provider which affords us the capacity to move our application across different machines without worrying about our environment. Of course, we can use virtualization on a local machine and impose a shared environment on future machines but this is additional responsibility we'd prefer to delegate.

Horizontal Scaling

As our compute needs increase, we will likely need to scale horizontally rather than vertically. That is, we may need additional servers rather than larger ones. This is intuitive both because there are limits to the size of a single machine and because costs tend to scale in a super-linear fashion. Each incremental increase in machine size comes with an increasingly higher cost. This leads to the result that it is more cost effective to distribute compute across many small machines than a few large ones.

This capacity to scale comes with a complexity cost. Distributed computation requires coordination of resources across the various machines. The form that this communication takes will be a function of the compute being done. There are many tools for managing machine groups which warrant their own detailed treatment. Applications involving the composition of several jobs distributed over a cluster may call for orchestration tools like Kubernetes. Distributing analysis of large data sets across many machines can be done with libraries such as Hadoop and Spark. In many cases, coordination of machines can be handled manually via API calls or other forms of inter-process communication. Whatever the tooling used to facilitate managing the complexity of distributed compute, its advantage over single-machine computing is the capacity for arbitrary horizontal scaling.

Of course, we have the option of whether to achieve horizontal scale via local or cloud machines. In the case of local machines, this means procuring the necessary quantity of servers, physically maintaining them, and configuring the appropriate software to coordinate computing among them. The tradeoffs associated with this approach roughly mirror those of running compute on a single local machine. There are potential benefits in the way of customizability, information security, and cost. Conversely, horizontal scaling using a managed cloud provider affords the benefits of flexibility, comparative ease of management, reliability, and pre-built tooling.

Using managed cloud resources also leads to an important orgnizational benefit. Because these offerings have a broad user base, there is a comparatively large potential labor supply. That is, there are more hirable individuals with the expertise to manage common cloud infrastructure than there are with the expertise to manage niche deployments.

As data pipelines become more complex and resource intensive, the need for horizontal scaling typically follows. Certain organizations, particularly very large ones, may have specific needs which warrant the maintenance of physical computing infrastructure. However, many organizations find that the use of a virtual private cloud is the appropriate means of achieving the horizontal scale required by their pipelines.

Serverless Computing

Another computation framework which has emerged more recently is serverless computing. Of course, there are actually servers which handle compute but their administration is abstracted from the end user. In the serverless compute model, application code is executed by a cloud service provider using physical machine resources that they provision and administer. The client of the serverless compute is only responsible for specifying the executable and associated meta-data (i.e. timing, function inputs, etc.).

As a comparatively nascent space, the the options within serverless computing are evolving rapidly. In addition to serverless compute, commercial offerings exist for serverless databases in which the scaling and management of the database is abstracted from the user by the cloud provider. It seems reasonable to expect that both the variety and quality of such offerings will continue to grow quickly.

The primary advantage of the serverless framework is the ease of administration. Because this work is abstracted from the client, the need for both effort and expertise on this front is removed. This allows users to focus on the particulars of their application logic and not need to think about the infrastructure which is responsible for the execution.

An additional advantage is cost. Depending on the usage pattern, serverless compute is often cheaper than having dedicated machines. For systems in which compute is intermittent and there are long periods of machine resource underutilization, serverless compute is likely to be a cost effective solution. Existing serverless compute offerings charge for the compute time used so if dedicated machines sit idle, they will have a high cost relative to their on-demand counterparts.

Another, related, benefit of serverless compute is the elasticity of resources. Machines are requisitioned by the cloud provider to accommodate the application at runtime so effectively arbitrary changes in scale are possible. If the system has no work to complete, then no physical resources are claimed or paid for. As work is demanded by the system, the appropriate amount of compute resources are acquired for the duration of the tasks.

There are important tradeoffs to consider when transitioning to a serverless architecture. While the benefits of serverless are significant, it is not the correct choice for all computing contexts.

First, there are systems for which serverless computing would be meaningfully more expensive. We highlighted that alternation between bursts of compute and periods of idleness is a usage pattern which is handled in a cost effective manner by serverless compute. The inverse is also true. If resource usage is consistently high, then a dedicated machine is likely a cheaper option; perhaps significantly so.

There are also performance costs to serverless compute. Serverless computing is an on demand model which means that utilized resources need to be acquired at runtime. This also applies to the loading of dependencies. Rather than being a one time process on a dedicated machine, this will be a recurring process for each run of the application code. This spin up process comes with a latency cost.

Another drawback to serverless is the comparative inability to customize the machine on which application code is run. Managed compute services generally provide a particular environment in which your dependencies must be built. While this may not be a major concern for many applications, it may complicate the deployment of applications which have intricate and particular dependencies. The serverless deployment of Docker images, which would serve to ameliorate this issue, can involve additional complexity and is not universally supported by major cloud providers. The prevalence of templated runtimes over fully customizable alternatives presents an additional roadblock for the deployment of applications using less common programming languages.

An additional concern is telemetry. A primary feature of serverless computing is that the user experience of server administration is hands off. While this is typically a benefit, there are circumstances in which detailed monitoring of the executing machine - beyond just process logs - is desirable but not available.

The last major concern is vendor lock. Serverless computing is provided by a managed cloud provider according to vendor specific interfaces. This means that building systems around a serverless architecture entails committing to a particular vendor and accepting that there will be costs associated with changing providers.

Concluding Notes

Management of compute resources is an essential component of building data pipelines. While there are no universal rules of server administration, it is still important to understand the essential tradeoffs in order to make informed infrastructure decisions. Hopefully, the above is a useful starting point in highlighting the competing concerns at play within your own data pipelines.

Antifragile Software

Nathan Epstein — Tue, 08 May 2018 01:24:54 +0000

Software projects famously suffer from unforeseen complexities that slow development and undermine teams' ability to execute on high level objectives. In light of this, the desirability of developing "antifragile" software projects that grow stronger through this complexity - as opposed to collapsing under it - should be obvious. What follows are a few principles aimed at achieving this.

Prefer dependency on software with a long history of use

Project requirements are generally complex, incompletely specified, and non-static over time. For the most part, this means that a demonstrated history of usefulness should be weighed more heavily than rationalizations about a technology's value.

Software that survives through a long period of wide use has a demonstrated ability to handle practical complexity beyond that of newer software. This translates directly to a longer expected shelf-life. From this, we get "Lindy effects" where the longer a technology has been used, the longer it is likely to continue to be in use.

When released, smart people spun compelling narratives around why web developers would want to adopt Backbone.js, CoffeeScript, Knockout.js, Meteor, Angular, Aurelia, Haml, and an expansive graveyard of forgotten JavaScript frameworks. And yet, all of these (to varying degrees) have seen their usage wane as they've failed to adequately match the requirements of real software projects.

By way of contrast, SQL has been around for decades and shows no signs of going away any time soon. Through extensive use, features which could easily be rationalized as weaknesses (a generally insecure text-based API for example) have revealed themselves to be strengths (by allowing non-technical business users to explore data without an engineer). Time, and not narrative, is the judge.

This seems like an intuitive result. Given even odds, how many people would bet that SQL will be outlasted by a newer and "better" alternative like, say, MongoDB?

Prefer dependency on software that is used by its maintainers

A proven track record is ideal when choosing software but it's not always an option. If you're picking a JavaScript web framework, for example, essentially all of your options are young projects (except perhaps jQuery which, if it supports the requirements of your project, is a great choice).

But if you have to go with something new, its preferable to use software which is used by its authors. This creates a sensitivity to unpredictable challenges and will tend to lead the project to get stronger with time. If the author is using a project, emergent issues will prompt corresponding feature development. Conversely, an author who doesn't actively use their project will develop to match a preconceived mental model of reality instead of the real thing.

Consider the example of AngularJS vs. React. AngularJS grew quickly based largely on the excitement surrounding the fact that it was developed by Google. The rationalization was something like "Google has a lot of money and smart people, so their framework will obviously be great". But Google famously didn't use AngularJS for their own projects; their design turned out to be poorly suited to real projects and was abandoned for a whole sale re-write in the form of Angular 2.

By way of contrast, React is used by Facebook (which maintains the framework). The project has been growing rapidly, is healthily maintained, and appears to be the best active possibility for stability within the JS ecosystem.

According to the Stack Overflow Developer Survey (2017 and 2018), the percentage of respondants using Angular dropped from 44.3% to 36.9% while those using React jumped from 19.5% to 27.8%. This indicates a massive migration from Angular to React which shows no signs of slowing.

Prefer project owners over issue owners

In organizing software development work, prefer assigning people to high-level projects over low-level tasks. The reason for this is similar to the above argument about preferring software used by its authors. In completing small incremental tasks, it's easy to introduce technical debt for temporary expendience; project ownership creates incentives to avoid this kind of practice.

A developer tasked with specific, contained features has incentives to trade the long term health of the project for ease of development / efficiency in the short term. A project owner will recognize their own exposure to long term issues and be inclined to make technical decisions which support the long term health of the project; new challenges will lead to project growth opportunities - instead of buried time bombs.

What is a "10x" Programmer?

Nathan Epstein — Sat, 05 Aug 2017 18:09:03 +0000

A lot of attention is paid to the value of "rockstar" or "10x" programmers in building successful organizations. It's not hard to understand why; the inherently scalable nature of software means that marginal differences in programming work result in large differences in output.

Whether targeting outliers is a sustainable hiring strategy (it certainly can't be if everyone is doing it) is a widely debated topic. Less often talked about is what actually makes somebody one of these great programmers.

Without a mental model, it's difficult to either work towards being a great programmer or identify such individuals when hiring. What follows is an (opinionated) attempt to distill some of the qualities that make a great programmer. In particular, this list emphasizes delivering value within an organization (as distinct from the pure craft of programming). The list is loosely sorted in ascending order of difficulty / rarity.

1) Strong programmers can write business logic.

This means the ability to write working code which yields a solution to a presented problem. Sort an array of values, determine the right data to show a given user, etc. Obviously this is not a binary condition (everyone has different limits) but most development tasks shouldn't stretch the limits of your ability in this regard.

One big reason that software engineering interview processes are broken is that this is simultaneously the minimum requirement for being able to hold a programming job and the complete extent of what is tested for.

2) Strong programmers can write good code.

This is different from being able to write application logic. Code is for other people and should be written as such. This means succinct and expressive names, modular classes and functions, and readable logic. It's not enough that a piece of code "works". Good code is straightforward for others to use.

3) Strong programmers can design and architect things correctly.

There are of course many acceptable solutions to a given design or architecture problem. You can reasonably choose to trade simplicity for performance if the application warrants it. You cannot reasonably choose an approach that is both complex and slow (where a simple or performant solution exists).

Coming up with a "correct solution" means having the depth of understanding to make appropriate tradeoffs and avoid inflicting a deadweight loss on the software.

4) Strong programmers know a lot.

This is fairly self-explanatory; software engineering is knowledge intensive work. You can Google syntax but you can't Google fundamental problem solving ability.

5) Strong programmers learn a lot.

Again, this is pretty self-explanatory. In the course of working as a software engineer, you will encounter situations where you need knowledge you don't have. Being able to acquire that knowledge is important to being effective.

6) Strong programmers teach a lot.

Yet again, this is straight forward. Teaching is an economical way to deliver value to an organization. It makes others more effective in their work, creates a more attractive work environment, and fosters a culture of knowledge growth.

7) Strong programmers can take ownership of projects.

I'll define this as being able to take high-level business requirements and deliver a good software solution to the problem. This is less a technical skill than a "human factor".

Completing a project often involves a mix of programming and other non-technical concerns that are difficult to bake into a standard process. This means identifying what stake holders want (which is often distinct from what they ask for), forming the correct approach, aligning involved parties, and executing to deliver a strong finished product.

Being able to assume responsibility for these varied factors is a major separator of people who are good at writing code and people who are able to have major impact in an organization.