Forem: BuzzFeed Tech

More Data, More Problems: How BuzzFeed Scaled its Data Operation

Matt Semanyshyn — Wed, 23 Dec 2020 18:19:07 +0000

Data has always been integral to BuzzFeed’s success. It allows team members to build data-driven products, evaluate how our content is performing, and ask questions to more deeply understand our audience — all to ultimately inform BuzzFeed’s overall strategy and create the best experience for our users.

Our data originates from many sources and covers a large footprint, including anonymized first-party tracking, third party analytics (Google), platform APIs (Facebook, YouTube, Instagram, etc), and internal applications (content metadata from MySQL databases). Where we have control of this data, we’ve worked hard to improve it at the point of creation. Our first-party tracking, for example, was recently redesigned and reimplemented to employ a modular schema design, ensuring consistency and flexibility across all our products.

To meet the increasing demands of these data sets, our Data Engineering group invested significantly in our data infrastructure over the last couple of years. We migrated our data into Google’s BigQuery and reworked our ingestion pipeline to import new data into the warehouse in near real-time. With this foundation in place, we are now ingesting tens of thousands of records per second, totaling nearly 2 TB of data per day. This process is fairly unopinionated; by simply specifying a schema, relevant database dumps or event stream log files get ingested into BigQuery without any transformation.

While the availability of this data in BigQuery in near real-time unlocks a multitude of ways in which it can be leveraged, we quickly realized more data also leads to more problems. In this post we’ll detail these challenges and how we ultimately worked past them to empower our organization to scale its use of data while also simplifying our data infrastructure.

Transforming the Data
BigQuery, while effective at storing large data volumes (totaling over 2 Petabytes across all of BuzzFeed’s datasets), requires special considerations when querying it. All BuzzFeed BigQuery queries share a fixed pool of 2,000 slots — units of computational capacity required to execute the query. BigQuery calculates the number of slots required by each query based on its complexity and amount of data scanned. Inefficient or large queries will not only take longer to execute but can also potentially block or slow other concurrent queries due to the number of slots it requires. Table JOINs in particular can become computationally expensive because of the way that data needs to be coordinated between slots. As such, BigQuery is most performant when data is denormalized.

Since our data is imported into BigQuery in its raw form, we needed a way to optimize it into representations that capture common query patterns and transformations. (So, for example, we want to do things like aggregate individual page view events into hourly totals or create a denormalized representation of our core content metadata.) To achieve this, we’ve built a “Materialized Views” system.

On its surface, the system is fairly straightforward: given a SQL query, run it periodically and save its results in a new table that can be queried independently from its source data. On closer inspection, however, you’ll see a much more complex system that tracks dependencies, schedules and triggers full and partial rebuilds of tables, orchestrates rebuild execution to balance job priorities against the fixed BigQuery slot allocation, provides tooling for creation and validation of views, and enforces change management rules to ensure reliability for downstream consumers of the resulting tables.

With over 200 views in production, the tables created by the Materialized Views system have become the primary data access point for data in BuzzFeed, supporting over 80% of our reporting.

Standardizing the Data
Given the varied nature of BuzzFeed’s data, understanding what data is available and how it relates can be difficult.

To lower the barrier of entry for working with this data, we’ve introduced the “BuzzFeed Data Model” (BFDM for short). Built by leveraging the Materialized Views system, BFDM provides a standardized and consistent set of tables designed to support a majority of common business use cases. It considers the entire landscape of our raw data and how the various sources relate to one another to provide:

Consistency in data granularity
Regardless of the source, BFDM provides metrics precomputed at hourly, daily, and lifetime granularities (where applicable).
Consistency in terminology and naming
By standardizing naming conventions across BFDM, it is easier to find relevant tables, understand what data is available within, and query across them.
Clearer relationships
Tables are broken out into one of three types: entities, relationships, and metrics. By understanding a set of common fields available on each, any sets of data can easily be JOINed together.
Centralization of business logic
(ie Content categorization, relationships, and grouping rules)
Data enrichment, clean-up, and error remediation

This set of tables makes it easier for teams to work with data, simplifies and optimizes queries, and provides a “sanctioned” source of truth for BuzzFeed’s core metrics. Team members are able to seamlessly query different tables without needing to acknowledge a long list of “gotchas” about the data.

Creating a Single Source of Truth
Through the years, various differing (and sometimes redundant) approaches were introduced to BuzzFeed’s data infrastructure:

Spark jobs aggregated raw page view events into hourly aggregates to be imported into our data warehouse (Redshift prior to BigQuery)
Looker Persistent Derived Tables transformed data for its own use
A Redis-backed API served transformed and aggregated data to some internal dashboards
A Cassandra-backed API served real-time time-series page view aggregates to other dashboards

Not only has each of these legacy pieces increasingly become an operational burden, but they have also allowed for potential inconsistencies.

While our move to BigQuery introduced one more potential source for inconsistency, it also has provided the key components to allow us to decommission each of the legacy systems in favor of one consolidated approach. Remember, we now can import data into BigQuery in near real-time to be transformed by the Materialized Views system into our standardized source of truth, the BuzzFeed Data Model. With this, the same BFDM tables can be used for ad-hoc queries or within BI tools like Looker. By introducing one more system — an API that runs lightweight queries against BFDM — our internal dashboards can be powered by them as well, guaranteeing consistency across all points of data access. (Not to mention reduced technical debt from each of our decommissioned systems!)

Looking to the future
These various efforts have left BuzzFeed on a strong footing to continue leaning into its data-driven culture. However, to continue to succeed into the future, our data-powered approach must be understood, valued, and supported throughout the organization — teams need to use our infrastructure effectively, properly instrument their products with tracking, and help BFDM evolve.

To help achieve this, the Data Group built out Data Governance processes, resources, and organizational structures:

Comprised of a set of “Data Stewards” representing each engineering team at BuzzFeed, a “Data Governance Council” disseminates established best practices in a scalable manner, opens up channels of communication to evolve these best practices in a way that properly represents each team’s practical needs, and facilitates knowledge sharing and collaboration across the engineering organization on data initiatives.
A data review process to be completed at the start of any new user-facing initiative helps ensure that the data needs of the project are considered as a first-rate product requirement.
A data resource center highlights best practices and centralizes documentation for use across the organization.

This work has been a collective effort across BuzzFeed Tech and enables us to explore many new and exciting data-driven initiatives! If you’d like to join us, BuzzFeed Tech is hiring! To browse openings, check out buzzfeed.com/jobs.

You can also follow us on Twitter @buzzfeedexp !

Finding and Fixing Memory Leaks in Python

peterkarp — Wed, 16 Oct 2019 18:02:22 +0000

One of the major benefits provided in dynamic interpreted languages such as Python is that they make managing memory easy. Objects such as arrays and strings grow dynamically as needed and their memory is cleared when no longer needed. Since memory management is handled by the language, memory leaks are less common of a problem than in languages like C and C++ where it is left to the programmer to request and free memory.

The BuzzFeed technology stack includes a micro-service architecture that supports over a hundred services many of which are built with Python. We monitor the services for common system properties such as memory and load. In the case of memory a well-behaved service will use memory and free memory. It performs like this chart reporting on the memory used over a three-month period.

A microservice that leaks memory over time will exhibit a saw-tooth behavior as memory increases until some point (for example, maximum memory available) where the service is shut down, freeing all the memory and then restarted.

Sometimes a code review will identify places where underlying operating system resources such as a file handle are allocated but never freed. These resources are limited and each time they are used they allocate a small amount of memory and need to be freed after use so others may use them.

This post first describes the tools used to identify the source of a memory leak. It then presents a real example of a Python application that leaked memory and how these tools were used to track down the leak

Tools

If a code review does not turn up any viable suspects, then it is time to turn to tools for tracking down memory leaks. The first tool should provide a way to chart memory usage over time. At BuzzFeed, we use DataDog to monitor microservices performance. Leaks may accumulate slowly over time, several bytes at a time. In this case, it is necessary to chart the memory growth to see the trend.

The other tool, tracemalloc, is part of the Python system library. Essentially tracemalloc is used to take snapshots of the Python memory. To begin using tracemalloc first call tracemalloc.start() to initialize tracemalloc, then take a snapshot using:



snapshot=tracemalloc.take_snapshot()

tracemalloc can show a sorted list of the top allocations in the snapshot using the statistics() method on a snapshot. In this snippet the top 5 allocations grouped by source filename are logged.



for i, stat in enumerate(snapshot.statistics('filename')[:5], 1):
    logging.info("top_current",i=i, stat=str(stat))

The output will look similar to this:



1 /usr/local/lib/python3.6/ssl.py:0: size=2569 KiB, count=41, average=62.7 KiB
2 /usr/local/lib/python3.6/tracemalloc.py:0: size=944 KiB, count=15152, average=64 B
3 /usr/local/lib/python3.6/socket.py:0: size=575 KiB, count=4461, average=132 B
4 /usr/local/lib/python3.6/site-packages/tornado/gen.py:0: size=142 KiB, count=500, average=290 B
5 /usr/local/lib/python3.6/mimetypes.py:0: size=130 KiB, count=1686, average=79 B

This shows the size of the memory allocation, the number of objects allocated and the average size each on a per-module basis.

We take a snapshot at the start of our program and implement a callback that runs every few minutes to take a snapshot of the memory. Comparing two snapshots shows changes with memory allocation. We compare each snapshot to the one taken at the start. By observing any allocation that is increasing over time we may capture an object that is leaking memory. The method compare_to() is called on snapshots to compare it with another snapshot. The ‘filename’ parameter is used to group all allocations by module. This helps to narrow a search to a module that is leaking memory.



current = tracemalloc.take_snapshot()
stats = current.compare_to(start, 'filename')
for i, stat in enumerate(stats[:5], 1):
    logging.info("since_start", i=i, stat=str(stat))

The output will look similar to this:



1 /usr/local/lib/python3.6/ssl.py:0: size=2569 KiB (+2569 KiB), count=43 (+43), average=59.8 KiB
2 /usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB), count=13761 (+13761), average=137 B
3 /usr/local/lib/python3.6/tracemalloc.py:0: size=671 KiB (+671 KiB), count=10862 (+10862), average=63 B
4 /usr/local/lib/python3.6/linecache.py:0: size=371 KiB (+371 KiB), count=3639 (+3639), average=104 B
5 /usr/local/lib/python3.6/mimetypes.py:0: size=126 KiB (+126 KiB), count=1679 (+1679), average=77 B

Once a suspect module is identified, it may be possible to find the exact line of code responsible for a memory allocation. tracemalloc provides a way to view a stack trace for any memory allocation. As with a Python exception traceback, it shows the line and module where an allocation occurred and all the calls that came before.



traces = current.statistics('traceback')
for stat in traces[1]:
    logging.info("traceback", memory_blocks=stat.count, size_kB=stat.size / 1024))
    for line in stat.traceback.format():
        logging.info(line)



memory_blocks=2725 size_kB=346.0341796875
  File "/usr/local/lib/python3.6/socket.py", line 657
    self._sock = None
  File "/usr/local/lib/python3.6/http/client.py", line 403
    fp.close()
  File "/usr/local/lib/python3.6/http/client.py", line 410
    self._close_conn()
  File "/usr/local/lib/python3.6/site-packages/ddtrace/writer.py", line 166
    result_traces = None
  File "/usr/local/lib/python3.6/threading.py", line 864
  File "/usr/local/lib/python3.6/threading.py", line 916
    self.run()
  File "/usr/local/lib/python3.6/threading.py", line 884
    self._bootstrap_inner()

Reading bottom to top, this shows a trace to a line in the socket module where a memory allocation took place. With this information, it may be possible to finally isolate the cause of the memory leak.

In this first section, we saw that tracemalloc takes snapshots of memory and provides statistics about the memory allocation. The next section describes the search for an actual memory leak in one BuzzFeed microservice.

The Search for Our Memory Leak

Over several months we observed the classic saw-tooth of an application with a memory leak.

We instrumented the microservice with a call to trace_leak() to log the statistics found in the tracemalloc snapshots. The code loops forever and sleeps for some delay in each loop.



async def trace_leak(delay=60, top=20, trace=1):
    """
    Use spawn_callback to invoke:
        tornado.ioloop.IOLoop.current().spawn_callback(trace_leak, delay=300, top=10, trace=3)
    :param delay: in seconds (int)
    :param top: number of top allocations to list (int)
    :param trace: number of top allocations to trace (int)
    :return: None
    """
    logger.info('start_trace', delay=delay, top=top, trace=trace)
    tracemalloc.start(25)
    start = tracemalloc.take_snapshot()
    prev = start
    while True:
            await tornado.gen.sleep(delay)
            current = tracemalloc.take_snapshot()
                # compare current snapshot to starting snapshot
            stats = current.compare_to(start, 'filename')
                # compare current snapshot to previous snapshot
            prev_stats = current.compare_to(prev, 'lineno')

            logger.info('Top Diffs since Start')
        # Print top diffs: current snapshot - start snapshot       
        for i, stat in enumerate(stats[:top], 1):
                logger.info('top_diffs', i=i, stat=str(stat))

            logger.info('Top Incremental')
                # Print top incremental stats: current snapshot - previous snapshot 
            for i, stat in enumerate(prev_stats[:top], 1):
                logger.info('top_incremental', i=i, stat=str(stat))

            logger.info('Top Current')
                # Print top current stats
            for i, stat in enumerate(current.statistics('filename')[:top], 1):
                logger.info('top_current', i=i, stat=str(stat))

                # get tracebacks (stack trace) for the current snapshot
            traces = current.statistics('traceback')
            for stat in traces[:trace]:
                logger.info('traceback', memory_blocks=stat.count, size_kB=stat.size / 1024)
                for line in stat.traceback.format():
                        logger.info(line)
        # set previous snapshot to current snapshot
        prev = current

The microservice is built using tornado so we call it using spawn_callback() and pass parameters delay, top, and trace:



tornado.ioloop.IOLoop.current().spawn_callback(trace_leak, delay=300, top=5, trace=1)

The logs for a single iteration showed allocations occurring in several modules:



1 /usr/local/lib/python3.6/ssl.py:0: size=2569 KiB (+2569 KiB), count=43 (+43), average=59.8 KiB
2 /usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB), count=13761 (+13761), average=137 B
3 /usr/local/lib/python3.6/tracemalloc.py:0: size=671 KiB (+671 KiB), count=10862 (+10862), average=63 B
4 /usr/local/lib/python3.6/linecache.py:0: size=371 KiB (+371 KiB), count=3639 (+3639), average=104 B
5 /usr/local/lib/python3.6/mimetypes.py:0: size=126 KiB (+126 KiB), count=1679 (+1679), average=77 B

tracemalloc is not the source of the memory leak! However, it does require some memory so it shows up here. After running the service for several hours we use DataDog to filter the logs by module and we start to see a pattern with socket.py:



/usr/local/lib/python3.6/socket.py:0: size=1840 KiB (+1840 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1840 KiB (+1840 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1841 KiB (+1841 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1842 KiB (+1842 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1843 KiB (+1843 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1843 KiB (+1843 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1844 KiB (+1844 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1844 KiB (+1844 KiB)
/usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB)
#                               Increase here ^
/usr/local/lib/python3.6/socket.py:0: size=1845 KiB (+1845 KiB)

The size of the allocation for socket.py is increasing from 1840 KiB to 1845 KiB. None of the other modules exhibited this clear trend. We next look at the traceback for socket.py.

We identify a possible cause

We get a stack trace from tracemalloc for the socket module.



  File "/usr/local/lib/python3.6/socket.py", line 657
    self._sock = None
  File "/usr/local/lib/python3.6/http/client.py", line 403
    fp.close()
  File "/usr/local/lib/python3.6/http/client.py", line 410
    self._close_conn()
  File "/usr/local/lib/python3.6/site-packages/ddtrace/writer.py", line 166
    result_traces = None
  File "/usr/local/lib/python3.6/threading.py", line 864
  File "/usr/local/lib/python3.6/threading.py", line 916
    self.run()
  File "/usr/local/lib/python3.6/threading.py", line 884
    self._bootstrap_inner()

Initially, I want to assume that Python and the standard library is solid and not leaking memory. Everything in this trace is part of the Python 3.6 standard library except for a package from DataDog ddtrace/writer.py. Given my assumption about the integrity of Python, a package provided by a third-party seems like a good place to start investigating further.

It's still leaking

We find when ddtrace was added to our service and do a quick rollback of requirements and then start monitoring the memory again.

Another look at the logs

Over the course of several days, the memory continues to rise. Removing the module did not stop the leak. We did not find the leaking culprit. So it’s back to the logs to find another suspect.



1: /usr/local/lib/python3.6/ssl.py:0: size=2568 KiB (+2568 KiB), count=28 (+28), average=91.7 KiB
2: /usr/local/lib/python3.6/tracemalloc.py:0: size=816 KiB (+816 KiB), count=13126 (+13126), average=64 B
3: /usr/local/lib/python3.6/linecache.py:0: size=521 KiB (+521 KiB), count=5150 (+5150), average=104 B
4: /usr/local/lib/python3.6/mimetypes.py:0: size=130 KiB (+130 KiB), count=1699 (+1699), average=78 B
5: /usr/local/lib/python3.6/site-packages/tornado/gen.py:0: size=120 KiB (+120 KiB), count=368 (+368),

There is nothing in these logs that looks suspicious on its own. However, ssl.py is allocating the largest chunk by far, 2.5 MB of memory. Over time the logs show that this remains constant, neither increasing nor decreasing. Without much else to go on, we start checking the tracebacks for ssl.py.



File "/usr/local/lib/python3.6/ssl.py", line 645

    return self._sslobj.peer_certificate(binary_form)

  File "/usr/local/lib/python3.6/ssl.py", line 688

  File "/usr/local/lib/python3.6/ssl.py", line 1061

    self._sslobj.do_handshake()

  File "/usr/local/lib/python3.6/site-packages/tornado/iostream.py", line 1310

    self.socket.do_handshake()

  File "/usr/local/lib/python3.6/site-packages/tornado/iostream.py", line 1390

    self._do_ssl_handshake()

  File "/usr/local/lib/python3.6/site-packages/tornado/iostream.py", line 519

    self._handle_read()

  File "/usr/local/lib/python3.6/site-packages/tornado/stack_context.py", line 277

    return fn(*args, **kwargs)

  File "/usr/local/lib/python3.6/site-packages/tornado/ioloop.py", line 888

    handler_func(fd_obj, events)

  File "/app/socialmc.py", line 76

    tornado.ioloop.IOLoop.instance().start()

A solid lead

The top of the stack shows a call on line 645 of ssl.py to peer_certificate(). Without much else to go on, we make a long-shot Google search for “python memory leak SSL peer_certificate” and get a link to a Python bug report. Fortunately, this bug was resolved. Now it was simply a matter of updating our container image from Python 3.6.1 to Python 3.6.4 to get the fixed version and see if it resolved our memory leak.

Looks good

After updating the image we monitor the memory again with DataDog. After a fresh deploy around Sept. 9th, the memory now runs flat.

Summary

Having the right tools for the job can make the difference between solving the problem and not. The search for our memory leak took place over two months. Tracemalloc provides good insight into the memory allocations happening in a Python program; however, it does not know about the memory allocations that take place in packages that are allocating memory in C/C++. In the end, tracking down memory leaks requires patience, persistence, and a bit of detective work.

References

https://docs.python.org/3/library/tracemalloc.html
https://www.fugue.co/blog/2017-03-06-diagnosing-and-fixing-memory-leaks-in-python.html

This post was originally posted on Jan. 17th, 2019, on BF Tech's Medium.

Why we use Micro Frontends at BuzzFeed

Ian Feather — Thu, 26 Sep 2019 14:12:30 +0000

The definition of what constitutes a “micro frontend” perhaps hasn’t yet reached consensus. The smart folks at DAZN consider it to be a series of full pages managed by a client-side orchestrator. Other approaches, such as OpenComponents, compose single pages out of multiple micro frontends.

BuzzFeed’s use case fits somewhere in between the two. I wouldn’t say we have a micro frontend architecture; however, we do leverage them for a few parts of the page. We consider something to be a micro frontend if the API returns fully rendered html (and assets) but not an <html> or <body> element.

We have three micro frontends: the header component, the post content, and our interactive embeds. Each of these adopted the micro frontend approach because they presented real and distinct business problems.

Micro Frontend #1: The Header

Why? Component Distribution

This is the buzzfeed.com header. It has a light layer of configuration as well as a reasonable amount of code behind it: certainly enough that it merits an abstraction as opposed to duplicating it in all of our services.

Originally, we made this abstraction and extracted it into an npm package, which services imported as part of their build process. This allowed us to remove duplication as well as have the service bundle the header as part of its own build process (meaning we could easily deduplicate common code and libraries).

With only two or three services, this technique works really well, but we have more than ten rendering services backing buzzfeed.com. This meant that every time we wanted to make a change to the header we had to make the following changes more than 10 times:

Update the code in the header
Make a Pull Request
Merge and publish to npm
Update the service package.json
Make a Pull Request
Merge and Deploy the service

This became extremely time consuming and led to teams avoiding header changes because of it. Sure, there are ways in which we could have improved this workflow (e.g. using loose semver and just rebuilding the service, automating the update and creation of service PRs) but these still felt like the wrong approach. By moving to a micro frontend pattern, we’re now able to distribute the header instantly to all services and the workflow to update it on all of buzzfeed.com is now:

Update the code in the header
Make a Pull Request
Deploy the header

Micro Frontend #2: Post Content (or as we call it: The Subbuzzes)

Why? To maintain a contract with the CMS

We have a few different “destinations” (e.g., BuzzFeed and BuzzFeed News) for our content yet each one is powered by a single CMS. Each destination is its own service (or multiple services) which connects to our content APIs. This means that we have the ability to render the same content in multiple destinations; however, in practice we choose not to.

This also means that we have to maintain a contract between the CMS / Content APIs and the rendering services. To illustrate this it’s easier to focus on an example.

When an editor wants to add an image to the page, they select the image “subbuzz” in the CMS and upload it. They then have the option to add extensions to that image. One such extension is the ability to mark the image as showing Graphic Content. The intention of adding this extension is that the image would be blurred out and the user would have to opt-in to see it (this is particularly important with sensitive news content). As far as the CMS cares though, all this means is a boolean value stored against an image. Because the CMS depends on the rendering services to add a blurred overlay we end up with an implicit coupling between the two. If a destination failed to support this feature then users would be exposed to graphic content, and we would have failed to uphold the editors’ intentions.

So what does this have to do with Micro Frontends?

We could choose to abstract these subbuzz templates into an npm package and share them across the destinations; however, when we change support for something in the CMS we need the rendering services to be able to reflect this immediately. The CMS is deployed in an un-versioned state, and the content APIs only expose major version numbers. Coupling these with npm packages using semver and deployed via a package would make it harder for them to stay in sync. By moving the subbuzzes behind an HTTP API, we can update the rendering-cms contract across all destinations immediately and guarantee that each destination supports the latest CMS features.

Micro Frontend #3: Embeds (Buzz Format Platform)

Why? Independence from the platform

Maybe the most clear use case for Micro Frontends: the Embed. We host a ton of embeds (Instagram, Twitter, etc.), including first-party embeds. We call these BFPs which stands for Buzz Format Platform, and they can be anything from a newsletter signup to a heavily reusable quiz format or a bespoke format supporting an investigative story.

The entry point for an embed is typically an iframe or a script element, so it doesn’t really qualify as Micro Frontends themselves. We break that mold (where possible) by rendering them server-side and including the returned DOM directly in the page. We do this so that we can render the embeds in distributed formats (like our BuzzFeed Mobile App or Facebook Instant Articles) as well as expose the content to search engine crawlers.

BFP provides independence from the platform and gives engineers the feeling of working on a small component without having to consider the wider BuzzFeed ecosystem. This feeling is one we always try to get to when creating developer environments and Micro Frontends certainly provide that opportunity.

The trade-offs

A micro frontend architecture can give you a great developer experience and a lot of flexibility, but they don’t come for free. You trade them off against:

Larger client-side assets or tougher orchestration

We compose our micro frontends in the browser which means there’s no singular build process that can optimize and deduplicate shared dependencies. To achieve this at the browser level, you need to code-split all dependencies and make sure you use the same versions — or build in an orchestration layer.

Higher risk when releasing updates

Just as we’re able to distribute new changes instantly across many services, we’re also able to distribute bugs and errors. These errors also surface at runtime rather than at build time or in CI. We use this heightened risk as an opportunity to focus more on testing and ensuring that the component contract is maintained.

There has also been criticism that micro frontends make it tougher to have a cohesive UX, but this is not something we’ve experienced. All of these micro frontends inherit design patterns and smaller components via shared packages.

Overall the micro frontend pattern has worked well for BuzzFeed Tech in these use cases and has been well tested over the last one to two years. There is definitely an inflection point where having many more of them would require more work to offset the first trade-off but, we don’t think we’re there yet and don’t anticipate being there any time soon — abstracting components to shared packages work well for the majority of our cases. Where it doesn’t, it’s nice to have another architectural pattern to reach for.

How BuzzFeed’s Tech Team Helps Journalists Report On Technology With Authority

Logan McDonald — Wed, 25 Sep 2019 19:26:47 +0000

Reporter: Is anyone around to take a quick look at some code on codepen and help me understand what it does?

This question — and many, many others like it — arose from an initiative called “The Tech + News Working Group,” a BuzzFeed News collaboration between reporters and nearly every role across the Tech department — including SREs, SWEs, data scientists and product managers.
The BuzzFeed Tech team is afforded “7% time,” or half a day each week, to learn something new. For those of us on the Tech side of the working group, that 7% time is dedicated to bringing our technical expertise to the newsroom.

A primary value of BuzzFeed Tech is fostering an experimental and collaborative workplace; One way that’s come to life is through the Tech + News Working Group. The foundation for the group was laid by former BuzzFeed Tech SRE Sri Ray, who provided countless acts of support to the newsroom throughout his time here. Reporters began reaching out to people on the tech and IT teams to help understand terminology, dig into tips, or verify the security of a file they had received from tipsters. The partnership allowed the Tech team to learn from reporters and consider the wider implications of our work. In turn, as insiders in the tech industry, the newsroom benefitted from our tools and perspective while reporting on technology, security, and business stories.

Like other award-winning newsrooms, BuzzFeed News has a dedicated data journalism team that uses data and programming to aid in their reporting. But a year ago, we realized that there is a place for the Tech team to help with reporting, too.

We decided to expand and formalize our collaboration, with an open Slack channel that includes members from both our Tech and News organizations to support day-to-day consulting on stories and work on larger ongoing projects, along with a new mission statement:

Our mission [for the Tech News Working Group] is to act as an elevating force for the newsroom’s work through guidance and assistance using our technical skills. Our News team at BuzzFeed strives to shine light on the most important issues facing us today. Currently, this means a growing focus on the impact of the internet and technology on a range of topics covered by News. As a team of technical experts, we are uniquely equipped to assist in this realm.

Since then, we’ve worked with the newsroom to develop reporting tools, including a project called Tubeviewer which took us down the YouTube rabbit hole. Through Tubeviewer, we found that the Up Next algorithm occasionally pushes users toward hyperpartisan videos that include divisive, conspiratorial, and sometimes hateful content. Tubeviewer simulates the viewing experience of real humans watching YouTube videos given a certain set of search terms. Turn it on, walk away, and come back to a full picture of your viewing experience. We’ve used tools like this to audit algorithms across tech platforms. We’ve also developed bots that send digests of patent information to Slack or collect data from filing systems like NYSCEF. We’ve developed patterns for doing network analysis through Wireshark and done visualizations for articles with the data.

We’ve received hundreds of requests ranging from assistance with finding data from a website to explaining how authentication tokens are used:

Reporter: working on a story about token based authentication vulnerability in a video game that uses google/fb sign on — and looking for someone to comment on the limits/risks of using token based authentication in general

Sometimes we get requests for help with something that would require a lot of manual labor, were it not for a quick and easy tech solution:

Reporter: this is a weird request
that i apologize for
but can someone count the number of stories under this tag
if there is an easy way to do that…

Engineer: can do with JS, count the number of item-list classes

When we created the group, we didn’t know how popular the channel would be, but these requests are a fun break from our day to day “real” work. Sometimes the Tech team gets way more invested in the answer to a question posed by a reporter than the reporter intended. Once, we spent an embarrassing amount of time trying to determine what software Jessica Simpson used to create the graphic for this tweet (loads of EXIF digging later, we still couldn’t figure it out and the reporter had completely moved on). In the end, we’ve helped contribute to a bunch of stories and scoops, and in doing so created a bond of trust amongst reporters and technologists across our organization.

Reporter: I credit this scoop to asking the group here, so thank you

The mentality of a good engineer is similar to that of a journalist; both need to ask good questions and seek out accurate answers with persistence even when there might be bugs in a system. Over the past year, it has been fun and rewarding to build a stronger bridge between the News and Tech teams at BuzzFeed, and we only expect this work to continue. Through the collaboration, both teams thrive: members of the Tech team have the privilege of working hand-in-hand with journalists, while the newsroom is able to report on technology with more authority.

Original post was published on Tech @ BuzzFeed publication on Medium, September 4, 2019.