Forem: Quansight Labs

Is GitHub Actions suitable for running benchmarks?

Jaime Rodríguez-Guerra — Wed, 18 Aug 2021 14:38:43 +0000

Benchmarking software is a tricky business. For robust results, you need dedicated hardware that only runs the benchmarking suite under controlled conditions. No other processes! No OS updates! Nothing else! Even then, you might find out that CPU throttling, thermal regulation and other issues can introduce noise in your measurements.

So, how are we even trying to do it on a CI provider like GitHub Actions? Every job runs in a separate VM instance with frequent updates and shared resources. It looks like it would just be a very expensive random number generator.

Well, it turns out that there is a sensible way to do it: relative benchmarking. And we know it works because we have been collecting stability data points for several weeks.

Good benchmarking on shared resources?

Speed-critical projects use benchmark suites to track performance over time and detect regressions from commit to commit. For these measurements to be useful, they need to be comparable and hence use the exact same conditions for each data point. Guaranteeing this in absolute manners can be a daunting task, but there are ways around it.

Instead of going through all the complications involved in renting or acquiring dedicated hardware, setting up credentials and monitoring costs, we hoped we could use the same free cloud resources normally used for CI tests. Ideally, GitHub Actions.

Let's compare the requirements for good benchmarks and the features provided by CI services. That way we can understand how to work around some of the apparent limitations:

A good benchmark suite...	CI services...
runs on the same hardware every time	provide a different (standardized) machine for each run
runs on dedicated hardware	run on shared resources
runs on a frozen OS configuration	update their VM images often
requires renting or acquiring such hardware	are free
requires authentication mechanisms	implement authentication out of the box
requires hardware that can be abused through public PRs	are designed for public PRs

Looking at that table we see that CI services are attractive because of the cost and setup, but fail to provide some of the essential quality factors for reliable benchmarking. That is, if we want to track performance over time and compare absolute measurements directly.

What if we only want to detect regressions introduced in a PR? Or in the current commit against the last release? We do not need absolute measurements, just a way to compare two commits in a reliable way. This is exactly what we meant by relative benchmarking.

The idea is to run the benchmarking suite in the same CI job for the two commits we are comparing. This way, we guarantee we are using the exact same machine and configuration for both. By doing this, we are essentially canceling out the background noise and can estimate the performance ratio of the two commits.

There are some gotchas, though. Benchmark suites can take long to run, and running them twice makes it even worse. This opens an even larger time window for other CI jobs to pollute our measurements with resource-intensive tasks that happen to be running at the same time. To minimize these effects, benchmarking tools run the tests several times in different schedules and apply some statistics to correct for the inevitable deviations. Or, at least, try.

Relative performance measurements with Airspeed Velocity

scikit-image, the project that commissioned this task, uses Airspeed Velocity, or asv, for their benchmark tests.

asv's main feature is being able to track performance measurements over time in a JSON database and generate beautiful dashboards that can be published to a static site server like GitHub Pages. For an example, look at the reports for pandas at their speed.pydata.org website.

asv also has a special subcommand named continuous, which might provide the functionally we needed. The help message says:

Run a side-by-side comparison of two commits for continuous integration.

This sounds like exactly what we are looking for! Let's break down how it works:

When we run asv continuous A B, asv will create at least two^† virtual environments (one per commit) and install revisions A and B in those, respectively. If the project involves compiled libraries (as is the case with scikit-image), this can be a lengthy process!
The default configuration will run the benchmark in two interleaved passes, with several repeats each. In other words, asv will run the suite four times (A->B->A->B)! This is done to account for the unavoidable deviations from ideality caused from co-running processes, as mentioned above.
The statistics for each commit are gathered and a report table is presented. The ratio of each test is computed and, if it is greater than a certain threshold (1.2 by default) an error is emitted.

^† asv supports the notion of a configuration matrix, so you can test your code under different environments; e.g. NumPy versions, Python interpreters, etc.

For the benchmark suite of scikit-image as of June/July 2021, this ends up taking up to two hours. This raises two questions we will answer in the following sections:

Although we are trying hard to reduce measurement errors, is that enough? Are these measurements reliable?
Two hours might be too long. Are there any settings we can tune to reduce the runtime without reducing the accuracy of the measurements?

Are CI services reliable enough for benchmarking?

We saw above that CI services are not designed for this kind of task, but with some workarounds they might be good enough. However, how can we be sure? As data-driven scientists, we say let's run an experiment!

The setup

If our experiment is data-driven, we should generate some data first. This is our strategy:

We will benchmark and compare two commits that are exactly the same in terms of tested code.
Under ideal conditions, we should see that the performance ratio between the two commits is 1.0. In other words, performance should be the same. Of course, these are not ideal conditions, so some kind of error is expected. We just want it to stay reliably under an acceptable threshold.
We will implement a GitHub Actions workflow that will run every six hours (four times a day) and collect results for a week or more. This will help us account for two things:
- Accumulate sufficient data points to answer our question.
- Account for time factors like different days of the week (e.g. weekday vs weekend), or the time of the day (3am vs 3pm).
The workflow will upload the benchmark results as artifacts we can download and process locally with a Jupyter Notebook.

Check the GitHub Actions workflow in this fork!

The results

To measure the stability of the different benchmark runs we will be looking at the performance ratio of the measurements between the two commits. Since they are identical in terms of tested code, ideally they should be all 1.0. We know this will not happen, but maybe the errors are not that big and stay like that regardless the time of the day or the day of the week.

After collecting data points for 16 days, these are the results:

Average time taken: 1h55min
Minimum and maximum ratios observed: 0.51, 1.36
Mean and standard deviation of ratios: 1.00, 0.05
Proportion of false positives: 4/108 = 3.7%

In the X axis you can see the different runs, sorted by date and time. Days of the week are grouped with colored patches for easier visual analysis. In the Y axis, we plot the performance ratio. Each of those vertical clouds include 75 points, one per benchmark test. Ideally, they should all fall at y=1. Of course, not all of them are there, but a surprising amount of them do!

But! We do see some bigger deviations at certain times! How is that possible? Well, that's the error we were talking about! These outliers are the ones that would concern us because they can be interpreted as false positives. In other words: asv would report a performance regression, when in fact there's none. However, in the observed measurements, the outliers were always within y ∈ (0.5, 1.4). That means we can affirm that the method is sensitive enough to detect performance regressions of 50% or more! This is good enough for our project and, in fact, some projects might even be happy with a threshold of 2.0.

If you are curious about how we automatically downloaded the artifacts, parsed the output, and plotted the performance ratios, check the Jupyter Notebook here.

Can we make it run faster without losing accuracy?

We just found out that this approach is sensitive enough but it takes two hours to run. Is it possible to reduce that time without sacrificing sensitivity? We need to remember that asv is running several passes and repeats to reduce the measurement error, but maybe some of those default counter-measures are not needed. Namely:

The benchmark runs with --interleave-processes, but it can be disabled with --no-interleave-processes. The help message for this flag says:

Interleave benchmarks with multiple processes across commits. This can avoid measurement biases from commit ordering, can take longer.

How much longer? Does it help keep error under control? We should measure that.
By default, all tests are run several times, with different schedules. There are two benchmark attributes that govern these settings: processes and repeat. processes defaults to 2, which means that the full suite will be run twice per commit. If we only do one pass (processes=1), we will reduce the running time in half, but will we lose too much accuracy?

To answer both questions, we added more entries to the data collection workflow shown above by parameterizing the asv command-line options.

These are the results!

No process interleaving

Disabling process interleaving should be faster and maybe the accuracy loss is not that bad. But... how bad? Here are the results:

Average time taken: 1h39min
Minimum and maximum ratios observed: 0.43, 1.5
Mean and standard deviation of ratios: 0.99, 0.07
Proportion of false positives: 6/66 = 9.99%

Single-pass with `processes=1`

In this configuration, we expect a drastic 50% running time reduction, since we will only do one pass per commit, instead of two. However, the accuracy loss might be too dramatic... Let's see!

Average time taken: 1h7min
Minimum and maximum ratios observed: 0.51, 2.76
Mean and standard deviation of ratios: 1.01, 0.07
Proportion of false positives: 8/64 = 12.5%

At first sight, those clouds look very spread! The number of false positives is also larger.

Summary of the strategies

Let's take a look at the three strategies now. The main columns are runtime and %FP (percentage of false positives). We want the smallest %FP at the shortest runtime.

Strategy	Runtime	%FP	Min	Max	Mean	Std
Default	1h55	3.7	0.51	1.36	1.00	0.05
No interleaving	1h39	9.99	0.43	1.50	0.99	0.07
Single pass	1h07	12.5	0.51	2.76	1.01	0.07

Unsurprisingly, the default strategy (two passes, interleaved) is the most accurate, but also the most time consuming. Disabling process interleaving helps reduce the average runtime 16min but the false positives increased to more than double! Using a single pass brought the runtime down to an hour, but multiplied the false positives by almost four.

In short, we will stick to the default strategy for accuracy but need to investigate other ways to make it run in less time.

We also considered running several single-pass replicas in parallel using different GitHub Actions jobs. In theory, a false positive could be spotted by comparing the values of the failing tests in the other replicas. Only true positives would reliably appear in all replicas. However, this consumes more CI resources (compilation happens several times) and is noisier from the maintainer perspective, who would need to check all replicas. Not to mention that more replicas increase the chances for more false positives!

Speeding up compilation times

So far we have only looked at speeding up the benchmark suite itself, but we saw earlier that asv will also spend some time setting up virtual environments and installing the project. Since installing scikit-learn involves compiling some extensions, this can add up to a non-trivial amount of time.

To accelerate the creation of the virtual environments, which in our case uses conda, we replaced the conda calls with a faster implementation called mamba. We rely on an asv implementation detail: to find conda, asv will first check the value of the CONDA_EXE environment variable. This is normally set by conda activate <env>, but we overwrite it with the path to mamba to have asv use it instead.

The second optimization is leveraging the use of a compiler cache. Since most of the modules will not change in a given PR, we can use ccache to keep the unchanged libraries around. Check the workflow file to see how it can be implemented on GitHub Actions.

These two changes together brought the average running time down to around 1h 20min. Not bad!

Run it on demand!

Benchmarks do not need to run for every single commit pushed to a PR. In fact, that's probably a waste of resources. Instead, we'd like the maintainers to run the benchmarks on demand, whenever needed during the PR lifetime.

GitHub Actions offers different kinds of events that will trigger a worklow. By default the on.pull_request trigger is configured to act on three event types: [opened, reopened, synchronize]. In practice, this means after every push or after closing+opening the PR. However, there are more triggers!

Of all the choices, we settled for labeled. This means that the workflow will be triggered whenever the PR is tagged with a label. To specify which label(s) are able to trigger the workflow, you can use an if clause at the job level, like this:

name: Benchmark

on:
  pull_request:
    types: [labeled]

jobs:
  benchmark:
    if: ${{ github.event.label.name == 'run-benchmark' && github.event_name == 'pull_request' }}
    name: Linux
    runs-on: ubuntu-20.04
    ...

In this example, the PR will only get triggered if the label is named run-benchmark. This works surprisingly well as a manual trigger! It is also restricted to authorized users (triaging permissions or above), so no need to fiddle with authentication tokens or similar complications.

There is one gotcha, though: the checks UI will be appended to the last push event panel, which can be confusing if more commits have been added to the PR since the label was added. Ideally, the checks UI panel would be added next to the @user added the run-benchmark label message. Maybe in a future update?

TLDR

We have seen that GitHub Actions is indeed good enough to perform relative benchmarking in a reliable way as long as several passes are averaged together (default behaviour in asv continuous). This takes a bit more time but you can speed it up a bit with mamba and ccache for compiled libraries. Even in that case, it is probably overkill to run it for every push event, so we are using the on.pull_request.labeled trigger to let the maintainers decide when to do it on demand.

Useful references

scikit-image/scikit-image #5424: The PR where all this was implemented.
Analysis notebook: The code used to analyze the benchmarking data.
asv.readthedocs.io: The official documentation for asv.
Building an Open Source, Continuous Benchmark System
Conbench: Language-independent Continuous Benchmarking Tool, by Ursa Labs
PyPy Speed
Pandas Speed

Acknowledgements

Thanks Gregory Lee, Stéfan van der Walt and Juan Nunez-Iglesias for their enthusiastic and useful feedback in the PR! The plots look that pretty thanks to comments provided by John Lee. Gregory Lee, Gonzalo Peña-Castellanos and Isabela Presedo-Floyd provided super valuable comments and suggestions for this post.

This work was funded by the Chan-Zuckerberg Institute (CZI) as part of an Essential Open Source Software for Science grant.

Putting Out the Fire: Where Do We Start With Accessibility in JupyterLab?

Isabela Presedo-Floyd — Thu, 27 May 2021 22:11:18 +0000

I want to be honest with you, I started asking accessibility questions in JupyterLab spaces while filled with anxiety. Anxiety that I was shouting into the void and no one else would work on accessibility with me. Anxiety that I didn’t have the skills or energy or knowledge to back up what I wanted to do. Anxiety that I was going to do it wrong and make JupyterLab even more inaccessible. Sometimes I still feel that way.

Here’s the thing. That anxiety, while real and worth acknowledging, doesn’t help the disabled people we constantly fail and exclude when we keep building things inaccessibly. So yes, I want you to know I felt that way and that you might too, but I also want you to remember who we are here for, especially if you are working to support a group you aren’t a part of (as is the case with me). Plus, many of these concerns didn’t end up happening. First, I didn’t end up being alone at all! Each of the people that have joined in have different skills that have helped us tackle issues that I don’t think any of us would’ve been able to on our own. Knowing we are working together also helps keep me accountable because I’d like to be able to show up to our meetings with something to share. As for worrying that I’m doing it all wrong, I suppose that’s still a possibility. Speaking for myself, I’d rather be making mistakes, learning, and iterating than continue to let JupyterLab stay inaccessible indefinitely.

In a space where considering the needs of disabled people isn’t the standard, accessibility might feel like an insurmountable challenge. For example, when I showed up to our first devoted accessibility meeting, JupyterLab’s accessibility status was like a hazy shape in the distance. I was pretty sure it wasn’t good, but I didn’t know for sure how or why. A few meetings later and a closer look made me realize that haze was actually smoke and I'd walked myself and others directly into a (metaphorical) burning building. But just because it felt like everything was chaos without a good place to start didn't mean that was the truth. In fact, it wasn't. Building software is more about people more than any tool, so let’s consider what our regular team of people on the user-contributor-maintainer spectrum said are the basics of what they care about in JupyterLab.

Users want to:

Use JupyterLab to read or navigate documents.
Use JupyterLab to edit and run documents. To edit a document, users need to be able to navigate where they want to edit, so the read-only experience is aprerequisite.
Know what things they can do in JupyterLab and get help on how to do it.

Contributors want to:

Gain enough understanding of a JupyterLab in order to work with it.
Understand the expectations of their contributions and how to meet them. In this case, they would want to know that they need to think about accessibility and how to consider that.

Maintainers want to:

Ensure that JupyterLab is both progressing and relatively stable.
Promote sustainable growth for a project that doesn’t overwrite past efforts. Automation can be helpful because maintainers are usually strapped for time.

With the support of a team member with prior experience auditing for accessibility, we pinpointed specific ways in which JupyterLab lacked support for accessibility broken up by WCAG 2.1 standards.

From conversations with these more experienced community members, we found that issues generally broke up into four categories of work needed (not necessarily in this order):

1. Make JupyterLab accessible for a read-only type experience

This is something users need. For our purposes, we’re using read-only to describe what you need to navigate and consume all the content in JupyterLab from the interface to the documents and processes that live in it. Most of this also falls under WCAG standards, and are the first things users need to start working with JupyterLab since it’s difficult to interact with a space if you can’t get where you want to go.

2. Make JupyterLab accessible for an interacting/editing experience

This is something users need and is the other half of WCAG standards. Once you can navigate the space, people need to interact by writing, editing, running process, and so on. While WCAG standards do cover interactive web experiences and they are written generally enough to apply to many interface types, their roots in a more standard website experience means that we also have some grey areas to account for since JupyterLab can easily include complex and layered interactions than even other web apps. We are supporting this by looking into how other tools with similar uses (like coding) approach these types of accessibility and hope to test it in the future.

3. Accessibility documentation

This is something users and contributors need and has two parts. One part is making the documentation itself accessible through WCAG compliance in the docs theme, labeling relevant content, and providing content in different forms. Second is adding documentation specifically for accessibility such as how to use accessibility features and how accessibility fits in to our contribution process.

Accessibility and documentation both have reputations for falling to the wayside, and we almost got so caught up in applying WCAG standards to the software itself that we continued the pattern. But making an accessible experience is, like any UX problem, not limited to the time spent within that product. Think of it this way, if there is no accessible documentation on how to get started with JupyterLab and use relevant accessibility support, then all the work we’ve done in the software itself won’t be able to serve the very people it is there for.

4. Adding relevant accessibility tests to the JupyterLab contributing workflow

This is something contributors and maintainers need, though the results also benefit users. As grateful as I am to have a group of people who are taking action to make JupyterLab accessible, it isn’t enough on its own. We aren’t a group that can review every single PR and we may not all be able to devote time to this forever; tests ensure that accessibility remains a priority in the contributing workflow regardless of individual involvement. It also will help prevent current efforts from being overwritten by new contributions.

Automated accessibility testing has its limits because you are trying to quantify an experience without getting users involved, but I think a first pass and a reminder to the community—especially the contributing community—that accessibility is something we are all responsible for is critical. Since accessibility isn’t yet a regular standard for contributions in many projects, feedback from tests might also be an opportunity for people who haven’t worked with accessibility before to start learning more.

Where we are now

As I’m writing this post, our team is mostly focused on JupyterLab accessibility for WCAG compliance starting with the read-only type experience. Among many things, JupyterLab is currently missing of landmarks and labels that block manual accessibility testing to a degree since they prevent further navigation and interaction. Starting here means that we are a step closer to users being able to accessibly read content in the interface.

If you are going to take away one thing from my journey so far, I’d tell you to be consistently brave. Feeling anxious in the face of challenges and accepting areas where you don’t yet have knowledge is normal, but it isn’t reason to back down. Find the people that will collaborate with you and dive in. And when I get lost and don’t know what to do, I find it most helpful to put people first and remember who I am doing this for. Breaking the work into pieces by what users need can help you strategically start putting out fires.

Focusing on people just for strategy isn’t all though. Be on the look out for my next blog where I’ll talk about what the disconnect of what accessibility meant to different people in our community and how that impacted the time and way we’ve solved issues in JupyterLab so far.

Interested in getting involved? Join our community via the JupyterLab accessibility meetings listed every other week on the Jupyter community calendar.

Rethinking Jupyter Interactive Documentation

Matthias Bussonnier — Fri, 07 May 2021 19:45:38 +0000

Jupyter Notebook first release was 8 years ago – under the IPython Notebook name at the time. Even if notebooks were not invented by Jupyter; they were definitely democratized by it. Being Web powered allowed development of many changes in the Datascience world. Objects now often expose rich representation; from Pandas dataframes with as html tables, to more recent Scikit-learn model.

Today I want to look into a topic that has not evolved much since, and I believe
could use an upgrade. Accessing interactive Documentation when in a Jupyter session, and what it could become. At the end I'll link to my current prototype if you are adventurous.

The current limitation for users

The current documentation of IPython and Jupyter come in a few forms, but mostly have the same limitation. The typical way to reach for help is to use the ? operator. Depending on the frontend you are using it will bring a pager, or a panel that will display some information about the current object.

It can show some information about the current object (signature, file, sub/super classes) and the raw DocString of the object.

You can scroll around but that's about it whether in terminal or Notebooks.

Compare it to the same documentation on the NumPy website.

On the left is the documentation for NumPy when visiting the NumPy website. Let's call that "rendered documentation". On the right what you get in Jupyter Lab or in the IPython or regular Python REPL, let's cal that "help documentation" since it is typically reached via identifier? or help(identifier)

Compared to rendered documentation, the help documentation is:

Hard to read,
Has no navigation,
RST Directives have not been interpreted,
No inline graphs, no rendered math.

There is also no access to non-docstring based documentation, no narrative, no tutorials, no image gallery or examples, no search, no syntax highlighting, no way to interact or modify documentation to test effects of parameters.

Limitation for authors

Due to Jupyter and IPython limitations to display documentation I believe authors are often contained to document functions.

Syntax in docstrings is often kept simple for readability, this first version is
often preferred:

You can use ``np.einsum('i->', a)`` ...

In the longer form, which makes the reference into a link when viewing rendered
documentation, it is difficult to read when shown as help documentation:

You can use :py:func:`np.einsum('i->', a) <numpy.einsum>` ...

This also leads to long discussions about which syntax to use in advanced areas, like formulas in Sympy's docstrings.

Many projects have to implement dynamic docstrings; for example to include all the parameters a function or class would pass down using **kwargs (search the matplotlib source code for _kwdoc for example, or look at the pandas.DataFrame implementation).

This can make it relatively difficult for authors and contributors to properly maintain and provide comprehensive docs.

I'm not sure I can completely predict all the side effects this has on how library maintainers write docs; but I believe there is also a strong opportunity for a tool to help there. See for example vélin which attempts to auto reformat and fix common NumPyDoc's format mistakes and
typos – but that's a subject of a future post.

Stuck between a Rock and a Hard place

While Sphinx and related projects are great at offering hosted HTML documentation, extensive usage of those makes interactive documentation harder to consume.

While it is possible to run Sphinx on the fly when rendering docstrings, most Sphinx features only work when building a full project, with the proper configuration and extension, and can be computationally intensive. This makes running Sphinx locally impractical.

Hosted websites often may not reflect the locally installed version of the libraries and require careful linking, deprecation and narrative around platform or version specific features.

This is fixable

For the past few months I've been working on rewriting how IPython (and hence Jupyter) can display documentation. It works both in terminal (IPython) and browser context (notebook, JupyterLab, Spyder) with proper rendering, and currently understands most directives; it could be customized to understand any new ones:

Above is the (terminal) documentation of scipy.polynomial.lagfit, see how the single backticks are properly understood and refer to known parameters, it detected that `n` is incorrect as it should have double backticks; notice the rendering of the math even in terminal.

For that matter technically this does not care as to whether the DocString is written in RST or Markdown; though I need to implement the latter part. I believe though that some maintainers would be quite happy to use Markdown, the syntax of which more users are familiar with.

It supports navigation – here in a terminal – where clicking or pressing enter on a link would bring you to the target page. In the above gif you can see that many tokens of the code example are also automatically type-inferred (thanks Jedi), and can also be clicked on to navigate to their corresponding page.

Images are included, even in the terminal when they are not inline but replaced by a button to open them in your preferred viewer (see the Open with quicklook in the above screenshot).

The future

I'm working on a number of other features, in particular:

rendering of narrative docs – for which I have a prototype,
automatic indexing of all the figures and plots – working but slow right now,
proper cross-library referencing and indexing without the need for intersphinx. For example, it is possible from the numpy.linspace page to see all pages that reference it, or use numpy.linspace in their example section (see previous image).

And many others, like showing a graph of the local references between functions, search, and preference configurability. I think this could also support many other desirable features, like user preferences (hide/show type annotation, deprecated directives, and custom color/syntax highlighting) - though I haven't started working on these. I do have some ideas on how this could be used to provide translations as well.

Right now, is it not as fast and efficient as I would like to – though it's faster than running Sphinx on the fly – but requires some ahead of time processing. And it crashes in many places; it can render most of the documentation of SciPy, NumPy, xarray, IPython and scikit-image.

I encourage you to think about what features you are missing when using documentation from within Jupyter and let me know. I hope this could become a nice addition to Sphinx when consulting documentation from within Jupyter.

For now I've submitted a Letter of intent to CZI EOSS 4 in an attempt to get some of that work funded to land in IPython, and if you have any interest in contributing or want something like that for your library, feel free to reach out.

You can find the repository on my GitHub account, it's still in pre-alpha stage. It is still quite unstable with too many hard coded values to my taste, and needs some polish to be considered usable for production. I've focused my effort for now mostly on terminal rendering – a Jupyter notebook or JupyterLab extension would be welcome. So if you are adventurous and like to work from the cutting (or even bleeding) edge, please feel free to try it out and open issues/pull request.

It also needs to be better documented (pun intended), I'm hoping to use papyri itself to document papyri; but it needs to be a bit more mature for that.

Stay tuned for more news, I'll try to explain how it works in more detail in a follow-up post, and discuss some of the advantages (and drawbacks) this project has.

Spot the differences: what is new in Spyder 5?

Isabela Presedo-Floyd — Sat, 17 Apr 2021 15:11:46 +0000

In case you missed it, Spyder 5 was released at the beginning of April! This blog post is a conversation attempting to document the long and complex process of improving Spyder's UI with this release. Portions lead by Juanita Gomez (@juanis2112 ) are marked as Juanita, and those lead by Isabela Presedo-Floyd are marked as Isabela.

What did we do?

[Juanita] Spyder was created more than 10 years ago and it has had the contributions of a great number of developers who have written code, proposed ideas, opened issues and tested PRs in order to build a piece of Spyder on their own. We (the Spyder team) have been lucky to have such a great community of people contributing throughout the years, but this is the first time that we decided to ask for help from an UX/UI expert! Why? You might wonder. Having the contributions of this great amount of people has resulted in inconsistencies around Spyder’s interface which we didn’t stop to analyze until now.

When Isabela joined Quansight, we realized that we had an opportunity of improving Spyder’s interface with her help. We thought her skill set was everything we needed to make Spyder’s UI better. So we started by reviewing the results of a community survey from a few months ago and realized that some of the most common feedback from users is related to its interface (very crowded, not consistent, many colors). This is why we decided to start a joint project with Isabela, (who we consider now part of the Spyder team) called Spyder 5!!!

This version was in development for over a year and was finally released on April 2nd. It has some nice new features that we hope will benefit our users greatly. Most of these are focused on improving Spyder’s interface and usability, which we did thanks to Isabela’s help. The 3 main UX features implemented in this release were:

A brand new color palette designed to bring greater consistency to the UI and to make it easier to use.
The redesign of our toolbars by adjusting the margins and sizes of all the buttons to meet accessibility recommendations.
A new set of icons to ensure a consistent style.

How did we do it?

1. First impressions

[Isabela] I find collaboration usually starts with three things: discovering and stating a problem, asking why, and figuring out the best ways to communicate with each other. For me, this is a design problem on it’s own, especially when starting to work with a new team like I was with Spyder. For this project, I was asked to audit Spyder for any UX/UI issues and report back. Because I have a habit of pushing every button in an interface, I ended up having a lot (maybe too much) feedback to pass on. One of the things I remember most about opening Spyder for the first time was having three dialogs pop up immediately. That’s really not the first impression you want to give, and I remember talking to Juanita about that right away. Figuring out how to state problems as simply and clearly to a group of people I didn’t know yet was intimidating and went through several phases.

2. From the “nightmare document” to the issue tracker

[Juanita] The first phase was discussing all the problems that Isabela found in weekly meetings with Carlos, the Spyder maintainer, and Stephanie, another Spyder core developer. I created a Google drive document (which we ended up calling “The Nightmare document”) in which I collected most of the feedback that Isabela gave us. Then, I grouped this information into categories depending on whether the comments were about the interface, in general, or if they were about a specific pane. Once we agreed on a relevant problem that we wanted to address, I opened an issue on a new repo that we created in the Spyder’s organization called “ux-improvements.”

[Isabela] In fact, that wasn’t even the first set of documents we were working with; I had a whole table, numbering system, and document I was trying to handle before. But it was Juanita that turned them into Github issues.

3. Sorting out the nightmare

[Juanita] Since we ended up with more than 30 issues, we had to start a “triaging phase.” We had to label, triage, organize, and prioritize issues according to “urgency” and importance. This issue tracker became our main tool to keep up with all the plans for the future!

[Isabela] Juanita did wonderful work tracking our progress through issues and keeping us all accountable, but we were still left with a long list of issues to triage—long enough that it wasn’t all getting in Spyder 5. To have the greatest impact on Spyder, we started with the issues that had influence on Spyder as a whole. Toolbars, icons, and colors are something you will always encounter from the first impression to the most recent, so it made sense to start thinking about those big picture issues first.

4. Digging deeper into the dark hole

[Isabela] When prioritizing the audit feedback for Spyder 5, each pass seemed to get to a deeper layer of the problem. For example, what started as issues to make tooltips more legible and improve the variable explorer’s color coding soon became the realization that we weren’t sure exactly what blue was already being used for much of Spyder’s interface. It got more complicated when we found out how many colors were hard coded across multiple files or defined by an external project. Eventually, the problem changed from the color contrast of tool tips to an unsustainable approach for managing color across the two default Spyder themes rooted in a non-Spyder repo. Work at each step did build up into a larger solution, but it’s worth noting that it isn’t what we set out to do in the first place.

5. What witchcraft does Isabela do in the background?

[Juanita] One of the most important parts of the process was designing the mock ups for the new ideas that we came up with for the interface which is definitely not our expertise. So... how did the designs magically appear
on our Github issues?

[Isabela] First things first, it isn’t actually witchcraft even if it looks magical from the outside. How I work depends somewhat on what problem we are trying to solve, so let’s use the design of custom icons for Spyder 5 as an example. Once I had a defined list of icons to work on, I needed to spend time making progress on my own. Research on best practices for the relevant area of design is where I started; in this case, I knew we were going to be working with Material Design Icons’ specifications. After that, I did a combination of pen-and-paper sketching and working digitally based on the existing icons in Spyder and Material Design Icons while I kept note of the pros and cons for different directions. I also collected design elements as I built them so that I could make more consistent, accurate designs faster as I worked. For the icons, this included things like letters, rounded corners, and templates for the size and spacing of elements. Finally, I compared options side by side and tried them out in the interface to evaluate what designs were strong enough to bring them to the rest of the team. Then we discussed the options together.

6. Mock ups Vs Reality

[Juanita] After many discussions, mock ups, and meetings, decisions weremade and we were ready to move onto the implementation phase. A big part of the improvements were made in QDarkStyleSheet where we did the new palette and color system for both the dark and light themes of Spyder. In my opinion, this was the hardest part of the process since it involved getting familiar with the code first and then, trying and trying again changing lines of code to change the color or style of buttons, tabs, toolbars, borders, etc…

The other problem that I ran into, was trying to meet the designs’ specifications. Specially, when working with the toolbars, figuring the right number for the pixels of margins and sizes was a challenge. I tried several values before finding one that closely matched the proposed mock up only to realize later that “pixels” was not the best unit for the specifications. I ended up using “em” since it was more consistent across operating systems. Isabela, Stephanie and Carlos were part of this process as well. Between the four of us we managed to implement all the changes that we had planned for Spyder 5, the new color palette, the complete redesign of toolbars and the new set of icons. It was an arduous task, more than we all expected, but at the end we were all very happy with the results and thankful to Isabela for helping us to give a new face to Spyder.

What's the final result?

[Isabela] Individually, the colors, toolbars, and icons may feel like small adjustments, but those are some of the elements that make up most of Spyder. When they are together, those small adjustments set the mood in the interface; they are more noticeable, and rooted in the Spyder UI many people are already familiar with. While the changes may feel obvious when they are new, they are also chosen to create consistent patterns across interactions that can become more comfortable over time. Spyder’s default dark and light modes, for example, used to use a different set of UI elements between modes. Now they both use the same elements and it is only the colors that change. This makes it easier for users to jump into a familiar interface and take what they know from working in one space to another. For contributors, it gives a more clear UI pattern for them to follow in their own work.

Before and after (Dark theme)

Before and after (Light theme)

What did we learn :)?

[Isabela] From developing new skills to working as a team for the first time, I think we both took a lot from this process. Here are some lessons that stood out to us.

[Juanita]

Sometimes it is better to try some of the ideas during the process, than having long discussions about an idea and implementing at the end. In some cases you end up realizing that things don’t look as good as you thought they would, or that some are not even possible.
One of the most important parts of the design process is to get yourself in the users’ shoes. At the end, they are the reason why we work to improve things constantly.
Occasionally, less is more. Simple and consistent is better than crowded and complicated.

[Isabela]

Don’t be afraid of asking questions even when you think you understand the problem because every bit of information can be useful to better grasping what hurts or helps users.
Always take the time to review what you might think is obvious with the rest of the team. It’s easy to forget about what you know when you are working with people who have different skills than you.

A step towards educating with Spyder

Juanita Gomez — Fri, 16 Apr 2021 16:10:56 +0000

As a community manager in the Spyder team, I have been looking for ways of involving more users in the community and making Spyder useful for a larger number of people. With this, a new idea came: Education.

For the past months, we have been wondering with the team whether Spyder could also serve as a teaching-learning platform, especially in this era where remote instruction has become necessary. We submitted a proposal to the Essential Open Source Software for Science (EOSS) program of the Chan
Zuckerberg Initiative, during its third cycle, with the idea of providing a simple way inside Spyder to create and share interactive tutorials on topics relevant to scientific research. Unfortunately, we didn’t get this funding,
but we didn’t let this great idea die.

We submitted a second proposal to the Python Software Foundation from which we were awarded $4000. For me, this is the perfect opportunity for
us to take the first step towards using Spyder for education.

What the project is about

The goal of this project is to create specialized Python online training content that uses Spyder as the main platform to deliver it. The grant will cover the development of three practical workshops:

Python for Financial Data Analysis with Spyder
Python for Scientific Computing and Visualization with Spyder
Spyder 5 Plugin Development

They will be included as part of Spyder’s documentation for remote learning, but they will also be used as hands-on materials for talks and workshops.

These materials are meant for users to learn how Spyder can accelerate their workflow when working with Python in scientific research and data analysis. The idea is for us to provide a way in which we can help people get the most
out of Spyder by applying it in their day-to-day jobs.

The first two workshops will cover aspects such as data exploration and visualization with Spyder’s variable explorer and plots panes, getting documentation through Spyder’s help pane, writing good quality and efficient code using Spyder’s code analysis and profiler, etc.

Our last workshop will demonstrate how to create a plugin for Spyder, which, thanks to our new API in Spyder 5, released in April 2021, will allow users to easily customize and extend Spyder’s interface with new menus, toolbars, widgets or panes in order to adapt it to their own needs...

Why it is important

This project will benefit the international community of Spyder users (around 500,000, we estimate) to discover new capabilities of Spyder in order to take advantage of all its resources. It will also provide testing materials for potential users who will be able to adopt Spyder as a tool for
their work in Financial Data Analysis, Scientific research and Spyder plugin development.

For the past months, our documentation tutorials have had a great impact in our community, with more than 20,000 views in our YouTube channel. We expect these workshops to be a great input to our documentation and help us continue building a community around Spyder.

What is next?

This project is just the first step towards making Spyder an educational tool. In the future, we hope that we can develop the infrastructure necessary to support in-IDE tutorials, by improving the tools like Jupyter Book,
sphinx-thebe, MyST-Parser which will provide better integration to write educational tutorials.

The final goal is to enable researchers, educators and experts that don’t necessarily have a software engineering background to build scientific programming tutorials easily and provide them as online learning materials
in Spyder. Once the infrastructure is built, we can develop several examples to demonstrate Spyder capabilities and teach basic scientific programming concepts applicable to a variety of fields.

Accessibility: Who's Responsible?

Isabela Presedo-Floyd — Sat, 27 Mar 2021 02:56:37 +0000

For the past few months, I've been part of a group of people in the JupyterLab community who've committed to start chipping away at the many accessibility failings of JupyterLab. I find this work is critical, fascinating, and a learning experience for everyone involved. So I'm going to document my personal experience and lessons I've learned in a series of blog posts. Welcome!

Because this is the first of a series, I want to make sure we start with a good foundation. Let me answer some questions you might be having.

Q: Who are you?

A: I'm Isabela, a UX/UI designer at Quansight Labs, who cares about accessibility and is fortunate to work somewhere where that is a respected concern. I also spend time in the Jupyter ecosystem—especially around JupyterLab —though that is not the only open-source community you can find me in. I like to collect gargoyles, my hair is pink, and I love the sunflower emoji 🌻. It's nice to meet you!

Q: What is the Jupyter ecosystem and JupyterLab?

A: Project Jupyter is an organization that produces open-source software and open standards. The Jupyter ecosystem is a term used to describe projects that are directly a part of or support Project Jupyter. JupyterLab is one of its primary projects and a staple for the day-to-day work of many students, professionals, researchers, and more.

Q: What is accessibility?

A: Accessibility is a term used to describe the practice of creating things in a way that makes them usable for people with disabilities. I’m going to be talking mostly about web accessibility since JupyterLab is a web app. If you're asking why you should care about accessibility, please take a moment to read why it matters (hint: there are ethical, legal, and business reasons to care). Inaccessible experiences can have consequences, from people not being able to get information they need to being unable to pursue whole careers that rigidly require the use of inaccessible software (such as JupyterLab).

How did we get here?

The Jupyter ecosystem is full of people who care about accessibility. I know this because I've heard people ask about accessibility in community meetings. I know this because I've read discussions about accessibility on Github issues and PRs. I know this because the project has a repository devoted to organizing community accessibility efforts. If this is the case, then why hasn't JupyterLab already been made more accessible in the past three years it's been deemed "ready for users?" (I'm intentionally not mentioning other Jupyter projects to limit this post's scope.)

Because for every time accessibility is brought up, I've also experienced a hesitance around taking action. Even though I’ve never heard it explicitly said, the way I’ve seen these efforts get lost time and time again has come to mean this in my head: “accessibility is someone else’s problem.” But it can’t always be someone else’s problem; at some point there is a person taking ownership of the work.

So who is responsible for making something accessible? Probably not the users, though feedback can be a helpful step in making change. Certainly not the people that already can’t use the tool because it isn’t accessible. But I, personally, think anyone who is part of making that tool is responsible for building and maintaining its accessibility. Just as any user experience encompasses the whole of a product, an accessible experience does the same. This should be a consideration from within the product, to its support/documentation, to any other interaction. A comprehensive team who thinks to ask questions like, “how would I use this if I could only use my keyboard?” or “would I be able to get the same information if I were colorblind?” are starting to hold themselves and their team accountable. Taking responsibility is key to starting and sustaining change.

Misconceptions

Here are a few common concerns I’ve heard when people tell me why they can’t or haven’t worked on accessibility. I’m going to paraphrase some replies I've heard when asking about accessibility in many different environments (not only JupyterLab) over the years.

I don’t know anything! And that’s fine. You don’t have to be an expert! Fortunately, there are already a lot of resources out on the wide open internet, some even focused on beginners (some of my personal favorites are at The A11y Project and MDN). Of course, it’s important to remember that learning will mean that you are likely to make mistakes and need to keep iterating. This isn’t a one-and-done deal. If you do have access to an expert, spending time to build a foundation means they can help you tackle greater obstacles instead of just giving you the basics.

I don’t have time for another project! Accessibility doesn’t have to be your only focus. JupyterLab sure isn’t the only project I am working on, and it won’t be in the near future. Any progress is better than no progress, and several people doing even a little work can add up faster than you might think. Besides, there’s a good chance you won’t even have to go out of your way to start improving accessibility. Start by asking questions about a project you are already working on. Is there a recommended way to design and build a component? Is information represented in more than one way? Is everything labeled? It’s good practice and more sustainable to consider accessibility as a regular part of your process instead of a special side project.

It’s not a good use of my energy to work on something that only affects a few people! It’s not just a few people. Read what WHO and the CDC have to say about the number of people with disabilities.

I don’t want to make abled people’s experience different than it already is! Depending on what you are doing, the changes might not be active or noticeable unless assistive technologies or accessibility features are being actively used. And in many cases, accessibility features improve the experience for all users and not just those they were designed for (sometimes called the curb cut effect). Even if you aren’t convinced, I’d encourage you to ask yourself why creating the user experience you want and making that experience accessible are mutually exclusive. What are people missing out on if they can’t use your product? What are you missing out on if they can’t use your product?

What could responsibility be like?

With JupyterLab, it was just a matter of a few people who were willing to say they were tired of waiting and able to spend time both learning what needed to be done as well as doing it. Speaking for myself, I did not come in as an expert or with undivided obligations or even someone with all the skills to make changes that are needed. I think this is important to note because it seems to me that it could have just as easily been other members of the community in my position given similar circumstances.

Our first step in taking responsibility was setting up a regular time to meet so we could check-in and help one another. Then we set reasonable goals and scoped the work: we decided to focus on JupyterLab rather than multiple projects at once, address WCAG 2.1 standards in parts of JupyterLab we were already working on, and follow up on past work that other community members began. This is just the beginning, but I hope it was a helpful peek into the process we are trying out.

But wait, there's more!

Deciding to make accessibility a priority in Jupyter spaces isn't where this work ends. Join me for the next post in this series where I'll talk about my not-so-subtle panic at the amount of problems to be solved, how to move forwards in spite of panic, and the four experience types in JupyterLab that we must address to be truly accessible.