Forem: michael c stewart

What Even Is Synthetic Data?

michael c stewart — Tue, 23 Mar 2021 19:15:36 +0000

For Christmas I purchased my daughter a copy of Neural Networks for Babies. I thought it would be funny and cute, but also practical because my day job is in computer vision technology and sometimes I feel like I have a 9-month-old’s grasp on it all. Ultimately, I was disappointed that the book didn’t explain how neural networks are trained. But perhaps that’s an unfair expectation when a book’s first page is, “This is a ball.”

Since I’ve had to learn how to discuss training computer vision models (in a way that even I can comprehend), I thought I’d share here. While I’m at it, I’ll also explain what exactly synthetic training data is and when you might want to train on synthetic datasets. Disclaimer: I am not a machine learning engineer or data scientist, so please understand that—like Neural Networks for Babies—this high level overview may be imperfect or incomplete.

René Magritte’s most famous work is The Treachery of Images. It is a painting of a pipe with a caption beneath saying “Ceci n’est pas une pipe,” or “This is not a pipe.” The surrealist work drew the ire of folks who thought, actually pal, that looks like a pipe to me. Magritte, a troll ahead of his time, defended himself, saying “How people reproached me for it! And yet, could you stuff my pipe? No, it’s just a representation, is it not? So if I had written on my picture ‘This is a pipe,’ I’d have been lying!”

This story illustrates how the human computer — our big old brain — tends to process information. If you are sighted, you glance at something, compare it against a database of things you’ve seen before (and their accompanying labels), and gain an understanding of what exactly it is you’re looking at. It all happens in an instant, and it’s how we’re able to recognize a pipe when we see one.

But unlike humans, computers won’t organically accumulate a database of labeled reference images as they grow. If you want a computer to be able to perform object detection tasks — say identifying the location of an object within an image, or predicting an object’s name — you first need to teach that computer what those things look like and what they’re called.

This is where I will artfully gloss over the deep learning part. Suffice to say, it’s no longer necessary to build your own object detection model. There are many different publicly available algorithms that can accomplish that task: exciting acronyms like SSD, R-CNN, and YOLO, for example. All of these have been patiently explained to me, and I now have, at best, a Neural Networks for Babies understanding of each.

The TL;DR is that these models generally work pretty well. But in order for them to work, they first need to be trained on data. In some cases you can find models — ResNet-101 or MobileNet — which have already been trained on a publicly available dataset, like ImageNet. That popular dataset has over 14 million images and can teach models about 1000 object categories. But if you’re trying to use computer vision to solve a specific problem and recognize something in particular, the chances are you’re going to have to train the model yourself. And for that, you’ll need a custom dataset.

Do you know what object is not included in the ImageNet dataset? A pipe. If you wanted to build a pipe detector, and had to train your model to recognize a pipe, you’d need a dataset chock full of images of pipes. When you can’t find an existing dataset that meets your needs, you’ve got roughly three options: scrape it, make it, or fake it.

Scraping a dataset together is exactly what it sounds like. You’re combing Flickr or Google image search for the images you need. At best it’s time consuming, and at worst it’s ethically murky when you consider image or even likeness rights. (You better hope you don’t get an Illinoisian in there.) While you may get a good range of images this way, whether or not the resulting dataset will work for your purposes is another question altogether.

Some ambitious teams may choose to make their datasets in house. For a pipe, this would be relatively easy. Buy a pipe, and take a ton of photos of it. Wait, actually, you’d need to buy a bunch of pipes or else the computer will only recognize that one specific pipe. And you’d probably need to take photos of those pipes in a bunch of different environments and contexts, in hands, hanging from mouths, etc., lest the computer come to believe pipes can only exist on tables. Okay, see, now this is why people scrape their datasets.

But wait, why not fake it? Humans can tell the difference between a physical pipe and an image of a pipe — like, we get it Magritte! — but computers will only ever process images. That means a computer doesn’t really know or care whether that image is a real photo you took or a still that you generated using CGI. This sort of training data, custom 3D modeled and generated for a given computer vision problem, is known as synthetic data.

Look, while true it’s my job to say this, synthetic data has a ton of upsides. It is flexible and abundant. It solves data scarcity, which is an interesting situation where pictures of the thing you’re trying to detect are just super hard or impossible to come by. There are no privacy issues since there are never real people in synthetic datasets. And you don’t need to painstakingly (or more likely, pay to have someone else) label the images you scraped together from Google, since synthetic images are generated with your desired labels at the outset.

Also, once again, the computer couldn’t care less that you’re feeding it the Impossible burger equivalent of a hamburger. The computer just needs to learn the characteristics and contexts of a pipe so that it can identify pipes when it sees them in the future. Synthetic data works great for that. While you can train exclusively on synthetic data, current research suggests the best model performance actually comes from a hybrid approach (using both sourced and synthetic images). Incidentally, that’s kind of how we learn right? My daughter can say ball now (dad brag), and I have to believe her ability to recognize both the real ball and an illustrated ball is because she was trained on both.

As for Magritte, as persnickety as he was about it, I think he’s right. It’s not a pipe. It’s just synthetic data.

Five Big Problems With Labeled Data

michael c stewart — Thu, 11 Feb 2021 20:13:05 +0000

The widespread adoption of artificial intelligence and computer vision across industries—from manufacturing, to retail, security, agriculture, healthcare, and beyond—has increased the demand for labeled data exponentially. As the key to training models that will transform the way we work, it makes perfect sense that labeled data and data annotation would suddenly be in such high demand.

But at Zumo Labs, many of our incoming customers have a common pain point: labeled training data is presenting itself as a significant bottleneck. How is it that data wrangling (that is, sourcing labeled data and managing the training data pipeline) can take up to 80% of AI project time by some estimates (per Cognilytica)? Even if that figure is exaggerated by a factor of two, perhaps by exasperated engineers who would rather be making headway on their projects, it’s still far too much time. In our work, we’ve pretty concretely identified five big problems with labeled data.

PROBLEM 1: Labeled Data Must Be Sourced or Produced

The first issue is perhaps the most glaring—labeled data must be sourced or produced. Where are you getting your labeled data? This will depend on your specific use case, but there are only so many options. If you’re lucky, you may be able to find a publicly available labeled dataset that’s “good enough” for the problem you’re trying to solve. But if you’re like most folks, your model is going to require a custom solution; you’ll need to collect raw data from your own cameras and then label it.

And what are your options for labeling it? You can label that data in house, which requires you to build out a team of subject matter experts, or you can turn to a third party vendor for this step, such as Amazon Mechanical Turk or a data annotation service. These annotation services are usually limited to simple annotations such as bounding boxes and basic categories, those which are easy for a human labeler to do at scale. If your problem requires nuanced or proprietary labels (a CAD model of an engine with 75+ subcomponents, for example) you will have no choice but to build and train a team in house. That’s what Tesla has done, for example.

An added challenge here is that you can only source (and subsequently label) data that already exists. That means if your cameras are for a piece of hardware that has yet to be manufactured, or you want to detect incredibly rare failures (that you simply don’t have enough examples of to train a robust detection model), you’re out of luck.

PROBLEM 2: Quality of Data Labeling Is Lacking

Assuming you’re able to source a representative and sufficiently balanced batch of images, the next big problem is the quality of the data labeling. Precise labels are critical to the performance of a model, but publicly available datasets have consistently shown that they can contain questionable—and sometimes alarming—labeling issues, as documented in this article about ImageNet containing slurs. While labeling services often have internal quality assurance checks in place, a customer is still dependent on the domain knowledge of the labeler and the person responsible for QA.

Quality goes beyond the accuracy of the labels though. The precision of a bounding box matters in training as well. Human labelers may draw a bounding box to the best of their ability, but they can’t always fairly assess images that present obstacles such as occlusion. Likewise, without the full context of the images contained within the dataset, or even how the dataset will be used, the quality of their labeling may suffer.

PROBLEMS 3 & 4: Data Labeling Is Slow and Costly

Because data labeling requires a human in the loop, the twin issues of both price and speed become immediately apparent. To address price, data labeling is not cheap. To use a computer vision engineer’s time on labeling rather than on machine learning is an inefficient use of resources. But building out a dedicated labeling team in house is also a huge gamble, especially if you have a less-than-predictable future volume of labeling needs. Meanwhile, the cost of using a third party labeling service often maps closely to the volume of images you need labeled, because each additional image means a little more work for a labeler.

Training on labeled data can also only happen as fast as you’re able to acquire labeled data, which is dependent on both your ability to capture and share the data in question, and the turnaround time of your chosen labeler. If for whatever reason you need to increase the size of the dataset, say to introduce new edge cases, you must once again run through the full cycle. This introduces a challenging bottleneck.

PROBLEM 5: Privacy Is Not Guaranteed

Finally, where datasets including humans are concerned, storing labeled data introduces unnecessary privacy considerations. If you’re working with a publicly available labeled dataset such as MegaFace, you might assume you’re in the clear. But since that dataset consists primarily of images scraped from the internet, there are real liabilities. Facebook was sued for violating the Illinois Biometric Information Privacy Act (“BIPA”) after scraping images of users to train its Tag Suggestion feature. BIPA, notoriously one of the most restrictive state laws around biometric privacy, requires written consent before collecting someone’s fingerprints, retina or iris scan, voiceprint, scan of hand, or face geometry. Facebook paid $550 million to settle this suit last year.

Progress on better artificial intelligence and higher quality computer vision models is hampered by a dependency on sourced and labeled data. We believe that there’s another better way.

SOLUTION: Synthetic Training Data

Synthetic data, in this case computer generated imagery that’s simulating the specifics of sourced data, is the solution to nearly all of the specific failings of labeled data. Since the data is simulated and will never contain real humans, compliance with privacy laws such as BIPA, GDPR, or CCPA is guaranteed. It is, over time, both cheaper and faster than sourcing and labeling data. The quality is unparalleled, thanks to pixel perfect annotations and ground truth (which solves for occlusion, creates perfect depth maps, etc.) And perhaps best of all, synthetic data addresses the issue of data scarcity—you do not need to wait to collect a critical volume of real data to get started on a problem.

They say control your controllables. For a long time, dataset creation and annotation was not a realistic controllable for most businesses. But synthetic data makes that so. Generate your own synthetic data in house—using Unity, Unreal, Blender, or any 3D modeling software—and take control of your entire machine learning training stack.

If you want a guided tour of what synthetic data is capable of, please feel free to reach out to me directly. I'm at michael@zumolabs.ai.