Forem: Hugo

Why Blender Is the Best Software for the 3D Workflow

Hugo — Mon, 28 Jun 2021 16:17:02 +0000

3D

3D, short for the three dimensions of space we live in, is a catch-all term used to describe the varied technologies used to create virtual worlds. 3D’s technology stack can be roughly split into two broad categories: asset creation and asset scripting. Asset creation is the process of creating assets: virtual objects, scenes, and materials. Asset scripting is the process of manipulating those assets and their interactions over the fourth dimension of time. Decades of progress have resulted in sophisticated software tools that make 3D workflows more automated and straightforward, but a significant amount of human expertise and artistic talent is still required.

Asset Creation

Assets are digital representations of a 3D object. One type of asset is a mesh: a connected graph of 3D points also called vertices, which define the surface of an object. Edges interconnect vertices, and a closed loop of vertices creates a polygon known as a face. The engineering and manufacturing world creates meshes using computer-aided design (CAD) software such as AutoCAD, Solidworks, Onshape, and Rhino. The entertainment industry creates meshes using modeling software such as Maya, 3DSMax, and Cinema4D.

Whereas a mesh describes the shape and form of an object, a material asset describes the texture and appearance of a virtual object. A material may define rules for the reflectivity, specularity, and metallic-ness of the object as a function of lighting conditions. Shader programs use materials to calculate the exact pixel values to render for each face of a mesh polygon. Modeling software usually comes packaged with tools for the creation and configuration of materials.

Finally, asset creation encompasses the process of scene composition. Assets can be organized into scenes, which may contain other unique virtual objects such as simulated lights and cameras. Deciding where to place assets, especially lights, is still almost entirely done by hand. Automatic scene composition remains a tremendous challenge in the 3D technology stack.

Asset Scripting

The fourth perceivable dimension of our reality is time. Asset scripting is the process of defining the behaviors of assets within scenes over time. One type of asset scripting is called animation, which consists of creating sequential mesh deformations that create the illusion of natural movement. Animation is a tedious manual task because an artist must define every frame; expert animators spend decades honing their digital puppeteering skills. Specialized software is often used to automate this task as much as possible, and technologies such as Motion Capture (MoCap) can be used to record the movement of real objects and play those movements back on virtual assets.

Game Engines are software tools that allow for more structured and systematic asset scripting, mostly by providing software interfaces (e.g., code) to control the virtual world. Used extensively in the video game industry after which they were named, examples include Unity, Unreal Engine, GoDot, and Roblox. These game engines support rule-based spawning, animation, and complex interactions between assets in the virtual world. Programming within game engines is a separate skillset to modeling and animating and is usually done by separate engineers within an organization.

Blender

Blender is an open-source 3D software tool initially released in 1994. It has grown steadily over the decades and has become one of the most popular 3D tools available, with a massive online community of users. Blender’s strength is in its breadth: it provides simple tools for every part of the 3D workflow, rather than specializing in a narrow slice. Organizations such as game studios have traditionally preferred specialization, having separate engineers using separate tools (such as Maya for modeling and Unreal Engine for scripting). However, the convenience of using a single tool, and the myriad advantages of a single engineer being able to see a project start to finish, make a strong case for Blender as the ultimate winner in the 3D software tools race.

Many of the world’s new 3D developers opt to get started and build their expertise in Blender for its open-source and community-emphasizing offering. This is an example of a common product flywheel: using a growing community of users to improve a product over time. With big industry support from Google, Amazon, and even Unreal, Blender also has the funding required to improve its tools with this user feedback.

In addition to supporting the full breadth of the 3D workflow, Blender has the unique strength of using Python as the programming language of choice for asset scripting. Python has emerged as the lingua franca for modern deep learning, in part due to the popularity of open-source frameworks such as TensorFlow, PyTorch, and Scikit-Learn. Successful adoption of synthetic data will require Machine Learning Engineers to perform asset scripting, and these engineers will be much more comfortable in Blender’s Python environment than Unity’s C# or Unreal Engine’s C++ tools.

Conclusion

Thanks for getting this far! If you’re interested in 3D and what it can do for synthetic data, check out our open-source data development toolkit zpy. Everything you need to generate and iterate synthetic data for computer vision is available for free. Your feedback, commits, and feature requests are invaluable as we continue to build a more robust set of tools for generating synthetic data. In the meantime, if you need our support with a particularly tricky problem, please reach out.

Will Self-Supervised Visual Transformers Replace Pre-Trained CNNs?

Hugo — Tue, 01 Jun 2021 14:58:40 +0000

Pre-trained CNNs are still king when training models for computer vision use cases. However, the emerging popularity of Visual Transformers (ViTs), and subsequent consensus about their unsupervised learning capabilities, gives unexpected space for ViTs to usurp the throne.

Pre-Trained CNNs

Convolutional Neural Networks work by sliding a pattern (formally known as the kernel, but also referred to as a "feature map") across an image (Slide 1). This sliding strategy is effective because it acts as a natural form of translation invariance: once a CNN can recognize something in one part of the image, it will recognize it in any part of the image [1]. However, this approach leads to a kind of fragility: feature maps are often overfit to a particular texture or object size.

Building up feature maps requires a ton of data, and CNNs are usually pre-trained on a large generic dataset like COCO or ImageNet-the latter boasting over one million images and around 1,000 categories. Further, a pre-trained CNN can be fine-tuned to new tasks by cutting out the model head and retraining with a new, often much smaller, dataset (Slide 2).

Transformers

Transformers have been popular in natural language processing (NLP) for quite some time. They work through a concept known as "self-attention," which pays certain input parts more attention than others [3]. In NLP, this allows for specific words within a sentence to be identified as more important. There are different types of attention and plenty of nuance for the experts to argue over, but the words "attention" and "focus" are good mental models of how these networks learn.

Self-Supervised ViT

Self-supervised training is a little different in that it does not require labels-you don't need to tell the model that the object in an image belongs to the category "cat," for example. Instead, a self-supervised training technique might involve cropping an image, feeding it through multiple networks, and then getting them all to agree on which features in the image are essential (Slide 3). This type of learning technique, called DINO [3], successfully trained visual transformers (transformers for visual tasks, e.g., images). The ViTs trained with DINO ended up surprisingly effective for classification tasks, reaching 80% top-1 accuracy on ImageNet. Inspecting the self-attention maps of these ViTs also shows that they can very precisely segment out objects in an image (Slide 4).

Now, the bold prediction: self-supervised ViTs will eventually replace pre-trained CNNs as the go-to feature encoders for computer vision tasks. There are still unanswered questions, such as whether ViTs will generalize outside the training distribution better than CNNs. But one thing is sure: not requiring labels during training will enable using much larger datasets. Consider the difference in capacity between ImageNet and a self-supervised ViT trained on the entire internet of images…

Conclusion

Thanks for reading our latest paper exploration. If you love computer vision, check out zpy [4], our open-source synthetic data development toolkit. It's everything you need to generate and iterate on synthetic training data for computer vision. Your feedback, commits, and feature requests are invaluable as we continue to build a more robust set of tools for generating synthetic data. Meanwhile, if you could use support with a particularly tricky problem, please reach out.

References

[1] CS231n Convolutional Neural Networks for Visual Recognition - Convolutional Neural Networks (https://cs231n.github.io/convolutional-networks/)
[2] Transformer: A Novel Neural Network Architecture for Language Understanding (https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
[3] Emerging Properties in Self-Supervised Vision Transformers (https://arxiv.org/pdf/2104.14294.pdf ).
[4] zpy (github.com/ZumoLabs/zpy)

Synthetic Data Experiments: Package Detection

Hugo — Mon, 24 May 2021 14:34:15 +0000

Having a package stolen is frustrating. As Mark Rober has demonstrated, it can drive people to the edge of madness. But what if you could build your own package detection model using exclusively synthetic data? We’ve outlined a few short steps we took to go from synthetic data generation to working detector.

Generate Synthetic Data

Synthetic data is generated from a simulation or “sim”—typically a scene that has been created from custom or stock 3D models. Sims can run in the cloud in parallel to create virtually infinite training data. I created a sim for package detection using open-source 3D graphics software Blender and zpy [1]. In this sim, assorted 3D packages are spawned while the camera angle and lighting conditions are randomized. The resulting synthetic dataset is visually diverse and perfectly labeled.

Figure 1: Synthetic images of packages generated from a sim.

Collect the Test Data

To test our model trained on the synthetic data, we are going to need to collect some real images. We found some on the internet, and manually labeled them using a DIY labeling platform called RoboFlow [2]. Give it a try. After spending an hour drawing bounding boxes on images, take a moment to appreciate that nearly all training data has to be painstakingly manually labeled like that. It’s the sort of tedious work that folks in developing countries wind up being paid pennies for. Talk about a dystopian future…

Train the Model

Armed with our synthetic training dataset and our real test dataset, we are ready to do some model training. We used a resnet variant implemented in PyTorch, from the Detectron2 github repo [3]. This network was pre-trained on Imagenet, so we only need to fine tune it a little longer on our synthetic dataset before it is capable of making decent predictions. Not bad for such a small dataset (1000 synthetic images) and such a short training time (30 minutes).

Figure 2: Predictions from our neural network trained on synthetic data. False positives shown for context.

Closing Thoughts

These are great results for the first iteration. To improve model performance further we could increase the size of the dataset, add more variety to the sim, or pick better hyperparameters for our model. Evaluating model performance on real test data and iterating is core to the synthetic data workflow. After all, the coolest thing about synthetic training data is that it’s ultimately dynamic data.

For your next computer vision project, whether it be a hobby or your job, spare those poor manual data labelers and consider trying out the synthetic approach. We’ve made it easy for you: we’ve released our data development toolkit zpy [1] under an open source license. Now everything you need to generate and iterate synthetic data for computer vision is available for free. Your feedback, commits, and feature requests, will be invaluable as we continue to build a more robust set of tools for generating synthetic data. Meanwhile, if you could use hands on support with a particularly tricky problem, please reach out!

References

[1] zpy (github.com/ZumoLabs/zpy)
[2] RoboFlow (roboflow.com)
[3] Detectron2 (github.com/facebookresearch/detectron2)

What is Neural Rendering?

Hugo — Tue, 04 May 2021 17:07:42 +0000

As our world becomes increasingly digitized, the methods by which we render these virtual worlds are rapidly changing. Neural rendering has huge potential in improving many aspects of the rendering pipeline by leveraging generative machine learning techniques. What is neural rendering? In this article we'll introduce the concept, compare it to classical computer graphics, and discuss what it means for the future.

Classic Rendering

Creating 3D virtual worlds today is a complicated and involved process. Each item, or asset, in a virtual scene is represented by a polygon mesh (Slide 1). This polygon mesh can either be modeled by an artist, or scanned into existence: both of these processes are manual and time consuming. The more detailed we want this specific asset to be, the more polygons the mesh will have.

The polygon mesh is only the beginning. Each surface in this 3D world also has a corresponding material, which determines the appearance of the mesh. At runtime, the material and mesh of the object are used as inputs to shader programs, which calculate the appearance of the object under given lighting conditions and a specific camera angle (Slide 2). Over the years, many different shader programs have been developed, though the fundamental principle is the same: use the laws of physics to calculate the appearance of an object. This is most evident in the approach known as Ray Tracing, where every light ray is traced from its source down to every surface it bounces on.

This render pipeline can create amazing results: every CGI effect in every movie you have seen, and every game you have ever played uses some form of this "classical computer graphics" pipeline. The main pain point for this pipeline is in the huge amount of work required to explicitly define every object and every material, and the large computation required to render a realistic or complex scene. Which leads us to the question: what if we didn't have to define every object and calculate every light bounce?

Enter Neural Rendering

So, what is neural rendering? Though still a very young field, it's one which has grown to encompass a large number of techniques-GANs are a form of neural rendering. The key concept behind neural rendering approaches is that they are differentiable. A differentiable function is one whose derivative exists at each point in the domain. This is important because machine learning is basically the chain rule with extra steps: a differentiable rendering function can be learned with data, one gradient descent step at a time. Learning a rendering function statistically through data is fundamentally different from the classic rendering methods we described above, which calculate and extrapolate from the known laws of physics.

One of the coolest flavors of neural rendering is novel view synthesis. In this problem, a neural network learns to render a scene from an arbitrary viewpoint. Slides 3 and 4 are figures from two great papers on this topic: one from Google Research [1] and the other from Facebook Reality Labs [2]. Both of these works use a volume rendering technique known as ray marching. Ray marching is when you shoot out a ray from the observer (camera) through a 3D volume in space and ask a function: what is the color and opacity at this particular point in space? Neural rendering takes the next step by using a neural network to approximate this function.

The Future of Rendering

We really just scratched the surface when it comes to neural rendering. If you want to learn more, we recommend this super extensive summary paper [3]. But before we go, what could this mean for the future?
With neural rendering, we no longer need to physically model the scene and simulate the light transport, as this knowledge is now stored implicitly inside the weights of a neural network. This means that it will be possible to render your face, while it is inside a VR headset (Slide 5), without ever having to store or distort a 3D polygon mesh of your face!

With neural rendering, the compute required to render an image is also no longer tied to the complexity of the scene (the number of objects, lights, and materials), but rather the size of the neural network (time required to perform a forward pass). This opens up the door for the possibility of really high quality rendering at a blazingly fast frame rate.
If you're interested in the intersection of machine learning and 3D, please check out our open source synthetic data toolkit zpy [5]. Your feedback, commits, and feature requests will be invaluable as we continue to build a more robust set of tools for generating synthetic data. Who knows? Perhaps the next great neural rendering model will be trained using data generated with zpy.

References

[1] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (arxiv.org/pdf/2003.08934.pdf)
[2] Neural Volumes: Learning Dynamic Renderable Volumes from Images (arxiv.org/pdf/1906.07751.pdf)
[3] State of the Art on Neural Rendering (arxiv.org/pdf/2004.03805.pdf)
[4] zpy: an open source synthetic data toolkit.

Use Python and Blender to Make More Dynamic Training Data

Hugo — Mon, 12 Apr 2021 21:25:35 +0000

Tools that make synthetic data generation easy are fundamentally changing the way machine learning work is done. Iterating and improving the dataset over the course of a project is more important to project success than iterating the model architecture. That's why we are releasing zpy, an open source synthetic data toolkit. All developers should have the option of working with dynamic data rather than static data.

Software 2.0

We are undergoing a phase change in the way software programming works [1]. As we replace our collective software stack with deep learning systems, we are going to fundamentally change many of the core abstractions and workflows that have been part of software development for decades.

Figure 1: Machine learning introduces a new programming paradigm [2].

Unfortunately many deep learning researchers are still stuck in the old software paradigm: spending the majority of their time and effort designing and iterating the algorithm (“Rules” in Figure 1) while using a static dataset like MNIST or ImageNet. Those of us who make machine learning work in the real world though have already come to the realization that the most important part of getting something to work is making a good dataset (“Data” and “Answers” in Figure 1). The data and the labels are really where we should spend the majority of our time and effort.

Deep learning algorithms are made of the same building blocks: layers of neurons arranged in clever patterns. The exact arrangement of those neurons and the long list of accompanying tricks and widgets has been described as alchemy [3]. Researchers spend a huge amount of effort discovering the arrangements that work best, often keeping the dataset static so they can compare these arrangements quantitatively. In the real world however, engineers often do the opposite: figuring out how to get better data while simply using whatever arrangement is popular at the time.

This presents a huge need for tools that make it simple to modify, adjust, and create more training data. A need which is being met by the dynamic nature of synthetic data generation. Synthetic data makes it easy to change the annotation style, or add an additional label which can be used as an additional training loss for the model. It also makes it easy to generate more examples of a specific edge case that may be causing issues in production. Synthetic data generation and iteration should be easy, and used in concert with adjustments to the model in order to achieve one’s goals.

Open Source

“Free software” means software that respects users’ freedom and community. Roughly, it means that the users have the freedom to run, copy, distribute, study, change and improve the software. Thus, “free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech,” not as in “free beer”. [4]

People want to be able to shape and influence the tools they use. The best way to empower them to do that is to build those tools out in the open. The future of data creation, and thus the future of software, will be open core tools that are created in part by the developer community.

The best argument for this type of development is the growing popularity of the open core model in the software startup scene. Open core is based around the idea of having the “core” of the software stack being open source and freely available online. Startups that adopt this paradigm sustain themselves by selling additional services or features on top of this open core. This stands in contrast to the more popular SaaS business model where all software is proprietary and is effectively rented out to users.

Dynamic Data

Dynamic data is the future of training deep learning systems. Open source is the future of programming. That’s why we have decided to release our data development toolkit zpy [5] under an open source license. Now everything you need to generate and iterate synthetic data for computer vision is available for free.

But this is just the beginning of the phase shift we mentioned earlier. Your feedback, commits, and feature requests, will be invaluable as we continue to build a more robust set of tools for generating synthetic data. Meanwhile, if you could use hands on support with a particularly tricky problem, please reach out!

References

[1] Building the Software 2 0 Stack. Video Lecture by Andrej Karpathy.
[2] Deep Learning with Python. Book by Francois Chollet.
[3] Machine Learning has become Alchemy. Video Lecture by Ali Rahimi.
[4] “What is free software?”. Article by the GNU Operating System.
[5] zpy: an open source synthetic data toolkit.