In recent years, video conferencing has played an increasingly important role in both work and personal communication for many users. Over the past two years, we’ve improved this Google Meet experience by introducing privacy-preserving machine learning (ML) background features, also known as a “virtual green screen,” that allow users to blur their background or replace it with other images . What is unique about this solution is that it works directly in the browser without the need to install additional software.

Until now, these ML-powered features relied on CPU inference made possible by leveraging neural network sparsity, a general solution that works across devices from entry-level PCs to high-end workstations. This allows our features to reach the widest possible audience. However, mid-range and high-end devices often have powerful GPUs that go unused for ML inference, and existing functionality allows web browsers to access GPUs via shaders (WebGL).

S latest update to Google Meet, we now use the power of GPUs to greatly improve the accuracy and performance of these background effects. As we detail in “Efficient heterogeneous video edge segmentation”, these advances are driven by two main components: 1) a new real-time video segmentation model and 2) a new, highly efficient approach to accelerating ML in the browser using WebGL. We use this capability to develop fast ML inferences via fragment shaders. This combination results in significant gains in accuracy and latency, resulting in clearer foreground boundaries.

CPU segmentation vs. HD segmentation in Meet.

Moving to higher quality video segmentation models
To predict finer details, our new segmentation model now works with high-definition (HD) input images instead of lower-resolution images, effectively doubling the resolution of the previous model. To accommodate this, the model must have a higher capacity to extract features with sufficient detail. Roughly speaking, doubling the input resolution quadruples the computational cost during inference.

Rendering high-resolution models using the CPU is not possible for many devices. A processor may have multiple high-performance cores that allow it to run arbitrary complex code efficiently, but it is limited in its ability to do the parallel computing required for HD segmentation. In contrast, GPUs have many relatively low-performance cores coupled with a wide memory interface, making them uniquely suited to high-resolution convolutional models. Therefore, for mid-range and high-end devices, we adopt a significantly faster pure GPU pipeline that is integrated using WebGL.

This change inspired us to revisit some of the previous design decisions for the model architecture.

  • Spine: We compared several widely used support networks for device networks and found that EfficientNet-Lite is better suited for GPUs because it removes squeezing and arousal block, a component that is ineffective under WebGL (more below).
  • Decoder: We switched to a multilayer perceptron (MLP) decoder consisting of 1×1 coils instead of using a simple bilinear upscaling or the more expensive pinch and excite blocks. MLP has been successfully adopted in other segmentation architectures such as DeepLab and PointRendand is efficient for both CPU and GPU computation.
  • Model size: With our new WebGL inference and GPU-friendly model architecture, we were able to afford a larger model without sacrificing the real-time frame rate required for smooth video segmentation. We investigated the width and depth parameters using a search for neural architecture.
HD Segmentation Model Architecture.

Overall, these changes significantly improve the average Crossing over the Junction (IoU) metric by 3%, resulting in less uncertainty and clearer boundaries around hair and fingers.

We also let the accompanying people go model card for this segmentation model that describes our fairness ratings. Our analysis shows that the model is consistent in its performance across regions, skin tones and genders, with only minor deviations in IoU metrics.

Model Resolution Conclusion IoU Latency (ms)
CPU segmenter 256×144 Wasm SIMD 94.0% 8.7
GPU segmenter 512×288 WebGL 96.9% 4.3
Comparison of the previous segmentation model vs. the new HD segmentation model on the Macbook Pro (2018).

Accelerating Web ML with WebGL
A common challenge with web-based inference is that web technologies can cause performance degradation compared to applications running natively on the device. For the GPU, this penalty is significant, achieving only about 25% of native OpenGL productivity. This is because WebGL, the current GPU standard for web-based inference, was designed primarily for rendering images, not for arbitrary ML workloads. In particular, WebGL does not include computational shadersthat enable general-purpose computing and enable ML workloads in mobile and native applications.

To overcome this challenge, we accelerated low-level neural network kernels with fragment shaders which typically calculate the output properties of a pixel such as color and depth and then apply new optimizations inspired by the graphics community. Since ML workloads on GPUs are often bound by memory bandwidth rather than computation, we focused on rendering techniques that would improve memory access, such as Multiple render targets (MR T).

MRT is a feature in modern GPUs that allows images to be rendered to multiple source textures (OpenGL objects that represent images) at once. While MRT was originally designed to support advanced graphical rendering such as deferred shading, we found that we could use this feature to dramatically reduce the memory bandwidth usage of our fragment shader implementations for critical operations such as wraps and fully connected layers. We do this by treating the intermediate tensors as multiple OpenGL textures.

In the figure below, we show an example of intermediate tensors, each of which has four basic GL textures. With MRT, the number of GPU threads and thus effectively the number of memory requests for weights is reduced by a factor of four and saves memory bandwidth usage. Although this introduces significant complexities into the code, it helps us reach over 90% of native OpenGL performance, closing the gap with native apps.

Left: Classic Conv2D implementation with 1-to-1 correspondence of tensor and OpenGL texture. Red, yellow, green, and blue boxes indicate different locations within a texture for intermediate tensors A and B. Right: Our implementation of Conv2D with MRT, where intermediate tensors A and B are implemented with a set of 4 GL textures each, depicted as red, yellow, green and blue boxes. Note that this reduces the number of weight requests by a factor of 4.

We’ve made rapid strides in improving the quality of real-time segmentation models by leveraging GPUs on mid-range and high-end devices for use with Google Meet. We look forward to the opportunities that will be enabled by upcoming technologies such as WebGPU, which bring compute shaders to the grid. Beyond GPU inference, we are also working on improving segmentation quality for lower power devices with quantized inference by XNNPACK WebAssembly.

Special thanks to those of the Meet team and others who worked on this project, especially Sebastian Jansson, Sami Kaliomaki, Rikard Lundmark, Stefan Reiter, Fabian Bergmark, Ben Wagner, Stefan Holmer, Dan Gunnarsson, Stefan Hulaud and all our team members who made this possible: Siargey Pisarchyk, Raman Sarokin, Artsiom Alavatski, Jamie Lin, Tyler Mullen, Gregory Karpiak, Andrei Kulik, Karthik Raveendran, Trent Tolley, and Matthias Grundmann.

Previous articleCLEO Systems launches e-prescription solution
Next articleThe finalists for the American Legal Technology Awards are…! – Artificial lawyer