Groundbreaking Computer Vision Research at CVPR 2020

July 9, 2020

6 papers pushing the boundaries of computer vision

Every year, the Conference on Computer Vision and Pattern Recognition (CVPR) celebrates the most innovative research in the field. This highly selective conference is widely seen as the most prestigious in the world. While it usually rotates locations around the US, the event was held online as a result of the COVID-19 pandemic.

In this blog post, we take a closer look at 6 groundbreaking research papers that were accepted to CVPR 2020, and that Algolux researchers – led by our CTO Felix Heide – were involved with.

  1. Hardware-in-the-Loop End-to-End Optimization of Camera Image Processing Pipelines
  2. Seeing Around Street Corners: Non-Line-of-Sight Detection and Tracking In-the-Wild Using Doppler Radar
  3. Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather
  4. Defending Against Universal Attacks Through Selective Feature Regeneration
  5. Single-shot Monocular RGB-D Imaging using Uneven Double Refraction
  6. Learning Rank-1 Diffractive Optics for Single-shot High Dynamic Range Imaging

Hardware-in-the-Loop End-to-End Optimization of Camera Image Processing Pipelines

Cameras are the sensor of choice for system developers of safety-critical applications. However, camera development currently relies on expert imaging teams to manually tune camera architectures. This painstaking approach can take months, requires hard-to-find deep expertise, and depends on visual subjectivity. As such, this process does not ensure that the camera provides the optimal output for computer vision algorithms.

The research paper, “Hardware-in-the-Loop End-to-End Optimization of Camera Image Processing Pipelines” describes a new approach applying a stochastic optimization method to the challenging problem of optimal hardware-software co-design of cameras and computer vision algorithms and also compares the results against baseline camera systems. As such, this method solves a longstanding validation and optimization challenge that handcrafted intermediate image metrics and compartmentalized design tried to tackle in the past.

The researchers improved end-to-end losses compared to manual adjustment and existing approximation-based approaches. This was done for multiple camera configurations and computer vision models such as object detection, instance segmentation, and panoptic segmentation.

For automotive 2D object detection, the new method outperformed manual expert tuning by 30% mAP. It also outperformed recent methods using ISP approximations by 18% mAP.

Seeing Around Street Corners: Non-Line-of-Sight Detection and Tracking In-the-Wild Using Doppler Radar

Sensors and cameras are making cars safer and safer. Unfortunately, the accuracy of today’s perception systems is still a major challenge. Some researchers, however, are pushing the boundaries of perception to what is outside the direct line of sight. Indeed, Felix Heide and his collaborators have found a way for vehicles to perceive moving objects around street corners.

Conventional sensor systems are currently not good enough because they only record information about directly visible objects. Occluded scene components, on the other hand, are considered lost in the measurement process. Non-line-of-sight (NLOS) methods try to recover such hidden objects from their indirect reflections but current approaches struggle to record these low-signal components outside the lab. Additionally, they do not scale to large-scale outdoor scenes and high-speed motion, typical in automotive scenarios.

In this research paper, the idea is to depart from visible-wavelength approaches and demonstrate detection, classification, and tracking of hidden objects in large-scale dynamic environments using Doppler radars that can be manufactured at low-cost in series production. A Doppler radar is a specialized radar that uses the Doppler effect to produce velocity data about objects at a distance. It bounces a microwave signal off a desired target and analyzes how the object’s motion has altered the frequency of the returned signal.

Here, researchers used static building facades or parked vehicles as relay walls to jointly classify, reconstruct, and track occluded objects. To untangle noisy indirect and direct reflections, they leveraged temporal sequences of Doppler velocity and position measurements. They then fused in a joint NLOS detection and tracking network over time.

The proposed approach allows for collision warning for pedestrians and cyclists in real-world autonomous driving scenarios.

That is before seeing them with existing direct line-of-sight sensors. This will enable cars to see occluded objects that today’s lidar and camera sensors cannot record, for example, allowing a self-driving vehicle to see around a dangerous intersection.

Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather

Harsh conditions – and especially fog – remain a major challenge for vision systems. The fusion of multimodal sensor streams, such as camera, lidar, and radar measurements, plays a critical role in object detection for autonomous vehicles, which base their decision making on these inputs. Existing methods exploit redundant information in good environmental conditions but fail in adverse weather where the sensory streams can be asymmetrically distorted.

Available datasets do not represent these rare “edge-case” scenarios and existing fusion architectures cannot handle them. To address this challenge, researchers introduced a multimodal dataset acquired in over 10,000km of driving in northern Europe. While this dataset is the first large multimodal dataset in adverse weather, with 100k labels for lidar, camera, radar, and gated NIR sensors, it does not actually facilitate training as extreme weather is rare.

To solve this, researchers came up with a deep fusion network for robust fusion without a large corpus of labeled training data covering all asymmetric distortions. Departing from proposal-level fusion, they proposed a single-shot model that adaptively fuses features, driven by measurement entropy.

Researchers demonstrate that it is possible to learn multimodal fusion for extreme adverse weather conditions from clean data only.

Defending Against Universal Attacks Through Selective Feature Regeneration

Deep neural network (DNN) predictions can be vulnerable to carefully crafted adversarial perturbations. Specifically, image-agnostic (universal adversarial) perturbations added to any image can fool a target network into making erroneous predictions. For instance, as we saw last year, hackers were able to fool Tesla’s Autopilot with nothing more than… stickers.

Departing from existing defense strategies that work mostly in the image domain, researchers introduced a novel defense that operates in the DNN feature domain. This method effectively defends against such universal perturbations.

Their approach identifies pre-trained convolutional features that are most vulnerable to adversarial noise and deploys trainable feature regeneration units. These transform the DNN filter activations into resilient features that are robust to universal perturbations.

Regenerating only the top 50% adversarially susceptible activations in at most 6 DNN layers and leaving all remaining DNN activations unchanged, we outperform existing defense strategies across different network architectures by more than 10% in restored accuracy.

Researchers show that without any additional modification, their defense trained on ImageNet with one type of universal attack examples effectively defends against other types of unseen universal attacks.

Single-shot Monocular RGB-D Imaging using Uneven Double Refraction

Cameras that capture color and depth information have become an essential imaging modality. That is the case for applications in robotics, autonomous driving, virtual, as well as augmented reality. Existing RGB-D cameras rely on multiple sensors or active illumination with specialized sensors.

In this work, researchers introduce a novel method for monocular single-shot RGB-D imaging. Instead of learning depth from single-image depth cues, they revisit double-refraction imaging using a birefractive medium, measuring depth as the displacement of differently refracted images superimposed in a single capture.

However, it is impossible to use existing double-refraction methods in real-time applications – such as robotics – because they are orders of magnitudes too slow. Additionally, it provides only inaccurate depth due to correspondence ambiguity in double reflection.

It is possible to resolve this ambiguity optically by leveraging the orthogonality of the two linearly polarized rays in double refraction – introducing uneven double refraction by adding a linear polarizer to the birefractive medium.

Doing so makes it possible to develop a real-time method for reconstructing sparse depth and color simultaneously in real-time.

Learning Rank-1 Diffractive Optics for Single-shot High Dynamic Range Imaging

High-dynamic-range (HDR) imaging is an essential imaging modality for a wide range of applications in uncontrolled environments. This includes autonomous driving, robotics, and mobile phone cameras.

However, existing HDR techniques in commodity devices struggle with dynamic scenes due to multi-shot acquisition and post-processing time, e.g. mobile phone burst photography, making such approaches unsuitable for real-time applications.

In this work, researchers developed a method for snapshot HDR imaging by learning an optical HDR encoding in a single image that maps saturated highlights into neighboring unsaturated areas using a diffractive optical element (DOE).

Their novel rank-1 parameterization of the DOE drastically reduces the optical search space. It also allows them to efficiently encode high-frequency detail. They propose a reconstruction network tailored to this rank-1 parametrization for the recovery of clipped information from the encoded measurements.

Researchers validate the end-to-end framework through simulation and real-world experiments. This improves the PSNR by more than 7 dB over state-of-the-art end-to-end designs.

Furthermore, they show that their network’s ability to remove streak encodings can also be applied to other types of streaks introduced by grating-like optics.

For instance, front-facing automotive cameras suffer from glare induced by thin lines of dust and dirt remaining on the windshield after wiping. These thin streaks of dust produce glare streaks that vary with the wiping pattern on a curved windshield.

Removing these streaks can improve autonomous driving at night time. Researchers trained their network to remove these types of streaks and demonstrated successful removal.