3D Images and Deep Learning

5 min readJun 27, 2021

With the development of AR/VR, self-driving cars, 3D vision problem becomes more and more important since it provides much richer information than 2D. 3D image measures one more dimension, the depth dimension.

Representation of 3D data

3D images can be represented in the following ways:

Point cloud is a collection of points in three-dimensional space; each point is determined by a certain (x,y,z) position, and we can also specify other attributes (such as RGB color) for it. They are the original form of the lidar data when it is acquired.
The voxel grid is developed from the point cloud. We can regard the voxel grid as a quantized point cloud with a fixed size.

Polygonal mesh: is collection of vertices, edges and faces that defines the objects’ surface in 3 dimensions. It can capture granular details in a fairly compact representation.

Multi-view representation is a collection of rendered two-dimensional polygon mesh images obtained from different simulated perspectives

Issues with various 3D representation

Point Cloud

Point clouds are invariant to the order of the points. The same set of points in different order still represents the same 3D object.
Cannot directly apply CNN.

A three step process to deal with this:

Sort the points.
Take multiple kinds of permutations
Use a symmetric function to aggregate the information from each point. By symmetric function, we mean like + or * are symmetric binary function.

In pointNet models, max pooling is used as a symmetric function. Max pooling is used to select the most extraordinary feature of all points. That is why the function is invariant of the order of points.

Voxel Grid

VoxNet is a deep learning based architecture for classifying 3D point cloud using a probability occupancy grid, where each voxel contains the probability that the voxel is occupied in space. One advantage of this is that it allows the network to distinguish between voxels that are known to be free and voxels whose occupancy is unknown.

However, representation using voxel grid is sparse and wasteful. The density of useful voxels decreases as the resolution increases. Secondly if the different points representing the complex structure are very close, they will be bound in the same voxel. Third, compared with point clouds in sparse environments, voxel grids may cause unnecessarily high memory usage. This is because they actively consume memory to represent free and unknown space, while the point cloud contains only known points

VoxNet may be a simple and good approach to go. But if a complex dataset encountered, it may not be a good choice.

Polygonal mesh

CNN cannot be directly applied on it.

Multi-View Representation

Multi-view representation is the simplest way to apply Deep Learning model to a three-dimensional scene.

3D Object and Deep Learning

Since Dep learning networks are trained over many previous samples of 3D object, it knows how to extrapolate the shape of a new 3D sample it never saw before but has seen similar samples of that 3D class.

GAN Encoder-Decoder networks work in a similar way to generate 3D shapes. We train a GAN network to generate a fake z-vector. This is done by taking the z-vectors for the first and last 3D models, let’s call these vectors z_start and z_end, then new z-vectors are calculated as a linear combination of z_start and z_end. Specifically, a number (let’s say alpha) between 0 and 1 is picked and then a new z — z_new is calculated: z_new = (z_start*alpha + z_end*(1-alpha)).

Therefore, a small change in z_vector will lead to a small change of the 3D model but still keep the 3D structure of the model category, that way it is possible to continuously change the model from z_start to z_end. The generator network is trained to only produce new z-vectors based on a random input.

The discriminator gets as input real z-vectors from the encoder-decoder network along with fake z-vectors from the generator network. Since the decoder knows to get as input a z-vector and from it reconstruct a 3D model and the generator is trained to produce z-vectors which resemble real ones, new 3D models can be reconstructed using both networks combined.

Issues in 3D Dataset and Metrics

Ground-truth 3D datasets were traditionally quite expensive and difficult to obtain because of divergent approaches for representing 3D structures.
There need to be enough models (usually minimum hundreds) in each category and enough categories to allow for any type of real life application of this type of neural network.
For each model, images from different angles, lighting positions and camera parameters, scales alignments and translations is required.
No specific metric to measure accuracy in 3D reconstruction — IoU checks how much of the volume of a reconstructed 3D shape overlaps with the original 3D shape in comparison to the joint volume of both shapes. If the reconstructed shape is moved to be in a different volume in space, the IoU might be zero (because there is no overlap) even if the shapes are identical.

One way of measuring accuracy is to take 10 silhouette images of the model from angles on a dodecahedron and 10 different dodecahedrons per model.

This article gives a brief introduction about various representation of 3D images, issues in such representation, the power of Deep Learning in classifying and generating 3D objects and the general issues in 3D datasets and metrics of evaluation.

3D Images and Deep Learning

3D Object and Deep Learning

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Dibyendu Biswas

No responses yet

More from Dibyendu Biswas

Stereo Camera Calibration and Depth Estimation from Stereo Images

Well, when we capture photos in 2D, all the depth information is lost due to a process called Perspective projection. When an image is…

Lucas–Kanade method for Optical Flow

DeepSORT is one of the finest object tracking algorithm. However, there are some assumptions in DeepSORT, for example, there should be no…

D, D Lite & LPA*

D-Star (𝐷), short for dynamic A is a sensor based algorithm that deals with dynamic obstacles by real time changing its edge’s weights…

Camera Calibration

Camera calibration is the estimation of the parameters of a camera, parameters about the camera required to determine an accurate…

Recommended from Medium

VR 3D UI

Let’s check out how we can do up some Virtual Realty Three Dimensional User Interfaces!! The Scene: We have an old building that needs to…

Generative AI for 3D Models Using Neural Radiance Fields (NeRF)

With the rapid advancements in AI, generative models have extended their capabilities to 3D modeling, enabling tasks such as reconstructing…

Lists

Natural Language Processing

Practical Guides to Machine Learning

data science and AI

Staff picks

Building an Exercise Tracker with OpenCV and Mediapipe: A Fun and Interactive Project with Students

As an innovation program teacher, I have always found that the most impactful learning comes from hands-on projects. Recently, my students…

Human Pose Estimation Overview

A presentation I made in Computer Vision Class at the African Institute for Mathematical Sciences (AIMS) South Africa

Understanding Vision Transformers: A Game-Changer in Computer Vision

When you think about computer vision, CNNs (Convolutional Neural Networks) likely come to mind as the go-to architecture. However, recent…

Image Segmentation in Machine Learning: A Step-by-Step Guide

If you’ve ever wondered how self-driving cars recognize objects on the road or how medical imaging software detects tumors, the answer…

3D Images and Deep Learning

3D Object and Deep Learning

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Dibyendu Biswas

No responses yet

More from Dibyendu Biswas

Stereo Camera Calibration and Depth Estimation from Stereo Images

Well, when we capture photos in 2D, all the depth information is lost due to a process called Perspective projection. When an image is…

Lucas–Kanade method for Optical Flow

DeepSORT is one of the finest object tracking algorithm. However, there are some assumptions in DeepSORT, for example, there should be no…

D*, D* Lite & LPA*

D-Star (𝐷*), short for dynamic A* is a sensor based algorithm that deals with dynamic obstacles by real time changing its edge’s weights…

Camera Calibration

Camera calibration is the estimation of the parameters of a camera, parameters about the camera required to determine an accurate…

Recommended from Medium

VR 3D UI

Let’s check out how we can do up some Virtual Realty Three Dimensional User Interfaces!! The Scene: We have an old building that needs to…

Generative AI for 3D Models Using Neural Radiance Fields (NeRF)

With the rapid advancements in AI, generative models have extended their capabilities to 3D modeling, enabling tasks such as reconstructing…

Lists

Natural Language Processing

Practical Guides to Machine Learning

data science and AI

Staff picks

Building an Exercise Tracker with OpenCV and Mediapipe: A Fun and Interactive Project with Students

As an innovation program teacher, I have always found that the most impactful learning comes from hands-on projects. Recently, my students…

Human Pose Estimation Overview

A presentation I made in Computer Vision Class at the African Institute for Mathematical Sciences (AIMS) South Africa

Understanding Vision Transformers: A Game-Changer in Computer Vision

When you think about computer vision, CNNs (Convolutional Neural Networks) likely come to mind as the go-to architecture. However, recent…

Image Segmentation in Machine Learning: A Step-by-Step Guide

If you’ve ever wondered how self-driving cars recognize objects on the road or how medical imaging software detects tumors, the answer…

D, D Lite & LPA*

D-Star (𝐷), short for dynamic A is a sensor based algorithm that deals with dynamic obstacles by real time changing its edge’s weights…