3D Images and Deep Learning

Dibyendu Biswas
5 min readJun 27, 2021

With the development of AR/VR, self-driving cars, 3D vision problem becomes more and more important since it provides much richer information than 2D. 3D image measures one more dimension, the depth dimension.

Representation of 3D data

3D images can be represented in the following ways:

  • Point cloud is a collection of points in three-dimensional space; each point is determined by a certain (x,y,z) position, and we can also specify other attributes (such as RGB color) for it. They are the original form of the lidar data when it is acquired.
  • The voxel grid is developed from the point cloud. We can regard the voxel grid as a quantized point cloud with a fixed size.

Polygonal mesh: is collection of vertices, edges and faces that defines the objects’ surface in 3 dimensions. It can capture granular details in a fairly compact representation.

Multi-view representation is a collection of rendered two-dimensional polygon mesh images obtained from different simulated perspectives

Issues with various 3D representation

Point Cloud

  • Point clouds are invariant to the order of the points. The same set of points in different order still represents the same 3D object.
  • Cannot directly apply CNN.

A three step process to deal with this:

  1. Sort the points.
  2. Take multiple kinds of permutations
  3. Use a symmetric function to aggregate the information from each point. By symmetric function, we mean like + or * are symmetric binary function.

In pointNet models, max pooling is used as a symmetric function. Max pooling is used to select the most extraordinary feature of all points. That is why the function is invariant of the order of points.

Voxel Grid

VoxNet is a deep learning based architecture for classifying 3D point cloud using a probability occupancy grid, where each voxel contains the probability that the voxel is occupied in space. One advantage of this is that it allows the network to distinguish between voxels that are known to be free and voxels whose occupancy is unknown.

However, representation using voxel grid is sparse and wasteful. The density of useful voxels decreases as the resolution increases. Secondly if the different points representing the complex structure are very close, they will be bound in the same voxel. Third, compared with point clouds in sparse environments, voxel grids may cause unnecessarily high memory usage. This is because they actively consume memory to represent free and unknown space, while the point cloud contains only known points

VoxNet may be a simple and good approach to go. But if a complex dataset encountered, it may not be a good choice.

Polygonal mesh

CNN cannot be directly applied on it.

Multi-View Representation

Multi-view representation is the simplest way to apply Deep Learning model to a three-dimensional scene.

3D Object and Deep Learning

Since Dep learning networks are trained over many previous samples of 3D object, it knows how to extrapolate the shape of a new 3D sample it never saw before but has seen similar samples of that 3D class.

GAN Encoder-Decoder networks work in a similar way to generate 3D shapes. We train a GAN network to generate a fake z-vector. This is done by taking the z-vectors for the first and last 3D models, let’s call these vectors z_start and z_end, then new z-vectors are calculated as a linear combination of z_start and z_end. Specifically, a number (let’s say alpha) between 0 and 1 is picked and then a new z — z_new is calculated: z_new = (z_start*alpha + z_end*(1-alpha)).

Therefore, a small change in z_vector will lead to a small change of the 3D model but still keep the 3D structure of the model category, that way it is possible to continuously change the model from z_start to z_end. The generator network is trained to only produce new z-vectors based on a random input.

The discriminator gets as input real z-vectors from the encoder-decoder network along with fake z-vectors from the generator network. Since the decoder knows to get as input a z-vector and from it reconstruct a 3D model and the generator is trained to produce z-vectors which resemble real ones, new 3D models can be reconstructed using both networks combined.

Issues in 3D Dataset and Metrics

  1. Ground-truth 3D datasets were traditionally quite expensive and difficult to obtain because of divergent approaches for representing 3D structures.
  2. There need to be enough models (usually minimum hundreds) in each category and enough categories to allow for any type of real life application of this type of neural network.
  3. For each model, images from different angles, lighting positions and camera parameters, scales alignments and translations is required.
  4. No specific metric to measure accuracy in 3D reconstruction — IoU checks how much of the volume of a reconstructed 3D shape overlaps with the original 3D shape in comparison to the joint volume of both shapes. If the reconstructed shape is moved to be in a different volume in space, the IoU might be zero (because there is no overlap) even if the shapes are identical.

One way of measuring accuracy is to take 10 silhouette images of the model from angles on a dodecahedron and 10 different dodecahedrons per model.

This article gives a brief introduction about various representation of 3D images, issues in such representation, the power of Deep Learning in classifying and generating 3D objects and the general issues in 3D datasets and metrics of evaluation.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Dibyendu Biswas
Dibyendu Biswas

Written by Dibyendu Biswas

Robotics Enthusiast. Well versed with computer vision, path planning algorithms, SLAM and ROS

No responses yet

Write a response

Recommended from Medium

Lists

See more recommendations