A brief overview of single-view 3D reconstruction based on Deep Learning


As a combination of Computer Graphics and Computer Vision, 3D reconstruction has been a classic problem for a long time. In recent years, with the development of Deep Learning, more and more researchers are focusing on 3D reconstruction with Deep Learning again.

Related Work

The number of work focus on 3D reconstruction is increasing in recent years. However, unlike classification or object detection, there is not a specific model for 3D reconstruction. That is, their models use different kinds of input data, and return different types of output.

Type of input:

  • RGB-D images, aka depth image
  • RGB images
    • Easy to collect data
    • Lack depth information(3D information)

Type of output:

  • Voxel model
    • Easy to implement.
    • Low resolution and high memory consuming.
  • Point cloud
  • Mesh
    • Much more practical.
    • Hard to implement: Mesh is not tensor. So it is hard to use mini-batch and GPU when training.

Some interesting Papers

  • VON: Visual Object Networks: Image Generation with Disentangled 3D Representation
    • Link: VON
    • They use three end-to-end steps:
      • shape code ==> 3D shape
      • 3D shape ==> 2.5D sketches.
      • 2.5D sketches ==> natural images(Texture, different viewpoint)
  • ShapeHD: Learning Shape Priors for Single-View 3D Completion and Reconstruction
    • Link: ShapeHD
      • They use three steps:
        • 2D image ==> 2.5D image
        • 2.5D image ==> 3D voxel model
        • 3D voxel model ==> Fine tunning 3D model
  • Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images
    • Link: Pixel2Mesh
    • They use Graph Convolution Neural Net(GCN) to deform a mesh of ellipsoid and finally get a output mesh.
    • The result are not very accurate. And their output can not keep the genus(holes) of the ground truth.
  • AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation
    • Link: AtlasNet
    • They initialize a fixed number of rectangle surfaces. Then they deform these surfaces and finally collage them as the final output.
  • DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation
    • Link: DeepSDF
    • Their work has really good performance. They train a auto-decoder, which takes a latent vector and a 3D coordinates (x, y, z) as input. They sample 500K points near the surface as the feature data, and compute the signed distance from the sampled points to the surface as the label.
    • The initial latent vector is a random vector. They use backprop to update the latent vector during the training procedure.
    • When evaluating, they also sample some points with signed distance from the depth images. Then they “train” on these samples. After that procedure, the model will learn the implicit function of the shape surface.
    • Finally, use marching cube algorithm to reconstruct the triangle mesh.

Limitation of single-view 3D reconstruction from RGB image.

  • The input files don’t include enough information for the ground truth. So it is impossible to reconstruct all the real details of the real object.
  • This procedure is more like a person, who has the knowledge about some object(e.g, cars). And when he sees a picture of a car, he can easily imagine the 3D details of it. The imagination is not accurate, but reasonable.
  • In a word, it highly depends on the prior knowledge, and create reasonable 3D models.


  • It can save a lot of time when reconstructing a 3D model from a image. People only need to do some tunning to the output model.
  • AR: make pictures alive.

Leave a Comment

Your email address will not be published.