Skip to content

Instantly share code, notes, and snippets.

What would you like to do?

Multi-modal cluttered scene analysis in knowledge intensive scenarios

Detection of transparent objects in real world RGB-D data, and estimation of it's pose


The project's algorithmical part is mainly based on the following paper: Recognition and Pose Estimation of Rigid Transparent Objects with a Kinect Sensor

My original proposal contained one more segmentation algorithm, but during the bonding period we decided to focus mostly on this one, thus the second part was somewhat excluded from my original plan.

New Annotators

TransparentSegmentationAnnotator - Consumes an RGB-D data from Kinect or Xtion sensor (possibly Intel RealSence, but not tested) and outputs segments, requires a PlaneAnnotator to be runned first;

ContourFittingClassifier - Consumes segments and spits out classes description, stamped poses and refined depth data;

New Annotation types

  • TransparentSegment - marks a segment of possible location of transparent object on the image

The rest of the annotations generated, are already existing Cluster, Classification and PoseAnnotation;

What's done

  • Regions of failed depth perception are correctly detected, merged, refined, rejected on customisable size criteria, marked as possibly related to transparent objects and submitted for further processing to CAS;
  • Segments marked as transparent (but not limited to it, may be used for any segment) which are located on a support plane are fitted with the corresponding CAD meshes. The pose of the objects is estimated, written to CAS and used to repair point cloud and depth map;
  • Critical to performance parts of the code are running in parralel giving a significant advantage on multicore systems;
  • A ranking thingy, that may be used to keep track of possible similar hypotheses of different kinds.
  • The code is tidy, most of the functions are reentrant. All the functions contain comments and Doxygen in-code documentation;

What is not working

The segmentation part works good only with structured light sensors. Thus ToF sensors like Kinectv2 does will not work with this approach, however I plan to implement it outside of GSoC;

The ICP pose refinement works unreliably for now (broken). A bit earlier additional data along with training meshes was used as source data for the icp, and it worked fairly well (a bit slow thou), but then training data preparation required understanding of how the algorithm is implemented, which is not user-friendly, thus I've implemented automatic extraction of required data, which takes additional processing time, so I need to find a balance between acceptable performance and convergence stability.

Getting the code

My main development repository on github: here

If you are reading it after the end of the program chances are the code is already merged so you may just checkout the master branch of robosherlocks upstream.

Pull requests: PR#96 and PR#99;

To run a demonstration of implemented algorthms clone my fork of the repository to your catkin workspace and checkout to transparent_objects baranch. You are done.

Compiling the code

Ideally a catkin_make would be enough, but since there are some ABI incompatibilities during libpcl link process, you may get an instant segfault when the ContourFittingClassifier is loaded. To address this bug you need a recompiled version of pcl with -std=c++11 flag enabled, look official site for compilation instructions.


Choose your data source in camera_config_files section of descriptors/analysis_engines/transparent_demo.xml and/or play your rosbag file. Execute roslaunch robosherlock transparent_demo.launch in your catkin home to launch demo.

Using the annotators in pipeline

I think this post might be used as a usage manual for the post so I'll describe the options to tweak in order to get the best result for your usecase.

Data preparation

The only input data required for the contour fittign classifier are 3d meshes. Only ply file format is supported for now (Blender can import and export them). The mesh's default pose is assumed bottom-down, an +Y-up and -Z-forward. If ICP refinement is used, proper vertex normals are required and smooth shading is highly suggested.


As mentioned earlier I want to implement transparent segmentation for KinectV2 sensor.

A pose refinement process is not stable and slow - some tweaking there would be good.


  • The segments are effectively detected in different lighting conditions in about 100ms time on a test scene;
  • The position and orientation of successfully detected transparent objects are accurately estimated, (given the training meshes have accurate size, and the camera was calibrated), however, the support plane assumption makes the estimation robust against moderate variations in training data. Without pose refinement (on a test scene) it takes roughly 650ms to fit 4 segments with 3 distinct meshes, edge model for each mesh contains 36 samples.

The processing time measured on 3.5 GHz 8core Ryzen CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment