Blog: Paper Reading - Geometric Shape Insertion Task

Paper Reading:

Teaching Robots Geometric Shape Tasks

Spatial Reasoning in Robotics

What if we could teach robots to do assembly that requires spatial reasoning?

One paper from the Robotics: Perception and Manipulation (RPM) Lab at Minnesota Robotics Institute approaches solving this problem with a novel task involving distinct object pairs of 3D shapes (e.g., plus, minus, pentagon).

Different shape pairs used for the geometric insertion task

🎯 The Challenge:
Two robot arms must grasp different objects and figure out how to align them perfectly for insertion. When a "+" shaped peg is rotated 90°, the robot must recognize this and plan the correct alignment - pure geometric reasoning.

Model architecture for geometric reasoning

Model architecture showing the vision encoder and robot state processing pipeline

Their solution:

Their task setup involves having 3 RGB Cameras - top view and one for each robot arm.
This camera feed is downsampled and processed using a vision encoder(ResNet, ViT variants) to produce 3 embeddings.
These embeddings are used along with the robot state representation (6D rotation matrix computed using forward kinematics and orthogonalized with Gram-Schmidt process)
These values are then converted into gripper space for the robot arm to follow (inverse kinematics)

Different perturbations applied to test model robustness

🔧 How did sim2real work?
With robotics, the goal is always to build with robustness in mind. So, in order to account for real world scenarios - they add pertrubutions along different axes to inflict grasp errors. These also include randomisation of the object pair orientation(along the Z axis), which enables the model to "actually learn spatial reasoning". This was done across 1,000 demonstrations all run in a PyBullet simulator before being tested in the real world.

Their approach was able to get 82.5% success rate in a real world setup.

📈 What's Next:
The authors suggest DiffusionPolicy and action chunking could push accuracy even higher. Add force feedback, and we might see human-level assembly performance.

Learn More:
Paper: Evaluating Robustness of Visual Representations for Object Assembly Task Requiring Spatio-Geometrical Reasoning
Website