Thanks to


How Google made Robotic Grasping a reality

Google Robotic Arm
Deep Learning / Robotics

How Google made Robotic Grasping a reality

Artificial Neural Networks have the ability to carry out spectacular tasks with high accuracy. Thus, many organizations have been actively researching on how to use neural networks in the best way possible for their field of work. The robotics team at Google is also one of such teams who have been able to adapt the use of neural networks for a task called Robotic Grasping.

Robotic Grasping is a task in which a robotic arm is able to grasp an object of interest using a single-viewpoint RGB image. Once an object is grasped, it can be moved to desired locations and thus, such robotic arms have quickly found their place in manufacturing pipelines and production warehouses. This article explains on how Robotic Grasping was made a reality by Google.

Training to grasp objects using an ‘arm farm’

The input to the robotic arm is a single viewpoint RGB image and thus, the arm needs to learn how to co-ordinate hand-eye movement for selection of appropriate motion commands that successfully pick up objects. For this to be possible, a lot of training attempts are required and since data collection is a lot slower using only one arm, Google used an arm farm to expedite the data collection process.

The arm farm collected over a million grasp attempts in which thousand robot hours were spent. The objective was to make sure that the robotic arms were accurate in the task and fault-free.

The robotics team soon realized that the rate of training was still very slow to cover all edge cases and a larger number of training attempts were required. Hence, the team shifted towards the use of computer simulations and with this setup, they were able to record millions of grasps in hours instead of weeks.

Simulated robotic grasping

Using this training approach, the model reached a whooping grasp success of 90% in the simulated environment! However, everything fell apart when the model could only grasp successfully 23% of the time in the real world. This was because the simulated training images didn’t look like real-world images and thus, it couldn’t account for the lightning, reflections, texture of objects and such.

This was certainly not what they were hoping for.

Since the simulated model wasn’t able to perform up to par in the real world, the team started using simulated data to improve real-world sample efficiency, also called, Sim-to-Real transfer.

The two approaches used by the team for Sim-to-Real Transfer were:

a. Using Generative Adversarial Networks for pixel-level domain adaptation

Generative Adversarial Networks for pixel-level domain adaptation

Given a training set, Generative Adversarial Networks (GANs) learns to generate new data with the same statistics as the training set. This meant that the Google team could generate real-like version of simulation images based on real-like images itself.

Using this noble approach, the team modified their simulation images to have properties similar to real like images. As the images where being modified on a pixel level, this is also termed as pixel-level domain adaptation.

b. Using Domain-Adversarial Neural Networks for feature-level domain adaptation

Domain-Adversarial Neural Networks for feature-level domain adaptation

The domain-adversarial neural networks or DANNs (Ganin et al, JMLR 2016) uses end-to-end learning of domain-invariant features, by training a model with an adversarial domain classifier.

This means that both the simulated and the real data are taken and the same model is trained on both datasets. Then, an intermediate feature layer is added with a similarity loss and the similarity loss is responsible for affecting the behaviour of feature distribution to be same across both domains.

DANNs implement the similarity loss as a small neural network that tries to predict the domain based on the input feature it receives and the rest of the model tries to confuse the domain classifier as much as possible.

GraspGAN: Combining both feature-level and pixel-level methods

Feature-level methods can learn domain-invariant features on data from related domains that aren’t identical and pixel-level methods can transform data to look identical to real data but they do not work perfectly. This is why both of these methods were combined by the robotic team at Google and they came up with GraspGAN.

The fine results from the GraspGAN are as shown in the picture below.

GraspGAN test results

The result shows that they were able to get the accuracy to about 80% with their method. They were also able to see that their 188k simulated + real-world sample data performed as good as 9.4M real-world sample data.

In Conclusion

If Google hadn’t pushed so hard for this project, Robotic Grasping may have taken a long time to be a reality. Due to their efforts, lots of manufacturing and production companies are now able to perform the mundane task of moving things around with much ease.

Also, if you want to learn more about GraspGan, the full paper can be found here.

Leave your thought here

Your email address will not be published. Required fields are marked *

Close Bitnami banner