By Alishba Imran
👋 I'm Alishba, a 17-year-old who's passionate about using hardware & software to tackle problems in deep tech.
I've been working with Hanson Robotics (creators of Sophia the robot) to develop various deep learning techniques as a new way to automate manipulation tasks in humanoids/robots like Sophia. We're developing neural networks, for closed-loop grasping to ultimately develop neural symbolic AI approaches that improve our understanding and ability to grab novel objects in unstructured environments.
Before we get into the specifics, let's go over some of the key components in robotic perception. Robotic perception is the ability of robots to learn from sensory data and, based on learned models, to react and take decisions accordingly. It consists of three components: 1. perceive 2. comprehend and 3. reason about the surrounding environment
Key modules of a typical robotic perception system are: sensory data processing (focusing here on visual and range perception); data representations specific for the tasks at hand; algorithms for data analysis and interpretation (using AI/ML methods); and planning and execution of actions for robot-environment interaction.
Data from sensors are super key in robotic perception and it can come from a single or multiple sensors, usually mounted onboard the robot or from external cameras mounted on.
Sensor-based environment mapping includes the acquisition of a metric model and its semantic interpretation. This semantic mapping process includes reasoning on volumetric occupancy and occlusions, or identifying, describing objects to enable reasoning and inference regarding the real-world environment where the robot operates.
Robotic grasping is one of the most fundamental robotic manipulation tasks: before interacting with objects in the world, a robot starts by grasping them.
In order to do this, the robot has to work through many things such as (but not limited to):
These tasks are difficult today for two main reasons:
This is where using computer vision or CNN approaches come in handy.
The algorithm that I used for grasping detection is a Generative Grasping Convolutional Neural Network (GG-CNN) widely inspired by this paper. A GG-CNN works by taking in depth images of objects and predicting the pose of grasps at every pixel for different grasping tasks and objects.
CNN-based controllers for grasping are preferred since they can do closed-loop grasping. Both systems learn controllers which map potential control commands to the expected quality of or distance to a grasp after execution of the control, requiring many potential commands to be sampled at each time step.
What is GG-CNN?
Grasp is executed perpendicular to a plane surface, given a depth image of the scene which is determined by its pose (such as the grippers centre position based on x,y,z values in Cartesian coordinates). The grippers rotation around the z axis and the required gripper width. A scalar value is also used to represent the chances of grasp success in the pose.
A grasp can be described as:
A grasp in the image space can be converted to a grasp in world coordinates by transforming it with the following:
Grasp Representation: represents a grasp map (G) as a set of three images, Q, Φ and W:
To train our network, I used the Cornell Grasping Dataset which contains 885 RGB-D images of real objects, with 5110 human-labelled positive and 2909 negative grasps. This dataset is good for our task because it has a pixelwise grasp representation as multiple labelled grasps.
Depth and RGB images from the Cornell Grasping Dataset with the ground-truth positive grasp rectangles are shown in green. From the ground-truth grasps, the Grasp Quality (QT), Grasp Angle (ΦT) and Grasp Width (WT) images are generated to train the network. The angle is further decomposed into cos(2ΦT ) and sin(2ΦT ) for training.
I've finished training the model and testing/validating it. Based on current tests, the model is able to get a 83% grasp success rate on a set of previously unseen objects and 88% on a set of household objects that are moved during the grasp attempt and 81% accuracy when grasping in dynamic clutter. Overtime, I'll be working to train the model more with new data to increase the accuracy.
Reinforcement Learning (RL) figures out what to do and how to map situations to actions. The end result is to maximize the numerical reward signal but instead of telling the learner what action to take, they must discover which action will result in the maximum reward. In our case, the action would be grasping an object with a high success rate.
Using RL, we can get an agent to learn the optimal policy for performing a sequential decision without complete knowledge of the environment. The agent first explores the environment by taking action and then edits the policy according to the reward function to maximize the reward. To train the agent, we can use:
To learn more about these, you can read this paper which testing all of these in a simulated environment and compared the results.
Key Terms to note here are:
Some of the key limitations to RL right now is that it's very hard to find the most optimal reward function and setting up a grasping experiment in real life is hard. For this reason, most implementations of this have been done via simulation but results might not translate as well in real life.
Examples of simulations are as such:
The robot must pick up objects in a bin, which is populated using randomized objects and then generalize to unseen test objects or pick up a specific object from a cluttered bin.
To overcome of these barriers, we could also use Imitation learning where the learner tries to mimic an experts action in order to achieve the best performance. This is easier to faciliate since a human can do the actions and the model can learn based on that overtime. We can use the DAgger algorithm to accomplish these tasks. I'll be going into more details about these in a future blog post.
The neural network would just be part of the entire system, as in the long-term the goal is to integrate it with symbolic reasoning techniques such as rules engines or expert systems or knowledge graphs. This is often referred to as Nero-Symbolic AI. Examples of this can be using neural networks to identify what kind of shape or colour a particular object has and then applying symbolic reasoning to identify other properties such as the area of the object, volume and so on.
Why should we combine neural networks with symbolic reasoning?
Current deep learning models are too narrow. When you give them huge amounts of data, they work very well at the task you want it to perform but break down if you prompt it to adapt to a more general task. To do this you also need enormous amounts of data. On the other hand, symbolic AI is really good at doing interesting things with symbols but actually getting the symbols from the real world is much harder than anticipated.
By combining the two approaches, you end up with a system that has neural pattern recognition allowing it to see, while the symbolic part allows the system to logically reason about symbols, objects, and the relationships between them.
CLEVR is one of the first datasets using neuro-symbolic ai that provides a set of images that contain multiple solid objects and poses questions about the content and the relationships of those objects (visual questioning). If you want to train a neural network to solve a system like that, you might be able to do it but it requires tremendous amounts of data.
To solve this, they first use a CNN to process the image and create a list of objects and their characteristics such as color, location and size. These are types of symbolic representation that rule-based AI models can be used on. Another NLP algorithm processes the question and parses it into a list of classical programming functions that can be run on the objects list. The result is a rule-based program that expresses and solves the problem in symbolic AI terms. The AI model is trained with reinforcement learning through trial and error based on the rules of the environment it is operating in.
To learn more about this, you can read this blog.
How could we do this for grasping?
We can create a map of different applications and capabilities through affordances. Affordances include the perceived and actual properties of a thing that provide cues to the operation of things. Affordances can be symbolic representations used to complete a task. Combining these capabilities with neural network which can make a prediction for grasping and turn that into symbolic representation that can be reasoned with through calling functions.
Over the next few weeks I'll be working on the following to progress this project forward:
If you enjoyed this blog, please do share it: