On Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes

In robotics research, numerous ways of communicating a goal to a robot have been explored. Prominent among them is communication through natural language, specifically through voice commands. Although instructing robots through voice commands seems like an easier and more natural thing to do, it is not entirely convenient when describing manipulation goals. To get a robot to set up a table in a specific manner, describing your desired goal arrangements through natural language could be quite tedious and could often lead to ambiguity. A better way to describe such a goal would be to physically demonstrate what a good table setup would look like. This way, the robot can know exactly the kind of arrangement to go for. This is what Zhen Zeng, Zheming Zhou, Zhiqiang Sui and Odest Chadwicke Jenkins try to accomplish in their work on Semantic Robot Programming.

Semantic robot programming is a declarative approach to programming robots for manipulation tasks on objects. All you have to do is to demonstrate the goal arrangement of the objects to the robot. The robot then autonomously plans a sequence of actions to change the arrangement objects from any arbitrary placements to the goal arrangement.

The robot achieves this by first using a scene perception technique called DIGEST to construct a scene graph of both the initial object arrangement and the goal arrangement. The scene graphs describe the spatial relationships between objects. For task and motion planning purposes, it is expressed in STRIPS planning language.  An object arrangement that comprises of a saucer on a table, a tea cup on the saucer and a tea spoon in the tea cup could be described in PDDL as : exists(table),  on(saucer, table), on(tea cup, saucer), in(tea spoon, tea cup). This relational representation of the initial and goal arrangements are provided to the task planner, which uses a planning algorithm (BFS in this paper) to plan a sequence of arrangements from the initial arrangement to the goal arrangement. The robot then takes this high-level plan and uses a motion planning framework (MoveIt! in this paper) to get the robot to plan and execute low-level arm trajectories to execute the high-level plan and get the objects arranged  as the goal arrangement specifies. It is worth noting that, the robot does not strive to precisely match the object positions in the goal arrangement. It only matches the semantic relationships between the objects.

This briefly describes the entire Semantic Robot Programming framework. Here’s a figure from the Semantic Robot Programming research paper that aptly describes the framework.

 

From the brief description of the framework above, it seems to me that all the magic happens in the DIGEST technique. Indeed, I believe a key measure of intelligence is the ability to distill complex phenomena into easily digestible representations. In other words, the ability to extract exactly the kind of information relevant for a particular purpose from a dense hodge-podge of information. And this is exactly what DIGEST does. It distills the complex information contained in  pixel space into the easy and convenient graph representation the planner needs to do its magic. DIGEST is endowed with an object detector that creates bounding boxes around detected objects. It then generates a set of candidate identities the objects detected could have and develops a collection of a set of hypotheses, where a hypothesis consists of a combination of the candidate identities of the objects. It uses a Bayesian filtering approach to compute the pose estimates and the likelihoods of each hypothesis and ranks the hypotheses based on the computed likelihoods. The hypothesis with the greatest likelihood is then chosen as the scene estimate.

To determine the spatial relationships between the now-identified objects, DIGEST uses pre-obtained meshes of the objects. It uses simple heuristics to check for the alignment of the axes of the meshes to determine stacking relationships and other spatial relationships between the objects. Using these inferred relationships, it then builds the scene graph using the PDDL language and proffers it to the task and motion planner to do its magic.

The following are a few questions that arose as I read this research paper.

What happens if an object in the goal scene is not present in the initial scene? Does the robot attempt to match the arrangements of the existing objects as close as possible to those of the goal scene or does it not try at all? And what happens if there are extraneous objects in the initial scene? How does SRP know to disregard them or put them aside?

In determining the spatial relationships between objects, DIGEST assumes that the robot possesses a 3D mesh of all objects of interest. Could there be a more convenient way to determine object relationships without requiring the geometric computation on meshes or even the existence of meshes of all objects of interest as 3D meshes could be non-trivial to obtain? Could a neural net be trained to not only detect independent objects but to predict the spatial and semantic relationships between them? Image captioning neural networks do some form of semantic relationship prediction when they caption images. Could this idea be adapted for SRP?

What would it take to expand the state space of SRP from the tabletop setting described in the paper to, say, an entire kitchen? Would DIGEST be able to handle it? Especially as there would now be a huge number of objects, many of which would not be relevant in performing the goal task? How do we scale up SRP?

Manually specifying objects of interest and physically demonstrating goal scenes seems tedious for  everyday usage of a domestic robot. Could the robot be made to infer the appropriate goal scene for specific task descriptions? For example, If I command my kitchen robot, Schrute, to clean up the kitchen, Schrute should ideally be able to infer from this task description that the spoon should definitely not be left on the floor after clean up but the table should however be left with its legs in contact with the floor. What would it take to develop such a common-sense knowledge base of a kitchen. Would these rules have to be all hard-coded or should they all be learned? Or should the basic rules be hard-coded and the rest left to the robot to learn? And how should the robot use these rules to infer goal scenes?

I find Semantic robot programming really interesting because of its potential to significantly advance communication between robots and humans and I look forward to future improvements of this work.

One Reply to “On Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes”

Leave a Reply

Your email address will not be published. Required fields are marked *