03/24/2019
Introduction
RASCAPP[1] is a mobile manipulator robot I built by first stripping off an electric wheelchair down to its power-base, rewiring its motor leads to a motor driver which could be controlled with ROS and finally mounting the Baxter robot onto this mobile base. The figure below depicts its final look. RASCAPP was built as part of a research project, headed by Prof. William Messner, which aimed to develop robotic platforms to provide domestic assistance to people rendered immobile from spinal injuries. Over the period of the research project, alongside developing other nifty devices to make domestic life easier for our target population, we got RASCAPP to perform kitchen tasks like transferring food into and out of refrigerators, microwaving food, picking up objects from the ground, etc. The difficulty in getting RASCAPP to perform these tasks laid in developing manipulation strategies that are able to generalize over varying positions of objects of interest.
Unlike in the factory, where robots perform tasks in a structured, deterministic environment, in a domestic setting, objects of interest are not guaranteed to occupy the same space over time. In other words, the state space of objects is non-deterministic. You can’t design a reasonably successful domestic robot system with the prior constraints that, the bottle of water would be put back in the same position in the fridge after a series of constant usage for an extended time period. As a result, I employed object pose detection techniques such as sticking AR markers to objects of interest and using color segmentation techniques to identify objects as well as estimate their 3-dimensional pose and orientation. In this particular work, I get the robot to pour drink from a bottle into a cup. I describe my methodology and the various techniques I employed to get the robot to perform this task at an over 95% success rate.
Methodology
I set a table in front of the robot and get it to pour drink from the bottle which is on the table into a cup which is also on the table. I use a combination of hard-coded trajectories and motion planning to ensure that the robot’s grippers move in collision-free trajectories throughout the activity. To locate and estimate the pose of objects (the cup and the bottle), I use the Primesense 3D camera. I employ a color-segmentation technique which replaces any pixel in the robot’s camera frame which doesn’t have a specific color with black and pixels which have that color with white pixels. After, I remove noise from the resulting black-and-white image and draw bounding boxes on all white contours in the image. I then select the largest contour as the object of interest. In this specific case, the bottle was green and the cup was light blue. Having selected this object, I draw a bounding box around this object and select the centroid of the bounding box to be the centroid of the object. Finally, with the pixel coordinates of this centroid, I query the 3D point cloud depth-registered image generated by the camera for the 3D pose of the point in the point cloud that corresponds to the pixel coordinates of the centroid of the object. Once I have this 3D pose, I broadcast its transformation to the camera frame and publish its 3D position to an indicated ROS topic. The 3D position is a time-stamped pose message called PoseStamped in the ROS geometry messages package.
With the robot’s arm in zero-gravity mode, I carry its grippers and hold it in a suitable grasp pose and orientation (the goal pose), which would maximize the number of successful grasps. I then record the pose and orientation of the robot’s grippers (the goal pose). For each grasp attempt, I perform a relational transformation between the bottle’s pose and the goal pose, and pass this transformation to the robot’s Inverse Kinematics solver to generate the appropriate trajectory to get the robot’s grippers to the goal pose given the position of the bottle. This way, the gripper would always hold the bottle at that specific region and orientation regardless of the position of the bottle, assuming the bottle is within the reach of the robot’s arm. This solves the problem of varying object position to some degree.
Work by Pinto et al . [2] and Levine et al.[3] have developed ways to learn successful pre-grasp poses for varying objects by training a neural net with hundreds of thousands of grasp attempts and a diverse set of objects of interests. While these methods, unlike mine, would generalize well on new kinds of objects with different geometries, the success rate of their grasp attempts was just a little over 60% for the first and was improved to 80% by the second. Since these success rates are not practical in a domestic setting, I opted to go with my approach, which although restrained to bottles of similar geometries as mine, has a 98% success rate.
Using my technique, I got the robot to locate the bottle and pick it up. The pouring action used a variant of this technique. In order to pour the drink into the cup, I chose a specific spatial relationship between the bottle and the cup. Granted, there are numerous configurations to pour drinks into cups. Works by Yamaguchi et al. [4] and Rozo et al. [5] have employed various learning strategies to get robots to learn how to pour. These methods generalize well over different bottle shapes and different fluid properties. But these methods are not trivial to implement and don’t have success rates high enough to be applicable in a domestic setting. As a result, I worked on getting the robot to pour the drink in a single bottle-cup configuration. I used the relational method described in the previous paragraph to first get the robot to orient the bottle in the optimal pouring configuration, query the 3D poses of both the gripper and the cup and perform a relational transformation between the cup’s pose and the gripper pose, which is the goal pose. For each pouring task, I pass this transformation to the robot’s Inverse Kinematics solver to generate the appropriate pouring trajectory to pour the drink into the cup.
The focus of this work was not on the specific quantity of drink the robot poured into the cup. Since I assumed the drinks the robot would be pouring would have the same viscosity as water, I made the assumption that keeping the robot’s grippers in the pouring pose for a consistent period of time (in this case, 3 seconds) would result in pouring approximately the same quantity of drink in the cup, for each pouring action.
Video
Robot pours drink from the bottle into the cup
Results
My approach to the pouring task achieved a 98% success rate across different configurations and arrangements of the bottle and the cup. Out of 100 pouring attempts, only 2 attempts were not successful. The first failed attempt knocked off the bottle from the table as the robot tried to grasp it. The second poured most of the drink on the table instead of in the cup.
The Bigger Picture: Learning vs Planning
Robotic planning in Task and Motion Planning (TAMP) problems is basically a search problem, where, when fed with an initial state and a goal state, the planning agent searches for and returns a set of actions or configurations to change the agent’s state from the initial state to the final state.
The problem with complete planning approaches is that they are typically slow and do not improve with experience. The primary reason why they are slow is that they have a complex hybrid search space. Let’s say a robot is trying to cook a meal for dinner. The robot has to decide on discrete decision variables such as which object to manipulate in order to prepare the meal. And as you can imagine, as the number of objects to manipulate and the planning horizons grow, it will run into a combinatorial explosion. We also have to decide on continuous variables such as where should the robot stand in order to pick a particular object. This has to satisfy certain physical and geometric constraints and these are usually reasoned by calling an external low-level motion planner which usually is expensive.
In order to avoid this computational problem, we might have to use learning. So we can train a neural network that can try to map a state to a low-level robot control which would maximize the sum of the rewards along a particular time horizon. The premise here is that there exists some kind of smoothness that the function approximator can exploit in terms of patterns with respect to the states and action spaces. The problem for TAMP problems is that, a small change in state results in a large change in the feasible motion. This also happens with respect to the changes in actions. Based on this observation, we can foresee that learning would require a lot of data because you need to learn all the fine details. On the other hand, if we manage to train a neural network with as much data as it needs to learn these fine details, almost no online computation would be required. Planning on the other hand, though computationally expensive, needs no training data.
We can think of a spectrum between complete learning and complete planning; with complete planning being on the far left end of the spectrum and complete learning being on the far right end of the spectrum. Complete planning is difficult because we have a complex search space and complete learning is difficult because we’d need an enormous amount of data. So an obvious optimal approach would be to try and sit in the middle of the spectrum and learn to guide planning by predicting constraints that guide the planning procedure. Constraints are a set of suggestions to a subset of decision variables. So the total number of decision variables for an object-placement problem could be all the configurations for placing the object down. If we restrict ourselves to predicting the robot placement base pose, then that becomes much more smooth with respect to the changes in the state. And for the rest of the decision variables that have to exhibit collision-free behavior, we can delegate them to a low-level motion planner to adapt to the changes in the environment.
Other approaches to merge learning and planning, like that of Phillips et al. [6] show how incorporating user demonstrations of constrained manipulation motions can dramatically accelerate planning in manipulating constrained objects like drawers and doors. They demonstrate how their approach can be directly incorporated into experience graphs, which are graphs that encode and reuse previous experiences.
The application of learning in Task and Motion Planning problems is an area of research I’m deeply interested in so I would most likely write other blog posts on other new strategies in the robotics literature that attempt to improve TAMP using learning.
References
[1] Alphonsus Adu-Bredu. RASCAPP: Documentation of a Domestic Assistant Mobile Manipulator Robot. URL: https://alphonsusadubredu.com/wp-content/uploads/2019/01/rascapp.pdf
[2] Lerrel Pinto and Abhinav Gupta. Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours. International Conference on Robotics and Automation, 2016. URL: https://arxiv.org/abs/1509.06825
[3] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye co-ordination for robotic grasping with deep learning and large-scale data collection. CoRR, ABS/1603.02199, 2016. URL:http://arxiv.org/abs/1603.02199
[4] Akihiko Yamaguchi, Christopher Atkeson, Scott Niekum and Tsukasa Ogasawara: Learning Pouring Skills from Demonstration and Practice. 2014 IEEE-RAS. URL: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7041472
[5] Leonel Rozo, Pablo Jimenez, Carme Torras: Force-based robot learning of pouring skills using parametric hidden Markov models. 9th International Workshop on Robot Motion and Control. URL: https://ieeexplore.ieee.org/document/6614613
[6] Mike Phillips, Victor Hwang, Sachin Chitta, Maxim Likhachev: Learning to Plan for Constrained Manipulation from Demonstrations. Autonomous Robots (AURO), 2016. URL: http://www.cs.cmu.edu/~maxim/files/truncatedincsearch_aij16.pdf
Wow so magnificent