Teach your Robot new Tricks with NLP and Deep Learning

Simon Stepputtis
13 min readNov 30, 2020

Imagine Alexa or Siri were able to control your robot. What do we need to make this possible? We recently published a paper at NeurIPS 2020: “Language-Conditioned Imitation Learning for Robot Manipulation Tasks” that answers this question. In this medium article, we provide a step-by-step guide that allows you to talk to your robot.

The full source code, dataset, and a pre-trained model are available at our GitHub repository https://github.com/ir-lab/LanguagePolicies.


How to teach new skills to robots in an efficient way is an ongoing problem in robotics. Especially considering the ever-increasing availability of robots to the general public poses the question of how non-robotics-experts can teach new tasks to a robot. Users generally expect what the robot should be doing in a given scenario and what behavior it should produce. An intuitive and natural way of relaying the user's expectations to the robot is to demonstrate the expected, in other words, imitation learning.

However, good teachers do not only demonstrate the task but also explain the task verbally. Using natural language together with the physical demonstration allows human demonstrators to relay the desired motion and give a verbal description or command alongside it. Using this additional information enables the robot to understand the underlying intention of a task. For example, consider the task of picking up an object from a table. A teacher would move the robot arm to the target location and say Pick up the green cup”. After training, the robot should be able to execute the demonstrated behavior, even if we say Raise the green cup”. The following video shows the robot movement after receiving the new instruction:

Execution of two tasks in a sequential manner. First, the language command is analyzed in the given environment to identify the target object, desired action, and described quantity; then, this information is used to generate a suitable task controller that actuates the robot.

A challenging aspect is that the robot needs to learn the dependencies between language, vision, and motion. In the above example, the robot needs to identify the object, locate it in the camera image, and finally produce a suitable motion to get there. This is where things get difficult due to the use of multiple modalities involved in the process.

In our work, we address this fundamental challenge by providing a single end-to-end framework and imitation learning. Our model learned to interrelate the correlations between language, vision, and control during training, allowing it to generate a suitable control policy based on a new environment image and related command at runtime. Language is used to identify the what and where of the desired task and how it needs to execute these tasks. We provide an intuitive interface to users, allowing them to provide a command using unconstrained natural language. Our deep network then takes a look at the current context given by the camera image and tries to synthesize the underlying instruction. The resulting control commands are then sent to the robot in order to achieve the described task.

Introducing the Model

In our approach, policy generation is treated as a translation process from language and vision. We consider the problem of learning a policy from a given set of demonstrations in a supervised manner. Each demonstration contains the full robot trajectory, an image of the robot’s surroundings, and natural language command. The model's goal is to generate a suitable policy π from the image and the verbal command. In our case, the policy is a motor primitive for which our model generates all hyper-parameters, namely the weights W and current temporal position in the task ϕ at which the motor primitive is evaluated to generate the next robot state rₜ₊₁. While our model is trained end-to-end, it can conceptually be divided into two parts:

System overview of our framework (center). Detailed explanations of the semantic module (right) and the policy translation module (left) can be found below.
  • Semantic Module: The semantic module's goal is to generate a joint embedding e of the visual environment information and the language command. Intuitively, this module’s task is to encode all the necessary information on the desired task. As part of the semantic module, we use GloVe word embedding to convert the command into a fixed-size language representation. Additionally, image processing is done by Faster-RCNN to generate a set of candidate objects in the environment. Faster-RCNN is fine-tuned to our task's specific needs, i.e., detecting objects based on a different color, size, and shape.
  • Policy Translation Module: The policy translation module takes the previously generated task representation e and translates it into a motion primitive’s hyper-parameters. In this work, we predict the motor primitive’s weights, the current phase ϕ, and the speed at which the robot moves in time Δₜ. This generation is done at each timestep to adapt to potential control discrepancies of the robot.
  • Motor Primitive (part of the Policy Translation): The policy translation module generates the hyper-parameters of the motor primitive. We use 11 Gaussian basis functions for each robot degree of freedom that are equidistantly distributed between phase ϕ ranging from 0 to 1, where 0 indicates the beginning of any trajectory, and 1 indicates the last step of any trajectory. This motor primitive can be evaluated at any scalar position ϕ, being the sum of all basis functions at position ϕ times their respective weight.
Model Overview: Based on the environment perception and the voice command, the system generates a suitable motor primitive specific to the specified task in the given environment. The left half shows the semantic module, whereas the right side shows the policy translation module.

The above animation summarizes how the model is working. Our perception module takes an RGB image of the environment and detects all objects within the workspace using a fine-tuned Faster RCNN (fine-tuned on 40,000 training images). In parallel, the operator’s input command is converted into a sentence representation by using GloVe word embeddings. In the attention network, the target object is identified and concatenated with the sentence representation and the current robot state. The result of the semantic section shown on the left-hand side is then translated into a task-specific motor primitive by generating its weights and the current progress in the overall task. The policy generation network is executed recurrently at each timestep to predict the next robot state.

The main implementation of our model can be seen below. The semantic module explained above can be found in line 64 onwards, and the policy translation can be found in line 84 onwards (detailed code for the sub-modules can be found on our GitHub.).


Our model is evaluated in a simulated table-top manipulation task. The task's goal is first to pick up a cup and then to pour a specified quantity of the cup’s content into the target bowl. This task is broken down into two distinct actions: Picking and Pouring, where the respective next action is issued after the first one was completed. Overall, we utilize three different cups (red, green, and blue) as well as twenty unique bowls that differ in their combination of color (yellow, red, green, blue, and pink), size (small and large), and shape (round and square). In each sample, a subset of the cups and bowls are present in the environment (limited to a maximum of six objects). Depending on which objects are present in the scene, the tasks' linguistic difficulty varies based on the number of visual features that need to be used in the command to describe the target object uniquely. The example below shows such a scenario:

To successfully describe the target object, using the color (red), shape (square), or size (small) would not be sufficient since none of them are unique to the target object. Thus, a combination of multiple features is required. A suitable sentence for this example could be “Fill everything into the small red rectangular container.”


To train our model, we require a large amount of training data. The pre-trained model found on our GitHub was trained on 40,000 demonstrations (20,000 for picking and pouring each). Acquiring this dataset considering the need for annotated voice commands is infeasible. To address this problem, we recorded 100 full tasks, each containing a picking and pouring motion. The resulting videos are similar to the demonstration from the previous section. With these 200 videos (each of the 100 demonstrations contains a picking and a pouring motion) we ask five human annotators to provide a verbal command that they think is executed on the robot without enforcing any constraints on the used words or language. The provided commands have then been transcribed into text and converted into templates with replaceable noun-phrases and adverbs. With these templates, we can automatically generate synthetic training descriptions by filling these slots with synonyms, given the context of a generated scene.

Tutorial: Using the Model

The following section explains how to use our model. To follow along, please first set up the code on our official GitHub.

Reproducing the Paper Results: Training

While we offer a pre-trained model in the downloaded archive, you can train your own model from scratch. To do so, we also provide the full pre-processed dataset of 40,000 training and 4,000 validation samples within the downloaded archive. These data can be found in their respective .tfrecord files. The following code snippet shows the features that are available in the dataset:

To train the model with the default settings on our dataset, you can start training as follows:

python main.py

Below, we will describe how to create your own dataset. However, after you created it, you can easily use it by changing the parameters at the top of the main.py file. You can also change other parameters like the learning rate and weights for each of the auxiliary losses.

We trained our model on a system with two Intel Xeon CPU E5–2699A v4 CPUs over 200 epochs until convergence. After training, the code will save the latest model as well as the best model in Data/Model/. Additionally, you can also find the TensorBoard logs in Data/Log/.

Reproducing the Paper Results: Evaluation

To evaluate our model, CoppeliaSim, PyRep, and ROS 2 are required. The rationale behind using ROS is to separate the model’s inference and robot control from each other, allowing us to utilize available hardware resources better. However, for the evaluation in this tutorial, we will be using CoppeliaSim to simulate the robot.

In the first step, we need to run the service node that provides the neural network. This can be done by running:

python service.py

This will provide an inference service that controls the simulated robot in CoppeliaSim. Please make sure to set the correct path to your model. The default setting will use the pre-trained model provided in the downloaded archive, however, you can change the used model at the top of the service.py file.

Another interesting parameter is USE_DROPOUT, which will use dropout during inference. Instead of running one batch with one sample during each step, we run 250 times the same sample in a single batch while using dropout and average the produced outputs. This allows us to reason about the model's certainty of the produced output, as shown in the figure above. For example, if there is no green bowl in the environment, but the command instructs an interaction with a green object, the prediction of the target’s position has a higher variance than bowls that are around (here, red bowls are present at various locations). Further details can be found in the blog post of Yarin Gal.

Our model can be evaluated in CoppeliaSim by running val_model_vrep.py. At the top of the file, you can choose between two different evaluations:

  • Automatic: Evaluates the model on 100 full test-tasks (100 for picking and 100 for pouring). Set RUN_ON_TEST_DATA to True
  • Manual: Environments can be generated manually, and commands are typed in. Set RUN_ON_TEST_DATA to False

The automatic evaluation will create a file called val_result.json based on the number of test runs specified in NUM_TESTED_DATA. Without going into details of the file’s structure, it captures what should have been executed and what was actually executed by the robot during inference. A report of the automatic evaluation can be created by running:

python viz_val_vrep.py

This provides information about a particular log file, specified at the top of the file. For comparison, we provide the file ours_full_cl.json, which results from running the automatic evaluation on all 100 tasks. The last line of this output produces a Latex line that has directly been placed in Table 1 of our paper.

For the manual evaluation, we provide a terminal interface to control the simulation. However, a results file as for the automatic evaluation is not generated since the ground-truth trajectories are not available. You can interact with the simulation by typing the following commands and pressing enter:

  • q →This will quit the simulation
  • g → This will reset the robot and generate a new, random environment
  • r → This will reset the robot and environment to their respective default states
  • t <command> →Replace <command> with your command to execute the specified task in the given environment. As an example, you could run “t pick up the green cup”, assumed there is a green cup in the scene.

Overall, our model achieves a 95% success rate for picking, 85% success for pouring, and 84% success for executing both tasks sequentially.

The following example highlights the reason that our low-level controller runs as a recurrent network, regenerating the control parameters at every timestep, is the ability to react to perturbations and discrepancies during robot control. In the video, the robot’s movement is disturbed, but our model is capable of recovering. While our model is not able to handle failure cases, e.g., failing to lift the cup, discrepancies happening during movement with enough room to allow for a recovering motion are possible.

Collecting Data

You can collect your own data with the code given in our archive. The main entry file for this is collect_data.py in the utils directory:

python utils/collect_data.py

By default, the collected data is stored in the downloaded archive. You will find one JSON file for each sub-task (picking and pouring) in the collected directory. Two files have the same random name if they are from the same task demonstration, differing by being suffixed with _1 and _2, indicating picking and pouring, respectively.

While you can directly use the collected files for the automatic evaluation described above, training the model requires the data to be processed into TFRecords. To convert the data, please set the parameters at the bottom of the data_processing.py file, found in the utils folder. Most importantly, the default batch size is set to 16, meaning you need at least 16 demonstrations for training and validation each. Please set the max_samples parameter for the validation dataset size to your desired amount and clip the same amount in the parameter min_samples of the training set. This will ensure that the datasets are not overlapping. After that, run:

python utils/data_processing.py

This creates three files in the downloaded archive:

  • normalization_custom.pkl: Contains the normalization for the data, especially for the robot joints. If you want to use this for training or evaluation, please set the path to this file instead of the default normalization.
  • train_custom.tfrecord: Contains the training dataset. This path can be set in main.py to be used during training.
  • validate_custom.tfrecord: Contains the validation dataset. This path can be set in main.py to be used during training.

Custom Code: Model Creation and Inference

If you would like to experiment with the model independently, the minimal code needed to create and load the pre-trained model can be seen below. The paths in the snippet are set up to work with the downloaded archive. Please change them as needed.

Our model expects the following inputs:

  • Verbal Command: A vector of 15 integer word IDs (padded with 0s if necessary), holding the row-index into the GloVe word embedding matrix.
  • Candidate Objects: A 6x5 matrix holding a set of six candidate objects in the environment, as detected by Faster RCNN. Each object is represented by its object class and four values describing their bounding box in the original image.
  • Robot State: A sequence of variable length, indicating the robot states since the current task started. Each robot state contains the six robot joints plus one value for the gripper state.

To generate the candidate objects, Faster RCNN needs to be used on the original image input. Faster-RCNN returns a set of detected scene objects, ordered by their detection confidence. We select the six objects with the highest confidence in the below function and store their class and bounding box as the feature vector for the respective object. Assumed you have a model created as indicated above, you can run the following code, where the image should be a 3D NumPy array.

In addition to the candidate regions, we also need to pre-process the sentence and convert it into a vector of word IDs. This can be done with the following code snippet:

During inference, our model creates the following output:

  • Next State rₜ₊₁: A vector of seven values, indicating the next joint configuration of the robot, plus one configuration of the gripper.

Additionally, we output the following auxiliary outputs:

  • Object Attention a: A vector of six values, indicating the candidate objects' likelihood being the described target object.
  • Phase Velocity Δₜ: A scalar value that describes how much the predicted phase ϕ needs to be advanced to step from the current state r to the next state rₜ₊₁.
  • Phase Progression ϕ: A scalar value that predicts how far the current robot state r has progressed in the overall task.
  • Motor Primitive Weights W: A matrix defining the weight for each of the eleven basis functions for every robot degree of freedom.

Summary and Conclusion

In this article, we present a method to teach robots new skills from natural language commands. Our approach combines language, vision, and control in an end-to-end imitation learning framework, providing a simple and intuitive interface for human users to interact with robots.

A future direction of this work is the application of our method on a real UR5 robot, allowing users to issue voice commands directly and see the result applied in the real world. However, there are multiple challenges outside of this tutorial’s scope that need to be addressed in order to transfer from simulation to the real world successfully. An interesting point, however, is that the model shown in the above video is actually the same model used in simulation without any fine-tuning of the core model. We only fine-tuned Faster RCNN on 100 real-world images in order to account for the changed images.

Please have a look at our GitHub for more information and consider visiting our lab website!

We are happy to receive any feedback you might have. Please feel free to reach out to us with any questions or suggestions!



Simon Stepputtis

Postdoctoral Fellow at Carnegie Mellon University | Artificial Intelligence, Natural Language Processing, Human-Robot-Collaboration