Teach your Robot new Tricks with NLP and Deep Learning


Execution of two tasks in a sequential manner. First, the language command is analyzed in the given environment to identify the target object, desired action, and described quantity; then, this information is used to generate a suitable task controller that actuates the robot.

Introducing the Model

System overview of our framework (center). Detailed explanations of the semantic module (right) and the policy translation module (left) can be found below.
  • Semantic Module: The semantic module's goal is to generate a joint embedding e of the visual environment information and the language command. Intuitively, this module’s task is to encode all the necessary information on the desired task. As part of the semantic module, we use GloVe word embedding to convert the command into a fixed-size language representation. Additionally, image processing is done by Faster-RCNN to generate a set of candidate objects in the environment. Faster-RCNN is fine-tuned to our task's specific needs, i.e., detecting objects based on a different color, size, and shape.
  • Policy Translation Module: The policy translation module takes the previously generated task representation e and translates it into a motion primitive’s hyper-parameters. In this work, we predict the motor primitive’s weights, the current phase ϕ, and the speed at which the robot moves in time Δₜ. This generation is done at each timestep to adapt to potential control discrepancies of the robot.
  • Motor Primitive (part of the Policy Translation): The policy translation module generates the hyper-parameters of the motor primitive. We use 11 Gaussian basis functions for each robot degree of freedom that are equidistantly distributed between phase ϕ ranging from 0 to 1, where 0 indicates the beginning of any trajectory, and 1 indicates the last step of any trajectory. This motor primitive can be evaluated at any scalar position ϕ, being the sum of all basis functions at position ϕ times their respective weight.
Model Overview: Based on the environment perception and the voice command, the system generates a suitable motor primitive specific to the specified task in the given environment. The left half shows the semantic module, whereas the right side shows the policy translation module.



Tutorial: Using the Model

Setting up the Code

  • CoppeliaSim: Downloading and installing the player version will be sufficient, as long as you do not want to change the simulation environment itself. Our code was tested with version 4.0 and 4.1
  • ROS 2 Eloquent: ROS is used for communication between the simulator and the neural network in our evaluation code. Before running our code, please compile and source the workspace found in the folder ros2 of our repository. This will allow the code to find the needed communication messages.
  • PyRep: We use PyRep to interface with the simulator. Please follow the installation instructions at the respective GitHub. In case you run into issues not being able to find libcoppeliaSim.so, please make sure you set the environment variables correctly as described on their setup page.
  • Orocos KDL (use commit 1ae45bb): The python-wrapper has to match the solver version installed on your system. We strongly suggest installing both components from their GitHub. For Python3, the following GitHub-Issue provides guidance for the installation process. If you run into installation issues regarding SIP, please try installing sip version 4.19.22.
  • Automatic: Evaluates the model on 100 full test-tasks (100 for picking and 100 for pouring). Set RUN_ON_TEST_DATA to True
  • Manual: Environments can be generated manually, and commands are typed in. Set RUN_ON_TEST_DATA to False
  • q →This will quit the simulation
  • g → This will reset the robot and generate a new, random environment
  • r → This will reset the robot and environment to their respective default states
  • t <command> →Replace <command> with your command to execute the specified task in the given environment. As an example, you could run “t pick up the green cup”, assumed there is a green cup in the scene.

Overall, our model achieves a 95% success rate for picking, 85% success for pouring, and 84% success for executing both tasks sequentially.

  • normalization_custom.pkl: Contains the normalization for the data, especially for the robot joints. If you want to use this for training or evaluation, please set the path to this file instead of the default normalization.
  • train_custom.tfrecord: Contains the training dataset. This path can be set in main.py to be used during training.
  • validate_custom.tfrecord: Contains the validation dataset. This path can be set in main.py to be used during training.

Custom Code: Model Creation and Inference

  • Verbal Command: A vector of 15 integer word IDs (padded with 0s if necessary), holding the row-index into the GloVe word embedding matrix.
  • Candidate Objects: A 6x5 matrix holding a set of six candidate objects in the environment, as detected by Faster RCNN. Each object is represented by its object class and four values describing their bounding box in the original image.
  • Robot State: A sequence of variable length, indicating the robot states since the current task started. Each robot state contains the six robot joints plus one value for the gripper state.
  • Next State rₜ₊₁: A vector of seven values, indicating the next joint configuration of the robot, plus one configuration of the gripper.
  • Object Attention a: A vector of six values, indicating the candidate objects' likelihood being the described target object.
  • Phase Velocity Δₜ: A scalar value that describes how much the predicted phase ϕ needs to be advanced to step from the current state r to the next state rₜ₊₁.
  • Phase Progression ϕ: A scalar value that predicts how far the current robot state r has progressed in the overall task.
  • Motor Primitive Weights W: A matrix defining the weight for each of the eleven basis functions for every robot degree of freedom.

Summary and Conclusion



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Simon Stepputtis

Simon Stepputtis


PhD student in Computer Science | Artificial Intelligence, Natural Language Processing, Human-Robot-Collaboration