Decision Transformer on Vision-Language Navigation

Title: Decision Transformer on Vision-Language Navigation

Who

Zilai Zeng [zzeng28], Shijie Wang [swang299]

Introduction

Navigation and interaction guided by natural language instructions and ego-centric vision have become a significant challenge for embodied agents. Vision-Language Navigation (VLN) requires making sequential decisions (actions) in dynamic environments by reasoning over natural language instructions and visual observations. Recent methods, like VLN-BERT, Episodic Transformer, etc., leverage transformer-based architecture to encode the full episode of navigation history, they use imitation learning to address this problem in a supervised way. We can also formulate this task as a reinforcement learning (RL) problem from another perspective. In this project, we proposed an offline RL method combining Decision Transformer, a framework that abstracts RL as a sequence modeling problem. By incorporating reward information, our method can further improve the performance and robustness of decision-making in VLN.

Related Work

Episodic Transformer (E.T.) proposed a transformer-based VLN neural agent, which consists of four encoders for each modality: language encoder, visual encoder, action encoder, and multi-modal encoder. It shows that transformer architecture has better performance in encoding long-term episodes and solving compositional tasks compared to recurrent-based architecture. By utilizing synthetic data, it achieves 38.4% and 8.5% task success rates on seen and unseen test splits.
Decision Transformer introduces a framework that treats offline reinforcement learning as a sequence modeling problem. By using GPT architecture, it enables auto-regressive generation with the casual attention mechanism, which shares the same process of conditional generation for an MDP trajectory. In this way, Decision Transformer is able to learn meaningful patterns and draw a better policy from some limited datasets. Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL algorithms without using Dynamic Programming.

Data

ALFRED is a benchmark for learning a mapping from natural language instructions and ego-centric vision to sequences of actions for household tasks. It includes expert demonstrations in interactive visual environments for 25k natural language directives.

ALFRED introduced 7 different task types parameterized by 84 object classes in 120 scenes. Each expert demonstration contains an agent’s ego-centric visual observations of the environment and expert action sequences to complete the task. Each directive has a high-level goal and a set of step-by-step instructions (subgoals). Plus, it also collects free-form language directives from 3 different annotators for every demonstration.

Methodology

Episodic Transformer provides a simple but powerful way to map natural language instructions and visual observations to action sequences, while Decision Transformer bridges the gap between offline RL and sequence modeling problem.

Inspired by the above two models, we propose a transformer-based model treating VLN task as an offline reinforcement learning problem. Firstly, we convert the VLN task into a reinforcement learning problem with an appropriate reward design. We keep the same encoders of Episodic Transformer to jointly encode language, visual observations, and actions. We apply a 2-layer transformer encoder to encode language directives and a pre-trained Faster R-CNN to extract visual frame features. By taking advantage of Decision Transformer, we can plug reward sequences as an additional part into the original trajectory and smoothly incorporate reward information with other modalities. Similar to both models, we use Teacher Forcing in training for simplicity and stability.

Metrics

ALFRED provides several evaluation metrics to help us evaluate from different aspects and granularities:

Task Success. 1 if the object positions and state changes correspond correctly to the task goal-conditions at the end of the action sequence, and 0 otherwise.
Goal-Condition Success. The ratio of goal-conditions completed at the end of an episode to those necessary to have finished a task.
Path Weighted Metrics. Consider the length of the path. If a successful path is twice as long as the expert path, it can only get half credit.

Following E.T., we will take successful rates (the ratio of successful trials to the total trials) as our primary metric.

Ethics

This project mainly focuses on the visual language navigation task, which has a wide range of application scenarios in real life. For example, robots with automatic navigation based on human language could be used to assist blind or disabled people, work as household androids, or in factory production.