CS1470 Final Project DevPost

Title

Grounding Language to Non-Markovian Tasks with No Supervision of Task Specifications

Important Links

Who

Andrew Li (ali140)
Ben Schornstein (bschorns)
Celina Ye (cye9)
Rob Scheidegger (rscheide)

Introduction

Say I'm designing an autonomous vehicle. How will the vehicle know how to get from point A to point B? The answer is simple -- it grabs GPS data based on satellite imagery. But is there any way to encode a trajectory from point A to point B without using satellites? The way we as humans accomplish this is by using sequential natural language instructions, such as "turn left at the corner and go straight until the intersection." However, it’s not this easy for autonomous vehicles to find this path in the environment that satisfies the required temporal constraints; rather, these natural language instructions must be translated to a form of expression that autonomous vehicles can easily understand: linear temporal logic (LTL). On a broader, generable scale, the question then is can we create a model where robots and other agents determine a LTL trajectory from point A to point B using natural language instructions?

Our inspiration for this project comes from this existing research paper. In it, the paper identifies a limitation to existing approaches to mapping natural language expressions to LTL expressions as using: a requirement of an expensive dataset of LTL expressions paired with English sentences. The paper's primary goal is then to address and alleviate this issue by introducing an approach that can learn a mapping from English to LTL expressions given only pairs of English sentences and trajectories, enabling a robot to understand commands with sequential constraints. In essence, this is a supervised Seq2Seq problem.

The paper's primary goal is then to address and alleviate this issue by introducing an alternative approach to learning a semantic parsing model that bypasses the requirement of paired language data and LTL logical forms: learning from trajectories as a proxy. Trajectories are defined as relatively-known landmarks and locations along the path an agent takes in the environment from one point to another. It is the goal that the agent can learn a mapping from English to LTL expressions given only pairs of English natural language instructions and trajectories, enabling a robot to understand commands with sequential constraints.In essence, this is a supervised sequence-to-sequence (Seq2Seq) problem.

We are motivated to implement this paper because of its various, numerous practical applications. As mentioned earlier, traditional, existing approaches suffer from the requirement of an expensive dataset. This dataset is not only hard to obtain, but also difficult to maintain and develop. If this alternative strategy of using trajectories as a proxy is successful, it would be a powerful tool that could be used in places where there are few natural-language instructions, such as lesser-known cities, newly developed regions, and poor areas. To be able to translate natural language instructions to LTL expressions is a powerful tool, and we are excited to see the future applications from this alternative approach to language modeling.

Related Work

The paper that we are implementing references 44 other sources, and we have taken a look at some of those sources.

We now expand on and summarize some of those sources, namely:

These papers served as the foundation for the research paper by highlighting the issues present with semantic parsing literature. Specifically, these papers explored weakly supervised semantic parsing models that ground to lambda calculus expressions and logical forms that do not handle temporal order in the way that LTLs do.

We are not currently aware of any public implementations of the algorithm developed in this paper, however we have pieces of source code that were involved in its original creation (largely dealing with the geographic datasets and some management of the reinforcement learning environment). However, these pieces do not constitute the entire process. One of the original reasons we wanted to work with this paper is that the original author of the paper actually lost a lot of the original source code due to a local computer failure, and since it was not published along with the paper through GitHub or other online means, so if we are able to sufficiently reimplement this, it may actually be useful for future research.

Data: What data are you using (if any)?

The paper evaluates on SAIL, a benchmark artificial environment that is an existing dataset with 3,000 samples of instructions and trajectories. The paper uses Open Street Maps (OSM), a global open-sourced map API where users can add landmarks, as well as information about the landmarks that are then verified.

Additionally, there are a lot of data for each of the cities we are looking at. For example, the Providence data has over 2,000 entries. This data is given to us in TSV format. However, looking at it from a global scope, it could be considered relatively small since we only have data for ten cities (Ann-Arbor, Atlanta, Austin, Baltimore, Berkeley, Boston, Cambridge, New-Haven, Philadephia, and Providence), comprising approximately 20,000 total inputs.

The preprocessing component to this project is not significant, and it is comparable to what we have done in past Deep Learning assignments.

Methodology: What is the architecture of your model?

There are two primary (and relatively disjoint parts) of the model that we would ideally like to implement:

Descriptions -> Trajectories: Create a seq2seq model (likely using a transformer architecture) that is able to translate english language instructions into an ordered list of trajectories that can be used by the second part. A large part in this paper is the fact that we are using these trajectories as an intermediate representation, which supposedly will help learn LTL faster (as opposed to going directly from Descriptions -> LTL). We can train this part of the model by using the labeled data that we have for each city (have descriptions and associated trajectories). One aspect we are interested in testing is if it is beneficial to train the model on all of the training data (all of the cities), or to train the model for the different cities specifically (which could have a performance improvement since all of the locations are localized to different cities).
(Descriptons, Trajectories) -> Linear Temporal Logic: There are a couple ways that this could be achieved. Since this is in theory another seq2seq problem, we could attempt to use another transformer architecture, or like the paper attempts to do, we can utilize reinforcement learning in a simulated environment, that takes in both the descriptions and the trajectories, with rewards for choosing the paths in real time that best represent the descriptions. Assuming that the RL approach is used (since otherwise we know how transformers work), we will use a reward function that rewards the 'robot' when not only the proper trajectory is chosen, but the correct timing is used for the transitions between locations (hence the temporal aspect of LTL), with a punishment (negative reward) if it makes the incorrect decision.

Although the first point is certainly nontrivial, working with reinforcement learning to make the second part work is by far going to be the most difficult part of this project to implement. We are relatively confident in that we can get the first part to work (since generally, it should be similar to the assignments we have done so far in machine translation, but with the additional challenge that we want exceptionally high accuracy), and are hoping that through enout trial an tribulation, the second part will come together as well.

Metrics: What constitutes “success?”

The paper measures success based on the goal-state accuracy and path accuracy. The goal-state accuracy is computed by evaluating whether or not the final location after planning is the correct end location of the trajectory, where a higher value is better. The path accuracy is computed by editing distance between the computed path and the ground-truth trajectory, where a lower value is better.

The base goals are given in the paper and the target goals are the accuracy values that correspond with accuracy values provided in the paper. The stretch goal would be to have a higher goal-state accuracy than the paper and a lower path accuracy than the paper.

Ethics:

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Our dataset comes directly from annotations from a group of researchers, so there doesn't seem to be all that much ethical concern from where the data has been sourced.

However, since the descriptions were manually generated by only a small group of individuals, it is unlikely that these descriptions are representative of different styles of english that are spoken by different communities and from around the world (since although there was an explicit attempt to bring realistic variation to different annotations, it doesn't necessarily represent the way everyone actually speaks). Therefore, it would be important that if a system were ever built around this technology and put into production, that it not only be trained on a larger dataset that involves annotations from people from around the country with different speech patterns, such that their commands would not be mistranslated when translating to LTL.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

We imagine that the most important stakeholders for the problem we have at hand (converting path descriptions to LTL) are those who would be using these sort of voice commands in practice to instruct a robot how to act in real time.

One scenario we thought of explicitly was a home-help robot that could be used by the elderly and visually impaired to help with tasks around the house, while still retaining a decent degree of autonomy. For instance, a robot equipped with such capabilities could be told "go around the corner into the living room, grab my water, and bring it back to me," and feasibly translate and execute such a task. This would be a great deal of help to those who would not be able to physically execute these tasks themselves, but can still give proper temporal instructions on how they should be done.

However, that being said, if a reliance were to be made on these sorts of robots, then algorithmic mistakes could have far larger consequences. For instance, if someone were to fall and injure themselves, and then rely on the robot interpreting these commands to call for help since the person is incapacitated, then the inability to interpret such instructions properly could be the difference between life and death. Therefore, as always it would be important to not completely rely on such technology, and always have backup plans in the case that things go wrong.

Division of labor: Briefly outline who will be responsible for which part(s) of the project.

Andrew: Pre-process and Model Architecture Part 1
Ben: Dataset and Model Architecture Part 2
Celina: Pre-process and Model Architecture Part 1
Rob: Model Architecture Parts 1 and 2

Culminatively, we will be updating and revising the DevPost.

Reflection #2

Introduction:

Say I'm designing an autonomous vehicle. How will the vehicle know how to get from point A to point B? The answer is obvious -- it grabs GPS data based on satellite imagery. But is there any way to encode a trajectory from point A to point B without using satellites? The way we as humans accomplish this is by using sequential natural language instructions, such as "turn left at the corner and go straight until the intersection." These instructions are then translated to linear temporal logic (LTL) expressions. The question then is, can we create a model that determines a LTL trajectory from point A to point B using natural language instructions?

We are motivated to implement this paper because of its various, numerous practical applications. Namely, it is a powerful tool to be able to translate natural language instructions to LTLs expressions, and we are excited to see what future applications from this alternative approach of language modeling.

Challenges: What has been the hardest part of the project you’ve encountered so far?

The hardest part of the project is deciding on an evaluation metric. When evaluating the performance of the model, we will be comparing the trajectories learned from the model with the actual trajectories. However, this comparison is vague and loosely defined. That is, when are two trajectories comparable? Do they need to be an exact match or can there be some leeway? Should the model be penalized for learned a trajectory from point A to point B that takes or makes an unnecessary stop from the actual trajectory? If so, how much? Deciding on this evaluation metric was then the hardest part of the project we encountered so far.

Insights: Are there any concrete results you can show at this point?

As of now, our model is able to train with the training dataset, although all we have for feedback is the loss of the model (or perplexity, but this is proportional). Although it seems to be working for the initial seq2seq model that maps descriptions to trajectories, our model may be overfitting quite extensively as we see multiple loss values under 0.01 and then flucation afterwards upon the next epoch, which is something that we need to address. We plan to look into this more by analyzing the accuracy of the model.

We noticed that this task does differ from a traditional machine translation task in that we do have an absolute ground truth in the form of the trajectories (wheras natural language doesn't often have a single ground truth). Therefore, we need to alter the loss we were trying to use previously (which we are currently working on), and use this as our absolute evaluation metric to use in evaluating our progress.