ImagineDim

Poster

Portal

Introduction

When humans communicate, they have an imaginary sense of extensive properties of objects like volume, weight, etc., embedded in their language. For example, if someone says foxes are similar to animals like dogs, without even seeing a fox in real life, people can estimate the volume or weight of a fox if they have ever seen a dog before. Of course, those estimations might not be quite precise but won't be so far off such that the imaginary weights or volumes of fox match with an elephant.

This sense of extensive properties helps us explore or take actions in unexplored areas in various situations in our lives. For example- we can buy clothes for our friends' children even before seeing them, without even seeing or lifting an almirah before we bring our truck when we want to purchase one, we look at the skyline if someone says to meet under the tallest building near Thayer Street. A rough estimation is good enough to navigate the world, make intelligent decisions, and perform various tasks in these situations. So this led to this idea - whether we can infer these extensive properties from our language model. With the help of the neural network, if we can infer this knowledge from natural language models, then robots or other intelligent agents can do a more conscious search and motion planning for not yet visible objects just analyzing natural language commands.

What kind of problem is it?

If we break down the problem statement, we can see that we need to solve three different sub-problems to achieve this goal.

Make or Find a reasonable annotated dataset of extensive properties of various objects
Find or make a language model that can establish relationships based on the names of various objects
Build a neural network model that can connect the relationship between objects in the language model with the physical data of those objects

Related Works

[1] Elazar, Y., Mahabal, A., Ramachandran, D., Bedrax-Weiss, T., & Roth, D. (2019). How large are lions? inducing distributions over quantitative attributes. arXiv preprint arXiv:1906.01327.

In this paper, the authors tried to gather quantitative information from a large web corpus. This method was unsupervised and used data mining techniques like collecting the data if measurement and objects are co-occurring within a particular context window. Of course, the result is noisy, but the advantage of this simple model is that they have managed to collect a vast amount of useful DoQ data.

[2] Zhang, X., Ramachandran, D., Tenney, I., Elazar, Y., & Roth, D. (2020). Do Language Embeddings Capture Scales?. arXiv preprint arXiv:2010.05345.

Zhang et al. used linear probing on different language models to predict scaler magnitude, created a distribution around that data in the following paper, and concluded a strong relationship between scaler information and language models. However, this method is not particularly good at capturing raw numbers for each scale.

[3] Forbes, M.,& Choi, Y. (2017). Verb physics: Relative physical knowledge of actions and objects. arXiv preprint arXiv:1706.03799.

Maxwell and Choi took a different approach to introduce commonsense about extensive property in language. They tried to infer relative physical knowledge between object pairs using the verb connecting the physical relationships between them and found a good result. The downside is - scaler values can not be inferred from this relationship.

Data

For Distribution Over Quantities

[1] Google DoQ: https://github.com/google-research-datasets/distribution-over-quantities

This DoQ data is generated from web scrapping.

[2] VerbPhysics - https://github.com/uwnlp/verbphysics

This dataset contains shape relation between two objects based on the verb interaction between them.

[3] Shapenet - https://shapenet.org/

This is a collection of 3d models with various annotated quantitative data close to real life.

For Adjective Intensity

[1] https://github.com/acocos/scalar-adj/tree/master/globalorder/data/crowd/gold_rankings

This dataset contains a collection of similar adjectives but sorted by different intensity.

[2] https://github.com/Coral-Lab/scales/blob/master/ordering/ordering_gold_standard.csv

This dataset represents the intensity of adjectives using numerical values.

None of these data are actually designed for this task. So I need to de-noise the data and structure them properly before using them for any training or evaluation purpose.

Methodology

For DoQ

[1] Baseline plan - Use existing data to have the volume distribution of various objects. The problem is that existing data is noisy and may not generate good result

[2] Target plan -Use an existing annotated dataset of extensive properties of various objects to train an existing language model to capture physical attributes

[3] Stretch goal - use image processing with deep learning to identify different objects and their true size. Associate the data with the language model to infer physical attributes of unseen objects and also increase or decrease the scale of assumption based on the adjective modifier associated with the language. The model will be something similar to this - https://arxiv.org/pdf/1612.00496.pdf

Metrics

Success -

If it can learn to estimate the physical properties of unknown objects just from their language representations.
It can maintain the relative relationship of physical attributes between two objects
If it can increase and decrease the predicted shape within the reasonable boundary based on the intensity of the adjective associated with the object

So far, as the goal of the model is to make quantitative data prediction along with some classifications, traditional loss functions should work on this.

For evaluation of the quality of DoQ I can use the Relative dataset mentioned in this paper - Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. 2016. Are elephants bigger than butterflies? Reasoning about sizes of objects. In Thirtieth AAAI Conference on Artificial Intelligence

For evaluation of the intensity of adjective, it can be compared against the normalized intensity version of this Coral-Lab dataset - https://github.com/Coral-Lab/scales/blob/master/ordering/ordering_gold_standard.csv

Ethics

[1] Why is Deep Learning a good approach to this problem?

Answer - Because no traditional fixed rule-based model can capture and associate the relationship between language representation of various objects with their extensive properties . Therefore, we need a flexible model that can generalize the input space and make a good estimation. For this kind of task, deep learning is especially good because it can scale easily according to the quality and availability of data and make a connection between seemingly uncorrelated data.

[2] Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? Answer - If anyone tries to use this shape estimation model to manipulate real-world situations by taking action through an automated robot or system, they will be the stakeholders. If they blindly use this model without taking necessary precautions about the error, it can result in real catastrophe while manipulating objects with subpar shape estimation.