Title:

Entity Recognition in Online Fashion Listings

Who:

  • Adrien Chavarot (achavaro)
  • Walter Zhang (xzhan180)

Introduction:

We are implementing a model from a paper that describes a new approach to extract structured data at the sentence-level. We chose this paper because it engages in the non-trivial task of entity recognition on natural language that may not follow normal grammatical conventions. The dataset we collected is like this too, which is one of the reasons why this paper interests us.

We are performing Named Entity Recognition (NER), a mix of identification and classification.

Related Work:

Named Entity Recognition is a widely-used pre-processing step in many NLP applications, so there is much literature about this. However, most approaches to NER rely on either ‘external resources’ or text normalization, to normalize grammatically and syntactically inconsistent text. However, this paper in particular stands out because it does not rely on any of this.

Most of the other papers make use of a pre-trained BiLSTM network for word embeddings, with a Conditional Random Field at the end.

There has also been some use of sent2vec, and even phonetic representations of words to infer meaning beyond just how they are written

  • List of known code implementations:

    None yet.

(https://medium.com/mysuperai/what-is-named-entity-recognition-ner-and-how-can-i-use-it-2b68cf6f545d)

To summarize the above article, Named Entity Recognition models work in two steps: First, detect a named entity, then categorize the entity. The type of training data is very important to the success of a model at any given task. In the words of the article: “Train your model on Victorian gothic literature, and it will probably struggle to navigate Twitter.” NER is useful to give an overview understanding of any dataset, and filter by certain parameters.

Data:

The dataset that we used is a custom dataset that Adrien obtained through web scraping a popular second-hand clothing website, Grailed. 700k listings are provided.

Entity extraction is interesting in this case because the listings are created by many different users, so there is no standardization in style across the dataset. Therefore, a naïve approach of looking for certain keywords is insufficient.

No significant pre-processing will need to be done on the data itself. Each description contains tags about category, name, color, size, designers, seller location, and condition. These can be used as the ground truth labels.

Methodology:

Our model will follow the architecture described in the paper:

Untitled

We think the hardest part of this model to implement will be the BiLSTM modules, as we have only seen a one-direction LSTM previously.

Metrics:

After training, we will run the model on the testing data and compare the model’s outputs against the ground truth. We will do this with both the data described in the Data section and a standard WNUT dataset used by the paper.

The models in the paper are compared using the F1 score, which is calculated using the number of true positives, false positives, and false negatives. This metric is more suitable than accuracy to measure the performance on a data set whose distribution of labels is uneven, as the false positives and negatives are weighed more. The authors of the paper sought to compare the performance of their model against “state-of-the-art” models on standard datasets for predicting named entities.

Thus, we will also be using the F1 metric to measure the performance of our mode

Our base line will be to implement the layers for processing test input in the figure above and have a working model that can accurately predict some labels. The target will be to achieve a similar score on the data set as described in the paper, and the stretch goal will be to implement the multimodal feature that adds an image analysis component.

Ethics:

  • Why is Deep Learning a good approach to this problem?

Deep learning is a good approach to this problem because it is a problem that can be feasibly solved, but is extremely repetitious and therefore not worth doing by humans. A good approach to see if a deep learning can be good at a specific problem is to see if a human can do it. If a human can’t then it is hard to expect a DL model to. A human could feasibly do this specific task, but an algorithmic approach would not work well here.

  • What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

The dataset will have to be manually inspected. This is because we have a concern with it, which is that all the data points have been created by random internet users. One negation to this concern is that unlike with generative models, which can cause real harm by generating content similar to harmful data they were trained on, entity extraction models do not create any new data, they just allow it to be more efficiently organized and tagged.

Another concern is with how the data was collected. Web scraping is not widely accepted within the web community, and some consider it to be damaging to the website that is being scraped. However, our rationale is that a lot of DL datasets have been created from scraping, and the Grailed company stands to gain from our research on how to more efficiently categorize fashion listings. Therefore, we do not see ourselves working against their interests.

Division of Labor:

Adrien will be responsible for Data Gathering and cleaning.

Adrien and Walter will both share responsibility for the code implementation of the model and the subsequent write-up

Built With

Share this project:

Updates

posted an update

Week 1 Update

Introduction

We are implementing a model from a paper that describes a new approach to extract structured data at the sentence-level. We chose this paper because it engages in the non-trivial task of entity recognition on natural language that may not follow normal grammatical conventions. The dataset we collected is like this too, which is one of the reasons why this paper interests us.

We are performing Named Entity Recognition (NER), a mix of identification and classification.

Challenges

So far, the biggest challenge has been figuring out data pre-processing. Our data is in the form of product json files, with a description, and labels. However, the paper uses data structured in a different manner. Therefore, we have to convert our data to a different format. In order to make use of some of the preprocessing code from hw5, we are currently planning on converting our data to a txt comma separated file.

Insights

The paper we are implementing has two entity extraction models. The first one takes only textual information as input. We have implemented this, but have not been able to train it yet due to data pre-processing challenges. So as far as concrete results, we have coded the class that contains the textual-only model.

Plan

We believe we are generally on track with our project. Data pre-processing can be quite tricky, so we will be planning on dedicating more time to this, however in general we believe that the actual implementation of the model is going well. We do anticipate that we will need to do some hyper-parameter tuning. We don’t think we will just be able to reuse the ones that the paper we are implementing used because our dataset is very different.

Log in or sign up for Devpost to join the conversation.