Adrien Chavarot posted an update — Nov 30, 2022 10:33 AM EST

Week 1 Update

Introduction

We are implementing a model from a paper that describes a new approach to extract structured data at the sentence-level. We chose this paper because it engages in the non-trivial task of entity recognition on natural language that may not follow normal grammatical conventions. The dataset we collected is like this too, which is one of the reasons why this paper interests us.

We are performing Named Entity Recognition (NER), a mix of identification and classification.

Challenges

So far, the biggest challenge has been figuring out data pre-processing. Our data is in the form of product json files, with a description, and labels. However, the paper uses data structured in a different manner. Therefore, we have to convert our data to a different format. In order to make use of some of the preprocessing code from hw5, we are currently planning on converting our data to a txt comma separated file.

Insights

The paper we are implementing has two entity extraction models. The first one takes only textual information as input. We have implemented this, but have not been able to train it yet due to data pre-processing challenges. So as far as concrete results, we have coded the class that contains the textual-only model.

Plan

We believe we are generally on track with our project. Data pre-processing can be quite tricky, so we will be planning on dedicating more time to this, however in general we believe that the actual implementation of the model is going well. We do anticipate that we will need to do some hyper-parameter tuning. We don’t think we will just be able to reuse the ones that the paper we are implementing used because our dataset is very different.

Log in or sign up for Devpost to join the conversation.