What is DEEP

image The DEEP is a collaborative platform for qualitative data analysis supporting humanitarian analytical teams to produce actionable insights. Since its inception in the aftermath of the 2015 Nepal Earthquake, DEEP has significantly contributed to improving the humanitarian data ecosystem, and today, without a doubt, is the largest repository of annotated humanitarian response documents: 50k+ sources/leads and 400k+ entries/annotations, actively used for 300+ projects by 3.5k+ registered users in 60+ countries. Because of its widespread use and granularly annotated information by expert humanitarian analysts, DEEP is uniquely situated as a fertile data source for developing NLP models.


During crises, rapidly identifying important information from available data (news, reports, research, etc.) is crucial to understanding the needs of affected populations and to improving evidence-based decision-making. To make the information classification process even faster, DEEP is well suited to implement Natural Language Processing (NLP) and Deep Learning to aid and support the manual tagging process and give the humanitarian community more time to produce analyses and take rapid action to save more lives.

Although the developments and usage of AI in the sector focus mainly on finding disasters on maps, satellite imagery, etc, the needs of analysis teams working behind desks are often overlooked. These humanitarian workers must understand the situation from a variety of sources, often consisting of very noisy and voluminous data that is difficult to process.

What it does

Applying machine/deep learning to the humanitarian and development sectors has been challenging. With complex topics and ethical considerations, other similar projects stop in proofs of concepts or research papers. The DEEP and the NLP capacities developed for it are targeting the specific needs and case studies of the disaster response and humanitarian sectors.

To be more specific, DEEP allows users to submit documents and applies several NLP processes such as extraction and classification of text excerpts and Named Entity Recognition (NER). The text snippets are classified according to predefined humanitarian target labels, referred to as analytical frameworks. The unstructured data and the classifications in the sector pose technical challenges that are not evident. How to deal with interrelated tags with hierarchical structure in NLP classifiers? How to form a dataset out of entries such that classifiers will be best at capturing positive signals when having very rare tags? Our tags are dependent on preceding sentences. How to design a DL/ML - UI flow such that we can respond to this challenge?

For this hackathon, we are focusing on multi-label text classification problems according to multiple categories, the following sections will describe more in-depth.

How we built it

In this section, we are talking about the following points:

  • Data
  • Modeling
  • Deployment


Up to now, all the information (of any kind: reports, news, articles, maps, infographics, etc.) uploaded to the platform has been annotated by hand by experts in the humanitarian sector. Furthermore, in the DEEP platform, users can create projects, which are usually linked to certain humanitarian crises, such as natural disasters, or ongoing conflicts in certain geographic locations where a rapid response is needed. To analyze those situations, users can create custom label sets (analysis frameworks) and use them to annotate the information that will be uploaded within their projects. After creating projects and analysis frameworks, users upload documents of any format, select excerpts of the text, which contain important details for the analysis, and annotate them using their label sets (i.e. analysis frameworks). To combine entries from various projects and different analytical frameworks, we defined a generic analytical framework and transformed our labels accordingly. Our generic analytical framework has 8 main multi-label categories.

  • Three Primary Tags: Sectors, 2D Pillars & 2D Sub-pillars, 1D Pillars & 1D Sub-pillars,
  • Five Secondary Tags: Affected Groups, Demographic Groups, Specific Needs Groups, Severity, and Geolocation.

For this hackathon, we focused only on a subset of the above categories, the Primary Tags. Primary Tags contain 75 labels under different subcategories named as follows:

  • Sectors with 11 labels,
  • 2D Pillars with 6 labels,
  • 2D Sub-pillars with 18 labels,
  • 1D Pillars with 7 labels, and
  • 1D Sub-pillars with 33 labels.

image info

Approximately 2/3 of our dataset is in English, while French, Spanish, and Arabic largely made up the rest.

As often happens in sparse multi-label datasets, some labels are underrepresented compared to others. This caused overfitting for our initial baseline models. We describe how we solve these problems in the following sections.


The model we developed is based on a pre-trained transformer architecture. The transformer had to fulfill some criteria:

  • multilingual: it needs to work for different languages
  • good performance: for it to be useful, the model needs to be performant
  • fast predictions: the main goal of the modeling is to give live predictions to taggers while they are annotating. Speed is critical in this case and the faster the model the better.
  • only one endpoint for deployment: to optimize costs, we want to have only one endpoint for all models and predictions. To do this, we create a custom class containing all our models and deploy it. This endpoint is easily scalable with autoscaling in case of intense requests.

We use the transformer microsoft/xtremedistil-l6-h256-uncased as a backbone.

For the sub-pillars tags (and also for secondary tags), we use a tree-like multi-task model (i.e. a hierarchical classification model), and we fine-tune the last hidden state of the transformer differently for each subtask. We have 13 different subtasks for the sub-pillars model (Humanitarian Conditions, At Risk, Displacement, Covid-19, Humanitarian Access, Impact, Information And Communication, Shock/Event, Capacities & Response, Context, Casualties, Priority Interventions, Priority Needs) each of which then has its final labels, which we want to predict. This allows for weight-sharing over-tasks and down the hierarchy. This results in better generalization because of the relationship between different tasks and the interdependence between the same task hierarchical labels.


We deploy all of our models into scalable GPU-SageMaker instances. The response time for one excerpt of text to pass through all models is below 200ms. We use the MLFlow Pyfunc base image which uses the trained models stored on S3 to deploy them and create the Sagemaker endpoints using the instance type ml.g4dn.xlarge. Along with that, several non-computationally intensive models are deployed as lambda functions.

The Online Testing Environment, which is a Streamlit based web application, created for testing purposes is deployed using AWS Elastic Beanstalk. It is backed by a t2.small EC2 instance. This web application sends requests to fetch the tag predictions from the models.

We have also deployed an MLFlow server where several hyper-parameters and performance metrics from the models are recorded. This helps us to track our experiments and make comparisons on multiple versions of models and select the best performing.

We used Terraform (Infrastructure as Code) to write configurations to deploy all the components in our AWS environment.

Challenges we ran into

Through the process of development, we have undoubtedly encountered several challenges. In general, we have overcome these by fostering a transparent and collaborative development environment where new ideas and forward-looking failures are embraced. The unstructured data and the classifications of humanitarian sectors also posed technical challenges that were not initially evident. Our challenges and solutions are listed below:

Working across timezones and cultures

Our team spans the globe with developers and managers in the US, Switzerland, Turkey, Austria, and Nepal. This diversity has brought great benefit to our team, however, the distribution across timezones had made things difficult. To overcome this, we plan and execute all of our work on common Trello boards and have daily standup calls in the few hours of the day when we’re all online. Overcoming the language barrier has meant that our common working language has been English, and we’ve greatly enjoyed sharing our cultures.

How to combine different projects using separate analysis frameworks into one consistent dataset?

With the help of humanitarian experts and senior data analysts, we, the NLP team, consolidated the different tags/annotations of the various analysis projects into a single generic tag set. However, due to differences in the definitions of similar labels in different analysis projects (i.e. definition of what age range a child is), the final dataset had some noise that affected the performance of the models. We further improved the consistency of the generic tags by using active learning and weak supervision. Our development cycle included; analyzing the data, performing rule-based processing of the data, training models, offline testing on a held-out validation set, interactively testing them by expert data analysts, getting feedback, and repeating.

How to deal with interrelated tags with hierarchical structure in NLP classifiers?

Our hierarchical model architecture respects the nature of our tag set and is effectively producing predictions consistent with the hierarchy of our generic tag set. The hierarchical model we are using improved the performance over a flat baseline model.

How to deal with the multi-lingual nature of our dataset?

Approximately 2/3 of our dataset was in English, while French, Spanish, and Arabic largely made up the rest. To effectively use all the information available to us, we use a multi-lingual pre-trained language model as a backbone and finetune it on our dataset.

XtremeDistilTransformers is a distilled task-agnostic transformer model that leverages task transfer for learning a small universal model that can be applied to arbitrary tasks and languages as outlined in the paper XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation.

How to form a dataset out of entries and train classifiers such that they will be best at capturing positive signals when having very rare tags?

Many of our tags have few positive examples in the dataset. This biases the model to ignore signals that come from such tags. To work out this challenge, we perform data augmentation, threshold tuning, and use the focal loss algorithm.

Accomplishments that we're proud of

We are both delighted and humbled by the fact that the models we are building will serve the humanitarian sector in situation analysis and decision making. We have accomplished remarkable technological advancements along the way.

  • A large supervised textual humanitarian dataset, which is (to the best of our knowledge) the largest and most accurately annotated of its kind.
  • An assisted tagging system that will speed up the analysis of humanitarian secondary data.
  • Creation of a PDF text extractor that uses computer vision, greatly improving the quality of text extracted from PDFs link.
  • We have taken part in hackathons hosted by CERN and Applied Machine Learning Days
  • DEEP has been used to support critical global humanitarian responses including the Central/South American refugee crisis as well as global COVID response

What we learned

We have learned and put into practice how to use several AWS products to serve our purpose:

  • SageMaker for training and deploying deep learning models.
  • Beanstalk to host a testing environment that we use to test our models before deployment.
  • ECS to host containerized MLFlow server for the tracking and logging of models and metrics.
  • Lambda Function to host DEEPL PDF Extractor tool built in-house.

We have honed our skills of using many deep learning and data science tools including:

  • Streamlit; to quickly build an interface that enables human experts to test the models interactively.
  • MLFlow; to log our experiments and models.
  • DVC; to enable dataset versioning and sharing among the team.
  • PyTorch, PyTorch Lightning, Huggingface, and SageMaker API to train our models.

We have also polished our expertise in testing machine learning models. We perform different kinds of testing.

  • Testing the performance of the models using a held-out test set.
  • Testing the models interactively before deployment with the end-users to get their feedback.
  • Speed-test of the deployed model endpoint.
  • Testing the integration between the model endpoint and the backend of the DEEP.

What's next for NLP for improved humanitarian response & analysis

We are experimenting with more ideas that will further boost and enhance the quality of humanitarian analysis. We looking at the NLP product as an instance and we are researching ways where other services can be plugged in to help classify disaster-related documents. Lastly, we are developing a pre-trained language model like DeBERTa targeting the disaster response sector.

What is Data Friendly Space (DFS)

This project is submitted by the NLP team of Data Friendly Space, a U.S. based INGO working across six continents to make modern data systems and data science accessible to the humanitarian and development communities. Our mission is to render informed, effective and targeted aid by supporting the global humanitarian community through responsible, resource-efficient innovation. Since our first project in 2018, we have continued to increase not only our capacity but also the capacity of our partners to make a lasting impact. We ultimately strive to give humanitarian organizations more time to focus on what matters most, and not worry about their data.

Built With

  • aws-elastic
  • beanstalk
  • deepl-pdf-extractor
  • ecs
  • gpu-sagemaker
  • huggingface
  • ml.g4dn.xlarge
  • mlflow
  • pyfunc
  • python
  • pytorch
  • pytorch-lightning
  • sagemakerapi
Share this project: