Bloomberg Embedding Challenge

welcome

What is this hot mess?

The basic goal of this challenge is to turn the challenge embeddings into actual words that makes sense. We are given sample articles along with their respective embeddings. We are also given sample code to calculate the similarity between two embeddings.

Welcome to the Kitchen!

We took the sample code and imported pandas to format the .csv files that were provided to us for testing. We first formatted the file reading of the challenge file and both ccn and fed samples to grab their embeddings. We then compared the two sample embeddings with the challenge embeddings. We decided to use cosine similarity and then sorted the output embeddings with the most similar in front. The following code is what we ended up with:

from scipy import spatial
import pandas as pd
import random 

## Cosine similarity.  
def similarity(a, b):
    return 1 - spatial.distance.cosine(a, b)

df = pd.read_csv('federal_samples.csv')
challenge = pd.read_csv('challenge.csv')
output = pd.DataFrame(columns=['challenge', 'sample', 'similarity'])

for i in range(len(challenge.index)):
    max = 0
    max_j = 0
    embd0 = [float(x) for x in challenge['embeddings'][i][1:-1].split(", ")]
    for j in range(len(df.index)):
        embd1 = [float(x) for x in df['embeddings'][j][1:-1].split(", ")]
        if(similarity(embd0, embd1) > max):
            max = similarity(embd0, embd1)
            max_j = j
        # print(f'Challenge: {i} to Sample: {j}     {similarity(embd0, embd1)}')
        output.loc[len(output.index)] = [i, j, similarity(embd0, embd1)]
    print(f'Max for #{i} is sample: {max_j} with {max}\ntext: {df["text"][max_j]}\n\n')

output = output.sort_values(by='challenge')
output = output.sort_values(by='similarity',ascending=False)
output.to_csv('out2.csv')

The idea is basically compare the challenge embedding with every embedding in the given sample and spitting out the samples with the highest similarity.

After getting the basic ideas of the topics behind each challenge, we summarized the embeddings of additional related articles that we find ourselves using the following edited api code:

import requests
import json
from scipy import spatial
import pandas as pd

def get_embedding(text, api_key):
    ## API Definitions
    url = "https://datathon.bindgapi.com/channel"
    headers =  {
        "X-API-Key": api_key,
        "Content-Type":"application/json"
    }
    body = { "inputs": text }
    ## API Call
    try:
        response = requests.post(url, data=json.dumps(body), headers=headers)
    except Exception:
        print(Exception)

    try:
        # return response 
        result = response.json()
        return json.loads(result['results'])
    except:
        print(response.status_code)

def similarity(a, b):
    return 1 - spatial.distance.cosine(a, b)

# Define your API key here 
API_KEY = "IJXH6TU5QL9BFnRJHCl8G99pKBFkTIMt6smwp0cU"
with open("api_input.txt") as f:
    TEXT = f.read()[:7500]

# Call the get_embedding function located in ./assets
embd1 = get_embedding(TEXT, API_KEY)

challenge = pd.read_csv('challenge.csv')
num = 0
embd0 = [float(x) for x in challenge['embeddings'][num][1:-1].split(", ")]
print(f'Challenge: {num}  {similarity(embd0, embd1)}')

Using this script, we find out which articles are closer to the topic of the challenges.

What got burned down

real life renactment of the datathon
During our similarity tests, we tried both cosine and euclidean and after comparing the results, we realized that cosine gave a much better result, so we stuck with cosine with the rest of the operations. As for the two samples that were given, the cnn sample was much more similar to the challenge embeddings than the fed samples, so we decided to only look at cnn samples for the result.

Results gathered

Our result prediction for the challenge problems are gathered by grabbing the top 2 most similar embeddings and comparing the words in their sample text.

Challenge 1 Prediction: Article about international traveling discounts between India and London. The top three embeddings all mention different countries and includes the word travel.

Challenge 2 Prediction: This article describes the failing of investments and it's rippling impacts on large global financial investment institutions and the money that they are managing. These impacts may be so detrimental that some are filing for bankruptcy. Both of the top embeddings for challenge 2 include the words bankruptcy, investment.

Challenge 3 Prediction: Article that reports a spread of a deadly virus/disease that is causing the lock down of an area. Both of the top embedding mention some sort of virus that is causing death and lock down for separate countries.

Challenge 4 Prediction: Article about armed conflicts between official authority and rebels/activists. Both of the top embeddings mention some sort of authority(police and military).

Challenge 5 Prediction: Article about the endangerment of nature and people's work towards conservation. One of the articles mention forest fire and preventation work for forest fires, while the other mentions threatened forest and butterfly population. It describes the collective efforts of national conservation organizations and large mobile teams of firefighters to combat human's negative impacts on nature.

Built With

blood
sweat
tears

Updates

Mengting Teng started this project — Oct 09, 2022 10:49 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.