OpenAI CLIP (Contrastive Language Image Pre-training) is a visual and text embeddings models that allow for zero-shot image classification and semantic search. It has achieved equivalent performance of the best ImageNet models without using any annotated label.

However, there is a big downside. It only works for English, as most language models nowadays.

To end this misrepresentation and allow for new applications and improve the representativeness of models for other languages, I propose MultiLingual CLIP.

What it does

Multilingual CLIP is a pre-trained model which can be used for multilingual semantic search and zero-shot image classification in 100 languages.

See multilingual semantic search demo here.

How we built it

Model Architecture

Multilingual CLIP was built using OpenAI CLIP model. I have used the same Vision encoder (ResNet 50x4), but instead I replaced their text encoder (Transformer) with a Mulilingual Text Encoder (XLM-Roberta) and a configurable number of projection heads, as seen below:

Model Architecture

The model was trained in a distributed fashion on 16 Habana Gaudi Accelerators and with mixed Precision in two phases (using COCO Dataset for phase 1 and Google Conceptual Captions for phase 2). The training pipeline was built using PyTorch, PyTorch Lightning, and Distributed Data Parallel.


Three datasets have been used for building the model. COCO captions was used for training phase 1 and Google Conceptual Captions was used for training phase 2. Unsplash dataset was used for testing and inference.

COCO Captions

COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset. The COCO captions dataset has around ~85000 images and captions pairs.

Run the following to download the dataset:


This dataset was used for the first pre-training phase.

Google Conceptual Captions

Conceptual Captions is a dataset consisting of ~3.3 million images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles.

Download the datasets urls/captions from here as save it to datasets/googlecc/googlecc.tsv. The full dataset has over 3 million images, but you can select a subset by loading the googlecc.tsv file and saving only the number of rows you want (I have used 1 million images for training).

Then run the following commands to download each image on the googlecc.tsv file:

npm install
node download_build_googlecc.js

This dataset was used for the second pre-training phase.


This dataset was used as the test set during inference.

Run python3.8 to download the dataset.


Training phase 1

Training phase 2


Create two Habana instances (AWS EC2 DL1) using Habana® Deep Learning Base AMI (Ubuntu 20.04)

Create the PyTorch docker container running:

docker run --name pytorch -td --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host

Enter the docker image by running:

docker exec -it pytorch  /bin/bash

Setup password-less ssh between all connected servers

  1. Configure password-less ssh between all nodes: Do the following in all the nodes' docker sessions:

    mkdir ~/.ssh
    cd ~/.ssh
    ssh-keygen -t rsa -b 4096

    Copy contents from every node's docker to every other node's docker's ~/.ssh/authorized_keys (all public keys need to be in all hosts' authorized_keys):

    cat > authorized_keys
    vi authorized_keys

    Copy the contents from inside to other systems. Paste all hosts' public keys in all hosts' “authorized_keys” file.

  2. On each system: Add all hosts (including itself) to known_hosts. The IP addresses used below are just for illustration:

    ssh-keyscan -p 3022 -H $IP1 >> ~/.ssh/known_hosts
    ssh-keyscan -p 3022 -H $IP2 >> ~/.ssh/known_hosts
  3. Change Docker SSH port to 3022

    sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
    sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
    service ssh restart

Allow all TCP traffic between the nodes on AWS

Clone the git repo:

git clone

Create environment:

python3.8 -m venv .env

Install requirements:

python3.8 -r requirements.txt

Activate environment

source .env/bin/activate

Training params

Learning rate: 1e-3

Batch size: 64

Phase 1 - Epochs: 100

Phase 2 - Epochs: 15

Train script arguments

--dataset-num-workers       Number of workers (default: 8)
--dataset-type      Dataset type (coco or googlecc) (default: coco)
--dataset-dir       Dataset dir (default: ./datasets/coco/)
--dataset-subset-size       Load only a subset of the dataset (useful for debugging)
--dataset-train-split       Dataset train split (default: 0.8)
--train-device      Type of device to use (default: hpu)
--distributed-num-nodes         Number of nodes (machines) (default: 2)
--distributed-parallel-devices      Number of parallel devices per node (default: 8)
--distributed-master-address        Master node IP address
--distributed-master-port       Master node port (default: 12345)
--distributed-bucket-cap-mb         DDP bucket cap MB (default: 200)
--checkpoint-dir        Model checkpoint dir (default: ./models)
--checkpoint-save-every-n       Save every n epochs (default: 1)
--checkpoint-load-vision-path       Load vision encoder checkpoint
--checkpoint-load-text-path         Load text encoder checkpoint
--model-visual-name         Which visual model to use (default: RN50x4)
--model-textual-name        Which textual model to use (default: xlm-roberta-base)
--hyperparam-num-layers         Number of layers (default: 3)
--hyperparam-lr         Model learning rate (default: 0.001)
--hyperparam-epochs         Max epochs (default: 100)
--hyperparam-precision      Precision (default: 16)
--hyperparam-batch-size         Batch size (default: 64)
--wandb-project         W&B project name (default: clip)
--wandb-enabled         W&B is enabled? (default: True)

Habana Gaudi - 8 accelerators

Phase 1 training
python3.8 --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 1
Phase 2 training
python3.8 --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 1 --hyperparam-epochs 15 --checkpoint-load-text-path /home/models/text-last.ckpt --checkpoint-load-vision-path /home/models/vision-last.ckpt --checkpoint-dir ./models_phase2

Habana Gaudi - 16 accelerators (multi-server training)

Change the master IP address based on your instances (use local IP, not public IP).

Phase 1 training
NODE_RANK=0 python3.8 --distributed-master-address --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2
NODE_RANK=1 python3.8 --distributed-master-address --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2
Phase 2 training
NODE_RANK=0 python3.8 --distributed-master-address --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2 --hyperparam-epochs 10 --checkpoint-load-text-path /home/models/text-last.ckpt --checkpoint-load-vision-path /home/models/vision-last.ckpt --checkpoint-dir ./models_phase2
NODE_RANK=1 python3.8 --distributed-master-address --train-device hpu --distributed-parallel-devices 8 --distributed-num-nodes 2 --hyperparam-epochs 15 --checkpoint-load-text-path /home/models/text-last.ckpt --checkpoint-load-vision-path /home/models/vision-last.ckpt --checkpoint-dir ./models_phase2

Other devices

If you don't have access to a Habana Gaudi accelerator yet, you can also train on CPU/GPU, although it will be way slower.

To train on CPU, just pass --train-device=cpu and on GPU --train-device=cuda to the script.

Evaluation metrics and results


Demo app

You can test Multilingual CLIP semantic search on this Hugging Face spaces app I have created.


The pre-trained Multilingual CLIP was made available on Hugging Face HUB

Loading pre-trained model from Hugging Face HUB

from models import create_and_load_from_hub

model = create_and_load_from_hub()

Loading model from local checkpoint

from models import MultiLingualCLIP, load_model

text_checkpoint_path = '/path/to/text model checkpoint'
vision_checkpoint_path = '/path/to/vision model checkpoint'

model = MultiLingualCLIP(num_layers=3)
load_model(model, vision_checkpoint_path, text_checkpoint_path)

Generate embeddings

Run the following (after downloading Unplash dataset):

python3.8 ./

Searching images

import numpy as np
from search import MultiLingualSearch

images_embeddings = np.load('/path/to/images_embeddings')
images_data = [...] # List of image info for each row of the embeddings. For instance, it could be a list of urls, filepaths, ids. They will be returned when calling the search function
semantic_search = MultiLingualSearch(model, images_embeddings, images_data)

results ='विद्यालय में') # Means at school
[{"image": "",
  "prob": 0.2461608648300171},
 {"image": "",
  "prob": 0.16881239414215088},
 {"image": "",
  "prob": 0.14744874835014343},
 {"image": "",
  "prob": 0.095176100730896},
 {"image": "",
  "prob": 0.05218643322587013}]

Challenges we ran into

One of most challenging aspects was to adapt CLIP to run on Habana with Pytorch Lightining using Distributed Data Parallel across multiple servers.

There were a lot of trial and error to make everything work, including:

  • Permuting layers params to use RSCK on Habana
  • Configuring servers to allow SSH access through each node using custom SSH port (for distributed training)
  • Adapting the model for mixed precision
  • Setting the correct environment for multi-server training
  • Adapting custom checkpoint to save only of first rank and permuting params back before saving

Accomplishments that we're proud of

  • Porting CLIP to Habana accelerators
  • Integrating CLIP training using PyTorch Lightning and Habana
  • Training the model using single and multiple servers using distributed training
  • Used a variable of datasets for training (Google CC and COCO)
  • Built custom dataset for testing (Unsplash)

What we learned

  • Habana training
  • Migrating models to Habana
  • Distributed Learning across single and multiple servers

What's next for Multilingual CLIP - Semantic Image Search in 100 languages

  • Improve performance by tuning hyperparams (batch size, epochs, learning rate, dropout)
  • Improve performance by adding more data
  • Experiment changing the number of projection layers

Built With

  • distributed-training
  • gaudi
  • habana
  • pytorch
  • pytorch-lightning
Share this project: