Team Git'em HEB Challenge

Inspiration

Online grocery searches often fail to capture user intent — for instance, a query like “healthy snacks for kids” might return chips or soda. This inspired us to build a semantic ranking system that truly understands meaning, enabling smarter product discovery for H-E-B’s customers.

What it does

Our model takes a customer query and ranks products by semantic relevance rather than keyword overlap. It uses Sentence Transformers to generate embeddings for both queries and product descriptions, then compares them using cosine similarity:

[ \text{similarity}(A, B) = \frac{A \cdot B}{|A| |B|} ]

The model outputs the top-(k) most relevant products, providing a contextually accurate shopping experience.

How we built it

Combined product attributes — title, brand, category, description, and ingredients — into a single text representation.
Created synthetic query–product pairs with normalized relevance scores for training.
Fine-tuned the all-MiniLM-L6-v2 model using CosineSimilarityLoss from the Sentence Transformers library.
Encoded all products into dense vector embeddings.
Performed semantic search using util.semantic_search() to compute top-(k) rankings.
Exported results to submission.json for evaluation and leaderboard submission.

Challenges we ran into

Limited labeled data → we synthesized realistic query–product pairs to train effectively.
Data inconsistency → had to clean and standardize text fields across the product catalog.
GPU resource limits → optimized batch size, epochs, and warmup steps to fit within Colab constraints.
Generalization → ensured robust performance across diverse product types and query styles.

Accomplishments that we're proud of

Built a fully functional semantic retrieval pipeline from scratch.
Achieved strong relevance alignment between user queries and product content.
Demonstrated an end-to-end NLP solution combining data preprocessing, model fine-tuning, and ranking generation.
Significantly improved the search experience for complex, intent-driven queries.

What we learned

Deep understanding of transformer-based semantic search and its applications in retail.
Practical skills in embedding models, vector similarity, and ranking optimization.
The importance of data quality and domain-specific fine-tuning for real-world performance.
Exposure to challenges in balancing accuracy, scalability, and compute efficiency.

What's next

Deploy as a live semantic search API for interactive product retrieval.
Experiment with cross-encoder re-ranking for improved precision.
Integrate image embeddings for multimodal (text + image) retrieval.
Explore reinforcement learning from user clicks to further refine relevance scoring.
hat we're proud of