Search4Cure.AI-Diabetes: AI-Powered Research Assistant

🧠 Inspiration

Diabetes is one of the world’s most prevalent and complex chronic diseases. Despite vast amounts of research, the process of discovering new insights remains time-consuming. I was inspired by the idea of building an AI system that could bridge gaps between text, data, and images — and help researchers find connections faster.

🚀 What it does

Search4Cure.AI-Diabetes is an AI-powered research assistant that enables scientists and clinicians to:

  • 🔍 Search across Arxiv, PubMed, CSV files, and images for diabetes-related insights.
  • 🧠 Perform multimodal similarity search using both text and images.
  • 🤖 Use agentic workflows to extract structured answers and recommendations with humans in the loop.
  • 📊 Leverage LLMs, CLIP embeddings, and vector databases to surface meaningful connections in large datasets.

🛠️ How I built it

  • Frontend: Built with Streamlit for an intuitive UI allowing query entry, file uploads, and result visualization.

  • Data Sources:

    • Scientific articles from Arxiv and PubMed
    • Tabular data via CSV uploads
    • Medical images for embedding-based search
  • Embeddings:

    • sbert_text_embedding (384-dim): Semantic search across documents
    • clip_text_embedding and clip_image_embedding (512-dim): For multimodal alignment
    • gemini_embedding (256-dim): For CSV records and structured text
  • Storage & Retrieval:

    • 🧠 MongoDB Atlas Vector Search for fast, scalable similarity lookups
    • ☁️ Google Cloud Storage (GCS) for storing and retrieving uploaded images
  • Agentic Reasoning:

    • Used modular LLM workflows to simulate human-in-the-loop interactions and structured reasoning chains.

⚠️ Challenges I ran into

  • Ensuring dimensional compatibility across multimodal embeddings (text vs image).
  • Handling varied formats from PubMed XML, Arxiv JSON, and user-uploaded CSVs.
  • Implementing efficient chunking and deduplication to reduce noise in search results.
  • Managing API quotas, rate limits, and latency ## 🏆 Accomplishments That I'm Proud Of
  • 🛠️ Successfully integrating multiple modalities (text, tabular data, images) into a unified semantic search framework
  • 🤖 Designing a working agentic LLM pipeline to simulate human reasoning in research workflows
  • 🌐 Deploying a full-stack app with real-time similarity search using MongoDB Atlas Vector Search and Google Cloud Storage
  • 💡 Building a tool that can potentially accelerate biomedical research and generate actionable insights

📚 What I Learned

  • 🚀 How to build scalable, multimodal embedding pipelines combining SBERT, CLIP, and Gemini
  • 🧠 Designing agentic AI workflows that simulate scientific exploration with LLMs
  • ☁️ Integrating and deploying services using MongoDB Atlas, GCS, and Streamlit
  • 🧵 Combining usability, performance, and AI reasoning in a single end-to-end application

🔮 What's Next for Search4Cure.AI-Diabetes

  • 🧬 Generalize the assistant to support other diseases such as cancer, Alzheimer’s, and rare genetic disorders
  • 🤝 Enable collaborative research with shared workspaces, annotations, and versioning
  • 📈 Enhance explainability and transparency of AI-generated insights for trust and adoption
  • 🌍 Integrate additional data sources like clinical trials, patents, and medical guidelines

Built With

Share this project:

Updates