Inspiration

The deep ocean remains one of the most unexplored ecosystems on Earth, holding countless unknown species and untapped biological insights. Traditional taxonomy and DNA barcoding rely heavily on incomplete reference databases, which limits discovery and slows ecological understanding. BioLine was born from our desire to create an intelligent, efficient, and alignment-free system to analyze environmental DNA (eDNA) — enabling scientists to uncover biodiversity even in poorly characterized regions like the deep sea.

What it does

BioLine is an AI-driven bioinformatics pipeline that processes raw eDNA reads (from water or sediment samples) to:

Classify eukaryotic taxa with minimal dependence on reference databases.

Annotate and cluster novel or unclassified sequences using deep learning and unsupervised techniques.

Estimate relative abundance and ecological patterns directly from raw reads.

Greatly reduce compute time compared to traditional BLAST-based or alignment-heavy workflows.

In essence, BioLine bridges the gap between deep-sea genetic data and meaningful biodiversity insights.

How we built it

We designed a hybrid AI + bioinformatics pipeline using:

Biopython for parsing and preprocessing raw FASTA/FASTQ eDNA reads.

DNABERT (via Hugging Face) to generate contextual DNA embeddings, capturing sequence-level biological meaning.

UMAP for dimensionality reduction, visualizing complex genetic patterns.

HDBSCAN for unsupervised clustering to group similar taxa — even without references.

BLAST integration for rapid, optional identification when reference hits exist.

Python, Pandas, and Matplotlib for analytics and visualization.

This architecture allows BioLine to operate efficiently and flexibly, from known sequences to completely novel ones.

Challenges we ran into

Handling massive sequencing datasets with limited compute resources.

Ensuring accuracy without heavy reliance on curated databases like SILVA or PR2.

Designing an alignment-free workflow that remains biologically meaningful.

Optimizing UMAP and HDBSCAN parameters to balance speed and precision.

Accomplishments that we're proud of

Successfully created a fully functional AI-driven eDNA pipeline capable of unsupervised discovery.

Demonstrated that DNABERT embeddings can reveal taxonomic relationships without alignment.

Reduced computational time compared to traditional reference-based pipelines.

Contributed a potential tool for deep-sea biodiversity monitoring and conservation.

What we learned

How deep learning models like DNABERT can revolutionize genomics by understanding DNA contextually.

The power of unsupervised learning (UMAP + HDBSCAN) in discovering new biological patterns.

The importance of designing scalable, database-independent pipelines for real-world biological data.

Effective team collaboration across AI, biology, and data science domains.

What's next for BioLine

Integrate metagenomic functional annotation to infer ecological roles of unknown taxa.

Deploy BioLine as a web-based platform for oceanographic researchers and labs.

Expand support for multi-marker analysis (e.g., 18S rRNA, COI, 16S).

Leverage GPU-accelerated deep learning for even faster embeddings and clustering.

Partner with marine research institutions to validate BioLine on real deep-sea datasets.

Share this project:

Updates