Inspiration

I used to work at a hiring platform, and we built something similar. LLMs make it a lot easier to hack something together in a weekend.

What it does

I scrape job postings directly from company ATS pages, use an LLM to extract structured metadata and produce embeddings, and then index them in a search engine.

How we built it

  • Search common crawl to find companies' ATS job boards. (I limited myself to Greenhouse, Lever, and Ashby, but I have a list of about a dozen others I want to add.)
  • Scrape job postings from each page or use the API if the ATS provides one
  • Pass the job posting to Mistral with a custom prompt to extract structured metadata (things like skills, experience required, salary, etc. I have about 2 dozen fields currently).
  • The previous step gets expensive fast, and you need a big model to get decent results. So I ran a training set of 1000 jobs through it, then fine-tuned a Mistral 7B model on its outputs to use for the rest of the postings.
  • Embed the job descriptions using the Mistral API
  • Index the job description data, structured metadata, and embeddings in Meilisearch, a full-text search engine that also supports hybrid vector search.

Challenges we ran into

  • Meilisearch's vector search seems to not be ready for production uses. This is the main I wanted to test, so it was nice to at least find this out.
  • Some ATS's don't have APIs and/or have low rate limits

Accomplishments that we're proud of

The fine-tuned model gives impressive results, on par with those of the mistral-large model (based on my 👀 test, so take the with a grain of salt). And it's a fraction of the price, making it suitable to run nightly on updated postings.

What we learned

Start small and get something working before boiling the ocean

What's next for Dope Jobs

¯_(ツ)_/¯ Probably build a UI, actually use the application DB and schedule nightly updates, and clean up my code. Also add data from more ATS's.

Built With

Share this project:

Updates