Inspiration

I noticed many non-profit organizations don't have an AI chatbot on their websites. I think this is because it's expensive for a non-profit to hire an AI developer to build a chatbot for them, or to upskill their in-house or volunteer web developers on AI. I decided to build Donatobot as an AI chatbot that can be used for any non-profit, allowing them more exposure to users who prefer interacting or conversing with a website through questions in natural language, rather than just searching for and reading static text.

What it does

If you look at the code for Donatobot, you see there's no data from the non-profit hard-coded anywhere, yet with the sample deployment you can ask questions about the non-profit and get relevant answers fast from the site's data! You can even select a topic to ask questions about, and the topics you can choose from come straight from the non-profit's website.

How I built it

How did I do this? Well, you need to configure 3 things first: (1) the non-profit's website URL, (2) the tag of the website's navbar (for example div or nav), (3) the class of that tag. Using these 3 configs, I built a web scraper that scrapes the site navbar for the pages of the submenus.

The pages are then partitioned (split) into semantic elements, then the elements are chunked. These chunks (of text) are then encoded using OpenAI embedding into vectors (arrays of floats). These vectors are finally loaded to a vector store, Pinecone, along with the chunk of text each vector represents, plus metadata for each chunk like its topic. The topics are simply the menus parsed from the navbar, and as mentioned users can optionally select a specific topic to ask about, or just ask about the non-profit in general.

When a user asks a question in the Gradio conversational UI, the chatbot sends the question as an encoded embedding to OpenAI, along with the most relevant vectors retrieved from the Pinecone vector store. This is called RAG, retrieval-augmented generation, and the matching of the question to the relevant data is done using highest cosine similarity between the vector calculated for the question and the vectors in the Pinecone index, filtered on topic in vector metadata if the user selected a topic.

This approach is totally dynamic and doesn't require any hard-coding of anything from the non-profit's website. Only the 3 configs, an URL, a tag, and a class, are needed. Then whenever the non-profit's website data is changed (for example the address or types of donations accepted), the data scraping and loading process can be repeated to keep Donatobot's answers fresh, based on the latest data on the non-profit's site.

What's next for Donatobot

In the next version of Donatobot, I plan to implement a FAQ feature where both questions and answers will be generated in real-time based on the non-profit's website data. Again, this will be totally dynamic with nothing hard-coded, making sure that this AI chatbot can continue to be used for any non-profit organization.

Built With

  • beautiful-soup
  • chunking
  • cosine-similarity
  • embedding
  • gradio
  • langchain
  • llm
  • openai
  • pinecone
  • python
  • unstructured-io
  • vector-store
Share this project:

Updates