Inspiration

I like to use ChatGPT to ask coding questions because I can use natural language. So I tried using it for questions about companies I invest in, or deciding if I should invest in. But I found ChatGPT doesn't have the latest info about companies because of its knowledge cutoff date, which is Oct 2023. I noticed they did add Internet search recently to the free plan, but only while you're using GPT-4o, which is used up fast. Then you're stuck with GPT-4o mini for 3 or more hours, just getting old answers from last October, which you'd still have to double-check anyway, as they warn.

So I thought wouldn't it be really helpful if I built an AI chatbot for this GPT4o: Code & Conquer hackathon that uses the latest financial and corporate data available for all companies that trade on US stock exchanges, which is over 10,000. This chatbot, which I named The Smart Statement Agent, would answer your questions based on current year info, and the info would be trustworthy and reliable, since the bot would get it from the financial statements required by law to be filed by companies with the US Securities and Exchange Commission (SEC).

Of course, you can access these financial statements from SEC directly, at https://www.sec.gov/search-filings. But you need to find the statements you're interested in, then read or search their contents using keywords only to find the answers you're looking for. Why not use The Smart Statement Agent where you can ask questions in natural language in a chat interface? You can't do that on the SEC website, you can't do that on Google. Only in The Smart Statement Agent, it's unique in using official financial statements.

What it does

The Smart Statement Agent is an AI chatbot that gives you a dropdown list of all 10,000 plus companies that trade on US stock exchanges. You can type ahead in the dropdown box or select a company directly from the list. Then simply ask questions about that company and click the "Ask Question" button.

The Smart Statement Agent stores current year financial statements from SEC in a vector database, and if it finds the statements for the company you chose, it will use RAG (Retrieval Augmented Generation) with OpenAI to generate your answer from the financial statements (forms 10-K, 10-Q, 8-K for US firms; 20-F, 6-K for non-US). If the company's statements aren't already in the database, they are retrieved from SEC in real-time, processed for ML, and inserted into the database. This process can take a few minutes, so The Smart Statement Agent will inform you about that, and you need to try asking your question again later (just click "Ask Question" again).

During this week I already loaded the financial statements for all 503 stocks of the S&P 500 (https://en.wikipedia.org/wiki/List_of_S%26P_500_companies) into the vector database, taking up almost 5 GB for the text chunks and corresponding vectors. But The Smart Statement Agent supports over 10,000 companies and there are many interesting companies not in the S&P 500. So you will likely encounter a company where you need to wait for the chatbot to load its statements. Please be patient then, you're saving the next user interested in that company from waiting :)

How I built it

I built The Smart Statement Agent in Python, my current favorite language, using Gradio for the UI, Beautiful Soup for parsing the financial statement data retrieved from SEC's EDGAR database and API, NLTK for chunking the parsed text, OpenAI for embedding the text chunks into vectors, and TiDB Vector for storing the chunks and vectors with metadata. For answering questions, I use RAG with LangChain using metadata to query the vectors from TiDB, passing the closest matched vectors to OpenAI for natural language processing. Note TiDB comes with 5 GB data free per cluster, which I almost used up to store the S&P 500 vectors and metadata (text).

I also use FastAPI to implement an internal heartbeat endpoint to keep the app (chatbot) running and not timing out while a company's financial statements are being retrieved, processed, and loaded to the database. To avoid excessive memory usage, I use a queue and a daemon worker thread for loading the statements. The loading code is pretty robust as proven by my successful loading of the 2024 statements for the S&P 500 companies to date.

What's next for The Smart Statement Agent

In v2 I will enhance The Smart Statement Agent to automatically update the financial statements stored in its vector database if new ones are filed with the SEC. Currently after a company's statements are added by the code, if you want to add a new statement, you need to manually delete that company's statements from the database, and then the next time that company is selected, the code will add all its statements for the current year. With the new feature, I'll add a scheduled job that checks for new statements and adds them automatically, as well as an admin page where the admin can request an update for selected companies (like all S&P 500 companies) on demand.

Built With

Share this project:

Updates

posted an update

NOTE: I just (today Nov 2nd) found my feature of loading a new company's statements to the vector database doesn't work in a Docker environment. The reason seems to be due to the EDGAR API (which is used to retrieve company statements from the SEC) blocks requests from the default Docker IP address. So if you want to use this feature, you need to run my code locally instead of in Docker!

My deployment for this hackathon is using Docker, so that's why you will keep getting "Give me a few minutes to get their 2024 financial statements" when you choose a new company. My bad for not testing this in my deployed app earlier; I ASSumed since it worked locally on my machine (letting me load statements for all S&P 500 stocks), it would work in my deployed Docker container as well :(

However, everything else works in my deployed app, so enjoy :)

Log in or sign up for Devpost to join the conversation.