Inspiration
Since moving my social media to distributed PixelFed and ActivityPub servers, I've been trying to caption all my images to make them more accessible. The @altbot@fuzzies.wtf account provides image captions as a comment on a post, but it stopped working on PixelFed servers. So I’ve been captioning my own images using Ollama and the llava:7b when posting from the desktop. When posting from the phone, I'm having a harder time with captions. Therefore, I built a script to backfill alt text on posts without captions after I reviewed the generated captions. When I saw Kiro with its Spec-driven development, I decided to convert the simple script to a version I can host myself for others to use.
What it does
Vedfolnir is a hosted tool that generates and manages alt text for social media posts on ActivityPub platforms like Pixelfed and Mastodon. It allows users to review the AI-generated alt text prior to updating the alt text in their post. The system uses AI (OpenAI’s Llama with LLaVA vision-language model) to analyze images and generate intelligent descriptions, then provides a comprehensive web interface for human review and approval. Once approved, it updates the original posts with the improved alt text.
Key features include:
- AI-powered caption generation with quality assessment and fallback mechanisms.
- Multi-platform support for Pixelfed, Mastodon, and other ActivityPub platforms.
- Comprehensive web interface with real-time progress tracking and batch operations.
- Performance monitoring with Redis session management and MySQL database backend.
- Multiuser
How we built it
I took an existing single-user Python Flask script of 1,500 lines of code and instructed Kiro to create a Spec along the lines of "This is an existing project that I want to make multiuser." Kiro started building the Requirements, Design, and Task documents for the project with the following requirements: "The Vedfolnir is a system designed to enhance accessibility on ActivityPub platforms (Pixelfed and Mastodon) by automatically generating and managing alt text (image descriptions) for posts that lack them. The bot identifies images without alt text, uses AI to generate appropriate descriptions, provides a human review interface, and updates the original posts with approved descriptions. This feature aims to make visual content more accessible to users with visual impairments who rely on screen readers."
Initially, I let Kiro's Specs run as is. That quickly started generating requirements that made the app more professional with good functionality. Testing was hard as Kiro often would write tests and not run them, and of course, the AI-written code needed to be tested. I found that using Playwright scripts with strong instructions in a steering document worked well for testing the web app.
The app is now using a modern, scalable architecture: Backend: Python 3.8+ with Flask 2.0+ and SQLAlchemy ORM for a robust web application framework. Database: MySQL/MariaDB with advanced indexing and connection pooling for enterprise performance. AI Integration: Ollama with LLaVA vision-language model for intelligent image analysis and description. Session Management: Redis-based sessions with database fallback for high availability. Security: Enterprise-grade middleware with comprehensive CSRF protection, input validation, and audit logging. Frontend: HTML5, Bootstrap 5, and JavaScript with WebSocket support for real-time updates.
The architecture follows a modular design with organized components in the app/ directory, including core application components, Flask blueprints, service layers, and WebSocket functionality for real-time progress tracking.
Challenges we ran into
The biggest challenge was not knowing and defining what technical stack the app would be built on when I started my Spec-based development. Kiro would make good choices on the technical stack that would later lead to problems. For example, SQLite was fine for the database until I started to try to store user session data in it. That led to the transition from SQLite to MySQL/MariaDB and also the addition of Redis caching.
At some point, I implemented app security based on the recommendation of the sample Kiro hook "Security Vulnerability Scanner." That was a bad choice and should have been implemented near completion of the project. It was difficult to learn CSRF while I was actively making large code base additions and changes
Accomplishments that we're proud of
The app is functional and ready to be hosted. I learned a lot about Spec-driven development and will save a lot of time on the next project since I will know what to write into the specs earlier.
What we learned
Front-end testing and dealing with JavaScript console errors (i.e. CSRF) is too time-consuming to paste console errors into the chat. I learned to instead use AI-generated Playwright tests with instructions to monitor both the console and logs for errors and resolve the bugs as they were encountered. It was critical to have a strong Playwright steering document.
Simple things would have made testing with Playwright far easier. For example, the Login page had a Login button. Playwright always tried to click a Submit button. For testing, I could have saved a lot of AI time if I renamed Login to Submit.
Finally, I learned to review the Kiro Requirements, Design, and Spec documents before accepting them. It was important to ensure the Specs would use existing frameworks and code rather than writing new code. And review was required to make sure tests would be both created and run.
What's next for Alt Text Bot
I need to continue iterating and making sure all functionality works, especially on the administration side. There is a lot of reporting that is mocked up but not fully functional. I need to evaluate the memory use with Ollama running on the same system as the web app. It’s likely that I'll need to move to a third-party AI endpoint. Finally, refactoring is always needed; the code base is up to 280k lines of Python code, a far cry from the 1,500 lines that provided the single-user functionality.
Built With
- apple
- bootstrap
- bootstrap-5
- csrf
- flask
- flask-2.0+
- html5
- llava
- mariadb
- mysql
- ollama
- python
- redis
- sqlalchemy-orm-**database**:-mysql/mariadb-with-advanced-indexing-and-connection-pooling-**cache/sessions**:-redis-with-database-fallback-for-high-availability-**ai/ml**:-ollama-with-llava-vision-language-model-**frontend**:-html5
- websock
Log in or sign up for Devpost to join the conversation.