Inspiration

My kids love stories. We read them together all the time usually with picture books. My kids also like projection based stories where a story is read with picture images displayed.

Years ago Google used to have a partnership with Golden Books where you could buy certain books and the Google Home devices would do a whole story around them with effects as they played. My kids at the time loved that program but support was eventually disabled for it and they were left without that piece. It did inspire me here though.

What it does

This project acts as a digital storyteller. The user uses a handheld wand to trigger the device to record text which is then processed, put through a prompt, and the device speaks each sentence, triggers effects for the mood, and displays images as the story processes. It's the combination of multiple elements coming together to build a narrative for the user.

How we built it

I wanted to come up with a unique usage of the gpt-oss-20b LLM. Given it adheres to the prompt well I knew I could set it up to follow certain elements for the story creation such as triggering the mood, images, and text from a brief prompt.

Initially I built out the various components one at a time to ensure I could get the elements working well together. I setup a script to handle the faster whisper elements first listening on the microphone and then transcribing it, I then handled the Piper TTS element having it speak out from text, I got ComfyUI working and setup the API json to allow it to generate an image for a given websocket, and I got the Xiao MIDI Synthesizer setup to play the various sounds over serial then a test script to interface with it from the Jetson using pyserial.

I used platform.io for the Xiao setting it up for the ESP32C3 along with Seeed's MIDI library for playing the sound. I've done it previously with a device for playing music based on color detection so it wasn't very difficult to get up and running quickly there.

The main script is setup to wait on the Xiao triggering a record event from one of the buttons. When the record button is pressed a sound is chimed indicating it can record, the user then can speak toward the USB microphone on the device, and pressing the button again indicates they are done. Once done faster whisper transcribes it and then it is sent to gpt-oss-20b over an ollama server running. The content is then streamed back and various elements are parsed such as [MOOD:Happy], [IMAGE:A dog dancing], and regular text. They are all quickly processed accordingly with the mood doing nothing in the moment, the image fetching over a first in first out queue with the ComfyUI websocket, and the text handled with Piper TTS for rendering wav files to play back. Next the story runs as this continues triggering the elements in order: for the image it updates the current one shown, for the audio it plays it, and for the mood it updates the Xiao MIDI Synthesizer to play that sound.

The Xiao additionally has a button to kill the story midway through if the user decides.

Challenges we ran into

I ran into a lot of random challenges along the way. I had problems with USB audio and sound devices I needed to work out. The image display for the Jetson to the project wasn't working at first. Additionally I needed to build CTranslate2 such that it would work for faster-whisper via the GPU as the version I received from pip had to be CPU only and I wanted to ensure it was quick.

Accomplishments that we're proud of

In the end it works pretty well and my kids have had fun using it. It does a pretty good job showcasing the whole flow of a storyteller with images and sound. I like the touch of the MIDI elements with background music and cues for various effects.

What we learned

I learned a bit about Python development beyond what I've created previously. It's not my forte and something I rely a bit on AI to assist with but have been picking it up along the way with various projects like this.

What's next for gpt-oss-20b Storyteller

  • Some of the audio can be a bit cutoff / it could use further debugging around the Piper TTS elements
  • The image generation could use having a bit of consistency around how it generates images such that subjects are more consistent with the generation
  • I wish I had time to fine tune story development a bit to steer the model both for the image generation and story beats

Built With

  • comfyui
  • pipertts
  • platform.io
  • python
  • whisper
  • xiao
Share this project:

Updates