Much, maybe most, of the useful knowledge and enjoyable experiences today are consumed in a video or audio format. This is increasingly so as education, business, and entertainment shift towards a socially-distant medium. In general, searching in multimedia content is orders of magnitude less efficient and less accurate compared to searching in text. Moreover, the typical search experience is biased towards typed text and is inclusive for neither video or audio media formats nor a conversational medium. Because of all these, many educational and enterprise multimedia resources remain underutilized or are accessed inefficiently by consumers.

Applicability for "learning Alexa with Alexa" was another point of inspiration. See the attached video for the full story about inspiration and motivation.

What it does

Helps you in the journey of learning Alexa with Alexa. More generally, helps the user in the quest of exploring information in multimedia files. Whether the knowledge you are looking for is inside an instructional video or an audio format, or typing and reading is not accessible, this project provides a solution for finding it in a conversationally intuitive and efficient way.

How I built it

The front-end is a familiar Alexa voice user interface enriched with a multi-modal experience build using the Alexa Presentation Language. The users can ask questions about the multimedia files and have a conversation about their content based on Alexa's answer that identify relevant passages. The user can also interact using visuals and touch for selecting a movie from a list and for controlling the video player. See the attached images and video for more details.

The use of multi-modal technology, in particular, APL, was essential for developing the skill. More specifically, the skill's main feature relies on the essential use of the APL Video component, and, in particular, its offset attribute. Controlling the video player, conversationally, or through touch events, is also an essential use of APL for implementing the skill's beyond voice, multi-modal experience.

The back-end is a standard intent-handling serverless architecture enriched with several artificial intelligence and machine learning AWS services: a Kendra index is queried for relevant information and metadata extracted from multimedia files using artificial intelligence AWS services, such as: Transcribe, etc. See the attached architecture diagram for more details.

Media Quest Architecture

To match the Beyond Voice hackathon's theme the media files used in the skill are from the video series Zero to Hero: A comprehensive course to building an Alexa Skill. The videos were used by permission, for illustration and education purposes (many thanks to Andrea Muttoni and German Viscuso for granting me permission to use the video content, and to Joe Muoio for facilitating communication. No affiliation or partnership with Amazon derives from using the video files)

The implementation goes beyond this concrete illustration and can be applied to any media files, regardless of their concrete content. The main idea of the approach is to solve a more general problem: using a conversationally intuitive, and practically convenient, way of searching in multimedia files. See also the final section below for further applications.

Challenges I ran into

  • As I was developing the skill using the online simulator I had to deal with several timing and duration issues in the behavior of Video components in APL (looking forward to enrich the interaction on a real device).
  • Connecting with, and orchestrating, the AWS back-end AI and ML services was also a challenge, especially in terms of both time and, especially, cost efficiency.
  • Finally, costing the skill's resources consumption and designing a sustainable monetizing strategy to balance out the costs is an (ongoing) challenge. Some alternatives are discussed in the last section below.

Accomplishments that I'm proud of

  • Building a fully functional multi-modal skill in less than a month.
  • Going Beyond Voice with both: multi-modal visual components and also, especially, with the insight and convenience provided by the AI components.
  • Designing an approach that does not rely on other preexisting external sources for metadata extraction from media files, but instead, provides a solution that: extracts the excerpts, processes the corresponding timestamps, and generates the search results.
  • Implementing a skill that does not only search among video titles and their tags, but is capable to search inside the content of the video.

What I learned

  • A lot about the Alexa (and AWS) suite of technologies and how to make them work together.
  • A lot from, and about, the Alexa community, and a bit about voice and multi-modal design.

What's next for Conversational Search in Multimedia Content

  1. Test, improve, (and enjoy) the skill on a real device! (several timing and duration features for APL video components work only partially on the online simulator, I plan to re-iterate the design of the interaction model with a real device in my hands).
  2. Add more (all?) videos about Alexa development I can find on the internet (and obtain permission for use).
  3. Develop similar skills for other subjects and topics (e.g. AWS, development, ML & AI docs, etc. and also let the users set their own search domains and upload their own videos: e.g. all the classes and [zoom]-lectures for 4th grade)
  4. Extend to entertainment: feature movies and (especially) audio books.
  5. Alternate topic searching with having Alexa quiz you about topics from the media content you select.
  6. Explore a sustainable monetization strategy: Kendra is a great service but the design, and costs, are tailored for enterprise level applications. I plan to explore cheaper alternatives for the search functionality (ES, ML-NLP-QA, etc.) which can be better suited for the scale of the skill.
  7. Keep extending the interaction experience with questions about media content: e.g. select only answers where code, diagrams, or specific entities appear on the screen, ask clarifying questions when the answers are below a relevance threshold, present a list of answer timestamps ordered by relevance, etc.

Built With

  • ai
  • apl
  • ask
  • kendra
  • lambda
  • ml
  • s3
  • sam
  • transcribe
Share this project: