My son, who keep showing stuff to Alexa and asking "Do you know what this is?" And no matter how powerful Amazon has made Alexa.The fact is, "Alexa can't see and can't read"
What it does
It gives eyes to Alexa and ability to read things via camera of your phone.
Step 1. Download an "Alexa Companion" app.
Step 2. Click or Browse a picture in it.
Step 3. And ask, "Alexa open your eyes"
You can even ask "Alexa open your eyes and read this" in Spanish or Chinese or Japanese or more.Right now there is a support of 12 languages.
How I built it
Following are the 15 services running in the background to support this -
Lambda functions (2)
Alexa endpoint This lambda function is called whenever Alexa skill is invoked. This function decides which language translation to invoke based on user input.
Triggered by S3 bucket Whenever image clicked using the Alexa Companion app is uploaded in the S3 bucket, this function is triggered. This function reads text using the Amazon Rekognition and push it to DynamoDB
AWS Serverless (1)
As Amazon doesn't have any SDK available for Xamarin for "Login with Amazon". I have to hack my way in using the API calls it makes to provide authentication for web. For that I used AWS Serverless and hosted an API which gets Auth Token from Amazon and return it to the app.
Login with Amazon (2)
Alexa Link Account The Alexa skill requires Amazon login to sync with the phone.
Xamarin app login Mobile app authenticates using Login with Amazon. This is required to sync with Alexa app.
Microsoft Cognitive service instance (1)
Alexa Companion app uses this service to extract caption from the captured image which describes its contents in a simple sentence. Amazon Rekognition do not have a caption field instead it gives a list of labels.
Xamarin Forms Android/iOS app (2)
This app is used to either click or browse a picture from there phone/tablet/iPad/etc. The selected picture is then send to Microsoft Cognitive service and parallely uploaded to S3 bucket for further processing.
My Eyes - Alexa skill (1)
My Eyes is the Alexa skill. You can invoke it in folowing ways, by saying -
- Alexa, open your eyes
- Alexa, open your eyes and read this
- Alexa, open your eyes and read this in [language]*
* you can use any language from arabic, chinese, czech, english, french, german, italian, japanese, portuguese, russian, spanish, turkish
Amazon DynamoDB instance (1)
All the information from the app linking code to translated text is stored in this DB
Amazon S3 buckets (2 folders)
All the images uploaded from Alexa Companion app are temporarily stored in this bucket(folder). An auto delete policy is set to delete them in one day(minimum available duration)
For audio files
All the audio files uploaded by Lambda function after converting them to a specific language are temporarily stored in this bucket(folder). An auto delete policy is set to delete them in one day(minimum available duration)
Amazon Translate instance (1)
All the text is translated using Amazon Translate.
Amazon Rekognition instance (1)
All the scanned images are sent to Amazon Rekognition to extract text from them.
Amazon Polly instance (1)
Text to speech for any supported language is performed using Amazon Polly.
Challenges I ran into
Integration, integration and integration
- Linking mobile app with Alexa
- Pushing Polly generated file to S3
- Converting Polly generated mp3 to Alexa undestanding format
- Managing wait time for generating result, like translation, reading image, etc.
- Different set of languages supported by Amazon Translate, Amazon Rekognition and Amazon Polly
- Hacking Amazon Login as it does not provide SDK for Xamarin Forms apps
Accomplishments that I'm proud of
Watching this whole integration work :)
What I learned
- Dealing with limitations of products
- Dealing with limitations in features provided
- Reinventing the wheel for optimizing performance
Hacking my way in to things not provided out of the box
What's next for My Eyes
Independent device (like RPI) to act as eyes
Using Alexa as a true Companion while you are driving your car