Leveraging AI to Generate Question-Answer Bots for Website FAQs
For RamHacks 2017, Octo challenged hackers to take a website and generate a question-answer bot with a voice interface for that website's FAQ pages. Technical challenges arise in finding the FAQ pages, parsing the FAQ pages, and generating mappings from question intents to answers.
What it does
On the Android app interface, the user enters a base url, for example, http://ramhacks.vcu.edu/. The app then makes a call to an AWS Lambda function that scrapes the website for pages and returns pages likely to contain an FAQ. For each potential FAQ page, the app attempts to parse the FAQ pages into question-answer pairs. The user can ask a question, via type or via voice, and the app will find and display an answer,leveraging Microsoft Cognitive Services to train a model.
How we built it
The Android app interface was created in Android Studio and contains two activities. The first activity requires the user to input a base url which is run through the Lambda function. The function returns a list of potential FAQ pages under that base url domain. For each of these urls, the app attempts to scrape off question-answer pairs using Microsoft Cognitive Services. The Lambda function is described below.
The scraper.py file contains the Lambda function handler. This scraper conducts breadth-first search through all internal-facing links in the anchor tags of each page. For each page visited, multiple features e.g. how many times FAQ appears in the page, are extracted from the html. The features compose an input vector for a basic neural network which generates a likelihood that the page is an FAQ page. If the probability exceeds 50%, the page is tagged as a potential FAQ page. The page crawling continues until there are no more unique pages to visit or the page visit limit (set by the initial Lambda function call) is exceeded. The final list of FAQ urls is concatenated into a newline-separated string and is returned to the origin of the Lambda function call.
ANN.py contains the basic neural network. Data to train the neural network is found in prepro_training.txt and is processed by DataGeneration.py. For each url in the training set, the HtmlFeatureExtractor.py extracts a set of quantitative features from the page and outputs that feature data into training.txt. The ANN.py can then be trained on the numeric training data in training.txt. The weights are stored and restored via a pickle file.
The second activity of the Android application contains an interface for the user to input a question. This question is evaluated for similarity to other questions and returns the appropriate, if present, scraped FAQ answer.
Challenges we ran into
Originally, we wanted to interface with Amazon Alexa, but our code did not seem to plug in easily. We could have spent all of our time pushing through and figuring out how to use it, but we decided that would not have been an efficient use of our time. We opted to create an Android application with the same voice functionality as Alexa.
Working with AWS also proved to be a difficult task. The neural network was written with the NumPy library, but AWS Lambda does not work well with local installed NumPy library. We had to use a Amazon EC2 micro instance to generate the Lambda function package.
Accomplishments that we're proud of
We were proud of our solution; we did not reinvent the wheel, we kept it functional yet sleek. Getting past our Alexa and AWS barriers were particularly good moments.
What we learned
Some of us learned how to program Android apps. Some of us learned how to use Amazon Web Services. Some of us learned how to create an Android interface.
What's next for OctoBot
- Implementation into Amazon Alexa, Google Home, etc.
- Better model for determining if page is FAQ
- More Android front-end work
- Julian Duque - Link between application and Microsoft Cognitive Services
- Meredith Lee - Application-user interface
- Tony Wang - Amazon Alexa interface through AWS
- David Zhao - AWS Python backend