As college athletes, we have gone through numerous physical examinations that we needed to pass in order to be cleared to play. In each of those physical examinations, the doctor had to complete a lot of paperwork as they examined us, taking extra time and hassle. The doctors often filled out such forms by hand, forcing either them or us to have to scan the form later. When we saw that we had an opportunity to utilize Microsoft’s Bing Speech API as part of our Hacktech project, the three of us knew that we had to implement the Automated Physician Assistant (APA). At that point, we realized that the problem was bigger then we could have imagined. Studies have shown that doctors spend two-thirds of their time on paperwork. Doctors are annoyed and patients suffer the consequences of higher medical costs and longer wait times. It’s finally time for the once tedious and mind-numbing task of filling out paperwork to be automated.

What it does

Offers service to automatically fill out medical paperwork from listening to the conversation between the doctor/nurse and the patient during a medical examination.

How we built it

We used the Bing Speech API to capture conversations between doctors and patients. We took advantage of a specific Javascript library ( and its sample webpage as the starting point for our local webpage. On that webpage, we refactored the code to fit APA’s needs, and allow users to have conversations recorded, then downloaded via text file. Users can also upload .wav files for our program to process as well. These files were then sent to our python script to fill out the form.

We wrote a python script to parse the transcribed audio from the Bing Speech API and programatically create an RTF (Rich Text Format) document that contains the answers from the patient during the conversation in a form-like format.

Challenges we ran into

To start, we needed to narrow down the modes of the Bing Speech API to pick the specific options necessary for the doctor-patient conversation. We also ran into an issue where the Bing Speech API was separating statements during speaker pauses, whereas we wanted to use a very simple start/stop interface. To fix this, we locally stored the text from all of the previous speaking events, only outputting a file when the user officially ends the conversation by pressing the stop button.

We had some issues parsing the text from the conversation. The text does not have a particularly well defined structure, and we had to devise a way to extract the relevant information. We decided to scan the text for question-keywords, and then scan forward to find the expected relevant information. For example, “glasses” might be a question-keyword for the interaction “Do you wear contacts or glasses? No, I do not.” In this case, we would search forward to see if “No” or “Yes” appears first, to see which was the patient’s response. This is complicated by different types of questions requiring different responses. Some require yes/no, some require numbers, and some, such as blood pressure, require two numbers. We had to account for such differences and parse accordingly.

Another challenge was familiarizing ourselves with RTF enough to be able to create the form from the information gathered by parsing the text. We found it tricky to deal with.

Accomplishments that we're proud of

We created a fully functional prototype. Our software can listen live to a conversation and/or take an audio file and convert it to a written conversation between physician and patient. From this point, we have all the data we need to accurately complete a generic physical form based on what was said.

In addition, we learned some new technologies during this hack, most notably how to integrate Microsoft cognitive services with a novel idea to create a wholesome product.

What we learned

Speech recognition can still be improved. For instance, we ran into problems converting spoken medical terminology such as “tylenol” to written text. We learned how to integrate an API into a webpage, as well as how to process and alter rich-text files. We also dealt with Javascript scripts and front-end storage issues, learning how to build a simple application with clean, readable code.

What's next for Automated Physician Assistant (APA)

We look forward to expanding our form recognition technologies. With optical character recognition (OCR) technology, we plan on developing a robust mechanism that can take in any form and automatically adapt our form-filling software to that specific form. In addition, we intend to incorporate machine learning methods for sentiment analysis on text and speech to improve our language processing. Long term, we can adapt this technology for additional applications such as employee forms and jury duty recording.

Built With

Share this project: