Herpes Simplex Virus-2 (HSV-2) is the cause of Genital Herpes, a lifelong and contagious disease characterized by recurring painful and fluid-filled sores. Transmission occurs through contact with fluids from the sores of the infected person during oral, anal, and vaginal sex; transmission can occur in asymptomatic carriers. HSV-2 is a global public health issue with an estimated 400 million people infected worldwide and 20 million new cases annually - 1/3 of which take place Africa (2012). HSV-2 will increase the risk of acquiring HIV by 3 fold, profoundly affect the psychological well being of the individual, and pose as a devastating neonatal complication. The social ramifications of HSV-2 are enormous. The social stigma of sexual transmitted diseases (STDs) and the taboo of confiding others means that patients are often left on their own, to the detriment of their sexual partners. In Africa, the lack of healthcare professionals further exacerbates this problem. Further, the 2:1 ratio of female to male patients is reflective of the gender inequality where women are ill-informed and unaware of their partners' condition or their own. Most importantly, the symptoms of HSV-2 are often similar to various other dermatological issues which are less severe, such as common candida infections and inflammatory eczema. It's very easy to dismiss Genital Herpes as these latter conditions which are much less severe and non-contagious.
What it does
Our team from Johns Hopkins has developed the humanitarian solution “Foresight” to tackle the taboo issue of STDs. Offered free of charge, Foresight is a cloud-based identification system which will allow a patient to take a picture of a suspicious skin lesion with a smartphone and to diagnose the condition directly in the iOS app. We have trained the computer vision and machine-learning algorithm, which is downloaded from the cloud, to differentiate between Genital Herpes and the less serious eczema and candida infections.
We have a few main goals:
- Remove the taboo involved in treating STDs by empowering individuals to make diagnostics independently through our computer vision and machine learning algorithm.
- Alleviate specialist shortages
- Prevent misdiagnosis and to inform patients to seek care if necessary
- Location service allows for snapshots of local communities and enables more potent public health intervention
- Protects the sexual relationship between couples by allowing for transparency- diagnose your partner!
How I built it
We first gathered 90 different images of 3 categories (30 each) of skin conditions that are common around the genital area: "HSV-2", "Eczema", and "Yeast Infections". We realized that a good way to differentiate between these different conditions are the inherent differences in texture, which are although subtle to the human eye, very perceptible via good algorithms. ] We take advantage of the Bag of Words model common in the field of Web Crawling and Information Retrieval, and apply a similar algorithm, which is written from scratch except for the feature identifier (SIFT). The algorithm follows:
Part A) Training the Computer Vision and Machine Learning Algorithm (Python)
- We use a Computer Vision feature identifying algorithm called SIFT to process each image and to identify "interesting" points like corners and other patches that are highly unique
- We consider each patch around the "interesting" points as textons, or units of characteristic textures
- We build a vocabulary of textons by identifying the SIFT points in all of our training images, and use the machine learning algorithm k-means clustering to narrow down to a list of 1000 "representative" textons
- For each training image, we can build our own version of a descriptor by representation of a vector, where each element of the vector is the normalized frequency of the texton. We further use tf-idf (term frequency, inverse document frequency) optimization to improve the representation capabilities of each vector. (all this is manually programmed)
- Finally, we save these vectors in memory. When we want to determine whether a test image depicts either of the 3 categories, we encode the test image into the same tf-idf vector representation, and apply k-nearest neighbors search to find the optimal class. We have found through experimentation that k=4 works well as a trade-off between accuracy and speed.
- We tested this model with a randomly selected subset that is 10% the size of our training set and achieved 89% accuracy of prediction!
Part B) Ruby on Rails Backend
- The previous machine learning model can be expressed as an aggregate of 3 files: cluster centers in SIFT space, tf-idf statistics, and classified training vectors in cluster space
- We output the machine learning model as csv files from python, and write an injector in Ruby that inserts the trained model into our PostgreSQL database on the backend
- We expose the API such that our mobile iOS app can download our trained model directly through an HTTPS endpoint.
- Beyond storage of our machine learning model, our backend also includes a set of API endpoints catered to public health purposes: each time an individual on the iOS app make a diagnosis, the backend is updated to reflect the demographic information and diagnosis results of the individual's actions. This information is visible on our web frontend.
Part C) iOS app
- The app takes in demographic information from the user and downloads a copy of the trained machine learning model from our RoR backend once
- Once the model has been downloaded, it is possible to make diagnosis even without internet access
- The user can take an image directly or upload one from the phone library for diagnosis, and a diagnosis is given in several seconds
- When the diagnosis is given, the demographic and diagnostic information is uploaded to the backend
Part D) Web Frontend
- Our frontend leverages the stored community data (pooled from diagnoses made from individual phones) accessible via our backend API
- The actual web interface is a portal for public health professionals like epidemiologists to understand the STD trends (as pertaining to our 3 categories) in a certain area. The heat map is live.
Challenges I ran into
It is hard to find current STD prevalence incidence data report outside the United States. Most of the countries have limited surveilliance data among African countries, and the conditions are even worse among stigmatized diseases. We collected the global HSV-2 prevalence and incidence report from World Health Organization(WHO) in 2012. Another issue we faced is the ethical issue in collecting disease status from the users. We were also conflicted on whether we should inform the user's spouse on their end result. It is a ethical dilemma between patient confidentiality and beneficence.
Accomplishments that I'm proud of
- We successfully built a cloud-based picture recognition system to distinguish the differences between HSV-2, yeast infection and eczema skin lesion by machine learning algorithm, and the accuracy is 89% for a randomly selected test set that is 10% the training size.
- Our mobile app which provide users to anonymously send their pictures to our cloud database for recognition, avoid the stigmatization of STDs from the neighbors.
- As a public health aspect, the function of the demographic distribution of STDs in Africa could assist the prevention of HSV-2 infection and providing more medical advice to the eligible patients.
What I learned
We learned much more about HSV-2 on the ground and the ramifications on society. We also learned about ML, computer vision, and other technological solutions available for STD image processing.
What's next for Foresight
Extrapolating our workflow for Machine Learning and Computer Vision to other diseases, and expanding our reach to other developing countries.