As AI is a companion, it is acting like a confidant for people in crisis. What happens when a vulnerable user asks for help? Instead of the AI providing a lifeline to the user, it details methods for self-harm? This is a key issue that must be addressed. While existing models have existing filters, they often rely on simple keyword blocking and are easy to fool. They do not understand reasoning well and can be bypassed with clever phrasing or fail to understand the nuanced context of a user's distress. This leaves a dangerous gap where harmful instructions can slip through.

SecureLife works in two steps: assesses the user's prompt for self-harm intent and intercepts and analyzes it for any descriptive or instructional self-harm content before sending the response back to the user. SecureLife will block responses if harmful content is detected and will substitute it with a safe, pre-vetted message containing resources like a crisis hotline number.

Built With

Share this project:

Updates