Inspiration

I was very interested in creating a Chatbot that would have witty humor but at the same time be intelligent enough to provide logical answers to questions.

What it does

The idea is very simple. It is a Python CLI that takes in a input message from the user and performs softmax regression to match it up with the best answer from data in a database

Construction and Built

My part of this project was more related to the data science aspect - Filtering the data and manipulating the data from a bz2 file and storing the clean data in a database, ready to be used.

Cleaning the data

Creating the Table for Clean Data

CREATE TABLE IF NOT EXISTS parent_reply(parent_id TEXT PRIMARY KEY, comment_id TEXT UNIQUE, parent TEXT, comment TEXT, subreddit TEXT, unix INT, score INT

Checking if the data is valid

I needed to make sure that all the data was valid in the sense that it hasn't been deleted, isn't too lengthy, or isn't too short. This was a simple function to implement.

def isAcceptable(data):
    if(len(data.split(' ')) > 50 or len(data.split(' '))<1): #If data length is too large or too small, reject it;
        return False
    elif (len(data) > 1000):
        return False
    elif data == '[deleted]' or data =='[removed]':
        return False
    return True

Prioritizing Comments Based on Upvotes

I didn't want the Chatbot giving unintelligent answers that Reddit users simply write on a post. I wanted it to return the comment that had the most upvotes. I implemented the logic that did this

def find_comment_score(parentId):
    try:
        sql = "SELECT score FROM parent_reply WHERE parent_id = '{}' LIMIT 1".format(parentId)
        c.execute(sql)
        result = c.fetchone() #run query
        if(result != None):
            return result [0]
    except Exception as e:
        return False

Bulking CRUD operations in a transaction

def transaction(sql):
    global sqlTransaction
    sqlTransaction.append(sql)
    if(len(sqlTransaction) > 1000):
        c.execute('BEGIN TRANSACTION') #starts transaction
        for s in sqlTransaction:
            try:
                c.execute(s)
            except:
                pass
        connection.commit() #Ends transaction
        sqlTransaction= [] #Clears transaction item queue

Main Program Logic

The main program had a very simple logic. For each row in the data, I would parse the data as JSON, and populate the database after cleaning the data.

if __name__ == "__main__":
    create_table()
    row_counter = 0
    paired_rows = 0

    with bz2.open("C:/Users/John/Downloads/RC_{}.bz2".format( FILENAME_OF_DATA)) as f:
        for row in f:
            row_counter += 1
            row = json.loads(row)
            parent_id = row['parent_id']
            body = format_data(row['body'])
            created_utc = row['created_utc']
            score = row['score']
            comment_id = row['name']
            subreddit = row['subreddit']

            parent_data = find_parent(parent_id)

            if(acceptable(body)):
                if score >= 2:
                    existing_comment_score = find_existing_score(parent_id) #Gets the score of the current most popular reply stored in the database to the question with the parent ID defined by pid
                    if existing_comment_score:
                        if score>existing_comment_score: #If the current comment is more popular than the one we have stored, we will update it
                            sql_insert_replace_comment(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
                    else:
                        if parent_data:
                            sql_insert_has_parent(comment_id, parent_id, parent_data, body, subreddit, created_utc, score)
                        else:
                            sql_insert_no_parent(comment_id, parent_id, body, subreddit, created_utc, score)

Creating a Train/Test Split

The logic for creating the train/test split was implemented with Pandas. I read the data from the database 5000 rows at a time and store 100,000 rows of the data for the test.

import sqlite3
import pandas as pd

FILENAME_OF_DATA= '2015-01'

#Connect to sql database
connection = sqlite3.connect('{}.db'.format(timeframe))
c = connection.cursor()
limit = 5000 #Analyzes 5000 rows at a time
last_unix = 0 #Keeps track of the LAST unix timestamp that we read up to
cur_length = limit #Ensures that we still have data to read and analyze by checking size of data
counter = 0
test_done = False
numOfRowsWrittenToTest = 0; #Keeps track of the number of rows written to the test data
while cur_length >= limit:
    query = "SELECT * FROM parent_reply WHERE unix > {} AND parent NOT NULL AND score > 0 ORDER BY unix ASC LIMIT {}".format(last_unix,limit)
    df = pd.read_sql(query, connection)
    if(df.empty):
        break
    last_unix = df.tail(1)['unix'].iloc[-1] #Grabs last UNIX timestamp
    cur_length = df.size/float(7) #Since there are 7 columns for each row

    if not test_done: #Writing to Test Data
        with open("test.from", "a", encoding="utf8") as f:
            for content in df['parent'].values:
                f.write(content+'\n')

        with open("test.to", "a", encoding="utf8") as f:
            for content in df['comment'].values:
                f.write(content+'\n')

        numOfRowsWrittenToTest += limit
        if(numOfRowsWrittenToTest >= numOfRowsWrittenToTest*20): #Stop writing to test data if 100,000 rows have been written
            test_done = True
    else: #Writing to Train Data
        with open("train.from", "a", encoding="utf8") as f:
            for content in df['parent'].values:
                f.write(content+'\n')

        with open("train.to", "a", encoding="utf8") as f:
            for content in df['comment'].values:
                f.write(content+'\n')

Model

For the model, I applied modifications to an existing model created by Daniel Kukieła, that can be found on his Github. Due to the model originally being his intellectual property, I will not post the code for the model.

Softmax regression was used to determine the question that fit the message that the user provided, and the database was then queried for the corresponding comment.

Extensions and Future Improvements

Obviously, the model was only trained on 5.5 GB of data which is about a month's worth of data from 2015, and would perform vastly better if it was trained on the large and complete ~300 GB data-set provided here

Built With

Share this project:

Updates