CS6200 Project

Two folders - data and code

Documents in folder data

corpus.p cleaned corpus in a pickle file
Format: {'doc_id_1':'doc_content', 'doc_id_2':'doc_content', 'doc_id_3':'doc_content'... }
unigrams.p all unigrams generated from corpus in a pickle file
Format: {'unigram_1':{'doc_id_9':frequency of unigram_1 in doc_id_9}, 'unigram_2':{'doc_id_4':frequency of unigram_4 in doc_id_4, 'doc_id_5':frequency of unigram_4 in doc_id_5}}

Unigrams_tf_table.txt - Term Frequencies of all terms in a text file

unigrams_tf.p pickle file of all term frequencies
Format: {'term_1': frequency of term_1, 'term_2': frequency of term_2}

^^ Note that these are simply counts of terms over all of the corpus

stopped_corpus.p corpus with stop words removed
Format: ```{CACM-CACM-0620': 'ratfact algorithm 78 cacm halstead ca620312 jb',
'CACM-CACM-1461': 'discussion summary operating systems cacm ca660311 jb' ... }```
stemmed_corpus.p corpus with stemmed words (provided from cacm_stem.txt)
Format: {'CACM-0059': ['survei of progress and trend of develop and us of automat data process in busi and manag control system of the feder govern as of decemb 1957 iii cacm septemb 1959 ca590910 jb '], 'CACM-0060': ['the alpha vector transform of a system of linear constraint cacm septemb 1959 wersan s j ca590909 jb '] ... }

======= Note that these are simply counts of terms over all of the corpus

code

This folder has four sub folders - one for each retrieval model

Built With

Share this project:

Updates