It was a late night, during exams periods, it came to me. I had spent over the last three days reading a boring, far too long and far to unspecific, materials sciences book. I could have saved a lot of time if I had this tool. From EducatedLobsters we think that no one should have to suffer through this painful experience.
What it does
Lightspeed Study does the work for you! Insted fo digging throug materials to find sensible information, Lightspeed Study summerizes whole papers and gives you the relevant information. But that's just the bigining...
Would't it be awsome to know whether the information in a text can ansewer a certain question and, what is more, to know the actual response? Well, Lightspeed Study does that for you too. You give it a text and a set of questions and the tool answers the questions the best as possible with the given data, or does not answer if the answer can't be found.
If you don't want any questions answered, you still can summarize any document you want to.
How we built it
First of all, we use a tesseract based algorithm to transform .pdf files to .txt. Next we use a natural language processing algorithm to analyze the input text and train which are the main concepts of the theory text sentence by sentence. This training is done transforming the text to the vector space and analyzing it acording to a modyfied (personalized) version of the term frequency * inverse document frequency (Tf-Idf algorithm) and Latent Semantic Indexing (LSI). When a questions text is intruduced too, the meaning of the questions is compared with the different parts of the text in order to dilucidates which part of the text answers each of the given questions properly (if possible). With this, we can generate a .txt file including:
- A brief summary, of a length of about one third of the original. This summary is obtained thanks to LexRank: a Graph-based Lexical Centrality as Salience in text summarization method.
- Then each question is answered with the parts of the text that are considered to better respond the questions (under certain criterions). If a question can't be answered you will be told in the output document. In the end we convert the .txt file to .pdf using the unoconv utility and output it. The whole system is hosted in AWS. We used a .tech domain.
Challenges we ran into
Uploading files to the server was a major challenge. Data type conversions weren't always easy. Finding a way to define a semantic relation between words required a lot of prior learning (hours of Google).
Accomplishments that we're proud of
It way better than expected. Drag and Drop.
What we learned
To use AWS, to use the NLTK.
The input files have to be called teoria.pdf and preguntes.pdf (althoug in this phase you can just press upload! if you don't have any question file).