unversity searcher

Inspiration

As students, we often encounter difficult choices to make. Among these choices, we have to decide the university in which we would like to study later, which will then have a huge impact on our work and our adult life. This is a very important decision made after much consultation and research. One has to take a lot of factors into consideration as there are many different universities and one has to choose the right one based on its ranking, study programs offered, amenities, location, student infrastructure, and so on. This task can therefore at times seem daunting or even impossible. To solve this problem, I've created a site that searches the internet for information that would be useful to you. We called this site uni d help to play a pun.

What it does

The website uses a combination of web-scrapping and text summarising to give you as much precise information as it can on any given university.

How we built it

So, if you are on a wide screen you will see two columns. The column on the right is made by first finding pages on each subject with the following code:

def ReturnFirstURLs(university, degree, item, country='us'):
    URLs = []
    url = 'https://www.google.com/search?q=' + university + '+' + country + '+' + degree + '+' + item
    headers = {"User-Agent": "Mozilla/5.0"}
    cookies = {"CONSENT": "YES+cb.20210720-07-p0.en+FX+410"}
    request_result = requests.get(url, headers=headers, cookies=cookies)
    soup = bs4.BeautifulSoup(request_result.text, "html.parser")
    heading_object = soup.find_all( 'a')
    for href in heading_object:
        URLs.append(href.get('href').split("/url?q=")[-1].split("&")[0])
    return URLs

--> This returns all URLs for a certain query, an example of such a query would be: "Harvard US undergraduate application"

Now we find the university's official link using the fact that each country has official educational URLs (.ac.uk in the UK, .edu in us):

def filterLink(links, country='us'):
    for item in links:
        if 'http' in item:
            if country.lower() == 'us':
                if '.edu' in item and (not 'default/files/styles/' in item) and (not '.png' in item):
                    return item
            else:
                if '.ac.uk' in item and (not 'default/files/styles/' in item) and (not '.png' in item) and (not 'images' in item.lower()):
                    print(item)
                    return item

Note: the image stuff is because sometimes there is an image link and you can't read an image

Now we summarise the contents:

def GetText(link, look_at, SENTENCES_COUNT, country, university):
    if not link:
        return ''
    url = link.split('%')[0]
    print(url)
    parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
    stemmer = Stemmer(LANGUAGE)

    summarizer = Summarizer(stemmer)
    summarizer.stop_words = get_stop_words(LANGUAGE)
    final = []
    if summarizer(parser.document, SENTENCES_COUNT):
        for sentence in summarizer(parser.document, SENTENCES_COUNT):
            final.append(sentence)
        final = list(dict.fromkeys(final))
        final = list(final)
        return final, look_at, url
    return False

For the second column, the code is much longer as each item takes a different algorithm but here is the function I made that I used for most of the items:

def ScrapGoogle(university, message, num=1):
    url = 'https://www.google.com/search?q=' + university + message
    headers = {"User-Agent": "Mozilla/5.0"}
    cookies = {"CONSENT": "YES+cb.20210720-07-p0.en+FX+410"}
    request_result = requests.get(url, headers=headers, cookies=cookies)
    soup = bs4.BeautifulSoup(request_result.text, "html.parser")
    texts = soup.findAll(text=True)
    if num == 1:
        visible_texts = filter(tag_visible, texts)  
    else:
        visible_texts = filter(tag_visible2, texts)  
    return u" ".join(t.strip() for t in visible_texts)

Of course this is not enough and I then use the split, join, replace... functions to get the output wanted for each item

Note: the "Near You" item is different as I use the PyYelp API for that one

Challenges we ran into

I encountered many difficulties because I've never worked on a project of this magnitude, but by far the biggest problem I had was my lack of experience. It was the first time I used Flask, heroku, jinja, PyYelp, BeautifulSoup, or even an external document like the one used for storing comments. I always had to learn new things, this inexperience delayed me throughout the project but I am proud to have been able to overcome this obstacle.

Built With

Updates

Samue1A Aron started this project — Apr 07, 2023 01:39 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.