Inspiration
As students, we often encounter difficult choices to make. Among these choices, we have to decide the university in which we would like to study later, which will then have a huge impact on our work and our adult life. This is a very important decision made after much consultation and research. One has to take a lot of factors into consideration as there are many different universities and one has to choose the right one based on its ranking, study programs offered, amenities, location, student infrastructure, and so on. This task can therefore at times seem daunting or even impossible. To solve this problem, I've created a site that searches the internet for information that would be useful to you. We called this site uni d help to play a pun.
What it does
The website uses a combination of web-scrapping and text summarising to give you as much precise information as it can on any given university.
How we built it
So, if you are on a wide screen you will see two columns. The column on the right is made by first finding pages on each subject with the following code:
def ReturnFirstURLs(university, degree, item, country='us'):
URLs = []
url = 'https://www.google.com/search?q=' + university + '+' + country + '+' + degree + '+' + item
headers = {"User-Agent": "Mozilla/5.0"}
cookies = {"CONSENT": "YES+cb.20210720-07-p0.en+FX+410"}
request_result = requests.get(url, headers=headers, cookies=cookies)
soup = bs4.BeautifulSoup(request_result.text, "html.parser")
heading_object = soup.find_all( 'a')
for href in heading_object:
URLs.append(href.get('href').split("/url?q=")[-1].split("&")[0])
return URLs
--> This returns all URLs for a certain query, an example of such a query would be: "Harvard US undergraduate application"
Now we find the university's official link using the fact that each country has official educational URLs (.ac.uk in the UK, .edu in us):
def filterLink(links, country='us'):
for item in links:
if 'http' in item:
if country.lower() == 'us':
if '.edu' in item and (not 'default/files/styles/' in item) and (not '.png' in item):
return item
else:
if '.ac.uk' in item and (not 'default/files/styles/' in item) and (not '.png' in item) and (not 'images' in item.lower()):
print(item)
return item
Note: the image stuff is because sometimes there is an image link and you can't read an image
Now we summarise the contents:
def GetText(link, look_at, SENTENCES_COUNT, country, university):
if not link:
return ''
url = link.split('%')[0]
print(url)
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
stemmer = Stemmer(LANGUAGE)
summarizer = Summarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)
final = []
if summarizer(parser.document, SENTENCES_COUNT):
for sentence in summarizer(parser.document, SENTENCES_COUNT):
final.append(sentence)
final = list(dict.fromkeys(final))
final = list(final)
return final, look_at, url
return False
For the second column, the code is much longer as each item takes a different algorithm but here is the function I made that I used for most of the items:
def ScrapGoogle(university, message, num=1):
url = 'https://www.google.com/search?q=' + university + message
headers = {"User-Agent": "Mozilla/5.0"}
cookies = {"CONSENT": "YES+cb.20210720-07-p0.en+FX+410"}
request_result = requests.get(url, headers=headers, cookies=cookies)
soup = bs4.BeautifulSoup(request_result.text, "html.parser")
texts = soup.findAll(text=True)
if num == 1:
visible_texts = filter(tag_visible, texts)
else:
visible_texts = filter(tag_visible2, texts)
return u" ".join(t.strip() for t in visible_texts)
Of course this is not enough and I then use the split, join, replace... functions to get the output wanted for each item
Note: the "Near You" item is different as I use the PyYelp API for that one
Challenges we ran into
I encountered many difficulties because I've never worked on a project of this magnitude, but by far the biggest problem I had was my lack of experience. It was the first time I used Flask, heroku, jinja, PyYelp, BeautifulSoup, or even an external document like the one used for storing comments. I always had to learn new things, this inexperience delayed me throughout the project but I am proud to have been able to overcome this obstacle.
Log in or sign up for Devpost to join the conversation.