Earlier I had a graph database, where I stored URL properties on nodes. I would like to get textual information from those URLs, I would like to do some web scraping, but without leaving the Cypher word. So, I have found out that web scraping procedures in Neo4j can be used to work on existing properties, and they can be used to produce output in a format where you can run further cypher commands on it. Let's see an example: You have product nodes with product names. You can use the product names to generate Ebay search URLs, and you can grab the process from that html page, and you can create new nodes in your Neo4j database, connecting the product nodes to some Ebay listing pages, and you can write the price property into your graph. But there are unlimited possibilities. These procedures can be used for preprocessing you your NLP pipelines as well.
What it does
It simply uses Jsoup library, you can use the selector syntax from there, and you can play with your content easily.
How I built it
I was thinking about what to use, but finally, I decided to use the jSoup, and implement the wrapper around it. This is a Neo4j procedure package, so it should be compiled into a jar file, and installed into your Neo4j instance.
Challenges I ran into
Timeouts, tricky webpages, and similar things. If you are familiar with web scraping, then you can know that this is a continuously changing and challenging topic, where there are new challenges every day.
Accomplishments that I'm proud of
I have never seen anybody else do it this way. I have seen people using external services, or Apache Tika for documents like PPT or PDF, but not for html.
What I learned
css query and similar wild topics.
What's next for neo4jscraperproc
This is really a pretotype. I think it needs improvements, more features, proper project setup, etc., all the things I didn't have time for during this Hackathon.