Posted on 24 Jun 2013 in SpeakingBurlingtonJSJavaScriptScrapingNode.ioCasper.jsPhantom.jsNode.js
June 19, 2013 was the third meeting of the BurlingtonJS group, the second of which I attended, and the first at which I was lucky enough to be presenting. I volunteered to speak on web scraping. Web scraping has been around about as long as browsers and the prevalence of the Internet. It is an interesting topic in the world of JS because of such amazing developments as Node.io and Casper.js (on Phantom.js).
Slidedeck
I gave the presentation via PowerPoint. Being my first tech talk, I felt it was best to have everything in one nicely packaged document so I could just have one button to press. It ended up really helping with my nerves to not have to worry about switching between different apps/windows/tabs.
The idea for the main project I feature in this talk came about months ago. I never thought I would present on it, but you can view an intro and the final map.
Dumb scraping featuring cURL and Node.io. Basic examples and discussion.
Smart scraping featuring Casper.js. Scrape the items of an HTML list that have been generated via JavaScript.
Distributed scraping and NPM with Node.io. Scrape the names of links in the search results for 'burlingtonjs.'
Dealer.com scraping and micro formats. Scrape Dealer.com websites for location information, accounting for two different HTML templates. This is messy and probably not ideal, but it works. Some of the pages have latitude and longitude embedded in the page, some do not.
What's next? Natural Language Processing via Alchemy API shows promise. It allows you to leverage many great tools without the overhead. Their tools help ease the pain in gathering structured data from unstructured hypertext markup.
Notes
BurlingtonJS meets monthly and features two one-hour talks. For June, Rob (@founddrama ) presented first on functional programming. He did a really nice job explaining high-level functional concepts and examples. Check out his write up and slides for more on functional JavaScript.
Massive thanks to Dealer.com for hosting the June meeting in their amazing offices as well as using micro formats
This was my first tech-talk, so to ease the nerves, I tried to absorb as much information as I could on the topic and on presenting. Posts that helped me prepare: