June 19, 2013 was the third meeting of the BurlingtonJS group, the second of which I attended, and the first at which I was lucky enough to be presenting. I volunteered to speak on web scraping. Web scraping has been around about as long as browsers and the prevalence of the Internet. It is an interesting topic in the world of JS because of such amazing developments as Node.io and Casper.js (on Phantom.js).
I gave the presentation via PowerPoint. Being my first tech talk, I felt it was best to have everything in one nicely packaged document so I could just have one button to press. It ended up really helping with my nerves to not have to worry about switching between different apps/windows/tabs.
The idea for the main project I feature in this talk came about months ago. I never thought I would present on it, but you can view an intro and the final map.
Dumb scraping featuring cURL and Node.io. Basic examples and discussion.
Distributed scraping and NPM with Node.io. Scrape the names of links in the search results for 'burlingtonjs.'
Dealer.com scraping and micro formats. Scrape Dealer.com websites for location information, accounting for two different HTML templates. This is messy and probably not ideal, but it works. Some of the pages have latitude and longitude embedded in the page, some do not.
What's next? Natural Language Processing via Alchemy API shows promise. It allows you to leverage many great tools without the overhead. Their tools help ease the pain in gathering structured data from unstructured hypertext markup.