NotesHighlight:Unfurl (spread out from a furled state) will take a url and some options, fetch the url, extract the metadata we care about and format the result in a sane way. It supports all major metadata providers and expanding it to work for any others should be trivial.
Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap: - jacktuck/unfurlUnfurl
NotesHighlight:The metascraper library allows you to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks.
Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more. - microlinkhq/metascraperUnfurl
Notes"PHP 5's SimpleXML module is one of the the biggest reasons to upgrade to 5. If you're parsing RSS feeds or the results of webservice requests it works beautifully and saves a ton of time. The only problem with it is that it'll only load valid XML. I banged my head against it for way to long before coming up with the following:"Unfurl
Notes"Scrapy is a high level scraping and web crawling framework for writing spiders to crawl and parse web pages for all kinds of purposes, from information retrieval to monitoring or testing web sites. "Unfurl
Notes"The Tidy extension is new in PHP 5, and is available from PHP version 5.0b3 upward. It is based on the TidyLib library, and allows the developer to validate, repair, and parse HTML, XHTML and XML documents from within PHP."Unfurl
Notes"This document explains how to do HTML screen scraping. In effect it shows how to treat the Web as a resource by enabling you to retrieve and extract data from HTML Web pages."Unfurl
Notes"Populicio.us still lost their service because their reliance on del.icio.us fell away, but the lesson here is that screen scraping HTML comes with those risks by nature."Unfurl
Notes"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like."Unfurl