lmorchard - Pebbling Club 🐧🪨

jacktuck/unfurl: Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap:

github.com
2024-08-21T23:31:41.545Z
scraping webdev html imported:raindrop

Notes

Highlight:Unfurl (spread out from a furled state) will take a url and some options, fetch the url, extract the metadata we care about and format the result in a sane way. It supports all major metadata providers and expanding it to work for any others should be trivial. Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap: - jacktuck/unfurl

Unfurl

{ "failed": true, "failedAt": 1728192767113, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds", "author": "jacktuck", "description": "Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap: - jacktuck/unfurl", "image": "https://opengraph.githubassets.com/9f032f4b6ab9857230d7733954d6273caa1a14c4304938ebf5a05b9d112c51d1/jacktuck/unfurl", "logo": "https://github.com/fluidicon.png", "publisher": "GitHub", "title": "GitHub - jacktuck/unfurl: Metadata scraper with support for oEmbed, Twitter Cards and Open Graph Protocol for Node.js :zap:", "url": "https://github.com/jacktuck/unfurl", "lang": "en", "cached": true, "cachedAt": 1728192774031 }
microlinkhq/metascraper: Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.

github.com
2024-08-21T23:30:30.031Z
scraping webdev html opengraph imported:raindrop

Notes

Highlight:The metascraper library allows you to easily scrape metadata from an article on the web using Open Graph metadata, regular HTML metadata, and series of fallbacks. Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more. - microlinkhq/metascraper

Unfurl

{ "author": "microlinkhq", "description": "Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more. - microlinkhq/metascraper", "image": "https://repository-images.githubusercontent.com/59617593/6d33d967-c4f7-4038-aad1-f9da4cbf623a", "logo": "https://github.com/fluidicon.png", "publisher": "GitHub", "title": "GitHub - microlinkhq/metascraper: Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.", "url": "https://github.com/microlinkhq/metascraper", "lang": "en", "failed": true, "failedAt": 1728192767114, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds", "cached": true, "cachedAt": 1728192779646 }
Using SimpleXML with HTML | drewish.com

drewish.com
2010-04-21T17:41:30.000Z
php simplexml html scraping xml webdev dom imported:pinboard

Notes

"PHP 5's SimpleXML module is one of the the biggest reasons to upgrade to 5. If you're parsing RSS feeds or the results of webservice requests it works beautifully and saves a ton of time. The only problem with it is that it'll only load valid XML. I banged my head against it for way to long before coming up with the following:"

Unfurl

{ "failed": true, "failedAt": 1728195477365, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
Scrapy.org. an open source web scraping framework in Python

scrapy.org
2009-03-18T03:05:49.000Z
python scraping webdev imported:pinboard

Notes

"Scrapy is a high level scraping and web crawling framework for writing spiders to crawl and parse web pages for all kinds of purposes, from information retrieval to monitoring or testing web sites. "

Unfurl

{ "failed": true, "failedAt": 1728195527415, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
Crowbar - SIMILE

simile.mit.edu
2009-01-23T20:14:19.000Z
crowbar xulrunner scraping webdev scrapers js javascript mozilla imported:pinboard

Unfurl

{ "failed": true, "failedAt": 1728195537390, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
Zend Developer Zone | Tidying up your HTML with PHP 5

devzone.zend.com
2006-08-19T05:25:29.000Z
tidy php scraping scrapers webdev imported:pinboard

Notes

"The Tidy extension is new in PHP 5, and is available from PHP version 5.0b3 upward. It is based on the TidyLib library, and allows the developer to validate, repair, and parse HTML, XHTML and XML documents from within PHP."

Unfurl

{ "failed": true, "failedAt": 1728195707396, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
Microsummaries - MozillaWiki

wiki.mozilla.org
2006-05-14T22:49:08.000Z
webdev mozilla firefox xsl scraping imported:pinboard

Notes

"Microsummaries are regularly-updated succinct compilations of the most important information on web pages."

Unfurl

{ "failed": true, "failedAt": 1728195727506, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
HTML Screen Scraping: A How-To Document

www.rexx.com
2006-01-21T18:03:36.000Z
webdev programming scraping imported:pinboard

Notes

"This document explains how to do HTML screen scraping. In effect it shows how to treat the Web as a resource by enabling you to retrieve and extract data from HTML Web pages."

Unfurl

{ "failed": true, "failedAt": 1728195757680, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
For GRDDL-heads: XSLT+Tidy from Mark Nottingham on 2005-10-19 (semantic-web@w3.org from October 2005)

lists.w3.org
2005-10-20T13:46:16.000Z
xslt grddl scraping microformats imported:pinboard

Notes

"Let the scraping begin..."

Unfurl

{ "failed": true, "failedAt": 1728195777689, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
Read/Write Web: The danger of running a remix service

www.readwriteweb.com
2005-09-17T17:02:15.000Z
del.icio.us webservices hacks scraping webdev imported:pinboard

Notes

"Populicio.us still lost their service because their reliance on del.icio.us fell away, but the lesson here is that screen scraping HTML comes with those risks by nature."

Unfurl

{ "failed": true, "failedAt": 1728195787658, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
miscoranda: Link in a Soupstack

miscoranda.com
2005-06-05T20:19:41.000Z
python html scraping imported:pinboard

Notes

"The problem with getting links from HTML is that the HTML you find lying about on the web is often quite broken..."

Unfurl

{ "failed": true, "failedAt": 1728195817474, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
Beautiful Soup

www.crummy.com
2004-06-21T11:25:28.000Z
python scraping imported:pinboard

Notes

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like."

Unfurl

{ "failed": true, "failedAt": 1728195887795, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }
Pop Goes the Gmail. SMTP/POP server for Gmail!

jaybe.org
2004-06-01T16:44:04.000Z
mail scraping webdev imported:pinboard

Unfurl

{ "failed": true, "failedAt": 1728195897762, "failedError": "TimeoutError: Promise timed out after 10000 milliseconds" }

0 - 50 of 13 items

10 25 50 100 250