PHP Port of Arc90’s Readability

Update 2011-03-23: Readers may also be interested in how we use PHP Readability at FiveFilters.org: Content Extraction at FiveFilters.org

Last year I ported Arc90’s Readability to use in the Five Filters project. It’s been over a year now and Readability has improved a lot — thanks to Chris Dary and the rest of the team at Arc90.

As part of an update to the Full-Text RSS service I started porting a more recent version (1.6.2) to PHP and the code is now online:

For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.

It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place. Here’s an example of how to use the PHP port:

require_once 'Readability.php';
header('Content-Type: text/plain; charset=utf-8');

// get latest Medialens alert 
// (change this URL to whatever you'd like to test)
$url = 'http://medialens.org/alerts/index.php';
$html = file_get_contents($url);
 
// PHP Readability works with UTF-8 encoded content. 
// If $html is not UTF-8 encoded, use iconv() or 
// mb_convert_encoding() to convert to UTF-8.

// If we've got Tidy, let's clean up input.
// This step is highly recommended - PHP's default HTML parser
// often does a terrible job and results in strange output.
if (function_exists('tidy_parse_string')) {
	$tidy = tidy_parse_string($html, array(), 'UTF8');
	$tidy->cleanRepair();
	$html = $tidy->value;
}

// give it to Readability
$readability = new Readability($html, $url);

// print debug output? 
// useful to compare against Arc90's original JS version - 
// simply click the bookmarklet with FireBug's 
// console window open
$readability->debug = false;

// convert links to footnotes?
$readability->convertLinksToFootnotes = true;

// process it
$result = $readability->init();

// does it look like we found what we wanted?
if ($result) {
	echo "== Title ===============================\n";
	echo $readability->getTitle()->textContent, "\n\n";

	echo "== Body ===============================\n";
	$content = $readability->getContent()->innerHTML;

	// if we've got Tidy, let's clean it up for output
	if (function_exists('tidy_parse_string')) {
		$tidy = tidy_parse_string($content, 
			array('indent'=>true, 'show-body-only'=>true), 
			'UTF8');
		$tidy->cleanRepair();
		$content = $tidy->value;
	}
	echo $content;
} else {
	echo 'Looks like we couldn\'t find the content.';
}

Differences between the PHP port and the original

Arc90’s Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page’s CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP’s ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90’s Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90’s Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)

Another significant difference is that the aim of Arc90’s Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser — Arc90 already do that extremely well, and for PDF output there’s FiveFilters.org’s PDF Newspaper.

Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don’t want to do because it makes debugging and updating more difficult), I’ve tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.

This entry was posted in Code and tagged , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

46 Comments

  1. Keyvan says:

    There’s also a Python port and a Ruby port available if PHP isn’t your language.

  2. Chris Dary says:

    This is good stuff. I’ve played with it a bit, works great. Nice work, Keyvan!

  3. Keyvan says:

    Thanks Chris! 🙂

  4. ph says:

    Hi,

    I’ve been trying to get this to work and all im getting is a blank screen. I turned on “debug:true” and still no output… any ideas?

  5. Keyvan says:

    ph: try adding

    error_reporting(E_ALL);
    ini_set("display_errors", 1);

    to the top of the example file and see if that produces an error message.

  6. Neal says:

    Good work, Keyvan.
    A quick question – do you plan to support multiple pages? i.e. porting findNextPageLink from the original js?

  7. Keyvan says:

    Neal: Thanks. I do plan to port that over too, yes. Hopefully sometime this month.

  8. Fab says:

    This is great work, thanks for sharing!

  9. Chris says:

    There appears to be a few issues with images.

    It seems to be killing divs with images in them while as the JS version does not.

    Try this URL for instance: http://www.popsci.com/cars/article/2010-09/giving-traffic-lights-mind-their-own-can-reduce-congestion-study-says

    Any ideas?

  10. Chris says:

    Oh by the way, how about creating a Github repo for this. I’d gladly contribute. I was working on something similar back when they had 1.6.x lol.

  11. Keyvan says:

    Chris: Thanks for the report. I’ll look into it. As for the github, I’ll think about it. 🙂

  12. Mridang says:

    Is this the extraction backend that powers the FiveFilters Full-Text Feeds application? Thank you.

  13. Keyvan says:

    Mridang: yes, it is

  14. Simon says:

    I’m a french speaker, and I’ve tried it on a french page, and found a bug using accent (I know a lot of language use characters as À É È and so on) on a website not using the normal É . Is there a way to fix it?

  15. Keyvan says:

    Simon: the content should be UTF-8 encoded before you pass it to PHP Readability. If it’s not you will need to convert it. If you’ve got a URL of the page I can take a look.

  16. pat says:

    hi would love this on github aswell. im willing to contribute what i have.

  17. Edward says:

    This is WONDERFUL! Thanks for creating it. Any updates on additional features?

  18. thinkery says:

    Having problems with relative image paths on websites. Any hints or updates in the making?
    Allowing Github contributions would be cool indeed.

  19. […] fact, I really want more than this – I want to use a port of Arc90′s Readability (there are many, in many different languages) to grab the content from the page I’ve tagged, […]

  20. Keyvan says:

    thinkery: PHP Readability does not automatically convert relative URLs to absolute ones, but it’s not difficult to do. Is that what you’re trying to do?

    Regarding github contributions: the source code has now moved to code.fivefilters.org using Indefero. You can now grab it with git. I hope that makes it easier for those of you who’d like to fork it and modify it. If you do make changes, please share them – I’ll consider incorporating any changes once tested.

  21. Trendy Bing says:

    Hi,
    this is great tool. it works like a charm, I will use it with combination of Bing Search engine. Is there any algorithm for creating such information. I just want to learn about different algorithms for this purpose.

  22. Brad says:

    I’m getting errors on line 293:
    $this->dom->documentElement->appendChild($this->body);
    with poor content such as when the source url has become a 404. Is there anyway for this to fail gracefully?

  23. Keyvan says:

    Brad: can you give me an example of the HTML ($html) that produces that error in PHP Readability?

  24. Al says:

    Thanks for the work in doing this, exactly what I’m looking for. However I’m getting the following error when testing using the example provided:

    Warning: tidy_parse_string() [function.tidy-parse-string]: Could not load configuration file ‘UTF8’ in /home/public_html/readability/Readability.php on line 18

    Any ideas what is causing this?

    Cheers.

  25. Keyvan says:

    Al: sorry, there was an error in the example code on this page (the example in the repository should work fine). tidy_parse_string expects the character encoding in the third argument but in the code I’d posted up it was being passed as the second argument. I’ve fixed it now by adding an empty array() as the second argument – please try copying again from the code on this page and let me know if you still get an error.

  26. Al says:

    Thank you! That was it!

    One other thing, I noticed on Gizmodo pages it doesn’t work, any idea why it returns the results it does? Example URL:

    http://gizmodo.com/#!5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook

    Thanks.

  27. Keyvan says:

    Al: that’s great, thanks for letting me know.

    As for Gizmodo, they, like Twitter, have embraced a crazy new trend which breaks the way most people expect URLs to work. Tim Bray has written more about it here: http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch

    Basically, you will have to rewrite these hash bang (#!) URLs into a form which leads to a page with real content. The simple rule is replace ‘#!’ with ‘?_escaped_fragment_=’ so your example would become: http://gizmodo.com/?_escaped_fragment_=5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook

    The full gory details available here: http://code.google.com/web/ajaxcrawling/docs/specification.html

    Hope that helps.

  28. Al says:

    Ahhh, pesky hashbangs! lol. Ok, thanks for the info. This kinda screws things up for me a bit though since I’m pulling the links in from Facebook and on facebook Gizmodo posts the links as http://gizmodo.com/5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook but if you put that in the browser it becomes http://gizmodo.com/#!5765852/motorola-atrix-review-a-great-phone-makes-for-a-weak-netbook, so I basically have no way of knowing if they’re using hashbangs until the URL is fully resolved. So I’m kinda stuck..

    Thanks though!

  29. Al says:

    Following up on my last comment, In case anyone runs into issues with URLs redirecting to hashbang links I was able to resolve this by using the following function in my php code:

    function get_redirect_url($url){
    	$redirect_url = null; 
     
    	$url_parts = @parse_url($url);
    	if (!$url_parts) return false;
    	if (!isset($url_parts['host'])) return false; //can't process relative URLs
    	if (!isset($url_parts['path'])) $url_parts['path'] = '/';
     
    	$sock = fsockopen($url_parts['host'], (isset($url_parts['port']) ? (int)$url_parts['port'] : 80), $errno, $errstr, 30);
    	if (!$sock) return false;
     
    	$request = "HEAD " . $url_parts['path'] . (isset($url_parts['query']) ? '?'.$url_parts['query'] : '') . " HTTP/1.1\r\n"; 
    	$request .= 'Host: ' . $url_parts['host'] . "\r\n"; 
    	$request .= "Connection: Close\r\n\r\n"; 
    	fwrite($sock, $request);
    	$response = '';
    	while(!feof($sock)) $response .= fread($sock, 8192);
    	fclose($sock);
     
    	if (preg_match('/^Location: (.+?)$/m', $response, $matches)){
    		if ( substr($matches[1], 0, 1) == "/" )
    			return $url_parts['scheme'] . "://" . $url_parts['host'] . trim($matches[1]);
    		else
    			return trim($matches[1]);
     
    	} else {
    		return false;
    	}
     
    }
    
    $url = get_redirect_url($url);
    
    // Strip any hashangs
    $url = str_replace("#!","?_escaped_fragment_=",$url);
    $html = file_get_contents($url);
    
  30. Keyvan says:

    Al: that will work if you expect $url to have exactly one redirect. But if the URL returned by get_redirect_url() has further redirects, you might not catch the hash bang.

    A more robust solution would be to follow redirects one by one, resolving relative URLs and rewriting any hash-bangs you encounter, or a simpler option is to use cURL, let it handle redirects but grab the effective URL (the final URL it fetches) – see http://www.php.net/manual/en/function.curl-getinfo.php – and if that contains a hash bang, rewrite that and fetch it again. Although I haven’t tested to see if cURL preserves the fragment identifier when it returns the effective URL – if it doesn’t, then this solution will be no good.

  31. Al says:

    Thanks Keyvan – some good points. I’ll work that into my script.

    I have run into another problem with this link: http://gizmodo.com/5767306/apple-will-unveil-ipad-2-on-march-2 (unfortunately another gizmodo link). Really not sure what’s happening here, looks fine with readability bookmarklet, any idea what could be causing this to happen?

    Cheers

  32. Keyvan says:

    Al: I’ll soon be collecting URLs of pages which fail extraction in an effort to improve PHP Readability. I’ve deliberately held off the desire to change the code because at the moment I don’t have a decent test framework in place to allow me to see the impact of the changes on sites other than the one in question. Once I have something in place, I’ll post up here and hopefully get help from anyone interested in improving the PHP Readability code.

    I think with the new readability.com service Arc90 are unlikely to continue developing their open source version, so perhaps a community effort can keep it alive.

    My suggestion to you regarding gizmodo.com, and any other site you think you’ll be extracting from fairly regularly, is to create your own extraction pattern and rely on PHP Readability if that pattern fails (e.g. in the case of a redesign). That’s actually what Flipboard and Instapaper appear to do – see links in the comments here: http://www.corgitoergosum.net/2011/01/17/replicating-flipboard-part-i-site-scraping/comment-page-1/#comment-39

  33. MetLife says:

    Bonne initiative. Maintenant disponible aussi en plugin chez SPIP.

  34. Alan says:

    Just want to say thank you for this code. There are other php readability libraries popping up on git hub but none seem to work as well as this one. Would you be willing to create a repository for this on there too?

  35. […] to now we’ve relied mainly on PHP Readability to automatically identify and extract articles from web pages, and this is still how the majority […]

  36. Nikhil says:

    This is incredibly helpful. Thanks a lot for the port.

  37. […] Tayyar Be?ik, software developer @Nokta If you want to use php for this job PHPReadability http://www.keyva… (more) Sign up for free to read the full text. Login if you already have an account.This answer […]

  38. jaideep says:

    i’m making a Digg like url submittor.
    ,how far this class can help me to extract title ,description and images from any submitted url,
    thanks

  39. Don says:

    I’m curious if it’s possible to use this to extract content from custom comment tags. For example, if the page I’m looking at contains something like: <code><!-- MYTAG myname=myvalue --></code>, is there a clear way to pull out the name and value using Readability?

  40. ragess says:

    Hello Keyvan. I came to your blog after googling about’ how to get full content from rss feed’. Your Readability sounds promising. But I tried with your example and it gave so many ‘s, ‘s ‘s etc but not the text, not at all.
    I even used simplepie, but again ‘get_content()’ just gave me excerpts.
    Keyvan could you please help me in how I can get full content from rss feed? Please describe in detail, like which code to put where, if you could.
    Thanks.

  41. Keyvan says:

    MetLife: Merci!

    Alan: Thanks! Regarding GitHub, please see earlier comment http://keyvan.net/2010/08/php-readability/#comment-322379 – our repository on code.fivefilters.org is accessible via git, anyone can fork it and place it on GitHub. I’m not interested in doing that myself.

    Nikhil: Thanks!

    jaideep: I don’t know. Try it and see. If you want more control over extraction I suggest you check our Full-Text RSS tool: http://fivefilters.org/content-only/

    Don: I wouldn’t use PHP Readability for that. Try regular expressions.

    ragess: If you’re trying to create full-text RSS feeds, I’d suggest you look at our Full-Text RSS tool: http://fivefilters.org/content-only/ You’ll find some documentation here: http://help.fivefilters.org/customer/portal/topics/62602-full-text-rss/articles

  42. Matt says:

    That is awesome stuff. Just what I needed. Thanks!

  43. Alan says:

    I am having some trouble in an implementation and would appreciate some guidance:

    With:
    $url = ‘http://business.financialpost.com/2012/01/02/asias-double-edged-currency-sword/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+FP_TopStories+%28Financial+Post+-+Top+Stories%29’;

    I get this as the first part of the output:

    == Title =====================================
    Asia currencies face volatility dilemma | Investing

    == Body ======================================

    <em><strong>By Emily Kaiser, Asia economics correspondent</strong></em>
    SINGAPORE – The roller-coaster ride for Asian currencies, which saw only the yen and yuan post significant gains for the year against the U.S. dollar, is set to continue in 2012.
    While Japan actively sought to stem the yen’s rise — drawing U.S. criticism last week — China intervened to ensure the yuan ended the year at a new high. Both currencies appreciated roughly 5% in 2011 against the dollar.
    The opposite approaches illustrate a dilemma facing Asian policymakers as they try to smooth out foreign exchange rate volatility, which shows no sign of abating in the new year. If the currency is too strong, exports get more expensive. Too weak, and imported inflation spikes and domestic buying power fades.

  44. Alan says:

    @Alan followup…. it was actually showing the html markup with ‘p’ markers but I see that’s actually desired.

    But I do see issues with seeing the following text showing up : odd letters/symbols before the the word We and after the word government

    “We believe the government’s

  45. Andy says:

    Keyvan, another heuristic to consider is adding elements with an explicit style=”display:none” to the unlikely candidates list. I ran into some examples where a hidden DIV contained a bunch of text that the user would never see, and modified my copy of the library to throw these out.

  46. Vanina says:

    I can not seem to access the code online. The page redirects to a 404.
    Where can I find the source code?

  47. Keyvan says:

    Alan: Sorry for the late reply. That appears to be a character encoding issue. You need to make sure whatever you give PHP Readability is in UTF-8. And treat its output as UTF-8.

    Andy: Yes, we actually do that in Full-Text RSS. I guess it wasn’t in the original Readability code as those elements probabaly weren’t being considered when run in the browser.

    Vanina: We moved code.fivefilters.org which unfortunately broke a few URLs. I’ve updated the links on this page, so please try again. Thanks for the report.

  48. Frank says:

    Hey Keyvan,
    Thanks for sharing! Quick question about images.
    In a comment above, you were saying you were going to look into it as they seem to get killed in the process. Have you worked on a fix? 🙂
    I’d like to be able to use them along with text in a small app I’m building.
    Thanks!

  49. Keyvan says:

    Frank: regarding images, we’ve made a few changes to PHP Readability that will go into the release of Full-Text RSS 3.1. The changes should preserve more images and embedded videos. Once we’re ready with that release I’ll update the PHP Readability code linked here.

4 Trackbacks

  1. […] fact, I really want more than this – I want to use a port of Arc90′s Readability (there are many, in many different languages) to grab the content from the page I’ve tagged, […]

  2. […] PHP port […]

  3. […] to now we’ve relied mainly on PHP Readability to automatically identify and extract articles from web pages, and this is still how the majority […]

  4. […] Tayyar Be?ik, software developer @Nokta If you want to use php for this job PHPReadability http://www.keyva… (more) Sign up for free to read the full text. Login if you already have an account.This answer […]