PHP Port of Arc90’s Readability

Update 2011-03-23: Readers may also be interested in how we use PHP Readability at FiveFilters.org: Content Extraction at FiveFilters.org

Last year I ported Arc90’s Readability to use in the Five Filters project. It’s been over a year now and Readability has improved a lot — thanks to Chris Dary and the rest of the team at Arc90.

As part of an update to the Full-Text RSS service I started porting a more recent version (1.6.2) to PHP and the code is now online:

For anyone not familiar, Readability was created for use as a browser addon (a bookmarklet). With one click it transforms web pages for easy reading and strips away clutter. Apple recently incorporated it into Safari Reader.

It’s also very handy for content extraction, which is why I wanted to port it to PHP in the first place. Here’s an example of how to use the PHP port:

require_once 'Readability.php';
header('Content-Type: text/plain; charset=utf-8');

// get latest Medialens alert 
// (change this URL to whatever you'd like to test)
$url = 'http://medialens.org/alerts/index.php';
$html = file_get_contents($url);
 
// PHP Readability works with UTF-8 encoded content. 
// If $html is not UTF-8 encoded, use iconv() or 
// mb_convert_encoding() to convert to UTF-8.

// If we've got Tidy, let's clean up input.
// This step is highly recommended - PHP's default HTML parser
// often does a terrible job and results in strange output.
if (function_exists('tidy_parse_string')) {
	$tidy = tidy_parse_string($html, array(), 'UTF8');
	$tidy->cleanRepair();
	$html = $tidy->value;
}

// give it to Readability
$readability = new Readability($html, $url);

// print debug output? 
// useful to compare against Arc90's original JS version - 
// simply click the bookmarklet with FireBug's 
// console window open
$readability->debug = false;

// convert links to footnotes?
$readability->convertLinksToFootnotes = true;

// process it
$result = $readability->init();

// does it look like we found what we wanted?
if ($result) {
	echo "== Title ===============================\n";
	echo $readability->getTitle()->textContent, "\n\n";

	echo "== Body ===============================\n";
	$content = $readability->getContent()->innerHTML;

	// if we've got Tidy, let's clean it up for output
	if (function_exists('tidy_parse_string')) {
		$tidy = tidy_parse_string($content, 
			array('indent'=>true, 'show-body-only'=>true), 
			'UTF8');
		$tidy->cleanRepair();
		$content = $tidy->value;
	}
	echo $content;
} else {
	echo 'Looks like we couldn\'t find the content.';
}

Differences between the PHP port and the original

Arc90’s Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page’s CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP’s ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90’s Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90’s Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)

Another significant difference is that the aim of Arc90’s Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser — Arc90 already do that extremely well, and for PDF output there’s FiveFilters.org’s PDF Newspaper.

Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don’t want to do because it makes debugging and updating more difficult), I’ve tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.

This entry was posted in Code and tagged , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

3 Comments

  1. Michael says:

    Hello. I’ve tried the class you posted and it’s really great! I’ve tested it on different types of web pages and on most of them it gives awesome results!

    But there’s type of web pages that holds multiple blocks of content of similar size. I turned debug on and it showed me scores of 42 to 52, and I think instead of grabbing one of them it’s better to get some.

    I was thinking about having some threshold, say 20-30% of top candidate’s score and take all candidates that fit in it, so in my case 20% of 52 is 10.4, so all candidates with (score > 41.6) would be included in the output.

    Before I dive into rewriting it for my needs I wanted to ask this: do you have a version that extracts X top candidates instead of one or the way I described with the threshold? I’ve looked into the grabArticle code and it looks like it won’t be a quick fix to implement something like that.

    Feel free to contact me if you find this idea interesting.

  2. Keyvan says:

    Hi Michael, that’s interesting. I’m not aware of a version that does that. For use on FiveFilters.org, we write custom extraction rules for sites where PHP Readability doesn’t extract what we want. In cases where we want to extract multiple elements, we use XPath to select them.

    I’m afraid I can’t help with what you’re trying to achieve, but it does sound interesting.

  3. Michael says:

    Thanks for your time to reply me.

    I will then check the code again and try to come up with way to make the changes I need. Gladly it’s very well documented and easy to understand.

    Best regards.