Clean up HTML on paste in CKEditor

We use CKEditor at FiveFilters.org for our PastePad service. The idea is to allow users to paste content that’s not currently publically available on the web for processing with one of our web tools. This can be content that’s in a Word document, an email, or behind a paywall.

CKEditor can automatically clean up HTML it identifies as coming from MS Word, but there’s no way to force cleanup on all pasted content. By default, HTML cleanup occurs in the following two cases:

  1. User clicks the ‘paste from word’ toolbar icon
  2. User pastes content copied from MS Word itself

In the second case, CKEditor looks for signs of MS Word formatting. It does this by testing whatever you paste against the following regular expression:

/(class=\"?Mso|style=\"[^\"]*\bmso\-|w:WordDocument)/

If there’s a match, it will be cleaned up. Otherwise it will paste as normal.

I want to avoid editing core files, so my solution is simply to ensure that this regular expression always matches pasted content. Here’s what I’ve come up with:

CKEDITOR.on('instanceReady', function(ev) {
    ev.editor.on('paste', function(evt) {    
        evt.data['html'] = '<!--class="Mso"-->'+evt.data['html'];
    }, null, null, 9);
});

I haven’t tested extensively, but this appears to work as expected (CKEditor 3.6.2). You can try it out.

What the code does is it registers a new listener for the paste event, just like the Paste from Word plugin. When it receives the pasted HTML, it simply prepends an HTML comment containing one of the strings the Paste from Word plugin looks for. The listener has a priority of 9 to ensure it runs before the plugin which will trigger the actual cleaning (default priority of 10).

Note: I posted this solution on StackOverflow as an alternative to another solution, titled “CKEditor – use pastefromword filtering on all pasted content.” StackOverflow recently deleted some of my answers (and hid them from me) so I’m moving the rest of my meagre contributions over to my own blog.

Posted in Code | Comments closed

Push to Kindle e-mail service

Push to Kindle, FiveFilters.org’s web service for sending web articles to your Kindle, can now also be used by e-mail. The email service is aimed at iPad and iPhone users.

Here’s a video showing you how to use it on your iPad or iPhone:

Step by step

  1. On your device, load an article you’d like to send to your Kindle
  2. Choose share page
  3. In the list of options presented, select Mail
  4. Enter your Kindle email address but instead of @kindle.com, enter @pushtokindle.com
  5. Send!

Changing the ending to @pushtokindle.com in step 4 ensures our service processes the article first and then sends it to your Kindle account.

The first time you do this, you’ll receive an email from FiveFilters.org asking you to confirm the address you’re sending from. After confirming, you’ll have the opportunity to save your Push to Kindle email address in your contacts list to make future sending easier. (Simply typing ‘kin’ in to the To: field should show your Push to Kindle address as an option.)


If you own a 3G Kindle device and you want to make sure you will not be charged by Amazon, please send to @free.pushtokindle.com. (For the time being we are only sending to @free.kindle.com, but this might change in future.)

Why an e-mail service?

We already have a Push to Kindle Android app. It adds ‘Push to Kindle’ as an entry in your device’s share menu, so whenever you want to send a web article to your Kindle, you bring up the share menu and choose Push to Kindle.

We considered doing the same for iOS and other mobile devices, but decided to focus on email for two reasons:

  1. Unlike Android, iOS and Windows Phone operating systems do not yet allow apps to add entries to the share menu.
  2. The share menu on most mobile devices does, however, include e-mail as an option

Pricing

The first 25 articles processed by our e-mail service are free, after that you’ll be asked to purchase credits — this allows us to maintain the service.

100 credits cost 1.5€ (around £1.20 or $2)

Each article sent uses 1 credit. You will receive an email notice when your credits are low.

Note: credits are linked to the email address you send from, not your Kindle address.

Compared to Amazon’s email service

Amazon’s Send to Kindle email service currently works by accepting documents as attachments to an email message.

Web articles you read online are usually not in a format that can be sent to your Kindle account directly. They need to be cleaned up and converted to a suitable format first. That’s what our Push to Kindle service does. We take care of extracting the content and converting the article to a suitable format for your Kindle. We then send the result as an attachment to your Kindle account.

Bear in mind

We’re working to integrate this service with our sustainer membership. Once that’s done this service will be free for new and existing sustainers.

All articles are currently considered equal: 1 credit = 1 article. In the future this may change. For example, in line with our goal to encourage use of non-corporate sources, we’ll be white listing many non-corporate sources so no credits will be used if you process articles from these sources. Conversely, we may deduct more credits for articles originating from corporate sources.

Please consider this an experimental service. Let us know if you experience any issues and we’ll be happy to help. Email help@fivefilters.org.

Posted in General | Comments closed

Push to Kindle supports sending to Duokan

Our Push to Kindle service has been updated to enable delivery to iduokan.com addresses — a system similar to Amazon’s personal documents service but designed to work with the Duokan software.

Note: Sending to @iduokan.com addresses has been enabled on our web app. The Android app has not yet been updated.

Thanks to Daniel ?o?opa for testing.

Posted in General | Comments closed

Full-Text RSS 3.0

Full-Text RSS 3.0 is now available.

What is it?

Full-Text RSS is a free software PHP application to help you extract content from web pages. It can extract content from a standard HTML page and return a 1-item feed or it can transform an existing feed into a full-text feed.

It’s used primarily by news enthusiasts and developers.

It’s used by news enthusiasts who dislike partial web feeds – feeds which require them to read the full story on a different site, rather than their preferred application. Full-Text RSS can convert these feeds to full-text versions, allowing the reader to stay in his/her preferred environment to read the full story.

It’s used by developers building applications which need an article extraction component. It allows developers to retrieve and process only the content they’re interested in.

Demo

Try it out – enter a URL in the form and hit ‘Create Feed’.

What’s new in 3.0

Extraction

Multi-page support
Many web sites now split their articles into a number of pages. In earlier version of Full-Text RSS we’d added support for retrieving the single-page view and extracting content from that page. For sites which do not offer such a single-page view, we can now follow the ‘next page’ links and build up the full article page by page.

Multi-page support currently works by specifying a next_page_link in the site config file associated with the website you are extracting from.

Examples:

next_page_link: //a[@id='next-page']
next_page_link: //a[contains(text(), 'Next page')]
HTML5 parser: html5lib
By default we still rely on PHP’s fast libxml parser. For sites where this proves problematic, you can now specify html5lib – a PHP implementation of a HTML parser based on the HTML5 spec.

Example:

parser: html5lib

Better AJAX handling
Full-Text RSS does not interpret any Javascript it comes across when fetching pages. To get at the content, we expect it to be marked up in HTML. Some sites have started relying on the user’s browser and its Javascript support to load page content. For pages which load content in this way, Google suggests that the publisher also offers the content in plain HTML so Google’s search engine crawlers can access it. Google’s spec contains two possible triggers which will guide Google’s crawlers to the HTML version.

The first trigger appears in the URL, these URLs are often called ‘hashbang’ URLs. Example: https://twitter.com/#!/search-home

The second trigger can appear in the HTML header: Example:

When encountered, these triggers will result in a new URL being generated, what Google terms an ‘Ugly URL’. The new URL will contain additional query string parameters to to indicate to the server that the plain HTML version is being requested.

Earlier versions of Full-Text RSS looked for the first trigger (‘hashbang’ in the URL) but not the second trigger. Full-Text RSS 3.0 now handles both.

Site config extraction patterns updated
Site config files are used to fine-tune extraction where autodetection doesn’t always work. There are now over 700 site config files. Many old ones have been updated and new ones added.

We also now look for OpenGraph title and date elements.

Developers

Cross-origin resource sharing (CORS) support
If Full-Text RSS is hosted on an a different domain to your application. Enabling CORS will allow your application to request JSON results from Full-Text RSS directly from the user’s browser. Avoiding the browser’s same origin policy.

To enable CORS, look at $options->cors in the config file.

JSONP support
The old way of circumventing the browser’s same origin policy was to use JSONP. You can do this by requesting JSON (&format=json) with an additional callback function (&format=json&callback=functionName).
Global site config
The global site config accepts everything a regular site config file does, but it’s applied to all sites, whether or not a specific site config matches.

The global site config file should be named global.txt and placed inside the relevant site_config/ subfolder.

Site config merging
Site config files are used to fine-tune extraction where autodetection doesn’t always work.

Previous version of Full-Text RSS looked for site config files in the following order:

  1. URL hostname match or wildcard match in the site_config/custom/
  2. URL hostname match or wildcard match in the site_config/standard/
  3. fingerprint match (HTML fragment mapping to hostname) in site_config/custom/
  4. fingerprint match (HTML fragment mapping to hostname) in site_config/standard/

As soon as an entry was matched, we’d process it, return it, and stop looking.

In Full-Text RSS 3.0, we follow the same order, but continue looking even if there’s a match. We build up the site config by appending any new entries we find. In addition, we also look for and combine global site config files:

  1. global rules in site_config/custom/global.txt
  2. global rules in site_config/standard/global.txt

To prevent this behaviour, you can enter autodetect_on_failure: no in the site config file. This will end the chain. The config files before and including this one will be loaded and merged, but no others.

XSS filtering
We have not enabled XSS filtering by default because we assume the majority of our users do not display the HTML retrieved by Full-Text RSS in a web page without further processing. If you subscribe to our generated feeds in your news reader application, it should, if it’s good software, already filter the resulting HTML for XSS attacks, making it redundant for Full-Text RSS do the same. Similarly with frameworks/CMS which display feed content – the content should be treated like any other user-submitted content.

If you are writing an application yourself which is processing feeds generated by Full-Text RSS, you can either filter the HTML yourself to remove potential XSS attacks or enable this option. This might be useful if you are processing our generated feeds with JavaScript on the client side – although there’s client side xss filtering available too, e.g. JsHtmlSanitizer

If enabled, we’ll pass retrieved HTML content through htmLawed with safe flag on and style attributes denied, see htmLawed’s readme.

Note: if enabled this will also remove certain elements you may want to preserve, such as iframes.

Site config editor
Full-Text RSS 3.0 now comes with a site config editor available in the admin area (accessible via the admin/ folder). This lets you find, edit, and test existing site config files, or add new ones.

Note: We suggest you make changes to the site config files using a local installation of Full-Text RSS and upload the results to your server when ready. Site config files are simple text files stored on disk. Cloud hosting environments do not always offer persistent file storage, so changes made to a hosted copy on such environments may be lost.

Debug mode
Debug mode allows you to see what happens behind the scenes when Full-Text RSS is running. This is useful if you want to see things such as:

  • URL redirects
  • Which site config files are loaded
  • Whether the single_page_link and next_page_link expressions match
  • Which XPath expression end up matching title, body, date, author

Performance

Site config caching in APC
If you run Full-Text RSS in a hosting environment which has APC enabled, it can take advantage of APC’s user cache – a memory cache. If enabled we will store site config files (when requested for the first time) in APC’s user cache – avoiding disk access on subsequent requests. See $options->apc in the config file to enable. Keys in APC are prefixed with ‘sc.’

Note: $options->apc has no effect if APC is unavailable on your server.

Smart cache (experimental)
If you enable caching and APC, you can also try out the experimental smart cache. The intention here is, again, to reduce disk access. With this enabled we will not write Full-Text RSS’s results to disk straight away, instead we’ll store the generated cache key in APC’s user cache for 10 minutes. If a subsequent request comes in matching the cache key, we’ll write the result to disk. Requests after that matching the cache key will be loaded from disk. See $options->smart_cache in the config file to enable. Keys in APC are prefixed with ‘cache.’

Note: this has no effect if APC is disabled or unavailable on your server, or if you have caching disabled.

Cloud ready

Host for free on AppFog
AppFog offer users free hosting with 2GB RAM. That’s more than enough to run Full-Text RSS for most users.

To get started:

  1. Create a free account
  2. Install the AppFog command-line client (af)
  3. Change into the Full-Text RSS folder
  4. Type af push
  5. Follow the prompts and you’re done.

Note: if you get a 701 error saying the URL has been taken, edit manifest.yml and comment out the line starting with name: and url: by inserting a hash sign (#) at the beginning of the line. Save and try again. This time af will prompt you for an application name and URL.

Override config options with environment variables
Most of the config options in the config file can now be overridden with environment variables. When creating environment variables, use the option name prefixed with ‘ftr_‘. For example, to override $options->max_entries and limit the maximum to 2, create an environment variable with key ftr_max_entries and value 2.

What didn’t make it

No monitored feeds
One feature which didn’t make this release is the ability to create monitored feeds with PubSubHubbub support. This was specifically to improve the speed with which generated feeds updated within Google Reader’s system. Unfortunately this feature is not yet ready – we’ve not had great results in our tests, so won’t be releasing until we’re happy.
Config options removed
The following config options were removed:

  • $options->restrict
  • $options->message_to_prepend_with_key
  • $options->message_to_append_with_key
  • $options->error_message_with_key
  • $options->alternative_url
No extraction with CSS selector
You can no longer specify what should get extracted with a CSS selector passed in the querystring.
Posted in General | Comments closed

Push to Kindle: some stats

Our Push to Kindle service has become quite popular since we launched. Over 25,000 people currently use our Chrome extension, 7,000 use the Firefox extension and over 2,000 have installed our Android app.

I recently decided to check how much of the content processed by our Push to Kindle service comes from corporate news sources. Here’s what I found:

rank domain percentage
#1 nytimes.com 2.62%
#4 guardian.co.uk 1.32%
#15 bbc.co.uk 0.51%
#48 telegraph.co.uk 0.22%
#97 independent.co.uk 0.11%

This is based on data collected over a period of 3 weeks.

I’m glad to see our users do not rely too much on corporate news sources. However, as the main goal of the FiveFilters.org project is to promote independent, non-corporate media, I’ll be thinking about ways to direct people to non-corporate sources of news and analysis in future updates.

For the time being, if a New York Times article is loaded, I’ve added a tab with links to The NYTimes eXaminer (‘An antidote to the “paper of record”‘). Similarly, if an article from The Guardian, BBC or Independent is loaded, users will see a tab with links to Medialens.

Posted in General | Comments closed