Making EPUB from WordPress (and other) web collections


As part of Workpackage 3 I have been looking at WordPress as a way of creating scholarly monographs. This post carries on from the last couple, but it’s not really about EPUB or about WordPress, it’s about interoperability and how tools might work together in a Scholarly HTML mode so that people can package and repackage their resources much more reliably and flexibly than they can now.

While exploring WordPress I had a look at the JISC funded KnowledgeBlog project. The team there has released a plugin for WordPress to show a table of contents made up of all the posts in a particular category. It seemed that with a bit of enhancement this could be a useful component of a production workflow for book-like project, particularly for project reports and theses (where they are being written online in content management systems maybe not so common now, but likely to become more common) and for course materials.

Recently I looked at Anthologize, a WordPress-based way of creating ebooks from HTML resources sourced from around the web (I noted a number of limitations which I am sure will be dealt with sooner or later). Anthologize is using a design pattern that I have seen a couple of times with EPUB, converting the multiple parts of a project to an XML format that already has some tools for rendering and using those tools to generate outputs like PDF or EPUB. Asciidoc does this using the DocBook tool-chain and Anthologize uses TEI tools. I will write more on this design pattern and its implications soon. There is another obvious approach; to leave things in HTML and build books from that, for example using Calibre which already has ways to build ebooks from HTML sources. This is an approach which could be added to Anthologize very easily, to complement the TEI approach.

So, I have put together a workflow using Calibre to build EPUBs straight from a blog.

Why would you want to do this? Two main reasons. Firstly, to read a report, thesis or course, or an entire blog on a mobile device. Secondly, to be able to deposit a snapshot of same into a repository.

In this post I will talk about some academic works:

The key to this effort is the KnowledgeBlog table of contents plugin ktoc, with some enhancements I have added to make it easier to harvest web content into a book.

The results are available on a Calibre server I’m running in the Amazon cloud just for the duration of this project. (The server is really intended for local use, the way I am running it behind an Apache reverse proxy it doesn’t seem very happy you may have to refresh a couple of times until it comes good). This is rough. It is certainly not production quality.


These books are created using calibre ‘recipes’: available here. You run them like this:

ebook-convert thesis-demo.recipe .epub --test

If you are just trying this out, to be kind to site owners --test will cause it to only fetch a couple of articles per feed.

I added them to the calibre server like this:

calibredb add --library-path=./books thesis-demo.epub

The projects page at my site has two TOCs for two different projects.

[ktoc cat="jiscPUB" title="Digital Monograph Technical Landscape study #jiscPUB" show_authors="false" orderby="date" toc_author="Peter Sefton"]

[ktoc cat="ScholarlyHTML" title="Scholarly HTML posts" orderby="date" show_authors="false" toc_author="Peter Sefton" ]

I the title is used to create sections in the book, in both cases the post are displayed in date-order and I am not showing the name of the author on the page because that’s not needed when it is all me.

The resulting book has a nested table of contents, seen here in Adobe Digital Editions.

graphics2Illustration 1: A book built from a WordPress page with two table of contents blocks generated from WordPress categories.

Read on for more detail about the process of developing these things and some comments about the problems I encountered working with multiple conflicting WordPress plugins, etc.

The Scholarly HTML way to EPUB

The first thing I tried in this exploration was writing a recipe to make an EPUB book from a Knowledge Blog, for the Ontogenesis project. It is a kind of encyclopaedia of ontology development maintained in a WordPress site with multiple contributors. It worked well, for a demonstration, and did not take long to develop. The Ontogenesis recipe is available here and the resulting book is available on the Calibre server.

But there was a problem.

The second blog I wanted to try it on was my own, so I installed ktoc changed the URL in the recipe and ran it. Nothing. The problem is that Ontogenesis and my blog use different WordPress themes so the structure is different. Recipes have stuff like this in them to locate the parts of a page, such as <p class='details_small'>:

remove_tags_before = dict(name='p', attrs={'class':'details_small'})

remove_tags_after = dict(name='div', attrs={'class':'post_content'})

That’s for Ontogenesis, different rules are needed for other sites. You also need code to find the table of contents amongst all the links on a WordPress page, and deal with pages that might have two or more ktoc-generated tables for different sections of a journal, or parts of a project report.

Anyway, I wrote a different recipe for my site, but as I was doing so I was thinking about how to make this easier. What if:

  • The ktoc plugin output a little more information in its list of posts that made it easy to find no matter what WordPress theme was being used.

  • The actual post part of each page (ie not the navigation, or ads) identified itself as such.

  • The same technique could be extended to other websites in general.

There is already a standard way to do the most important part of this, listing a set of resources that make up an aggregated resource; the Object Reuse and Exchange specification, embedded in HTML using RDFa. ORE in RDFa. Simple.

Well no, it’s not, unfortunately. ORE is complicated and has some very important but hard to grasp abstractions such the difference between an Aggregation, and a Resource Map. An Aggregation is a collection of resources which has a URI, while a resource map describes the relationship between the aggregation and the resources it aggregates. These things are supposed to have different URIs. Now, for a simple task like making a table of contents of WordPress posts machine-readable so you can throw together a book, these abstractions are not really helpful to developers or consumers. But what if there were a simple recipe/microformat what we call a convention in Scholarly HTML to follow, which was ORE compliant and that was also simple to implement at both the server and client end?

What I have been doing over the last couple of days, as I continue this EPUB exploration is try to use the ORE spec in a way that will be easy implement, say in the TOC page, or in Anthologize, while still being ORE compliant. That discussion is ongoing, and will take place in the Google groups for Scholarly HTML and ORE. It is worth pursuing because if we can get it sorted out then with a few very simple additions to the HTML they spit out, any web system can get EPUB export quickly and cheaply by adhering to a narrowly defined profile of ORE subject to the donor service being able to supply reasonable quality HTML. More sophisticated tools that do understand RDFa and ORE will be able to process arbitrary pages that use the Scholarly HTML convention, but developers can choose the simpler convention over a full implementation for some tasks.

The details may change, as I seek advice from experts, but basically, there are two parts to this.

Firstly there’s adding ORE semantics to the ktoc (or any) table of contents. It used to be a plain-old unordered list, with list items in it:

<li><a href="">Automatic
maintenance of multiple inheritance ontologies</a> by Mikel Egana
<li><a href="">Characterising
Representation</a> by Sean Bechhofer and Robert Stevens</li>
<li><a href="">Closing Down
the Open World: Covering Axioms and Closure Axioms</a> by Robert

The list items now explicitly say what is being aggregated. The plain old <li> becomes:

<li  rel=""

(The fact that this is an <li> does not matter, it could be any element.)

And there is a separate URI for the Aggregation and resource map courtesy of different IDs. And the resource map says that it describes the Aggregation as per the ORE spec.

<div id=”AggregationScholarlyHTML">

<div rel="" resource="#AggregationScholarlyHTML" id="ResourceMapScholarlyHTML" about="#ResourceMapScholarlyHTML">

It is verbose, but nobody will have to type this stuff. What I have tried to do here (and it is a work in progress) is to simplify an existing standard which could be applied in any number of ways and boil it down to a simple convention that’s easy to implement but that still honours the more complicated specifications in the background. (Experts this will realise that I have used an RDFa 1.1 approach here, meaning that current RDFa processors will not understand, this is so that we don’t have to deal with namespaces and CURIES which complicate processing for non-native tools.)

Secondly the plugin wraps a <div> element around the content for every post to label it as being scholarly HTML, this is a way of saying that this part of the whole page is the content that makes up the article, thesis chapter or similar. Without a marker like this finding the content is a real challenge where pages are loaded up with all sorts of navigation, decoration and advertisements, it is different on just about every site, and it can change at the whim of the blog owner if they change themes.

<div rel="">

Why not define an even simpler format?

It would be possible to come up with a simple microformat that had nice human readable class attributes or something to mark the parts of a TOC page. I didn’t do that because then people will rightly point out that ORE exists and we would end up with a convention that covered a subset of the existing spec, making it harder for tool makers to cover both and less likely that services will interoperate.

So why not just use general ORE and RDFa?

There are several reasons:

  • Tool support is extremely limited for client and server side processing of full RDFa, for example in supporting the way namespaces are handled in RDFa using CURIES. (Sam Adams has pointed out that it would be a lot easier to debug my code if I did use CURIES and RDFa 1.0 so I followed his advice, did some search and replacing and checked that the work I am doing here is indeed ORE compliant).

  • The ORE spec is suited only for experienced developers with a lot of patience for complexities like the difference between an aggregation and a resource map.

  • RDFa needs to apply to a whole page, with the correct document type and that’s not always possible to do when we’re dealing with systems like WordPress. The convention approach means you can at least produce something that can become proper RDFa if put into the right context.

Why not use RSS/Atom feeds?

Another way to approach this would be to use a feed, in RSS or Atom format. WordPress has good support for feeds there’s one for just about everything. So you can look at all the posts on my website:

or use Tony Hirst’s approach to fetch a singe post from the jiscPUB blog

The nice thing about this single post technique is that it gives you just the content in a content element so there is no screen scraping involved. The problem is that the site has to be set up to provide full HTML versions of all posts in its feeds or you only get a summary. There’s a problem with using feeds on categories too, I believe, in that there is an upper limit to how many posts a WordPress site will serve. The site admin can change that to a larger number but then that will affect subscribers to the general purpose feeds as well. They probably don’t want to see three hundred posts in Google Reader when they sign up to a new blog.

Given that Atom (the best standardised and most modern feed format) is one of the official serialisation formats for ORE it is probably worth revisiting this question later if someone, such as JISC, decides to invest more in this kind of web-to-ebook-compiling application.

What next?

There are some obvious things that could be done to further this work:

  • Set up a more complete and robust book server which builds and rebuilds books from particular sites and distributes them in some way, using Open Publication Distribution System (OPDS) or something like this thing that sends stuff to your Kindle.

  • Write a ‘recipe factory’. With a little more work the ScholarlyHTML recipe can be got to the point where the only required variable is a single page URL everything else can be harvested from the page or over-ridden by the recipe.

  • Combining the above to make a WordPress plugin that can create EPUBs from collections of in-built content (tricky because of the present calibre dependency but it could be re-coded in PHP).

  • Add the same ScholarlyHTML convention for ORE to other web systems such as the plugin and Anthologize. Anthologize is appealing because it allows you to order resources in ‘projects’ and nest them into ‘parts’ rather than being based on simple queries but at the moment it does not actually have a way to publish a project directly to the web.

  • Explore the same technique in the next phase of WorkPackage 3 when I return to looking at word processing tools and examine how cloud replication services like DropBox might help people to manage book-like projects that consist of multiple parts.

Postscript: Lessons and things that need fixing or investiging

I encountered some issues. Some of these are mentioned above but I wanted to list them here as fodder for potential new projects.

  • As with Anthologize, if you use the WordPress RSS importer to bring-in content it does not change the links between posts so they point to the new location. Likewise with importing a WordPress export file.

  • The RSS importer applied to the thesis created hundreds of blank categories.

  • I tried to add my ktoc plugin to a site, but ran into problems. It uses PHP’s simplexml parser which chokes on what I am convinced is perfectly valid XML in unpredictable ways. And the default configuration expects posts to be formatted in a particular way as a list of top-level paragraphs, rather than with nested divs. I will follow this up with the developers.

  • Calibre does a pretty good job of taking HTML and making it into EPUBs but it does have its issues. I will work through these on the relevant forums as time permits.

    • There are some encoding problems with the table of contents in some places. Might be an issue with my coding in the recipes.

    • Unlike other Calibre workflows, such as creating books from raw HTML, ebook-convert adds navigation to each HTML page in the book created by a recipe. This navigation is redundant in an EPUB, but apparently it would require a source code change to get rid of it.

    • It does something complicated to give each book its style information. There are some odd presentation glitches in the samples as a result of Calibre’s algorithms. This requires more investigation.

    • It doesn’t find local links between parts of a book (ie links from one post to another which occur a lot in my work and in Tony’s course), but I have coded around that in the Scholarly HTML recipes.

It will be up to Theo Andrew, the project manager if any of these next steps or issues get any attention during the rest of this project.

Copyright Peter Sefton, 2011-05-25. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <>


This post was written in, using templates and tools provided by the Integrated Content Environment project.

This entry was posted in Uncategorized, Workpackage 3. Bookmark the permalink.

10 Responses to Making EPUB from WordPress (and other) web collections

  1. Abhaya says:

    Hi Peter,

    In building our Blog2Book tool, we went with the ATOM/RSS feed way. Currently we generate PDF formats for printing and for online consumption. We are also working on generating epub format and hope to push it out in few days.

    Just one comment: while the ATOM/RSS standards do not have a way for fetching older content, WordPress supports pagination using which you can fetch older posts also without the site admin needing to adjust the default number of posts per page. However the problem with partial and full content remains.


    • Peter Sefton says:


      Thanks for the tips re feeds.

      Your system did not work at all with which is plain old WordPress.

      I does work with this jiscPUB blog, but the PDF it produces is missing all the links – including the link text. Is there an option to download in a ebook formats such as EPUB or just PDF?

  2. bowerbird says:

    you might be interested in my project:


    my light-markup system takes a utf8-plain-text file as input
    and converts out to .pdf, .html, .epub, and .mobi.

    if you think about it, i do believe you will come to agree that
    having your “master source-file” be a utf8-plain-text file is
    the best way to archive. it has the lowest cost-structure —
    to create, edit, and maintain — and extremely high benefits,
    in the form of existing tools and ease of re-mix and re-use…


    • Peter Sefton says:

      No arguments about UTF-8 for archiving many kinds of documents. Sometimes XML or HTML would be good things to do with the UTF-8 as well. Then there is the MAJOR issue of how to link to and bundle images and other media, data etc.

      Can I ask why you decided to invent yet another lightweight markup format? How is this different from asciidoc, pandoc, markdown, wikicreole et al?

  3. bowerbird says:

    peter said:
    > No arguments about UTF-8 for
    > archiving many kinds of documents.
    > Sometimes XML or HTML would be
    > good things to do with the UTF-8 as well.

    .html is an output format, if/when you want that.
    x.m.l. could be as well, but i don’t see much purpose.

    the important part of having a utf8-plain-text master
    is that it is plain-text, and thus a breeze to work with.

    > Then there is the MAJOR issue of how to link to
    > and bundle images and other media, data etc.

    i believe you will find that i have done much thought
    about those issues… i’ve been doing this a long time.

    > Can I ask why you decided to invent
    > yet another lightweight markup format?
    > How is this different from
    > asciidoc, pandoc, markdown, wikicreole et al?

    this makes me chuckle.

    i started my work before most of those were hatched.
    as a result, i have had a while to think and do research.
    what i will be releasing soon is the first 5% of my work.
    there will be much more to follow.

    but to answer the question directly, my work differs
    from those other forms of light-markup in that it is
    directly focused on books and long-form documents,
    rather than webpages. further, it’s been informed by
    my long experience with desktop-publishing software,
    going back to the earliest version of ventura publisher.
    (which, if you know about these things, you will know
    was based on one of the best frameworks ever used.)


  4. bowerbird says:

    i’ve released the first beta of my converter-tool.



  5. Keith Fahlgren says:

    i started my work before most of those were hatched

    AsciiDoc Version 1.0 (2002-11-25)
    Markdown 1.0.1 (2004-12-17)
    WikiCreole 1.0 (2007-07-04)
    Pandoc 0.45 (2007–12–09)
    First beta jaguarps (2011-06-06)

  6. Pingback: The repository is watching: automated harvesting from replicated filesystems | ptsefton

  7. bowerbird says:


    why are my comments not appearing?


  8. bowerbird says:

    ok, let’s try this again.

    as keith knows, and knows very well,
    my work predates the jaguar release.