Gutenberg books as marked up XML

Project Gutenberg is a fantastic project where a large collection of books has been scanned and made available for free. The problem has been that they are available as text which makes automated processing sometimes a problem. Extracting the title of a book can be a problem (though an easy one). However, the nice people at the HTML Writer Guild have maked up a large collection of Gutenberg book using a XML with a publicly available DTD.

Possible application: have a given book be automatically integrated in a content management system (learning management system).

You might also want to consider GutenMark as a tool to process Gutenberg books (output to LaTeX and HTML).

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

One thought on “Gutenberg books as marked up XML”

  1. Output du weekend sur
    Je me demandais la semaine dernière si certaine personnes qui lisaient mon carnet n’étaient pas abonnés au fil web de mes signets partagés sur… Je me disait que je devrais à l’occasion poster ici aussi….

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see