Gutenberg books as marked up XML

Project Gutenberg is a fantastic project where a large collection of books has been scanned and made available for free. The problem has been that they are available as text which makes automated processing sometimes a problem. Extracting the title of a book can be a problem (though an easy one). However, the nice people at the HTML Writer Guild have maked up a large collection of Gutenberg book using a XML with a publicly available DTD.

Possible application: have a given book be automatically integrated in a content management system (learning management system).

You might also want to consider GutenMark as a tool to process Gutenberg books (output to LaTeX and HTML).

One thought on “Gutenberg books as marked up XML”



  1. Output du weekend sur del.ico.us
    Je me demandais la semaine dernière si certaine personnes qui lisaient mon carnet n’étaient pas abonnés au fil web de mes signets partagés sur del.ico.us… Je me disait que je devrais à l’occasion poster ici aussi….

Leave a Reply

Your email address will not be published. Required fields are marked *