Wget doesn’t eat XML

I wanted to retrieve a local copy of my online XML course. I instructed the technical staff to serve the XHTML files as application/xml. I believe this was to work around the limitations of Internet Explorer. In any case, I stumbled upon a wget bug! Wget won’t process XHTML with the mime-type application/xml as an XHTML file, and hence, it won’t follow the links inside it.

A deeper limitation is that wget doesn’t know XML. This means that it will not follow stylesheets. Wget also doesn’t know about javascript.

This meant I had to write my own scripts to recover the course. First, a bash script:

wget -m -r -l inf -v -p http://www.teluq.uquebec.ca/inf6450/index-fr.htm
find -path "*.htm" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.html" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xhtml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p

You see that the last line is repeated twice. Don’t do this type of scripting at home. Bad design!

Next I need a python script to extract the URLs I need (Perl or Ruby would also do):

#!/bin/env python
import re,sys
for filename in sys.argv[1:]:
file=open(filename)
#print "from ", file
for line in file:
# better hope that we don't have repeated spaces!
for m in re.findall( "(?< =).*(?=)",line)+\\
re.findall("(?< =openwindow\\(').*?(?=')",line)+\\ re.findall("(?< =stylesheet href=["']).*?(?=["'])",line): print "http://"+re.search("www.*/",filename).group()+m

This is a pretty awful hack, but it works!

Here is a project for the tech savvy among you: extend wget so that it can parse XML!

One thought on “Wget doesn’t eat XML”

  1. Hello,

    I think wgets refusal to check for links inside the xml is more likely a result of only parsing text/html documents than a problem of parsing.
    You can ask wget to regard all possible mime types as html to check with wget -F

    with kind regards
    Michael

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax