Wget doesn’t eat XML
I wanted to retrieve a local copy of my online XML course. I instructed the technical staff to serve the XHTML files as application/xml. I believe this was to work around the limitations of Internet Explorer. In any case, I stumbled upon a wget bug! Wget won’t process XHTML with the mime-type application/xml as an XHTML file, and hence, it won’t follow the links inside it.
A deeper limitation is that wget doesn’t know XML. This means that it will not follow stylesheets. Wget also doesn’t know about javascript.
This meant I had to write my own scripts to recover the course. First, a bash script:
wget -m -r -l inf -v -p http://www.teluq.uquebec.ca/inf6450/index-fr.htm
find -path "*.htm" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.html" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xhtml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
find -path "*.xml" | xargs ./extracturls.py | xargs wget -m -r -l inf -v -p
You see that the last line is repeated twice. Don’t do this type of scripting at home. Bad design!
Next I need a python script to extract the URLs I need (Perl or Ruby would also do):
#!/bin/env python
import re,sys
for filename in sys.argv[1:]:
file=open(filename)
#print "from ", file
for line in file:
# better hope that we don't have repeated spaces!
for m in re.findall( "(?< =
re.findall( "(?<=
re.findall("(?<=
re.findall("(?< =openwindow\(').*?(?=')",line)+\
re.findall("(?< =stylesheet href=["']).*?(?=["'])",line):
print "http://"+re.search("www.*/",filename).group()+m
This is a pretty awful hack, but it works!
Here is a project for the tech savvy among you: extend wget so that it can parse XML!
Montreal, Canada 
Follow on
Hello,
I think wgets refusal to check for links inside the xml is more likely a result of only parsing text/html documents than a problem of parsing.
You can ask wget to regard all possible mime types as html to check with wget -F
with kind regards
Michael
Comment by Michael Barth — 6/5/2006 @ 5:03