danirod

Mirroring a website with wget

This is a little trick that I always forget, so I’m saving it here for reference to myself (as well as for anyone). Sometimes it is useful to mirror an entire website. Maybe it is a website that doesn’t often work. Maybe you need to have offline access to that site. There are a couple reasons. In my case there’s a website with lots of open documentation in PDF format that I’m trying to access, but I don’t want to manually save each one of the tens of PDF files available, so I’d prefer to just clone the website. The following command for wget will let you do this.

wget --mirror --convert-links --no-parent [website]

The following flags are used:

  • --mirror: this tells wget to recursively fetch every link in the downloaded document and repeat this until every web site has been fetched.
  • --convert-links: this will ask wget to chage the downloaded HTML files so every absolute links is replaced by relative links. This makes it easier to access downloaded websites when there is no Internet connection.
  • --no-parent: links to parent webpages won’t be followed. For instance, if you are downloading http://www.example.com/foo/bar, neither webpages under http://www.example.com/foo or http://www.example.com/ will be downloaded. Which is cool because you probably don’t want to store more documents than the ones that you actually need.

Another useful flag is probably --wait, which comes handy when you want to wait a few seconds between each download. It is polite not to struggle the server, specially when you are trying to download things from websites that don’t have powerful servers. You would use it like

wget --wait=10 [--every other flag] [your website]

Instead of 10 you can provide your own value if you want to wait more or less. This will increase the time required to download all the assets, as you can expect.