mirror of
https://github.com/mirror/wget.git
synced 2024-12-29 14:30:48 +08:00
1396b30055
to add --bind-address, making many necessary alphabetization, coding style, comment, documentation, and naming fixes and additions.
1198 lines
48 KiB
Plaintext
1198 lines
48 KiB
Plaintext
This is Info file wget.info, produced by Makeinfo version 1.68 from the
|
||
input file ./wget.texi.
|
||
|
||
INFO-DIR-SECTION Net Utilities
|
||
INFO-DIR-SECTION World Wide Web
|
||
START-INFO-DIR-ENTRY
|
||
* Wget: (wget). The non-interactive network downloader.
|
||
END-INFO-DIR-ENTRY
|
||
|
||
This file documents the the GNU Wget utility for downloading network
|
||
data.
|
||
|
||
Copyright (C) 1996, 1997, 1998, 2000 Free Software Foundation, Inc.
|
||
|
||
Permission is granted to make and distribute verbatim copies of this
|
||
manual provided the copyright notice and this permission notice are
|
||
preserved on all copies.
|
||
|
||
Permission is granted to copy and distribute modified versions of
|
||
this manual under the conditions for verbatim copying, provided also
|
||
that the sections entitled "Copying" and "GNU General Public License"
|
||
are included exactly as in the original, and provided that the entire
|
||
resulting derived work is distributed under the terms of a permission
|
||
notice identical to this one.
|
||
|
||
|
||
File: wget.info, Node: Top, Next: Overview, Prev: (dir), Up: (dir)
|
||
|
||
Wget 1.5.3+dev
|
||
**************
|
||
|
||
This manual documents version 1.5.3+dev of GNU Wget, the freely
|
||
available utility for network download.
|
||
|
||
Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
|
||
|
||
* Menu:
|
||
|
||
* Overview:: Features of Wget.
|
||
* Invoking:: Wget command-line arguments.
|
||
* Recursive Retrieval:: Description of recursive retrieval.
|
||
* Following Links:: The available methods of chasing links.
|
||
* Time-Stamping:: Mirroring according to time-stamps.
|
||
* Startup File:: Wget's initialization file.
|
||
* Examples:: Examples of usage.
|
||
* Various:: The stuff that doesn't fit anywhere else.
|
||
* Appendices:: Some useful references.
|
||
* Copying:: You may give out copies of Wget.
|
||
* Concept Index:: Topics covered by this manual.
|
||
|
||
|
||
File: wget.info, Node: Overview, Next: Invoking, Prev: Top, Up: Top
|
||
|
||
Overview
|
||
********
|
||
|
||
GNU Wget is a freely available network utility to retrieve files from
|
||
the World Wide Web, using HTTP (Hyper Text Transfer Protocol) and FTP
|
||
(File Transfer Protocol), the two most widely used Internet protocols.
|
||
It has many useful features to make downloading easier, some of them
|
||
being:
|
||
|
||
* Wget is non-interactive, meaning that it can work in the
|
||
background, while the user is not logged on. This allows you to
|
||
start a retrieval and disconnect from the system, letting Wget
|
||
finish the work. By contrast, most of the Web browsers require
|
||
constant user's presence, which can be a great hindrance when
|
||
transferring a lot of data.
|
||
|
||
* Wget is capable of descending recursively through the structure of
|
||
HTML documents and FTP directory trees, making a local copy of the
|
||
directory hierarchy similar to the one on the remote server. This
|
||
feature can be used to mirror archives and home pages, or traverse
|
||
the web in search of data, like a WWW robot (*Note Robots::). In
|
||
that spirit, Wget understands the `norobots' convention.
|
||
|
||
* File name wildcard matching and recursive mirroring of directories
|
||
are available when retrieving via FTP. Wget can read the
|
||
time-stamp information given by both HTTP and FTP servers, and
|
||
store it locally. Thus Wget can see if the remote file has
|
||
changed since last retrieval, and automatically retrieve the new
|
||
version if it has. This makes Wget suitable for mirroring of FTP
|
||
sites, as well as home pages.
|
||
|
||
* Wget works exceedingly well on slow or unstable connections,
|
||
retrying the document until it is fully retrieved, or until a
|
||
user-specified retry count is surpassed. It will try to resume the
|
||
download from the point of interruption, using `REST' with FTP and
|
||
`Range' with HTTP servers that support them.
|
||
|
||
* By default, Wget supports proxy servers, which can lighten the
|
||
network load, speed up retrieval and provide access behind
|
||
firewalls. However, if you are behind a firewall that requires
|
||
that you use a socks style gateway, you can get the socks library
|
||
and build wget with support for socks. Wget also supports the
|
||
passive FTP downloading as an option.
|
||
|
||
* Builtin features offer mechanisms to tune which links you wish to
|
||
follow (*Note Following Links::).
|
||
|
||
* The retrieval is conveniently traced with printing dots, each dot
|
||
representing a fixed amount of data received (1KB by default).
|
||
These representations can be customized to your preferences.
|
||
|
||
* Most of the features are fully configurable, either through
|
||
command line options, or via the initialization file `.wgetrc'
|
||
(*Note Startup File::). Wget allows you to define "global"
|
||
startup files (`/usr/local/etc/wgetrc' by default) for site
|
||
settings.
|
||
|
||
* Finally, GNU Wget is free software. This means that everyone may
|
||
use it, redistribute it and/or modify it under the terms of the
|
||
GNU General Public License, as published by the Free Software
|
||
Foundation (*Note Copying::).
|
||
|
||
|
||
File: wget.info, Node: Invoking, Next: Recursive Retrieval, Prev: Overview, Up: Top
|
||
|
||
Invoking
|
||
********
|
||
|
||
By default, Wget is very simple to invoke. The basic syntax is:
|
||
|
||
wget [OPTION]... [URL]...
|
||
|
||
Wget will simply download all the URLs specified on the command
|
||
line. URL is a "Uniform Resource Locator", as defined below.
|
||
|
||
However, you may wish to change some of the default parameters of
|
||
Wget. You can do it two ways: permanently, adding the appropriate
|
||
command to `.wgetrc' (*Note Startup File::), or specifying it on the
|
||
command line.
|
||
|
||
* Menu:
|
||
|
||
* URL Format::
|
||
* Option Syntax::
|
||
* Basic Startup Options::
|
||
* Logging and Input File Options::
|
||
* Download Options::
|
||
* Directory Options::
|
||
* HTTP Options::
|
||
* FTP Options::
|
||
* Recursive Retrieval Options::
|
||
* Recursive Accept/Reject Options::
|
||
|
||
|
||
File: wget.info, Node: URL Format, Next: Option Syntax, Prev: Invoking, Up: Invoking
|
||
|
||
URL Format
|
||
==========
|
||
|
||
"URL" is an acronym for Uniform Resource Locator. A uniform
|
||
resource locator is a compact string representation for a resource
|
||
available via the Internet. Wget recognizes the URL syntax as per
|
||
RFC1738. This is the most widely used form (square brackets denote
|
||
optional parts):
|
||
|
||
http://host[:port]/directory/file
|
||
ftp://host[:port]/directory/file
|
||
|
||
You can also encode your username and password within a URL:
|
||
|
||
ftp://user:password@host/path
|
||
http://user:password@host/path
|
||
|
||
Either USER or PASSWORD, or both, may be left out. If you leave out
|
||
either the HTTP username or password, no authentication will be sent.
|
||
If you leave out the FTP username, `anonymous' will be used. If you
|
||
leave out the FTP password, your email address will be supplied as a
|
||
default password.(1)
|
||
|
||
You can encode unsafe characters in a URL as `%xy', `xy' being the
|
||
hexadecimal representation of the character's ASCII value. Some common
|
||
unsafe characters include `%' (quoted as `%25'), `:' (quoted as `%3A'),
|
||
and `@' (quoted as `%40'). Refer to RFC1738 for a comprehensive list
|
||
of unsafe characters.
|
||
|
||
Wget also supports the `type' feature for FTP URLs. By default, FTP
|
||
documents are retrieved in the binary mode (type `i'), which means that
|
||
they are downloaded unchanged. Another useful mode is the `a'
|
||
("ASCII") mode, which converts the line delimiters between the
|
||
different operating systems, and is thus useful for text files. Here
|
||
is an example:
|
||
|
||
ftp://host/directory/file;type=a
|
||
|
||
Two alternative variants of URL specification are also supported,
|
||
because of historical (hysterical?) reasons and their widespreaded use.
|
||
|
||
FTP-only syntax (supported by `NcFTP'):
|
||
host:/dir/file
|
||
|
||
HTTP-only syntax (introduced by `Netscape'):
|
||
host[:port]/dir/file
|
||
|
||
These two alternative forms are deprecated, and may cease being
|
||
supported in the future.
|
||
|
||
If you do not understand the difference between these notations, or
|
||
do not know which one to use, just use the plain ordinary format you use
|
||
with your favorite browser, like `Lynx' or `Netscape'.
|
||
|
||
---------- Footnotes ----------
|
||
|
||
(1) If you have a `.netrc' file in your home directory, password
|
||
will also be searched for there.
|
||
|
||
|
||
File: wget.info, Node: Option Syntax, Next: Basic Startup Options, Prev: URL Format, Up: Invoking
|
||
|
||
Option Syntax
|
||
=============
|
||
|
||
Since Wget uses GNU getopts to process its arguments, every option
|
||
has a short form and a long form. Long options are more convenient to
|
||
remember, but take time to type. You may freely mix different option
|
||
styles, or specify options after the command-line arguments. Thus you
|
||
may write:
|
||
|
||
wget -r --tries=10 http://fly.cc.fer.hr/ -o log
|
||
|
||
The space between the option accepting an argument and the argument
|
||
may be omitted. Instead `-o log' you can write `-olog'.
|
||
|
||
You may put several options that do not require arguments together,
|
||
like:
|
||
|
||
wget -drc URL
|
||
|
||
This is a complete equivalent of:
|
||
|
||
wget -d -r -c URL
|
||
|
||
Since the options can be specified after the arguments, you may
|
||
terminate them with `--'. So the following will try to download URL
|
||
`-x', reporting failure to `log':
|
||
|
||
wget -o log -- -x
|
||
|
||
The options that accept comma-separated lists all respect the
|
||
convention that specifying an empty list clears its value. This can be
|
||
useful to clear the `.wgetrc' settings. For instance, if your `.wgetrc'
|
||
sets `exclude_directories' to `/cgi-bin', the following example will
|
||
first reset it, and then set it to exclude `/~nobody' and `/~somebody'.
|
||
You can also clear the lists in `.wgetrc' (*Note Wgetrc Syntax::).
|
||
|
||
wget -X '' -X /~nobody,/~somebody
|
||
|
||
|
||
File: wget.info, Node: Basic Startup Options, Next: Logging and Input File Options, Prev: Option Syntax, Up: Invoking
|
||
|
||
Basic Startup Options
|
||
=====================
|
||
|
||
`-V'
|
||
`--version'
|
||
Display the version of Wget.
|
||
|
||
`-h'
|
||
`--help'
|
||
Print a help message describing all of Wget's command-line options.
|
||
|
||
`-b'
|
||
`--background'
|
||
Go to background immediately after startup. If no output file is
|
||
specified via the `-o', output is redirected to `wget-log'.
|
||
|
||
`-e COMMAND'
|
||
`--execute COMMAND'
|
||
Execute COMMAND as if it were a part of `.wgetrc' (*Note Startup
|
||
File::). A command thus invoked will be executed *after* the
|
||
commands in `.wgetrc', thus taking precedence over them.
|
||
|
||
|
||
File: wget.info, Node: Logging and Input File Options, Next: Download Options, Prev: Basic Startup Options, Up: Invoking
|
||
|
||
Logging and Input File Options
|
||
==============================
|
||
|
||
`-o LOGFILE'
|
||
`--output-file=LOGFILE'
|
||
Log all messages to LOGFILE. The messages are normally reported
|
||
to standard error.
|
||
|
||
`-a LOGFILE'
|
||
`--append-output=LOGFILE'
|
||
Append to LOGFILE. This is the same as `-o', only it appends to
|
||
LOGFILE instead of overwriting the old log file. If LOGFILE does
|
||
not exist, a new file is created.
|
||
|
||
`-d'
|
||
`--debug'
|
||
Turn on debug output, meaning various information important to the
|
||
developers of Wget if it does not work properly. Your system
|
||
administrator may have chosen to compile Wget without debug
|
||
support, in which case `-d' will not work. Please note that
|
||
compiling with debug support is always safe--Wget compiled with
|
||
the debug support will *not* print any debug info unless requested
|
||
with `-d'. *Note Reporting Bugs:: for more information on how to
|
||
use `-d' for sending bug reports.
|
||
|
||
`-q'
|
||
`--quiet'
|
||
Turn off Wget's output.
|
||
|
||
`-v'
|
||
`--verbose'
|
||
Turn on verbose output, with all the available data. The default
|
||
output is verbose.
|
||
|
||
`-nv'
|
||
`--non-verbose'
|
||
Non-verbose output--turn off verbose without being completely quiet
|
||
(use `-q' for that), which means that error messages and basic
|
||
information still get printed.
|
||
|
||
`-i FILE'
|
||
`--input-file=FILE'
|
||
Read URLs from FILE, in which case no URLs need to be on the
|
||
command line. If there are URLs both on the command line and in
|
||
an input file, those on the command lines will be the first ones to
|
||
be retrieved. The FILE need not be an HTML document (but no harm
|
||
if it is)--it is enough if the URLs are just listed sequentially.
|
||
|
||
However, if you specify `--force-html', the document will be
|
||
regarded as `html'. In that case you may have problems with
|
||
relative links, which you can solve either by adding `<base
|
||
href="URL">' to the documents or by specifying `--base=URL' on the
|
||
command line.
|
||
|
||
`-F'
|
||
`--force-html'
|
||
When input is read from a file, force it to be treated as an HTML
|
||
file. This enables you to retrieve relative links from existing
|
||
HTML files on your local disk, by adding `<base href="URL">' to
|
||
HTML, or using the `--base' command-line option.
|
||
|
||
`-B URL'
|
||
`--base=URL'
|
||
When used in conjunction with `-F', prepends URL to relative links
|
||
in the file specified by `-i'.
|
||
|
||
|
||
File: wget.info, Node: Download Options, Next: Directory Options, Prev: Logging and Input File Options, Up: Invoking
|
||
|
||
Download Options
|
||
================
|
||
|
||
`--bind-address=ADDRESS'
|
||
When making client TCP/IP connections, `bind()' to ADDRESS on the
|
||
local machine. ADDRESS may be specified as a hostname or IP
|
||
address. This option can be useful if your machine is bound to
|
||
multiple IPs.
|
||
|
||
`-t NUMBER'
|
||
`--tries=NUMBER'
|
||
Set number of retries to NUMBER. Specify 0 or `inf' for infinite
|
||
retrying.
|
||
|
||
`-O FILE'
|
||
`--output-document=FILE'
|
||
The documents will not be written to the appropriate files, but
|
||
all will be concatenated together and written to FILE. If FILE
|
||
already exists, it will be overwritten. If the FILE is `-', the
|
||
documents will be written to standard output. Including this
|
||
option automatically sets the number of tries to 1.
|
||
|
||
`-nc'
|
||
`--no-clobber'
|
||
If a file is downloaded more than once in the same directory,
|
||
wget's behavior depends on a few options, including `-nc'. In
|
||
certain cases, the local file will be "clobbered", or overwritten,
|
||
upon repeated download. In other cases it will be preserved.
|
||
|
||
When running wget without `-N', `-nc', or `-r', downloading the
|
||
same file in the same directory will result in the original copy
|
||
of `FILE' being preserved and the second copy being named
|
||
`FILE.1'. If that file is downloaded yet again, the third copy
|
||
will be named `FILE.2', and so on. When `-nc' is specified, this
|
||
behavior is suppressed, and wget will refuse to download newer
|
||
copies of `FILE'. Therefore, "no-clobber" is actually a misnomer
|
||
in this mode - it's not clobbering that's prevented (as the
|
||
numeric suffixes were already preventing clobbering), but rather
|
||
the multiple version saving that's prevented.
|
||
|
||
When running wget with `-r', but without `-N' or `-nc',
|
||
re-downloading a file will result in the new copy simply
|
||
overwriting the old. Adding `-nc' will prevent this behavior,
|
||
instead causing the original version to be preserved and any newer
|
||
copies on the server to be ignored.
|
||
|
||
When running wget with `-N', with or without `-r', the decision as
|
||
to whether or not to download a newer copy of a file depends on
|
||
the local and remote timestamp and size of the file (*Note
|
||
Time-Stamping::). `-nc' may not be specified at the same time as
|
||
`-N'.
|
||
|
||
Note that when `-nc' is specified, files with the suffixes `.html'
|
||
or (yuck) `.htm' will be loaded from the local disk and parsed as
|
||
if they had been retrieved from the Web.
|
||
|
||
`-c'
|
||
`--continue'
|
||
Continue getting an existing file. This is useful when you want to
|
||
finish up the download started by another program, or a previous
|
||
instance of Wget. Thus you can write:
|
||
|
||
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
|
||
|
||
If there is a file name `ls-lR.Z' in the current directory, Wget
|
||
will assume that it is the first portion of the remote file, and
|
||
will require the server to continue the retrieval from an offset
|
||
equal to the length of the local file.
|
||
|
||
Note that you need not specify this option if all you want is Wget
|
||
to continue retrieving where it left off when the connection is
|
||
lost--Wget does this by default. You need this option only when
|
||
you want to continue retrieval of a file already halfway
|
||
retrieved, saved by another FTP client, or left by Wget being
|
||
killed.
|
||
|
||
Without `-c', the previous example would just begin to download the
|
||
remote file to `ls-lR.Z.1'. The `-c' option is also applicable
|
||
for HTTP servers that support the `Range' header.
|
||
|
||
Note that if you use `-c' on a file that's already downloaded
|
||
completely, `FILE' will not be changed, nor will a second `FILE.1'
|
||
copy be created.
|
||
|
||
`--dot-style=STYLE'
|
||
Set the retrieval style to STYLE. Wget traces the retrieval of
|
||
each document by printing dots on the screen, each dot
|
||
representing a fixed amount of retrieved data. Any number of dots
|
||
may be separated in a "cluster", to make counting easier. This
|
||
option allows you to choose one of the pre-defined styles,
|
||
determining the number of bytes represented by a dot, the number
|
||
of dots in a cluster, and the number of dots on the line.
|
||
|
||
With the `default' style each dot represents 1K, there are ten dots
|
||
in a cluster and 50 dots in a line. The `binary' style has a more
|
||
"computer"-like orientation--8K dots, 16-dots clusters and 48 dots
|
||
per line (which makes for 384K lines). The `mega' style is
|
||
suitable for downloading very large files--each dot represents 64K
|
||
retrieved, there are eight dots in a cluster, and 48 dots on each
|
||
line (so each line contains 3M). The `micro' style is exactly the
|
||
reverse; it is suitable for downloading small files, with 128-byte
|
||
dots, 8 dots per cluster, and 48 dots (6K) per line.
|
||
|
||
`-N'
|
||
`--timestamping'
|
||
Turn on time-stamping. *Note Time-Stamping:: for details.
|
||
|
||
`-S'
|
||
`--server-response'
|
||
Print the headers sent by HTTP servers and responses sent by FTP
|
||
servers.
|
||
|
||
`--spider'
|
||
When invoked with this option, Wget will behave as a Web "spider",
|
||
which means that it will not download the pages, just check that
|
||
they are there. You can use it to check your bookmarks, e.g. with:
|
||
|
||
wget --spider --force-html -i bookmarks.html
|
||
|
||
This feature needs much more work for Wget to get close to the
|
||
functionality of real WWW spiders.
|
||
|
||
`-T seconds'
|
||
`--timeout=SECONDS'
|
||
Set the read timeout to SECONDS seconds. Whenever a network read
|
||
is issued, the file descriptor is checked for a timeout, which
|
||
could otherwise leave a pending connection (uninterrupted read).
|
||
The default timeout is 900 seconds (fifteen minutes). Setting
|
||
timeout to 0 will disable checking for timeouts.
|
||
|
||
Please do not lower the default timeout value with this option
|
||
unless you know what you are doing.
|
||
|
||
`-w SECONDS'
|
||
`--wait=SECONDS'
|
||
Wait the specified number of seconds between the retrievals. Use
|
||
of this option is recommended, as it lightens the server load by
|
||
making the requests less frequent. Instead of in seconds, the
|
||
time can be specified in minutes using the `m' suffix, in hours
|
||
using `h' suffix, or in days using `d' suffix.
|
||
|
||
Specifying a large value for this option is useful if the network
|
||
or the destination host is down, so that Wget can wait long enough
|
||
to reasonably expect the network error to be fixed before the
|
||
retry.
|
||
|
||
`--waitretry=SECONDS'
|
||
If you don't want Wget to wait between *every* retrieval, but only
|
||
between retries of failed downloads, you can use this option.
|
||
Wget will use "linear backoff", waiting 1 second after the first
|
||
failure on a given file, then waiting 2 seconds after the second
|
||
failure on that file, up to the maximum number of SECONDS you
|
||
specify. Therefore, a value of 10 will actually make Wget wait up
|
||
to (1 + 2 + ... + 10) = 55 seconds per file.
|
||
|
||
Note that this option is turned on by default in the global
|
||
`wgetrc' file.
|
||
|
||
`-Y on/off'
|
||
`--proxy=on/off'
|
||
Turn proxy support on or off. The proxy is on by default if the
|
||
appropriate environmental variable is defined.
|
||
|
||
`-Q QUOTA'
|
||
`--quota=QUOTA'
|
||
Specify download quota for automatic retrievals. The value can be
|
||
specified in bytes (default), kilobytes (with `k' suffix), or
|
||
megabytes (with `m' suffix).
|
||
|
||
Note that quota will never affect downloading a single file. So
|
||
if you specify `wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz',
|
||
all of the `ls-lR.gz' will be downloaded. The same goes even when
|
||
several URLs are specified on the command-line. However, quota is
|
||
respected when retrieving either recursively, or from an input
|
||
file. Thus you may safely type `wget -Q2m -i sites'--download
|
||
will be aborted when the quota is exceeded.
|
||
|
||
Setting quota to 0 or to `inf' unlimits the download quota.
|
||
|
||
|
||
File: wget.info, Node: Directory Options, Next: HTTP Options, Prev: Download Options, Up: Invoking
|
||
|
||
Directory Options
|
||
=================
|
||
|
||
`-nd'
|
||
`--no-directories'
|
||
Do not create a hierarchy of directories when retrieving
|
||
recursively. With this option turned on, all files will get saved
|
||
to the current directory, without clobbering (if a name shows up
|
||
more than once, the filenames will get extensions `.n').
|
||
|
||
`-x'
|
||
`--force-directories'
|
||
The opposite of `-nd'--create a hierarchy of directories, even if
|
||
one would not have been created otherwise. E.g. `wget -x
|
||
http://fly.cc.fer.hr/robots.txt' will save the downloaded file to
|
||
`fly.cc.fer.hr/robots.txt'.
|
||
|
||
`-nH'
|
||
`--no-host-directories'
|
||
Disable generation of host-prefixed directories. By default,
|
||
invoking Wget with `-r http://fly.cc.fer.hr/' will create a
|
||
structure of directories beginning with `fly.cc.fer.hr/'. This
|
||
option disables such behavior.
|
||
|
||
`--cut-dirs=NUMBER'
|
||
Ignore NUMBER directory components. This is useful for getting a
|
||
fine-grained control over the directory where recursive retrieval
|
||
will be saved.
|
||
|
||
Take, for example, the directory at
|
||
`ftp://ftp.xemacs.org/pub/xemacs/'. If you retrieve it with `-r',
|
||
it will be saved locally under `ftp.xemacs.org/pub/xemacs/'.
|
||
While the `-nH' option can remove the `ftp.xemacs.org/' part, you
|
||
are still stuck with `pub/xemacs'. This is where `--cut-dirs'
|
||
comes in handy; it makes Wget not "see" NUMBER remote directory
|
||
components. Here are several examples of how `--cut-dirs' option
|
||
works.
|
||
|
||
No options -> ftp.xemacs.org/pub/xemacs/
|
||
-nH -> pub/xemacs/
|
||
-nH --cut-dirs=1 -> xemacs/
|
||
-nH --cut-dirs=2 -> .
|
||
|
||
--cut-dirs=1 -> ftp.xemacs.org/xemacs/
|
||
...
|
||
|
||
If you just want to get rid of the directory structure, this
|
||
option is similar to a combination of `-nd' and `-P'. However,
|
||
unlike `-nd', `--cut-dirs' does not lose with subdirectories--for
|
||
instance, with `-nH --cut-dirs=1', a `beta/' subdirectory will be
|
||
placed to `xemacs/beta', as one would expect.
|
||
|
||
`-P PREFIX'
|
||
`--directory-prefix=PREFIX'
|
||
Set directory prefix to PREFIX. The "directory prefix" is the
|
||
directory where all other files and subdirectories will be saved
|
||
to, i.e. the top of the retrieval tree. The default is `.' (the
|
||
current directory).
|
||
|
||
|
||
File: wget.info, Node: HTTP Options, Next: FTP Options, Prev: Directory Options, Up: Invoking
|
||
|
||
HTTP Options
|
||
============
|
||
|
||
`-E'
|
||
`--html-extension'
|
||
If a file of type `text/html' is downloaded and the URL does not
|
||
end with the regexp "\.[Hh][Tt][Mm][Ll]?", this option will cause
|
||
the suffix `.html' to be appended to the local filename. This is
|
||
useful, for instance, when you're mirroring a remote site that uses
|
||
`.asp' pages, but you want the mirrored pages to be viewable on
|
||
your stock Apache server. Another good use for this is when you're
|
||
downloading the output of CGIs. A URL like
|
||
`http://site.com/article.cgi?25' will be saved as
|
||
`article.cgi?25.html'.
|
||
|
||
Note that filenames changed in this way will be re-downloaded
|
||
every time you re-mirror a site, because wget can't tell that the
|
||
local `X.html' file corresponds to remote URL `X' (since it
|
||
doesn't yet know that the URL produces output of type `text/html'.
|
||
To prevent this re-downloading, you must use `-k' and `-K' so
|
||
that the original version of the file will be saved as `X.orig'
|
||
(*Note Recursive Retrieval Options::).
|
||
|
||
`--http-user=USER'
|
||
`--http-passwd=PASSWORD'
|
||
Specify the username USER and password PASSWORD on an HTTP server.
|
||
According to the type of the challenge, Wget will encode them
|
||
using either the `basic' (insecure) or the `digest' authentication
|
||
scheme.
|
||
|
||
Another way to specify username and password is in the URL itself
|
||
(*Note URL Format::). For more information about security issues
|
||
with Wget, *Note Security Considerations::.
|
||
|
||
`-C on/off'
|
||
`--cache=on/off'
|
||
When set to off, disable server-side cache. In this case, Wget
|
||
will send the remote server an appropriate directive (`Pragma:
|
||
no-cache') to get the file from the remote service, rather than
|
||
returning the cached version. This is especially useful for
|
||
retrieving and flushing out-of-date documents on proxy servers.
|
||
|
||
Caching is allowed by default.
|
||
|
||
`--ignore-length'
|
||
Unfortunately, some HTTP servers (CGI programs, to be more
|
||
precise) send out bogus `Content-Length' headers, which makes Wget
|
||
go wild, as it thinks not all the document was retrieved. You can
|
||
spot this syndrome if Wget retries getting the same document again
|
||
and again, each time claiming that the (otherwise normal)
|
||
connection has closed on the very same byte.
|
||
|
||
With this option, Wget will ignore the `Content-Length' header--as
|
||
if it never existed.
|
||
|
||
`--header=ADDITIONAL-HEADER'
|
||
Define an ADDITIONAL-HEADER to be passed to the HTTP servers.
|
||
Headers must contain a `:' preceded by one or more non-blank
|
||
characters, and must not contain newlines.
|
||
|
||
You may define more than one additional header by specifying
|
||
`--header' more than once.
|
||
|
||
wget --header='Accept-Charset: iso-8859-2' \
|
||
--header='Accept-Language: hr' \
|
||
http://fly.cc.fer.hr/
|
||
|
||
Specification of an empty string as the header value will clear all
|
||
previous user-defined headers.
|
||
|
||
`--proxy-user=USER'
|
||
`--proxy-passwd=PASSWORD'
|
||
Specify the username USER and password PASSWORD for authentication
|
||
on a proxy server. Wget will encode them using the `basic'
|
||
authentication scheme.
|
||
|
||
`--referer=URL'
|
||
Include `Referer: URL' header in HTTP request. Useful for
|
||
retrieving documents with server-side processing that assume they
|
||
are always being retrieved by interactive web browsers and only
|
||
come out properly when Referer is set to one of the pages that
|
||
point to them.
|
||
|
||
`-s'
|
||
`--save-headers'
|
||
Save the headers sent by the HTTP server to the file, preceding the
|
||
actual contents, with an empty line as the separator.
|
||
|
||
`-U AGENT-STRING'
|
||
`--user-agent=AGENT-STRING'
|
||
Identify as AGENT-STRING to the HTTP server.
|
||
|
||
The HTTP protocol allows the clients to identify themselves using a
|
||
`User-Agent' header field. This enables distinguishing the WWW
|
||
software, usually for statistical purposes or for tracing of
|
||
protocol violations. Wget normally identifies as `Wget/VERSION',
|
||
VERSION being the current version number of Wget.
|
||
|
||
However, some sites have been known to impose the policy of
|
||
tailoring the output according to the `User-Agent'-supplied
|
||
information. While conceptually this is not such a bad idea, it
|
||
has been abused by servers denying information to clients other
|
||
than `Mozilla' or Microsoft `Internet Explorer'. This option
|
||
allows you to change the `User-Agent' line issued by Wget. Use of
|
||
this option is discouraged, unless you really know what you are
|
||
doing.
|
||
|
||
|
||
File: wget.info, Node: FTP Options, Next: Recursive Retrieval Options, Prev: HTTP Options, Up: Invoking
|
||
|
||
FTP Options
|
||
===========
|
||
|
||
`--retr-symlinks'
|
||
Usually, when retrieving FTP directories recursively and a symbolic
|
||
link is encountered, the linked-to file is not downloaded.
|
||
Instead, a matching symbolic link is created on the local
|
||
filesystem. The pointed-to file will not be downloaded unless
|
||
this recursive retrieval would have encountered it separately and
|
||
downloaded it anyway.
|
||
|
||
When `--retr-symlinks' is specified, however, symbolic links are
|
||
traversed and the pointed-to files are retrieved. At this time,
|
||
this option does not cause wget to traverse symlinks to
|
||
directories and recurse through them, but in the future it should
|
||
be enhanced to do this.
|
||
|
||
Note that when retrieving a file (not a directory) because it was
|
||
specified on the commandline, rather than because it was recursed
|
||
to, this option has no effect. Symbolic links are always
|
||
traversed in this case.
|
||
|
||
`-g on/off'
|
||
`--glob=on/off'
|
||
Turn FTP globbing on or off. Globbing means you may use the
|
||
shell-like special characters ("wildcards"), like `*', `?', `['
|
||
and `]' to retrieve more than one file from the same directory at
|
||
once, like:
|
||
|
||
wget ftp://gnjilux.cc.fer.hr/*.msg
|
||
|
||
By default, globbing will be turned on if the URL contains a
|
||
globbing character. This option may be used to turn globbing on
|
||
or off permanently.
|
||
|
||
You may have to quote the URL to protect it from being expanded by
|
||
your shell. Globbing makes Wget look for a directory listing,
|
||
which is system-specific. This is why it currently works only
|
||
with Unix FTP servers (and the ones emulating Unix `ls' output).
|
||
|
||
`--passive-ftp'
|
||
Use the "passive" FTP retrieval scheme, in which the client
|
||
initiates the data connection. This is sometimes required for FTP
|
||
to work behind firewalls.
|
||
|
||
|
||
File: wget.info, Node: Recursive Retrieval Options, Next: Recursive Accept/Reject Options, Prev: FTP Options, Up: Invoking
|
||
|
||
Recursive Retrieval Options
|
||
===========================
|
||
|
||
`-r'
|
||
`--recursive'
|
||
Turn on recursive retrieving. *Note Recursive Retrieval:: for more
|
||
details.
|
||
|
||
`-l DEPTH'
|
||
`--level=DEPTH'
|
||
Specify recursion maximum depth level DEPTH (*Note Recursive
|
||
Retrieval::). The default maximum depth is 5.
|
||
|
||
`--delete-after'
|
||
This option tells Wget to delete every single file it downloads,
|
||
*after* having done so. It is useful for pre-fetching popular
|
||
pages through a proxy, e.g.:
|
||
|
||
wget -r -nd --delete-after http://whatever.com/~popular/page/
|
||
|
||
The `-r' option is to retrieve recursively, and `-nd' to not
|
||
create directories.
|
||
|
||
Note that `--delete-after' deletes files on the local machine. It
|
||
does not issue the `DELE' command to remote FTP sites, for
|
||
instance. Also note that when `--delete-after' is specified,
|
||
`--convert-links' is ignored, so `.orig' files are simply not
|
||
created in the first place.
|
||
|
||
`-k'
|
||
`--convert-links'
|
||
Convert the non-relative links to relative ones locally. Only the
|
||
references to the documents actually downloaded will be converted;
|
||
the rest will be left unchanged.
|
||
|
||
Note that only at the end of the download can Wget know which
|
||
links have been downloaded. Because of that, much of the work
|
||
done by `-k' will be performed at the end of the downloads.
|
||
|
||
`-K'
|
||
`--backup-converted'
|
||
When converting a file, back up the original version with a `.orig'
|
||
suffix. Affects the behavior of `-N' (*Note HTTP Time-Stamping
|
||
Internals::).
|
||
|
||
`-m'
|
||
`--mirror'
|
||
Turn on options suitable for mirroring. This option turns on
|
||
recursion and time-stamping, sets infinite recursion depth and
|
||
keeps FTP directory listings. It is currently equivalent to `-r
|
||
-N -l inf -nr'.
|
||
|
||
`-nr'
|
||
`--dont-remove-listing'
|
||
Don't remove the temporary `.listing' files generated by FTP
|
||
retrievals. Normally, these files contain the raw directory
|
||
listings received from FTP servers. Not removing them can be
|
||
useful to access the full remote file list when running a mirror,
|
||
or for debugging purposes.
|
||
|
||
`-p'
|
||
`--page-requisites'
|
||
This option causes wget to download all the files that are
|
||
necessary to properly display a given HTML page. This includes
|
||
such things as inlined images, sounds, and referenced stylesheets.
|
||
|
||
Ordinarily, when downloading a single HTML page, any requisite
|
||
documents that may be needed to display it properly are not
|
||
downloaded. Using `-r' together with `-l' can help, but since
|
||
wget does not ordinarily distinguish between external and inlined
|
||
documents, one is generally left with "leaf documents" that are
|
||
missing their requisites.
|
||
|
||
For instance, say document `1.html' contains an `<IMG>' tag
|
||
referencing `1.gif' and an `<A>' tag pointing to external document
|
||
`2.html'. Say that `2.html' is the same but that its image is
|
||
`2.gif' and it links to `3.html'. Say this continues up to some
|
||
arbitrarily high number.
|
||
|
||
If one executes the command:
|
||
|
||
wget -r -l 2 http://SITE/1.html
|
||
|
||
then `1.html', `1.gif', `2.html', `2.gif', and `3.html' will be
|
||
downloaded. As you can see, `3.html' is without its requisite
|
||
`3.gif' because wget is simply counting the number of hops (up to
|
||
2) away from `1.html' in order to determine where to stop the
|
||
recursion. However, with this command:
|
||
|
||
wget -r -l 2 -p http://SITE/1.html
|
||
|
||
all the above files *and* `3.html''s requisite `3.gif' will be
|
||
downloaded. Similarly,
|
||
|
||
wget -r -l 1 -p http://SITE/1.html
|
||
|
||
will cause `1.html', `1.gif', `2.html', and `2.gif' to be
|
||
downloaded. One might think that:
|
||
|
||
wget -r -l 0 -p http://SITE/1.html
|
||
|
||
would download just `1.html' and `1.gif', but unfortunately this
|
||
is not the case, because `-l 0' is equivalent to `-l inf' - that
|
||
is, infinite recursion. To download a single HTML page (or a
|
||
handful of them, all specified on the commandline or in a `-i' URL
|
||
input file) and its requisites, simply leave off `-p' and `-l':
|
||
|
||
wget -p http://SITE/1.html
|
||
|
||
Note that wget will behave as if `-r' had been specified, but only
|
||
that single page and its requisites will be downloaded. Links
|
||
from that page to external documents will not be followed.
|
||
Actually, to download a single page and all its requisites (even
|
||
if they exist on separate websites), and make sure the lot
|
||
displays properly locally, this author likes to use a few options
|
||
in addition to `-p':
|
||
|
||
wget -E -H -k -K -nh -p http://SITE/DOCUMENT
|
||
|
||
To finish off this topic, it's worth knowing that wget's idea of an
|
||
external document link is any URL specified in an `<A>' tag, an
|
||
`<AREA>' tag, or a `<LINK>' tag other than `<LINK
|
||
REL="stylesheet">'.
|
||
|
||
|
||
File: wget.info, Node: Recursive Accept/Reject Options, Prev: Recursive Retrieval Options, Up: Invoking
|
||
|
||
Recursive Accept/Reject Options
|
||
===============================
|
||
|
||
`-A ACCLIST --accept ACCLIST'
|
||
`-R REJLIST --reject REJLIST'
|
||
Specify comma-separated lists of file name suffixes or patterns to
|
||
accept or reject (*Note Types of Files:: for more details).
|
||
|
||
`-D DOMAIN-LIST'
|
||
`--domains=DOMAIN-LIST'
|
||
Set domains to be accepted and DNS looked-up, where DOMAIN-LIST is
|
||
a comma-separated list. Note that it does *not* turn on `-H'.
|
||
This option speeds things up, even if only one host is spanned
|
||
(*Note Domain Acceptance::).
|
||
|
||
`--exclude-domains DOMAIN-LIST'
|
||
Exclude the domains given in a comma-separated DOMAIN-LIST from
|
||
DNS-lookup (*Note Domain Acceptance::).
|
||
|
||
`--follow-ftp'
|
||
Follow FTP links from HTML documents. Without this option, Wget
|
||
will ignore all the FTP links.
|
||
|
||
`--follow-tags=LIST'
|
||
Wget has an internal table of HTML tag / attribute pairs that it
|
||
considers when looking for linked documents during a recursive
|
||
retrieval. If a user wants only a subset of those tags to be
|
||
considered, however, he or she should be specify such tags in a
|
||
comma-separated LIST with this option.
|
||
|
||
`-G LIST'
|
||
`--ignore-tags=LIST'
|
||
This is the opposite of the `--follow-tags' option. To skip
|
||
certain HTML tags when recursively looking for documents to
|
||
download, specify them in a comma-separated LIST.
|
||
|
||
In the past, the `-G' option was the best bet for downloading a
|
||
single page and its requisites, using a commandline like:
|
||
|
||
wget -Ga,area -H -k -K -nh -r http://SITE/DOCUMENT
|
||
|
||
However, the author of this option came across a page with tags
|
||
like `<LINK REL="home" HREF="/">' and came to the realization that
|
||
`-G' was not enough. One can't just tell wget to ignore `<LINK>',
|
||
because then stylesheets will not be downloaded. Now the best bet
|
||
for downloading a single page and its requisites is the dedicated
|
||
`--page-requisites' option.
|
||
|
||
`-H'
|
||
`--span-hosts'
|
||
Enable spanning across hosts when doing recursive retrieving
|
||
(*Note All Hosts::).
|
||
|
||
`-L'
|
||
`--relative'
|
||
Follow relative links only. Useful for retrieving a specific home
|
||
page without any distractions, not even those from the same hosts
|
||
(*Note Relative Links::).
|
||
|
||
`-I LIST'
|
||
`--include-directories=LIST'
|
||
Specify a comma-separated list of directories you wish to follow
|
||
when downloading (*Note Directory-Based Limits:: for more
|
||
details.) Elements of LIST may contain wildcards.
|
||
|
||
`-X LIST'
|
||
`--exclude-directories=LIST'
|
||
Specify a comma-separated list of directories you wish to exclude
|
||
from download (*Note Directory-Based Limits:: for more details.)
|
||
Elements of LIST may contain wildcards.
|
||
|
||
`-nh'
|
||
`--no-host-lookup'
|
||
Disable the time-consuming DNS lookup of almost all hosts (*Note
|
||
Host Checking::).
|
||
|
||
`-np'
|
||
|
||
`--no-parent'
|
||
Do not ever ascend to the parent directory when retrieving
|
||
recursively. This is a useful option, since it guarantees that
|
||
only the files *below* a certain hierarchy will be downloaded.
|
||
*Note Directory-Based Limits:: for more details.
|
||
|
||
|
||
File: wget.info, Node: Recursive Retrieval, Next: Following Links, Prev: Invoking, Up: Top
|
||
|
||
Recursive Retrieval
|
||
*******************
|
||
|
||
GNU Wget is capable of traversing parts of the Web (or a single HTTP
|
||
or FTP server), depth-first following links and directory structure.
|
||
This is called "recursive" retrieving, or "recursion".
|
||
|
||
With HTTP URLs, Wget retrieves and parses the HTML from the given
|
||
URL, documents, retrieving the files the HTML document was referring
|
||
to, through markups like `href', or `src'. If the freshly downloaded
|
||
file is also of type `text/html', it will be parsed and followed
|
||
further.
|
||
|
||
The maximum "depth" to which the retrieval may descend is specified
|
||
with the `-l' option (the default maximum depth is five layers). *Note
|
||
Recursive Retrieval::.
|
||
|
||
When retrieving an FTP URL recursively, Wget will retrieve all the
|
||
data from the given directory tree (including the subdirectories up to
|
||
the specified depth) on the remote server, creating its mirror image
|
||
locally. FTP retrieval is also limited by the `depth' parameter.
|
||
|
||
By default, Wget will create a local directory tree, corresponding to
|
||
the one found on the remote server.
|
||
|
||
Recursive retrieving can find a number of applications, the most
|
||
important of which is mirroring. It is also useful for WWW
|
||
presentations, and any other opportunities where slow network
|
||
connections should be bypassed by storing the files locally.
|
||
|
||
You should be warned that invoking recursion may cause grave
|
||
overloading on your system, because of the fast exchange of data
|
||
through the network; all of this may hamper other users' work. The
|
||
same stands for the foreign server you are mirroring--the more requests
|
||
it gets in a rows, the greater is its load.
|
||
|
||
Careless retrieving can also fill your file system uncontrollably,
|
||
which can grind the machine to a halt.
|
||
|
||
The load can be minimized by lowering the maximum recursion level
|
||
(`-l') and/or by lowering the number of retries (`-t'). You may also
|
||
consider using the `-w' option to slow down your requests to the remote
|
||
servers, as well as the numerous options to narrow the number of
|
||
followed links (*Note Following Links::).
|
||
|
||
Recursive retrieval is a good thing when used properly. Please take
|
||
all precautions not to wreak havoc through carelessness.
|
||
|
||
|
||
File: wget.info, Node: Following Links, Next: Time-Stamping, Prev: Recursive Retrieval, Up: Top
|
||
|
||
Following Links
|
||
***************
|
||
|
||
When retrieving recursively, one does not wish to retrieve loads of
|
||
unnecessary data. Most of the time the users bear in mind exactly what
|
||
they want to download, and want Wget to follow only specific links.
|
||
|
||
For example, if you wish to download the music archive from
|
||
`fly.cc.fer.hr', you will not want to download all the home pages that
|
||
happen to be referenced by an obscure part of the archive.
|
||
|
||
Wget possesses several mechanisms that allows you to fine-tune which
|
||
links it will follow.
|
||
|
||
* Menu:
|
||
|
||
* Relative Links:: Follow relative links only.
|
||
* Host Checking:: Follow links on the same host.
|
||
* Domain Acceptance:: Check on a list of domains.
|
||
* All Hosts:: No host restrictions.
|
||
* Types of Files:: Getting only certain files.
|
||
* Directory-Based Limits:: Getting only certain directories.
|
||
* FTP Links:: Following FTP links.
|
||
|
||
|
||
File: wget.info, Node: Relative Links, Next: Host Checking, Prev: Following Links, Up: Following Links
|
||
|
||
Relative Links
|
||
==============
|
||
|
||
When only relative links are followed (option `-L'), recursive
|
||
retrieving will never span hosts. No time-expensive DNS-lookups will
|
||
be performed, and the process will be very fast, with the minimum
|
||
strain of the network. This will suit your needs often, especially when
|
||
mirroring the output of various `x2html' converters, since they
|
||
generally output relative links.
|
||
|
||
|
||
File: wget.info, Node: Host Checking, Next: Domain Acceptance, Prev: Relative Links, Up: Following Links
|
||
|
||
Host Checking
|
||
=============
|
||
|
||
The drawback of following the relative links solely is that humans
|
||
often tend to mix them with absolute links to the very same host, and
|
||
the very same page. In this mode (which is the default mode for
|
||
following links) all URLs that refer to the same host will be retrieved.
|
||
|
||
The problem with this option are the aliases of the hosts and
|
||
domains. Thus there is no way for Wget to know that `regoc.srce.hr' and
|
||
`www.srce.hr' are the same host, or that `fly.cc.fer.hr' is the same as
|
||
`fly.cc.etf.hr'. Whenever an absolute link is encountered, the host is
|
||
DNS-looked-up with `gethostbyname' to check whether we are maybe
|
||
dealing with the same hosts. Although the results of `gethostbyname'
|
||
are cached, it is still a great slowdown, e.g. when dealing with large
|
||
indices of home pages on different hosts (because each of the hosts
|
||
must be DNS-resolved to see whether it just *might* be an alias of the
|
||
starting host).
|
||
|
||
To avoid the overhead you may use `-nh', which will turn off
|
||
DNS-resolving and make Wget compare hosts literally. This will make
|
||
things run much faster, but also much less reliable (e.g. `www.srce.hr'
|
||
and `regoc.srce.hr' will be flagged as different hosts).
|
||
|
||
Note that modern HTTP servers allow one IP address to host several
|
||
"virtual servers", each having its own directory hierarchy. Such
|
||
"servers" are distinguished by their hostnames (all of which point to
|
||
the same IP address); for this to work, a client must send a `Host'
|
||
header, which is what Wget does. However, in that case Wget *must not*
|
||
try to divine a host's "real" address, nor try to use the same hostname
|
||
for each access, i.e. `-nh' must be turned on.
|
||
|
||
In other words, the `-nh' option must be used to enable the
|
||
retrieval from virtual servers distinguished by their hostnames. As the
|
||
number of such server setups grow, the behavior of `-nh' may become the
|
||
default in the future.
|
||
|
||
|
||
File: wget.info, Node: Domain Acceptance, Next: All Hosts, Prev: Host Checking, Up: Following Links
|
||
|
||
Domain Acceptance
|
||
=================
|
||
|
||
With the `-D' option you may specify the domains that will be
|
||
followed. The hosts the domain of which is not in this list will not be
|
||
DNS-resolved. Thus you can specify `-Dmit.edu' just to make sure that
|
||
*nothing outside of MIT gets looked up*. This is very important and
|
||
useful. It also means that `-D' does *not* imply `-H' (span all
|
||
hosts), which must be specified explicitly. Feel free to use this
|
||
options since it will speed things up, with almost all the reliability
|
||
of checking for all hosts. Thus you could invoke
|
||
|
||
wget -r -D.hr http://fly.cc.fer.hr/
|
||
|
||
to make sure that only the hosts in `.hr' domain get DNS-looked-up
|
||
for being equal to `fly.cc.fer.hr'. So `fly.cc.etf.hr' will be checked
|
||
(only once!) and found equal, but `www.gnu.ai.mit.edu' will not even be
|
||
checked.
|
||
|
||
Of course, domain acceptance can be used to limit the retrieval to
|
||
particular domains with spanning of hosts in them, but then you must
|
||
specify `-H' explicitly. E.g.:
|
||
|
||
wget -r -H -Dmit.edu,stanford.edu http://www.mit.edu/
|
||
|
||
will start with `http://www.mit.edu/', following links across MIT
|
||
and Stanford.
|
||
|
||
If there are domains you want to exclude specifically, you can do it
|
||
with `--exclude-domains', which accepts the same type of arguments of
|
||
`-D', but will *exclude* all the listed domains. For example, if you
|
||
want to download all the hosts from `foo.edu' domain, with the
|
||
exception of `sunsite.foo.edu', you can do it like this:
|
||
|
||
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu http://www.foo.edu/
|
||
|
||
|
||
File: wget.info, Node: All Hosts, Next: Types of Files, Prev: Domain Acceptance, Up: Following Links
|
||
|
||
All Hosts
|
||
=========
|
||
|
||
When `-H' is specified without `-D', all hosts are freely spanned.
|
||
There are no restrictions whatsoever as to what part of the net Wget
|
||
will go to fetch documents, other than maximum retrieval depth. If a
|
||
page references `www.yahoo.com', so be it. Such an option is rarely
|
||
useful for itself.
|
||
|
||
|
||
File: wget.info, Node: Types of Files, Next: Directory-Based Limits, Prev: All Hosts, Up: Following Links
|
||
|
||
Types of Files
|
||
==============
|
||
|
||
When downloading material from the web, you will often want to
|
||
restrict the retrieval to only certain file types. For example, if you
|
||
are interested in downloading GIFs, you will not be overjoyed to get
|
||
loads of PostScript documents, and vice versa.
|
||
|
||
Wget offers two options to deal with this problem. Each option
|
||
description lists a short name, a long name, and the equivalent command
|
||
in `.wgetrc'.
|
||
|
||
`-A ACCLIST'
|
||
`--accept ACCLIST'
|
||
`accept = ACCLIST'
|
||
The argument to `--accept' option is a list of file suffixes or
|
||
patterns that Wget will download during recursive retrieval. A
|
||
suffix is the ending part of a file, and consists of "normal"
|
||
letters, e.g. `gif' or `.jpg'. A matching pattern contains
|
||
shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'.
|
||
|
||
So, specifying `wget -A gif,jpg' will make Wget download only the
|
||
files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the
|
||
other hand, `wget -A "zelazny*196[0-9]*"' will download only files
|
||
beginning with `zelazny' and containing numbers from 1960 to 1969
|
||
anywhere within. Look up the manual of your shell for a
|
||
description of how pattern matching works.
|
||
|
||
Of course, any number of suffixes and patterns can be combined
|
||
into a comma-separated list, and given as an argument to `-A'.
|
||
|
||
`-R REJLIST'
|
||
`--reject REJLIST'
|
||
`reject = REJLIST'
|
||
The `--reject' option works the same way as `--accept', only its
|
||
logic is the reverse; Wget will download all files *except* the
|
||
ones matching the suffixes (or patterns) in the list.
|
||
|
||
So, if you want to download a whole page except for the cumbersome
|
||
MPEGs and .AU files, you can use `wget -R mpg,mpeg,au'.
|
||
Analogously, to download all files except the ones beginning with
|
||
`bjork', use `wget -R "bjork*"'. The quotes are to prevent
|
||
expansion by the shell.
|
||
|
||
The `-A' and `-R' options may be combined to achieve even better
|
||
fine-tuning of which files to retrieve. E.g. `wget -A "*zelazny*" -R
|
||
.ps' will download all the files having `zelazny' as a part of their
|
||
name, but *not* the PostScript files.
|
||
|
||
Note that these two options do not affect the downloading of HTML
|
||
files; Wget must load all the HTMLs to know where to go at
|
||
all--recursive retrieval would make no sense otherwise.
|
||
|