mirror of
https://github.com/mirror/wget.git
synced 2025-01-10 04:10:50 +08:00
db8a8bc235
--random-wait.
3883 lines
140 KiB
Plaintext
3883 lines
140 KiB
Plaintext
\input texinfo @c -*-texinfo-*-
|
|
|
|
@c %**start of header
|
|
@setfilename wget.info
|
|
@include version.texi
|
|
@set UPDATED Apr 2005
|
|
@settitle GNU Wget @value{VERSION} Manual
|
|
@c Disable the monstrous rectangles beside overfull hbox-es.
|
|
@finalout
|
|
@c Use `odd' to print double-sided.
|
|
@setchapternewpage on
|
|
@c %**end of header
|
|
|
|
@iftex
|
|
@c Remove this if you don't use A4 paper.
|
|
@afourpaper
|
|
@end iftex
|
|
|
|
@c Title for man page. The weird way texi2pod.pl is written requires
|
|
@c the preceding @set.
|
|
@set Wget Wget
|
|
@c man title Wget The non-interactive network downloader.
|
|
|
|
@dircategory Network Applications
|
|
@direntry
|
|
* Wget: (wget). The non-interactive network downloader.
|
|
@end direntry
|
|
|
|
@ifnottex
|
|
This file documents the the GNU Wget utility for downloading network
|
|
data.
|
|
|
|
@c man begin COPYRIGHT
|
|
Copyright @copyright{} 1996--2005 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to make and distribute verbatim copies of
|
|
this manual provided the copyright notice and this permission notice
|
|
are preserved on all copies.
|
|
|
|
@ignore
|
|
Permission is granted to process this file through TeX and print the
|
|
results, provided the printed document carries a copying permission
|
|
notice identical to this one except for the removal of this paragraph
|
|
(this paragraph not being relevant to the printed manual).
|
|
@end ignore
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.2 or
|
|
any later version published by the Free Software Foundation; with the
|
|
Invariant Sections being ``GNU General Public License'' and ``GNU Free
|
|
Documentation License'', with no Front-Cover Texts, and with no
|
|
Back-Cover Texts. A copy of the license is included in the section
|
|
entitled ``GNU Free Documentation License''.
|
|
@c man end
|
|
@end ifnottex
|
|
|
|
@titlepage
|
|
@title GNU Wget @value{VERSION}
|
|
@subtitle The non-interactive download utility
|
|
@subtitle Updated for Wget @value{VERSION}, @value{UPDATED}
|
|
@author by Hrvoje Nik@v{s}i@'{c} and others
|
|
|
|
@ignore
|
|
@c man begin AUTHOR
|
|
Originally written by Hrvoje Niksic <hniksic@xemacs.org>.
|
|
@c man end
|
|
@c man begin SEEALSO
|
|
GNU Info entry for @file{wget}.
|
|
@c man end
|
|
@end ignore
|
|
|
|
@page
|
|
@vskip 0pt plus 1filll
|
|
Copyright @copyright{} 1996--2005, Free Software Foundation, Inc.
|
|
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.2 or
|
|
any later version published by the Free Software Foundation; with the
|
|
Invariant Sections being ``GNU General Public License'' and ``GNU Free
|
|
Documentation License'', with no Front-Cover Texts, and with no
|
|
Back-Cover Texts. A copy of the license is included in the section
|
|
entitled ``GNU Free Documentation License''.
|
|
@end titlepage
|
|
|
|
@ifnottex
|
|
@node Top
|
|
@top Wget @value{VERSION}
|
|
|
|
This manual documents version @value{VERSION} of GNU Wget, the freely
|
|
available utility for network downloads.
|
|
|
|
Copyright @copyright{} 1996--2005 Free Software Foundation, Inc.
|
|
|
|
@menu
|
|
* Overview:: Features of Wget.
|
|
* Invoking:: Wget command-line arguments.
|
|
* Recursive Download:: Downloading interlinked pages.
|
|
* Following Links:: The available methods of chasing links.
|
|
* Time-Stamping:: Mirroring according to time-stamps.
|
|
* Startup File:: Wget's initialization file.
|
|
* Examples:: Examples of usage.
|
|
* Various:: The stuff that doesn't fit anywhere else.
|
|
* Appendices:: Some useful references.
|
|
* Copying:: You may give out copies of Wget and of this manual.
|
|
* Concept Index:: Topics covered by this manual.
|
|
@end menu
|
|
@end ifnottex
|
|
|
|
@node Overview
|
|
@chapter Overview
|
|
@cindex overview
|
|
@cindex features
|
|
|
|
@c man begin DESCRIPTION
|
|
GNU Wget is a free utility for non-interactive download of files from
|
|
the Web. It supports @sc{http}, @sc{https}, and @sc{ftp} protocols, as
|
|
well as retrieval through @sc{http} proxies.
|
|
|
|
@c man end
|
|
This chapter is a partial overview of Wget's features.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@c man begin DESCRIPTION
|
|
Wget is non-interactive, meaning that it can work in the background,
|
|
while the user is not logged on. This allows you to start a retrieval
|
|
and disconnect from the system, letting Wget finish the work. By
|
|
contrast, most of the Web browsers require constant user's presence,
|
|
which can be a great hindrance when transferring a lot of data.
|
|
@c man end
|
|
|
|
@item
|
|
@ignore
|
|
@c man begin DESCRIPTION
|
|
|
|
@c man end
|
|
@end ignore
|
|
@c man begin DESCRIPTION
|
|
Wget can follow links in @sc{html} and @sc{xhtml} pages and create local
|
|
versions of remote web sites, fully recreating the directory structure of
|
|
the original site. This is sometimes referred to as ``recursive
|
|
downloading.'' While doing that, Wget respects the Robot Exclusion
|
|
Standard (@file{/robots.txt}). Wget can be instructed to convert the
|
|
links in downloaded @sc{html} files to the local files for offline
|
|
viewing.
|
|
@c man end
|
|
|
|
@item
|
|
File name wildcard matching and recursive mirroring of directories are
|
|
available when retrieving via @sc{ftp}. Wget can read the time-stamp
|
|
information given by both @sc{http} and @sc{ftp} servers, and store it
|
|
locally. Thus Wget can see if the remote file has changed since last
|
|
retrieval, and automatically retrieve the new version if it has. This
|
|
makes Wget suitable for mirroring of @sc{ftp} sites, as well as home
|
|
pages.
|
|
|
|
@item
|
|
@ignore
|
|
@c man begin DESCRIPTION
|
|
|
|
@c man end
|
|
@end ignore
|
|
@c man begin DESCRIPTION
|
|
Wget has been designed for robustness over slow or unstable network
|
|
connections; if a download fails due to a network problem, it will
|
|
keep retrying until the whole file has been retrieved. If the server
|
|
supports regetting, it will instruct the server to continue the
|
|
download from where it left off.
|
|
@c man end
|
|
|
|
@item
|
|
Wget supports proxy servers, which can lighten the network load, speed
|
|
up retrieval and provide access behind firewalls. However, if you are
|
|
behind a firewall that requires that you use a socks style gateway,
|
|
you can get the socks library and build Wget with support for socks.
|
|
Wget uses the passive @sc{ftp} downloading by default, active @sc{ftp}
|
|
being an option.
|
|
|
|
@item
|
|
Wget supports IP version 6, the next generation of IP. IPv6 is
|
|
autodetected at compile-time, and can be disabled at either build or
|
|
run time. Binaries built with IPv6 support work well in both
|
|
IPv4-only and dual family environments.
|
|
|
|
@item
|
|
Built-in features offer mechanisms to tune which links you wish to follow
|
|
(@pxref{Following Links}).
|
|
|
|
@item
|
|
The progress of individual downloads is traced using a progress gauge.
|
|
Interactive downloads are tracked using a ``thermometer''-style gauge,
|
|
whereas non-interactive ones are traced with dots, each dot
|
|
representing a fixed amount of data received (1KB by default). Either
|
|
gauge can be customized to your preferences.
|
|
|
|
@item
|
|
Most of the features are fully configurable, either through command line
|
|
options, or via the initialization file @file{.wgetrc} (@pxref{Startup
|
|
File}). Wget allows you to define @dfn{global} startup files
|
|
(@file{/usr/local/etc/wgetrc} by default) for site settings.
|
|
|
|
@ignore
|
|
@c man begin FILES
|
|
@table @samp
|
|
@item /usr/local/etc/wgetrc
|
|
Default location of the @dfn{global} startup file.
|
|
|
|
@item .wgetrc
|
|
User startup file.
|
|
@end table
|
|
@c man end
|
|
@end ignore
|
|
|
|
@item
|
|
Finally, GNU Wget is free software. This means that everyone may use
|
|
it, redistribute it and/or modify it under the terms of the GNU General
|
|
Public License, as published by the Free Software Foundation
|
|
(@pxref{Copying}).
|
|
@end itemize
|
|
|
|
@node Invoking
|
|
@chapter Invoking
|
|
@cindex invoking
|
|
@cindex command line
|
|
@cindex arguments
|
|
@cindex nohup
|
|
|
|
By default, Wget is very simple to invoke. The basic syntax is:
|
|
|
|
@example
|
|
@c man begin SYNOPSIS
|
|
wget [@var{option}]@dots{} [@var{URL}]@dots{}
|
|
@c man end
|
|
@end example
|
|
|
|
Wget will simply download all the @sc{url}s specified on the command
|
|
line. @var{URL} is a @dfn{Uniform Resource Locator}, as defined below.
|
|
|
|
However, you may wish to change some of the default parameters of
|
|
Wget. You can do it two ways: permanently, adding the appropriate
|
|
command to @file{.wgetrc} (@pxref{Startup File}), or specifying it on
|
|
the command line.
|
|
|
|
@menu
|
|
* URL Format::
|
|
* Option Syntax::
|
|
* Basic Startup Options::
|
|
* Logging and Input File Options::
|
|
* Download Options::
|
|
* Directory Options::
|
|
* HTTP Options::
|
|
* HTTPS (SSL/TLS) Options::
|
|
* FTP Options::
|
|
* Recursive Retrieval Options::
|
|
* Recursive Accept/Reject Options::
|
|
@end menu
|
|
|
|
@node URL Format
|
|
@section URL Format
|
|
@cindex URL
|
|
@cindex URL syntax
|
|
|
|
@dfn{URL} is an acronym for Uniform Resource Locator. A uniform
|
|
resource locator is a compact string representation for a resource
|
|
available via the Internet. Wget recognizes the @sc{url} syntax as per
|
|
@sc{rfc1738}. This is the most widely used form (square brackets denote
|
|
optional parts):
|
|
|
|
@example
|
|
http://host[:port]/directory/file
|
|
ftp://host[:port]/directory/file
|
|
@end example
|
|
|
|
You can also encode your username and password within a @sc{url}:
|
|
|
|
@example
|
|
ftp://user:password@@host/path
|
|
http://user:password@@host/path
|
|
@end example
|
|
|
|
Either @var{user} or @var{password}, or both, may be left out. If you
|
|
leave out either the @sc{http} username or password, no authentication
|
|
will be sent. If you leave out the @sc{ftp} username, @samp{anonymous}
|
|
will be used. If you leave out the @sc{ftp} password, your email
|
|
address will be supplied as a default password.@footnote{If you have a
|
|
@file{.netrc} file in your home directory, password will also be
|
|
searched for there.}
|
|
|
|
@strong{Important Note}: if you specify a password-containing @sc{url}
|
|
on the command line, the username and password will be plainly visible
|
|
to all users on the system, by way of @code{ps}. On multi-user systems,
|
|
this is a big security risk. To work around it, use @code{wget -i -}
|
|
and feed the @sc{url}s to Wget's standard input, each on a separate
|
|
line, terminated by @kbd{C-d}.
|
|
|
|
You can encode unsafe characters in a @sc{url} as @samp{%xy}, @code{xy}
|
|
being the hexadecimal representation of the character's @sc{ascii}
|
|
value. Some common unsafe characters include @samp{%} (quoted as
|
|
@samp{%25}), @samp{:} (quoted as @samp{%3A}), and @samp{@@} (quoted as
|
|
@samp{%40}). Refer to @sc{rfc1738} for a comprehensive list of unsafe
|
|
characters.
|
|
|
|
Wget also supports the @code{type} feature for @sc{ftp} @sc{url}s. By
|
|
default, @sc{ftp} documents are retrieved in the binary mode (type
|
|
@samp{i}), which means that they are downloaded unchanged. Another
|
|
useful mode is the @samp{a} (@dfn{ASCII}) mode, which converts the line
|
|
delimiters between the different operating systems, and is thus useful
|
|
for text files. Here is an example:
|
|
|
|
@example
|
|
ftp://host/directory/file;type=a
|
|
@end example
|
|
|
|
Two alternative variants of @sc{url} specification are also supported,
|
|
because of historical (hysterical?) reasons and their widespreaded use.
|
|
|
|
@sc{ftp}-only syntax (supported by @code{NcFTP}):
|
|
@example
|
|
host:/dir/file
|
|
@end example
|
|
|
|
@sc{http}-only syntax (introduced by @code{Netscape}):
|
|
@example
|
|
host[:port]/dir/file
|
|
@end example
|
|
|
|
These two alternative forms are deprecated, and may cease being
|
|
supported in the future.
|
|
|
|
If you do not understand the difference between these notations, or do
|
|
not know which one to use, just use the plain ordinary format you use
|
|
with your favorite browser, like @code{Lynx} or @code{Netscape}.
|
|
|
|
@c man begin OPTIONS
|
|
|
|
@node Option Syntax
|
|
@section Option Syntax
|
|
@cindex option syntax
|
|
@cindex syntax of options
|
|
|
|
Since Wget uses GNU getopt to process command-line arguments, every
|
|
option has a long form along with the short one. Long options are
|
|
more convenient to remember, but take time to type. You may freely
|
|
mix different option styles, or specify options after the command-line
|
|
arguments. Thus you may write:
|
|
|
|
@example
|
|
wget -r --tries=10 http://fly.srk.fer.hr/ -o log
|
|
@end example
|
|
|
|
The space between the option accepting an argument and the argument may
|
|
be omitted. Instead @samp{-o log} you can write @samp{-olog}.
|
|
|
|
You may put several options that do not require arguments together,
|
|
like:
|
|
|
|
@example
|
|
wget -drc @var{URL}
|
|
@end example
|
|
|
|
This is a complete equivalent of:
|
|
|
|
@example
|
|
wget -d -r -c @var{URL}
|
|
@end example
|
|
|
|
Since the options can be specified after the arguments, you may
|
|
terminate them with @samp{--}. So the following will try to download
|
|
@sc{url} @samp{-x}, reporting failure to @file{log}:
|
|
|
|
@example
|
|
wget -o log -- -x
|
|
@end example
|
|
|
|
The options that accept comma-separated lists all respect the convention
|
|
that specifying an empty list clears its value. This can be useful to
|
|
clear the @file{.wgetrc} settings. For instance, if your @file{.wgetrc}
|
|
sets @code{exclude_directories} to @file{/cgi-bin}, the following
|
|
example will first reset it, and then set it to exclude @file{/~nobody}
|
|
and @file{/~somebody}. You can also clear the lists in @file{.wgetrc}
|
|
(@pxref{Wgetrc Syntax}).
|
|
|
|
@example
|
|
wget -X '' -X /~nobody,/~somebody
|
|
@end example
|
|
|
|
Most options that do not accept arguments are @dfn{boolean} options,
|
|
so named because their state can be captured with a yes-or-no
|
|
(``boolean'') variable. For example, @samp{--follow-ftp} tells Wget
|
|
to follow FTP links from HTML files and, on the other hand,
|
|
@samp{--no-glob} tells it not to perform file globbing on FTP URLs. A
|
|
boolean option is either @dfn{affirmative} or @dfn{negative}
|
|
(beginning with @samp{--no}). All such options share several
|
|
properties.
|
|
|
|
Unless stated otherwise, it is assumed that the default behavior is
|
|
the opposite of what the option accomplishes. For example, the
|
|
documented existence of @samp{--follow-ftp} assumes that the default
|
|
is to @emph{not} follow FTP links from HTML pages.
|
|
|
|
Affirmative options can be negated by prepending the @samp{--no-} to
|
|
the option name; negative options can be negated by omitting the
|
|
@samp{--no-} prefix. This might seem superfluous---if the default for
|
|
an affirmative option is to not do something, then why provide a way
|
|
to explicitly turn it off? But the startup file may in fact change
|
|
the default. For instance, using @code{follow_ftp = off} in
|
|
@file{.wgetrc} makes Wget @emph{not} follow FTP links by default, and
|
|
using @samp{--no-follow-ftp} is the only way to restore the factory
|
|
default from the command line.
|
|
|
|
@node Basic Startup Options
|
|
@section Basic Startup Options
|
|
|
|
@table @samp
|
|
@item -V
|
|
@itemx --version
|
|
Display the version of Wget.
|
|
|
|
@item -h
|
|
@itemx --help
|
|
Print a help message describing all of Wget's command-line options.
|
|
|
|
@item -b
|
|
@itemx --background
|
|
Go to background immediately after startup. If no output file is
|
|
specified via the @samp{-o}, output is redirected to @file{wget-log}.
|
|
|
|
@cindex execute wgetrc command
|
|
@item -e @var{command}
|
|
@itemx --execute @var{command}
|
|
Execute @var{command} as if it were a part of @file{.wgetrc}
|
|
(@pxref{Startup File}). A command thus invoked will be executed
|
|
@emph{after} the commands in @file{.wgetrc}, thus taking precedence over
|
|
them. If you need to specify more than one wgetrc command, use multiple
|
|
instances of @samp{-e}.
|
|
|
|
@end table
|
|
|
|
@node Logging and Input File Options
|
|
@section Logging and Input File Options
|
|
|
|
@table @samp
|
|
@cindex output file
|
|
@cindex log file
|
|
@item -o @var{logfile}
|
|
@itemx --output-file=@var{logfile}
|
|
Log all messages to @var{logfile}. The messages are normally reported
|
|
to standard error.
|
|
|
|
@cindex append to log
|
|
@item -a @var{logfile}
|
|
@itemx --append-output=@var{logfile}
|
|
Append to @var{logfile}. This is the same as @samp{-o}, only it appends
|
|
to @var{logfile} instead of overwriting the old log file. If
|
|
@var{logfile} does not exist, a new file is created.
|
|
|
|
@cindex debug
|
|
@item -d
|
|
@itemx --debug
|
|
Turn on debug output, meaning various information important to the
|
|
developers of Wget if it does not work properly. Your system
|
|
administrator may have chosen to compile Wget without debug support, in
|
|
which case @samp{-d} will not work. Please note that compiling with
|
|
debug support is always safe---Wget compiled with the debug support will
|
|
@emph{not} print any debug info unless requested with @samp{-d}.
|
|
@xref{Reporting Bugs}, for more information on how to use @samp{-d} for
|
|
sending bug reports.
|
|
|
|
@cindex quiet
|
|
@item -q
|
|
@itemx --quiet
|
|
Turn off Wget's output.
|
|
|
|
@cindex verbose
|
|
@item -v
|
|
@itemx --verbose
|
|
Turn on verbose output, with all the available data. The default output
|
|
is verbose.
|
|
|
|
@item -nv
|
|
@itemx --no-verbose
|
|
Turn off verbose without being completely quiet (use @samp{-q} for
|
|
that), which means that error messages and basic information still get
|
|
printed.
|
|
|
|
@cindex input-file
|
|
@item -i @var{file}
|
|
@itemx --input-file=@var{file}
|
|
Read @sc{url}s from @var{file}. If @samp{-} is specified as
|
|
@var{file}, @sc{url}s are read from the standard input. (Use
|
|
@samp{./-} to read from a file literally named @samp{-}.)
|
|
|
|
If this function is used, no @sc{url}s need be present on the command
|
|
line. If there are @sc{url}s both on the command line and in an input
|
|
file, those on the command lines will be the first ones to be
|
|
retrieved. The @var{file} need not be an @sc{html} document (but no
|
|
harm if it is)---it is enough if the @sc{url}s are just listed
|
|
sequentially.
|
|
|
|
However, if you specify @samp{--force-html}, the document will be
|
|
regarded as @samp{html}. In that case you may have problems with
|
|
relative links, which you can solve either by adding @code{<base
|
|
href="@var{url}">} to the documents or by specifying
|
|
@samp{--base=@var{url}} on the command line.
|
|
|
|
@cindex force html
|
|
@item -F
|
|
@itemx --force-html
|
|
When input is read from a file, force it to be treated as an @sc{html}
|
|
file. This enables you to retrieve relative links from existing
|
|
@sc{html} files on your local disk, by adding @code{<base
|
|
href="@var{url}">} to @sc{html}, or using the @samp{--base} command-line
|
|
option.
|
|
|
|
@cindex base for relative links in input file
|
|
@item -B @var{URL}
|
|
@itemx --base=@var{URL}
|
|
Prepends @var{URL} to relative links read from the file specified with
|
|
the @samp{-i} option.
|
|
@end table
|
|
|
|
@node Download Options
|
|
@section Download Options
|
|
|
|
@table @samp
|
|
@cindex bind address
|
|
@cindex client IP address
|
|
@cindex IP address, client
|
|
@item --bind-address=@var{ADDRESS}
|
|
When making client TCP/IP connections, bind to @var{ADDRESS} on
|
|
the local machine. @var{ADDRESS} may be specified as a hostname or IP
|
|
address. This option can be useful if your machine is bound to multiple
|
|
IPs.
|
|
|
|
@cindex retries
|
|
@cindex tries
|
|
@cindex number of retries
|
|
@item -t @var{number}
|
|
@itemx --tries=@var{number}
|
|
Set number of retries to @var{number}. Specify 0 or @samp{inf} for
|
|
infinite retrying. The default is to retry 20 times, with the exception
|
|
of fatal errors like ``connection refused'' or ``not found'' (404),
|
|
which are not retried.
|
|
|
|
@item -O @var{file}
|
|
@itemx --output-document=@var{file}
|
|
The documents will not be written to the appropriate files, but all
|
|
will be concatenated together and written to @var{file}. If @samp{-}
|
|
is used as @var{file}, documents will be printed to standard output,
|
|
disabling link conversion. (Use @samp{./-} to print to a file
|
|
literally named @samp{-}.)
|
|
|
|
Note that a combination with @samp{-k} is only well-defined for
|
|
downloading a single document.
|
|
|
|
@cindex clobbering, file
|
|
@cindex downloading multiple times
|
|
@cindex no-clobber
|
|
@item -nc
|
|
@itemx --no-clobber
|
|
If a file is downloaded more than once in the same directory, Wget's
|
|
behavior depends on a few options, including @samp{-nc}. In certain
|
|
cases, the local file will be @dfn{clobbered}, or overwritten, upon
|
|
repeated download. In other cases it will be preserved.
|
|
|
|
When running Wget without @samp{-N}, @samp{-nc}, or @samp{-r},
|
|
downloading the same file in the same directory will result in the
|
|
original copy of @var{file} being preserved and the second copy being
|
|
named @samp{@var{file}.1}. If that file is downloaded yet again, the
|
|
third copy will be named @samp{@var{file}.2}, and so on. When
|
|
@samp{-nc} is specified, this behavior is suppressed, and Wget will
|
|
refuse to download newer copies of @samp{@var{file}}. Therefore,
|
|
``@code{no-clobber}'' is actually a misnomer in this mode---it's not
|
|
clobbering that's prevented (as the numeric suffixes were already
|
|
preventing clobbering), but rather the multiple version saving that's
|
|
prevented.
|
|
|
|
When running Wget with @samp{-r}, but without @samp{-N} or @samp{-nc},
|
|
re-downloading a file will result in the new copy simply overwriting the
|
|
old. Adding @samp{-nc} will prevent this behavior, instead causing the
|
|
original version to be preserved and any newer copies on the server to
|
|
be ignored.
|
|
|
|
When running Wget with @samp{-N}, with or without @samp{-r}, the
|
|
decision as to whether or not to download a newer copy of a file depends
|
|
on the local and remote timestamp and size of the file
|
|
(@pxref{Time-Stamping}). @samp{-nc} may not be specified at the same
|
|
time as @samp{-N}.
|
|
|
|
Note that when @samp{-nc} is specified, files with the suffixes
|
|
@samp{.html} or @samp{.htm} will be loaded from the local disk and
|
|
parsed as if they had been retrieved from the Web.
|
|
|
|
@cindex continue retrieval
|
|
@cindex incomplete downloads
|
|
@cindex resume download
|
|
@item -c
|
|
@itemx --continue
|
|
Continue getting a partially-downloaded file. This is useful when you
|
|
want to finish up a download started by a previous instance of Wget, or
|
|
by another program. For instance:
|
|
|
|
@example
|
|
wget -c ftp://sunsite.doc.ic.ac.uk/ls-lR.Z
|
|
@end example
|
|
|
|
If there is a file named @file{ls-lR.Z} in the current directory, Wget
|
|
will assume that it is the first portion of the remote file, and will
|
|
ask the server to continue the retrieval from an offset equal to the
|
|
length of the local file.
|
|
|
|
Note that you don't need to specify this option if you just want the
|
|
current invocation of Wget to retry downloading a file should the
|
|
connection be lost midway through. This is the default behavior.
|
|
@samp{-c} only affects resumption of downloads started @emph{prior} to
|
|
this invocation of Wget, and whose local files are still sitting around.
|
|
|
|
Without @samp{-c}, the previous example would just download the remote
|
|
file to @file{ls-lR.Z.1}, leaving the truncated @file{ls-lR.Z} file
|
|
alone.
|
|
|
|
Beginning with Wget 1.7, if you use @samp{-c} on a non-empty file, and
|
|
it turns out that the server does not support continued downloading,
|
|
Wget will refuse to start the download from scratch, which would
|
|
effectively ruin existing contents. If you really want the download to
|
|
start from scratch, remove the file.
|
|
|
|
Also beginning with Wget 1.7, if you use @samp{-c} on a file which is of
|
|
equal size as the one on the server, Wget will refuse to download the
|
|
file and print an explanatory message. The same happens when the file
|
|
is smaller on the server than locally (presumably because it was changed
|
|
on the server since your last download attempt)---because ``continuing''
|
|
is not meaningful, no download occurs.
|
|
|
|
On the other side of the coin, while using @samp{-c}, any file that's
|
|
bigger on the server than locally will be considered an incomplete
|
|
download and only @code{(length(remote) - length(local))} bytes will be
|
|
downloaded and tacked onto the end of the local file. This behavior can
|
|
be desirable in certain cases---for instance, you can use @samp{wget -c}
|
|
to download just the new portion that's been appended to a data
|
|
collection or log file.
|
|
|
|
However, if the file is bigger on the server because it's been
|
|
@emph{changed}, as opposed to just @emph{appended} to, you'll end up
|
|
with a garbled file. Wget has no way of verifying that the local file
|
|
is really a valid prefix of the remote file. You need to be especially
|
|
careful of this when using @samp{-c} in conjunction with @samp{-r},
|
|
since every file will be considered as an "incomplete download" candidate.
|
|
|
|
Another instance where you'll get a garbled file if you try to use
|
|
@samp{-c} is if you have a lame @sc{http} proxy that inserts a
|
|
``transfer interrupted'' string into the local file. In the future a
|
|
``rollback'' option may be added to deal with this case.
|
|
|
|
Note that @samp{-c} only works with @sc{ftp} servers and with @sc{http}
|
|
servers that support the @code{Range} header.
|
|
|
|
@cindex progress indicator
|
|
@cindex dot style
|
|
@item --progress=@var{type}
|
|
Select the type of the progress indicator you wish to use. Legal
|
|
indicators are ``dot'' and ``bar''.
|
|
|
|
The ``bar'' indicator is used by default. It draws an @sc{ascii} progress
|
|
bar graphics (a.k.a ``thermometer'' display) indicating the status of
|
|
retrieval. If the output is not a TTY, the ``dot'' bar will be used by
|
|
default.
|
|
|
|
Use @samp{--progress=dot} to switch to the ``dot'' display. It traces
|
|
the retrieval by printing dots on the screen, each dot representing a
|
|
fixed amount of downloaded data.
|
|
|
|
When using the dotted retrieval, you may also set the @dfn{style} by
|
|
specifying the type as @samp{dot:@var{style}}. Different styles assign
|
|
different meaning to one dot. With the @code{default} style each dot
|
|
represents 1K, there are ten dots in a cluster and 50 dots in a line.
|
|
The @code{binary} style has a more ``computer''-like orientation---8K
|
|
dots, 16-dots clusters and 48 dots per line (which makes for 384K
|
|
lines). The @code{mega} style is suitable for downloading very large
|
|
files---each dot represents 64K retrieved, there are eight dots in a
|
|
cluster, and 48 dots on each line (so each line contains 3M).
|
|
|
|
Note that you can set the default style using the @code{progress}
|
|
command in @file{.wgetrc}. That setting may be overridden from the
|
|
command line. The exception is that, when the output is not a TTY, the
|
|
``dot'' progress will be favored over ``bar''. To force the bar output,
|
|
use @samp{--progress=bar:force}.
|
|
|
|
@item -N
|
|
@itemx --timestamping
|
|
Turn on time-stamping. @xref{Time-Stamping}, for details.
|
|
|
|
@cindex server response, print
|
|
@item -S
|
|
@itemx --server-response
|
|
Print the headers sent by @sc{http} servers and responses sent by
|
|
@sc{ftp} servers.
|
|
|
|
@cindex Wget as spider
|
|
@cindex spider
|
|
@item --spider
|
|
When invoked with this option, Wget will behave as a Web @dfn{spider},
|
|
which means that it will not download the pages, just check that they
|
|
are there. For example, you can use Wget to check your bookmarks:
|
|
|
|
@example
|
|
wget --spider --force-html -i bookmarks.html
|
|
@end example
|
|
|
|
This feature needs much more work for Wget to get close to the
|
|
functionality of real web spiders.
|
|
|
|
@cindex timeout
|
|
@item -T seconds
|
|
@itemx --timeout=@var{seconds}
|
|
Set the network timeout to @var{seconds} seconds. This is equivalent
|
|
to specifying @samp{--dns-timeout}, @samp{--connect-timeout}, and
|
|
@samp{--read-timeout}, all at the same time.
|
|
|
|
When interacting with the network, Wget can check for timeout and
|
|
abort the operation if it takes too long. This prevents anomalies
|
|
like hanging reads and infinite connects. The only timeout enabled by
|
|
default is a 900-second read timeout. Setting a timeout to 0 disables
|
|
it altogether. Unless you know what you are doing, it is best not to
|
|
change the default timeout settings.
|
|
|
|
All timeout-related options accept decimal values, as well as
|
|
subsecond values. For example, @samp{0.1} seconds is a legal (though
|
|
unwise) choice of timeout. Subsecond timeouts are useful for checking
|
|
server response times or for testing network latency.
|
|
|
|
@cindex DNS timeout
|
|
@cindex timeout, DNS
|
|
@item --dns-timeout=@var{seconds}
|
|
Set the DNS lookup timeout to @var{seconds} seconds. DNS lookups that
|
|
don't complete within the specified time will fail. By default, there
|
|
is no timeout on DNS lookups, other than that implemented by system
|
|
libraries.
|
|
|
|
@cindex connect timeout
|
|
@cindex timeout, connect
|
|
@item --connect-timeout=@var{seconds}
|
|
Set the connect timeout to @var{seconds} seconds. TCP connections that
|
|
take longer to establish will be aborted. By default, there is no
|
|
connect timeout, other than that implemented by system libraries.
|
|
|
|
@cindex read timeout
|
|
@cindex timeout, read
|
|
@item --read-timeout=@var{seconds}
|
|
Set the read (and write) timeout to @var{seconds} seconds. The
|
|
``time'' of this timeout refers @dfn{idle time}: if, at any point in
|
|
the download, no data is received for more than the specified number
|
|
of seconds, reading fails and the download is restarted. This option
|
|
does not directly affect the duration of the entire download.
|
|
|
|
Of course, the remote server may choose to terminate the connection
|
|
sooner than this option requires. The default read timeout is 900
|
|
seconds.
|
|
|
|
@cindex bandwidth, limit
|
|
@cindex rate, limit
|
|
@cindex limit bandwidth
|
|
@item --limit-rate=@var{amount}
|
|
Limit the download speed to @var{amount} bytes per second. Amount may
|
|
be expressed in bytes, kilobytes with the @samp{k} suffix, or megabytes
|
|
with the @samp{m} suffix. For example, @samp{--limit-rate=20k} will
|
|
limit the retrieval rate to 20KB/s. This is useful when, for whatever
|
|
reason, you don't want Wget to consume the entire available bandwidth.
|
|
|
|
This option allows the use of decimal numbers, usually in conjunction
|
|
with power suffixes; for example, @samp{--limit-rate=2.5k} is a legal
|
|
value.
|
|
|
|
Note that Wget implements the limiting by sleeping the appropriate
|
|
amount of time after a network read that took less time than specified
|
|
by the rate. Eventually this strategy causes the TCP transfer to slow
|
|
down to approximately the specified rate. However, it may take some
|
|
time for this balance to be achieved, so don't be surprised if limiting
|
|
the rate doesn't work well with very small files.
|
|
|
|
@cindex pause
|
|
@cindex wait
|
|
@item -w @var{seconds}
|
|
@itemx --wait=@var{seconds}
|
|
Wait the specified number of seconds between the retrievals. Use of
|
|
this option is recommended, as it lightens the server load by making the
|
|
requests less frequent. Instead of in seconds, the time can be
|
|
specified in minutes using the @code{m} suffix, in hours using @code{h}
|
|
suffix, or in days using @code{d} suffix.
|
|
|
|
Specifying a large value for this option is useful if the network or the
|
|
destination host is down, so that Wget can wait long enough to
|
|
reasonably expect the network error to be fixed before the retry. The
|
|
waiting interval specified by this function is influenced by
|
|
@code{--random-wait}, which see.
|
|
|
|
@cindex retries, waiting between
|
|
@cindex waiting between retries
|
|
@item --waitretry=@var{seconds}
|
|
If you don't want Wget to wait between @emph{every} retrieval, but only
|
|
between retries of failed downloads, you can use this option. Wget will
|
|
use @dfn{linear backoff}, waiting 1 second after the first failure on a
|
|
given file, then waiting 2 seconds after the second failure on that
|
|
file, up to the maximum number of @var{seconds} you specify. Therefore,
|
|
a value of 10 will actually make Wget wait up to (1 + 2 + ... + 10) = 55
|
|
seconds per file.
|
|
|
|
Note that this option is turned on by default in the global
|
|
@file{wgetrc} file.
|
|
|
|
@cindex wait, random
|
|
@cindex random wait
|
|
@item --random-wait
|
|
Some web sites may perform log analysis to identify retrieval programs
|
|
such as Wget by looking for statistically significant similarities in
|
|
the time between requests. This option causes the time between requests
|
|
to vary between 0.5 and 1.5 * @var{wait} seconds, where @var{wait} was
|
|
specified using the @samp{--wait} option, in order to mask Wget's
|
|
presence from such analysis.
|
|
|
|
A 2001 article in a publication devoted to development on a popular
|
|
consumer platform provided code to perform this analysis on the fly.
|
|
Its author suggested blocking at the class C address level to ensure
|
|
automated retrieval programs were blocked despite changing DHCP-supplied
|
|
addresses.
|
|
|
|
The @samp{--random-wait} option was inspired by this ill-advised
|
|
recommendation to block many unrelated users from a web site due to the
|
|
actions of one.
|
|
|
|
@cindex proxy
|
|
@itemx --no-proxy
|
|
Don't use proxies, even if the appropriate @code{*_proxy} environment
|
|
variable is defined.
|
|
|
|
For more information about the use of proxies with Wget, @xref{Proxies}.
|
|
|
|
@cindex quota
|
|
@item -Q @var{quota}
|
|
@itemx --quota=@var{quota}
|
|
Specify download quota for automatic retrievals. The value can be
|
|
specified in bytes (default), kilobytes (with @samp{k} suffix), or
|
|
megabytes (with @samp{m} suffix).
|
|
|
|
Note that quota will never affect downloading a single file. So if you
|
|
specify @samp{wget -Q10k ftp://wuarchive.wustl.edu/ls-lR.gz}, all of the
|
|
@file{ls-lR.gz} will be downloaded. The same goes even when several
|
|
@sc{url}s are specified on the command-line. However, quota is
|
|
respected when retrieving either recursively, or from an input file.
|
|
Thus you may safely type @samp{wget -Q2m -i sites}---download will be
|
|
aborted when the quota is exceeded.
|
|
|
|
Setting quota to 0 or to @samp{inf} unlimits the download quota.
|
|
|
|
@cindex DNS cache
|
|
@cindex caching of DNS lookups
|
|
@item --no-dns-cache
|
|
Turn off caching of DNS lookups. Normally, Wget remembers the IP
|
|
addresses it looked up from DNS so it doesn't have to repeatedly
|
|
contact the DNS server for the same (typically small) set of hosts it
|
|
retrieves from. This cache exists in memory only; a new Wget run will
|
|
contact DNS again.
|
|
|
|
However, it has been reported that in some situations it is not
|
|
desirable to cache host names, even for the duration of a
|
|
short-running application like Wget. With this option Wget issues a
|
|
new DNS lookup (more precisely, a new call to @code{gethostbyname} or
|
|
@code{getaddrinfo}) each time it makes a new connection. Please note
|
|
that this option will @emph{not} affect caching that might be
|
|
performed by the resolving library or by an external caching layer,
|
|
such as NSCD.
|
|
|
|
If you don't understand exactly what this option does, you probably
|
|
won't need it.
|
|
|
|
@cindex file names, restrict
|
|
@cindex Windows file names
|
|
@item --restrict-file-names=@var{mode}
|
|
Change which characters found in remote URLs may show up in local file
|
|
names generated from those URLs. Characters that are @dfn{restricted}
|
|
by this option are escaped, i.e. replaced with @samp{%HH}, where
|
|
@samp{HH} is the hexadecimal number that corresponds to the restricted
|
|
character.
|
|
|
|
By default, Wget escapes the characters that are not valid as part of
|
|
file names on your operating system, as well as control characters that
|
|
are typically unprintable. This option is useful for changing these
|
|
defaults, either because you are downloading to a non-native partition,
|
|
or because you want to disable escaping of the control characters.
|
|
|
|
When mode is set to ``unix'', Wget escapes the character @samp{/} and
|
|
the control characters in the ranges 0--31 and 128--159. This is the
|
|
default on Unix-like OS'es.
|
|
|
|
When mode is set to ``windows'', Wget escapes the characters @samp{\},
|
|
@samp{|}, @samp{/}, @samp{:}, @samp{?}, @samp{"}, @samp{*}, @samp{<},
|
|
@samp{>}, and the control characters in the ranges 0--31 and 128--159.
|
|
In addition to this, Wget in Windows mode uses @samp{+} instead of
|
|
@samp{:} to separate host and port in local file names, and uses
|
|
@samp{@@} instead of @samp{?} to separate the query portion of the file
|
|
name from the rest. Therefore, a URL that would be saved as
|
|
@samp{www.xemacs.org:4300/search.pl?input=blah} in Unix mode would be
|
|
saved as @samp{www.xemacs.org+4300/search.pl@@input=blah} in Windows
|
|
mode. This mode is the default on Windows.
|
|
|
|
If you append @samp{,nocontrol} to the mode, as in
|
|
@samp{unix,nocontrol}, escaping of the control characters is also
|
|
switched off. You can use @samp{--restrict-file-names=nocontrol} to
|
|
turn off escaping of control characters without affecting the choice of
|
|
the OS to use as file name restriction mode.
|
|
|
|
@cindex IPv6
|
|
@itemx -4
|
|
@itemx --inet4-only
|
|
@itemx -6
|
|
@itemx --inet6-only
|
|
Force connecting to IPv4 or IPv6 addresses. With @samp{--inet4-only}
|
|
or @samp{-4}, Wget will only connect to IPv4 hosts, ignoring AAAA
|
|
records in DNS, and refusing to connect to IPv6 addresses specified in
|
|
URLs. Conversely, with @samp{--inet6-only} or @samp{-6}, Wget will
|
|
only connect to IPv6 hosts and ignore A records and IPv4 addresses.
|
|
|
|
Neither options should be needed normally. By default, an IPv6-aware
|
|
Wget will use the address family specified by the host's DNS record.
|
|
If the DNS responds with both IPv4 and IPv6 addresses, Wget will them
|
|
in sequence until it finds one it can connect to. (Also see
|
|
@code{--prefer-family} option described below.)
|
|
|
|
These options can be used to deliberately force the use of IPv4 or
|
|
IPv6 address families on dual family systems, usually to aid debugging
|
|
or to deal with broken network configuration. Only one of
|
|
@samp{--inet6-only} and @samp{--inet4-only} may be specified at the
|
|
same time. Neither option is available in Wget compiled without IPv6
|
|
support.
|
|
|
|
@item --prefer-family=IPv4/IPv6/none
|
|
When given a choice of several addresses, connect to the addresses
|
|
with specified address family first. IPv4 addresses are preferred by
|
|
default.
|
|
|
|
This avoids spurious errors and connect attempts when accessing hosts
|
|
that resolve to both IPv6 and IPv4 addresses from IPv4 networks. For
|
|
example, @samp{www.kame.net} resolves to
|
|
@samp{2001:200:0:8002:203:47ff:fea5:3085} and to
|
|
@samp{203.178.141.194}. When the preferred family is @code{IPv4}, the
|
|
IPv4 address is used first; when the preferred family is @code{IPv6},
|
|
the IPv6 address is used first; if the specified value is @code{none},
|
|
the address order returned by DNS is used without change.
|
|
|
|
Unlike @samp{-4} and @samp{-6}, this option doesn't inhibit access to
|
|
any address family, it only changes the @emph{order} in which the
|
|
addresses are accessed. Also note that the reordering performed by
|
|
this option is @dfn{stable}---it doesn't affect order of addresses of
|
|
the same family. That is, the relative order of all IPv4 addresses
|
|
and of all IPv6 addresses remains intact in all cases.
|
|
|
|
@item --retry-connrefused
|
|
Consider ``connection refused'' a transient error and try again.
|
|
Normally Wget gives up on a URL when it is unable to connect to the
|
|
site because failure to connect is taken as a sign that the server is
|
|
not running at all and that retries would not help. This option is
|
|
for mirroring unreliable sites whose servers tend to disappear for
|
|
short periods of time.
|
|
|
|
@cindex user
|
|
@cindex password
|
|
@cindex authentication
|
|
@item --user=@var{user}
|
|
@itemx --password=@var{password}
|
|
Specify the username @var{user} and password @var{password} for both
|
|
@sc{ftp} and @sc{http} file retrieval. These parameters can be overridden
|
|
using the @samp{--ftp-user} and @samp{--ftp-password} options for
|
|
@sc{ftp} connections and the @samp{--http-user} and @samp{--http-password}
|
|
options for @sc{http} connections.
|
|
@end table
|
|
|
|
@node Directory Options
|
|
@section Directory Options
|
|
|
|
@table @samp
|
|
@item -nd
|
|
@itemx --no-directories
|
|
Do not create a hierarchy of directories when retrieving recursively.
|
|
With this option turned on, all files will get saved to the current
|
|
directory, without clobbering (if a name shows up more than once, the
|
|
filenames will get extensions @samp{.n}).
|
|
|
|
@item -x
|
|
@itemx --force-directories
|
|
The opposite of @samp{-nd}---create a hierarchy of directories, even if
|
|
one would not have been created otherwise. E.g. @samp{wget -x
|
|
http://fly.srk.fer.hr/robots.txt} will save the downloaded file to
|
|
@file{fly.srk.fer.hr/robots.txt}.
|
|
|
|
@item -nH
|
|
@itemx --no-host-directories
|
|
Disable generation of host-prefixed directories. By default, invoking
|
|
Wget with @samp{-r http://fly.srk.fer.hr/} will create a structure of
|
|
directories beginning with @file{fly.srk.fer.hr/}. This option disables
|
|
such behavior.
|
|
|
|
@item --protocol-directories
|
|
Use the protocol name as a directory component of local file names. For
|
|
example, with this option, @samp{wget -r http://@var{host}} will save to
|
|
@samp{http/@var{host}/...} rather than just to @samp{@var{host}/...}.
|
|
|
|
@cindex cut directories
|
|
@item --cut-dirs=@var{number}
|
|
Ignore @var{number} directory components. This is useful for getting a
|
|
fine-grained control over the directory where recursive retrieval will
|
|
be saved.
|
|
|
|
Take, for example, the directory at
|
|
@samp{ftp://ftp.xemacs.org/pub/xemacs/}. If you retrieve it with
|
|
@samp{-r}, it will be saved locally under
|
|
@file{ftp.xemacs.org/pub/xemacs/}. While the @samp{-nH} option can
|
|
remove the @file{ftp.xemacs.org/} part, you are still stuck with
|
|
@file{pub/xemacs}. This is where @samp{--cut-dirs} comes in handy; it
|
|
makes Wget not ``see'' @var{number} remote directory components. Here
|
|
are several examples of how @samp{--cut-dirs} option works.
|
|
|
|
@example
|
|
@group
|
|
No options -> ftp.xemacs.org/pub/xemacs/
|
|
-nH -> pub/xemacs/
|
|
-nH --cut-dirs=1 -> xemacs/
|
|
-nH --cut-dirs=2 -> .
|
|
|
|
--cut-dirs=1 -> ftp.xemacs.org/xemacs/
|
|
...
|
|
@end group
|
|
@end example
|
|
|
|
If you just want to get rid of the directory structure, this option is
|
|
similar to a combination of @samp{-nd} and @samp{-P}. However, unlike
|
|
@samp{-nd}, @samp{--cut-dirs} does not lose with subdirectories---for
|
|
instance, with @samp{-nH --cut-dirs=1}, a @file{beta/} subdirectory will
|
|
be placed to @file{xemacs/beta}, as one would expect.
|
|
|
|
@cindex directory prefix
|
|
@item -P @var{prefix}
|
|
@itemx --directory-prefix=@var{prefix}
|
|
Set directory prefix to @var{prefix}. The @dfn{directory prefix} is the
|
|
directory where all other files and subdirectories will be saved to,
|
|
i.e. the top of the retrieval tree. The default is @samp{.} (the
|
|
current directory).
|
|
@end table
|
|
|
|
@node HTTP Options
|
|
@section HTTP Options
|
|
|
|
@table @samp
|
|
@cindex .html extension
|
|
@item -E
|
|
@itemx --html-extension
|
|
If a file of type @samp{application/xhtml+xml} or @samp{text/html} is
|
|
downloaded and the URL does not end with the regexp
|
|
@samp{\.[Hh][Tt][Mm][Ll]?}, this option will cause the suffix @samp{.html}
|
|
to be appended to the local filename. This is useful, for instance, when
|
|
you're mirroring a remote site that uses @samp{.asp} pages, but you want
|
|
the mirrored pages to be viewable on your stock Apache server. Another
|
|
good use for this is when you're downloading CGI-generated materials. A URL
|
|
like @samp{http://site.com/article.cgi?25} will be saved as
|
|
@file{article.cgi?25.html}.
|
|
|
|
Note that filenames changed in this way will be re-downloaded every time
|
|
you re-mirror a site, because Wget can't tell that the local
|
|
@file{@var{X}.html} file corresponds to remote URL @samp{@var{X}} (since
|
|
it doesn't yet know that the URL produces output of type
|
|
@samp{text/html} or @samp{application/xhtml+xml}. To prevent this
|
|
re-downloading, you must use @samp{-k} and @samp{-K} so that the original
|
|
version of the file will be saved as @file{@var{X}.orig} (@pxref{Recursive
|
|
Retrieval Options}).
|
|
|
|
@cindex http user
|
|
@cindex http password
|
|
@cindex authentication
|
|
@item --http-user=@var{user}
|
|
@itemx --http-password=@var{password}
|
|
Specify the username @var{user} and password @var{password} on an
|
|
@sc{http} server. According to the type of the challenge, Wget will
|
|
encode them using either the @code{basic} (insecure) or the
|
|
@code{digest} authentication scheme.
|
|
|
|
Another way to specify username and password is in the @sc{url} itself
|
|
(@pxref{URL Format}). Either method reveals your password to anyone who
|
|
bothers to run @code{ps}. To prevent the passwords from being seen,
|
|
store them in @file{.wgetrc} or @file{.netrc}, and make sure to protect
|
|
those files from other users with @code{chmod}. If the passwords are
|
|
really important, do not leave them lying in those files either---edit
|
|
the files and delete them after Wget has started the download.
|
|
|
|
@iftex
|
|
For more information about security issues with Wget, @xref{Security
|
|
Considerations}.
|
|
@end iftex
|
|
|
|
@cindex proxy
|
|
@cindex cache
|
|
@item --no-cache
|
|
Disable server-side cache. In this case, Wget will send the remote
|
|
server an appropriate directive (@samp{Pragma: no-cache}) to get the
|
|
file from the remote service, rather than returning the cached version.
|
|
This is especially useful for retrieving and flushing out-of-date
|
|
documents on proxy servers.
|
|
|
|
Caching is allowed by default.
|
|
|
|
@cindex cookies
|
|
@item --no-cookies
|
|
Disable the use of cookies. Cookies are a mechanism for maintaining
|
|
server-side state. The server sends the client a cookie using the
|
|
@code{Set-Cookie} header, and the client responds with the same cookie
|
|
upon further requests. Since cookies allow the server owners to keep
|
|
track of visitors and for sites to exchange this information, some
|
|
consider them a breach of privacy. The default is to use cookies;
|
|
however, @emph{storing} cookies is not on by default.
|
|
|
|
@cindex loading cookies
|
|
@cindex cookies, loading
|
|
@item --load-cookies @var{file}
|
|
Load cookies from @var{file} before the first HTTP retrieval.
|
|
@var{file} is a textual file in the format originally used by Netscape's
|
|
@file{cookies.txt} file.
|
|
|
|
You will typically use this option when mirroring sites that require
|
|
that you be logged in to access some or all of their content. The login
|
|
process typically works by the web server issuing an @sc{http} cookie
|
|
upon receiving and verifying your credentials. The cookie is then
|
|
resent by the browser when accessing that part of the site, and so
|
|
proves your identity.
|
|
|
|
Mirroring such a site requires Wget to send the same cookies your
|
|
browser sends when communicating with the site. This is achieved by
|
|
@samp{--load-cookies}---simply point Wget to the location of the
|
|
@file{cookies.txt} file, and it will send the same cookies your browser
|
|
would send in the same situation. Different browsers keep textual
|
|
cookie files in different locations:
|
|
|
|
@table @asis
|
|
@item Netscape 4.x.
|
|
The cookies are in @file{~/.netscape/cookies.txt}.
|
|
|
|
@item Mozilla and Netscape 6.x.
|
|
Mozilla's cookie file is also named @file{cookies.txt}, located
|
|
somewhere under @file{~/.mozilla}, in the directory of your profile.
|
|
The full path usually ends up looking somewhat like
|
|
@file{~/.mozilla/default/@var{some-weird-string}/cookies.txt}.
|
|
|
|
@item Internet Explorer.
|
|
You can produce a cookie file Wget can use by using the File menu,
|
|
Import and Export, Export Cookies. This has been tested with Internet
|
|
Explorer 5; it is not guaranteed to work with earlier versions.
|
|
|
|
@item Other browsers.
|
|
If you are using a different browser to create your cookies,
|
|
@samp{--load-cookies} will only work if you can locate or produce a
|
|
cookie file in the Netscape format that Wget expects.
|
|
@end table
|
|
|
|
If you cannot use @samp{--load-cookies}, there might still be an
|
|
alternative. If your browser supports a ``cookie manager'', you can use
|
|
it to view the cookies used when accessing the site you're mirroring.
|
|
Write down the name and value of the cookie, and manually instruct Wget
|
|
to send those cookies, bypassing the ``official'' cookie support:
|
|
|
|
@example
|
|
wget --no-cookies --header "Cookie: @var{name}=@var{value}"
|
|
@end example
|
|
|
|
@cindex saving cookies
|
|
@cindex cookies, saving
|
|
@item --save-cookies @var{file}
|
|
Save cookies to @var{file} before exiting. This will not save cookies
|
|
that have expired or that have no expiry time (so-called ``session
|
|
cookies''), but also see @samp{--keep-session-cookies}.
|
|
|
|
@cindex cookies, session
|
|
@cindex session cookies
|
|
@item --keep-session-cookies
|
|
When specified, causes @samp{--save-cookies} to also save session
|
|
cookies. Session cookies are normally not saved because they are
|
|
meant to be kept in memory and forgotten when you exit the browser.
|
|
Saving them is useful on sites that require you to log in or to visit
|
|
the home page before you can access some pages. With this option,
|
|
multiple Wget runs are considered a single browser session as far as
|
|
the site is concerned.
|
|
|
|
Since the cookie file format does not normally carry session cookies,
|
|
Wget marks them with an expiry timestamp of 0. Wget's
|
|
@samp{--load-cookies} recognizes those as session cookies, but it might
|
|
confuse other browsers. Also note that cookies so loaded will be
|
|
treated as other session cookies, which means that if you want
|
|
@samp{--save-cookies} to preserve them again, you must use
|
|
@samp{--keep-session-cookies} again.
|
|
|
|
@cindex Content-Length, ignore
|
|
@cindex ignore length
|
|
@item --ignore-length
|
|
Unfortunately, some @sc{http} servers (@sc{cgi} programs, to be more
|
|
precise) send out bogus @code{Content-Length} headers, which makes Wget
|
|
go wild, as it thinks not all the document was retrieved. You can spot
|
|
this syndrome if Wget retries getting the same document again and again,
|
|
each time claiming that the (otherwise normal) connection has closed on
|
|
the very same byte.
|
|
|
|
With this option, Wget will ignore the @code{Content-Length} header---as
|
|
if it never existed.
|
|
|
|
@cindex header, add
|
|
@item --header=@var{header-line}
|
|
Send @var{header-line} along with the rest of the headers in each
|
|
@sc{http} request. The supplied header is sent as-is, which means it
|
|
must contain name and value separated by colon, and must not contain
|
|
newlines.
|
|
|
|
You may define more than one additional header by specifying
|
|
@samp{--header} more than once.
|
|
|
|
@example
|
|
@group
|
|
wget --header='Accept-Charset: iso-8859-2' \
|
|
--header='Accept-Language: hr' \
|
|
http://fly.srk.fer.hr/
|
|
@end group
|
|
@end example
|
|
|
|
Specification of an empty string as the header value will clear all
|
|
previous user-defined headers.
|
|
|
|
As of Wget 1.10, this option can be used to override headers otherwise
|
|
generated automatically. This example instructs Wget to connect to
|
|
localhost, but to specify @samp{foo.bar} in the @code{Host} header:
|
|
|
|
@example
|
|
wget --header="Host: foo.bar" http://localhost/
|
|
@end example
|
|
|
|
In versions of Wget prior to 1.10 such use of @samp{--header} caused
|
|
sending of duplicate headers.
|
|
|
|
@cindex proxy user
|
|
@cindex proxy password
|
|
@cindex proxy authentication
|
|
@item --proxy-user=@var{user}
|
|
@itemx --proxy-password=@var{password}
|
|
Specify the username @var{user} and password @var{password} for
|
|
authentication on a proxy server. Wget will encode them using the
|
|
@code{basic} authentication scheme.
|
|
|
|
Security considerations similar to those with @samp{--http-password}
|
|
pertain here as well.
|
|
|
|
@cindex http referer
|
|
@cindex referer, http
|
|
@item --referer=@var{url}
|
|
Include `Referer: @var{url}' header in HTTP request. Useful for
|
|
retrieving documents with server-side processing that assume they are
|
|
always being retrieved by interactive web browsers and only come out
|
|
properly when Referer is set to one of the pages that point to them.
|
|
|
|
@cindex server response, save
|
|
@item --save-headers
|
|
Save the headers sent by the @sc{http} server to the file, preceding the
|
|
actual contents, with an empty line as the separator.
|
|
|
|
@cindex user-agent
|
|
@item -U @var{agent-string}
|
|
@itemx --user-agent=@var{agent-string}
|
|
Identify as @var{agent-string} to the @sc{http} server.
|
|
|
|
The @sc{http} protocol allows the clients to identify themselves using a
|
|
@code{User-Agent} header field. This enables distinguishing the
|
|
@sc{www} software, usually for statistical purposes or for tracing of
|
|
protocol violations. Wget normally identifies as
|
|
@samp{Wget/@var{version}}, @var{version} being the current version
|
|
number of Wget.
|
|
|
|
However, some sites have been known to impose the policy of tailoring
|
|
the output according to the @code{User-Agent}-supplied information.
|
|
While this is not such a bad idea in theory, it has been abused by
|
|
servers denying information to clients other than (historically)
|
|
Netscape or, more frequently, Microsoft Internet Explorer. This
|
|
option allows you to change the @code{User-Agent} line issued by Wget.
|
|
Use of this option is discouraged, unless you really know what you are
|
|
doing.
|
|
|
|
Specifying empty user agent with @samp{--user-agent=""} instructs Wget
|
|
not to send the @code{User-Agent} header in @sc{http} requests.
|
|
|
|
@cindex POST
|
|
@item --post-data=@var{string}
|
|
@itemx --post-file=@var{file}
|
|
Use POST as the method for all HTTP requests and send the specified data
|
|
in the request body. @code{--post-data} sends @var{string} as data,
|
|
whereas @code{--post-file} sends the contents of @var{file}. Other than
|
|
that, they work in exactly the same way.
|
|
|
|
Please be aware that Wget needs to know the size of the POST data in
|
|
advance. Therefore the argument to @code{--post-file} must be a regular
|
|
file; specifying a FIFO or something like @file{/dev/stdin} won't work.
|
|
It's not quite clear how to work around this limitation inherent in
|
|
HTTP/1.0. Although HTTP/1.1 introduces @dfn{chunked} transfer that
|
|
doesn't require knowing the request length in advance, a client can't
|
|
use chunked unless it knows it's talking to an HTTP/1.1 server. And it
|
|
can't know that until it receives a response, which in turn requires the
|
|
request to have been completed -- a chicken-and-egg problem.
|
|
|
|
Note: if Wget is redirected after the POST request is completed, it
|
|
will not send the POST data to the redirected URL. This is because
|
|
URLs that process POST often respond with a redirection to a regular
|
|
page, which does not desire or accept POST. It is not completely
|
|
clear that this behavior is optimal; if it doesn't work out, it might
|
|
be changed in the future.
|
|
|
|
This example shows how to log to a server using POST and then proceed to
|
|
download the desired pages, presumably only accessible to authorized
|
|
users:
|
|
|
|
@example
|
|
@group
|
|
# @r{Log in to the server. This can be done only once.}
|
|
wget --save-cookies cookies.txt \
|
|
--post-data 'user=foo&password=bar' \
|
|
http://server.com/auth.php
|
|
|
|
# @r{Now grab the page or pages we care about.}
|
|
wget --load-cookies cookies.txt \
|
|
-p http://server.com/interesting/article.php
|
|
@end group
|
|
@end example
|
|
|
|
If the server is using session cookies to track user authentication,
|
|
the above will not work because @samp{--save-cookies} will not save
|
|
them (and neither will browsers) and the @file{cookies.txt} file will
|
|
be empty. In that case use @samp{--keep-session-cookies} along with
|
|
@samp{--save-cookies} to force saving of session cookies.
|
|
@end table
|
|
|
|
@node HTTPS (SSL/TLS) Options
|
|
@section HTTPS (SSL/TLS) Options
|
|
|
|
@cindex SSL
|
|
To support encrypted HTTP (HTTPS) downloads, Wget must be compiled
|
|
with an external SSL library, currently OpenSSL. If Wget is compiled
|
|
without SSL support, none of these options are available.
|
|
|
|
@table @samp
|
|
@cindex SSL protocol, choose
|
|
@item --secure-protocol=@var{protocol}
|
|
Choose the secure protocol to be used. Legal values are @samp{auto},
|
|
@samp{SSLv2}, @samp{SSLv3}, and @samp{TLSv1}. If @samp{auto} is used,
|
|
the SSL library is given the liberty of choosing the appropriate
|
|
protocol automatically, which is achieved by sending an SSLv2 greeting
|
|
and announcing support for SSLv3 and TLSv1. This is the default.
|
|
|
|
Specifying @samp{SSLv2}, @samp{SSLv3}, or @samp{TLSv1} forces the use
|
|
of the corresponding protocol. This is useful when talking to old and
|
|
buggy SSL server implementations that make it hard for OpenSSL to
|
|
choose the correct protocol version. Fortunately, such servers are
|
|
quite rare.
|
|
|
|
@cindex SSL certificate, check
|
|
@item --no-check-certificate
|
|
Don't check the server certificate against the available certificate
|
|
authorities. Also don't require the URL host name to match the common
|
|
name presented by the certificate.
|
|
|
|
As of Wget 1.10, the default is to verify the server's certificate
|
|
against the recognized certificate authorities, breaking the SSL
|
|
handshake and aborting the download if the verification fails.
|
|
Although this provides more secure downloads, it does break
|
|
interoperability with some sites that worked with previous Wget
|
|
versions, particularly those using self-signed, expired, or otherwise
|
|
invalid certificates. This option forces an ``insecure'' mode of
|
|
operation that turns the certificate verification errors into warnings
|
|
and allows you to proceed.
|
|
|
|
If you encounter ``certificate verification'' errors or ones saying
|
|
that ``common name doesn't match requested host name'', you can use
|
|
this option to bypass the verification and proceed with the download.
|
|
@emph{Only use this option if you are otherwise convinced of the
|
|
site's authenticity, or if you really don't care about the validity of
|
|
its certificate.} It is almost always a bad idea not to check the
|
|
certificates when transmitting confidential or important data.
|
|
|
|
@cindex SSL certificate
|
|
@item --certificate=@var{file}
|
|
Use the client certificate stored in @var{file}. This is needed for
|
|
servers that are configured to require certificates from the clients
|
|
that connect to them. Normally a certificate is not required and this
|
|
switch is optional.
|
|
|
|
@cindex SSL certificate type, specify
|
|
@item --certificate-type=@var{type}
|
|
Specify the type of the client certificate. Legal values are
|
|
@samp{PEM} (assumed by default) and @samp{DER}, also known as
|
|
@samp{ASN1}.
|
|
|
|
@item --private-key=@var{file}
|
|
Read the private key from @var{file}. This allows you to provide the
|
|
private key in a file separate from the certificate.
|
|
|
|
@item --private-key-type=@var{type}
|
|
Specify the type of the private key. Accepted values are @samp{PEM}
|
|
(the default) and @samp{DER}.
|
|
|
|
@item --ca-certificate=@var{file}
|
|
Use @var{file} as the file with the bundle of certificate authorities
|
|
(``CA'') to verify the peers. The certificates must be in PEM format.
|
|
|
|
Without this option Wget looks for CA certificates at the
|
|
system-specified locations, chosen at OpenSSL installation time.
|
|
|
|
@cindex SSL certificate authority
|
|
@item --ca-directory=@var{directory}
|
|
Specifies directory containing CA certificates in PEM format. Each
|
|
file contains one CA certificate, and the file name is based on a hash
|
|
value derived from the certificate. This is achieved by processing a
|
|
certificate directory with the @code{c_rehash} utility supplied with
|
|
OpenSSL. Using @samp{--ca-directory} is more efficient than
|
|
@samp{--ca-certificate} when many certificates are installed because
|
|
it allows Wget to fetch certificates on demand.
|
|
|
|
Without this option Wget looks for CA certificates at the
|
|
system-specified locations, chosen at OpenSSL installation time.
|
|
|
|
@cindex entropy, specifying source of
|
|
@cindex randomness, specifying source of
|
|
@item --random-file=@var{file}
|
|
Use @var{file} as the source of random data for seeding the
|
|
pseudo-random number generator on systems without @file{/dev/random}.
|
|
|
|
On such systems the SSL library needs an external source of randomness
|
|
to initialize. Randomness may be provided by EGD (see
|
|
@samp{--egd-file} below) or read from an external source specified by
|
|
the user. If this option is not specified, Wget looks for random data
|
|
in @code{$RANDFILE} or, if that is unset, in @file{$HOME/.rnd}. If
|
|
none of those are available, it is likely that SSL encryption will not
|
|
be usable.
|
|
|
|
If you're getting the ``Could not seed OpenSSL PRNG; disabling SSL.''
|
|
error, you should provide random data using some of the methods
|
|
described above.
|
|
|
|
@cindex EGD
|
|
@item --egd-file=@var{file}
|
|
Use @var{file} as the EGD socket. EGD stands for @dfn{Entropy
|
|
Gathering Daemon}, a user-space program that collects data from
|
|
various unpredictable system sources and makes it available to other
|
|
programs that might need it. Encryption software, such as the SSL
|
|
library, needs sources of non-repeating randomness to seed the random
|
|
number generator used to produce cryptographically strong keys.
|
|
|
|
OpenSSL allows the user to specify his own source of entropy using the
|
|
@code{RAND_FILE} environment variable. If this variable is unset, or
|
|
if the specified file does not produce enough randomness, OpenSSL will
|
|
read random data from EGD socket specified using this option.
|
|
|
|
If this option is not specified (and the equivalent startup command is
|
|
not used), EGD is never contacted. EGD is not needed on modern Unix
|
|
systems that support @file{/dev/random}.
|
|
@end table
|
|
|
|
@node FTP Options
|
|
@section FTP Options
|
|
|
|
@table @samp
|
|
@cindex ftp user
|
|
@cindex ftp password
|
|
@cindex ftp authentication
|
|
@item --ftp-user=@var{user}
|
|
@itemx --ftp-password=@var{password}
|
|
Specify the username @var{user} and password @var{password} on an
|
|
@sc{ftp} server. Without this, or the corresponding startup option,
|
|
the password defaults to @samp{-wget@@}, normally used for anonymous
|
|
FTP.
|
|
|
|
Another way to specify username and password is in the @sc{url} itself
|
|
(@pxref{URL Format}). Either method reveals your password to anyone who
|
|
bothers to run @code{ps}. To prevent the passwords from being seen,
|
|
store them in @file{.wgetrc} or @file{.netrc}, and make sure to protect
|
|
those files from other users with @code{chmod}. If the passwords are
|
|
really important, do not leave them lying in those files either---edit
|
|
the files and delete them after Wget has started the download.
|
|
|
|
@iftex
|
|
For more information about security issues with Wget, @xref{Security
|
|
Considerations}.
|
|
@end iftex
|
|
|
|
@cindex .listing files, removing
|
|
@item --no-remove-listing
|
|
Don't remove the temporary @file{.listing} files generated by @sc{ftp}
|
|
retrievals. Normally, these files contain the raw directory listings
|
|
received from @sc{ftp} servers. Not removing them can be useful for
|
|
debugging purposes, or when you want to be able to easily check on the
|
|
contents of remote server directories (e.g. to verify that a mirror
|
|
you're running is complete).
|
|
|
|
Note that even though Wget writes to a known filename for this file,
|
|
this is not a security hole in the scenario of a user making
|
|
@file{.listing} a symbolic link to @file{/etc/passwd} or something and
|
|
asking @code{root} to run Wget in his or her directory. Depending on
|
|
the options used, either Wget will refuse to write to @file{.listing},
|
|
making the globbing/recursion/time-stamping operation fail, or the
|
|
symbolic link will be deleted and replaced with the actual
|
|
@file{.listing} file, or the listing will be written to a
|
|
@file{.listing.@var{number}} file.
|
|
|
|
Even though this situation isn't a problem, though, @code{root} should
|
|
never run Wget in a non-trusted user's directory. A user could do
|
|
something as simple as linking @file{index.html} to @file{/etc/passwd}
|
|
and asking @code{root} to run Wget with @samp{-N} or @samp{-r} so the file
|
|
will be overwritten.
|
|
|
|
@cindex globbing, toggle
|
|
@item --no-glob
|
|
Turn off @sc{ftp} globbing. Globbing refers to the use of shell-like
|
|
special characters (@dfn{wildcards}), like @samp{*}, @samp{?}, @samp{[}
|
|
and @samp{]} to retrieve more than one file from the same directory at
|
|
once, like:
|
|
|
|
@example
|
|
wget ftp://gnjilux.srk.fer.hr/*.msg
|
|
@end example
|
|
|
|
By default, globbing will be turned on if the @sc{url} contains a
|
|
globbing character. This option may be used to turn globbing on or off
|
|
permanently.
|
|
|
|
You may have to quote the @sc{url} to protect it from being expanded by
|
|
your shell. Globbing makes Wget look for a directory listing, which is
|
|
system-specific. This is why it currently works only with Unix @sc{ftp}
|
|
servers (and the ones emulating Unix @code{ls} output).
|
|
|
|
@cindex passive ftp
|
|
@item --no-passive-ftp
|
|
Disable the use of the @dfn{passive} FTP transfer mode. Passive FTP
|
|
mandates that the client connect to the server to establish the data
|
|
connection rather than the other way around.
|
|
|
|
If the machine is connected to the Internet directly, both passive and
|
|
active FTP should work equally well. Behind most firewall and NAT
|
|
configurations passive FTP has a better chance of working. However,
|
|
in some rare firewall configurations, active FTP actually works when
|
|
passive FTP doesn't. If you suspect this to be the case, use this
|
|
option, or set @code{passive_ftp=off} in your init file.
|
|
|
|
@cindex symbolic links, retrieving
|
|
@item --retr-symlinks
|
|
Usually, when retrieving @sc{ftp} directories recursively and a symbolic
|
|
link is encountered, the linked-to file is not downloaded. Instead, a
|
|
matching symbolic link is created on the local filesystem. The
|
|
pointed-to file will not be downloaded unless this recursive retrieval
|
|
would have encountered it separately and downloaded it anyway.
|
|
|
|
When @samp{--retr-symlinks} is specified, however, symbolic links are
|
|
traversed and the pointed-to files are retrieved. At this time, this
|
|
option does not cause Wget to traverse symlinks to directories and
|
|
recurse through them, but in the future it should be enhanced to do
|
|
this.
|
|
|
|
Note that when retrieving a file (not a directory) because it was
|
|
specified on the command-line, rather than because it was recursed to,
|
|
this option has no effect. Symbolic links are always traversed in this
|
|
case.
|
|
|
|
@cindex Keep-Alive, turning off
|
|
@cindex Persistent Connections, disabling
|
|
@item --no-http-keep-alive
|
|
Turn off the ``keep-alive'' feature for HTTP downloads. Normally, Wget
|
|
asks the server to keep the connection open so that, when you download
|
|
more than one document from the same server, they get transferred over
|
|
the same TCP connection. This saves time and at the same time reduces
|
|
the load on the server.
|
|
|
|
This option is useful when, for some reason, persistent (keep-alive)
|
|
connections don't work for you, for example due to a server bug or due
|
|
to the inability of server-side scripts to cope with the connections.
|
|
@end table
|
|
|
|
@node Recursive Retrieval Options
|
|
@section Recursive Retrieval Options
|
|
|
|
@table @samp
|
|
@item -r
|
|
@itemx --recursive
|
|
Turn on recursive retrieving. @xref{Recursive Download}, for more
|
|
details.
|
|
|
|
@item -l @var{depth}
|
|
@itemx --level=@var{depth}
|
|
Specify recursion maximum depth level @var{depth} (@pxref{Recursive
|
|
Download}). The default maximum depth is 5.
|
|
|
|
@cindex proxy filling
|
|
@cindex delete after retrieval
|
|
@cindex filling proxy cache
|
|
@item --delete-after
|
|
This option tells Wget to delete every single file it downloads,
|
|
@emph{after} having done so. It is useful for pre-fetching popular
|
|
pages through a proxy, e.g.:
|
|
|
|
@example
|
|
wget -r -nd --delete-after http://whatever.com/~popular/page/
|
|
@end example
|
|
|
|
The @samp{-r} option is to retrieve recursively, and @samp{-nd} to not
|
|
create directories.
|
|
|
|
Note that @samp{--delete-after} deletes files on the local machine. It
|
|
does not issue the @samp{DELE} command to remote FTP sites, for
|
|
instance. Also note that when @samp{--delete-after} is specified,
|
|
@samp{--convert-links} is ignored, so @samp{.orig} files are simply not
|
|
created in the first place.
|
|
|
|
@cindex conversion of links
|
|
@cindex link conversion
|
|
@item -k
|
|
@itemx --convert-links
|
|
After the download is complete, convert the links in the document to
|
|
make them suitable for local viewing. This affects not only the visible
|
|
hyperlinks, but any part of the document that links to external content,
|
|
such as embedded images, links to style sheets, hyperlinks to non-@sc{html}
|
|
content, etc.
|
|
|
|
Each link will be changed in one of the two ways:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
The links to files that have been downloaded by Wget will be changed to
|
|
refer to the file they point to as a relative link.
|
|
|
|
Example: if the downloaded file @file{/foo/doc.html} links to
|
|
@file{/bar/img.gif}, also downloaded, then the link in @file{doc.html}
|
|
will be modified to point to @samp{../bar/img.gif}. This kind of
|
|
transformation works reliably for arbitrary combinations of directories.
|
|
|
|
@item
|
|
The links to files that have not been downloaded by Wget will be changed
|
|
to include host name and absolute path of the location they point to.
|
|
|
|
Example: if the downloaded file @file{/foo/doc.html} links to
|
|
@file{/bar/img.gif} (or to @file{../bar/img.gif}), then the link in
|
|
@file{doc.html} will be modified to point to
|
|
@file{http://@var{hostname}/bar/img.gif}.
|
|
@end itemize
|
|
|
|
Because of this, local browsing works reliably: if a linked file was
|
|
downloaded, the link will refer to its local name; if it was not
|
|
downloaded, the link will refer to its full Internet address rather than
|
|
presenting a broken link. The fact that the former links are converted
|
|
to relative links ensures that you can move the downloaded hierarchy to
|
|
another directory.
|
|
|
|
Note that only at the end of the download can Wget know which links have
|
|
been downloaded. Because of that, the work done by @samp{-k} will be
|
|
performed at the end of all the downloads.
|
|
|
|
@cindex backing up converted files
|
|
@item -K
|
|
@itemx --backup-converted
|
|
When converting a file, back up the original version with a @samp{.orig}
|
|
suffix. Affects the behavior of @samp{-N} (@pxref{HTTP Time-Stamping
|
|
Internals}).
|
|
|
|
@item -m
|
|
@itemx --mirror
|
|
Turn on options suitable for mirroring. This option turns on recursion
|
|
and time-stamping, sets infinite recursion depth and keeps @sc{ftp}
|
|
directory listings. It is currently equivalent to
|
|
@samp{-r -N -l inf --no-remove-listing}.
|
|
|
|
@cindex page requisites
|
|
@cindex required images, downloading
|
|
@item -p
|
|
@itemx --page-requisites
|
|
This option causes Wget to download all the files that are necessary to
|
|
properly display a given @sc{html} page. This includes such things as
|
|
inlined images, sounds, and referenced stylesheets.
|
|
|
|
Ordinarily, when downloading a single @sc{html} page, any requisite documents
|
|
that may be needed to display it properly are not downloaded. Using
|
|
@samp{-r} together with @samp{-l} can help, but since Wget does not
|
|
ordinarily distinguish between external and inlined documents, one is
|
|
generally left with ``leaf documents'' that are missing their
|
|
requisites.
|
|
|
|
For instance, say document @file{1.html} contains an @code{<IMG>} tag
|
|
referencing @file{1.gif} and an @code{<A>} tag pointing to external
|
|
document @file{2.html}. Say that @file{2.html} is similar but that its
|
|
image is @file{2.gif} and it links to @file{3.html}. Say this
|
|
continues up to some arbitrarily high number.
|
|
|
|
If one executes the command:
|
|
|
|
@example
|
|
wget -r -l 2 http://@var{site}/1.html
|
|
@end example
|
|
|
|
then @file{1.html}, @file{1.gif}, @file{2.html}, @file{2.gif}, and
|
|
@file{3.html} will be downloaded. As you can see, @file{3.html} is
|
|
without its requisite @file{3.gif} because Wget is simply counting the
|
|
number of hops (up to 2) away from @file{1.html} in order to determine
|
|
where to stop the recursion. However, with this command:
|
|
|
|
@example
|
|
wget -r -l 2 -p http://@var{site}/1.html
|
|
@end example
|
|
|
|
all the above files @emph{and} @file{3.html}'s requisite @file{3.gif}
|
|
will be downloaded. Similarly,
|
|
|
|
@example
|
|
wget -r -l 1 -p http://@var{site}/1.html
|
|
@end example
|
|
|
|
will cause @file{1.html}, @file{1.gif}, @file{2.html}, and @file{2.gif}
|
|
to be downloaded. One might think that:
|
|
|
|
@example
|
|
wget -r -l 0 -p http://@var{site}/1.html
|
|
@end example
|
|
|
|
would download just @file{1.html} and @file{1.gif}, but unfortunately
|
|
this is not the case, because @samp{-l 0} is equivalent to
|
|
@samp{-l inf}---that is, infinite recursion. To download a single @sc{html}
|
|
page (or a handful of them, all specified on the command-line or in a
|
|
@samp{-i} @sc{url} input file) and its (or their) requisites, simply leave off
|
|
@samp{-r} and @samp{-l}:
|
|
|
|
@example
|
|
wget -p http://@var{site}/1.html
|
|
@end example
|
|
|
|
Note that Wget will behave as if @samp{-r} had been specified, but only
|
|
that single page and its requisites will be downloaded. Links from that
|
|
page to external documents will not be followed. Actually, to download
|
|
a single page and all its requisites (even if they exist on separate
|
|
websites), and make sure the lot displays properly locally, this author
|
|
likes to use a few options in addition to @samp{-p}:
|
|
|
|
@example
|
|
wget -E -H -k -K -p http://@var{site}/@var{document}
|
|
@end example
|
|
|
|
To finish off this topic, it's worth knowing that Wget's idea of an
|
|
external document link is any URL specified in an @code{<A>} tag, an
|
|
@code{<AREA>} tag, or a @code{<LINK>} tag other than @code{<LINK
|
|
REL="stylesheet">}.
|
|
|
|
@cindex @sc{html} comments
|
|
@cindex comments, @sc{html}
|
|
@item --strict-comments
|
|
Turn on strict parsing of @sc{html} comments. The default is to terminate
|
|
comments at the first occurrence of @samp{-->}.
|
|
|
|
According to specifications, @sc{html} comments are expressed as @sc{sgml}
|
|
@dfn{declarations}. Declaration is special markup that begins with
|
|
@samp{<!} and ends with @samp{>}, such as @samp{<!DOCTYPE ...>}, that
|
|
may contain comments between a pair of @samp{--} delimiters. @sc{html}
|
|
comments are ``empty declarations'', @sc{sgml} declarations without any
|
|
non-comment text. Therefore, @samp{<!--foo-->} is a valid comment, and
|
|
so is @samp{<!--one-- --two-->}, but @samp{<!--1--2-->} is not.
|
|
|
|
On the other hand, most @sc{html} writers don't perceive comments as anything
|
|
other than text delimited with @samp{<!--} and @samp{-->}, which is not
|
|
quite the same. For example, something like @samp{<!------------>}
|
|
works as a valid comment as long as the number of dashes is a multiple
|
|
of four (!). If not, the comment technically lasts until the next
|
|
@samp{--}, which may be at the other end of the document. Because of
|
|
this, many popular browsers completely ignore the specification and
|
|
implement what users have come to expect: comments delimited with
|
|
@samp{<!--} and @samp{-->}.
|
|
|
|
Until version 1.9, Wget interpreted comments strictly, which resulted in
|
|
missing links in many web pages that displayed fine in browsers, but had
|
|
the misfortune of containing non-compliant comments. Beginning with
|
|
version 1.9, Wget has joined the ranks of clients that implements
|
|
``naive'' comments, terminating each comment at the first occurrence of
|
|
@samp{-->}.
|
|
|
|
If, for whatever reason, you want strict comment parsing, use this
|
|
option to turn it on.
|
|
@end table
|
|
|
|
@node Recursive Accept/Reject Options
|
|
@section Recursive Accept/Reject Options
|
|
|
|
@table @samp
|
|
@item -A @var{acclist} --accept @var{acclist}
|
|
@itemx -R @var{rejlist} --reject @var{rejlist}
|
|
Specify comma-separated lists of file name suffixes or patterns to
|
|
accept or reject (@pxref{Types of Files} for more details).
|
|
|
|
@item -D @var{domain-list}
|
|
@itemx --domains=@var{domain-list}
|
|
Set domains to be followed. @var{domain-list} is a comma-separated list
|
|
of domains. Note that it does @emph{not} turn on @samp{-H}.
|
|
|
|
@item --exclude-domains @var{domain-list}
|
|
Specify the domains that are @emph{not} to be followed.
|
|
(@pxref{Spanning Hosts}).
|
|
|
|
@cindex follow FTP links
|
|
@item --follow-ftp
|
|
Follow @sc{ftp} links from @sc{html} documents. Without this option,
|
|
Wget will ignore all the @sc{ftp} links.
|
|
|
|
@cindex tag-based recursive pruning
|
|
@item --follow-tags=@var{list}
|
|
Wget has an internal table of @sc{html} tag / attribute pairs that it
|
|
considers when looking for linked documents during a recursive
|
|
retrieval. If a user wants only a subset of those tags to be
|
|
considered, however, he or she should be specify such tags in a
|
|
comma-separated @var{list} with this option.
|
|
|
|
@item --ignore-tags=@var{list}
|
|
This is the opposite of the @samp{--follow-tags} option. To skip
|
|
certain @sc{html} tags when recursively looking for documents to download,
|
|
specify them in a comma-separated @var{list}.
|
|
|
|
In the past, this option was the best bet for downloading a single page
|
|
and its requisites, using a command-line like:
|
|
|
|
@example
|
|
wget --ignore-tags=a,area -H -k -K -r http://@var{site}/@var{document}
|
|
@end example
|
|
|
|
However, the author of this option came across a page with tags like
|
|
@code{<LINK REL="home" HREF="/">} and came to the realization that
|
|
specifying tags to ignore was not enough. One can't just tell Wget to
|
|
ignore @code{<LINK>}, because then stylesheets will not be downloaded.
|
|
Now the best bet for downloading a single page and its requisites is the
|
|
dedicated @samp{--page-requisites} option.
|
|
|
|
@item -H
|
|
@itemx --span-hosts
|
|
Enable spanning across hosts when doing recursive retrieving
|
|
(@pxref{Spanning Hosts}).
|
|
|
|
@item -L
|
|
@itemx --relative
|
|
Follow relative links only. Useful for retrieving a specific home page
|
|
without any distractions, not even those from the same hosts
|
|
(@pxref{Relative Links}).
|
|
|
|
@item -I @var{list}
|
|
@itemx --include-directories=@var{list}
|
|
Specify a comma-separated list of directories you wish to follow when
|
|
downloading (@pxref{Directory-Based Limits} for more details.) Elements
|
|
of @var{list} may contain wildcards.
|
|
|
|
@item -X @var{list}
|
|
@itemx --exclude-directories=@var{list}
|
|
Specify a comma-separated list of directories you wish to exclude from
|
|
download (@pxref{Directory-Based Limits} for more details.) Elements of
|
|
@var{list} may contain wildcards.
|
|
|
|
@item -np
|
|
@item --no-parent
|
|
Do not ever ascend to the parent directory when retrieving recursively.
|
|
This is a useful option, since it guarantees that only the files
|
|
@emph{below} a certain hierarchy will be downloaded.
|
|
@xref{Directory-Based Limits}, for more details.
|
|
@end table
|
|
|
|
@c man end
|
|
|
|
@node Recursive Download
|
|
@chapter Recursive Download
|
|
@cindex recursion
|
|
@cindex retrieving
|
|
@cindex recursive download
|
|
|
|
GNU Wget is capable of traversing parts of the Web (or a single
|
|
@sc{http} or @sc{ftp} server), following links and directory structure.
|
|
We refer to this as to @dfn{recursive retrieval}, or @dfn{recursion}.
|
|
|
|
With @sc{http} @sc{url}s, Wget retrieves and parses the @sc{html} from
|
|
the given @sc{url}, documents, retrieving the files the @sc{html}
|
|
document was referring to, through markup like @code{href}, or
|
|
@code{src}. If the freshly downloaded file is also of type
|
|
@code{text/html} or @code{application/xhtml+xml}, it will be parsed and
|
|
followed further.
|
|
|
|
Recursive retrieval of @sc{http} and @sc{html} content is
|
|
@dfn{breadth-first}. This means that Wget first downloads the requested
|
|
@sc{html} document, then the documents linked from that document, then the
|
|
documents linked by them, and so on. In other words, Wget first
|
|
downloads the documents at depth 1, then those at depth 2, and so on
|
|
until the specified maximum depth.
|
|
|
|
The maximum @dfn{depth} to which the retrieval may descend is specified
|
|
with the @samp{-l} option. The default maximum depth is five layers.
|
|
|
|
When retrieving an @sc{ftp} @sc{url} recursively, Wget will retrieve all
|
|
the data from the given directory tree (including the subdirectories up
|
|
to the specified depth) on the remote server, creating its mirror image
|
|
locally. @sc{ftp} retrieval is also limited by the @code{depth}
|
|
parameter. Unlike @sc{http} recursion, @sc{ftp} recursion is performed
|
|
depth-first.
|
|
|
|
By default, Wget will create a local directory tree, corresponding to
|
|
the one found on the remote server.
|
|
|
|
Recursive retrieving can find a number of applications, the most
|
|
important of which is mirroring. It is also useful for @sc{www}
|
|
presentations, and any other opportunities where slow network
|
|
connections should be bypassed by storing the files locally.
|
|
|
|
You should be warned that recursive downloads can overload the remote
|
|
servers. Because of that, many administrators frown upon them and may
|
|
ban access from your site if they detect very fast downloads of big
|
|
amounts of content. When downloading from Internet servers, consider
|
|
using the @samp{-w} option to introduce a delay between accesses to the
|
|
server. The download will take a while longer, but the server
|
|
administrator will not be alarmed by your rudeness.
|
|
|
|
Of course, recursive download may cause problems on your machine. If
|
|
left to run unchecked, it can easily fill up the disk. If downloading
|
|
from local network, it can also take bandwidth on the system, as well as
|
|
consume memory and CPU.
|
|
|
|
Try to specify the criteria that match the kind of download you are
|
|
trying to achieve. If you want to download only one page, use
|
|
@samp{--page-requisites} without any additional recursion. If you want
|
|
to download things under one directory, use @samp{-np} to avoid
|
|
downloading things from other directories. If you want to download all
|
|
the files from one directory, use @samp{-l 1} to make sure the recursion
|
|
depth never exceeds one. @xref{Following Links}, for more information
|
|
about this.
|
|
|
|
Recursive retrieval should be used with care. Don't say you were not
|
|
warned.
|
|
|
|
@node Following Links
|
|
@chapter Following Links
|
|
@cindex links
|
|
@cindex following links
|
|
|
|
When retrieving recursively, one does not wish to retrieve loads of
|
|
unnecessary data. Most of the time the users bear in mind exactly what
|
|
they want to download, and want Wget to follow only specific links.
|
|
|
|
For example, if you wish to download the music archive from
|
|
@samp{fly.srk.fer.hr}, you will not want to download all the home pages
|
|
that happen to be referenced by an obscure part of the archive.
|
|
|
|
Wget possesses several mechanisms that allows you to fine-tune which
|
|
links it will follow.
|
|
|
|
@menu
|
|
* Spanning Hosts:: (Un)limiting retrieval based on host name.
|
|
* Types of Files:: Getting only certain files.
|
|
* Directory-Based Limits:: Getting only certain directories.
|
|
* Relative Links:: Follow relative links only.
|
|
* FTP Links:: Following FTP links.
|
|
@end menu
|
|
|
|
@node Spanning Hosts
|
|
@section Spanning Hosts
|
|
@cindex spanning hosts
|
|
@cindex hosts, spanning
|
|
|
|
Wget's recursive retrieval normally refuses to visit hosts different
|
|
than the one you specified on the command line. This is a reasonable
|
|
default; without it, every retrieval would have the potential to turn
|
|
your Wget into a small version of google.
|
|
|
|
However, visiting different hosts, or @dfn{host spanning,} is sometimes
|
|
a useful option. Maybe the images are served from a different server.
|
|
Maybe you're mirroring a site that consists of pages interlinked between
|
|
three servers. Maybe the server has two equivalent names, and the @sc{html}
|
|
pages refer to both interchangeably.
|
|
|
|
@table @asis
|
|
@item Span to any host---@samp{-H}
|
|
|
|
The @samp{-H} option turns on host spanning, thus allowing Wget's
|
|
recursive run to visit any host referenced by a link. Unless sufficient
|
|
recursion-limiting criteria are applied depth, these foreign hosts will
|
|
typically link to yet more hosts, and so on until Wget ends up sucking
|
|
up much more data than you have intended.
|
|
|
|
@item Limit spanning to certain domains---@samp{-D}
|
|
|
|
The @samp{-D} option allows you to specify the domains that will be
|
|
followed, thus limiting the recursion only to the hosts that belong to
|
|
these domains. Obviously, this makes sense only in conjunction with
|
|
@samp{-H}. A typical example would be downloading the contents of
|
|
@samp{www.server.com}, but allowing downloads from
|
|
@samp{images.server.com}, etc.:
|
|
|
|
@example
|
|
wget -rH -Dserver.com http://www.server.com/
|
|
@end example
|
|
|
|
You can specify more than one address by separating them with a comma,
|
|
e.g. @samp{-Ddomain1.com,domain2.com}.
|
|
|
|
@item Keep download off certain domains---@samp{--exclude-domains}
|
|
|
|
If there are domains you want to exclude specifically, you can do it
|
|
with @samp{--exclude-domains}, which accepts the same type of arguments
|
|
of @samp{-D}, but will @emph{exclude} all the listed domains. For
|
|
example, if you want to download all the hosts from @samp{foo.edu}
|
|
domain, with the exception of @samp{sunsite.foo.edu}, you can do it like
|
|
this:
|
|
|
|
@example
|
|
wget -rH -Dfoo.edu --exclude-domains sunsite.foo.edu \
|
|
http://www.foo.edu/
|
|
@end example
|
|
|
|
@end table
|
|
|
|
@node Types of Files
|
|
@section Types of Files
|
|
@cindex types of files
|
|
|
|
When downloading material from the web, you will often want to restrict
|
|
the retrieval to only certain file types. For example, if you are
|
|
interested in downloading @sc{gif}s, you will not be overjoyed to get
|
|
loads of PostScript documents, and vice versa.
|
|
|
|
Wget offers two options to deal with this problem. Each option
|
|
description lists a short name, a long name, and the equivalent command
|
|
in @file{.wgetrc}.
|
|
|
|
@cindex accept wildcards
|
|
@cindex accept suffixes
|
|
@cindex wildcards, accept
|
|
@cindex suffixes, accept
|
|
@table @samp
|
|
@item -A @var{acclist}
|
|
@itemx --accept @var{acclist}
|
|
@itemx accept = @var{acclist}
|
|
The argument to @samp{--accept} option is a list of file suffixes or
|
|
patterns that Wget will download during recursive retrieval. A suffix
|
|
is the ending part of a file, and consists of ``normal'' letters,
|
|
e.g. @samp{gif} or @samp{.jpg}. A matching pattern contains shell-like
|
|
wildcards, e.g. @samp{books*} or @samp{zelazny*196[0-9]*}.
|
|
|
|
So, specifying @samp{wget -A gif,jpg} will make Wget download only the
|
|
files ending with @samp{gif} or @samp{jpg}, i.e. @sc{gif}s and
|
|
@sc{jpeg}s. On the other hand, @samp{wget -A "zelazny*196[0-9]*"} will
|
|
download only files beginning with @samp{zelazny} and containing numbers
|
|
from 1960 to 1969 anywhere within. Look up the manual of your shell for
|
|
a description of how pattern matching works.
|
|
|
|
Of course, any number of suffixes and patterns can be combined into a
|
|
comma-separated list, and given as an argument to @samp{-A}.
|
|
|
|
@cindex reject wildcards
|
|
@cindex reject suffixes
|
|
@cindex wildcards, reject
|
|
@cindex suffixes, reject
|
|
@item -R @var{rejlist}
|
|
@itemx --reject @var{rejlist}
|
|
@itemx reject = @var{rejlist}
|
|
The @samp{--reject} option works the same way as @samp{--accept}, only
|
|
its logic is the reverse; Wget will download all files @emph{except} the
|
|
ones matching the suffixes (or patterns) in the list.
|
|
|
|
So, if you want to download a whole page except for the cumbersome
|
|
@sc{mpeg}s and @sc{.au} files, you can use @samp{wget -R mpg,mpeg,au}.
|
|
Analogously, to download all files except the ones beginning with
|
|
@samp{bjork}, use @samp{wget -R "bjork*"}. The quotes are to prevent
|
|
expansion by the shell.
|
|
@end table
|
|
|
|
The @samp{-A} and @samp{-R} options may be combined to achieve even
|
|
better fine-tuning of which files to retrieve. E.g. @samp{wget -A
|
|
"*zelazny*" -R .ps} will download all the files having @samp{zelazny} as
|
|
a part of their name, but @emph{not} the PostScript files.
|
|
|
|
Note that these two options do not affect the downloading of @sc{html}
|
|
files; Wget must load all the @sc{html}s to know where to go at
|
|
all---recursive retrieval would make no sense otherwise.
|
|
|
|
@node Directory-Based Limits
|
|
@section Directory-Based Limits
|
|
@cindex directories
|
|
@cindex directory limits
|
|
|
|
Regardless of other link-following facilities, it is often useful to
|
|
place the restriction of what files to retrieve based on the directories
|
|
those files are placed in. There can be many reasons for this---the
|
|
home pages may be organized in a reasonable directory structure; or some
|
|
directories may contain useless information, e.g. @file{/cgi-bin} or
|
|
@file{/dev} directories.
|
|
|
|
Wget offers three different options to deal with this requirement. Each
|
|
option description lists a short name, a long name, and the equivalent
|
|
command in @file{.wgetrc}.
|
|
|
|
@cindex directories, include
|
|
@cindex include directories
|
|
@cindex accept directories
|
|
@table @samp
|
|
@item -I @var{list}
|
|
@itemx --include @var{list}
|
|
@itemx include_directories = @var{list}
|
|
@samp{-I} option accepts a comma-separated list of directories included
|
|
in the retrieval. Any other directories will simply be ignored. The
|
|
directories are absolute paths.
|
|
|
|
So, if you wish to download from @samp{http://host/people/bozo/}
|
|
following only links to bozo's colleagues in the @file{/people}
|
|
directory and the bogus scripts in @file{/cgi-bin}, you can specify:
|
|
|
|
@example
|
|
wget -I /people,/cgi-bin http://host/people/bozo/
|
|
@end example
|
|
|
|
@cindex directories, exclude
|
|
@cindex exclude directories
|
|
@cindex reject directories
|
|
@item -X @var{list}
|
|
@itemx --exclude @var{list}
|
|
@itemx exclude_directories = @var{list}
|
|
@samp{-X} option is exactly the reverse of @samp{-I}---this is a list of
|
|
directories @emph{excluded} from the download. E.g. if you do not want
|
|
Wget to download things from @file{/cgi-bin} directory, specify @samp{-X
|
|
/cgi-bin} on the command line.
|
|
|
|
The same as with @samp{-A}/@samp{-R}, these two options can be combined
|
|
to get a better fine-tuning of downloading subdirectories. E.g. if you
|
|
want to load all the files from @file{/pub} hierarchy except for
|
|
@file{/pub/worthless}, specify @samp{-I/pub -X/pub/worthless}.
|
|
|
|
@cindex no parent
|
|
@item -np
|
|
@itemx --no-parent
|
|
@itemx no_parent = on
|
|
The simplest, and often very useful way of limiting directories is
|
|
disallowing retrieval of the links that refer to the hierarchy
|
|
@dfn{above} than the beginning directory, i.e. disallowing ascent to the
|
|
parent directory/directories.
|
|
|
|
The @samp{--no-parent} option (short @samp{-np}) is useful in this case.
|
|
Using it guarantees that you will never leave the existing hierarchy.
|
|
Supposing you issue Wget with:
|
|
|
|
@example
|
|
wget -r --no-parent http://somehost/~luzer/my-archive/
|
|
@end example
|
|
|
|
You may rest assured that none of the references to
|
|
@file{/~his-girls-homepage/} or @file{/~luzer/all-my-mpegs/} will be
|
|
followed. Only the archive you are interested in will be downloaded.
|
|
Essentially, @samp{--no-parent} is similar to
|
|
@samp{-I/~luzer/my-archive}, only it handles redirections in a more
|
|
intelligent fashion.
|
|
@end table
|
|
|
|
@node Relative Links
|
|
@section Relative Links
|
|
@cindex relative links
|
|
|
|
When @samp{-L} is turned on, only the relative links are ever followed.
|
|
Relative links are here defined those that do not refer to the web
|
|
server root. For example, these links are relative:
|
|
|
|
@example
|
|
<a href="foo.gif">
|
|
<a href="foo/bar.gif">
|
|
<a href="../foo/bar.gif">
|
|
@end example
|
|
|
|
These links are not relative:
|
|
|
|
@example
|
|
<a href="/foo.gif">
|
|
<a href="/foo/bar.gif">
|
|
<a href="http://www.server.com/foo/bar.gif">
|
|
@end example
|
|
|
|
Using this option guarantees that recursive retrieval will not span
|
|
hosts, even without @samp{-H}. In simple cases it also allows downloads
|
|
to ``just work'' without having to convert links.
|
|
|
|
This option is probably not very useful and might be removed in a future
|
|
release.
|
|
|
|
@node FTP Links
|
|
@section Following FTP Links
|
|
@cindex following ftp links
|
|
|
|
The rules for @sc{ftp} are somewhat specific, as it is necessary for
|
|
them to be. @sc{ftp} links in @sc{html} documents are often included
|
|
for purposes of reference, and it is often inconvenient to download them
|
|
by default.
|
|
|
|
To have @sc{ftp} links followed from @sc{html} documents, you need to
|
|
specify the @samp{--follow-ftp} option. Having done that, @sc{ftp}
|
|
links will span hosts regardless of @samp{-H} setting. This is logical,
|
|
as @sc{ftp} links rarely point to the same host where the @sc{http}
|
|
server resides. For similar reasons, the @samp{-L} options has no
|
|
effect on such downloads. On the other hand, domain acceptance
|
|
(@samp{-D}) and suffix rules (@samp{-A} and @samp{-R}) apply normally.
|
|
|
|
Also note that followed links to @sc{ftp} directories will not be
|
|
retrieved recursively further.
|
|
|
|
@node Time-Stamping
|
|
@chapter Time-Stamping
|
|
@cindex time-stamping
|
|
@cindex timestamping
|
|
@cindex updating the archives
|
|
@cindex incremental updating
|
|
|
|
One of the most important aspects of mirroring information from the
|
|
Internet is updating your archives.
|
|
|
|
Downloading the whole archive again and again, just to replace a few
|
|
changed files is expensive, both in terms of wasted bandwidth and money,
|
|
and the time to do the update. This is why all the mirroring tools
|
|
offer the option of incremental updating.
|
|
|
|
Such an updating mechanism means that the remote server is scanned in
|
|
search of @dfn{new} files. Only those new files will be downloaded in
|
|
the place of the old ones.
|
|
|
|
A file is considered new if one of these two conditions are met:
|
|
|
|
@enumerate
|
|
@item
|
|
A file of that name does not already exist locally.
|
|
|
|
@item
|
|
A file of that name does exist, but the remote file was modified more
|
|
recently than the local file.
|
|
@end enumerate
|
|
|
|
To implement this, the program needs to be aware of the time of last
|
|
modification of both local and remote files. We call this information the
|
|
@dfn{time-stamp} of a file.
|
|
|
|
The time-stamping in GNU Wget is turned on using @samp{--timestamping}
|
|
(@samp{-N}) option, or through @code{timestamping = on} directive in
|
|
@file{.wgetrc}. With this option, for each file it intends to download,
|
|
Wget will check whether a local file of the same name exists. If it
|
|
does, and the remote file is older, Wget will not download it.
|
|
|
|
If the local file does not exist, or the sizes of the files do not
|
|
match, Wget will download the remote file no matter what the time-stamps
|
|
say.
|
|
|
|
@menu
|
|
* Time-Stamping Usage::
|
|
* HTTP Time-Stamping Internals::
|
|
* FTP Time-Stamping Internals::
|
|
@end menu
|
|
|
|
@node Time-Stamping Usage
|
|
@section Time-Stamping Usage
|
|
@cindex time-stamping usage
|
|
@cindex usage, time-stamping
|
|
|
|
The usage of time-stamping is simple. Say you would like to download a
|
|
file so that it keeps its date of modification.
|
|
|
|
@example
|
|
wget -S http://www.gnu.ai.mit.edu/
|
|
@end example
|
|
|
|
A simple @code{ls -l} shows that the time stamp on the local file equals
|
|
the state of the @code{Last-Modified} header, as returned by the server.
|
|
As you can see, the time-stamping info is preserved locally, even
|
|
without @samp{-N} (at least for @sc{http}).
|
|
|
|
Several days later, you would like Wget to check if the remote file has
|
|
changed, and download it if it has.
|
|
|
|
@example
|
|
wget -N http://www.gnu.ai.mit.edu/
|
|
@end example
|
|
|
|
Wget will ask the server for the last-modified date. If the local file
|
|
has the same timestamp as the server, or a newer one, the remote file
|
|
will not be re-fetched. However, if the remote file is more recent,
|
|
Wget will proceed to fetch it.
|
|
|
|
The same goes for @sc{ftp}. For example:
|
|
|
|
@example
|
|
wget "ftp://ftp.ifi.uio.no/pub/emacs/gnus/*"
|
|
@end example
|
|
|
|
(The quotes around that URL are to prevent the shell from trying to
|
|
interpret the @samp{*}.)
|
|
|
|
After download, a local directory listing will show that the timestamps
|
|
match those on the remote server. Reissuing the command with @samp{-N}
|
|
will make Wget re-fetch @emph{only} the files that have been modified
|
|
since the last download.
|
|
|
|
If you wished to mirror the GNU archive every week, you would use a
|
|
command like the following, weekly:
|
|
|
|
@example
|
|
wget --timestamping -r ftp://ftp.gnu.org/pub/gnu/
|
|
@end example
|
|
|
|
Note that time-stamping will only work for files for which the server
|
|
gives a timestamp. For @sc{http}, this depends on getting a
|
|
@code{Last-Modified} header. For @sc{ftp}, this depends on getting a
|
|
directory listing with dates in a format that Wget can parse
|
|
(@pxref{FTP Time-Stamping Internals}).
|
|
|
|
@node HTTP Time-Stamping Internals
|
|
@section HTTP Time-Stamping Internals
|
|
@cindex http time-stamping
|
|
|
|
Time-stamping in @sc{http} is implemented by checking of the
|
|
@code{Last-Modified} header. If you wish to retrieve the file
|
|
@file{foo.html} through @sc{http}, Wget will check whether
|
|
@file{foo.html} exists locally. If it doesn't, @file{foo.html} will be
|
|
retrieved unconditionally.
|
|
|
|
If the file does exist locally, Wget will first check its local
|
|
time-stamp (similar to the way @code{ls -l} checks it), and then send a
|
|
@code{HEAD} request to the remote server, demanding the information on
|
|
the remote file.
|
|
|
|
The @code{Last-Modified} header is examined to find which file was
|
|
modified more recently (which makes it ``newer''). If the remote file
|
|
is newer, it will be downloaded; if it is older, Wget will give
|
|
up.@footnote{As an additional check, Wget will look at the
|
|
@code{Content-Length} header, and compare the sizes; if they are not the
|
|
same, the remote file will be downloaded no matter what the time-stamp
|
|
says.}
|
|
|
|
When @samp{--backup-converted} (@samp{-K}) is specified in conjunction
|
|
with @samp{-N}, server file @samp{@var{X}} is compared to local file
|
|
@samp{@var{X}.orig}, if extant, rather than being compared to local file
|
|
@samp{@var{X}}, which will always differ if it's been converted by
|
|
@samp{--convert-links} (@samp{-k}).
|
|
|
|
Arguably, @sc{http} time-stamping should be implemented using the
|
|
@code{If-Modified-Since} request.
|
|
|
|
@node FTP Time-Stamping Internals
|
|
@section FTP Time-Stamping Internals
|
|
@cindex ftp time-stamping
|
|
|
|
In theory, @sc{ftp} time-stamping works much the same as @sc{http}, only
|
|
@sc{ftp} has no headers---time-stamps must be ferreted out of directory
|
|
listings.
|
|
|
|
If an @sc{ftp} download is recursive or uses globbing, Wget will use the
|
|
@sc{ftp} @code{LIST} command to get a file listing for the directory
|
|
containing the desired file(s). It will try to analyze the listing,
|
|
treating it like Unix @code{ls -l} output, extracting the time-stamps.
|
|
The rest is exactly the same as for @sc{http}. Note that when
|
|
retrieving individual files from an @sc{ftp} server without using
|
|
globbing or recursion, listing files will not be downloaded (and thus
|
|
files will not be time-stamped) unless @samp{-N} is specified.
|
|
|
|
Assumption that every directory listing is a Unix-style listing may
|
|
sound extremely constraining, but in practice it is not, as many
|
|
non-Unix @sc{ftp} servers use the Unixoid listing format because most
|
|
(all?) of the clients understand it. Bear in mind that @sc{rfc959}
|
|
defines no standard way to get a file list, let alone the time-stamps.
|
|
We can only hope that a future standard will define this.
|
|
|
|
Another non-standard solution includes the use of @code{MDTM} command
|
|
that is supported by some @sc{ftp} servers (including the popular
|
|
@code{wu-ftpd}), which returns the exact time of the specified file.
|
|
Wget may support this command in the future.
|
|
|
|
@node Startup File
|
|
@chapter Startup File
|
|
@cindex startup file
|
|
@cindex wgetrc
|
|
@cindex .wgetrc
|
|
@cindex startup
|
|
@cindex .netrc
|
|
|
|
Once you know how to change default settings of Wget through command
|
|
line arguments, you may wish to make some of those settings permanent.
|
|
You can do that in a convenient way by creating the Wget startup
|
|
file---@file{.wgetrc}.
|
|
|
|
Besides @file{.wgetrc} is the ``main'' initialization file, it is
|
|
convenient to have a special facility for storing passwords. Thus Wget
|
|
reads and interprets the contents of @file{$HOME/.netrc}, if it finds
|
|
it. You can find @file{.netrc} format in your system manuals.
|
|
|
|
Wget reads @file{.wgetrc} upon startup, recognizing a limited set of
|
|
commands.
|
|
|
|
@menu
|
|
* Wgetrc Location:: Location of various wgetrc files.
|
|
* Wgetrc Syntax:: Syntax of wgetrc.
|
|
* Wgetrc Commands:: List of available commands.
|
|
* Sample Wgetrc:: A wgetrc example.
|
|
@end menu
|
|
|
|
@node Wgetrc Location
|
|
@section Wgetrc Location
|
|
@cindex wgetrc location
|
|
@cindex location of wgetrc
|
|
|
|
When initializing, Wget will look for a @dfn{global} startup file,
|
|
@file{/usr/local/etc/wgetrc} by default (or some prefix other than
|
|
@file{/usr/local}, if Wget was not installed there) and read commands
|
|
from there, if it exists.
|
|
|
|
Then it will look for the user's file. If the environmental variable
|
|
@code{WGETRC} is set, Wget will try to load that file. Failing that, no
|
|
further attempts will be made.
|
|
|
|
If @code{WGETRC} is not set, Wget will try to load @file{$HOME/.wgetrc}.
|
|
|
|
The fact that user's settings are loaded after the system-wide ones
|
|
means that in case of collision user's wgetrc @emph{overrides} the
|
|
system-wide wgetrc (in @file{/usr/local/etc/wgetrc} by default).
|
|
Fascist admins, away!
|
|
|
|
@node Wgetrc Syntax
|
|
@section Wgetrc Syntax
|
|
@cindex wgetrc syntax
|
|
@cindex syntax of wgetrc
|
|
|
|
The syntax of a wgetrc command is simple:
|
|
|
|
@example
|
|
variable = value
|
|
@end example
|
|
|
|
The @dfn{variable} will also be called @dfn{command}. Valid
|
|
@dfn{values} are different for different commands.
|
|
|
|
The commands are case-insensitive and underscore-insensitive. Thus
|
|
@samp{DIr__PrefiX} is the same as @samp{dirprefix}. Empty lines, lines
|
|
beginning with @samp{#} and lines containing white-space only are
|
|
discarded.
|
|
|
|
Commands that expect a comma-separated list will clear the list on an
|
|
empty command. So, if you wish to reset the rejection list specified in
|
|
global @file{wgetrc}, you can do it with:
|
|
|
|
@example
|
|
reject =
|
|
@end example
|
|
|
|
@node Wgetrc Commands
|
|
@section Wgetrc Commands
|
|
@cindex wgetrc commands
|
|
|
|
The complete set of commands is listed below. Legal values are listed
|
|
after the @samp{=}. Simple Boolean values can be set or unset using
|
|
@samp{on} and @samp{off} or @samp{1} and @samp{0}.
|
|
|
|
Some commands take pseudo-arbitrary values. @var{address} values can be
|
|
hostnames or dotted-quad IP addresses. @var{n} can be any positive
|
|
integer, or @samp{inf} for infinity, where appropriate. @var{string}
|
|
values can be any non-empty string.
|
|
|
|
Most of these commands have direct command-line equivalents. Also, any
|
|
wgetrc command can be specified on the command line using the
|
|
@samp{--execute} switch (@pxref{Basic Startup Options}.)
|
|
|
|
@table @asis
|
|
@item accept/reject = @var{string}
|
|
Same as @samp{-A}/@samp{-R} (@pxref{Types of Files}).
|
|
|
|
@item add_hostdir = on/off
|
|
Enable/disable host-prefixed file names. @samp{-nH} disables it.
|
|
|
|
@item continue = on/off
|
|
If set to on, force continuation of preexistent partially retrieved
|
|
files. See @samp{-c} before setting it.
|
|
|
|
@item background = on/off
|
|
Enable/disable going to background---the same as @samp{-b} (which
|
|
enables it).
|
|
|
|
@item backup_converted = on/off
|
|
Enable/disable saving pre-converted files with the suffix
|
|
@samp{.orig}---the same as @samp{-K} (which enables it).
|
|
|
|
@c @item backups = @var{number}
|
|
@c #### Document me!
|
|
@c
|
|
@item base = @var{string}
|
|
Consider relative @sc{url}s in @sc{url} input files forced to be
|
|
interpreted as @sc{html} as being relative to @var{string}---the same as
|
|
@samp{--base=@var{string}}.
|
|
|
|
@item bind_address = @var{address}
|
|
Bind to @var{address}, like the @samp{--bind-address=@var{address}}.
|
|
|
|
@item ca_certificate = @var{file}
|
|
Set the certificate authority bundle file to @var{file}. The same
|
|
as @samp{--ca-certificate=@var{file}}.
|
|
|
|
@item ca_directory = @var{directory}
|
|
Set the directory used for certificate authorities. The same as
|
|
@samp{--ca-directory=@var{directory}}.
|
|
|
|
@item cache = on/off
|
|
When set to off, disallow server-caching. See the @samp{--no-cache}
|
|
option.
|
|
|
|
@item certificate = @var{file}
|
|
Set the client certificate file name to @var{file}. The same as
|
|
@samp{--certificate=@var{file}}.
|
|
|
|
@item certificate_type = @var{string}
|
|
Specify the type of the client certificate, legal values being
|
|
@samp{PEM} (the default) and @samp{DER} (aka ASN1). The same as
|
|
@samp{--certificate-type=@var{string}}.
|
|
|
|
@item check_certificate = on/off
|
|
If this is set to off, the server certificate is not checked against
|
|
the specified client authorities. The default is ``on''. The same as
|
|
@samp{--check-certificate}.
|
|
|
|
@item convert_links = on/off
|
|
Convert non-relative links locally. The same as @samp{-k}.
|
|
|
|
@item cookies = on/off
|
|
When set to off, disallow cookies. See the @samp{--cookies} option.
|
|
|
|
@item connect_timeout = @var{n}
|
|
Set the connect timeout---the same as @samp{--connect-timeout}.
|
|
|
|
@item cut_dirs = @var{n}
|
|
Ignore @var{n} remote directory components. Equivalent to
|
|
@samp{--cut-dirs=@var{n}}.
|
|
|
|
@item debug = on/off
|
|
Debug mode, same as @samp{-d}.
|
|
|
|
@item delete_after = on/off
|
|
Delete after download---the same as @samp{--delete-after}.
|
|
|
|
@item dir_prefix = @var{string}
|
|
Top of directory tree---the same as @samp{-P @var{string}}.
|
|
|
|
@item dirstruct = on/off
|
|
Turning dirstruct on or off---the same as @samp{-x} or @samp{-nd},
|
|
respectively.
|
|
|
|
@item dns_cache = on/off
|
|
Turn DNS caching on/off. Since DNS caching is on by default, this
|
|
option is normally used to turn it off and is equivalent to
|
|
@samp{--no-dns-cache}.
|
|
|
|
@item dns_timeout = @var{n}
|
|
Set the DNS timeout---the same as @samp{--dns-timeout}.
|
|
|
|
@item domains = @var{string}
|
|
Same as @samp{-D} (@pxref{Spanning Hosts}).
|
|
|
|
@item dot_bytes = @var{n}
|
|
Specify the number of bytes ``contained'' in a dot, as seen throughout
|
|
the retrieval (1024 by default). You can postfix the value with
|
|
@samp{k} or @samp{m}, representing kilobytes and megabytes,
|
|
respectively. With dot settings you can tailor the dot retrieval to
|
|
suit your needs, or you can use the predefined @dfn{styles}
|
|
(@pxref{Download Options}).
|
|
|
|
@item dots_in_line = @var{n}
|
|
Specify the number of dots that will be printed in each line throughout
|
|
the retrieval (50 by default).
|
|
|
|
@item dot_spacing = @var{n}
|
|
Specify the number of dots in a single cluster (10 by default).
|
|
|
|
@item egd_file = @var{file}
|
|
Use @var{string} as the EGD socket file name. The same as
|
|
@samp{--egd-file=@var{file}}.
|
|
|
|
@item exclude_directories = @var{string}
|
|
Specify a comma-separated list of directories you wish to exclude from
|
|
download---the same as @samp{-X @var{string}} (@pxref{Directory-Based
|
|
Limits}).
|
|
|
|
@item exclude_domains = @var{string}
|
|
Same as @samp{--exclude-domains=@var{string}} (@pxref{Spanning
|
|
Hosts}).
|
|
|
|
@item follow_ftp = on/off
|
|
Follow @sc{ftp} links from @sc{html} documents---the same as
|
|
@samp{--follow-ftp}.
|
|
|
|
@item follow_tags = @var{string}
|
|
Only follow certain @sc{html} tags when doing a recursive retrieval,
|
|
just like @samp{--follow-tags=@var{string}}.
|
|
|
|
@item force_html = on/off
|
|
If set to on, force the input filename to be regarded as an @sc{html}
|
|
document---the same as @samp{-F}.
|
|
|
|
@item ftp_password = @var{string}
|
|
Set your @sc{ftp} password to @var{string}. Without this setting, the
|
|
password defaults to @samp{-wget@@}, which is a useful default for
|
|
anonymous @sc{ftp} access.
|
|
|
|
This command used to be named @code{passwd} prior to Wget 1.10.
|
|
|
|
@item ftp_proxy = @var{string}
|
|
Use @var{string} as @sc{ftp} proxy, instead of the one specified in
|
|
environment.
|
|
|
|
@item ftp_user = @var{string}
|
|
Set @sc{ftp} user to @var{string}.
|
|
|
|
This command used to be named @code{login} prior to Wget 1.10.
|
|
|
|
@item glob = on/off
|
|
Turn globbing on/off---the same as @samp{--glob} and @samp{--no-glob}.
|
|
|
|
@item header = @var{string}
|
|
Define a header for HTTP doewnloads, like using
|
|
@samp{--header=@var{string}}.
|
|
|
|
@item html_extension = on/off
|
|
Add a @samp{.html} extension to @samp{text/html} or
|
|
@samp{application/xhtml+xml} files without it, like @samp{-E}.
|
|
|
|
@item http_keep_alive = on/off
|
|
Turn the keep-alive feature on or off (defaults to on). Turning it
|
|
off is equivalent to @samp{--no-http-keep-alive}.
|
|
|
|
@item http_password = @var{string}
|
|
Set @sc{http} password, equivalent to
|
|
@samp{--http-password=@var{string}}.
|
|
|
|
@item http_proxy = @var{string}
|
|
Use @var{string} as @sc{http} proxy, instead of the one specified in
|
|
environment.
|
|
|
|
@item http_user = @var{string}
|
|
Set @sc{http} user to @var{string}, equivalent to
|
|
@samp{--http-user=@var{string}}.
|
|
|
|
@item https_proxy = @var{string}
|
|
Use @var{string} as @sc{https} proxy, instead of the one specified in
|
|
environment.
|
|
|
|
@item ignore_length = on/off
|
|
When set to on, ignore @code{Content-Length} header; the same as
|
|
@samp{--ignore-length}.
|
|
|
|
@item ignore_tags = @var{string}
|
|
Ignore certain @sc{html} tags when doing a recursive retrieval, like
|
|
@samp{--ignore-tags=@var{string}}.
|
|
|
|
@item include_directories = @var{string}
|
|
Specify a comma-separated list of directories you wish to follow when
|
|
downloading---the same as @samp{-I @var{string}}.
|
|
|
|
@item inet4_only = on/off
|
|
Force connecting to IPv4 addresses, off by default. You can put this
|
|
in the global init file to disable Wget's attempts to resolve and
|
|
connect to IPv6 hosts. Available only if Wget was compiled with IPv6
|
|
support. The same as @samp{--inet4-only} or @samp{-4}.
|
|
|
|
@item inet6_only = on/off
|
|
Force connecting to IPv6 addresses, off by default. Available only if
|
|
Wget was compiled with IPv6 support. The same as @samp{--inet6-only}
|
|
or @samp{-6}.
|
|
|
|
@item input = @var{file}
|
|
Read the @sc{url}s from @var{string}, like @samp{-i @var{file}}.
|
|
|
|
@item limit_rate = @var{rate}
|
|
Limit the download speed to no more than @var{rate} bytes per second.
|
|
The same as @samp{--limit-rate=@var{rate}}.
|
|
|
|
@item load_cookies = @var{file}
|
|
Load cookies from @var{file}. See @samp{--load-cookies @var{file}}.
|
|
|
|
@item logfile = @var{file}
|
|
Set logfile to @var{file}, the same as @samp{-o @var{file}}.
|
|
|
|
@item mirror = on/off
|
|
Turn mirroring on/off. The same as @samp{-m}.
|
|
|
|
@item netrc = on/off
|
|
Turn reading netrc on or off.
|
|
|
|
@item noclobber = on/off
|
|
Same as @samp{-nc}.
|
|
|
|
@item no_parent = on/off
|
|
Disallow retrieving outside the directory hierarchy, like
|
|
@samp{--no-parent} (@pxref{Directory-Based Limits}).
|
|
|
|
@item no_proxy = @var{string}
|
|
Use @var{string} as the comma-separated list of domains to avoid in
|
|
proxy loading, instead of the one specified in environment.
|
|
|
|
@item output_document = @var{file}
|
|
Set the output filename---the same as @samp{-O @var{file}}.
|
|
|
|
@item page_requisites = on/off
|
|
Download all ancillary documents necessary for a single @sc{html} page to
|
|
display properly---the same as @samp{-p}.
|
|
|
|
@item passive_ftp = on/off
|
|
Change setting of passive @sc{ftp}, equivalent to the
|
|
@samp{--passive-ftp} option.
|
|
|
|
@itemx password = @var{string}
|
|
Specify password @var{string} for both @sc{ftp} and @sc{http} file retrieval.
|
|
This command can be overridden using the @samp{ftp_password} and
|
|
@samp{http_password} command for @sc{ftp} and @sc{http} respectively.
|
|
|
|
@item post_data = @var{string}
|
|
Use POST as the method for all HTTP requests and send @var{string} in
|
|
the request body. The same as @samp{--post-data=@var{string}}.
|
|
|
|
@item post_file = @var{file}
|
|
Use POST as the method for all HTTP requests and send the contents of
|
|
@var{file} in the request body. The same as
|
|
@samp{--post-file=@var{file}}.
|
|
|
|
@item prefer_family = IPv4/IPv6/none
|
|
When given a choice of several addresses, connect to the addresses
|
|
with specified address family first. IPv4 addresses are preferred by
|
|
default. The same as @samp{--prefer-family}, which see for a detailed
|
|
discussion of why this is useful.
|
|
|
|
@item private_key = @var{file}
|
|
Set the private key file to @var{file}. The same as
|
|
@samp{--private-key=@var{file}}.
|
|
|
|
@item private_key_type = @var{string}
|
|
Specify the type of the private key, legal values being @samp{PEM}
|
|
(the default) and @samp{DER} (aka ASN1). The same as
|
|
@samp{--private-type=@var{string}}.
|
|
|
|
@item progress = @var{string}
|
|
Set the type of the progress indicator. Legal types are @samp{dot}
|
|
and @samp{bar}. Equivalent to @samp{--progress=@var{string}}.
|
|
|
|
@item protocol_directories = on/off
|
|
When set, use the protocol name as a directory component of local file
|
|
names. The same as @samp{--protocol-directories}.
|
|
|
|
@item proxy_user = @var{string}
|
|
Set proxy authentication user name to @var{string}, like
|
|
@samp{--proxy-user=@var{string}}.
|
|
|
|
@item proxy_password = @var{string}
|
|
Set proxy authentication password to @var{string}, like
|
|
@samp{--proxy-password=@var{string}}.
|
|
|
|
@item quiet = on/off
|
|
Quiet mode---the same as @samp{-q}.
|
|
|
|
@item quota = @var{quota}
|
|
Specify the download quota, which is useful to put in the global
|
|
@file{wgetrc}. When download quota is specified, Wget will stop
|
|
retrieving after the download sum has become greater than quota. The
|
|
quota can be specified in bytes (default), kbytes @samp{k} appended) or
|
|
mbytes (@samp{m} appended). Thus @samp{quota = 5m} will set the quota
|
|
to 5 megabytes. Note that the user's startup file overrides system
|
|
settings.
|
|
|
|
@item random_file = @var{file}
|
|
Use @var{file} as a source of randomness on systems lacking
|
|
@file{/dev/random}.
|
|
|
|
@item random_wait = on/off
|
|
Turn random between-request wait times on or off. The same as
|
|
@samp{--random-wait}.
|
|
|
|
@item read_timeout = @var{n}
|
|
Set the read (and write) timeout---the same as
|
|
@samp{--read-timeout=@var{n}}.
|
|
|
|
@item reclevel = @var{n}
|
|
Recursion level (depth)---the same as @samp{-l @var{n}}.
|
|
|
|
@item recursive = on/off
|
|
Recursive on/off---the same as @samp{-r}.
|
|
|
|
@item referer = @var{string}
|
|
Set HTTP @samp{Referer:} header just like
|
|
@samp{--referer=@var{string}}. (Note it was the folks who wrote the
|
|
@sc{http} spec who got the spelling of ``referrer'' wrong.)
|
|
|
|
@item relative_only = on/off
|
|
Follow only relative links---the same as @samp{-L} (@pxref{Relative
|
|
Links}).
|
|
|
|
@item remove_listing = on/off
|
|
If set to on, remove @sc{ftp} listings downloaded by Wget. Setting it
|
|
to off is the same as @samp{--no-remove-listing}.
|
|
|
|
@item restrict_file_names = unix/windows
|
|
Restrict the file names generated by Wget from URLs. See
|
|
@samp{--restrict-file-names} for a more detailed description.
|
|
|
|
@item retr_symlinks = on/off
|
|
When set to on, retrieve symbolic links as if they were plain files; the
|
|
same as @samp{--retr-symlinks}.
|
|
|
|
@item retry_connrefused = on/off
|
|
When set to on, consider ``connection refused'' a transient
|
|
error---the same as @samp{--retry-connrefused}.
|
|
|
|
@item robots = on/off
|
|
Specify whether the norobots convention is respected by Wget, ``on'' by
|
|
default. This switch controls both the @file{/robots.txt} and the
|
|
@samp{nofollow} aspect of the spec. @xref{Robot Exclusion}, for more
|
|
details about this. Be sure you know what you are doing before turning
|
|
this off.
|
|
|
|
@item save_cookies = @var{file}
|
|
Save cookies to @var{file}. The same as @samp{--save-cookies
|
|
@var{file}}.
|
|
|
|
@item secure_protocol = @var{string}
|
|
Choose the secure protocol to be used. Legal values are @samp{auto}
|
|
(the default), @samp{SSLv2}, @samp{SSLv3}, and @samp{TLSv1}. The same
|
|
as @samp{--secure-protocol=@var{string}}.
|
|
|
|
@item server_response = on/off
|
|
Choose whether or not to print the @sc{http} and @sc{ftp} server
|
|
responses---the same as @samp{-S}.
|
|
|
|
@item span_hosts = on/off
|
|
Same as @samp{-H}.
|
|
|
|
@item strict_comments = on/off
|
|
Same as @samp{--strict-comments}.
|
|
|
|
@item timeout = @var{n}
|
|
Set all applicable timeout values to @var{n}, the same as @samp{-T
|
|
@var{n}}.
|
|
|
|
@item timestamping = on/off
|
|
Turn timestamping on/off. The same as @samp{-N} (@pxref{Time-Stamping}).
|
|
|
|
@item tries = @var{n}
|
|
Set number of retries per @sc{url}---the same as @samp{-t @var{n}}.
|
|
|
|
@item use_proxy = on/off
|
|
When set to off, don't use proxy even when proxy-related environment
|
|
variables are set. In that case it is the same as using
|
|
@samp{--no-proxy}.
|
|
|
|
@item user = @var{string}
|
|
Specify username @var{string} for both @sc{ftp} and @sc{http} file retrieval.
|
|
This command can be overridden using the @samp{ftp_user} and
|
|
@samp{http_user} command for @sc{ftp} and @sc{http} respectively.
|
|
|
|
@item verbose = on/off
|
|
Turn verbose on/off---the same as @samp{-v}/@samp{-nv}.
|
|
|
|
@item wait = @var{n}
|
|
Wait @var{n} seconds between retrievals---the same as @samp{-w
|
|
@var{n}}.
|
|
|
|
@item waitretry = @var{n}
|
|
Wait up to @var{n} seconds between retries of failed retrievals
|
|
only---the same as @samp{--waitretry=@var{n}}. Note that this is
|
|
turned on by default in the global @file{wgetrc}.
|
|
@end table
|
|
|
|
@node Sample Wgetrc
|
|
@section Sample Wgetrc
|
|
@cindex sample wgetrc
|
|
|
|
This is the sample initialization file, as given in the distribution.
|
|
It is divided in two section---one for global usage (suitable for global
|
|
startup file), and one for local usage (suitable for
|
|
@file{$HOME/.wgetrc}). Be careful about the things you change.
|
|
|
|
Note that almost all the lines are commented out. For a command to have
|
|
any effect, you must remove the @samp{#} character at the beginning of
|
|
its line.
|
|
|
|
@example
|
|
@include sample.wgetrc.munged_for_texi_inclusion
|
|
@end example
|
|
|
|
@node Examples
|
|
@chapter Examples
|
|
@cindex examples
|
|
|
|
@c man begin EXAMPLES
|
|
The examples are divided into three sections loosely based on their
|
|
complexity.
|
|
|
|
@menu
|
|
* Simple Usage:: Simple, basic usage of the program.
|
|
* Advanced Usage:: Advanced tips.
|
|
* Very Advanced Usage:: The hairy stuff.
|
|
@end menu
|
|
|
|
@node Simple Usage
|
|
@section Simple Usage
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Say you want to download a @sc{url}. Just type:
|
|
|
|
@example
|
|
wget http://fly.srk.fer.hr/
|
|
@end example
|
|
|
|
@item
|
|
But what will happen if the connection is slow, and the file is lengthy?
|
|
The connection will probably fail before the whole file is retrieved,
|
|
more than once. In this case, Wget will try getting the file until it
|
|
either gets the whole of it, or exceeds the default number of retries
|
|
(this being 20). It is easy to change the number of tries to 45, to
|
|
insure that the whole file will arrive safely:
|
|
|
|
@example
|
|
wget --tries=45 http://fly.srk.fer.hr/jpg/flyweb.jpg
|
|
@end example
|
|
|
|
@item
|
|
Now let's leave Wget to work in the background, and write its progress
|
|
to log file @file{log}. It is tiring to type @samp{--tries}, so we
|
|
shall use @samp{-t}.
|
|
|
|
@example
|
|
wget -t 45 -o log http://fly.srk.fer.hr/jpg/flyweb.jpg &
|
|
@end example
|
|
|
|
The ampersand at the end of the line makes sure that Wget works in the
|
|
background. To unlimit the number of retries, use @samp{-t inf}.
|
|
|
|
@item
|
|
The usage of @sc{ftp} is as simple. Wget will take care of login and
|
|
password.
|
|
|
|
@example
|
|
wget ftp://gnjilux.srk.fer.hr/welcome.msg
|
|
@end example
|
|
|
|
@item
|
|
If you specify a directory, Wget will retrieve the directory listing,
|
|
parse it and convert it to @sc{html}. Try:
|
|
|
|
@example
|
|
wget ftp://ftp.gnu.org/pub/gnu/
|
|
links index.html
|
|
@end example
|
|
@end itemize
|
|
|
|
@node Advanced Usage
|
|
@section Advanced Usage
|
|
|
|
@itemize @bullet
|
|
@item
|
|
You have a file that contains the URLs you want to download? Use the
|
|
@samp{-i} switch:
|
|
|
|
@example
|
|
wget -i @var{file}
|
|
@end example
|
|
|
|
If you specify @samp{-} as file name, the @sc{url}s will be read from
|
|
standard input.
|
|
|
|
@item
|
|
Create a five levels deep mirror image of the GNU web site, with the
|
|
same directory structure the original has, with only one try per
|
|
document, saving the log of the activities to @file{gnulog}:
|
|
|
|
@example
|
|
wget -r http://www.gnu.org/ -o gnulog
|
|
@end example
|
|
|
|
@item
|
|
The same as the above, but convert the links in the @sc{html} files to
|
|
point to local files, so you can view the documents off-line:
|
|
|
|
@example
|
|
wget --convert-links -r http://www.gnu.org/ -o gnulog
|
|
@end example
|
|
|
|
@item
|
|
Retrieve only one @sc{html} page, but make sure that all the elements needed
|
|
for the page to be displayed, such as inline images and external style
|
|
sheets, are also downloaded. Also make sure the downloaded page
|
|
references the downloaded links.
|
|
|
|
@example
|
|
wget -p --convert-links http://www.server.com/dir/page.html
|
|
@end example
|
|
|
|
The @sc{html} page will be saved to @file{www.server.com/dir/page.html}, and
|
|
the images, stylesheets, etc., somewhere under @file{www.server.com/},
|
|
depending on where they were on the remote server.
|
|
|
|
@item
|
|
The same as the above, but without the @file{www.server.com/} directory.
|
|
In fact, I don't want to have all those random server directories
|
|
anyway---just save @emph{all} those files under a @file{download/}
|
|
subdirectory of the current directory.
|
|
|
|
@example
|
|
wget -p --convert-links -nH -nd -Pdownload \
|
|
http://www.server.com/dir/page.html
|
|
@end example
|
|
|
|
@item
|
|
Retrieve the index.html of @samp{www.lycos.com}, showing the original
|
|
server headers:
|
|
|
|
@example
|
|
wget -S http://www.lycos.com/
|
|
@end example
|
|
|
|
@item
|
|
Save the server headers with the file, perhaps for post-processing.
|
|
|
|
@example
|
|
wget --save-headers http://www.lycos.com/
|
|
more index.html
|
|
@end example
|
|
|
|
@item
|
|
Retrieve the first two levels of @samp{wuarchive.wustl.edu}, saving them
|
|
to @file{/tmp}.
|
|
|
|
@example
|
|
wget -r -l2 -P/tmp ftp://wuarchive.wustl.edu/
|
|
@end example
|
|
|
|
@item
|
|
You want to download all the @sc{gif}s from a directory on an @sc{http}
|
|
server. You tried @samp{wget http://www.server.com/dir/*.gif}, but that
|
|
didn't work because @sc{http} retrieval does not support globbing. In
|
|
that case, use:
|
|
|
|
@example
|
|
wget -r -l1 --no-parent -A.gif http://www.server.com/dir/
|
|
@end example
|
|
|
|
More verbose, but the effect is the same. @samp{-r -l1} means to
|
|
retrieve recursively (@pxref{Recursive Download}), with maximum depth
|
|
of 1. @samp{--no-parent} means that references to the parent directory
|
|
are ignored (@pxref{Directory-Based Limits}), and @samp{-A.gif} means to
|
|
download only the @sc{gif} files. @samp{-A "*.gif"} would have worked
|
|
too.
|
|
|
|
@item
|
|
Suppose you were in the middle of downloading, when Wget was
|
|
interrupted. Now you do not want to clobber the files already present.
|
|
It would be:
|
|
|
|
@example
|
|
wget -nc -r http://www.gnu.org/
|
|
@end example
|
|
|
|
@item
|
|
If you want to encode your own username and password to @sc{http} or
|
|
@sc{ftp}, use the appropriate @sc{url} syntax (@pxref{URL Format}).
|
|
|
|
@example
|
|
wget ftp://hniksic:mypassword@@unix.server.com/.emacs
|
|
@end example
|
|
|
|
Note, however, that this usage is not advisable on multi-user systems
|
|
because it reveals your password to anyone who looks at the output of
|
|
@code{ps}.
|
|
|
|
@cindex redirecting output
|
|
@item
|
|
You would like the output documents to go to standard output instead of
|
|
to files?
|
|
|
|
@example
|
|
wget -O - http://jagor.srce.hr/ http://www.srce.hr/
|
|
@end example
|
|
|
|
You can also combine the two options and make pipelines to retrieve the
|
|
documents from remote hotlists:
|
|
|
|
@example
|
|
wget -O - http://cool.list.com/ | wget --force-html -i -
|
|
@end example
|
|
@end itemize
|
|
|
|
@node Very Advanced Usage
|
|
@section Very Advanced Usage
|
|
|
|
@cindex mirroring
|
|
@itemize @bullet
|
|
@item
|
|
If you wish Wget to keep a mirror of a page (or @sc{ftp}
|
|
subdirectories), use @samp{--mirror} (@samp{-m}), which is the shorthand
|
|
for @samp{-r -l inf -N}. You can put Wget in the crontab file asking it
|
|
to recheck a site each Sunday:
|
|
|
|
@example
|
|
crontab
|
|
0 0 * * 0 wget --mirror http://www.gnu.org/ -o /home/me/weeklog
|
|
@end example
|
|
|
|
@item
|
|
In addition to the above, you want the links to be converted for local
|
|
viewing. But, after having read this manual, you know that link
|
|
conversion doesn't play well with timestamping, so you also want Wget to
|
|
back up the original @sc{html} files before the conversion. Wget invocation
|
|
would look like this:
|
|
|
|
@example
|
|
wget --mirror --convert-links --backup-converted \
|
|
http://www.gnu.org/ -o /home/me/weeklog
|
|
@end example
|
|
|
|
@item
|
|
But you've also noticed that local viewing doesn't work all that well
|
|
when @sc{html} files are saved under extensions other than @samp{.html},
|
|
perhaps because they were served as @file{index.cgi}. So you'd like
|
|
Wget to rename all the files served with content-type @samp{text/html}
|
|
or @samp{application/xhtml+xml} to @file{@var{name}.html}.
|
|
|
|
@example
|
|
wget --mirror --convert-links --backup-converted \
|
|
--html-extension -o /home/me/weeklog \
|
|
http://www.gnu.org/
|
|
@end example
|
|
|
|
Or, with less typing:
|
|
|
|
@example
|
|
wget -m -k -K -E http://www.gnu.org/ -o /home/me/weeklog
|
|
@end example
|
|
@end itemize
|
|
@c man end
|
|
|
|
@node Various
|
|
@chapter Various
|
|
@cindex various
|
|
|
|
This chapter contains all the stuff that could not fit anywhere else.
|
|
|
|
@menu
|
|
* Proxies:: Support for proxy servers
|
|
* Distribution:: Getting the latest version.
|
|
* Mailing List:: Wget mailing list for announcements and discussion.
|
|
* Reporting Bugs:: How and where to report bugs.
|
|
* Portability:: The systems Wget works on.
|
|
* Signals:: Signal-handling performed by Wget.
|
|
@end menu
|
|
|
|
@node Proxies
|
|
@section Proxies
|
|
@cindex proxies
|
|
|
|
@dfn{Proxies} are special-purpose @sc{http} servers designed to transfer
|
|
data from remote servers to local clients. One typical use of proxies
|
|
is lightening network load for users behind a slow connection. This is
|
|
achieved by channeling all @sc{http} and @sc{ftp} requests through the
|
|
proxy which caches the transferred data. When a cached resource is
|
|
requested again, proxy will return the data from cache. Another use for
|
|
proxies is for companies that separate (for security reasons) their
|
|
internal networks from the rest of Internet. In order to obtain
|
|
information from the Web, their users connect and retrieve remote data
|
|
using an authorized proxy.
|
|
|
|
Wget supports proxies for both @sc{http} and @sc{ftp} retrievals. The
|
|
standard way to specify proxy location, which Wget recognizes, is using
|
|
the following environment variables:
|
|
|
|
@table @code
|
|
@item http_proxy
|
|
@itemx https_proxy
|
|
If set, the @code{http_proxy} and @code{https_proxy} variables should
|
|
contain the @sc{url}s of the proxies for @sc{http} and @sc{https}
|
|
connections respectively.
|
|
|
|
@item ftp_proxy
|
|
This variable should contain the @sc{url} of the proxy for @sc{ftp}
|
|
connections. It is quite common that @code{http_proxy} and
|
|
@code{ftp_proxy} are set to the same @sc{url}.
|
|
|
|
@item no_proxy
|
|
This variable should contain a comma-separated list of domain extensions
|
|
proxy should @emph{not} be used for. For instance, if the value of
|
|
@code{no_proxy} is @samp{.mit.edu}, proxy will not be used to retrieve
|
|
documents from MIT.
|
|
@end table
|
|
|
|
In addition to the environment variables, proxy location and settings
|
|
may be specified from within Wget itself.
|
|
|
|
@table @samp
|
|
@itemx --no-proxy
|
|
@itemx proxy = on/off
|
|
This option and the corresponding command may be used to suppress the
|
|
use of proxy, even if the appropriate environment variables are set.
|
|
|
|
@item http_proxy = @var{URL}
|
|
@itemx https_proxy = @var{URL}
|
|
@itemx ftp_proxy = @var{URL}
|
|
@itemx no_proxy = @var{string}
|
|
These startup file variables allow you to override the proxy settings
|
|
specified by the environment.
|
|
@end table
|
|
|
|
Some proxy servers require authorization to enable you to use them. The
|
|
authorization consists of @dfn{username} and @dfn{password}, which must
|
|
be sent by Wget. As with @sc{http} authorization, several
|
|
authentication schemes exist. For proxy authorization only the
|
|
@code{Basic} authentication scheme is currently implemented.
|
|
|
|
You may specify your username and password either through the proxy
|
|
@sc{url} or through the command-line options. Assuming that the
|
|
company's proxy is located at @samp{proxy.company.com} at port 8001, a
|
|
proxy @sc{url} location containing authorization data might look like
|
|
this:
|
|
|
|
@example
|
|
http://hniksic:mypassword@@proxy.company.com:8001/
|
|
@end example
|
|
|
|
Alternatively, you may use the @samp{proxy-user} and
|
|
@samp{proxy-password} options, and the equivalent @file{.wgetrc}
|
|
settings @code{proxy_user} and @code{proxy_password} to set the proxy
|
|
username and password.
|
|
|
|
@node Distribution
|
|
@section Distribution
|
|
@cindex latest version
|
|
|
|
Like all GNU utilities, the latest version of Wget can be found at the
|
|
master GNU archive site ftp.gnu.org, and its mirrors. For example,
|
|
Wget @value{VERSION} can be found at
|
|
@url{ftp://ftp.gnu.org/pub/gnu/wget/wget-@value{VERSION}.tar.gz}
|
|
|
|
@node Mailing List
|
|
@section Mailing List
|
|
@cindex mailing list
|
|
@cindex list
|
|
|
|
There are several Wget-related mailing lists, all hosted by
|
|
SunSITE.dk. The general discussion list is at
|
|
@email{wget@@sunsite.dk}. It is the preferred place for bug reports
|
|
and suggestions, as well as for discussion of development. You are
|
|
invited to subscribe.
|
|
|
|
To subscribe, simply send mail to @email{wget-subscribe@@sunsite.dk}
|
|
and follow the instructions. Unsubscribe by mailing to
|
|
@email{wget-unsubscribe@@sunsite.dk}. The mailing list is archived at
|
|
@url{http://www.mail-archive.com/wget%40sunsite.dk/} and at
|
|
@url{http://news.gmane.org/gmane.comp.web.wget.general}.
|
|
|
|
The second mailing list is at @email{wget-patches@@sunsite.dk}, and is
|
|
used to submit patches for review by Wget developers. A ``patch'' is
|
|
a textual representation of change to source code, readable by both
|
|
humans and programs. The file @file{PATCHES} that comes with Wget
|
|
covers the creation and submitting of patches in detail. Please don't
|
|
send general suggestions or bug reports to @samp{wget-patches}; use it
|
|
only for patch submissions.
|
|
|
|
To subscribe, simply send mail to @email{wget-subscribe@@sunsite.dk}
|
|
and follow the instructions. Unsubscribe by mailing to
|
|
@email{wget-unsubscribe@@sunsite.dk}. The mailing list is archived at
|
|
@url{http://news.gmane.org/gmane.comp.web.wget.patches}.
|
|
|
|
@node Reporting Bugs
|
|
@section Reporting Bugs
|
|
@cindex bugs
|
|
@cindex reporting bugs
|
|
@cindex bug reports
|
|
|
|
@c man begin BUGS
|
|
You are welcome to send bug reports about GNU Wget to
|
|
@email{bug-wget@@gnu.org}.
|
|
|
|
Before actually submitting a bug report, please try to follow a few
|
|
simple guidelines.
|
|
|
|
@enumerate
|
|
@item
|
|
Please try to ascertain that the behavior you see really is a bug. If
|
|
Wget crashes, it's a bug. If Wget does not behave as documented,
|
|
it's a bug. If things work strange, but you are not sure about the way
|
|
they are supposed to work, it might well be a bug.
|
|
|
|
@item
|
|
Try to repeat the bug in as simple circumstances as possible. E.g. if
|
|
Wget crashes while downloading @samp{wget -rl0 -kKE -t5 -Y0
|
|
http://yoyodyne.com -o /tmp/log}, you should try to see if the crash is
|
|
repeatable, and if will occur with a simpler set of options. You might
|
|
even try to start the download at the page where the crash occurred to
|
|
see if that page somehow triggered the crash.
|
|
|
|
Also, while I will probably be interested to know the contents of your
|
|
@file{.wgetrc} file, just dumping it into the debug message is probably
|
|
a bad idea. Instead, you should first try to see if the bug repeats
|
|
with @file{.wgetrc} moved out of the way. Only if it turns out that
|
|
@file{.wgetrc} settings affect the bug, mail me the relevant parts of
|
|
the file.
|
|
|
|
@item
|
|
Please start Wget with @samp{-d} option and send us the resulting
|
|
output (or relevant parts thereof). If Wget was compiled without
|
|
debug support, recompile it---it is @emph{much} easier to trace bugs
|
|
with debug support on.
|
|
|
|
Note: please make sure to remove any potentially sensitive information
|
|
from the debug log before sending it to the bug address. The
|
|
@code{-d} won't go out of its way to collect sensitive information,
|
|
but the log @emph{will} contain a fairly complete transcript of Wget's
|
|
communication with the server, which may include passwords and pieces
|
|
of downloaded data. Since the bug address is publically archived, you
|
|
may assume that all bug reports are visible to the public.
|
|
|
|
@item
|
|
If Wget has crashed, try to run it in a debugger, e.g. @code{gdb `which
|
|
wget` core} and type @code{where} to get the backtrace. This may not
|
|
work if the system administrator has disabled core files, but it is
|
|
safe to try.
|
|
@end enumerate
|
|
@c man end
|
|
|
|
@node Portability
|
|
@section Portability
|
|
@cindex portability
|
|
@cindex operating systems
|
|
|
|
Like all GNU software, Wget works on the GNU system. However, since it
|
|
uses GNU Autoconf for building and configuring, and mostly avoids using
|
|
``special'' features of any particular Unix, it should compile (and
|
|
work) on all common Unix flavors.
|
|
|
|
Various Wget versions have been compiled and tested under many kinds
|
|
of Unix systems, including GNU/Linux, Solaris, SunOS 4.x, OSF (aka
|
|
Digital Unix or Tru64), Ultrix, *BSD, IRIX, AIX, and others. Some of
|
|
those systems are no longer in widespread use and may not be able to
|
|
support recent versions of Wget. If Wget fails to compile on your
|
|
system, we would like to know about it.
|
|
|
|
Thanks to kind contributors, this version of Wget compiles and works
|
|
on 32-bit Microsoft Windows platforms. It has been compiled
|
|
successfully using MS Visual C++ 6.0, Watcom, Borland C, and GCC
|
|
compilers. Naturally, it is crippled of some features available on
|
|
Unix, but it should work as a substitute for people stuck with
|
|
Windows. Note that Windows-specific portions of Wget are not
|
|
guaranteed to be supported in the future, although this has been the
|
|
case in practice for many years now. All questions and problems in
|
|
Windows usage should be reported to Wget mailing list at
|
|
@email{wget@@sunsite.dk} where the volunteers who maintain the
|
|
Windows-related features might look at them.
|
|
|
|
@node Signals
|
|
@section Signals
|
|
@cindex signal handling
|
|
@cindex hangup
|
|
|
|
Since the purpose of Wget is background work, it catches the hangup
|
|
signal (@code{SIGHUP}) and ignores it. If the output was on standard
|
|
output, it will be redirected to a file named @file{wget-log}.
|
|
Otherwise, @code{SIGHUP} is ignored. This is convenient when you wish
|
|
to redirect the output of Wget after having started it.
|
|
|
|
@example
|
|
$ wget http://www.gnus.org/dist/gnus.tar.gz &
|
|
...
|
|
$ kill -HUP %%
|
|
SIGHUP received, redirecting output to `wget-log'.
|
|
@end example
|
|
|
|
Other than that, Wget will not try to interfere with signals in any way.
|
|
@kbd{C-c}, @code{kill -TERM} and @code{kill -KILL} should kill it alike.
|
|
|
|
@node Appendices
|
|
@chapter Appendices
|
|
|
|
This chapter contains some references I consider useful.
|
|
|
|
@menu
|
|
* Robot Exclusion:: Wget's support for RES.
|
|
* Security Considerations:: Security with Wget.
|
|
* Contributors:: People who helped.
|
|
@end menu
|
|
|
|
@node Robot Exclusion
|
|
@section Robot Exclusion
|
|
@cindex robot exclusion
|
|
@cindex robots.txt
|
|
@cindex server maintenance
|
|
|
|
It is extremely easy to make Wget wander aimlessly around a web site,
|
|
sucking all the available data in progress. @samp{wget -r @var{site}},
|
|
and you're set. Great? Not for the server admin.
|
|
|
|
As long as Wget is only retrieving static pages, and doing it at a
|
|
reasonable rate (see the @samp{--wait} option), there's not much of a
|
|
problem. The trouble is that Wget can't tell the difference between the
|
|
smallest static page and the most demanding CGI. A site I know has a
|
|
section handled by a CGI Perl script that converts Info files to @sc{html} on
|
|
the fly. The script is slow, but works well enough for human users
|
|
viewing an occasional Info file. However, when someone's recursive Wget
|
|
download stumbles upon the index page that links to all the Info files
|
|
through the script, the system is brought to its knees without providing
|
|
anything useful to the user (This task of converting Info files could be
|
|
done locally and access to Info documentation for all installed GNU
|
|
software on a system is available from the @code{info} command).
|
|
|
|
To avoid this kind of accident, as well as to preserve privacy for
|
|
documents that need to be protected from well-behaved robots, the
|
|
concept of @dfn{robot exclusion} was invented. The idea is that
|
|
the server administrators and document authors can specify which
|
|
portions of the site they wish to protect from robots and those
|
|
they will permit access.
|
|
|
|
The most popular mechanism, and the @i{de facto} standard supported by
|
|
all the major robots, is the ``Robots Exclusion Standard'' (RES) written
|
|
by Martijn Koster et al. in 1994. It specifies the format of a text
|
|
file containing directives that instruct the robots which URL paths to
|
|
avoid. To be found by the robots, the specifications must be placed in
|
|
@file{/robots.txt} in the server root, which the robots are expected to
|
|
download and parse.
|
|
|
|
Although Wget is not a web robot in the strictest sense of the word, it
|
|
can downloads large parts of the site without the user's intervention to
|
|
download an individual page. Because of that, Wget honors RES when
|
|
downloading recursively. For instance, when you issue:
|
|
|
|
@example
|
|
wget -r http://www.server.com/
|
|
@end example
|
|
|
|
First the index of @samp{www.server.com} will be downloaded. If Wget
|
|
finds that it wants to download more documents from that server, it will
|
|
request @samp{http://www.server.com/robots.txt} and, if found, use it
|
|
for further downloads. @file{robots.txt} is loaded only once per each
|
|
server.
|
|
|
|
Until version 1.8, Wget supported the first version of the standard,
|
|
written by Martijn Koster in 1994 and available at
|
|
@url{http://www.robotstxt.org/wc/norobots.html}. As of version 1.8,
|
|
Wget has supported the additional directives specified in the internet
|
|
draft @samp{<draft-koster-robots-00.txt>} titled ``A Method for Web
|
|
Robots Control''. The draft, which has as far as I know never made to
|
|
an @sc{rfc}, is available at
|
|
@url{http://www.robotstxt.org/wc/norobots-rfc.txt}.
|
|
|
|
This manual no longer includes the text of the Robot Exclusion Standard.
|
|
|
|
The second, less known mechanism, enables the author of an individual
|
|
document to specify whether they want the links from the file to be
|
|
followed by a robot. This is achieved using the @code{META} tag, like
|
|
this:
|
|
|
|
@example
|
|
<meta name="robots" content="nofollow">
|
|
@end example
|
|
|
|
This is explained in some detail at
|
|
@url{http://www.robotstxt.org/wc/meta-user.html}. Wget supports this
|
|
method of robot exclusion in addition to the usual @file{/robots.txt}
|
|
exclusion.
|
|
|
|
If you know what you are doing and really really wish to turn off the
|
|
robot exclusion, set the @code{robots} variable to @samp{off} in your
|
|
@file{.wgetrc}. You can achieve the same effect from the command line
|
|
using the @code{-e} switch, e.g. @samp{wget -e robots=off @var{url}...}.
|
|
|
|
@node Security Considerations
|
|
@section Security Considerations
|
|
@cindex security
|
|
|
|
When using Wget, you must be aware that it sends unencrypted passwords
|
|
through the network, which may present a security problem. Here are the
|
|
main issues, and some solutions.
|
|
|
|
@enumerate
|
|
@item
|
|
The passwords on the command line are visible using @code{ps}. The best
|
|
way around it is to use @code{wget -i -} and feed the @sc{url}s to
|
|
Wget's standard input, each on a separate line, terminated by @kbd{C-d}.
|
|
Another workaround is to use @file{.netrc} to store passwords; however,
|
|
storing unencrypted passwords is also considered a security risk.
|
|
|
|
@item
|
|
Using the insecure @dfn{basic} authentication scheme, unencrypted
|
|
passwords are transmitted through the network routers and gateways.
|
|
|
|
@item
|
|
The @sc{ftp} passwords are also in no way encrypted. There is no good
|
|
solution for this at the moment.
|
|
|
|
@item
|
|
Although the ``normal'' output of Wget tries to hide the passwords,
|
|
debugging logs show them, in all forms. This problem is avoided by
|
|
being careful when you send debug logs (yes, even when you send them to
|
|
me).
|
|
@end enumerate
|
|
|
|
@node Contributors
|
|
@section Contributors
|
|
@cindex contributors
|
|
|
|
@iftex
|
|
GNU Wget was written by Hrvoje Nik@v{s}i@'{c} @email{hniksic@@xemacs.org}.
|
|
@end iftex
|
|
@ifnottex
|
|
GNU Wget was written by Hrvoje Niksic @email{hniksic@@xemacs.org}.
|
|
@end ifnottex
|
|
However, its development could never have gone as far as it has, were it
|
|
not for the help of many people, either with bug reports, feature
|
|
proposals, patches, or letters saying ``Thanks!''.
|
|
|
|
Special thanks goes to the following people (no particular order):
|
|
|
|
@itemize @bullet
|
|
@item Mauro Tortonesi---contributed high-quality IPv6 code and many
|
|
other fixes.
|
|
|
|
@item Dan Harkless---contributed a lot of code and documentation of
|
|
extremely high quality, as well as the @code{--page-requisites} and
|
|
related options. He was the principal maintainer for some time and
|
|
released Wget 1.6.
|
|
|
|
@item Ian Abbott---contributed bug fixes, Windows-related fixes, and
|
|
provided a prototype implementation of the breadth-first recursive
|
|
download. Co-maintained Wget during the 1.8 release cycle.
|
|
|
|
@item
|
|
The dotsrc.org crew, in particular Karsten Thygesen---donated system
|
|
resources such as the mailing list, web space, @sc{ftp} space, and
|
|
version control repositories, along with a lot of time to make these
|
|
actually work. Christian Reiniger was of invaluable help with setting
|
|
up Subversion.
|
|
|
|
@item
|
|
Heiko Herold---provided high-quality Windows builds and contributed
|
|
bug and build reports for many years.
|
|
|
|
@item
|
|
Shawn McHorse---bug reports and patches.
|
|
|
|
@item
|
|
Kaveh R. Ghazi---on-the-fly @code{ansi2knr}-ization. Lots of
|
|
portability fixes.
|
|
|
|
@item
|
|
Gordon Matzigkeit---@file{.netrc} support.
|
|
|
|
@item
|
|
@iftex
|
|
Zlatko @v{C}alu@v{s}i@'{c}, Tomislav Vujec and Dra@v{z}en
|
|
Ka@v{c}ar---feature suggestions and ``philosophical'' discussions.
|
|
@end iftex
|
|
@ifnottex
|
|
Zlatko Calusic, Tomislav Vujec and Drazen Kacar---feature suggestions
|
|
and ``philosophical'' discussions.
|
|
@end ifnottex
|
|
|
|
@item
|
|
Darko Budor---initial port to Windows.
|
|
|
|
@item
|
|
Antonio Rosella---help and suggestions, plus the initial Italian
|
|
translation.
|
|
|
|
@item
|
|
@iftex
|
|
Tomislav Petrovi@'{c}, Mario Miko@v{c}evi@'{c}---many bug reports and
|
|
suggestions.
|
|
@end iftex
|
|
@ifnottex
|
|
Tomislav Petrovic, Mario Mikocevic---many bug reports and suggestions.
|
|
@end ifnottex
|
|
|
|
@item
|
|
@iftex
|
|
Fran@,{c}ois Pinard---many thorough bug reports and discussions.
|
|
@end iftex
|
|
@ifnottex
|
|
Francois Pinard---many thorough bug reports and discussions.
|
|
@end ifnottex
|
|
|
|
@item
|
|
Karl Eichwalder---lots of help with internationalization, Makefile
|
|
layout and many other things.
|
|
|
|
@item
|
|
Junio Hamano---donated support for Opie and @sc{http} @code{Digest}
|
|
authentication.
|
|
|
|
@item
|
|
People who provided donations for development---including Brian Gough.
|
|
@end itemize
|
|
|
|
The following people have provided patches, bug/build reports, useful
|
|
suggestions, beta testing services, fan mail and all the other things
|
|
that make maintenance so much fun:
|
|
|
|
Tim Adam,
|
|
Adrian Aichner,
|
|
Martin Baehr,
|
|
Dieter Baron,
|
|
Roger Beeman,
|
|
Dan Berger,
|
|
T. Bharath,
|
|
Christian Biere,
|
|
Paul Bludov,
|
|
Daniel Bodea,
|
|
Mark Boyns,
|
|
John Burden,
|
|
Wanderlei Cavassin,
|
|
Gilles Cedoc,
|
|
Tim Charron,
|
|
Noel Cragg,
|
|
@iftex
|
|
Kristijan @v{C}onka@v{s},
|
|
@end iftex
|
|
@ifnottex
|
|
Kristijan Conkas,
|
|
@end ifnottex
|
|
John Daily,
|
|
Andreas Damm,
|
|
Ahmon Dancy,
|
|
Andrew Davison,
|
|
Bertrand Demiddelaer,
|
|
Andrew Deryabin,
|
|
Ulrich Drepper,
|
|
Marc Duponcheel,
|
|
@iftex
|
|
Damir D@v{z}eko,
|
|
@end iftex
|
|
@ifnottex
|
|
Damir Dzeko,
|
|
@end ifnottex
|
|
Alan Eldridge,
|
|
Hans-Andreas Engel,
|
|
@iftex
|
|
Aleksandar Erkalovi@'{c},
|
|
@end iftex
|
|
@ifnottex
|
|
Aleksandar Erkalovic,
|
|
@end ifnottex
|
|
Andy Eskilsson,
|
|
Christian Fraenkel,
|
|
David Fritz,
|
|
Charles C. Fu,
|
|
FUJISHIMA Satsuki,
|
|
Masashi Fujita,
|
|
Howard Gayle,
|
|
Marcel Gerrits,
|
|
Lemble Gregory,
|
|
Hans Grobler,
|
|
Mathieu Guillaume,
|
|
Aaron Hawley,
|
|
Jochen Hein,
|
|
Karl Heuer,
|
|
HIROSE Masaaki,
|
|
Ulf Harnhammar,
|
|
Gregor Hoffleit,
|
|
Erik Magnus Hulthen,
|
|
Richard Huveneers,
|
|
Jonas Jensen,
|
|
Larry Jones,
|
|
Simon Josefsson,
|
|
@iftex
|
|
Mario Juri@'{c},
|
|
@end iftex
|
|
@ifnottex
|
|
Mario Juric,
|
|
@end ifnottex
|
|
@iftex
|
|
Hack Kampbj@o rn,
|
|
@end iftex
|
|
@ifnottex
|
|
Hack Kampbjorn,
|
|
@end ifnottex
|
|
Const Kaplinsky,
|
|
@iftex
|
|
Goran Kezunovi@'{c},
|
|
@end iftex
|
|
@ifnottex
|
|
Goran Kezunovic,
|
|
@end ifnottex
|
|
Igor Khristophorov,
|
|
Robert Kleine,
|
|
KOJIMA Haime,
|
|
Fila Kolodny,
|
|
Alexander Kourakos,
|
|
Martin Kraemer,
|
|
Sami Krank,
|
|
@tex
|
|
$\Sigma\acute{\iota}\mu o\varsigma\;
|
|
\Xi\varepsilon\nu\iota\tau\acute{\epsilon}\lambda\lambda\eta\varsigma$
|
|
(Simos KSenitellis),
|
|
@end tex
|
|
@ifnottex
|
|
Simos KSenitellis,
|
|
@end ifnottex
|
|
Christian Lackas,
|
|
Hrvoje Lacko,
|
|
Daniel S. Lewart,
|
|
@iftex
|
|
Nicol@'{a}s Lichtmeier,
|
|
@end iftex
|
|
@ifnottex
|
|
Nicolas Lichtmeier,
|
|
@end ifnottex
|
|
Dave Love,
|
|
Alexander V. Lukyanov,
|
|
@iftex
|
|
Thomas Lu@ss{}nig,
|
|
@end iftex
|
|
@ifnottex
|
|
Thomas Lussnig,
|
|
@end ifnottex
|
|
Andre Majorel,
|
|
Aurelien Marchand,
|
|
Matthew J. Mellon,
|
|
Jordan Mendelson,
|
|
Lin Zhe Min,
|
|
Jan Minar,
|
|
Tim Mooney,
|
|
Keith Moore,
|
|
Adam D. Moss,
|
|
Simon Munton,
|
|
Charlie Negyesi,
|
|
R. K. Owen,
|
|
Leonid Petrov,
|
|
Simone Piunno,
|
|
Andrew Pollock,
|
|
Steve Pothier,
|
|
@iftex
|
|
Jan P@v{r}ikryl,
|
|
@end iftex
|
|
@ifnottex
|
|
Jan Prikryl,
|
|
@end ifnottex
|
|
Marin Purgar,
|
|
@iftex
|
|
Csaba R@'{a}duly,
|
|
@end iftex
|
|
@ifnottex
|
|
Csaba Raduly,
|
|
@end ifnottex
|
|
Keith Refson,
|
|
Bill Richardson,
|
|
Tyler Riddle,
|
|
Tobias Ringstrom,
|
|
@c Texinfo doesn't grok @'{@i}, so we have to use TeX itself.
|
|
@tex
|
|
Juan Jos\'{e} Rodr\'{\i}guez,
|
|
@end tex
|
|
@ifnottex
|
|
Juan Jose Rodriguez,
|
|
@end ifnottex
|
|
Maciej W. Rozycki,
|
|
Edward J. Sabol,
|
|
Heinz Salzmann,
|
|
Robert Schmidt,
|
|
Nicolas Schodet,
|
|
Andreas Schwab,
|
|
Steven M. Schweda,
|
|
Chris Seawood,
|
|
Dennis Smit,
|
|
Toomas Soome,
|
|
Tage Stabell-Kulo,
|
|
Philip Stadermann,
|
|
Daniel Stenberg,
|
|
Sven Sternberger,
|
|
Markus Strasser,
|
|
John Summerfield,
|
|
Szakacsits Szabolcs,
|
|
Mike Thomas,
|
|
Philipp Thomas,
|
|
Mauro Tortonesi,
|
|
Dave Turner,
|
|
Gisle Vanem,
|
|
Russell Vincent,
|
|
@iftex
|
|
@v{Z}eljko Vrba,
|
|
@end iftex
|
|
@ifnottex
|
|
Zeljko Vrba,
|
|
@end ifnottex
|
|
Charles G Waldman,
|
|
Douglas E. Wegscheid,
|
|
YAMAZAKI Makoto,
|
|
Jasmin Zainul,
|
|
@iftex
|
|
Bojan @v{Z}drnja,
|
|
@end iftex
|
|
@ifnottex
|
|
Bojan Zdrnja,
|
|
@end ifnottex
|
|
Kristijan Zimmer.
|
|
|
|
Apologies to all who I accidentally left out, and many thanks to all the
|
|
subscribers of the Wget mailing list.
|
|
|
|
@node Copying
|
|
@chapter Copying
|
|
@cindex copying
|
|
@cindex GPL
|
|
@cindex GFDL
|
|
@cindex free software
|
|
|
|
GNU Wget is licensed under the GNU General Public License (GNU GPL),
|
|
which makes it @dfn{free software}. Please note that ``free'' in ``free
|
|
software'' refers to liberty, not price. As some people like to point
|
|
out, it's the ``free'' of ``free speech'', not the ``free'' of ``free
|
|
beer''.
|
|
|
|
The exact and legally binding distribution terms are spelled out below.
|
|
The GPL guarantees that you have the right (freedom) to run and change
|
|
GNU Wget and distribute it to others, and even---if you want---charge
|
|
money for doing any of those things. With these rights comes the
|
|
obligation to distribute the source code along with the software and to
|
|
grant your recipients the same rights and impose the same restrictions.
|
|
|
|
This licensing model is also known as @dfn{open source} because it,
|
|
among other things, makes sure that all recipients will receive the
|
|
source code along with the program, and be able to improve it. The GNU
|
|
project prefers the term ``free software'' for reasons outlined at
|
|
@url{http://www.gnu.org/philosophy/free-software-for-freedom.html}.
|
|
|
|
The exact license terms are defined by this paragraph and the GNU
|
|
General Public License it refers to:
|
|
|
|
@quotation
|
|
GNU Wget is free software; you can redistribute it and/or modify it
|
|
under the terms of the GNU General Public License as published by the
|
|
Free Software Foundation; either version 2 of the License, or (at your
|
|
option) any later version.
|
|
|
|
GNU Wget is distributed in the hope that it will be useful, but WITHOUT
|
|
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
|
|
FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
|
|
for more details.
|
|
|
|
A copy of the GNU General Public License is included as part of this
|
|
manual; if you did not receive it, write to the Free Software
|
|
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
|
@end quotation
|
|
|
|
In addition to this, this manual is free in the same sense:
|
|
|
|
@quotation
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.2 or
|
|
any later version published by the Free Software Foundation; with the
|
|
Invariant Sections being ``GNU General Public License'' and ``GNU Free
|
|
Documentation License'', with no Front-Cover Texts, and with no
|
|
Back-Cover Texts. A copy of the license is included in the section
|
|
entitled ``GNU Free Documentation License''.
|
|
@end quotation
|
|
|
|
@c #### Maybe we should wrap these licenses in ifinfo? Stallman says
|
|
@c that the GFDL needs to be present in the manual, and to me it would
|
|
@c suck to include the license for the manual and not the license for
|
|
@c the program.
|
|
|
|
The full texts of the GNU General Public License and of the GNU Free
|
|
Documentation License are available below.
|
|
|
|
@menu
|
|
* GNU General Public License::
|
|
* GNU Free Documentation License::
|
|
@end menu
|
|
|
|
@include gpl.texi
|
|
|
|
@include fdl.texi
|
|
|
|
@node Concept Index
|
|
@unnumbered Concept Index
|
|
@printindex cp
|
|
|
|
@contents
|
|
|
|
@bye
|