From c09779f75855fb09ded650479882895aad4bb95b Mon Sep 17 00:00:00 2001
From: Micah Cowan <micah@cowan.name>
Date: Mon, 24 Mar 2008 12:26:37 -0700
Subject: [PATCH] Mention various caveats related to accept/reject lists.

---
 doc/ChangeLog |  6 ++++++
 doc/wget.texi | 43 +++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/doc/ChangeLog b/doc/ChangeLog
index 2f25f5a8..3bca181a 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -1,3 +1,9 @@
+2008-03-24  Micah Cowan  <micah@cowan.name>
+
+	* wget.texi <Types of Fields>: Mentioned various caveats in the
+	behavior of accept/reject lists, deprecate current
+	always-download-HTML feature.
+
 2008-03-17  Micah Cowan  <micah@cowan.name>
 
 	* wget.texi <Directory-Based Limits>: Mention importance of
diff --git a/doc/wget.texi b/doc/wget.texi
index a4407949..47fb8033 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -2125,8 +2125,47 @@ better fine-tuning of which files to retrieve.  E.g. @samp{wget -A
 a part of their name, but @emph{not} the PostScript files.
 
 Note that these two options do not affect the downloading of @sc{html}
-files; Wget must load all the @sc{html}s to know where to go at
-all---recursive retrieval would make no sense otherwise.
+files (as determined by a @samp{.htm} or @samp{.html} filename
+prefix). This behavior may not be desirable for all users, and may be
+changed for future versions of Wget.
+
+Note, too, that query strings (strings at the end of a URL beginning
+with a question mark (@samp{?}) are not included as part of the
+filename for accept/reject rules, even though these will actually
+contribute to the name chosen for the local file. It is expected that
+a future version of Wget will provide an option to allow matching
+against query strings.
+
+Finally, it's worth noting that the accept/reject lists are matched
+@emph{twice} against downloaded files: once against the URL's filename
+portion, to determine if the file should be downloaded in the first
+place; then, after it has been accepted and successfully downloaded,
+the local file's name is also checked against the accept/reject lists
+to see if it should be removed. The rationale was that, since
+@samp{.htm} and @samp{.html} files are always downloaded regardless of
+accept/reject rules, they should be removed @emph{after} being
+downloaded and scanned for links, if they did match the accept/reject
+lists. However, this can lead to unexpected results, since the local
+filenames can differ from the original URL filenames in the following
+ways, all of which can change whether an accept/reject rule matches:
+
+@itemize @bullet
+@item
+If the local file already exists and @samp{--no-directories} was
+specified, a numeric suffix will be appended to the original name.
+@item
+If @samp{--html-extension} was specified, the local filename will have
+@samp{.html} appended to it. If Wget is invoked with @samp{-E -A.php},
+a filename such as @samp{index.php} will match be accepted, but upon
+download will be named @samp{index.php.html}, which no longer matches,
+and so the file will be deleted.
+@item
+Query strings do not contribute to URL matching, but are included in
+local filenames, and so @emph{do} contribute to filename matching.
+@end itemize
+
+This behavior, too, is considered less-than-desirable, and may change
+in a future version of Wget.
 
 @node Directory-Based Limits
 @section Directory-Based Limits