- use mmap() to read whole files in core instead of allocating memory
and read'ing it.
- use a new, more general, HTML parser (html-parse.c) and interface to
it from Wget (html-url.c).
- respect <meta name=robots content=nofollow> (easy with the new HTML
parser).
- use hash tables instead of linked lists in places where the lists
were used to facilitate mappings.
- rewrite the code in host.c to be more readable and faster (hash
tables instead of home-grown lists.)
- make convert_links properly convert partial URLs to complete ones
for those URLs that have *not* been downloaded.
- use HTTP persistent connections where available. very
simple-minded, caches the last connection to the server.
Published in <sxshf533d5r.fsf@florida.arsdigita.de>.
files specified on the commandline. Made --convert-links be ignored when
--delete-after is specified. Added note about this fact to --delete-after docs
and made general improvements to them, including the clarification that
--delete-after only deletes local files.
I renamed to "lockable_boolean") in the .wgetrc (currently just passive_ftp).
Wrote documentation for his changes and added the missing "referer" to the
.wgetrc section (making mention of the issue of "referrer" being the correct
spelling).
* ftp.c (ftp_retrieve_list): Use new INFINITE_RECURSION #define.
* html.c: htmlfindurl() now takes final `dash_p_leaf_HTML' parameter.
Wrapped some > 80-column lines. When -p is specified and we're at a
leaf node, do not traverse <A>, <AREA>, or <LINK> tags other than
<LINK REL="stylesheet">.
* html.h (htmlfindurl): Now takes final `dash_p_leaf_HTML' parameter.
* init.c: Added new -p / --page-requisites / page_requisites option.
* main.c (print_help): Clarified that -l inf and -l 0 both allow
infinite recursion. Changed the unhelpful --mirrior description
to simply give the options it's equivalent to. Added new -p option.
(main): Added some comments; handle new -p / --page-requisites.
* options.h (struct options): Added new page_requisites field.
* recur.c: Changed "URL-s" to "URLs" and "HTML-s" to "HTMLs".
Calculate and pass down new `dash_p_leaf_HTML' parameter to
get_urls_html(). Use new INFINITE_RECURSION #define.
* retr.c: Changed "URL-s" to "URLs". get_urls_html() now takes
final `dash_p_leaf_HTML' parameter.
* url.c: get_urls_html() and htmlfindurl() now take final
`dash_p_leaf_HTML' parameter.
* url.h (get_urls_html): Now takes final `dash_p_leaf_HTML' parameter.
* wget.h: Added some comments and new INFINITE_RECURSION #define.
* wget.texi (Recursive Retrieval Options): Documented new -p option.
go through without doing an update first, and I forgot to make the change the
second time. Just changed an erroneous main.c (main) to main.c (print_help).
said that 0 seconds are waited after the first retry, which I believe is
incorrect and does not match what's written elsewhere (e.g. wget.texi). Changed
to 1.
>= width of type" warning on 32-bit architectures. Got rid of it by tricking
the compiler w/ a variable.
* url.c (UNSAFE_CHAR): The macro didn't include all the illegal characters per
RFC1738, namely everything above '~'. It also generated a warning on OSes
where char =~ unsigned char. Fixed.
download a single HTML document and all its constituents.
* po/*.{gmo,po,pot}: Regenerated after adding new options.
* po/hr.po: Hrvoje forgot '\n's on his translations of my altered messages,
causing msgfmt to balk and `make install' to fail.
* wget.texi (Recursive Retrieval Options): In -K description, added a link to
the discussion of interaction with -N.
(Recursive Accept/Reject Options): Did some alphabetizing and added descriptions
of new --follow-tags and -G / --ignore-tags options.
(Following Links): Changed "the loads of" to "loads of".
(Wgetrc Commands): Added descriptions of new follow_tags and ignore_tags
commands.
* html.c (idmatch): Implemented checking of my new --follow-tags and
--ignore-tags options.
* init.c (commands): Added comment reminding people adding new entries doing
allocation to add corresponding freeing in cleanup().
(commands): Added new followtags and ignoretags commands.
(cleanup): Free storage for new followtags and ignoretags.
* main.c: Use of "comma-separated list" was random -- normalized it. Did some
alphabetization. Added comments pointing out "Options without arguments" and
"Options accepting an argument" sections of long_options[]. Added new options
--follow-tags and -G / --ignore-tags. Added comment that Damir's --referer is
currently undocumented. Added comment that Heiko's --waitretry is partially
undocumented (mentioned in --help but not in wget.texi). Moved improperly
sorted 24, 129, and 'G' cases.
* options.h (struct options): Added new fields follow_tags and ignore_tags.
* wget.h: Added "#define EQ 0" so we can say "strcmp(a, b) == EQ".
coded for (downloading StarOffice from Sun's website). He says he doesn't use
wget any more, so he won't be writing a patch that allows downloading that
without breaking anything (such a patch would apparently involve stopping
certain characters in the URL from being escaped).
URLs, gen_page.cgi?page1 and get_page.cgi?page2, they'll both be saved as
get_page.cgi and the second will overwrite the first. Also, parameters to
implicit CGIs, like "http://www.host.com/db/?2000-03-02" cause the URLs to be
printed with trailing garbage characters, and could seg fault. I'm not sure
what Dan had in mind with this patch (no explanatory comments), but I'm removing
it for now. If he can rewrite it so it doesn't break stuff, okay.
Got rid of newly-introduced nested-if warnings in ftp.c and http.c. Fixed
apparently completely untested code in main.c that was trying to provide --wait
/ --waitretry backwards compatibility, but had multiple fundamental bugs.
together, we compare local file X.orig (if extant) against server file X.
Previously -k and -N were worthless in combination because the local converted
files always differed from the server versions.