Bugfix: Fix NULL filename and output_stream in Metalink module

* NEWS: Mention the Metalink "path/file" name format handling
* src/metalink.c (retrieve_from_metalink): Fix NULL filename, set
  filename to the right "path/file" value
* src/metalink.c (retrieve_from_metalink): Fix NULL output_stream, set
  output_stream to filename when it is created by retrieve_url()
* src/metalink.c (retrieve_from_metalink): Add RFC5854 comments about
  proper metalink:file "path/file" name format handling
* doc/metalink.txt: Update document. Remove resolved bugs

If unique_create() cannot create/open the destination file, filename
and output_stream remain NULL. If fopen() is used instead, filename
always remains NULL. Both functions cannot create "path/file" trees.

Setting filename to the right value is sufficient to prevent SIGSEGV
generating from testing a NULL value. This also allows retrieve_url()
to create a "path/file" tree through opt.output_document.

Reading NULL as output_stream, when it shall not be, leads to wrong
results. For instance, a non-NULL output_stream tells when a stream
was interrupted, reading NULL instead means to assume the contrary.

This patch conforms to the RFC5854 specification:
  The Metalink Download Description Format
  4.1.2.1.  The "name" Attribute
  https://tools.ietf.org/html/rfc5854#section-4.1.2.1
This commit is contained in:
Matthew White 2016-07-28 17:10:46 +02:00
parent bcb9bf7ae4
commit 96554861f9
3 changed files with 119 additions and 55 deletions

5
NEWS
View File

@ -9,6 +9,11 @@ Please send GNU Wget bug reports to <bug-wget@gnu.org>.
* Changes in Wget X.Y.Z * Changes in Wget X.Y.Z
* When processing a Metalink file, create the parent directories of a
"path/file" destination file name:
https://tools.ietf.org/html/rfc5854#section-4.1.2.1
https://tools.ietf.org/html/rfc5854#section-4.2.8.3
* On a recursive download, append a .tmp suffix to temporary files * On a recursive download, append a .tmp suffix to temporary files
that will be deleted after being parsed, and create them that will be deleted after being parsed, and create them
readable/writable only by the owner. readable/writable only by the owner.

View File

@ -1,24 +1,26 @@
GNU Wget Metalink module (--input-metalink) GNU Wget Metalink module
Evaluation of "Directory Options" on the command line Evaluation of the Metalink/XML and Metalink/HTTP implementations
1. Introduction 1. Introduction
*************** ***************
This document, and the results contained in it, is focused over the This document, and the results contained in it, is focused over the
testing of the metalink:file "path/file" name format. evaluation of the Metalink/XML and Metalink/HTTP implementations.
The "Directory Options" mentioned here are used on the command line in The "Directory Options" mentioned here are used on the command line in
conjunction with the option '--input-metalink=file': conjunction with the option '--input-metalink=file' for Metalink/XML,
and '--metalink-over-http' for Metalink/HTTP.
$ wget --input-metalink=file <directory options> $ wget --input-metalink=<file> [directory options]
$ wget --metalink-over-http [directory options] <url>
2. Notes 2. Notes
******** ********
Tests containing a metalink:file "/path/file", "./path/file", or Tests for metalink:file names beginning with '/', '~/', './', or '../'
"../path/file" name shall be run manually due to security concerns. (e.g. "/path/file") shall be run manually due to security concerns.
3. Metalink files used as reference 3. Metalink files used as reference
*********************************** ***********************************
@ -47,17 +49,30 @@ EOF
4.1 Implemented safety features 4.1 Implemented safety features
=============================== ===============================
Do not follow relative or absolute paths: "/path/file", "./path/file", Any metalink:file name containing an absolute, relative, or home path
and "../path/file" as metalink:file name formats are all ignored (wget (see '2. Notes') parsed from Metalink/XML files is rejected.
refuses to start). The options --trust-server-names changes nothing.
4.2 Actual behaviour This is a libmetalink's design decision implemented in the function
==================== metalink_check_safe_path(). This feature shall not be modified.
Given a metalink:file "path/file" name, if "path" exists, download All the above conform to the RFC5854 standard.
"path/file", then compute its checksum. If "path" doesn't exist,
download the url's file in the working directory; then the checksum References:
fails: cannot find "path/file". https://tools.ietf.org/html/rfc5854#section-4.1.2.1
https://tools.ietf.org/html/rfc5854#section-4.2.8.3
4.2 File download behaviour
===========================
When a Metalink/XML file is parsed:
1. create the metalink:file "path/file" tree;
2. download the metalink:url file as "path/file";
3. verify the "path/file" checksum.
All the above conform to the RFC5854 standard.
References:
https://tools.ietf.org/html/rfc5854
4.3 Questionable behaviours 4.3 Questionable behaviours
=========================== ===========================
@ -69,69 +84,85 @@ If more metalink:file elements are the same, wget downloads them all.
The download is OK even when metalink:file size is wrong. The download is OK even when metalink:file size is wrong.
5. Directory Options 5. `wget --metalink-over-http`
******************************
5.1 Implemented safety features
===============================
The function url_file_name() is responsible of parsing the url's file
name and mixing in the "Directory Options" wrote on the command line.
The use of libmetalink's metalink_check_safe_path() shouldn't be
necessary (see '4.1 Implemented safety features').
All the above comform to the usual Wget's download behaviour.
References:
wget(1)
5.2 File download behaviour
===========================
When a Metalink/HTTP header is parsed:
1. extract metalink metadata from the header;
2. download the file from the mirror with the highest priority;
3. verify the file's checksum.
All the above comform to the usual Wget's download behaviour and to
the RFC6249 standard.
References:
wget(1)
https://tools.ietf.org/html/rfc6249
6. Directory Options
******************** ********************
'-nd' '-nd'
'--no-directories' '--no-directories'
Used alone has no effect (see `wget --input-metalink=test.meta4`). Do not apply to Metalink/XML files (aka --input-metalink=<file>).
Used in conjunction with --recursive, given "path/file", if "path" Apply to Metalink/HTTP urls as described in the Wget's manual, see
exists, download "path/file" and compute its checksum. If "path" wget(1). The target url is the url wrote on the command line.
doesnt' exist, download the url's file in the working directory,
then the checksum fails: cannot find "path/file".
'-x' '-x'
'--force-directories' '--force-directories'
Given "path/file", if "path" exists, download "path/file", then Do not apply to Metalink/XML files (aka --input-metalink=<file>).
compute its checksum. If "path" doesn't exist, create the url
hierarchy, then the checksum fails: cannot find "path/file". Apply to Metalink/HTTP urls as described in the Wget's manual, see
wget(1). The target url is the url wrote on the command line.
'-nH' '-nH'
'--no-host-directories' '--no-host-directories'
Given "path/file", if "path" exists, download "path/file", then Do not apply to Metalink/XML files (aka --input-metalink=<file>).
compute its checksum. If "path" doesn't exist, download the url's
file in the working directory, then the checksum fails: cannot Apply to Metalink/HTTP urls as described in the Wget's manual, see
find "path/file"; in this context, if --force-directories is wget(1). The target url is the url wrote on the command line.
present, create the url hierarchy omitting the host component.
'--protocol-directories' '--protocol-directories'
Used alone has no effect (see `wget --input-metalink=test.meta4`). Do not apply to Metalink/XML files (aka --input-metalink=<file>).
In conjunction with --force-directories, use the protocol name as Apply to Metalink/HTTP urls as described in the Wget's manual, see
the first directory component (see --force-directories). wget(1). The target url is the url wrote on the command line.
'--cut-dirs=number' '--cut-dirs=number'
Used alone has no effect (see `wget --input-metalink=test.meta4`). Do not apply to Metalink/XML files (aka --input-metalink=<file>).
In conjunction with --force-directories, ignore 'number' directory Apply to Metalink/HTTP urls as described in the Wget's manual, see
components after the domain (see --force-directories). wget(1). The target url is the url wrote on the command line.
'-P prefix' '-P prefix'
'--directory-prefix=prefix' '--directory-prefix=prefix'
This is buggy or non-intuitive. Do not apply to Metalink/XML files (aka --input-metalink=<file>).
Given "path/file", and more metalink:url uris for the same file, Apply to Metalink/HTTP downloads.
if '-P path' is specified, the first url's file is downloaded as
"path/<url_file>", and the second url's file as "path/file". The
first file fails the checksum: cannot find "path/file". The file
"path/file" passes the checksum verification.
Given "path/file", and more metalink:url uris for the same file, The directory prefix is the directory where all other files and
if '-P newp' is specified, all the urls' files are downloaded as subdirectories will be saved to, see wget(1).
"newp/<url_file>. A suffix counter is added to the file names to
not overwrite existing files. Then all the checksums fail: cannot
find "path/file".
Given "path/file", and more metalink:url uris for the same file,
if '-P ../path' is specified, the same things as if '-P ../newp'
or '-P newp' will happen, e.g. "newp/<url_file> and checksums
failures.
[write here more wrong things happening]

View File

@ -170,7 +170,26 @@ retrieve_from_metalink (const metalink_t* metalink)
output_stream_regular = true; output_stream_regular = true;
/* Store the real file name for displaying in messages. */ /*
At this point, if output_stream is NULL, the file
couldn't be created/opened.
This happens when the metalink:file has a "path/file"
name format and its directory tree cannot be created:
* stdio.h (fopen)
* src/utils.c (unique_create)
RFC5854 requires a proper "path/file" format handling,
this can be achieved setting opt.output_document while
output_stream is left to NULL:
* src/http.c (open_output_stream): If output_stream is
NULL, create the opt.output_document "path/file"
*/
if (!filename)
filename = xstrdup (mfile->name);
/* Store the real file name for displaying in messages,
and for proper RFC5854 "path/file" handling. */
opt.output_document = filename; opt.output_document = filename;
opt.metalink_over_http = false; opt.metalink_over_http = false;
@ -178,6 +197,15 @@ retrieve_from_metalink (const metalink_t* metalink)
retr_err = retrieve_url (url, mres->url, NULL, NULL, retr_err = retrieve_url (url, mres->url, NULL, NULL,
NULL, NULL, opt.recursive, iri, false); NULL, NULL, opt.recursive, iri, false);
opt.metalink_over_http = _metalink_http; opt.metalink_over_http = _metalink_http;
/*
Bug: output_stream is NULL, but retrieve_url() somehow
created filename.
Bugfix: point output_stream to filename if it exists.
*/
if (!output_stream && file_exists_p (filename))
output_stream = fopen (filename, "ab");
} }
url_free (url); url_free (url);
iri_free (iri); iri_free (iri);