Merge pull request #11280 from fuowang/181113

fuowang 翻译完成
This commit is contained in:
Xingyu.Wang 2018-11-20 17:59:18 +08:00 committed by GitHub
commit 20696cbe22
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 119 additions and 121 deletions

View File

@ -1,121 +0,0 @@
fuowang 翻译中
Archiving web sites
======
I recently took a deep dive into web site archival for friends who were worried about losing control over the hosting of their work online in the face of poor system administration or hostile removal. This makes web site archival an essential instrument in the toolbox of any system administrator. As it turns out, some sites are much harder to archive than others. This article goes through the process of archiving traditional web sites and shows how it falls short when confronted with the latest fashions in the single-page applications that are bloating the modern web.
### Converting simple sites
The days of handcrafted HTML web sites are long gone. Now web sites are dynamic and built on the fly using the latest JavaScript, PHP, or Python framework. As a result, the sites are more fragile: a database crash, spurious upgrade, or unpatched vulnerability might lose data. In my previous life as web developer, I had to come to terms with the idea that customers expect web sites to basically work forever. This expectation matches poorly with "move fast and break things" attitude of web development. Working with the [Drupal][2] content-management system (CMS) was particularly challenging in that regard as major upgrades deliberately break compatibility with third-party modules, which implies a costly upgrade process that clients could seldom afford. The solution was to archive those sites: take a living, dynamic web site and turn it into plain HTML files that any web server can serve forever. This process is useful for your own dynamic sites but also for third-party sites that are outside of your control and you might want to safeguard.
For simple or static sites, the venerable [Wget][3] program works well. The incantation to mirror a full web site, however, is byzantine:
```
$ nice wget --mirror --execute robots=off --no-verbose --convert-links \
--backup-converted --page-requisites --adjust-extension \
--base=./ --directory-prefix=./ --span-hosts \
--domains=www.example.com,example.com http://www.example.com/
```
The above downloads the content of the web page, but also crawls everything within the specified domains. Before you run this against your favorite site, consider the impact such a crawl might have on the site. The above command line deliberately ignores [`robots.txt`][] rules, as is now [common practice for archivists][4], and hammer the website as fast as it can. Most crawlers have options to pause between hits and limit bandwidth usage to avoid overwhelming the target site.
The above command will also fetch "page requisites" like style sheets (CSS), images, and scripts. The downloaded page contents are modified so that links point to the local copy as well. Any web server can host the resulting file set, which results in a static copy of the original web site.
That is, when things go well. Anyone who has ever worked with a computer knows that things seldom go according to plan; all sorts of things can make the procedure derail in interesting ways. For example, it was trendy for a while to have calendar blocks in web sites. A CMS would generate those on the fly and make crawlers go into an infinite loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions (e.g. Wget has a `--reject-regex` option) to ignore problematic resources. Another option, if the administration interface for the web site is accessible, is to disable calendars, login forms, comment forms, and other dynamic areas. Once the site becomes static, those will stop working anyway, so it makes sense to remove such clutter from the original site as well.
### JavaScript doom
Unfortunately, some web sites are built with much more than pure HTML. In single-page sites, for example, the web browser builds the content itself by executing a small JavaScript program. A simple user agent like Wget will struggle to reconstruct a meaningful static copy of those sites as it does not support JavaScript at all. In theory, web sites should be using [progressive enhancement][5] to have content and functionality available without JavaScript but those directives are rarely followed, as anyone using plugins like [NoScript][6] or [uMatrix][7] will confirm.
Traditional archival methods sometimes fail in the dumbest way. When trying to build an offsite backup of a local newspaper ([pamplemousse.ca][8]), I found that WordPress adds query strings (e.g. `?ver=1.12.4`) at the end of JavaScript includes. This confuses content-type detection in the web servers that serve the archive, which rely on the file extension to send the right `Content-Type` header. When such an archive is loaded in a web browser, it fails to load scripts, which breaks dynamic websites.
As the web moves toward using the browser as a virtual machine to run arbitrary code, archival methods relying on pure HTML parsing need to adapt. The solution for such problems is to record (and replay) the HTTP headers delivered by the server during the crawl and indeed professional archivists use just such an approach.
### Creating and displaying WARC files
At the [Internet Archive][9], Brewster Kahle and Mike Burner designed the [ARC][10] (for "ARChive") file format in 1996 to provide a way to aggregate the millions of small files produced by their archival efforts. The format was eventually standardized as the WARC ("Web ARChive") [specification][11] that was released as an ISO standard in 2009 and revised in 2017. The standardization effort was led by the [International Internet Preservation Consortium][12] (IIPC), which is an "international organization of libraries and other organizations established to coordinate efforts to preserve internet content for the future", according to Wikipedia; it includes members such as the US Library of Congress and the Internet Archive. The latter uses the WARC format internally in its Java-based [Heritrix crawler][13].
A WARC file aggregates multiple resources like HTTP headers, file contents, and other metadata in a single compressed archive. Conveniently, Wget actually supports the file format with the `--warc` parameter. Unfortunately, web browsers cannot render WARC files directly, so a viewer or some conversion is necessary to access the archive. The simplest such viewer I have found is [pywb][14], a Python package that runs a simple webserver to offer a Wayback-Machine-like interface to browse the contents of WARC files. The following set of commands will render a WARC file on `http://localhost:8080/`:
```
$ pip install pywb
$ wb-manager init example
$ wb-manager add example crawl.warc.gz
$ wayback
```
This tool was, incidentally, built by the folks behind the [Webrecorder][15] service, which can use a web browser to save dynamic page contents.
Unfortunately, pywb has trouble loading WARC files generated by Wget because it [followed][16] an [inconsistency in the 1.0 specification][17], which was [fixed in the 1.1 specification][18]. Until Wget or pywb fix those problems, WARC files produced by Wget are not reliable enough for my uses, so I have looked at other alternatives. A crawler that got my attention is simply called [crawl][19]. Here is how it is invoked:
```
$ crawl https://example.com/
```
(It does say "very simple" in the README.) The program does support some command-line options, but most of its defaults are sane: it will fetch page requirements from other domains (unless the `-exclude-related` flag is used), but does not recurse out of the domain. By default, it fires up ten parallel connections to the remote site, a setting that can be changed with the `-c` flag. But, best of all, the resulting WARC files load perfectly in pywb.
### Future work and alternatives
There are plenty more [resources][20] for using WARC files. In particular, there's a Wget drop-in replacement called [Wpull][21] that is specifically designed for archiving web sites. It has experimental support for [PhantomJS][22] and [youtube-dl][23] integration that should allow downloading more complex JavaScript sites and streaming multimedia, respectively. The software is the basis for an elaborate archival tool called [ArchiveBot][24], which is used by the "loose collective of rogue archivists, programmers, writers and loudmouths" at [ArchiveTeam][25] in its struggle to "save the history before it's lost forever". It seems that PhantomJS integration does not work as well as the team wants, so ArchiveTeam also uses a rag-tag bunch of other tools to mirror more complex sites. For example, [snscrape][26] will crawl a social media profile to generate a list of pages to send into ArchiveBot. Another tool the team employs is [crocoite][27], which uses the Chrome browser in headless mode to archive JavaScript-heavy sites.
This article would also not be complete without a nod to the [HTTrack][28] project, the "website copier". Working similarly to Wget, HTTrack creates local copies of remote web sites but unfortunately does not support WARC output. Its interactive aspects might be of more interest to novice users unfamiliar with the command line.
In the same vein, during my research I found a full rewrite of Wget called [Wget2][29] that has support for multi-threaded operation, which might make it faster than its predecessor. It is [missing some features][30] from Wget, however, most notably reject patterns, WARC output, and FTP support but adds RSS, DNS caching, and improved TLS support.
Finally, my personal dream for these kinds of tools would be to have them integrated with my existing bookmark system. I currently keep interesting links in [Wallabag][31], a self-hosted "read it later" service designed as a free-software alternative to [Pocket][32] (now owned by Mozilla). But Wallabag, by design, creates only a "readable" version of the article instead of a full copy. In some cases, the "readable version" is actually [unreadable][33] and Wallabag sometimes [fails to parse the article][34]. Instead, other tools like [bookmark-archiver][35] or [reminiscence][36] save a screenshot of the page along with full HTML but, unfortunately, no WARC file that would allow an even more faithful replay.
The sad truth of my experiences with mirrors and archival is that data dies. Fortunately, amateur archivists have tools at their disposal to keep interesting content alive online. For those who do not want to go through that trouble, the Internet Archive seems to be here to stay and Archive Team is obviously [working on a backup of the Internet Archive itself][37].
--------------------------------------------------------------------------------
via: https://anarc.at/blog/2018-10-04-archiving-web-sites/
作者:[Anarcat][a]
选题:[lujun9972](https://github.com/lujun9972)
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://anarc.at
[1]: https://anarc.at/blog
[2]: https://drupal.org
[3]: https://www.gnu.org/software/wget/
[4]: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
[5]: https://en.wikipedia.org/wiki/Progressive_enhancement
[6]: https://noscript.net/
[7]: https://github.com/gorhill/uMatrix
[8]: https://pamplemousse.ca/
[9]: https://archive.org
[10]: http://www.archive.org/web/researcher/ArcFileFormat.php
[11]: https://iipc.github.io/warc-specifications/
[12]: https://en.wikipedia.org/wiki/International_Internet_Preservation_Consortium
[13]: https://github.com/internetarchive/heritrix3/wiki
[14]: https://github.com/webrecorder/pywb
[15]: https://webrecorder.io/
[16]: https://github.com/webrecorder/pywb/issues/294
[17]: https://github.com/iipc/warc-specifications/issues/23
[18]: https://github.com/iipc/warc-specifications/pull/24
[19]: https://git.autistici.org/ale/crawl/
[20]: https://archiveteam.org/index.php?title=The_WARC_Ecosystem
[21]: https://github.com/chfoo/wpull
[22]: http://phantomjs.org/
[23]: http://rg3.github.io/youtube-dl/
[24]: https://www.archiveteam.org/index.php?title=ArchiveBot
[25]: https://archiveteam.org/
[26]: https://github.com/JustAnotherArchivist/snscrape
[27]: https://github.com/PromyLOPh/crocoite
[28]: http://www.httrack.com/
[29]: https://gitlab.com/gnuwget/wget2
[30]: https://gitlab.com/gnuwget/wget2/wikis/home
[31]: https://wallabag.org/
[32]: https://getpocket.com/
[33]: https://github.com/wallabag/wallabag/issues/2825
[34]: https://github.com/wallabag/wallabag/issues/2914
[35]: https://pirate.github.io/bookmark-archiver/
[36]: https://github.com/kanishka-linux/reminiscence
[37]: http://iabak.archiveteam.org

View File

@ -0,0 +1,119 @@
存档网站
======
我最近深入研究了网站存档,因为有些朋友担心遇到糟糕的系统管理或恶意入侵时失去对在线托管的工作的控制。这使得网站存档成为任意系统管理员工具箱中的重要工具。事实证明,有些网站比其他网站更难存档。本文介绍了对传统网站进行存档的过程,并阐述在面对最新流行的单页面应用程序的现代网站时,它有哪些不足。
### 转换为简单网站
手动开发 HTML 网站的日子早已不复存在。现在的网站是动态的,并使用最新的 JavaScriptPHP 或 Python 框架即时构建。结果,这些网站更加脆弱:数据库崩溃,升级出错或者未修复的漏洞都可能使数据丢失。在我以前是一名 Web 开发人员时,我不得不接受客户希望网站基本上可以永久工作的想法。这种期望与 web 开发“快速行动和破除陈规”的理念不相符。在这方面,使用 [Drupal][2] 内容管理系统CMS尤其具有挑战性因为重大更新会破坏与第三方模块的兼容性这意味着客户很少承担的起高昂的升级成本。解决方案是将这些网站存档以实时动态的网站为基础将其转换为任何 web 服务器可以永久服务的纯 HTML 文件。此过程对你自己的动态网站非常有用,也适用于你想保护但无法控制的第三方网站。
对于简单的静态网站,古老的 [Wget][3] 程序就可以胜任。然而,镜像保存一个完整页面的方法,虽然复杂但很固定:
```
$ nice wget --mirror --execute robots=off --no-verbose --convert-links \
--backup-converted --page-requisites --adjust-extension \
--base=./ --directory-prefix=./ --span-hosts \
--domains=www.example.com,example.com http://www.example.com/
```
以上命令下载了网页的内容,但也抓取了指定域名中的所有内容。在对你喜欢的网站执行此操作之前,请考虑此类抓取可能对网站产生的影响。上面的命令故意忽略了 `robots.txt` 规则,就像现在[档案管理者的习惯做法][4],并尽可能快的存档网站。大多数抓取工具都可以选择点击暂停并限制带宽使用,以避免使网站瘫痪。
上面的命令还将获取 “page requisites译者注单页面所需的所有元素像样式表CSS图像和脚本等。下载的页面内容将会被修改以便链接也指向本地副本。任意 web 服务器均可托管生成的文件集,从而生成原始网站的静态副本。
以上所述是事情一切顺利的时候。任意使用过计算机的人都知道事情的进展很少如计划那样;各种各样的事情可以使程序以有趣的方式脱离正规。比如,在网站上有一个日历块很流行。内容管理系统会动态生成这些内容,这会使爬虫程序陷入死循环以尝试检索所有页面。灵巧的存档者可以使用正则表达式(例如 Wget 有一个 `--reject-regex` 选项)来忽略有问题的资源。如果可以访问网站的管理界面,另一个方法是禁用日历、登录表单、评论表单和其他动态区域。一旦网站变成静态的,(那些动态区域)也肯定会停止工作,因此从原始网站中移除这些杂乱的东西也不是全无意义。
### JavaScript 的厄运
很不幸,有些网站不仅仅是纯 HTML 文件构建的。比如在单页面网站中web 浏览器通过执行一个小的 JavaScript 程序来构建内容。像 Wget 这样的简单用户代理将难以重建这些网站的有意义的静态副本,因为它根本不支持 JavaScript。理论上网站应该使用[渐进增强][5]技术,在不使用 JavaScript 的情况下提供内容和实现功能,但这些指示很少被遵循,因为使用 [NoScript][6] 或 [uMatrix][7] 等插件的人都很确定。
传统的存档方法有时是最愚蠢的方式,会导致失败。在尝试为一个本地报纸网站([pamplemousse.ca][8])创建备份时,我发现 WordPress 在末尾包含 JavaScript且添加了查询字符串例如`?ver=1.12.4`)。这会使提供存档服务的 web 服务器不能正确进行内容类型检测,因为其靠文件扩展名来发送正确的 `Content-Type` 头部信息。在 web 浏览器加载此类存档时,这些脚本将无法加载,导致动态网站受损。
随着 web 向使用浏览器作为虚拟机执行任意代码转化,依赖于纯 HTML 文件解析的存档方法也需要随之适应。这个问题的解决方案是在抓取时记录(以及重现)服务器提供的 HTTP 头部信息,实际上专业的档案管理者就使用这种方法。
### 创建和显示 WARC 文件Creating and displaying WARC files
在 [Internet Archive][9] 网站Brewster Kahle 和 Mike Burner 在 1996 年设计了 [ARC][10] (用于 "ARChive")文件格式,以提供一种聚合档案工作产生的百万个小文件的方法。该格式最终标准化为 WARC“Web ARChive”[规范][11],并在 2009 年作为 ISO 标准发布2017 年修订。标准化工作由[国际互联网保护联盟][12]IIPC领导据维基百科称这是一个“为共同保护未来互联网内容而建立的图书馆和国际组织”它有美国国会图书馆和互联网档案馆等成员。后者内部在其基于 Java 的 [Heritrix crawler][13](译者注:一种爬虫程序)上使用 WARC 格式。
WARC 在单个压缩文件中聚合了多种资源,像 HTTP 头部信息,文件内容,以及其他元数据。方便的是实际上 Wget 提供了 `--warc` 参数来支持 WARC 格式。不幸的是 web 浏览器不能直接显示 WARC 文件,所以为了访问存档文件,一个查看器或某些格式转换是很有必要的。我所发现的最简单的查看器是 [pywb][14],它以 Python 包的形式运行一个简单的 web 服务器提供一个像网站时光倒流机网站的界面,来浏览 WARC 文件的内容。执行以下命令将会在 `http://localhost:8080/` 地址显示 WARC 文件的内容:
```
$ pip install pywb
$ wb-manager init example
$ wb-manager add example crawl.warc.gz
$ wayback
```
顺便说一句,这个工具是由 [Webrecorder][15] 服务提供者建立的Webrecoder 服务可以使用 web 浏览器保存动态页面的内容。
很不幸pywb 无法加载 Wget 生成的 WARC 文件,因为它[遵循][16][不一致的 1.0 规范][17][1.1 规范修复了此问题][17]。就算 Wget 或 pywb 修复了这些问题Wget 生成的 WARC 文件对我的使用来说不够可靠,所以我找了其他的替代品。引起我注意的爬虫程序简称 [crawl][19]。以下是它的调用方式:
```
$ crawl https://example.com/
```
(它的 README 文件说“非常简单”。)该程序确实支持一些命令行参数选项,但大多数默认值都是最佳的:它会从其他域获取页面需求(除非使用 `-exclude-related` 参数),但肯定不会递归出域。默认情况下,它会与远程站点建立十个并发连接,这个值可以使用 `-c` 参数更改。但是,最重要的是,生成的 WARC 文件可以使用 pywb 完美加载。
### 未来的工作和替代方案
这里还有更多有关使用 WARC 文件的[资源][20]。特别要提的是,这里有一个专门用来存档网站的 Wget 的直接替代品,叫做 [Wpull][21]。它实验性地支持了 [PhantomJS][22] 和 [youtube-dl][23] 的集成,即允许分别下载更复杂的 JavaScript 页面以及流媒体。该程序是一个叫做 [ArchiveBot][24] 的复杂档案工具的基础ArchiveBot 被那些在 [ArchiveTeam][25] 的“零散离群的档案管理者、程序员、作家以及演说家”使用,他们致力于“在历史永远丢失之前保存他们”。集成 PhantomJS 好像并没有如团队期望的那样良好工作,所以 ArchiveTeam 也用其他的低等工具来镜像保存更复杂的网站。例如,[snscrape][26] 将抓取社交媒体配置文件以生成要发送到 ArchiveBot 的页面列表。团队使用的另一个工具是 [crocoite][27],它在 Chrome 浏览器下以无头文件信息的模式来存档 JavaScript 较多的网站。
如果没有提到称做“网站复制者”的 [HTTrack][28] 项目,那么这篇文章算不上完整。工作方式和 Wget 相似HTTrack 可以对远程站点创建一个本地的副本,但是不幸的是它不支持输出 WRAC 文件。对于不熟悉命令行的小白用户来说,它在人机交互方面显得更有价值。
同样,在我的研究中,我发现了叫做 [Wget2][29] 的 Wget 的完全重制版本,它支持多线程操作,这可能使它比前身更快。和 Wget 相比,它[舍弃了一些功能][30]但是最值得注意的是拒绝模式、WARC 输出以及 FTP 支持,并增加了 RSS、DNS 缓存以及改进的 TLS 支持。
最后,我个人对这些工具的愿景是将他们与现有的书签系统集成起来。目前我在 [Wallabag][31] 中保留了一些有趣的链接,这是一种自托管式的“稍后阅读”服务,意在成为 [Pocket][32](现在由 Mozilla 拥有)的免费替代品。但是 Wallabag 在设计上只保留了文章的“可读”副本,而不是一个完整的拷贝。在某些情况下,“可读版本”实际上[不可读][33],并且 Wallabag 有时[无法解析文章][34]。恰恰相反,像 [bookmark-archiver][35] 或 [reminiscence][36] 这样其他的工具会保存页面的屏幕截图以及完整的 HTML 文件,但遗憾的是,它没有 WRAC 文件所以没有办法更可信的重现网页内容。
我所经历的有关镜像保存和存档的悲剧就是死数据。幸运的是,业余档案管理者可以利用工具将有趣的内容保存到网上。对于那些不想麻烦的人来说,互联网档案馆依然要留在这里,并且存档团队显然[正在为互联网档案馆本身做备份][37]。
--------------------------------------------------------------------------------
via: https://anarc.at/blog/2018-10-04-archiving-web-sites/
作者:[Anarcat][a]
选题:[lujun9972](https://github.com/lujun9972)
译者:[fuowang](https://github.com/fuowang)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://anarc.at
[1]: https://anarc.at/blog
[2]: https://drupal.org
[3]: https://www.gnu.org/software/wget/
[4]: https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/
[5]: https://en.wikipedia.org/wiki/Progressive_enhancement
[6]: https://noscript.net/
[7]: https://github.com/gorhill/uMatrix
[8]: https://pamplemousse.ca/
[9]: https://archive.org
[10]: http://www.archive.org/web/researcher/ArcFileFormat.php
[11]: https://iipc.github.io/warc-specifications/
[12]: https://en.wikipedia.org/wiki/International_Internet_Preservation_Consortium
[13]: https://github.com/internetarchive/heritrix3/wiki
[14]: https://github.com/webrecorder/pywb
[15]: https://webrecorder.io/
[16]: https://github.com/webrecorder/pywb/issues/294
[17]: https://github.com/iipc/warc-specifications/issues/23
[18]: https://github.com/iipc/warc-specifications/pull/24
[19]: https://git.autistici.org/ale/crawl/
[20]: https://archiveteam.org/index.php?title=The_WARC_Ecosystem
[21]: https://github.com/chfoo/wpull
[22]: http://phantomjs.org/
[23]: http://rg3.github.io/youtube-dl/
[24]: https://www.archiveteam.org/index.php?title=ArchiveBot
[25]: https://archiveteam.org/
[26]: https://github.com/JustAnotherArchivist/snscrape
[27]: https://github.com/PromyLOPh/crocoite
[28]: http://www.httrack.com/
[29]: https://gitlab.com/gnuwget/wget2
[30]: https://gitlab.com/gnuwget/wget2/wikis/home
[31]: https://wallabag.org/
[32]: https://getpocket.com/
[33]: https://github.com/wallabag/wallabag/issues/2825
[34]: https://github.com/wallabag/wallabag/issues/2914
[35]: https://pirate.github.io/bookmark-archiver/
[36]: https://github.com/kanishka-linux/reminiscence
[37]: http://iabak.archiveteam.org