[翻译完成]-20180129 Parsing HTML with Python.md

This commit is contained in:
Flowsnow 2018-03-13 10:27:09 +08:00
parent 0a244b1a27
commit 57772c949e
2 changed files with 219 additions and 214 deletions

View File

@ -1,214 +0,0 @@
translating by Flowsnow
Parsing HTML with Python
======
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/bus_html_code.png?itok=VjUmGsnl)
Image by : Jason Baker for Opensource.com.
As a long-time member of the documentation team at Scribus, I keep up-to-date with the latest updates of the source so I can help make updates and additions to the documentation. When I recently did a "checkout" using Subversion on a computer I had just upgraded to Fedora 27, I was amazed at how long it took to download the documentation, which consists of HTML pages and associated images. I became concerned that the project's documentation seemed much larger than it should be and suspected that some of the content was "zombie" documentation--HTML files that aren't used anymore and images that have lost all references in the currently used HTML.
I decided to create a project for myself to figure this out. One way to do this is to search for existing image files that aren't used. If I could scan through all the HTML files for image references, then compare that list to the actual image files, chances are I would see a mismatch.
Here is a typical image tag:
```
<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>
```
I'm interested in the part between the first set of quotation marks, after `src=`. After some searching for a solution, I found a Python module called [BeautifulSoup][1]. The tasty part of the script I wrote looks like this:
```
soup = BeautifulSoup(all_text, 'html.parser')
match = soup.findAll("img")
if len(match) > 0:
for m in match:
imagelist.append(str(m))
```
We can use this `findAll` method to pluck out the image tags. Here is a tiny piece of the output:
```
<img src="images/pdf-form-ht3.png"/><img src="images/pdf-form-ht4.png"/><img src="images/pdf-form-ht5.png"/><img src="images/pdf-form-ht6.png"/><img align="middle" alt="GSview - Advanced Options Panel" src="images/gsadv1.png" title="GSview - Advanced Options Panel"/><img align="middle" alt="Scribus External Tools Preferences" src="images/gsadv2.png" title="Scribus External Tools Preferences"/>
```
So far, so good. I thought that the next step might be to just carve this down, but when I tried some string methods in the script, it returned errors about this being tags and not strings. I saved the output to a file and went through the process of editing in [KWrite][2]. One nice thing about KWrite is that you can do a "find & replace" using regular expressions (regex), so I could replace `<img` with `\n<img`, which made it easier to see how to carve this down from there. Another nice thing with KWrite is that, if you make an injudicious choice with regex, you can undo it.
But I thought, surely there is something better than this, so I turned to regex, or more specifically the `re` module for Python. The relevant part of this new script looks like this:
```
match = re.findall(r'src="(.*)/>', all_text)
if len(match)>0:
for m in match:
imagelist.append(m)
```
And a tiny piece of its output looks like this:
```
images/cmcanvas.png" title="Context Menu for the document canvas" alt="Context Menu for the document canvas" /></td></tr></table><br images/eps-imp1.png" title="EPS preview in a file dialog" alt="EPS preview in a file dialog" images/eps-imp5.png" title="Colors imported from an EPS file" alt="Colors imported from an EPS file" images/eps-imp4.png" title="EPS font substitution" alt="EPS font substitution" images/eps-imp2.png" title="EPS import progress" alt="EPS import progress" images/eps-imp3.png" title="Bitmap conversion failure" alt="Bitmap conversion failure"
```
At first glance, it looks similar to the output above, and has the nice feature of trimming out parts of the image tag, but there are puzzling inclusions of table tags and other content. I think this relates to this regex expression `src="(.*)/>`, which is termed greedy, meaning it doesn't necessarily stop at the first instance of `/>` it encounters. I should add that I also tried `src="(.*)"` which was really no better. Not being a regexpert (just made this up), my searching around for various ideas to improve this didn't help.
After a series of other things, even trying out `HTML::Parser` with Perl, I finally tried to compare this to the situation of some scripts that I wrote for Scribus that analyze the contents of a text frame, character by character, then take some action. For my purposes, what I finally came up with improves on all these methods and requires no regex or HTML parser at all. Let's go back to that example `img` tag I showed.
```
<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>
```
I decided to home in on the `src=` piece. One way would be to wait for an occurrence of `s`, then see if the next character is `r`, the next `c`, and the next `=`. If so, bingo! Then what follows between two sets of double quotation marks is what I need. The problem with this is the structure it takes to hang onto these. One way of looking at a string of characters representing a line of HTML text would be:
```
for c in all_text:
```
But the logic was just too messy to hang onto the previous `c`, and the one before that, the one before that, and the one before that.
In the end, I decided to focus on the `=` and to use an indexing method whereby I could easily reference any prior or future character in the string. Here is the searching part:
```
index = 3
while index < linelength:
if (all_text[index] == '='):
if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and (all_text[index-1] == 'c'):
imagefound(all_text, imagelist, index)
index += 1
else:
index += 1
else:
index += 1
```
I start the search with the fourth character (indexing starts at 0), so I don't get an indexing error down below, and realistically, there will not be an equal sign before the fourth character of a line. The first test is to see if we find `=` as we're marching through the string, and if not, we march on. If we do see one, then we ask if the three previous characters were `s`, `r`, and `c`, in that order. If that happens, we call the function `imagefound`:
```
def imagefound(all_text, imagelist, index):
end = 0
index += 2
newimage = ''
while end == 0:
if (all_text[index] != '"'):
newimage = newimage + all_text[index]
index += 1
else:
newimage = newimage + '\n'
imagelist.append(newimage)
end = 1
return
```
We're sending the function the current index, which represents the `=`. We know the next character will be `"`, so we jump two characters and begin adding characters to a holding string named `newimage`, until we reach the following `"`, at which point we're done. We add the string plus a `newline` character to our list `imagelist` and `return`, keeping in mind there may be more image tags in this remaining string of HTML, so we're right back in the middle of our searching loop.
Here's what our output looks like now:
```
images/text-frame-link.png
images/text-frame-unlink.png
images/gimpoptions1.png
images/gimpoptions3.png
images/gimpoptions2.png
images/fontpref3.png
images/font-subst.png
images/fontpref2.png
images/fontpref1.png
images/dtp-studio.png
```
Ahhh, much cleaner, and this only took a few seconds to run. I could have jumped seven more index spots to cut out the `images/` part, but I like having it there to make sure I haven't chopped off the first letter of the image filename, and this is so easy to edit out with KWrite--you don't even need regex. After doing that and saving the file, the next step was to run another script I wrote called `sortlist.py`:
```
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# sortlist.py
import os
imagelist = []
for line in open('/tmp/imagelist_parse4.txt').xreadlines():
imagelist.append(line)
imagelist.sort()
outfile = open('/tmp/imagelist_parse4_sorted.txt', 'w')
outfile.writelines(imagelist)
outfile.close()
```
This pulls in the file contents as a list, sorts it, then saves it as another file. After that I could just do the following:
```
ls /home/gregp/development/Scribus15x/doc/en/images/*.png > '/tmp/actual_images.txt'
```
Then I need to run `sortlist.py` on that file too, since the method `ls` uses to sort is different from Python. I could have run a comparison script on these files, but I preferred to do this visually. In the end, I ended up with 42 images that had no HTML reference from the documentation.
Here is my parsing script in its entirety:
```
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# parseimg4.py
import os
def imagefound(all_text, imagelist, index):
end = 0
index += 2
newimage = ''
while end == 0:
if (all_text[index] != '"'):
newimage = newimage + all_text[index]
index += 1
else:
newimage = newimage + '\n'
imagelist.append(newimage)
end = 1
return
htmlnames = []
imagelist = []
tempstring = ''
filenames = os.listdir('/home/gregp/development/Scribus15x/doc/en/')
for name in filenames:
if name.endswith('.html'):
htmlnames.append(name)
#print htmlnames
for htmlfile in htmlnames:
all_text = open('/home/gregp/development/Scribus15x/doc/en/' + htmlfile).read()
linelength = len(all_text)
index = 3
while index < linelength:
if (all_text[index] == '='):
if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and
(all_text[index-1] == 'c'):
imagefound(all_text, imagelist, index)
index += 1
else:
index += 1
else:
index += 1
outfile = open('/tmp/imagelist_parse4.txt', 'w')
outfile.writelines(imagelist)
outfile.close()
imageno = len(imagelist)
print str(imageno) + " images were found and saved"
```
Its name, `parseimg4.py`, doesn't really reflect the number of scripts I wrote along the way, with both minor and major rewrites, plus discards and starting over. Notice that I've hardcoded these directory and filenames, but it would be easy enough to generalize, asking for user input for these pieces of information. Also as they were working scripts, I sent the output to `/tmp`, so they disappear once I reboot my system.
This wasn't the end of the story, since the next question was: What about zombie HTML files? Any of these files that are not used might reference images not picked up by the previous method. We have a `menu.xml` file that serves as the table of contents for the online manual, but I also needed to consider that some files listed in the TOC might reference files not in the TOC, and yes, I did find some.
I'll conclude by saying that this was a simpler task than this image search, and it was greatly helped by the processes I had already developed.
### About the author
[![](https://opensource.com/sites/default/files/styles/profile_pictures/public/20150529_gregp.jpg?itok=nv02g6PV)][7] Greg Pittman - Greg is a retired neurologist in Louisville, Kentucky, with a long-standing interest in computers and programming, beginning with Fortran IV in the 1960s. When Linux and open source software came along, it kindled a commitment to learning more, and eventually contributing. He is a member of the Scribus Team.[More about me][8]
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/1/parsing-html-python
作者:[Greg Pittman][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://opensource.com/users/greg-p
[1]:https://www.crummy.com/software/BeautifulSoup/
[2]:https://www.kde.org/applications/utilities/kwrite/
[7]:https://opensource.com/users/greg-p
[8]:https://opensource.com/users/greg-p

View File

@ -0,0 +1,219 @@
用Python解析HTML
======
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/bus_html_code.png?itok=VjUmGsnl)
图片由Jason Baker为Opensource.com所作。
作为Scribus文档团队的长期成员我随时了解最新的源代码更新以便对文档进行更新和补充。 我最近在刚升级到Fedora 27系统的计算机上使用Subversion进行“checkout”操作时对于文档下载所需要的时间我感到很惊讶文档由HTML页面和相关图像组成。 我担心该项目的文档看起来比项目本身大得多并且怀疑其中的一些内容是“僵尸”文档——不会再使用的HTML文件以及HTML中无法访问到的图像。
我决定为自己创建一个项目来解决这个问题。 一种方法是搜索未使用的现有图像文件。 如果我可以扫描所有HTML文件中的图像引用然后将该列表与实际图像文件进行比较那么我可能会看到不匹配的文件。
这是一个典型的图像标签:
```
<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>
```
我对第一组引号之间的部分很感兴趣在src =之后。 寻找解决方案后,我找到一个名为[BeautifulSoup][1]的Python模块。 脚本的核心部分如下所示:
```
soup = BeautifulSoup(all_text, 'html.parser')
match = soup.findAll("img")
if len(match) > 0:
for m in match:
imagelist.append(str(m))
```
我们可以使用这个`findAll` 方法来挖出图片标签。 这是一小部分输出:
```
<img src="images/pdf-form-ht3.png"/><img src="images/pdf-form-ht4.png"/><img src="images/pdf-form-ht5.png"/><img src="images/pdf-form-ht6.png"/><img align="middle" alt="GSview - Advanced Options Panel" src="images/gsadv1.png" title="GSview - Advanced Options Panel"/><img align="middle" alt="Scribus External Tools Preferences" src="images/gsadv2.png" title="Scribus External Tools Preferences"/>
```
到现在为止还挺好。我原以为下一步就可以搞定了,但是当我在脚本中尝试了一些字符串方法时,它返回了有关标记的错误而不是字符串的错误。 我将输出保存到一个文件中,并在[KWrite][2]中进行编辑。 KWrite的一个好处是你可以使用正则表达式regex来做“查找和替换”操作所以我可以用`\n<img` 替换 `<img`这样可以看得更清楚 KWrite的另一个好处是如果你用正则表达式做了一个不明智的选择你还可以撤消
但我认为肯定有比这更好的东西所以我转而使用正则表达式或者更具体地说Python的 `re` 模块。 这个新脚本的相关部分如下所示:
```
match = re.findall(r'src="(.*)/>', all_text)
if len(match)>0:
for m in match:
imagelist.append(m)
```
它的一小部分输出如下所示:
```
images/cmcanvas.png" title="Context Menu for the document canvas" alt="Context Menu for the document canvas" /></td></tr></table><br images/eps-imp1.png" title="EPS preview in a file dialog" alt="EPS preview in a file dialog" images/eps-imp5.png" title="Colors imported from an EPS file" alt="Colors imported from an EPS file" images/eps-imp4.png" title="EPS font substitution" alt="EPS font substitution" images/eps-imp2.png" title="EPS import progress" alt="EPS import progress" images/eps-imp3.png" title="Bitmap conversion failure" alt="Bitmap conversion failure"
```
乍一看,它看起来与上面的输出类似,并且附带有修剪部分图像标签的好处,但是有令人费解的是还夹杂着表格标签和其他内容。 我认为这涉及到这个正则表达式`src="(.*)/>`,这被称为贪婪,意味着它不一定停止在遇到`/>`的第一个实例。我应该补充一点,我也尝试过`src="(.*)"`,这真的没有什么更好的效果,不是一个正则表达式专家(只是做了这个),我找了各种方法来改进这一点但是并没什么用。
做了一系列的事情之后甚至尝试了Perl的`HTML::Parser`模块最终我试图将这与我为Scribus编写的一些脚本进行比较这些脚本逐个字符的分析文本内容然后采取一些行动。 为了最终目的我终于想出了所有这些方法并且完全不需要正则表达式或HTML解析器。 让我们回到展示的那个`img`标签的例子。
```
<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>
```
我决定回到`src=`这一块。 一种方法是等待`s`出现,然后看下一个字符是否是`r`,下一个是`c`,下一个是否`=`。 如果是这样,那就匹配上了! 那么两个双引号之间的内容就是我所需要的。 这种方法的问题在于需要连续识别上面这样的结构。 一种查看代表一行HTML文本的字符串的方法是
```
for c in all_text:
```
但是这个逻辑太乱了,以至于不能持续匹配到前面的`c`,还有之前的字符,更之前的字符,更更之前的字符。
最后,我决定专注于`=`并使用索引方法,以便我可以轻松地引用字符串中的任何先前或将来的字符。 这里是搜索部分:
```
index = 3
while index < linelength:
if (all_text[index] == '='):
if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and (all_text[index-1] == 'c'):
imagefound(all_text, imagelist, index)
index += 1
else:
index += 1
else:
index += 1
```
我用第四个字符开始搜索索引从0开始所以我在下面没有出现索引错误并且实际上在每一行的第四个字符之前不会有等号。 第一个测试是看字符串中是否出现了`=`,如果没有,我们就会前进。 如果我们确实看到一个等号,那么我们会看前三个字符是否是`s``r`和`c`。 如果全都匹配了,就调用函数`imagefound`
```
def imagefound(all_text, imagelist, index):
end = 0
index += 2
newimage = ''
while end == 0:
if (all_text[index] != '"'):
newimage = newimage + all_text[index]
index += 1
else:
newimage = newimage + '\n'
imagelist.append(newimage)
end = 1
return
```
我们正在给函数发送当前索引,它代表着`=`。 我们知道下一个字符将会是`"`,所以我们跳过两个字符,并开始向名为`newimage`的控制字符串添加字符,直到我们发现下一个`"`,此时我们完成了一次匹配。 我们将字符串加一个`换行`符添加到列表`imagelist`中并`返回`请记住在剩余的这个HTML字符串中可能会有更多图片标签所以我们马上回到搜索循环的中间。
以下是我们的输出现在的样子:
```
images/text-frame-link.png
images/text-frame-unlink.png
images/gimpoptions1.png
images/gimpoptions3.png
images/gimpoptions2.png
images/fontpref3.png
images/font-subst.png
images/fontpref2.png
images/fontpref1.png
images/dtp-studio.png
```
啊,干净多了,而这只花费几秒钟的时间。 我本可以将索引前移7步来剪切`images/`部分但我更愿意把这个部分保存下来确保我没有切片掉图像文件名的第一个字母这很容易用KWrite编辑成功- - 你甚至不需要正则表达式。 做完这些并保存文件后,下一步就是运行我编写的另一个脚本`sortlist.py`
```
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# sortlist.py
import os
imagelist = []
for line in open('/tmp/imagelist_parse4.txt').xreadlines():
imagelist.append(line)
imagelist.sort()
outfile = open('/tmp/imagelist_parse4_sorted.txt', 'w')
outfile.writelines(imagelist)
outfile.close()
```
这会读取文件内容,并存储为列表,对其排序,然后另存为另一个文件。 之后,我可以做到以下几点:
```
ls /home/gregp/development/Scribus15x/doc/en/images/*.png > '/tmp/actual_images.txt'
```
然后我需要在该文件上运行`sortlist.py`,因为`ls`方法的排序与Python不同。 我原本可以在这些文件上运行比较脚本,但我更愿意以可视方式进行操作。 最后我成功找到了42个图像这些图像没有来自文档的HTML引用。
这是我的完整解析脚本:
```
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# parseimg4.py
import os
def imagefound(all_text, imagelist, index):
end = 0
index += 2
newimage = ''
while end == 0:
if (all_text[index] != '"'):
newimage = newimage + all_text[index]
index += 1
else:
newimage = newimage + '\n'
imagelist.append(newimage)
end = 1
return
htmlnames = []
imagelist = []
tempstring = ''
filenames = os.listdir('/home/gregp/development/Scribus15x/doc/en/')
for name in filenames:
if name.endswith('.html'):
htmlnames.append(name)
#print htmlnames
for htmlfile in htmlnames:
all_text = open('/home/gregp/development/Scribus15x/doc/en/' + htmlfile).read()
linelength = len(all_text)
index = 3
while index < linelength:
if (all_text[index] == '='):
if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and
(all_text[index-1] == 'c'):
imagefound(all_text, imagelist, index)
index += 1
else:
index += 1
else:
index += 1
outfile = open('/tmp/imagelist_parse4.txt', 'w')
outfile.writelines(imagelist)
outfile.close()
imageno = len(imagelist)
print str(imageno) + " images were found and saved"
```
脚本名称为`parseimg4.py`,这并不能真实反映我陆续编写的脚本数量,包括微调的和大改的以及丢弃并重新开始写的。 请注意,我已经对这些目录和文件名进行了硬编码,但是总结起来很容易,要求用户输入这些信息。 同样,因为它们是工作脚本,所以我将输出发送到 `/tmp`目录,所以一旦重新启动系统,它们就会消失。
这不是故事的结尾因为下一个问题是僵尸HTML文件怎么办 任何未使用的文件都可能会引用到前面比对方法没有提取到的图像。 我们有一个`menu.xml`文件作为联机手册的目录但我还需要考虑TOC译者注TOC是table of contents的缩写中列出的某些文件可能引用了不在TOC中的文件是的我确实找到了一些这样的文件。
最后我可以说,这是一个比图像搜索更简单的任务,而且开发的过程对我有很大的帮助。
### About the author
[![](https://opensource.com/sites/default/files/styles/profile_pictures/public/20150529_gregp.jpg?itok=nv02g6PV)][7] Greg Pittman - Greg是Kentucky州Louisville市的一名退休的神经学家从二十世纪六十年代的Fortran IV语言开始长期以来对计算机和编程有着浓厚的兴趣。 当Linux和开源软件出现的时候Greg深受启发去学习更多只是并实现最终贡献的承诺。 他是Scribus团队的成员。[更多关于我][8]
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/1/parsing-html-python
作者:[Greg Pittman][a]
译者:[Flowsnow](https://github.com/Flowsnow)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://opensource.com/users/greg-p
[1]:https://www.crummy.com/software/BeautifulSoup/
[2]:https://www.kde.org/applications/utilities/kwrite/
[7]:https://opensource.com/users/greg-p
[8]:https://opensource.com/users/greg-p