Merge pull request #26078 from lkxed/20220616-Analyze-web-pages-with-Python-requests-and-Beautiful-Soup

[提交译文][tech]: 20220616 Analyze web pages with Python requests and Beautiful Soup.md
This commit is contained in:
六开箱 2022-06-16 16:42:25 +08:00 committed by GitHub
commit bc13c20699
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 146 additions and 146 deletions

View File

@ -1,146 +0,0 @@
[#]: subject: "Analyze web pages with Python requests and Beautiful Soup"
[#]: via: "https://opensource.com/article/22/6/analyze-web-pages-python-requests-beautiful-soup"
[#]: author: "Seth Kenlon https://opensource.com/users/seth"
[#]: collector: "lkxed"
[#]: translator: "lkxed"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
Analyze web pages with Python requests and Beautiful Soup
======
Follow this Python tutorial to easily extract information about web pages.
![Python programming language logo with question marks][1]
Image by: Opensource.com
Browsing the web probably accounts for much of your day. But it's an awfully manual process, isn't it? You have to open a browser. Go to a website. Click buttons, move a mouse. It's a lot of work. Wouldn't it be nicer to interact with the Internet through code?
You can get data from the Internet using Python with the help of the Python module `requests` :
```
import requests
DATA = "https://opensource.com/article/22/5/document-source-code-doxygen-linux"
PAGE = requests.get(DATA)
print(PAGE.text)
```
In this code sample, you first import the module `requests`. Then you create two variables: one called `DATA` to hold the URL you want to download. In later versions of this code, you'll be able to provide a different URL each time you run your application. For now, though, it's easiest to just "hard code” a test URL for demonstration purposes.
The other variable is `PAGE`, which you set to the responseof the `requests.get` function when it reads the URL stored in `DATA`. The `requests` module and its `.get` function is pre-programmed to "read” an Internet address (a URL), access the Internet, and download whatever is located at that address.
That's a lot of steps you don't have to figure out on your own, and that's exactly why Python modules exist. Finally, you tell Python to `print` everything that `requests.get` has stored in the `.text` field of the `PAGE` variable.
### Beautiful Soup
If you run the sample code above, you get the contents of the example URL dumped indiscriminately into your terminal. It does that because the only thing your code does with the data that `requests` has gathered is print it. It's more interesting to parse the text.
Python can "read” text with its most basic functions, but parsing text allows you to search for patterns, specific words, HTML tags, and so on. You could parse the text returned by `requests` yourself, but using a specialized module is much easier. For HTML and XML, there's the [Beautiful Soup][2] library.
This code accomplishes the same thing, but it uses Beautiful Soup to parse the downloaded text. Because Beautiful Soup recognizes HTML entities, you can use some of its built-in features to make the output a little easier for the human eye to parse.
For instance, instead of printing raw text at the end of your program, you can run the text through the `.prettify` function of Beautiful Soup:
```
from bs4 import BeautifulSoup
import requests
PAGE = requests.get("https://opensource.com/article/22/5/document-source-code-doxygen-linux")
SOUP = BeautifulSoup(PAGE.text, 'html.parser')
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    # do a thing here
    print(SOUP.prettify())
```
The output of this version of your program ensures that every opening HTML tag starts on its own line, with indentation to help demonstrate which tag is a parent of another tag. Beautiful Soup is aware of HTML tags in more ways than just how it prints it out.
Instead of printing the whole page, you can single out a specific kind of tag. For instance, try changing the print selector from print(SOUP.prettify() to this:
```
print(SOUP.p)
```
This prints just a `<p>` tag. Specifically, it prints just the first `<p>` tag encountered. To print all `<p>` tags, you need a loop.
### Looping
Create a for loop to cycle over the entire webpage contained in the `SOUP` variable, using the `find_all` function of Beautiful Soup. It's not unreasonable to want to use your loop for other tags besides just the `<p>` tag, so build it as a custom function, designated by the `def` keyword (for "define”) in Python.
```
def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG)
```
The temporary variable `TAG` is arbitrary. You can use any term, such as `ITEM` or `i` or whatever you want. Each time the loop runs, `TAG` contains the search results of the `find_all` function. In this code, the `<p>` tag is being searched.
A function doesn't run unless it's explicitly called. You can call your function at the end of your code:
```
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    # do a thing here
    loopit()
```
Run your code to see all `<p>` tags and each one's contents.
### Getting just the content
You can exclude tags from being printed by specifying that you want just the "string” (programming lingo for "words”).
```
def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG.string)
```
Of course, once you have the text of a webpage, you can parse it further with the standard Python string libraries. For instance, you can get a word count using `len` and `split` :
```
def loopit():
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            print(len(TAG.string.split()))
```
This prints the number of strings within each paragraph element, omitting those paragraphs that don't have any strings. To get a grand total, use a variable and some basic math:
```
def loopit():
    NUM = 0
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            NUM = NUM + len(TAG.string.split())
    print("Grand total is ", NUM)
```
### Python homework
There's a lot more information you can extract with Beautiful Soup and Python. Here are some ideas on how to improve your application:
* [Accept input][3] so you can specify what URL to download and analyze when you launch your application.
* Count the number of images (<img> tags) on a page.
* Count the number of images (<img> tags) within another tag (for instance, only images that appear in the `<main>` div, or only images following a `</p>` tag).
--------------------------------------------------------------------------------
via: https://opensource.com/article/22/6/analyze-web-pages-python-requests-beautiful-soup
作者:[Seth Kenlon][a]
选题:[lkxed][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/seth
[b]: https://github.com/lkxed
[1]: https://opensource.com/sites/default/files/lead-images/python_programming_question.png
[2]: https://beautiful-soup-4.readthedocs.io/en/latest/
[3]: https://opensource.com/article/17/3/python-tricks-artists-interactivity-Python-scripts

View File

@ -0,0 +1,146 @@
[#]: subject: "Analyze web pages with Python requests and Beautiful Soup"
[#]: via: "https://opensource.com/article/22/6/analyze-web-pages-python-requests-beautiful-soup"
[#]: author: "Seth Kenlon https://opensource.com/users/seth"
[#]: collector: "lkxed"
[#]: translator: "lkxed"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
使用 Python 的 requests 和 Beautiful Soup 来分析网页
======
学习这个 Python 教程,轻松提取网页的有关信息。
![带问号的 Python 语言图标][1]
图源Opensource.com
浏览网页可能占了你一天中的大部分时间。然而,你总是需要手动浏览,这很讨厌,不是吗?你必须打开浏览器,然后访问一个网站,单击按钮,移动鼠标……相当费时费力。如果能够通过代码与互联网交互,岂不是更好吗?
在 Python 的 `requests` 模块的帮助下,你可以使用 Python 从互联网中获取数据:
```
import requests
DATA = "https://opensource.com/article/22/5/document-source-code-doxygen-linux"
PAGE = requests.get(DATA)
print(PAGE.text)
```
在以上代码示例中,你首先导入了 `requests` 模块。接着,你创建了两个变量:其中一个叫做 `DATA`,它用来保存你要下载的 URL。在之后的代码中你将能够在每次运行应用程序时提供不同的 URL。不过就目前而言最简单的方法是“硬编码”一个测试 URL以达到演示目的。
另一个变量是 `PAGE`。代码读取了存储在 `DATA` 中的 URL然后把它作为参数传入 `requests.get` 函数,最后用变量 `PAGE` 来接收函数的返回值。`requests` 模块及其 `.get` 函数的功能是:“读取”一个互联网地址(一个 URL、访问互联网并下载位于该地址的任何内容。
当然,其中涉及到很多步骤。幸运的是,你不必自己弄清楚,这也正是 Python 模块存在的原因。最后,你告诉 Python 打印 `requests.get` 存储在 `PAGE` 变量的 `.text` 字段中的所有内容。
### Beautiful Soup
如果你运行上面的示例代码,你会得到示例 URL 的所有内容,并且,它们会不加选择地输出到你的终端里。这是因为在代码中,你对 `requests` 收集到的数据所做的唯一事情,就是打印它。然而,解析文本才是更加有趣的。
Python 可以通过其最基本的功能来“读取”文本但解析文本允许你搜索模式、特定单词、HTML 标签等。你可以自己解析 `requests` 返回的文本,不过,使用专门的模块会容易得多。针对 HTML 和 XML 文本,我们有 [Beautiful Soup][2] 库。
下面这段代码完成了同样的事情,只不过,它使用了 Beautiful Soup 来解析下载的文本。因为 Beautiful Soup 可以识别 HTML 元素,所以你可以使用它的一些内置功能,让输出对人眼更友好。
例如,在程序的末尾,你可以使用 Beautiful Soup 的 `.prettify` 函数来处理文本(使其更美观),而不是直接打印原始文本:
```
from bs4 import BeautifulSoup
import requests
PAGE = requests.get("https://opensource.com/article/22/5/document-source-code-doxygen-linux")
SOUP = BeautifulSoup(PAGE.text, 'html.parser')
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    # do a thing here
    print(SOUP.prettify())
```
通过以上代码,我们确保了每个打开的 HTML 标签都输出在单独的一行并带有适当的缩进以帮助说明标签的继承关系。实际上Beautiful Soup 能够通过更多方式来理解 HTML 标签,而不仅仅是将它打印出来。
你可以选择打印某个特定标签,而不是打印整个页面。例如,尝试将打印的选择器从 `print(SOUP.prettify())` 更改为:
```
print(SOUP.p)
```
这只会打印一个 `<p>` 标签。具体来说,它只打印遇到的第一个 `<p>` 标签。要打印所有的 `<p>` 标签,你需要使用一个循环。
### 循环
使用 Beautiful Soup 的 `find_all` 函数,你可以创建一个 for 循环,从而遍历 `SOUP` 变量中包含的整个网页。除了 `<p>` 标签之外,你可能也会对其他标签感兴趣,因此最好将其构建为自定义函数,由 Python 中的 `def` 关键字(意思是 <ruby>“定义”<rt>define</rt></ruby>)指定。
```
def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG)
```
你可以随意更改临时变量 `TAG` 的名字,例如 `ITEM``i` 或任何你喜欢的。每次循环运行时,`TAG` 中都会包含`find_all` 函数的搜索结果。在此代码中,它搜索的是 `<p>` 标签。
函数不会自动执行,除非你显式地调用它。你可以在代码的末尾调用这个函数:
```
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    # do a thing here
    loopit()
```
运行代码以查看所有的 `<p>` 标签和它们的内容。
### 只获取内容
你可以通过指定只需要 <ruby>“字符串”<rt>string</rt></ruby>(它是 <ruby>“单词”<rt>words</rt></ruby> 的编程术语)来排除打印标签。
```
def loopit():
    for TAG in SOUP.find_all('p'):
        print(TAG.string)
```
当然,一旦你有了网页的文本,你就可以用标准的 Python 字符串库进一步解析它。例如,你可以使用 `len``split` 函数获得单词个数:
```
def loopit():
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            print(len(TAG.string.split()))
```
这将打印每个段落元素中的字符串个数,省略那些没有任何字符串的段落。要获得字符串总数,你需要用到变量和一些基本数学知识:
```
def loopit():
    NUM = 0
    for TAG in SOUP.find_all('p'):
        if TAG.string is not None:
            NUM = NUM + len(TAG.string.split())
    print("Grand total is ", NUM)
```
### Python 作业
你可以使用 Beautiful Soup 和 Python 提取更多信息。以下是有关如何改进你的应用程序的一些想法:
* [接受输入][3],这样你就可以在启动应用程序时,指定要下载和分析的 URL。
* 统计页面上图片(<img> 标签)的数量。
* 统计另一个标签中的图片(<img> 标签)的数量(例如,仅出现在 `<main>` div 中的图片,或仅出现在 `</p>` 标签之后的图片)。
--------------------------------------------------------------------------------
via: https://opensource.com/article/22/6/analyze-web-pages-python-requests-beautiful-soup
作者:[Seth Kenlon][a]
选题:[lkxed][b]
译者:[lkxed](https://github.com/lkxed)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/seth
[b]: https://github.com/lkxed
[1]: https://opensource.com/sites/default/files/lead-images/python_programming_question.png
[2]: https://beautiful-soup-4.readthedocs.io/en/latest/
[3]: https://opensource.com/article/17/3/python-tricks-artists-interactivity-Python-scripts