mirror of
https://github.com/LCTT/TranslateProject.git
synced 2025-03-30 02:40:11 +08:00
commit
93398177fb
published
sources
talk
tech
20200522 A beginner-s guide to web scraping with Python.md20210111 Improve your productivity with this lightweight Linux desktop.md20210115 3 plain text note-taking tools.md20210117 3 steps to achieving Inbox Zero.md20210118 KDE Customization Guide- Here are 11 Ways You Can Change the Look and Feel of Your KDE-Powered Linux Desktop.md
translated/tech
@ -1,8 +1,8 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: (geekpi)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: reviewer: (wxy)
|
||||
[#]: publisher: (wxy)
|
||||
[#]: url: (https://linux.cn/article-13026-1.html)
|
||||
[#]: subject: (Homura: A WINE-based Game Launcher for BSD)
|
||||
[#]: via: (https://itsfoss.com/homura-game-launcher-bsd/)
|
||||
[#]: author: (John Paul https://itsfoss.com/author/john/)
|
||||
@ -18,7 +18,7 @@ BSD 不只是用于服务器。人们也将其用于桌面,并执行通常任
|
||||
|
||||
[Homura][2] 是一个能让你在 FreeBSD 上玩 Windows 游戏的工具。它的灵感来自于 [Lutris][3]。它允许你安装和管理多个 Windows 游戏和游戏商店启动器。它主要使用 Wine,但也附带了一些修复和变通的方法来让游戏工作。
|
||||
|
||||
Homura 的创建者 [Alexander Vereeken][4] 说,他创建这个应用是因为“当我开始使用 FreeBSD 时,没有有用的程序在 Wine 中安装游戏或启动器,所以我自己创建了一个”。当时,Wine 是唯一的选择。Linux 版本的 Steam 并不存在。
|
||||
Homura 的创建者 [Alexander Vereeken][4] 说,他创建这个应用是因为“当我开始使用 FreeBSD 时,没有能用的可以在 Wine 中安装游戏或启动器的程序,所以我自己创建了一个”。当时,Wine 是唯一的选择。Linux 版本的 Steam 并不存在。
|
||||
|
||||
![Homura install list][5]
|
||||
|
||||
@ -32,17 +32,13 @@ Homura 的创建者 [Alexander Vereeken][4] 说,他创建这个应用是因为
|
||||
* Discord
|
||||
* 龙歌 OL
|
||||
* GOG
|
||||
|
||||
|
||||
* 建造乌托邦
|
||||
* 英雄联盟
|
||||
* Origin 启动器
|
||||
* PokeMMO
|
||||
* 精灵宝可梦绿铀
|
||||
* 精灵宝可梦 绿铀
|
||||
* RuneScape
|
||||
* Steam
|
||||
|
||||
|
||||
* 地铁跑酷
|
||||
* Teamspeak
|
||||
* 猴岛大冒险 2
|
||||
@ -51,9 +47,7 @@ Homura 的创建者 [Alexander Vereeken][4] 说,他创建这个应用是因为
|
||||
* 战游游戏中心
|
||||
* Itch.io
|
||||
|
||||
|
||||
|
||||
Homura 是以动画 [Madoka Magica][6] 中的一个角色命名的。在创建者转移到 GitLab 之前,它最初托管在 GitHub 上。目前它被托管在 Codeberg 上。希望它能暂时留在那里。
|
||||
Homura 是以动画 [Madoka Magica][6] 中的一个角色命名的。在创建者转移到 GitLab 之前,它最初托管在 GitHub 上。目前它被托管在 Codeberg 上。希望它现在还能留在那里。
|
||||
|
||||
![Homura][7]
|
||||
|
||||
@ -71,11 +65,11 @@ pkg install games/homura
|
||||
cd /usr/ports/games/homura/ && make install clean
|
||||
```
|
||||
|
||||
安装完成后,你可以从菜单中选择 Homura,或在命令行中输入 `Homura` 来运行它。(在终端中名字必须大写,否则将无法运行。)
|
||||
安装完成后,你可以从菜单中选择 Homura,或在命令行中输入 `Homura` 来运行它。(在终端中名字必须大写开头,否则将无法运行。)
|
||||
|
||||
如果你通过 Homura 安装 Steam,你需要从 Homura 启动它。如果你从操作系统的菜单中启动它,它目前不会显示。
|
||||
如果你通过 Homura 安装 Steam,你需要从 Homura 启动它。如果你想从操作系统的菜单中启动它,它当前不会显示在那里。
|
||||
|
||||
Steam 的库和商店标签由内置的网络浏览器显示。由于某些原因,这在 FreeBSD 上是不行的。但如果你从 Homura 的菜单中启动 Steam,它将使用列表模式,工作没有问题。
|
||||
Steam 的库和商店标签由内置的 Web 浏览器显示。由于某些原因,这在 FreeBSD 上是不行的。但如果你从 Homura 的菜单中启动 Steam,它将使用列表模式,工作没有问题。
|
||||
|
||||
### 体验
|
||||
|
||||
@ -83,9 +77,9 @@ Steam 的库和商店标签由内置的网络浏览器显示。由于某些原
|
||||
|
||||
![Homura Main Menu][8]
|
||||
|
||||
我也尝试安装 GOG 启动器。由于某些原因,它对我不起作用。加载页面会弹出,但什么都不会发生。我打算提交一个问题。我没有测试任何安装程序/启动器,因为我没有使用它们。
|
||||
我也尝试安装 GOG 启动器。由于某些原因,它对我不起作用。加载页面会弹出,但什么都不会发生。我打算提交一个议题。我没有测试过任何安装程序/启动器,因为我没有使用它们。
|
||||
|
||||
### 最后想法
|
||||
### 总结
|
||||
|
||||
并不是所有的东西都能顺利地与 Homura 一起使用,但我可以玩一些我最喜欢的游戏。
|
||||
|
||||
@ -93,8 +87,6 @@ Steam 的库和商店标签由内置的网络浏览器显示。由于某些原
|
||||
|
||||
这款应用是一个典型的用户有了需求,并决定去填补它的案例。通过这样做,他让其他人的生活变得更轻松。希望这个应用能让人们更容易地开始使用 FreeBSD 作为他们的操作系统。
|
||||
|
||||
如果你觉得这篇文章有趣,请花点时间在社交媒体、Hacker News 或 [Reddit][10] 上分享。
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://itsfoss.com/homura-game-launcher-bsd/
|
||||
@ -102,7 +94,7 @@ via: https://itsfoss.com/homura-game-launcher-bsd/
|
||||
作者:[John Paul][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[geekpi](https://github.com/geekpi)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
校对:[wxy](https://github.com/wxy)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
@ -0,0 +1,99 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: ( )
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (4 big lessons from my internship with open source)
|
||||
[#]: via: (https://opensource.com/article/21/1/open-source-intern)
|
||||
[#]: author: (Jaideep Rao https://opensource.com/users/jaideeprao)
|
||||
|
||||
4 big lessons from my internship with open source
|
||||
======
|
||||
Just months after dipping my toes into open source, I learned lessons as
|
||||
a college intern that will help anyone writing and contributing code to
|
||||
a community.
|
||||
![Brain on a computer screen][1]
|
||||
|
||||
If you happened to have a time machine and chose to shoot yourself back a year to ask me what I thought about making open source contributions (of all the things you could've done), you may have guessed that I would just shrug and say something along the lines of, "I don't know, isn't that reserved for all the hard-core devs with the insane GitHub stats and decked-out macros and stuff? I'd have no idea what I was doing, and who would even care what some random college student had to say about their code?" And you'd probably be correct. But this was all before I stumbled onto the incredible opportunity to intern at Red Hat within the OpenShift Engineering division for the better part of 2020.
|
||||
|
||||
I came into the internship like any fresh-faced computer science student, skirting through life writing untested, barely readable but somehow still functional code, and feeling proud of it. But this internship brought the opportunity to get my hands dirty with the open source culture and finally see what all the hype is about.
|
||||
|
||||
I worked within the OpenShift-GitOps team right around the time Red Hat officially adopted [Argo CD][2] into its ecosystem. Subsequently, I—along with the rest of the team—was tasked with making upstream contributions to Argo CD. I decided to throw some of my thoughts about my experience into an article to take you through it.
|
||||
|
||||
### Figuring things out
|
||||
|
||||
Starting out was difficult and disorienting, as you might expect. I think we can all agree that reading code is hard enough when it's written by a colleague on your team. To get up to speed with code written within a different organization, potentially using different technologies, different components, and different coding practices can get overwhelming quickly. On several occasions, I found myself mindlessly sifting through files.
|
||||
|
||||
I soon realized that my first step should have been to try to understand the product as a user, not as a developer. This involved trying to get the software up and running and playing around with it. Fortunately, I had an entire team going through the same thing, so we could help each other get set up and cruise through it.
|
||||
|
||||
This was also when I started to appreciate the power of good documentation and what it can do to simplify your life as a developer. As an added bonus, the good folks in the Argo community were accommodating enough to host a weekly office hour of sorts to ease in all the new contributors. These sessions went a long way in speeding the awkward settling-in phase and helping our programmer's instinct kick in sooner. (They were also just a great place to be a fly-on-the-wall observer.)
|
||||
|
||||
### Selecting and solving issues
|
||||
|
||||
Skipping forward a little, I began scouring the list of open issues looking for something to sink my teeth into. This can be a messy process. The open source community's heavy dependence on its members' proclivity to participate comes with its fair share of ambiguity and lack of obligation towards efficient communication. This may present itself in several ways, such as an inability to reproduce the described issue locally, insufficient context to understand the issue, or just painfully slow communication with the person who raised the issue. As you go through the open source experience, you might find this to be a recurring theme. However, this experience helped me realize that picking the right issue, understanding its semantics, and reproducing it locally is half the battle.
|
||||
|
||||
When things go right, and you find an interesting issue with decent engagement, it can be quite exciting! The discussions in the comments section can show the different use cases and workarounds people come up with about specific issues in the project. These can be great sources of context, and gathering context is the name of the game—at least until you know what the heck is going on.
|
||||
|
||||
Once I was in the weeds and starting to think about potential solutions to an issue, the thing that jumped out to me the most was how much of a learning curve was associated with each new issue I took on. One of the reasons was that I was picking randomly from whatever unclaimed open issues were filed under the next major release milestone. This meant the issues I took varied a lot. I would wind up going down a different rabbit hole for each, learning 10 new related concepts in the process (even if not all of them made their way into the eventual pull request).
|
||||
|
||||
The same thing was true even when trying to step through the code to track down the source of a bug and coming across all the different components involved. This phase always seemed to be packed with the most amount of learning. As I slowly made my way to the solution, I often needed to fill in some knowledge gaps. Once again, I believe I had the most supportive colleagues anyone could ask for since I could always consult them as needed.
|
||||
|
||||
### Submitting pull requests
|
||||
|
||||
Once a fix or feature is done and tested locally, you're ready to raise your pull request (PR)! This is usually followed by a few rounds of going back and forth with the repository's maintainers as they review your PR and potentially request changes. It always amazes me to think about the amount of time, effort, and deliberation that the seemingly smallest contribution can involve. This isn't always evident from the outside, and your PR might end up looking very different from what you started with.
|
||||
|
||||
I also learned that it is not uncommon for five lines of code changes to be accompanied by 150 lines of tests for those changes. In other words, writing unit tests for your PR can sometimes be just as involved as the fix/feature itself, if not more. After everything is said and done, your PR is finally merged. You can do a quick celebratory dance, but then it's onward to the next one!
|
||||
|
||||
### Learning big lessons
|
||||
|
||||
I learned a lot during my internship that will help me professionally and personally.
|
||||
|
||||
#### Professional lessons
|
||||
|
||||
* Coming into this, most of my coding experience had been centered around personal projects or assignments for school or tasks assigned for my organization. These tend to be very specific to their intended audience and generally aren't very consequential to the wider community. In contrast, open source projects tend to have a much wider reach, so it was interesting to think about the potential scale of my contributions' impact. This made me feel like my work was consequential, and that made it feel worthwhile.
|
||||
* If you're like me, finding random issues and fixing them might stop being as exciting after a while. You might ask, "What is all of this actually leading to?" This is why I think it's important to have a larger picture and a sense of direction in mind. It helps drive your decisions and reminds you what you're working towards. Red Hat's larger objectives and long-term vision for adopting Argo CD provided me with this sense of direction and helped me stay motivated. But this can probably be achieved in multiple ways. Picking and working on issues more strategically, so they allow you to learn an aspect of programming that you're interested in getting better at, could be one of them.
|
||||
|
||||
|
||||
|
||||
#### Personal lessons
|
||||
|
||||
* It's no secret that breaking into something new can be daunting. I, probably along with half the software industry, am no stranger to imposter syndrome. There were times I felt like I wasn't moving as quickly or making as much progress as I would have expected from myself. This was frustrating, but I eventually understood how important it is to be patient with yourself. Especially if you're someone who's just learning the ropes, it may take a while to start making good progress, but it's worth remembering that it's all part of the learning process.
|
||||
* Early on, I tended to compare myself with my more experienced colleagues who were going through issues and getting PRs merged faster than you could finish saying "Argo CD." This didn't help my confidence a whole lot, but I realized that pitting myself (an intern working part time) against industry veterans wasn't really a fair comparison for anyone. The best way to build yourself up is to compare yourself to who you used to be rather than to those around you.
|
||||
|
||||
|
||||
|
||||
#### Other helpful tips I learned
|
||||
|
||||
* Do not hesitate to ask questions on the community forum. Also, try to find out if the project has a Slack, Discord, or Gitter that you can join.
|
||||
* Look at other PRs and discussions to gain context about what's happening in the project and better understand the codebase.
|
||||
* Try to identify unique log and error messages related to the workflow you're interested in. Searching for these messages directly against the codebase could be a quick way to locate the area you want to focus on, and reverse-engineering the sequence of function calls involved in getting to that point may help you understand everything that takes place along the way. (I found this to be particularly helpful.)
|
||||
* Looking at unit tests can be a good way to understand what a function is supposed to be doing and how it interacts with other functions (input/output formats, etc.).
|
||||
* Looking for issues labeled "good first issue" might be a good place to start. However, there may be many good issues that aren't labeled as such, so it might be worth looking through the issue board outside of that filter.
|
||||
* Always update documentation if you're making feature edits!
|
||||
|
||||
|
||||
|
||||
### Closing thoughts
|
||||
|
||||
The open source contribution experience is not a perfect process. As with anything else, it has its downsides. The sense of ambiguity from its open-ended nature and the occasional lack of feedback or communication can be hard to work around. On the other hand, I am pleased to see how much I was able to grow and get exposure to during this time. I found it challenging and gratifying to be a part of a developer team, and I am better for it.
|
||||
|
||||
Getting to work within a different paradigm, becoming a part of the wider developer community, and making new connections were all big pluses. Having your PRs merged is a good feeling as well! I also benefited by spotting [PlayerUnknown's Battlegrounds][3] as an Argo CD user, and I helped improve my friends' PUBG gaming experience by telling them about it.
|
||||
|
||||
If you made it all the way to the end, thanks for reading! I hope this can be helpful to you in starting your own journey into the open source world. Good luck!
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/21/1/open-source-intern
|
||||
|
||||
作者:[Jaideep Rao][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/jaideeprao
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/brain_computer_solve_fix_tool.png?itok=okq8joti (Brain on a computer screen)
|
||||
[2]: https://github.com/argoproj/argo-cd
|
||||
[3]: https://en.wikipedia.org/wiki/PlayerUnknown%27s_Battlegrounds
|
@ -1,493 +0,0 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: (stevenzdg988)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (A beginner's guide to web scraping with Python)
|
||||
[#]: via: (https://opensource.com/article/20/5/web-scraping-python)
|
||||
[#]: author: (Julia Piaskowski https://opensource.com/users/julia-piaskowski)
|
||||
|
||||
A beginner's guide to web scraping with Python
|
||||
======
|
||||
Get some hands-on experience with essential Python tools to scrape
|
||||
complete HTML sites.
|
||||
![HTML code][1]
|
||||
|
||||
There are plenty of great books to help you learn Python, but who actually reads these A to Z? (Spoiler: not me).
|
||||
|
||||
Many people find instructional books useful, but I do not typically learn by reading a book front to back. I learn by doing a project, struggling, figuring some things out, and then reading another book. So, throw away your book (for now), and let's learn some Python.
|
||||
|
||||
What follows is a guide to my first scraping project in Python. It is very low on assumed knowledge in Python and HTML. This is intended to illustrate how to access web page content with Python library [requests][2] and parse the content using [BeatifulSoup4][3], as well as JSON and [pandas][4]. I will briefly introduce [Selenium][5], but I will not delve deeply into how to use that library—that topic deserves its own tutorial. Ultimately I hope to show you some tricks and tips to make web scraping less overwhelming.
|
||||
|
||||
### Installing our dependencies
|
||||
|
||||
All the resources from this guide are available at my [GitHub repo][6]. If you need help installing Python 3, check out the tutorials for [Linux][7], [Windows][8], and [Mac][9].
|
||||
|
||||
|
||||
```
|
||||
$ python3 -m venv
|
||||
$ source venv/bin/activate
|
||||
$ pip install requests bs4 pandas
|
||||
```
|
||||
|
||||
If you like using JupyterLab, you can run all the code using this [notebook][10]. There are a lot of ways to [install JupyterLab][11], and this is one of them:
|
||||
|
||||
|
||||
```
|
||||
# from the same virtual environment as above, run:
|
||||
$ pip install jupyterlab
|
||||
```
|
||||
|
||||
### Setting a goal for our web scraping project
|
||||
|
||||
Now we have our dependencies installed, but what does it take to scrape a webpage?
|
||||
|
||||
Let's take a step back and be sure to clarify our goal. Here is my list of requirements for a successful web scraping project.
|
||||
|
||||
* We are gathering information that is worth the effort it takes to build a working web scraper.
|
||||
* We are downloading information that can be legally and ethically gathered by a web scraper.
|
||||
* We have some knowledge of how to find the target information in HTML code.
|
||||
* We have the right tools: in this case, it's the libraries **BeautifulSoup** and **requests**.
|
||||
* We know (or are willing to learn) how to parse JSON objects.
|
||||
* We have enough data skills to use **pandas**.
|
||||
|
||||
|
||||
|
||||
A comment on HTML: While HTML is the beast that runs the Internet, what we mostly need to understand is how tags work. A tag is a collection of information sandwiched between angle-bracket enclosed labels. For example, here is a pretend tag, called "pro-tip":
|
||||
|
||||
|
||||
```
|
||||
<pro-tip> All you need to know about html is how tags work </pro-tip>
|
||||
```
|
||||
|
||||
We can access the information in there ("All you need to know…") by calling its tag "pro-tip." How to find and access a tag will be addressed further in this tutorial. For more of a look at HTML basics, check out [this article][12].
|
||||
|
||||
### What to look for in a web scraping project
|
||||
|
||||
Some goals for gathering data are more suited for web scraping than others. My guidelines for what qualifies as a good project are as follows.
|
||||
|
||||
There is no public API available for the data. It would be much easier to capture structured data through an API, and it would help clarify both the legality and ethics of gathering the data. There needs to be a sizable amount of structured data with a regular, repeatable format to justify this effort. Web scraping can be a pain. BeautifulSoup (bs4) makes this easier, but there is no avoiding the individual idiosyncrasies of websites that will require customization. Identical formatting of the data is not required, but it does make things easier. The more "edge cases" (departures from the norm) present, the more complicated the scraping will be.
|
||||
|
||||
Disclaimer: I have zero legal training; the following is not intended to be formal legal advice.
|
||||
|
||||
On the note of legality, accessing vast troves of information can be intoxicating, but just because it's possible doesn't mean it should be done.
|
||||
|
||||
There is, thankfully, public information that can guide our morals and our web scrapers. Most websites have a [robots.txt][13] file associated with the site, indicating which scraping activities are permitted and which are not. It's largely there for interacting with search engines (the ultimate web scrapers). However, much of the information on websites is considered public information. As such, some consider the robots.txt file as a set of recommendations rather than a legally binding document. The robots.txt file does not address topics such as ethical gathering and usage of the data.
|
||||
|
||||
Questions I ask myself before beginning a scraping project:
|
||||
|
||||
* Am I scraping copyrighted material?
|
||||
* Will my scraping activity compromise individual privacy?
|
||||
* Am I making a large number of requests that may overload or damage a server?
|
||||
* Is it possible the scraping will expose intellectual property I do not own?
|
||||
* Are there terms of service governing use of the website, and am I following those?
|
||||
* Will my scraping activities diminish the value of the original data? (for example, do I plan to repackage the data as-is and perhaps siphon off website traffic from the original source)?
|
||||
|
||||
|
||||
|
||||
When I scrape a site, I make sure I can answer "no" to all of those questions.
|
||||
|
||||
For a deeper look at the legal concerns, see the 2018 publications [Legality and Ethics of Web Scraping by Krotov and Silva][14] and [Twenty Years of Web Scraping and the Computer Fraud and Abuse Act by Sellars][15].
|
||||
|
||||
### Now it's time to scrape!
|
||||
|
||||
After assessing the above, I came up with a project. My goal was to extract addresses for all Family Dollar stores in Idaho. These stores have an outsized presence in rural areas, so I wanted to understand how many there are in a rather rural state.
|
||||
|
||||
The starting point is the [location page for Family Dollar][16].
|
||||
|
||||
![Family Dollar Idaho locations page][17]
|
||||
|
||||
To begin, let's load up our prerequisites in our Python virtual environment. The code from here is meant to be added to a Python file (_scraper.py_ if you're looking for a name) or be run in a cell in JupyterLab.
|
||||
|
||||
|
||||
```
|
||||
import requests # for making standard html requests
|
||||
from bs4 import BeautifulSoup # magical tool for parsing html data
|
||||
import json # for parsing data
|
||||
from pandas import DataFrame as df # premier library for data organization
|
||||
```
|
||||
|
||||
Next, we request data from our target URL.
|
||||
|
||||
|
||||
```
|
||||
page = requests.get("<https://locations.familydollar.com/id/>")
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
```
|
||||
|
||||
BeautifulSoup will take HTML or XML content and transform it into a complex tree of objects. Here are several common object types that we will use.
|
||||
|
||||
* **BeautifulSoup**—the parsed content
|
||||
* **Tag**—a standard HTML tag, the main type of bs4 element you will encounter
|
||||
* **NavigableString**—a string of text within a tag
|
||||
* **Comment**—a special type of NavigableString
|
||||
|
||||
|
||||
|
||||
There is more to consider when we look at **requests.get()** output. I've only used **page.text()** to translate the requested page into something readable, but there are other output types:
|
||||
|
||||
* **page.text()** for text (most common)
|
||||
* **page.content()** for byte-by-byte output
|
||||
* **page.json()** for JSON objects
|
||||
* **page.raw()** for the raw socket response (no thank you)
|
||||
|
||||
|
||||
|
||||
I have only worked on English-only sites using the Latin alphabet. The default encoding settings in **requests** have worked fine for that. However, there is a rich internet world beyond English-only sites. To ensure that **requests** correctly parses the content, you can set the encoding for the text:
|
||||
|
||||
|
||||
```
|
||||
page = requests.get(URL)
|
||||
page.encoding = 'ISO-885901'
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
```
|
||||
|
||||
Taking a closer look at BeautifulSoup tags, we see:
|
||||
|
||||
* The bs4 element **tag** is capturing an HTML tag
|
||||
* It has both a name and attributes that can be accessed like a dictionary: **tag['someAttribute']**
|
||||
* If a tag has multiple attributes with the same name, only the first instance is accessed.
|
||||
* A tag's children are accessed via **tag.contents**.
|
||||
* All tag descendants can be accessed with **tag.contents**.
|
||||
* You can always access the full contents as a string with: **re.compile("your_string")** instead of navigating the HTML tree.
|
||||
|
||||
|
||||
|
||||
### Determine how to extract relevant content
|
||||
|
||||
Warning: this process can be frustrating.
|
||||
|
||||
Extraction during web scraping can be a daunting process filled with missteps. I think the best way to approach this is to start with one representative example and then scale up (this principle is true for any programming task). Viewing the page's HTML source code is essential. There are a number of ways to do this.
|
||||
|
||||
You can view the entire source code of a page using Python in your terminal (not recommended). Run this code at your own risk:
|
||||
|
||||
|
||||
```
|
||||
print(soup.prettify())
|
||||
```
|
||||
|
||||
While printing out the entire source code for a page might work for a toy example shown in some tutorials, most modern websites have a massive amount of content on any one of their pages. Even the 404 page is likely to be filled with code for headers, footers, and so on.
|
||||
|
||||
It is usually easiest to browse the source code via **View Page Source** in your favorite browser (right-click, then select "view page source"). That is the most reliable way to find your target content (I will explain why in a moment).
|
||||
|
||||
![Family Dollar page source code][18]
|
||||
|
||||
|
||||
|
||||
In this instance, I need to find my target content—an address, city, state, and zip code—in this vast HTML ocean. Often, a simple search of the page source (**ctrl + F**) will yield the section where my target location is located. Once I can actually see an example of my target content (the address for at least one store), I look for an attribute or tag that sets this content apart from the rest.
|
||||
|
||||
It would appear that first, I need to collect web addresses for different cities in Idaho with Family Dollar stores and visit those websites to get the address information. These web addresses all appear to be enclosed in a **href** tag. Great! I will try searching for that using the **find_all** command:
|
||||
|
||||
|
||||
```
|
||||
dollar_tree_list = soup.find_all('href')
|
||||
dollar_tree_list
|
||||
```
|
||||
|
||||
Searching for **href** did not yield anything, darn. This might have failed because **href** is nested inside the class **itemlist**. For the next attempt, search on **item_list**. Because "class" is a reserved word in Python, **class_** is used instead. The bs4 function **soup.find_all()** turned out to be the Swiss army knife of bs4 functions.
|
||||
|
||||
|
||||
```
|
||||
dollar_tree_list = soup.find_all(class_ = 'itemlist')
|
||||
for i in dollar_tree_list[:2]:
|
||||
print(i)
|
||||
```
|
||||
|
||||
Anecdotally, I found that searching for a specific class was often a successful approach. We can learn more about the object by finding out its type and length.
|
||||
|
||||
|
||||
```
|
||||
type(dollar_tree_list)
|
||||
len(dollar_tree_list)
|
||||
```
|
||||
|
||||
The content from this BeautifulSoup "ResultSet" can be extracted using **.contents**. This is also a good time to create a single representative example.
|
||||
|
||||
|
||||
```
|
||||
example = dollar_tree_list[2] # a representative example
|
||||
example_content = example.contents
|
||||
print(example_content)
|
||||
```
|
||||
|
||||
Use **.attr** to find what attributes are present in the contents of this object. Note: **.contents** usually returns a list of exactly one item, so the first step is to index that item using the bracket notation.
|
||||
|
||||
|
||||
```
|
||||
example_content = example.contents[0]
|
||||
example_content.attrs
|
||||
```
|
||||
|
||||
Now that I can see that **href** is an attribute, that can be extracted like a dictionary item:
|
||||
|
||||
|
||||
```
|
||||
example_href = example_content['href']
|
||||
print(example_href)
|
||||
```
|
||||
|
||||
### Putting together our web scraper
|
||||
|
||||
All that exploration has given us a path forward. Here's the cleaned-up version of the logic we figured out above.
|
||||
|
||||
|
||||
```
|
||||
city_hrefs = [] # initialise empty list
|
||||
|
||||
for i in dollar_tree_list:
|
||||
cont = i.contents[0]
|
||||
href = cont['href']
|
||||
city_hrefs.append(href)
|
||||
|
||||
# check to be sure all went well
|
||||
for i in city_hrefs[:2]:
|
||||
print(i)
|
||||
```
|
||||
|
||||
The output is a list of URLs of Family Dollar stores in Idaho to scrape.
|
||||
|
||||
That said, I still don't have address information! Now, each city URL needs to be scraped to get this information. So we restart the process, using a single, representative example.
|
||||
|
||||
|
||||
```
|
||||
page2 = requests.get(city_hrefs[2]) # again establish a representative example
|
||||
soup2 = BeautifulSoup(page2.text, 'html.parser')
|
||||
```
|
||||
|
||||
![Family Dollar map and code][19]
|
||||
|
||||
The address information is nested within **type= "application/ld+json"**. After doing a lot of geolocation scraping, I've come to recognize this as a common structure for storing address information. Fortunately, **soup.find_all()** also enables searching on **type**.
|
||||
|
||||
|
||||
```
|
||||
arco = soup2.find_all(type="application/ld+json")
|
||||
print(arco[1])
|
||||
```
|
||||
|
||||
The address information is in the second list member! Finally!
|
||||
|
||||
I extracted the contents (from the second list item) using **.contents** (this is a good default action after filtering the soup). Again, since the output of contents is a list of one, I indexed that list item:
|
||||
|
||||
|
||||
```
|
||||
arco_contents = arco[1].contents[0]
|
||||
arco_contents
|
||||
```
|
||||
|
||||
Wow, looking good. The format presented here is consistent with the JSON format (also, the type did have "**json**" in its name). A JSON object can act like a dictionary with nested dictionaries inside. It's actually a nice format to work with once you become familiar with it (and it's certainly much easier to program than a long series of RegEx commands). Although this structurally looks like a JSON object, it is still a bs4 object and needs a formal programmatic conversion to JSON to be accessed as a JSON object:
|
||||
|
||||
|
||||
```
|
||||
arco_json = json.loads(arco_contents)
|
||||
|
||||
[/code] [code]
|
||||
|
||||
type(arco_json)
|
||||
print(arco_json)
|
||||
```
|
||||
|
||||
In that content is a key called **address** that has the desired address information in the smaller nested dictionary. This can be retrieved thusly:
|
||||
|
||||
|
||||
```
|
||||
arco_address = arco_json['address']
|
||||
arco_address
|
||||
```
|
||||
|
||||
Okay, we're serious this time. Now I can iterate over the list store URLs in Idaho:
|
||||
|
||||
|
||||
```
|
||||
locs_dict = [] # initialise empty list
|
||||
|
||||
for link in city_hrefs:
|
||||
locpage = requests.get(link) # request page info
|
||||
locsoup = BeautifulSoup(locpage.text, 'html.parser')
|
||||
# parse the page's content
|
||||
locinfo = locsoup.find_all(type="application/ld+json")
|
||||
# extract specific element
|
||||
loccont = locinfo[1].contents[0]
|
||||
# get contents from the bs4 element set
|
||||
locjson = json.loads(loccont) # convert to json
|
||||
locaddr = locjson['address'] # get address
|
||||
locs_dict.append(locaddr) # add address to list
|
||||
```
|
||||
|
||||
### Cleaning our web scraping results with pandas
|
||||
|
||||
We have loads of data in a dictionary, but we have some additional crud that will make reusing our data more complex than it needs to be. To do some final data organization steps, we convert to a pandas data frame, drop the unneeded columns "**@type**" and "**country**"), and check the top five rows to ensure that everything looks alright.
|
||||
|
||||
|
||||
```
|
||||
locs_df = df.from_records(locs_dict)
|
||||
locs_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True)
|
||||
locs_df.head(n = 5)
|
||||
```
|
||||
|
||||
Make sure to save results!!
|
||||
|
||||
|
||||
```
|
||||
df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)
|
||||
```
|
||||
|
||||
We did it! There is a comma-separated list of all the Idaho Family Dollar stores. What a wild ride.
|
||||
|
||||
### A few words on Selenium and data scraping
|
||||
|
||||
[Selenium][5] is a common utility for automatic interaction with a webpage. To explain why it's essential to use at times, let's go through an example using Walgreens' website. **Inspect Element** provides the code for what is displayed in a browser:
|
||||
|
||||
![Walgreens location page and code][20]
|
||||
|
||||
|
||||
|
||||
While **View Page Source** provides the code for what **requests** will obtain:
|
||||
|
||||
![Walgreens source code][21]
|
||||
|
||||
When these two don't agree, there are plugins modifying the source code—so, it should be accessed after the page has loaded in a browser. **requests** cannot do that, but **Selenium** can.
|
||||
|
||||
Selenium requires a web driver to retrieve the content. It actually opens a web browser, and this page content is collected. Selenium is powerful—it can interact with loaded content in many ways (read the documentation). After getting data with **Selenium**, continue to use **BeautifulSoup** as before:
|
||||
|
||||
|
||||
```
|
||||
url = "[https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator\&state=ID][22]"
|
||||
driver = webdriver.Firefox(executable_path = 'mypath/geckodriver.exe')
|
||||
driver.get(url)
|
||||
soup_ID = BeautifulSoup(driver.page_source, 'html.parser')
|
||||
store_link_soup = soup_ID.find_all(class_ = 'col-xl-4 col-lg-4 col-md-4')
|
||||
```
|
||||
|
||||
I didn't need Selenium in the case of Family Dollar, but I do keep it on hand for those times when rendered content differs from source code.
|
||||
|
||||
### Wrapping up
|
||||
|
||||
In conclusion, when using web scraping to accomplish a meaningful task:
|
||||
|
||||
* Be patient
|
||||
* Consult the manuals (these are very helpful)
|
||||
|
||||
|
||||
|
||||
If you are curious about the answer:
|
||||
|
||||
![Family Dollar locations map][23]
|
||||
|
||||
There are many many Family Dollar stores in America.
|
||||
|
||||
The complete source code is:
|
||||
|
||||
|
||||
```
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import json
|
||||
from pandas import DataFrame as df
|
||||
|
||||
page = requests.get("<https://www.familydollar.com/locations/>")
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
|
||||
# find all state links
|
||||
state_list = soup.find_all(class_ = 'itemlist')
|
||||
|
||||
state_links = []
|
||||
|
||||
for i in state_list:
|
||||
cont = i.contents[0]
|
||||
attr = cont.attrs
|
||||
hrefs = attr['href']
|
||||
state_links.append(hrefs)
|
||||
|
||||
# find all city links
|
||||
city_links = []
|
||||
|
||||
for link in state_links:
|
||||
page = requests.get(link)
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
familydollar_list = soup.find_all(class_ = 'itemlist')
|
||||
for store in familydollar_list:
|
||||
cont = store.contents[0]
|
||||
attr = cont.attrs
|
||||
city_hrefs = attr['href']
|
||||
city_links.append(city_hrefs)
|
||||
# to get individual store links
|
||||
store_links = []
|
||||
|
||||
for link in city_links:
|
||||
locpage = requests.get(link)
|
||||
locsoup = BeautifulSoup(locpage.text, 'html.parser')
|
||||
locinfo = locsoup.find_all(type="application/ld+json")
|
||||
for i in locinfo:
|
||||
loccont = i.contents[0]
|
||||
locjson = json.loads(loccont)
|
||||
try:
|
||||
store_url = locjson['url']
|
||||
store_links.append(store_url)
|
||||
except:
|
||||
pass
|
||||
|
||||
# get address and geolocation information
|
||||
stores = []
|
||||
|
||||
for store in store_links:
|
||||
storepage = requests.get(store)
|
||||
storesoup = BeautifulSoup(storepage.text, 'html.parser')
|
||||
storeinfo = storesoup.find_all(type="application/ld+json")
|
||||
for i in storeinfo:
|
||||
storecont = i.contents[0]
|
||||
storejson = json.loads(storecont)
|
||||
try:
|
||||
store_addr = storejson['address']
|
||||
store_addr.update(storejson['geo'])
|
||||
stores.append(store_addr)
|
||||
except:
|
||||
pass
|
||||
|
||||
# final data parsing
|
||||
stores_df = df.from_records(stores)
|
||||
stores_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True)
|
||||
stores_df['Store'] = "Family Dollar"
|
||||
|
||||
df.to_csv(stores_df, "family_dollar_locations.csv", sep = ",", index = False)
|
||||
```
|
||||
|
||||
\--
|
||||
_Author's note: This article is an adaptation of a [talk I gave at PyCascades][24] in Portland, Oregon on February 9, 2020._
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/20/5/web-scraping-python
|
||||
|
||||
作者:[Julia Piaskowski][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/stevenzdg988)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/julia-piaskowski
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/bus_html_code.png?itok=VjUmGsnl (HTML code)
|
||||
[2]: https://requests.readthedocs.io/en/master/
|
||||
[3]: https://beautiful-soup-4.readthedocs.io/en/latest/
|
||||
[4]: https://pandas.pydata.org/
|
||||
[5]: https://www.selenium.dev/
|
||||
[6]: https://github.com/jpiaskowski/pycas2020_web_scraping
|
||||
[7]: https://opensource.com/article/20/4/install-python-linux
|
||||
[8]: https://opensource.com/article/19/8/how-install-python-windows
|
||||
[9]: https://opensource.com/article/19/5/python-3-default-mac
|
||||
[10]: https://github.com/jpiaskowski/pycas2020_web_scraping/blob/master/example/Familydollar_location_scrape-all-states.ipynb
|
||||
[11]: https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html
|
||||
[12]: https://opensource.com/article/20/4/build-websites
|
||||
[13]: https://www.contentkingapp.com/academy/robotstxt/
|
||||
[14]: https://www.researchgate.net/publication/324907302_Legality_and_Ethics_of_Web_Scraping
|
||||
[15]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625
|
||||
[16]: https://locations.familydollar.com/id/
|
||||
[17]: https://opensource.com/sites/default/files/uploads/familydollar1.png (Family Dollar Idaho locations page)
|
||||
[18]: https://opensource.com/sites/default/files/uploads/familydollar2.png (Family Dollar page source code)
|
||||
[19]: https://opensource.com/sites/default/files/uploads/familydollar3.png (Family Dollar map and code)
|
||||
[20]: https://opensource.com/sites/default/files/uploads/walgreens1.png (Walgreens location page and code)
|
||||
[21]: https://opensource.com/sites/default/files/uploads/walgreens2.png (Walgreens source code)
|
||||
[22]: https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator\&state=ID
|
||||
[23]: https://opensource.com/sites/default/files/uploads/family_dollar_locations.png (Family Dollar locations map)
|
||||
[24]: https://2020.pycascades.com/talks/adventures-in-babysitting-webscraping-for-python-and-html-novices/
|
@ -1,77 +0,0 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: (geekpi)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (Improve your productivity with this lightweight Linux desktop)
|
||||
[#]: via: (https://opensource.com/article/21/1/elementary-linux)
|
||||
[#]: author: (Kevin Sonney https://opensource.com/users/ksonney)
|
||||
|
||||
Improve your productivity with this lightweight Linux desktop
|
||||
======
|
||||
ElementaryOS provides a fast, lightweight, and efficient desktop to keep
|
||||
you productive in the new year.
|
||||
![Business woman on laptop sitting in front of window][1]
|
||||
|
||||
In prior years, this annual series covered individual apps. This year, we are looking at all-in-one solutions in addition to strategies to help in 2021. Welcome to day 1 of 21 Days of Productivity in 2021.
|
||||
|
||||
When looking for tools to be more productive, it is easy to cobble together a working collection of applications that almost, but just don't quite, play nice together. In prior years, we have talked about individual email applications, calendaring applications, note-taking applications, and so on. There are always bumps, though—places where it either takes custom scripts or complicated export/import steps to make a tool work.
|
||||
|
||||
[ElementaryOS][2] is a complete desktop with a beautiful, functional, and productive environment.
|
||||
|
||||
![Linux ElementaryOS Desktop][3]
|
||||
|
||||
ElementaryOS desktop (Kevin Sonney, [CC BY-SA 4.0][4])
|
||||
|
||||
ElementaryOS is a pay-what-you-want open source project based on the popular Ubuntu Linux distribution. The initial setup and installation will be very familiar to anyone who has installed Ubuntu in the past. However, once logged in, the experience can be very different.
|
||||
|
||||
ElementaryOS uses the [Gala window manager][5] and the Pantheon shell. Both were developed specifically for Elementary. Once installed, the desktop is very minimal, and it only provides a small number of lightweight apps. These apps include a web browser, terminal, mail client, and calendaring client. It also has an App Center that allows you to install both free and commercial applications curated by the Elementary team.
|
||||
|
||||
![Linux ElementaryOS Mail and calendar][6]
|
||||
|
||||
ElementaryOS mail and calendar (Kevin Sonney, [CC BY-SA 4.0][4])
|
||||
|
||||
The [mail][7] and [calendar][8] apps will look very familiar since both have been used by other distributions for a while now. Mail started as a fork of [Geary][9], and the calendar is known as [Maya][10] other places. Setup is very easy for both. The two apps only authenticate with a username/password by default, so users who require two-factor authentication will need to go through some extra steps. The interface is exceptionally lightweight and fast for both.
|
||||
|
||||
![Linux ElementaryOS app center][11]
|
||||
|
||||
The App Center (Kevin Sonney, [CC BY-SA 4.0][4])
|
||||
|
||||
By default, ElementaryOS does not include a to-do list or note-taking application. This situation is where the App Center comes into play. In the App Center, there is a large selection of applications to fill the gaps. Two of the apps really stand out. The first is [Planner][12], a clean, lightweight to-do list manager. It supports multiple lists, scheduled and recurring tasks, projects, and sub-projects. It can also sync with the popular [Todoist][13] online application, but that isn't required.
|
||||
|
||||
The second app from the App Center is [Notes-Up][14], a note-taking app that uses Markdown for rich text. It allows a user to create multiple notes in multiple notebooks and has both a "View" and "Edit" option so that you can preview how the final document will look. Again, the app is fast, lightweight, and very minimalistic, keeping with the overall ElementaryOS look and feel.
|
||||
|
||||
![ElementaryOS Planner and Notes-up][15]
|
||||
|
||||
Planner and Notes-up (Kevin Sonney, [CC BY-SA 4.0][4])
|
||||
|
||||
If you don't like the default apps, Elementary is based on Ubuntu LTS, so the entire Ubuntu app ecosystem is available if you need to use other applications. Overall, though, the ElementaryOS default provides a fast, lightweight, and efficient desktop to keep you productive in the new year.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/21/1/elementary-linux
|
||||
|
||||
作者:[Kevin Sonney][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/ksonney
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/lenovo-thinkpad-laptop-concentration-focus-windows-office.png?itok=-8E2ihcF (Woman using laptop concentrating)
|
||||
[2]: https://elementary.io/
|
||||
[3]: https://opensource.com/sites/default/files/pictures/elementaryos-desktop.png (ElementaryOS Desktop)
|
||||
[4]: https://creativecommons.org/licenses/by-sa/4.0/
|
||||
[5]: https://github.com/elementary/gala
|
||||
[6]: https://opensource.com/sites/default/files/pictures/elementaryos-mail-calendar.png (Linux ElementaryOS Mail and calendar)
|
||||
[7]: https://github.com/elementary/mail
|
||||
[8]: https://github.com/elementary/calendar
|
||||
[9]: https://wiki.gnome.org/Apps/Geary
|
||||
[10]: https://launchpad.net/maya
|
||||
[11]: https://opensource.com/sites/default/files/pictures/elementaryos-app-center_0.png (Linux ElementaryOS app center)
|
||||
[12]: https://appcenter.elementary.io/com.github.alainm23.planner/
|
||||
[13]: https://todoist.com/
|
||||
[14]: https://github.com/Philip-Scott/Notes-up
|
||||
[15]: https://opensource.com/sites/default/files/pictures/planner-notes-up.png (ElementaryOS Planner and Notes-up)
|
@ -1,5 +1,5 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: ( )
|
||||
[#]: translator: (geekpi)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
|
70
sources/tech/20210117 3 steps to achieving Inbox Zero.md
Normal file
70
sources/tech/20210117 3 steps to achieving Inbox Zero.md
Normal file
@ -0,0 +1,70 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: ( )
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (3 steps to achieving Inbox Zero)
|
||||
[#]: via: (https://opensource.com/article/21/1/inbox-zero)
|
||||
[#]: author: (Kevin Sonney https://opensource.com/users/ksonney)
|
||||
|
||||
3 steps to achieving Inbox Zero
|
||||
======
|
||||
Inbox Zero is a concept that helps you manage and prioritize all that
|
||||
overwhelming email.
|
||||
![Selfcare, drinking tea on the porch][1]
|
||||
|
||||
In prior years, this annual series covered individual apps. This year, we are looking at all-in-one solutions in addition to strategies to help in 2021. Welcome to day 7 of 21 Days of Productivity in 2021.
|
||||
|
||||
There are lots of ways people manage email. Most people I have talked to over the past few years fall into one of two categories: The people who keep everything in their Inbox folder, and those who do not. For those that do not, the concept of _Inbox Zero_ comes into play frequently.
|
||||
|
||||
Do you think the role you are responsible for at work -- developer, engineer, [SRE][2], [sysadmin][3], [product manager][4], [community manager][5], etc -- might play into which category someone falls into?
|
||||
|
||||
Inbox Zero is a term coined by Merlin Mann in 2006, and it _looks_ self-explanatory. Inbox Zero is an empty Inbox, right?
|
||||
|
||||
![Mailspring inbox zero][6]
|
||||
|
||||
Mailspring says I did it! (Kevin Sonney, [CC BY-SA 4.0][7])
|
||||
|
||||
If this were the case, Inbox Zero would be really easy to achieve. Just mark everything read and move it out of the Inbox. Who cares if it needs attention, scheduling, or some sort of action?
|
||||
|
||||
That is not the purpose of Inbox Zero. That is declaring [Inbox bankruptcy][8], a term popularized by Lawrence Lessig. Basically, Inbox bankruptcy is deciding there is too much to deal with, saying "to heck with it," and starting over from scratch.
|
||||
|
||||
Inbox Zero is about considering all the pending messages, deciding what to do with them, and then actually doing it. How can this be done quickly? My personal process is to scan everything and move anything that does not require action or my attention to a "Read Later" folder. These are usually messages like mailing lists, blog updates, and social network notifications. I can read them later if I have time or need a mental break from other things.
|
||||
|
||||
![Email folder setup][9]
|
||||
|
||||
My folder setup (Kevin Sonney, [CC BY-SA 4.0][7])
|
||||
|
||||
The next step is to start reading what is left. If the email is a task I can do in under two minutes, I just do it. This includes confirming or declining meetings, adding a conference link to a meeting, or simple queries that I can reply to right away. These all get moved or archived and are then out of the Inbox.
|
||||
|
||||
Finally, there should be a list of messages that require more in-depth thought, scheduling for the future, or lots of focus. These get put on my to-do list or calendar with deadline reminders, so I have dedicated time to work on them. Once these are scheduled, I move them to folders related to their topic. For me, these are tasks like reading comments on one of my podcasts, `git pull` requests, and lots of work-related emails.
|
||||
|
||||
![Inbox replenished][10]
|
||||
|
||||
Oh look, more email (Kevin Sonney, [CC BY-SA 4.0][7])
|
||||
|
||||
Hopefully, at this point, my Inbox is empty or close to it, and I can close my email program and not look at it for a few hours. As we discussed in an earlier post, email is not instant messaging and does not need monitoring 24x7.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/21/1/inbox-zero
|
||||
|
||||
作者:[Kevin Sonney][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/ksonney
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/coffee_tea_selfcare_wfh_porch_520.png?itok=2qXG0T7u (Selfcare, drinking tea on the porch)
|
||||
[2]: https://opensource.com/article/18/10/sre-startup
|
||||
[3]: https://opensource.com/article/20/12/sysadmin
|
||||
[4]: https://opensource.com/article/20/2/product-management-open-source-company
|
||||
[5]: https://opensource.com/article/20/9/open-source-community-managers
|
||||
[6]: https://opensource.com/sites/default/files/pictures/mailspring-inbox-zero.png (Mailspring inbox zero)
|
||||
[7]: https://creativecommons.org/licenses/by-sa/4.0/
|
||||
[8]: https://en.wikipedia.org/wiki/Email_bankruptcy
|
||||
[9]: https://opensource.com/sites/default/files/pictures/email-folder-setup.png (Email folder setup)
|
||||
[10]: https://opensource.com/sites/default/files/pictures/inbox-replenished.png (Inbox replenished)
|
@ -0,0 +1,171 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: ( )
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (KDE Customization Guide: Here are 11 Ways You Can Change the Look and Feel of Your KDE-Powered Linux Desktop)
|
||||
[#]: via: (https://itsfoss.com/kde-customization/)
|
||||
[#]: author: (Dimitrios Savvopoulos https://itsfoss.com/author/dimitrios/)
|
||||
|
||||
KDE Customization Guide: Here are 11 Ways You Can Change the Look and Feel of Your KDE-Powered Linux Desktop
|
||||
======
|
||||
|
||||
[KDE Plasma desktop][1] is unarguably the pinnacle of customization, as you can change almost anything you want. You can go to the extent of making it act as a [tiling window manager][2].
|
||||
|
||||
KDE Plasma can confuse a beginner by the degree of customization it offers. As options tend to pile on top of options, the user starts getting lost.
|
||||
|
||||
To address that issue, I’ll show you the key points of KDE Plasma customization that you should be aware of. This is some
|
||||
|
||||
![][3]
|
||||
|
||||
### Customizing KDE Plasma
|
||||
|
||||
I have used [KDE Neon][4] in this tutorial, but you may follow it with any distribution that uses KDE Plasma desktop.
|
||||
|
||||
#### 1\. **Plasma Widgets**
|
||||
|
||||
Desktop widgets can add convenience to the user experience, as you can immediately access important items on the desktop.
|
||||
|
||||
Students and professionals nowadays are working with computers more than ever before, a useful widget can be sticky notes.
|
||||
|
||||
Right-click on the desktop and select “Add Widgets”.
|
||||
|
||||
![][5]
|
||||
|
||||
Choose the widget you like, and simply drag and drop it to the desktop.
|
||||
|
||||
![][6]
|
||||
|
||||
#### 2\. **Desktop wallpaper**
|
||||
|
||||
This one is too obvious. Changing the wallpaper to change the looks of your desktop.
|
||||
|
||||
![][7]
|
||||
|
||||
At the wallpaper tab you can change more than just the wallpaper. From the **“Layout”** pulldown menu, you can select if your desktop will have icons or not.
|
||||
|
||||
The **“Folder View”** layout is named from the traditional desktop folder in your home directory, where you can access your desktop files. Thus, the **“Folder View”** option will retain the icons on the desktop.
|
||||
|
||||
If you select the **“Desktop”** layout, it will leave your desktop icon free and plain. However, you will still be able to access the desktop folder at the home directory.
|
||||
|
||||
![][8]
|
||||
|
||||
In **Wallpaper Type**, you can select if you want a wallpaper or not, to be still or to change and finally in **Positioning**, how it looks on your screen.
|
||||
|
||||
#### 3\. Mouse Actions
|
||||
|
||||
Each mouse button can be configured to one of the following actions:
|
||||
|
||||
* Switch Desktop
|
||||
* Paste
|
||||
* Switch Window
|
||||
* Standard Menu
|
||||
* Application Launcher
|
||||
* Switch Activity
|
||||
|
||||
|
||||
|
||||
The right-click is set to **Standard Menu**, which is the menu when you right-click on the desktop. The contents of the menu can be changed by clicking on the settings icon next to it.
|
||||
|
||||
![][9]
|
||||
|
||||
#### 4\. Location of your desktop content
|
||||
|
||||
This option is only available if you select the “Folder View” in the wallpaper tab. By default, the content shown on your desktop is what you have at the desktop folder at the home directory. The location tab gives you the option to change the content on your desktop, by selecting a different folder.
|
||||
|
||||
![][10]
|
||||
|
||||
#### 5\. Desktop Icons
|
||||
|
||||
Here you can select how the icons will be arranged (horizontally or vertically), right or left, the sorting criteria and their size. If this is not enough, you have additional aesthetic features to explore.
|
||||
|
||||
![][11]
|
||||
|
||||
#### 6\. Desktop Filters
|
||||
|
||||
Let’s be honest with ourselves! I believe every user ends up with a cluttered desktop at some point. If your desktop becomes messy and can’t find a file, you can apply a filter either by name or type and find what you need. Although, it’s better to make a good file housekeeping a habit!
|
||||
|
||||
![][12]
|
||||
|
||||
#### 7\. Application Dashboard
|
||||
|
||||
If you like the GNOME 3 application launcher, you may try the KDE application dashboard. All you have to do is to right click on the menu icon > Show Alternatives.
|
||||
|
||||
![][13]
|
||||
|
||||
Click on “Application Dashboard”.
|
||||
|
||||
![][14]
|
||||
|
||||
#### 8\. Window Manager Theme
|
||||
|
||||
Like you saw in [Xfce customization tutorial][15], you can change the window manager theme independently in KDE as well. This way you can choose a different theme for the panel and a different theme for the window manager. If the preinstalled themes are not enough, you can download more.
|
||||
|
||||
Inspired from [MX Linux][16] Xfce edition though, I couldn’t resist to my favourite “Arc Dark”.
|
||||
|
||||
Navigate to Settings > Application Style > Window decorations > Theme
|
||||
|
||||
![][17]
|
||||
|
||||
#### 9\. Global theme
|
||||
|
||||
As mentioned above, the look and feel of the KDE plasma panel can be configured from the Settings > Global theme tab. There isn’t a good number of themes preinstalled, but you can download a theme to suit your taste. The default Breeze Dark is an eye candy, though.
|
||||
|
||||
![][18]
|
||||
|
||||
#### 10\. System Icons
|
||||
|
||||
The system icon style can have significant impact on how the desktop looks. Whichever is your choice, you should choose the dark icon version if your global theme is dark. The only difference lies on the icon text contrast, which is inverted to the panel colour to make it readable. You can easy access the icon tab at the system settings.
|
||||
|
||||
![][19]
|
||||
|
||||
#### 11\. System fonts
|
||||
|
||||
System fonts are not at the spotlight of customization, but if you spend half of your day in front of a screen can be one factor of the eye strain. Users with dyslexia will appreciate the [OpenDyslexic][20] font. My personal choice is the Ubuntu font, which not only I find aesthetically pleasing but also a good font to spend my day in front of a screen.
|
||||
|
||||
You can, of course, [install more fonts on your Linux system][21] by downloading them for external sources.
|
||||
|
||||
![][22]
|
||||
|
||||
### Conclusion
|
||||
|
||||
KDE Plasma is one of the most flexible and customizable desktops available to the Linux community. Whether you are a tinkerer or not, KDE Plasma is a constantly evolving desktop environment with amazing modern features. The best part is that it can also manage on moderate system configurations.
|
||||
|
||||
Now I tried to make this guide beginner-friendly. Of course, there can be more advanced customization like that [window switching ani][23][mation][23]. If you are aware of some, why not share it with us in the comment section?
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://itsfoss.com/kde-customization/
|
||||
|
||||
作者:[Dimitrios Savvopoulos][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://itsfoss.com/author/dimitrios/
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://kde.org/plasma-desktop/
|
||||
[2]: https://github.com/kwin-scripts/kwin-tiling
|
||||
[3]: https://i0.wp.com/itsfoss.com/wp-content/uploads/2021/01/kde-neon-neofetch.png?resize=800%2C600&ssl=1
|
||||
[4]: https://itsfoss.com/kde-neon-review/
|
||||
[5]: https://i2.wp.com/itsfoss.com/wp-content/uploads/2021/01/16-kde-neon-add-widgets.png?resize=800%2C500&ssl=1
|
||||
[6]: https://i1.wp.com/itsfoss.com/wp-content/uploads/2021/01/17-kde-neon-widgets.png?resize=800%2C768&ssl=1
|
||||
[7]: https://i0.wp.com/itsfoss.com/wp-content/uploads/2021/01/1-kde-neon-configure-desktop.png?resize=800%2C500&ssl=1
|
||||
[8]: https://i2.wp.com/itsfoss.com/wp-content/uploads/2021/01/2-kde-neon-wallpaper.png?resize=800%2C600&ssl=1
|
||||
[9]: https://i0.wp.com/itsfoss.com/wp-content/uploads/2021/01/3-kde-neon-mouse-actions.png?resize=800%2C600&ssl=1
|
||||
[10]: https://i2.wp.com/itsfoss.com/wp-content/uploads/2021/01/10-kde-neon-location.png?resize=800%2C650&ssl=1
|
||||
[11]: https://i0.wp.com/itsfoss.com/wp-content/uploads/2021/01/4-kde-neon-desktop-icons.png?resize=798%2C635&ssl=1
|
||||
[12]: https://i1.wp.com/itsfoss.com/wp-content/uploads/2021/01/11-kde-neon-desktop-icons-filter.png?resize=800%2C650&ssl=1
|
||||
[13]: https://i0.wp.com/itsfoss.com/wp-content/uploads/2021/01/5-kde-neon-show-alternatives.png?resize=800%2C500&ssl=1
|
||||
[14]: https://i0.wp.com/itsfoss.com/wp-content/uploads/2021/01/6-kde-neon-application-dashboard.png?resize=800%2C450&ssl=1
|
||||
[15]: https://itsfoss.com/customize-xfce/
|
||||
[16]: https://itsfoss.com/mx-linux-kde-edition/
|
||||
[17]: https://i1.wp.com/itsfoss.com/wp-content/uploads/2021/01/12-kde-neon-window-manager.png?resize=800%2C512&ssl=1
|
||||
[18]: https://i1.wp.com/itsfoss.com/wp-content/uploads/2021/01/15-kde-neon-global-theme.png?resize=800%2C524&ssl=1
|
||||
[19]: https://i0.wp.com/itsfoss.com/wp-content/uploads/2021/01/13-kde-neon-system-icons.png?resize=800%2C524&ssl=1
|
||||
[20]: https://www.opendyslexic.org/about
|
||||
[21]: https://itsfoss.com/install-fonts-ubuntu/
|
||||
[22]: https://i1.wp.com/itsfoss.com/wp-content/uploads/2021/01/14-kde-neon-fonts.png?resize=800%2C524&ssl=1
|
||||
[23]: https://itsfoss.com/customize-task-switcher-kde/
|
@ -0,0 +1,493 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: (stevenzdg988)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (A beginner's guide to web scraping with Python)
|
||||
[#]: via: (https://opensource.com/article/20/5/web-scraping-python)
|
||||
[#]: author: (Julia Piaskowski https://opensource.com/users/julia-piaskowski)
|
||||
|
||||
利用 Python 爬网站的新手指南
|
||||
======
|
||||
通过基本的 Python 工具获取爬完整 HTML 网站的实践经验。
|
||||
|
||||
![HTML代码][1]
|
||||
|
||||
有很多很棒的书可以帮助您学习 Python ,但是谁真正读了这些(书名从A至Z)呢?(剧透:不是我)。
|
||||
|
||||
许多人觉得教学书籍很有用,但我通常不会从头到尾地阅读一本书来学习。我通过做一个项目,努力的,弄清楚一些内容,然后再读另一本书来学习。因此,暂时丢掉书,让我们一起学习 Python。
|
||||
|
||||
接下来是我的第一个 Python 抓取项目向导。假设在 Python 和 HTML 的知识处于很低水平。这旨在说明如何使用 Python 的 [requests][2] 库访问网页内容,如何使用 [BeatifulSoup4][3]库,以及 `JSON` 和 [pandas][4] 库解析网页内容。我将简要介绍 [Selenium][5] 库,但我不会深入研究如何使用该库——该主题应该作为它的指南。最终,我希望向您展示一些技巧和提示,以减少网络爬取过程中遇到问题而不知所措。
|
||||
|
||||
### 安装依赖
|
||||
|
||||
我的 [GitHub 存储库][6] 中提供了本指南的所有资源。如果需要安装 Python3 的帮助,请查看 [Linux][7],[Windows][8] 和 [Mac][9] 的教程。
|
||||
|
||||
|
||||
```
|
||||
$ python3 -m venv
|
||||
$ source venv/bin/activate
|
||||
$ pip install requests bs4 pandas
|
||||
```
|
||||
|
||||
如果您喜欢使用 JupyterLab ,则可以使用 [notebook][10] 运行所有代码。[安装 JupyterLab][11] 有很多方法,这是其中一种:
|
||||
|
||||
|
||||
```
|
||||
# from the same virtual environment as above, run:
|
||||
$ pip install jupyterlab
|
||||
```
|
||||
|
||||
### 为网站抓取项目设定目标
|
||||
|
||||
现在我们已经安装了依赖项,但是爬取网页需要做什么?
|
||||
|
||||
让我们后退一步,确保使目标清晰。下面是成功完成网站爬取项目需求列表。
|
||||
|
||||
* 收集有效的构建网站爬取的信息。
|
||||
* 基于法律和遵循道德规范的收集利用网站爬取工具下载的信息。
|
||||
* 了解如何在 HTML 代码中找到目标信息。
|
||||
* 利用恰当的工具:在此情况下,需要使用 **BeautifulSoup** 库和 **requests** 库。
|
||||
* 知道(或愿意去学习)如何解析 JSON 对象。
|
||||
* 有足够的 **pandas** 数据处理技能。
|
||||
|
||||
|
||||
|
||||
关于 HTML 的注释:HTML 是运行在 Internet 上的“猛兽”,但我们最需要了解的是标签的工作方式。标签是一对由尖括号包围关键词(一般成对出现,其内容在两个标签中间)。比如,这是一个伪标签,称为 “`pro-tip`”:
|
||||
|
||||
|
||||
```
|
||||
<pro-tip> All you need to know about html is how tags work </pro-tip>
|
||||
```
|
||||
|
||||
我们可以通过调用标签 “`pro-tip`” 来访问其中的信息("All you need to know…")。本教程将进一步介绍如何查找和访问标签。要进一步了解 HTML 基础知识,请查看 [本文][12]。
|
||||
|
||||
### 在网站爬取项目中查找内容
|
||||
|
||||
利用网站爬取采集数据比利用其他方法更合适。接下来的就是我的教程。
|
||||
|
||||
没有可用于数据(处理)的公共 API。通过 API 抓取结构化数据会容易得多,这将有助于阐明收集数据的合法性和道德规范。这就需要大量的采用规则的结构化数据,重复的格式可以证明这一点。爬网站可能会很痛苦。 `BeautifulSoup(bs4)`使操作更容易,但是却不可避免地需要定制。不需要格式相同的数据,但这确实使事情变得更容易。当前存在的 “边际案例”(偏离规范)越多,爬取将越复杂。
|
||||
|
||||
免责声明:我没有参加过法律培训;以下内容不打算作为正式的法律建议。
|
||||
|
||||
关于合法性,访问大量有价值信息可能令人兴奋,但正因为如此可能不意味着允许这样做。
|
||||
|
||||
值得庆幸的是,有一些公共信息可以指导我们的道德规范和网站爬取工具。大多数网站都有与该网站关联的 [robots.txt][13] 文件,指出允许哪些爬取活动,哪些不被允许。它主要用于与搜索引擎(最终的网站抓取工具)进行交互。然而,网站上的许多信息都被视为公共信息。因此,将 `robots.txt` 文件视为一组建议,不如看成是具有法律约束力的文档。 `robots.txt` 文件未涉及道德规范下的数据收集和使用等主题。
|
||||
|
||||
在开始爬取项目之前,问自己以下问题:
|
||||
|
||||
* 我是否在爬取版权材料?
|
||||
* 我的爬取活动会危害个人隐私吗?
|
||||
* 我是否发送了大量可能会使服务器超载或损坏的请求?
|
||||
* 爬取是否会暴露我不拥有的知识产权?
|
||||
* 是否有管理使用网站的服务条款,我是否遵循这些条款?
|
||||
* 我的爬取活动会减少原始数据的价值吗?(例如,我是否打算按原样重新打包数据,或者可能从原始来源中抽取(占用)网站流量)?
|
||||
|
||||
|
||||
|
||||
当我爬取一个网站时,请确保可以对所有这些问题回答 “否”。
|
||||
|
||||
要深入了解法律问题,请参阅2018年出版的 [Krotov 和 Silva 撰写的Web爬虫的合法性和道德性][14] 和 [Sellars 的二十年 Web 爬虫和计算机欺诈与滥用法案][15]。
|
||||
|
||||
### 现在开始爬网站
|
||||
|
||||
经过上述评估,我想到了一个项目。 我的目标是爬取爱达荷州所有 Family Dollar 商店的地址。 这些商店在农村地区规模很大,因此我想了解有多少家这样的商店。
|
||||
|
||||
起点是 [Family Dollar 的位置页面][16]
|
||||
|
||||
![爱达荷州 Family Dollar 所在地页面][17]
|
||||
|
||||
首先,让我们在 Python 虚拟环境中加载先决条件。 此处的代码旨在添加到 Python 文件(如果要查找名称,则为 _scraper.py_)或在 JupyterLab 的单元中运行。
|
||||
|
||||
|
||||
```
|
||||
import requests # for making standard html requests
|
||||
from bs4 import BeautifulSoup # magical tool for parsing html data
|
||||
import json # for parsing data
|
||||
from pandas import DataFrame as df # premier library for data organization
|
||||
```
|
||||
|
||||
接下来,我们从目标 URL 中请求数据。
|
||||
|
||||
|
||||
```
|
||||
page = requests.get("<https://locations.familydollar.com/id/>")
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
```
|
||||
|
||||
BeautifulSoup 将 HTML 或 XML 内容转换为复杂树对象。这是我们将使用的几种常见对象类型。
|
||||
|
||||
* **BeautifulSoup** ——解析的内容
|
||||
* **Tag**——标准 HTML 标记,您将遇到 `bs4`元素的主要类型
|
||||
* **NavigableString**——标签内的文本字符串
|
||||
* **Comment**—— NavigableString 的一种特殊类型
|
||||
|
||||
|
||||
|
||||
当我们查看 **requests.get()** 输出时,还有更多要考虑的问题。我仅使用 **page.text()** 将请求的页面转换为可读的内容,但是还有其他输出类型:
|
||||
|
||||
* **page.text()** 表示文本(最常见)
|
||||
* **page.content()** 用于逐字节输出
|
||||
* **page.json()** 用于 JSON 对象
|
||||
* **page.raw()** 用于原始套接字响应(没了)
|
||||
|
||||
|
||||
|
||||
我只在使用拉丁字母的纯英语网站上操作。 **requests** 中的默认编码设置可以很好地解决这一问题。然而,除了纯英语网站之外,就是更大的互联网世界。为了确保 **requests** 正确解析内容,您可以设置文本的编码:
|
||||
|
||||
|
||||
```
|
||||
page = requests.get(URL)
|
||||
page.encoding = 'ISO-885901'
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
```
|
||||
|
||||
仔细研究 BeautifulSoup 标签,我们看到:
|
||||
|
||||
* `bs4` 元素 **tag** 正在捕获 HTML 标记
|
||||
* 它具有名称和属性,可以像字典一样访问:**tag['someAttribute']**
|
||||
* 如果标签具有相同名称的多个属性,则仅访问第一个实例。
|
||||
* 可通过 **tag.contents** 访问子标签。
|
||||
* 所有标签后代都可以通过 **tag.contents** 访问。
|
||||
* 你始终可以使用以下字符串:**re.compile("your_string")** 访问作为字符串的所有内容。
|
||||
|
||||
|
||||
|
||||
### 确定如何提取相应内容
|
||||
|
||||
警告:此过程可能令人沮丧。
|
||||
|
||||
网站爬取过程中的提取可能是一个令人生畏的充满了错误过程。我认为解决此问题的最佳方法是从一个有代表性的示例开始然后进行扩展(此原理对于任何编程任务都是适用的)。查看页面的 HTML 源代码至关重要。有很多方法可以做到这一点。
|
||||
|
||||
你可以在终端中使用 Python 查看页面的整个源代码(不建议使用)。运行此代码需要您自担风险:
|
||||
|
||||
|
||||
```
|
||||
print(soup.prettify())
|
||||
```
|
||||
|
||||
虽然打印出页面的整个源代码可能适用于某些教程中显示的消遣示例,但大多数现代网站的页面上都有大量内容。甚至404页面也可能充满了页眉,页脚等代码。
|
||||
|
||||
通常,在您喜欢的浏览器中通过 **View Page Source** 浏览源代码是最容易的(单击右键,然后选择"view page source"(查看页面源代码))。这是找到目标内容的最可靠方法(稍后我将解释原因)。
|
||||
|
||||
![Family Dollar 页面源代码][18]
|
||||
|
||||
|
||||
|
||||
在这种情况下,我需要在广阔的 HTML 海洋中找到我的目标内容——地址,城市,州和邮政编码。通常,对页面源(**ctrl+ F**)的简单搜索就会产生目标位置所在的位置。一旦我实际看到目标内容的示例(至少一个商店的地址),便会找到将该内容与其他内容区分开的属性或标签。
|
||||
|
||||
首先,我需要在爱达荷州 Family Dollar 商店中收集不同城市的网址,并访问这些网站以获取地址信息。这些网址似乎都包含在 **href** 标记中。太棒了!我将尝试使用 **find_all** 命令进行搜索:
|
||||
|
||||
|
||||
```
|
||||
dollar_tree_list = soup.find_all('href')
|
||||
dollar_tree_list
|
||||
```
|
||||
|
||||
搜索 **href** 不会产生任何结果,该死。这可能已失败,因为 **href** 嵌套在 **itemlist** 类中。对于下一次尝试,请搜索 **item_list**。由于“`class`”是 Python 中的保留字,因此使用 **class_**来作为替代。**soup.find_all()** 原来是 `bs4` 函数的瑞士军刀。
|
||||
|
||||
|
||||
```
|
||||
dollar_tree_list = soup.find_all(class_ = 'itemlist')
|
||||
for i in dollar_tree_list[:2]:
|
||||
print(i)
|
||||
```
|
||||
|
||||
有趣的是,我发现搜索一个特定类的方法一般是一种成功。通过找出对象的类型和长度,我们可以了解更多有关对象的信息。
|
||||
|
||||
|
||||
```
|
||||
type(dollar_tree_list)
|
||||
len(dollar_tree_list)
|
||||
```
|
||||
|
||||
The content from this BeautifulSoup "ResultSet" can be extracted using **.contents**. This is also a good time to create a single representative example.
|
||||
可以使用 **.contents** 从 BeautifulSoup “ResultSet” 中提取内容。这也是创建单个代表性示例的好时机。
|
||||
|
||||
```
|
||||
example = dollar_tree_list[2] # a representative example
|
||||
example_content = example.contents
|
||||
print(example_content)
|
||||
```
|
||||
|
||||
使用 **.attr** 查找该对象内容中存在的属性。注意:**.contents** 通常会返回一个项目的精确的列表,因此第一步是使用方括号符号为该项目建立索引。
|
||||
|
||||
|
||||
```
|
||||
example_content = example.contents[0]
|
||||
example_content.attrs
|
||||
```
|
||||
|
||||
现在,我可以看到 **href** 是一个属性,可以像字典项一样提取它:
|
||||
|
||||
|
||||
```
|
||||
example_href = example_content['href']
|
||||
print(example_href)
|
||||
```
|
||||
|
||||
### 整合网站抓取工具
|
||||
|
||||
所有的探索为我们提供了前进的方法。这是弄清楚上面逻辑的清理版本。
|
||||
|
||||
|
||||
```
|
||||
city_hrefs = [] # initialise empty list
|
||||
|
||||
for i in dollar_tree_list:
|
||||
cont = i.contents[0]
|
||||
href = cont['href']
|
||||
city_hrefs.append(href)
|
||||
|
||||
# check to be sure all went well
|
||||
for i in city_hrefs[:2]:
|
||||
print(i)
|
||||
```
|
||||
|
||||
输出的内容是一个关于抓取爱达荷州 Family Dollar 商店 URL 的列表
|
||||
|
||||
也就是说,我仍然没有获得地址信息!现在,需要抓取每个城市的 URL 以获得此信息。因此,我们使用一个具有代表性的示例重新开始该过程。
|
||||
|
||||
|
||||
```
|
||||
page2 = requests.get(city_hrefs[2]) # again establish a representative example
|
||||
soup2 = BeautifulSoup(page2.text, 'html.parser')
|
||||
```
|
||||
|
||||
![Family Dollar 地图和代码][19]
|
||||
|
||||
地址信息嵌套在 **type="application/ld+json"** 里。经过大量的地理位置抓取之后,我开始认识到这是用于存储地址信息的一般结构。幸运的是,**soup.find_all()** 开启了利用 **type** 搜索。
|
||||
|
||||
|
||||
```
|
||||
arco = soup2.find_all(type="application/ld+json")
|
||||
print(arco[1])
|
||||
```
|
||||
|
||||
地址信息在第二个列表成员中!原来如此!
|
||||
|
||||
使用 **.contents** 提取(从第二个列表项中)内容(这是过滤后的合适的默认操作)。同样,由于输出的内容是一个列表,因此我为该列表项建立了索引:
|
||||
|
||||
|
||||
```
|
||||
arco_contents = arco[1].contents[0]
|
||||
arco_contents
|
||||
```
|
||||
|
||||
喔,看起来不错。此处提供的格式与 JSON 格式一致(而且,该类型的名称中确实包含 “**json**”)。 JSON对象的行为就像是带有嵌套字典的字典。一旦你熟悉利用其去工作,它实际上是一种不错的格式(当然,它比一长串 RegEx 命令更容易编程)。尽管从结构上看起来像一个 JSON 对象,但它仍然是 `bs4` 对象,需要通过编程方式转换为 JSON 对象才能对其进行访问:
|
||||
|
||||
|
||||
```
|
||||
arco_json = json.loads(arco_contents)
|
||||
|
||||
[/code] [code]
|
||||
|
||||
type(arco_json)
|
||||
print(arco_json)
|
||||
```
|
||||
|
||||
在该内容中,有一个被调用的 **address** 键,该键要求地址信息在一个比较小的嵌套字典里。可以这样检索:
|
||||
|
||||
|
||||
```
|
||||
arco_address = arco_json['address']
|
||||
arco_address
|
||||
```
|
||||
|
||||
好吧,请大家注意。现在我可以遍历存储爱达荷州 URL 的列表:
|
||||
|
||||
|
||||
```
|
||||
locs_dict = [] # initialise empty list
|
||||
|
||||
for link in city_hrefs:
|
||||
locpage = requests.get(link) # request page info
|
||||
locsoup = BeautifulSoup(locpage.text, 'html.parser')
|
||||
# parse the page's content
|
||||
locinfo = locsoup.find_all(type="application/ld+json")
|
||||
# extract specific element
|
||||
loccont = locinfo[1].contents[0]
|
||||
# get contents from the bs4 element set
|
||||
locjson = json.loads(loccont) # convert to json
|
||||
locaddr = locjson['address'] # get address
|
||||
locs_dict.append(locaddr) # add address to list
|
||||
```
|
||||
|
||||
### 用 **Pandas** 整理我们的网站抓取结果
|
||||
|
||||
我们在字典中装载了大量数据,但是还有一些额外的无用项,它们会使重用数据变得比需要的更为复杂。要执行最终的数据组织,我们需要将其转换为 pandas 数据框架,删除不需要的列 “**@type**” 和 “**country**”),并检查前五行以确保一切正常。
|
||||
|
||||
|
||||
```
|
||||
locs_df = df.from_records(locs_dict)
|
||||
locs_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True)
|
||||
locs_df.head(n = 5)
|
||||
```
|
||||
|
||||
确保保存结果!!
|
||||
|
||||
|
||||
```
|
||||
df.to_csv(locs_df, "family_dollar_ID_locations.csv", sep = ",", index = False)
|
||||
```
|
||||
|
||||
我们做到了!所有爱达荷州 Family Dollar 商店都有一个用逗号分隔的列表。多令人兴奋。
|
||||
|
||||
### Selenium 和数据抓取的一点说明
|
||||
|
||||
[Selenium][5] 是用于与网页自动交互的常用应用。为了解释为什么有时必须使用它,让我们来看一个使用 Walgreens 网站的示例。 **Inspect Element** 为在浏览器显示内容提供代码:
|
||||
|
||||
![Walgreens 位置页面和代码][20]
|
||||
|
||||
|
||||
|
||||
虽然 **View Page Source** 提供了有关 **requests** 将获得什么内容的代码:
|
||||
|
||||
![Walgreens 源代码][21]
|
||||
|
||||
如果这两个不一致,则有一些插件可以修改源代码——因此,应在将页面加载到浏览器后对其进行访问。 **requests** 不能做到这一点,但是 **Selenium** 可以做到。
|
||||
|
||||
Selenium 需要 Web 驱动程序来检索内容。实际上,它会打开 Web 浏览器,并收集此页面的内容。 Selenium 功能强大——它可以通过多种方式与加载的内容进行交互(请阅读文档)。使用 **Selenium** 获取数据后,继续像以前一样使用 **BeautifulSoup**:
|
||||
|
||||
|
||||
```
|
||||
url = "[https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator\&state=ID][22]"
|
||||
driver = webdriver.Firefox(executable_path = 'mypath/geckodriver.exe')
|
||||
driver.get(url)
|
||||
soup_ID = BeautifulSoup(driver.page_source, 'html.parser')
|
||||
store_link_soup = soup_ID.find_all(class_ = 'col-xl-4 col-lg-4 col-md-4')
|
||||
```
|
||||
|
||||
对于 Family Dollar 这种情形,我不需要 Selenium,但是当呈现的内容与源代码不同时,我确实会保留使用 Selenium。
|
||||
|
||||
### 小结
|
||||
|
||||
总之,使用网站抓取来完成有意义的任务时:
|
||||
|
||||
* 耐心一点
|
||||
* 查阅手册(它们非常有帮助)
|
||||
|
||||
|
||||
|
||||
如果您对答案感到好奇:
|
||||
|
||||
![Family Dollar 位置图][23]
|
||||
|
||||
美国有很多 Family Dollar 商店。
|
||||
|
||||
完整的源代码是:
|
||||
|
||||
|
||||
```
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import json
|
||||
from pandas import DataFrame as df
|
||||
|
||||
page = requests.get("<https://www.familydollar.com/locations/>")
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
|
||||
# find all state links
|
||||
state_list = soup.find_all(class_ = 'itemlist')
|
||||
|
||||
state_links = []
|
||||
|
||||
for i in state_list:
|
||||
cont = i.contents[0]
|
||||
attr = cont.attrs
|
||||
hrefs = attr['href']
|
||||
state_links.append(hrefs)
|
||||
|
||||
# find all city links
|
||||
city_links = []
|
||||
|
||||
for link in state_links:
|
||||
page = requests.get(link)
|
||||
soup = BeautifulSoup(page.text, 'html.parser')
|
||||
familydollar_list = soup.find_all(class_ = 'itemlist')
|
||||
for store in familydollar_list:
|
||||
cont = store.contents[0]
|
||||
attr = cont.attrs
|
||||
city_hrefs = attr['href']
|
||||
city_links.append(city_hrefs)
|
||||
# to get individual store links
|
||||
store_links = []
|
||||
|
||||
for link in city_links:
|
||||
locpage = requests.get(link)
|
||||
locsoup = BeautifulSoup(locpage.text, 'html.parser')
|
||||
locinfo = locsoup.find_all(type="application/ld+json")
|
||||
for i in locinfo:
|
||||
loccont = i.contents[0]
|
||||
locjson = json.loads(loccont)
|
||||
try:
|
||||
store_url = locjson['url']
|
||||
store_links.append(store_url)
|
||||
except:
|
||||
pass
|
||||
|
||||
# get address and geolocation information
|
||||
stores = []
|
||||
|
||||
for store in store_links:
|
||||
storepage = requests.get(store)
|
||||
storesoup = BeautifulSoup(storepage.text, 'html.parser')
|
||||
storeinfo = storesoup.find_all(type="application/ld+json")
|
||||
for i in storeinfo:
|
||||
storecont = i.contents[0]
|
||||
storejson = json.loads(storecont)
|
||||
try:
|
||||
store_addr = storejson['address']
|
||||
store_addr.update(storejson['geo'])
|
||||
stores.append(store_addr)
|
||||
except:
|
||||
pass
|
||||
|
||||
# final data parsing
|
||||
stores_df = df.from_records(stores)
|
||||
stores_df.drop(['@type', 'addressCountry'], axis = 1, inplace = True)
|
||||
stores_df['Store'] = "Family Dollar"
|
||||
|
||||
df.to_csv(stores_df, "family_dollar_locations.csv", sep = ",", index = False)
|
||||
```
|
||||
|
||||
\--
|
||||
_作者注释:本文是2020年2月9日在俄勒冈州波特兰的[我在PyCascades的演讲][24]的改编。_
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/20/5/web-scraping-python
|
||||
|
||||
作者:[Julia Piaskowski][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[stevenzdg988](https://github.com/stevenzdg988)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/julia-piaskowski
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/bus_html_code.png?itok=VjUmGsnl (HTML code)
|
||||
[2]: https://requests.readthedocs.io/en/master/
|
||||
[3]: https://beautiful-soup-4.readthedocs.io/en/latest/
|
||||
[4]: https://pandas.pydata.org/
|
||||
[5]: https://www.selenium.dev/
|
||||
[6]: https://github.com/jpiaskowski/pycas2020_web_scraping
|
||||
[7]: https://opensource.com/article/20/4/install-python-linux
|
||||
[8]: https://opensource.com/article/19/8/how-install-python-windows
|
||||
[9]: https://opensource.com/article/19/5/python-3-default-mac
|
||||
[10]: https://github.com/jpiaskowski/pycas2020_web_scraping/blob/master/example/Familydollar_location_scrape-all-states.ipynb
|
||||
[11]: https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html
|
||||
[12]: https://opensource.com/article/20/4/build-websites
|
||||
[13]: https://www.contentkingapp.com/academy/robotstxt/
|
||||
[14]: https://www.researchgate.net/publication/324907302_Legality_and_Ethics_of_Web_Scraping
|
||||
[15]: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625
|
||||
[16]: https://locations.familydollar.com/id/
|
||||
[17]: https://opensource.com/sites/default/files/uploads/familydollar1.png (Family Dollar Idaho locations page)
|
||||
[18]: https://opensource.com/sites/default/files/uploads/familydollar2.png (Family Dollar page source code)
|
||||
[19]: https://opensource.com/sites/default/files/uploads/familydollar3.png (Family Dollar map and code)
|
||||
[20]: https://opensource.com/sites/default/files/uploads/walgreens1.png (Walgreens location page and code)
|
||||
[21]: https://opensource.com/sites/default/files/uploads/walgreens2.png (Walgreens source code)
|
||||
[22]: https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator\&state=ID
|
||||
[23]: https://opensource.com/sites/default/files/uploads/family_dollar_locations.png (Family Dollar locations map)
|
||||
[24]: https://2020.pycascades.com/talks/adventures-in-babysitting-webscraping-for-python-and-html-novices/
|
@ -0,0 +1,76 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: (geekpi)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (Improve your productivity with this lightweight Linux desktop)
|
||||
[#]: via: (https://opensource.com/article/21/1/elementary-linux)
|
||||
[#]: author: (Kevin Sonney https://opensource.com/users/ksonney)
|
||||
|
||||
使用这个轻量级 Linux 桌面提高你的工作效率
|
||||
======
|
||||
ElementaryOS 提供了一个快速、轻量、高效的桌面,让你在新的一年里保持工作效率。
|
||||
![Business woman on laptop sitting in front of window][1]
|
||||
|
||||
在前几年,这个年度系列涵盖了单个的应用。今年,我们除了关注 2021 年的策略外,还将关注一体化解决方案。欢迎来到 2021 年 21 天生产力的第一天。
|
||||
|
||||
当寻求提高生产率的工具时,很容易将几乎可以但不能完全很好地发挥作用的应用集合在一起。在过去的几年里,我们已经谈到了单个的电子邮件应用、日历应用、记事本应用等等。不过,总会有些麻烦。要么需要自定义脚本,要么需要复杂的导出/导入步骤来使一个工具工作。
|
||||
|
||||
[ElementaryOS][2] 是一个完整的桌面,具有美观、实用、高效的环境。
|
||||
|
||||
![Linux ElementaryOS Desktop][3]
|
||||
|
||||
ElementaryOS 桌面 (Kevin Sonney, [CC BY-SA 4.0][4])
|
||||
|
||||
ElementaryOS 是一个基于流行的 Ubuntu Linux 发行版的“按需付费”开源项目。对于过去安装过 Ubuntu 的人来说,最初的设置和安装会非常熟悉。然而,一旦登录,体验就会很不一样。
|
||||
|
||||
ElementaryOS 使用 [Gala 窗口管理器][5]和 Pantheon shell。这两个都是专门为 Elementary 开发的。安装后,桌面非常简单,它只提供了少量的轻量级应用。这些应用包括网络浏览器、终端、邮件客户端和日历客户端。它还有一个应用中心,允许你安装 Elementary 团队策划的免费和商业应用。
|
||||
|
||||
![Linux ElementaryOS Mail and calendar][6]
|
||||
|
||||
ElementaryOS 的邮件和日历应用(Kevin Sonney, [CC BY-SA 4.0][4])
|
||||
|
||||
[邮件][7]和[日历][8]这两个应用看起来很熟悉,因为这两个应用已经被其他发行版使用了一段时间。邮件是作为 [Geary][9] 的分支开始的,而日历在其他地方则被称为 [Maya][10]。两者的设置非常简单。两款应用默认只用用户名/密码认证,所以需要双因素认证的用户需要一些额外的步骤。两者的界面都异常轻巧快速。
|
||||
|
||||
![Linux ElementaryOS app center][11]
|
||||
|
||||
应用中心(Kevin Sonney,[CC BY-SA 4.0][4])
|
||||
|
||||
ElementaryOS 默认不包含待办事项或记事本应用。这时,应用中心就可以发挥作用了。在应用中心中,有大量的应用可以填补空白。其中有两款应用真的很突出。第一个是 [Planner][12],一款简洁、轻量级的待办事项管理器。它支持多个列表、计划和重复性任务、项目和子项目。它还可以与流行的在线应用 [Todoist][13] 同步,但这不是必需的。
|
||||
|
||||
应用中心的第二款应用是 [Notes-Up][14],这是一款使用 Markdown 写富文本的笔记应用。它允许用户在多个笔记本中创建多个笔记,并且有”查看“和”编辑“选项,这样你就可以预览最终文档的样子。同样,这款应用快速、轻量而且非常简约,与 ElementaryOS 的整体外观和感觉保持一致。
|
||||
|
||||
![ElementaryOS Planner and Notes-up][15]
|
||||
|
||||
Planner 和 Notes-up (Kevin Sonney,[CC BY-SA 4.0][4])
|
||||
|
||||
如果你不喜欢默认应用,Elementary 基于 Ubuntu LTS,因此如果你需要使用其他应用,整个 Ubuntu 应用生态系统都可以使用。不过总的来说,ElementaryOS 默认提供了一个快速、轻量、高效的桌面,让你在新的一年里保持高效。
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/21/1/elementary-linux
|
||||
|
||||
作者:[Kevin Sonney][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[geekpi](https://github.com/geekpi)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/ksonney
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/lenovo-thinkpad-laptop-concentration-focus-windows-office.png?itok=-8E2ihcF (Woman using laptop concentrating)
|
||||
[2]: https://elementary.io/
|
||||
[3]: https://opensource.com/sites/default/files/pictures/elementaryos-desktop.png (ElementaryOS Desktop)
|
||||
[4]: https://creativecommons.org/licenses/by-sa/4.0/
|
||||
[5]: https://github.com/elementary/gala
|
||||
[6]: https://opensource.com/sites/default/files/pictures/elementaryos-mail-calendar.png (Linux ElementaryOS Mail and calendar)
|
||||
[7]: https://github.com/elementary/mail
|
||||
[8]: https://github.com/elementary/calendar
|
||||
[9]: https://wiki.gnome.org/Apps/Geary
|
||||
[10]: https://launchpad.net/maya
|
||||
[11]: https://opensource.com/sites/default/files/pictures/elementaryos-app-center_0.png (Linux ElementaryOS app center)
|
||||
[12]: https://appcenter.elementary.io/com.github.alainm23.planner/
|
||||
[13]: https://todoist.com/
|
||||
[14]: https://github.com/Philip-Scott/Notes-up
|
||||
[15]: https://opensource.com/sites/default/files/pictures/planner-notes-up.png (ElementaryOS Planner and Notes-up)
|
Loading…
Reference in New Issue
Block a user