mirror of
https://github.com/LCTT/TranslateProject.git
synced 2024-12-23 21:20:42 +08:00
hankchow translated
This commit is contained in:
parent
01fc625453
commit
aeaa4a456e
@ -1,533 +0,0 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: (HankChow)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (SPEED TEST: x86 vs. ARM for Web Crawling in Python)
|
||||
[#]: via: (https://blog.dxmtechsupport.com.au/speed-test-x86-vs-arm-for-web-crawling-in-python/)
|
||||
[#]: author: (James Mawson https://blog.dxmtechsupport.com.au/author/james-mawson/)
|
||||
|
||||
SPEED TEST: x86 vs. ARM for Web Crawling in Python
|
||||
======
|
||||
|
||||
![][1]
|
||||
|
||||
Can you imagine if your job was to trawl competitor websites and jot prices down by hand, again and again and again? You’d burn your whole office down by lunchtime.
|
||||
|
||||
So, little wonder web crawlers are huge these days. They can keep track of customer sentiment and trending topics, monitor job openings, real estate transactions, UFC results, all sorts of stuff.
|
||||
|
||||
For those of a certain bent, this is fascinating stuff. Which is how I found myself playing around with [Scrapy][2], an open source web crawling framework written in Python.
|
||||
|
||||
Being wary of the potential to do something catastrophic to my computer while poking with things I didn’t understand, I decided to install it on my main machine but a Raspberry Pi.
|
||||
|
||||
And wouldn’t you know it? It actually didn’t run too shabby on the little tacker. Maybe this is a good use case for an ARM server?
|
||||
|
||||
Google had no solid answer. The nearest thing I found was [this Drupal hosting drag race][3], which showed an ARM server outperforming a much more expensive x86 based account.
|
||||
|
||||
That was definitely interesting. I mean, isn’t a web server kind of like a crawler in reverse? But with one operating on a LAMP stack and the other on a Python interpreter, it’s hardly the exact same thing.
|
||||
|
||||
So what could I do? Only one thing. Get some VPS accounts and make them race each other.
|
||||
|
||||
### What’s the Deal With ARM Processors?
|
||||
|
||||
ARM is now the most popular CPU architecture in the world.
|
||||
|
||||
But it’s generally seen as something you’d opt for to save money and battery life, rather than a serious workhorse.
|
||||
|
||||
It wasn’t always that way: this CPU was designed in Cambridge, England to power the fiendishly expensive [Acorn Archimedes][4]. This was the most powerful desktop computer in the world, and by a long way too: it was multiple times the speed of the fastest 386.
|
||||
|
||||
Acorn, like Commodore and Atari, somewhat ignorantly believed that the making of a great computer company was in the making of great computers. Bill Gates had a better idea. He got DOS on as many x86 machines – of the most widely varying quality and expense – as he could.
|
||||
|
||||
Having the best user base made you the obvious platform for third party developers to write software for; having all the software support made yours the most useful computer.
|
||||
|
||||
Even Apple nearly bit the dust. All the $$$$ were in building a better x86 chip, this was the architecture that ended up being developed for serious computing.
|
||||
|
||||
That wasn’t the end for ARM though. Their chips weren’t just fast, they could run well without drawing much power or emitting much heat. That made them a preferred technology in set top boxes, PDAs, digital cameras, MP3 players, and basically anything that either used a battery or where you’d just rather avoid the noise of a large fan.
|
||||
|
||||
So it was that Acorn spun off ARM, who began an idiosyncratic business model that continues to today: ARM doesn’t actually manufacture any chips, they license their intellectual property to others who do.
|
||||
|
||||
Which is more or less how they ended up in so many phones and tablets. When Linux was ported to the architecture, the door opened to other open source technologies, which is how we can run a web crawler on these chips today.
|
||||
|
||||
#### ARM in the Server Room
|
||||
|
||||
Some big names, like [Microsoft][5] and [Cloudflare][6], have placed heavy bets on the British Bulldog for their infrastructure. But for those of us with more modest budgets, the options are fairly sparse.
|
||||
|
||||
In fact, when it comes to cheap and cheerful VPS accounts that you can stick on the credit card for a few bucks a month, for years the only option was [Scaleway][7].
|
||||
|
||||
This changed a few months ago when public cloud heavyweight [AWS][8] launched its own ARM processor: the [AWS Graviton][9].
|
||||
|
||||
I decided to grab one of each, and race them against the most similar Intel offering from the same provider.
|
||||
|
||||
### Looking Under the Hood
|
||||
|
||||
So what are we actually racing here? Let’s jump right in.
|
||||
|
||||
#### Scaleway
|
||||
|
||||
Scaleway positions itself as “designed for developers”. And you know what? I think that’s fair enough: it’s definitely been a good little sandbox for developing and prototyping.
|
||||
|
||||
The dirt simple product offering and clean, easy dashboard guides you from home page to bash shell in minutes. That makes it a strong option for small businesses, freelancers and consultants who just want to get straight into a good VPS at a great price to run some crawls.
|
||||
|
||||
The ARM account we will be using is their [ARM64-2GB][10], which costs 3 euros a month and has 4 Cavium ThunderX cores. This launched in 2014 as the first server-class ARMv8 processor, but is now looking a bit middle-aged, having been superseded by the younger, prettier ThunderX2.
|
||||
|
||||
The x86 account we will be comparing it to is the [1-S][11], which costs a more princely 4 euros a month and has 2 Intel Atom C3995 cores. Intel’s Atom range is a low power single-threaded system on chip design, first built for laptops and then adapted for server use.
|
||||
|
||||
These accounts are otherwise fairly similar: they each have 2 gigabytes of memory, 50 gigabytes of SSD storage and 200 Mbit/s bandwidth. The disk drives possibly differ, but with the crawls we’re going to run here, this won’t come into play, we’re going to be doing everything in memory.
|
||||
|
||||
When I can’t use a package manager I’m familiar with, I become angry and confused, a bit like an autistic toddler without his security blanket, entirely beyond reasoning or consolation, it’s quite horrendous really, so both of these accounts will use Debian Stretch.
|
||||
|
||||
#### Amazon Web Services
|
||||
|
||||
In the same length of time as it takes you to give Scaleway your credit card details, launch a VPS, add a sudo user and start installing dependencies, you won’t even have gotten as far as registering your AWS account. You’ll still be reading through the product pages trying to figure out what’s going on.
|
||||
|
||||
There’s a serious breadth and depth here aimed at enterprises and others with complicated or specialised needs.
|
||||
|
||||
The AWS Graviton we wanna drag race is part of AWS’s “Elastic Compute Cloud” or EC2 range. I’ll be running it as an on-demand instance, which is the most convenient and expensive way to use EC2. AWS also operates a [spot market][12], where you get the server much cheaper if you can be flexible about when it runs. There’s also a [mid-priced option][13] if you want to run it 24/7.
|
||||
|
||||
Did I mention that AWS is complicated? Anyhoo..
|
||||
|
||||
The two accounts we’re comparing are [a1.medium][14] and [t2.small][15]. They both offer 2GB of RAM. Which begs the question: WTF is a vCPU? Confusingly, it’s a different thing on each account.
|
||||
|
||||
On the a1.medium account, a vCPU is a single core of the new AWS Graviton chip. This was built by Annapurna Labs, an Israeli chip maker bought by Amazon in 2015. This is a single-threaded 64-bit ARMv8 core exclusive to AWS. This has an on-demand price of 0.0255 US dollars per hour.
|
||||
|
||||
Our t2.small account runs on an Intel Xeon – though exactly which Xeon chip it is, I couldn’t really figure out. This has two threads per core – though we’re not really getting the whole core, or even the whole thread.
|
||||
|
||||
Instead we’re getting a “baseline performance of 20%, with the ability to burst above that baseline using CPU credits”. Which makes sense in principle, though it’s completely unclear to me what to actually expect from this. The on-demand price for this account is 0.023 US dollars per hour.
|
||||
|
||||
I couldn’t find Debian in the image library here, so both of these accounts will run Ubuntu 18.04.
|
||||
|
||||
### Beavis and Butthead Do Moz’s Top 500
|
||||
|
||||
To test these VPS accounts, I need a crawler to run – one that will let the CPU stretch its legs a bit. One way to do this would be to just hammer a few websites with as many requests as fast as possible, but that’s not very polite. What we’ll do instead is a broad crawl of many websites at once.
|
||||
|
||||
So it’s in tribute to my favourite physicist turned filmmaker, Mike Judge, that I wrote beavis.py. This crawls Moz’s Top 500 Websites to a depth of 3 pages to count how many times the words “wood” and “ass” occur anywhere within the HTML source.
|
||||
|
||||
Not all 500 websites will actually get crawled here – some will be excluded by robots.txt and others will require javascript to follow links and so on. But it’s a wide enough crawl to keep the CPU busy.
|
||||
|
||||
Python’s [global interpreter lock][16] means that beavis.py can only make use of a single CPU thread. To test multi-threaded we’re going to have to launch multiple spiders as seperate processes.
|
||||
|
||||
This is why I wrote butthead.py. Any true fan of the show knows that, as crude as Butthead was, he was always slightly more sophisticated than Beavis.
|
||||
|
||||
Splitting the crawl into multiple lists of start pages and allowed domains might slightly impact what gets crawled – fewer external links to other websites in the top 500 will get followed. But every crawl will be different anyway, so we will count how many pages are scraped as well as how long they take.
|
||||
|
||||
### Installing Scrapy on an ARM Server
|
||||
|
||||
Installing Scrapy is basically the same on each architecture. You install pip and various other dependencies, then install Scrapy from pip.
|
||||
|
||||
Installing Scrapy from pip to an ARM device does take noticeably longer though. I’m guessing this is because it has to compile the binary parts from source.
|
||||
|
||||
Once Scrapy is installed, I ran it from the shell to check that it’s fetching pages.
|
||||
|
||||
On Scaleway’s ARM account, there seemed to be a hitch with the service_identity module: it was installed but not working. This issue had come up on the Raspberry Pi as well, but not the AWS Graviton.
|
||||
|
||||
Not to worry, this was easily fixed with the following command:
|
||||
|
||||
```
|
||||
sudo pip3 install service_identity --force --upgrade
|
||||
```
|
||||
|
||||
Then we were off and racing!
|
||||
|
||||
### Single Threaded Crawls
|
||||
|
||||
The Scrapy docs say to try to [keep your crawls running between 80-90% CPU usage][17]. In practice, it’s hard – at least it is with the script I’ve written. What tends to happen is that the CPU gets very busy early in the crawl, drops a little bit and then rallies again.
|
||||
|
||||
The last part of the crawl, where most of the domains have been finished, can go on for quite a few minutes, which is frustrating, because at that point it feels like more a measure of how big the last website is than anything to do with the processor.
|
||||
|
||||
So please take this for what it is: not a state of the art benchmarking tool, but a short and slightly balding Australian in his underpants running some scripts and watching what happens.
|
||||
|
||||
So let’s get down to brass tacks. We’ll start with the Scaleway crawls.
|
||||
|
||||
| VPS | Account | Time | Pages | Scraped | Pages/Hour | €/million | pages |
|
||||
| --------- | ------- | ------- | ------ | ---------- | ---------- | --------- | ----- |
|
||||
| Scaleway | | | | | | | |
|
||||
| ARM64-2GB | 108m | 59.27s | 38,205 | 21,032.623 | 0.28527 | | |
|
||||
| --------- | ------- | ------- | ------ | ---------- | ---------- | --------- | ----- |
|
||||
| Scaleway | | | | | | | |
|
||||
| 1-S | 97m | 44.067s | 39,476 | 24,324.648 | 0.33011 | | |
|
||||
|
||||
I kept an eye on the CPU use of both of these crawls using [top][18]. Both crawls hit 100% CPU use at the beginning, but the ThunderX chip was definitely redlining a lot more. That means these figures understate how much faster the Atom core crawls than the ThunderX.
|
||||
|
||||
While I was watching CPU use in top, I could also see how much RAM was in use – this increased as the crawl continued. The ARM account used 14.7% at the end of the crawl, while the x86 was at 15%.
|
||||
|
||||
Watching the logs of these crawls, I also noticed a lot more pages timing out and going missing when the processor was maxed out. That makes sense – if the CPU’s too busy to respond to everything then something’s gonna go missing.
|
||||
|
||||
That’s not such a big deal when you’re just racing the things to see which is fastest. But in a real-world situation, with business outcomes at stake in the quality of your data, it’s probably worth having a little bit of headroom.
|
||||
|
||||
And what about AWS?
|
||||
|
||||
| VPS Account | Time | Pages Scraped | Pages / Hour | $ / Million Pages |
|
||||
| ----------- | ---- | ------------- | ------------ | ----------------- |
|
||||
| a1.medium | 100m 39.900s | 41,294 | 24,612.725 | 1.03605 |
|
||||
| t2.small | 78m 53.171s | 41,200 | 31,336.286 | 0.73397 |
|
||||
|
||||
I’ve included these results for sake of comparison with the Scaleway crawls, but these crawls were kind of a bust. Monitoring the CPU use – this time through the AWS dashboard rather than through top – showed that the script wasn’t making good use of the available processing power on either account.
|
||||
|
||||
This was clearest with the a1.medium account – it hardly even got out of bed. It peaked at about 45% near the beginning and then bounced around between 20% and 30% for the rest.
|
||||
|
||||
What’s intriguing to me about this is that the exact same script ran much slower on the ARM processor – and that’s not because it hit a limit of the Graviton’s CPU power. It had oodles of headroom left. Even the Intel Atom core managed to finish, and that was maxing out for some of the crawl. The settings were the same in the code, the way they were being handled differently on the different architecture.
|
||||
|
||||
It’s a bit of a black box to me whether that’s something inherent to the processor itself, the way the binaries were compiled, or some interaction between the two. I’m going to speculate that we might have seen the same thing on the Scaleway ARM VPS, if we hadn’t hit the limit of the CPU core’s processing power first.
|
||||
|
||||
It was harder to know how the t2.small account was doing. The crawl sat at about 20%, sometimes going as high as 35%. Was that it meant by “baseline performance of 20%, with the ability to burst to a higher level”? I had no idea. But I could see on the dashboard I wasn’t burning through any CPU credits.
|
||||
|
||||
Just to make extra sure, I installed [stress][19] and ran it for a few minutes; sure enough, this thing could do 100% if you pushed it.
|
||||
|
||||
Clearly, I was going to need to crank the settings up on both these processors to make them sweat a bit, so I set CONCURRENT_REQUESTS to 5000 and REACTOR_THREADPOOL_MAXSIZE to 120 and ran some more crawls.
|
||||
|
||||
| VPS Account | Time | Pages Scraped | Pages/hr | $/10000 Pages |
|
||||
| ----------- | ---- | ------------- | -------- | ------------- |
|
||||
| a1.medium | 46m 13.619s | 40,283 | 52,285.047 | 0.48771 |
|
||||
| t2.small | 41m7.619s | 36,241 | 52,871.857 | 0.43501 |
|
||||
| t2.small (No CPU credits) | 73m 8.133s | 34,298 | 28,137.8891 | 0.81740 |
|
||||
|
||||
The a1 instance hit 100% usage about 5 minutes into the crawl, before dropping back to 80% use for another 20 minutes, climbing up to 96% again and then dropping down again as it was wrapping things up. That was probably about as well-tuned as I was going to get it.
|
||||
|
||||
The t2 instance hit 50% early in the crawl and stayed there for until it was nearly done. With 2 threads per core, 50% CPU use is one thread maxed out.
|
||||
|
||||
Here we see both accounts produce similar speeds. But the Xeon thread was redlining for most of the crawl, and the Graviton was not. I’m going to chalk this up as a slight win for the Graviton.
|
||||
|
||||
But what about once you’ve burnt through all your CPU credits? That’s probably the fairer comparison – to only use them as you earn them. I wanted to test that as well. So I ran stress until all the CPU credits were exhausted and ran the crawl again.
|
||||
|
||||
With no credits in the bank, the CPU usage maxed out at 27% and stayed there. So many pages ended up going missing that it actually performed worse than when on the lower settings.
|
||||
|
||||
### Multi Threaded Crawls
|
||||
|
||||
Dividing our crawl up between multiple spiders in separate processes offers a few more options to make use of the available cores.
|
||||
|
||||
I first tried dividing everything up between 10 processes and launching them all at once. This turned out to be slower than just dividing them up into 1 process per core.
|
||||
|
||||
I got the best result by combining these methods – dividing the crawl up into 10 processes and then launching 1 process per core at the start and then the rest as these crawls began to wind down.
|
||||
|
||||
To make this even better, you could try to minimise the problem of the last lingering crawler by making sure the longest crawls start first. I actually attempted to do this.
|
||||
|
||||
Figuring that the number of links on the home page might be a rough proxy for how large the crawl would be, I built a second spider to count them and then sort them in descending order of number of outgoing links. This preprocessing worked well and added a little over a minute.
|
||||
|
||||
It turned out though that blew the crawling time out beyond two hours! Putting all the most link heavy websites together in the same process wasn’t a great idea after all.
|
||||
|
||||
You might effectively deal with this by tweaking the number of domains per process as well – or by shuffling the list after it’s ordered. That’s a bit much for Beavis and Butthead though.
|
||||
|
||||
So I went back to my earlier method that had worked somewhat well:
|
||||
|
||||
| VPS Account | Time | Pages Scraped | Pages/hr | €/10,000 pages |
|
||||
| ----------- | ---- | ------------- | -------- | -------------- |
|
||||
| Scaleway ARM64-2GB | 62m 10.078s | 36,158 | 34,897.0719 | 0.17193 |
|
||||
| Scaleway 1-S | 60m 56.902s | 36,725 | 36,153.5529 | 0.22128 |
|
||||
|
||||
After all that, using more cores did speed up the crawl. But it’s hardly a matter of just halving or quartering the time taken.
|
||||
|
||||
I’m certain that a more experienced coder could better optimise this to take advantage of all the cores. But, as far as “out of the box” Scrapy performance goes, it seems to be a lot easier to speed up a crawl by using faster threads rather than by throwing more cores at it.
|
||||
|
||||
As it is, the Atom has scraped slightly more pages in slightly less time. On a value for money metric, you could possibly say that the ThunderX is ahead. Either way, there’s not a lot of difference here.
|
||||
|
||||
### Everything You Always Wanted to Know About Ass and Wood (But Were Afraid to Ask)
|
||||
|
||||
After scraping 38,205 pages, our crawler found 24,170,435 mentions of ass and 54,368 mentions of wood.
|
||||
|
||||
![][20]
|
||||
|
||||
Considered on its own, this is a respectable amount of wood.
|
||||
|
||||
But when you set it against the sheer quantity of ass we’re dealing with here, the wood looks miniscule.
|
||||
|
||||
### The Verdict
|
||||
|
||||
From what’s visible to me at the moment, it looks like the CPU architecture you use is actually less important than how old the processor is. The AWS Graviton from 2018 was the winner here in single-threaded performance.
|
||||
|
||||
You could of course argue that the Xeon still wins, core for core. But then you’re not really going dollar for dollar anymore, or even thread for thread.
|
||||
|
||||
The Atom from 2017, on the other hand, comfortably bested the ThunderX from 2014. Though, on the value for money metric, the ThunderX might be the clear winner. Then again, if you can run your crawls on Amazon’s spot market, the Graviton is still ahead.
|
||||
|
||||
All in all, I think this shows that, yes, you can crawl the web with an ARM device, and it can compete on both performance and price.
|
||||
|
||||
Whether the difference is significant enough for you to turn what you’re doing upside down is a whole other question of course. Certainly, if you’re already on the AWS cloud – and your code is portable enough – then it might be worthwhile testing out their a1 instances.
|
||||
|
||||
Hopefully we will see more ARM options on the public cloud in near future.
|
||||
|
||||
### The Scripts
|
||||
|
||||
This is my first real go at doing anything in either Python or Scrapy. So this might not be great code to learn from. Some of what I’ve done here – such as using global variables – is definitely a bit kludgey.
|
||||
|
||||
Still, I want to be transparent about my methods, so here are my scripts.
|
||||
|
||||
To run them, you’ll need Scrapy installed and you will need the CSV file of [Moz’s top 500 domains][21]. To run butthead.py you will also need [psutil][22].
|
||||
|
||||
##### beavis.py
|
||||
|
||||
```
|
||||
import scrapy
|
||||
from scrapy.spiders import CrawlSpider, Rule
|
||||
from scrapy.linkextractors import LinkExtractor
|
||||
from scrapy.crawler import CrawlerProcess
|
||||
|
||||
ass = 0
|
||||
wood = 0
|
||||
totalpages = 0
|
||||
|
||||
def getdomains():
|
||||
|
||||
moz500file = open('top500.domains.05.18.csv')
|
||||
|
||||
domains = []
|
||||
moz500csv = moz500file.readlines()
|
||||
|
||||
del moz500csv[0]
|
||||
|
||||
for csvline in moz500csv:
|
||||
leftquote = csvline.find('"')
|
||||
rightquote = leftquote + csvline[leftquote + 1:].find('"')
|
||||
domains.append(csvline[leftquote + 1:rightquote])
|
||||
|
||||
return domains
|
||||
|
||||
def getstartpages(domains):
|
||||
|
||||
startpages = []
|
||||
|
||||
for domain in domains:
|
||||
startpages.append('http://' + domain)
|
||||
|
||||
return startpages
|
||||
|
||||
class AssWoodItem(scrapy.Item):
|
||||
ass = scrapy.Field()
|
||||
wood = scrapy.Field()
|
||||
url = scrapy.Field()
|
||||
|
||||
class AssWoodPipeline(object):
|
||||
def __init__(self):
|
||||
self.asswoodstats = []
|
||||
|
||||
def process_item(self, item, spider):
|
||||
self.asswoodstats.append((item.get('url'), item.get('ass'), item.get('wood')))
|
||||
|
||||
def close_spider(self, spider):
|
||||
asstally, woodtally = 0, 0
|
||||
|
||||
for asswoodcount in self.asswoodstats:
|
||||
asstally += asswoodcount[1]
|
||||
woodtally += asswoodcount[2]
|
||||
|
||||
global ass, wood, totalpages
|
||||
ass = asstally
|
||||
wood = woodtally
|
||||
totalpages = len(self.asswoodstats)
|
||||
|
||||
class BeavisSpider(CrawlSpider):
|
||||
name = "Beavis"
|
||||
allowed_domains = getdomains()
|
||||
start_urls = getstartpages(allowed_domains)
|
||||
#start_urls = [ 'http://medium.com' ]
|
||||
custom_settings = {
|
||||
'DEPTH_LIMIT': 3,
|
||||
'DOWNLOAD_DELAY': 3,
|
||||
'CONCURRENT_REQUESTS': 1500,
|
||||
'REACTOR_THREADPOOL_MAXSIZE': 60,
|
||||
'ITEM_PIPELINES': { '__main__.AssWoodPipeline': 10 },
|
||||
'LOG_LEVEL': 'INFO',
|
||||
'RETRY_ENABLED': False,
|
||||
'DOWNLOAD_TIMEOUT': 30,
|
||||
'COOKIES_ENABLED': False,
|
||||
'AJAXCRAWL_ENABLED': True
|
||||
}
|
||||
|
||||
rules = ( Rule(LinkExtractor(), callback='parse_asswood'), )
|
||||
|
||||
def parse_asswood(self, response):
|
||||
if isinstance(response, scrapy.http.TextResponse):
|
||||
item = AssWoodItem()
|
||||
item['ass'] = response.text.casefold().count('ass')
|
||||
item['wood'] = response.text.casefold().count('wood')
|
||||
item['url'] = response.url
|
||||
yield item
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
process = CrawlerProcess({
|
||||
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
|
||||
})
|
||||
|
||||
process.crawl(BeavisSpider)
|
||||
process.start()
|
||||
|
||||
print('Uhh, that was, like, ' + str(totalpages) + ' pages crawled.')
|
||||
print('Uh huhuhuhuh. It said ass ' + str(ass) + ' times.')
|
||||
print('Uh huhuhuhuh. It said wood ' + str(wood) + ' times.')
|
||||
```
|
||||
|
||||
##### butthead.py
|
||||
|
||||
```
|
||||
import scrapy, time, psutil
|
||||
from scrapy.spiders import CrawlSpider, Rule, Spider
|
||||
from scrapy.linkextractors import LinkExtractor
|
||||
from scrapy.crawler import CrawlerProcess
|
||||
from multiprocessing import Process, Queue, cpu_count
|
||||
|
||||
ass = 0
|
||||
wood = 0
|
||||
totalpages = 0
|
||||
linkcounttuples =[]
|
||||
|
||||
def getdomains():
|
||||
|
||||
moz500file = open('top500.domains.05.18.csv')
|
||||
|
||||
domains = []
|
||||
moz500csv = moz500file.readlines()
|
||||
|
||||
del moz500csv[0]
|
||||
|
||||
for csvline in moz500csv:
|
||||
leftquote = csvline.find('"')
|
||||
rightquote = leftquote + csvline[leftquote + 1:].find('"')
|
||||
domains.append(csvline[leftquote + 1:rightquote])
|
||||
|
||||
return domains
|
||||
|
||||
def getstartpages(domains):
|
||||
|
||||
startpages = []
|
||||
|
||||
for domain in domains:
|
||||
startpages.append('http://' + domain)
|
||||
|
||||
return startpages
|
||||
|
||||
class AssWoodItem(scrapy.Item):
|
||||
ass = scrapy.Field()
|
||||
wood = scrapy.Field()
|
||||
url = scrapy.Field()
|
||||
|
||||
class AssWoodPipeline(object):
|
||||
def __init__(self):
|
||||
self.asswoodstats = []
|
||||
|
||||
def process_item(self, item, spider):
|
||||
self.asswoodstats.append((item.get('url'), item.get('ass'), item.get('wood')))
|
||||
|
||||
def close_spider(self, spider):
|
||||
asstally, woodtally = 0, 0
|
||||
|
||||
for asswoodcount in self.asswoodstats:
|
||||
asstally += asswoodcount[1]
|
||||
woodtally += asswoodcount[2]
|
||||
|
||||
global ass, wood, totalpages
|
||||
ass = asstally
|
||||
wood = woodtally
|
||||
totalpages = len(self.asswoodstats)
|
||||
|
||||
|
||||
class ButtheadSpider(CrawlSpider):
|
||||
name = "Butthead"
|
||||
custom_settings = {
|
||||
'DEPTH_LIMIT': 3,
|
||||
'DOWNLOAD_DELAY': 3,
|
||||
'CONCURRENT_REQUESTS': 250,
|
||||
'REACTOR_THREADPOOL_MAXSIZE': 30,
|
||||
'ITEM_PIPELINES': { '__main__.AssWoodPipeline': 10 },
|
||||
'LOG_LEVEL': 'INFO',
|
||||
'RETRY_ENABLED': False,
|
||||
'DOWNLOAD_TIMEOUT': 30,
|
||||
'COOKIES_ENABLED': False,
|
||||
'AJAXCRAWL_ENABLED': True
|
||||
}
|
||||
|
||||
rules = ( Rule(LinkExtractor(), callback='parse_asswood'), )
|
||||
|
||||
|
||||
def parse_asswood(self, response):
|
||||
if isinstance(response, scrapy.http.TextResponse):
|
||||
item = AssWoodItem()
|
||||
item['ass'] = response.text.casefold().count('ass')
|
||||
item['wood'] = response.text.casefold().count('wood')
|
||||
item['url'] = response.url
|
||||
yield item
|
||||
|
||||
def startButthead(domainslist, urlslist, asswoodqueue):
|
||||
crawlprocess = CrawlerProcess({
|
||||
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
|
||||
})
|
||||
|
||||
crawlprocess.crawl(ButtheadSpider, allowed_domains = domainslist, start_urls = urlslist)
|
||||
crawlprocess.start()
|
||||
asswoodqueue.put( (ass, wood, totalpages) )
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
asswoodqueue = Queue()
|
||||
domains=getdomains()
|
||||
startpages=getstartpages(domains)
|
||||
processlist =[]
|
||||
cores = cpu_count()
|
||||
|
||||
for i in range(10):
|
||||
domainsublist = domains[i * 50:(i + 1) * 50]
|
||||
pagesublist = startpages[i * 50:(i + 1) * 50]
|
||||
p = Process(target = startButthead, args = (domainsublist, pagesublist, asswoodqueue))
|
||||
processlist.append(p)
|
||||
|
||||
for i in range(cores):
|
||||
processlist[i].start()
|
||||
|
||||
time.sleep(180)
|
||||
|
||||
i = cores
|
||||
|
||||
while i != 10:
|
||||
time.sleep(60)
|
||||
if psutil.cpu_percent() < 66.7:
|
||||
processlist[i].start()
|
||||
i += 1
|
||||
|
||||
for i in range(10):
|
||||
processlist[i].join()
|
||||
|
||||
for i in range(10):
|
||||
asswoodtuple = asswoodqueue.get()
|
||||
ass += asswoodtuple[0]
|
||||
wood += asswoodtuple[1]
|
||||
totalpages += asswoodtuple[2]
|
||||
|
||||
print('Uhh, that was, like, ' + str(totalpages) + ' pages crawled.')
|
||||
print('Uh huhuhuhuh. It said ass ' + str(ass) + ' times.')
|
||||
print('Uh huhuhuhuh. It said wood ' + str(wood) + ' times.')
|
||||
```
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://blog.dxmtechsupport.com.au/speed-test-x86-vs-arm-for-web-crawling-in-python/
|
||||
|
||||
作者:[James Mawson][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://blog.dxmtechsupport.com.au/author/james-mawson/
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://blog.dxmtechsupport.com.au/wp-content/uploads/2019/02/quadbike-1024x683.jpg
|
||||
[2]: https://scrapy.org/
|
||||
[3]: https://www.info2007.net/blog/2018/review-scaleway-arm-based-cloud-server.html
|
||||
[4]: https://blog.dxmtechsupport.com.au/playing-badass-acorn-archimedes-games-on-a-raspberry-pi/
|
||||
[5]: https://www.computerworld.com/article/3178544/microsoft-windows/microsoft-and-arm-look-to-topple-intel-in-servers.html
|
||||
[6]: https://www.datacenterknowledge.com/design/cloudflare-bets-arm-servers-it-expands-its-data-center-network
|
||||
[7]: https://www.scaleway.com/
|
||||
[8]: https://aws.amazon.com/
|
||||
[9]: https://www.theregister.co.uk/2018/11/27/amazon_aws_graviton_specs/
|
||||
[10]: https://www.scaleway.com/virtual-cloud-servers/#anchor_arm
|
||||
[11]: https://www.scaleway.com/virtual-cloud-servers/#anchor_starter
|
||||
[12]: https://aws.amazon.com/ec2/spot/pricing/
|
||||
[13]: https://aws.amazon.com/ec2/pricing/reserved-instances/
|
||||
[14]: https://aws.amazon.com/ec2/instance-types/a1/
|
||||
[15]: https://aws.amazon.com/ec2/instance-types/t2/
|
||||
[16]: https://wiki.python.org/moin/GlobalInterpreterLock
|
||||
[17]: https://docs.scrapy.org/en/latest/topics/broad-crawls.html
|
||||
[18]: https://linux.die.net/man/1/top
|
||||
[19]: https://linux.die.net/man/1/stress
|
||||
[20]: https://blog.dxmtechsupport.com.au/wp-content/uploads/2019/02/Screenshot-from-2019-02-16-17-01-08.png
|
||||
[21]: https://moz.com/top500
|
||||
[22]: https://pypi.org/project/psutil/
|
@ -0,0 +1,525 @@
|
||||
[#]: collector: (lujun9972)
|
||||
[#]: translator: (HankChow)
|
||||
[#]: reviewer: ( )
|
||||
[#]: publisher: ( )
|
||||
[#]: url: ( )
|
||||
[#]: subject: (SPEED TEST: x86 vs. ARM for Web Crawling in Python)
|
||||
[#]: via: (https://blog.dxmtechsupport.com.au/speed-test-x86-vs-arm-for-web-crawling-in-python/)
|
||||
[#]: author: (James Mawson https://blog.dxmtechsupport.com.au/author/james-mawson/)
|
||||
|
||||
x86 和 ARM 的 Python 爬虫速度对比
|
||||
======
|
||||
|
||||
![][1]
|
||||
|
||||
如果你的老板给你的任务是不断地访问竞争对手的网站,把对方商品的价格记录下来,而且要纯手工操作,恐怕你会想要把整个办公室都烧掉。
|
||||
|
||||
之所以现在网络爬虫的影响力如此巨大,就是因为网络爬虫可以被用于追踪客户的情绪和趋向、搜寻空缺的职位、监控房地产的交易,甚至是获取 UFC 的比赛结果。除此以外,还有很多意想不到的用途。
|
||||
|
||||
对于有这方面爱好的人来说,爬虫无疑是一个很好的工具。因此,我使用了 [Scrapy][2] 这个基于 Python 编写的开源网络爬虫框架。
|
||||
|
||||
鉴于我不太了解这个工具是否会对我的计算机造成伤害,我并没有将它搭建在我的主力机器上,而是搭建在了一台树莓派上面。
|
||||
|
||||
令人感到意外的是,Scrapy 在树莓派上面的性能并不差,或许这是 ARM 架构服务器的又一个成功例子?
|
||||
|
||||
我尝试 Google 了一下,但并没有得到令我满意的结果,仅仅找到了一篇相关的《[Drupal 建站对比][3]》。这篇文章的结论是,ARM 架构服务器性能比昂贵的 x86 架构服务器要更好。
|
||||
|
||||
从另一个角度来看,这种 web 服务可以看作是一个“被爬虫”服务,但和 Scrapy 对比起来,前者是基于 LAMP 技术栈,而后者则依赖于 Python,这就导致两者之间没有太多的可比性。
|
||||
|
||||
那我们该怎样做呢?只能在一些 VPS 上搭建服务来对比一下了。
|
||||
|
||||
### 什么是 ARM 架构处理器?
|
||||
|
||||
ARM 是目前世界上最流行的 CPU 架构。
|
||||
|
||||
但 ARM 架构处理器在很多人眼中的地位只是作为一个节省成本的选择,而不是跑在生产环境中的处理器的首选。
|
||||
|
||||
然而,诞生于英国剑桥的 ARM CPU,最初是用于昂贵的 [Acorn Archimedes][4] 计算机上的,这是当时世界上最强大的计算机,它的运算速度甚至比最快的 386 还要快好几倍。
|
||||
|
||||
Acorn 公司和 Commodore、Atari 的理念类似,他们认为一家伟大的计算机公司就应该制造出伟大的计算机,让人感觉有点目光短浅。而比尔盖茨的想法则有所不同,他力图在更多不同种类和价格的 x86 机器上使用他的 DOS 系统。
|
||||
|
||||
拥有大量用户基础的平台会让更多开发者开发出众多适应平台的软件,而软件资源丰富又让计算机更受用户欢迎。
|
||||
|
||||
即使是苹果公司也在这上面吃到了苦头,不得不在 x86 芯片上投入大量的财力。最终,这些芯片不再仅仅用于专业的计算任务,走进了人们的日常生活中。
|
||||
|
||||
ARM 架构也并没有消失。基于 ARM 架构的芯片不仅运算速度快,同时也非常节能。因此诸如机顶盒、PDA、数码相机、MP3 播放器这些电子产品多数都会采用 ARM 架构的芯片,甚至在很多需要用电池、不配备大散热风扇的电子产品上,都可以见到 ARM 芯片的身影。
|
||||
|
||||
而 ARM 则脱离 Acorn 成为了一种独立的商业模式,他们不生产实物芯片,仅仅是向芯片生产厂商出售相关的知识产权。
|
||||
|
||||
因此,ARM 芯片被应用于很多手机和平板电脑上。当 Linux 被移植到这种架构的芯片上时,开源技术的大门就已经向它打开了,这才让我们今天得以在这些芯片上运行 web 爬虫程序。
|
||||
|
||||
#### 服务器端的 ARM
|
||||
|
||||
诸如[微软][5]和 [Cloudflare][6] 这些大厂都在基础设施建设上花了重金,所以对于我们这些预算不高的用户来说,可以选择的余地并不多。
|
||||
|
||||
实际上,如果你的信用卡只够付每月数美元的 VPS 费用,一直以来只能考虑 [Scaleway][7] 这个高性价比的厂商。
|
||||
|
||||
但自从数个月前公有云巨头 [AWS][8] 推出了他们自研的 ARM 处理器 [AWS Graviton][9] 之后,选择似乎就丰富了一些。
|
||||
|
||||
我决定在其中选择一款 VPS 厂商,将它提供的 ARM 处理器和 x86 处理器作出对比。
|
||||
|
||||
### 深入了解
|
||||
|
||||
所以我们要对比的是什么指标呢?
|
||||
|
||||
#### Scaleway
|
||||
|
||||
Scaleway 自身的定位是“专为开发者设计”。我觉得这个定位很准确,对于开发原型来说,Scaleway 提供的产品确实可以作为一个很好的沙盒环境。
|
||||
|
||||
Scaleway 提供了一个简洁的页面,让用户可以快速地从主页进入 bash shell 界面。对于很多小企业、自由职业者或者技术顾问,如果想要运行 web 爬虫,这个产品毫无疑问是一个物美价廉的选择。
|
||||
|
||||
ARM 方面我们选择 [ARM64-2GB][10] 这一款服务器,每月只需要 3 欧元。它带有 4 个 Cavium ThunderX 核心,是在 2014 年推出的第一款服务器级的 ARMv8 处理器。但现在看来它已经显得有点落后了,并逐渐被更新的 ThunderX2 取代。
|
||||
|
||||
x86 方面我们选择 [1-S][11],每月的费用是 4 欧元。它拥有 2 个英特尔 Atom C3995 核心。英特尔的 Atom 系列处理器的特点是低功耗、单线程,最初是用在笔记本电脑上的,后来也被服务器所采用。
|
||||
|
||||
两者在处理器以外的条件都大致相同,都使用 2 GB 的内存、50 GB 的 SSD 存储以及 200 Mbit/s 的带宽。磁盘驱动器可能会有所不同,但由于我们运行的是 web 爬虫,基本都是在内存中完成操作,因此这方面的差异可以忽略不计。
|
||||
|
||||
为了避免我不能熟练使用包管理器的尴尬局面,两方的操作系统我都会选择使用 Debian 9。
|
||||
|
||||
#### Amazon Web Services
|
||||
|
||||
当你还在注册 AWS 账号的时候,使用 Scaleway 的用户可能已经把提交信用卡信息、启动 VPS 实例、添加sudoer、安装依赖包这一系列流程都完成了。AWS 的操作相对来说比较繁琐,甚至需要详细阅读手册才能知道你正在做什么。
|
||||
|
||||
当然这也是合理的,对于一些需求复杂或者特殊的企业用户,确实需要通过详细的配置来定制合适的使用方案。
|
||||
|
||||
我们所采用的 AWS Graviton 处理器是 AWS EC2(Elastic Compute Cloud)的一部分,我会以按需实例的方式来运行,这也是最贵但最简捷的方式。AWS 同时也提供[竞价实例][12],这样可以用较低的价格运行实例,但实例的运行时间并不固定。如果实例需要长时间持续运行,还可以选择[预留实例][13]。
|
||||
|
||||
看,AWS 就是这么复杂……
|
||||
|
||||
我们分别选择 [a1.medium][14] 和 [t2.small][15] 两种型号的实例进行对比,两者都带有 2GB 内存。这个时候问题来了,手册中提到的 vCPU 又是什么?两种型号的不同之处就在于此。
|
||||
|
||||
对于 a1.medium 型号的实例,vCPU 是 AWS Graviton 芯片提供的单个计算核心。这个芯片由被亚马逊在 2015 收购的以色列厂商 Annapurna Labs 研发,是 AWS 独有的单线程 64 位 ARMv8 内核。它的按需价格为每小时 0.0255 美元。
|
||||
|
||||
而 t2.small 型号实例使用英特尔至强系列芯片,但我不确定具体是其中的哪一款。它每个核心有两个线程,但我们并不能用到整个核心,甚至整个线程。我们能用到的只是“20% 的基准性能,可以使用 CPU 积分突破这个基准”。这可能有一定的原因,但我没有弄懂。它的按需价格是每小时 0.023 美元。
|
||||
|
||||
在镜像库中没有 Debian 发行版的镜像,因此我选择了 Ubuntu 18.04。
|
||||
|
||||
### Beavis and Butthead Do Moz’s Top 500
|
||||
|
||||
要测试这些 VPS 的 CPU 性能,就该使用爬虫了。一般来说都是对几个网站在尽可能短的时间里发出尽可能多的请求,但这种操作太暴力了,我的做法是只向大量网站发出少数几个请求。
|
||||
|
||||
为此,我编写了 `beavs.py` 这个爬虫程序(致敬我最喜欢的物理学家和制片人 Mike Judge)。这个程序会将 Moz 上排行前 500 的网站都爬取 3 层的深度,并计算 “wood” 和 “ass” 这两个单词在 HTML 文件中出现的次数。
|
||||
|
||||
但我实际爬取的网站可能不足 500 个,因为我需要遵循网站的 `robot.txt` 协定,另外还有些网站需要提交 javascript 请求,也不一定会计算在内。但这已经是一个足以让 CPU 保持繁忙的爬虫任务了。
|
||||
|
||||
Python 的[全局解释器锁][16]机制会让我的程序只能用到一个 CPU 线程。为了测试多线程的性能,我需要启动多个独立的爬虫程序进程。
|
||||
|
||||
因此我还编写了 `butthead.py`,尽管 Butthead 很粗鲁,它也比 Beavis 要略胜一筹(译者注:beavis 和 butt-head 都是 Mike Judge 的动画片《Beavis and Butt-head》中的角色)。
|
||||
|
||||
我将整个爬虫任务拆分为多个部分,这可能会对爬取到的链接数量有一点轻微的影响。但无论如何,每次爬取都会有所不同,我们要关注的是爬取了多少个页面,以及耗时多长。
|
||||
|
||||
### 在 ARM 服务器上安装 Scrapy
|
||||
|
||||
安装 Scrapy 的过程与芯片的不同架构没有太大的关系,都是安装 pip 和相关的依赖包之后,再使用 pip 来安装Scrapy。
|
||||
|
||||
据我观察,在使用 ARM 的机器上使用 pip 安装 Scrapy 确实耗时要长一点,我估计是由于需要从源码编译为二进制文件。
|
||||
|
||||
在 Scrapy 安装结束后,就可以通过 shell 来查看它的工作状态了。
|
||||
|
||||
在 Scaleway 的 ARM 机器上,Scrapy 安装完成后会无法正常运行,这似乎和 `service_identity` 模块有关。这个现象也会在树莓派上出现,但在 AWS Graviton 上不会出现。
|
||||
|
||||
对于这个问题,可以用这个命令来解决:
|
||||
|
||||
```
|
||||
sudo pip3 install service_identity --force --upgrade
|
||||
```
|
||||
|
||||
接下来就可以开始对比了。
|
||||
|
||||
### 单线程爬虫
|
||||
|
||||
Scrapy 的官方文档建议[将爬虫程序的 CPU 使用率控制在 80% 到 90% 之间][17],在真实操作中并不容易,尤其是对于我自己写的代码。根据我的观察,实际的 CPU 使用率变动情况是一开始非常繁忙,随后稍微下降,接着又再次升高。
|
||||
|
||||
在爬取任务的最后,也就是大部分目标网站都已经被爬取了的这个阶段,会持续数分钟的时间。这让人有点失望,因为在这个阶段当中,任务的运行时长只和网站的大小有比较直接的关系,并不能以之衡量 CPU 的性能。
|
||||
|
||||
所以这并不是一次严谨的基准测试,只是我通过自己写的爬虫程序来观察实际的现象。
|
||||
|
||||
下面我们来看看最终的结果。首先是 Scaleway 的机器:
|
||||
|
||||
| 机器种类 | 耗时 | 爬取页面数 | 每小时爬取页面数 | 每百万页面费用(欧元) |
|
||||
| ------------------ | ----------- | ---------- | ---------------- | ---------------------- |
|
||||
| Scaleway ARM64-2GB | 108m 59.27s | 38,205 | 21,032.623 | 0.28527 |
|
||||
| Scaleway 1-S | 97m 44.067s | 39,476 | 24,324.648 | 0.33011 |
|
||||
|
||||
我使用了 [top][18] 工具来查看爬虫程序运行期间的 CPU 使用率。在任务刚开始的时候,两者的 CPU 使用率都达到了 100%,但 ThunderX 大部分时间都达到了 CPU 的极限,无法看出来 Atom 的性能会比 ThunderX 超出多少。
|
||||
|
||||
通过 top 工具,我还观察了它们的内存使用情况。随着爬取任务的进行,ARM 机器的内存使用率最终达到了 14.7%,而 x86 则最终是 15%。
|
||||
|
||||
从运行日志还可以看出来,当 CPU 使用率到达极限时,会有大量的超时页面产生,最终导致页面丢失。这也是合理出现的现象,因为 CPU 过于繁忙会无法完整地记录所有爬取到的页面。
|
||||
|
||||
如果仅仅是为了对比爬虫的速度,页面丢失并不是什么大问题。但在实际中,业务成果和爬虫数据的质量是息息相关的,因此必须为 CPU 留出一些用量,以防出现这种现象。
|
||||
|
||||
再来看看 AWS 这边:
|
||||
|
||||
| 机器种类 | 耗时 | 爬取页面数 | 每小时爬取页面数 | 每百万页面费用(美元) |
|
||||
| --------- | ------------ | ---------- | ---------------- | ---------------------- |
|
||||
| a1.medium | 100m 39.900s | 41,294 | 24,612.725 | 1.03605 |
|
||||
| t2.small | 78m 53.171s | 41,200 | 31,336.286 | 0.73397 |
|
||||
|
||||
为了方便比较,对于在 AWS 上跑的爬虫,我记录的指标和 Scaleway 上一致,但似乎没有达到预期的效果。这里我没有使用 top,而是使用了 AWS 提供的控制台来监控 CPU 的使用情况,从监控结果来看,我的爬虫程序并没有完全用到这两款服务器所提供的所有性能。
|
||||
|
||||
a1.medium 型号的机器尤为如此,在任务开始阶段,它的 CPU 使用率达到了峰值 45%,但随后一直在 20% 到 30% 之间。
|
||||
|
||||
让我有点感到意外的是,这个程序在 ARM 处理器上的运行速度相当慢,但却远未达到 Graviton CPU 能力的极限,而在 Inter 处理器上则可以在某些时候达到 CPU 能力的极限。它们运行的代码是完全相同的,处理器的不同架构可能导致了对代码的不同处理方式。
|
||||
|
||||
个中原因无论是由于处理器本身的特性,还是而今是文件的编译,又或者是两者皆有,对我来说都是一个黑盒般的存在。我认为,既然在 AWS 机器上没有达到 CPU 处理能力的极限,那么只有在 Scaleway 机器上跑出来的性能数据是可以作为参考的。
|
||||
|
||||
t2.small 型号的机器性能让人费解。CPU 利用率大概 20%,最高才达到 35%,是因为手册中说的“20% 的基准性能,可以使用 CPU 积分突破这个基准”吗?但在控制台中可以看到 CPU 积分并没有被消耗。
|
||||
|
||||
为了确认这一点,我安装了 [stress][19] 这个软件,然后运行了一段时间,这个时候发现居然可以把 CPU 使用率提高到 100% 了。
|
||||
|
||||
显然,我需要调整一下它们的配置文件。我将 CONCURRENT_REQUESTS 参数设置为 5000,将 REACTOR_THREADPOOL_MAXSIZE 参数设置为 120,将爬虫任务的负载调得更大。
|
||||
|
||||
| 机器种类 | 耗时 | 爬取页面数 | 每小时爬取页面数 | 每万页面费用(美元) |
|
||||
| ----------------------- | ----------- | ---------- | ---------------- | -------------------- |
|
||||
| a1.medium | 46m 13.619s | 40,283 | 52,285.047 | 0.48771 |
|
||||
| t2.small | 41m7.619s | 36,241 | 52,871.857 | 0.43501 |
|
||||
| t2.small(无 CPU 积分) | 73m 8.133s | 34,298 | 28,137.8891 | 0.81740 |
|
||||
|
||||
a1.medium 型号机器的 CPU 使用率在爬虫任务开始后 5 分钟飙升到了 100%,随后下降到 80% 并持续了 20 分钟,然后再次攀升到 96%,直到任务接近结束时再次下降。这大概就是我想要的效果了。
|
||||
|
||||
而 t2.small 型号机器在爬虫任务的前期就达到了 50%,并一直保持在这个水平直到任务接近结束。如果每个核心都有两个线程,那么 50% 的 CPU 使用率确实是单个线程可以达到的极限了。
|
||||
|
||||
现在我们看到它们的性能都差不多了。但至强处理器的线程持续跑满了 CPU,Graviton 处理器则只是有一段时间如此。可以认为 Graviton 略胜一筹。
|
||||
|
||||
然而,如果 CPU 积分耗尽了呢?这种情况下的对比可能更为公平。为了测试这种情况,我使用 stress 把所有的 CPU 积分用完,然后再次启动了爬虫任务。
|
||||
|
||||
在没有 CPU 积分的情况下,CPU 使用率在 27% 就到达极限不再上升了,同时又出现了丢失页面的现象。这么看来,它的性能比负载较低的时候更差。
|
||||
|
||||
### 多线程爬虫
|
||||
|
||||
将爬虫任务分散到不同的进程中,可以有效利用机器所提供的多个核心。
|
||||
|
||||
一开始,我将爬虫任务分布在 10 个不同的进程中并同时启动,结果发现比仅使用 1 个进程的时候还要慢。
|
||||
|
||||
经过尝试,我得到了一个比较好的方案。把爬虫任务分布在 10 个进程中,但每个核心只启动 1 个进程,在每个进程接近结束的时候,再从剩余的进程中选出 1 个进程启动起来。
|
||||
|
||||
如果还需要优化,还可以让运行时间越长的爬虫进程在启动顺序中排得越靠前,我也在尝试实现这个方法。
|
||||
|
||||
想要预估某个域名的页面量,一定程度上可以参考这个域名主页的链接数量。我用另一个程序来对这个数量进行了统计,然后按照降序排序。经过这样的预处理之后,只会额外增加 1 分钟左右的时间。
|
||||
|
||||
结果,爬虫运行的总耗时找过了两个小时!毕竟把链接最多的域名都堆在同一个进程中也存在一定的弊端。
|
||||
|
||||
针对这个问题,也可以通过调整各个进程爬取的域名数量来进行优化,又或者在排序之后再作一定的修改。不过这种优化可能有点复杂了。
|
||||
|
||||
因此,我还是用回了最初的方法,它的效果还是相当不错的:
|
||||
|
||||
| 机器种类 | 耗时 | 爬取页面数 | 每小时爬取页面数 | 每万页面费用(欧元) |
|
||||
| ------------------ | ----------- | ---------- | ---------------- | -------------------- |
|
||||
| Scaleway ARM64-2GB | 62m 10.078s | 36,158 | 34,897.0719 | 0.17193 |
|
||||
| Scaleway 1-S | 60m 56.902s | 36,725 | 36,153.5529 | 0.22128 |
|
||||
|
||||
毕竟,使用多个核心能够大大加快爬虫的速度。
|
||||
|
||||
我认为,如果让一个经验丰富的程序员来优化的话,一定能够更好地利用所有的计算核心。但对于开箱即用的 Scrapy 来说,想要提高性能,使用更快的线程似乎比使用更多核心要简单得多。
|
||||
|
||||
从数量来看,Atom 处理器在更短的时间内爬取到了更多的页面。但如果从性价比角度来看,ThunderX 又是稍稍领先的。不过总的来说差距不大。
|
||||
|
||||
### 爬取结果分析
|
||||
|
||||
在爬取了 38205 个页面之后,我们可以统计到在这些页面中 “ass” 出现了 24170435 次,而 “wood” 出现了 54368 次。
|
||||
|
||||
![][20]
|
||||
|
||||
“wood” 的出现次数不少,但和 “ass” 比起来简直微不足道。
|
||||
|
||||
### 结论
|
||||
|
||||
从上面的数据来看,不同架构的 CPU 性能和它们的问世时间没有直接的联系,AWS Graviton 是单线程情况下性能最佳的。
|
||||
|
||||
另外在性能方面 2017 年生产的 Atom 轻松击败了 2014 年生产的 ThunderX,而 ThunderX 则在性价比方面占优。当然,如果你使用 AWS 的机器的话,还是使用 Graviton 吧。
|
||||
|
||||
总之,ARM 架构的硬件是可以用来运行爬虫程序的,而且在性能和费用方面也相当有竞争力。
|
||||
|
||||
而这种差异是否足以让你将整个技术架构迁移到 ARM 上?这就是另一回事了。当然,如果你已经是 AWS 用户,并且你的代码有很强的可移植性,那么不妨尝试一下 a1 型号的实例。
|
||||
|
||||
希望 ARM 设备在不久的将来能够在公有云上大放异彩。
|
||||
|
||||
### 源代码
|
||||
|
||||
这是我第一次使用 Python 和 Scrapy 来做一个项目,所以我的代码写得可能不是很好,例如代码中使用全局变量就有点力不从心。
|
||||
|
||||
不过我仍然会在下面开源我的代码。
|
||||
|
||||
要运行这些代码,需要预先安装 Scrapy,并且需要 [Moz 上排名前 500 的网站][21]的 csv 文件。如果要运行 `butthead.py`,还需要安装 [psutil][22] 这个库。
|
||||
|
||||
##### beavis.py
|
||||
|
||||
```
|
||||
import scrapy
|
||||
from scrapy.spiders import CrawlSpider, Rule
|
||||
from scrapy.linkextractors import LinkExtractor
|
||||
from scrapy.crawler import CrawlerProcess
|
||||
|
||||
ass = 0
|
||||
wood = 0
|
||||
totalpages = 0
|
||||
|
||||
def getdomains():
|
||||
|
||||
moz500file = open('top500.domains.05.18.csv')
|
||||
|
||||
domains = []
|
||||
moz500csv = moz500file.readlines()
|
||||
|
||||
del moz500csv[0]
|
||||
|
||||
for csvline in moz500csv:
|
||||
leftquote = csvline.find('"')
|
||||
rightquote = leftquote + csvline[leftquote + 1:].find('"')
|
||||
domains.append(csvline[leftquote + 1:rightquote])
|
||||
|
||||
return domains
|
||||
|
||||
def getstartpages(domains):
|
||||
|
||||
startpages = []
|
||||
|
||||
for domain in domains:
|
||||
startpages.append('http://' + domain)
|
||||
|
||||
return startpages
|
||||
|
||||
class AssWoodItem(scrapy.Item):
|
||||
ass = scrapy.Field()
|
||||
wood = scrapy.Field()
|
||||
url = scrapy.Field()
|
||||
|
||||
class AssWoodPipeline(object):
|
||||
def __init__(self):
|
||||
self.asswoodstats = []
|
||||
|
||||
def process_item(self, item, spider):
|
||||
self.asswoodstats.append((item.get('url'), item.get('ass'), item.get('wood')))
|
||||
|
||||
def close_spider(self, spider):
|
||||
asstally, woodtally = 0, 0
|
||||
|
||||
for asswoodcount in self.asswoodstats:
|
||||
asstally += asswoodcount[1]
|
||||
woodtally += asswoodcount[2]
|
||||
|
||||
global ass, wood, totalpages
|
||||
ass = asstally
|
||||
wood = woodtally
|
||||
totalpages = len(self.asswoodstats)
|
||||
|
||||
class BeavisSpider(CrawlSpider):
|
||||
name = "Beavis"
|
||||
allowed_domains = getdomains()
|
||||
start_urls = getstartpages(allowed_domains)
|
||||
#start_urls = [ 'http://medium.com' ]
|
||||
custom_settings = {
|
||||
'DEPTH_LIMIT': 3,
|
||||
'DOWNLOAD_DELAY': 3,
|
||||
'CONCURRENT_REQUESTS': 1500,
|
||||
'REACTOR_THREADPOOL_MAXSIZE': 60,
|
||||
'ITEM_PIPELINES': { '__main__.AssWoodPipeline': 10 },
|
||||
'LOG_LEVEL': 'INFO',
|
||||
'RETRY_ENABLED': False,
|
||||
'DOWNLOAD_TIMEOUT': 30,
|
||||
'COOKIES_ENABLED': False,
|
||||
'AJAXCRAWL_ENABLED': True
|
||||
}
|
||||
|
||||
rules = ( Rule(LinkExtractor(), callback='parse_asswood'), )
|
||||
|
||||
def parse_asswood(self, response):
|
||||
if isinstance(response, scrapy.http.TextResponse):
|
||||
item = AssWoodItem()
|
||||
item['ass'] = response.text.casefold().count('ass')
|
||||
item['wood'] = response.text.casefold().count('wood')
|
||||
item['url'] = response.url
|
||||
yield item
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
||||
process = CrawlerProcess({
|
||||
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
|
||||
})
|
||||
|
||||
process.crawl(BeavisSpider)
|
||||
process.start()
|
||||
|
||||
print('Uhh, that was, like, ' + str(totalpages) + ' pages crawled.')
|
||||
print('Uh huhuhuhuh. It said ass ' + str(ass) + ' times.')
|
||||
print('Uh huhuhuhuh. It said wood ' + str(wood) + ' times.')
|
||||
```
|
||||
|
||||
##### butthead.py
|
||||
|
||||
```
|
||||
import scrapy, time, psutil
|
||||
from scrapy.spiders import CrawlSpider, Rule, Spider
|
||||
from scrapy.linkextractors import LinkExtractor
|
||||
from scrapy.crawler import CrawlerProcess
|
||||
from multiprocessing import Process, Queue, cpu_count
|
||||
|
||||
ass = 0
|
||||
wood = 0
|
||||
totalpages = 0
|
||||
linkcounttuples =[]
|
||||
|
||||
def getdomains():
|
||||
|
||||
moz500file = open('top500.domains.05.18.csv')
|
||||
|
||||
domains = []
|
||||
moz500csv = moz500file.readlines()
|
||||
|
||||
del moz500csv[0]
|
||||
|
||||
for csvline in moz500csv:
|
||||
leftquote = csvline.find('"')
|
||||
rightquote = leftquote + csvline[leftquote + 1:].find('"')
|
||||
domains.append(csvline[leftquote + 1:rightquote])
|
||||
|
||||
return domains
|
||||
|
||||
def getstartpages(domains):
|
||||
|
||||
startpages = []
|
||||
|
||||
for domain in domains:
|
||||
startpages.append('http://' + domain)
|
||||
|
||||
return startpages
|
||||
|
||||
class AssWoodItem(scrapy.Item):
|
||||
ass = scrapy.Field()
|
||||
wood = scrapy.Field()
|
||||
url = scrapy.Field()
|
||||
|
||||
class AssWoodPipeline(object):
|
||||
def __init__(self):
|
||||
self.asswoodstats = []
|
||||
|
||||
def process_item(self, item, spider):
|
||||
self.asswoodstats.append((item.get('url'), item.get('ass'), item.get('wood')))
|
||||
|
||||
def close_spider(self, spider):
|
||||
asstally, woodtally = 0, 0
|
||||
|
||||
for asswoodcount in self.asswoodstats:
|
||||
asstally += asswoodcount[1]
|
||||
woodtally += asswoodcount[2]
|
||||
|
||||
global ass, wood, totalpages
|
||||
ass = asstally
|
||||
wood = woodtally
|
||||
totalpages = len(self.asswoodstats)
|
||||
|
||||
|
||||
class ButtheadSpider(CrawlSpider):
|
||||
name = "Butthead"
|
||||
custom_settings = {
|
||||
'DEPTH_LIMIT': 3,
|
||||
'DOWNLOAD_DELAY': 3,
|
||||
'CONCURRENT_REQUESTS': 250,
|
||||
'REACTOR_THREADPOOL_MAXSIZE': 30,
|
||||
'ITEM_PIPELINES': { '__main__.AssWoodPipeline': 10 },
|
||||
'LOG_LEVEL': 'INFO',
|
||||
'RETRY_ENABLED': False,
|
||||
'DOWNLOAD_TIMEOUT': 30,
|
||||
'COOKIES_ENABLED': False,
|
||||
'AJAXCRAWL_ENABLED': True
|
||||
}
|
||||
|
||||
rules = ( Rule(LinkExtractor(), callback='parse_asswood'), )
|
||||
|
||||
|
||||
def parse_asswood(self, response):
|
||||
if isinstance(response, scrapy.http.TextResponse):
|
||||
item = AssWoodItem()
|
||||
item['ass'] = response.text.casefold().count('ass')
|
||||
item['wood'] = response.text.casefold().count('wood')
|
||||
item['url'] = response.url
|
||||
yield item
|
||||
|
||||
def startButthead(domainslist, urlslist, asswoodqueue):
|
||||
crawlprocess = CrawlerProcess({
|
||||
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
|
||||
})
|
||||
|
||||
crawlprocess.crawl(ButtheadSpider, allowed_domains = domainslist, start_urls = urlslist)
|
||||
crawlprocess.start()
|
||||
asswoodqueue.put( (ass, wood, totalpages) )
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
asswoodqueue = Queue()
|
||||
domains=getdomains()
|
||||
startpages=getstartpages(domains)
|
||||
processlist =[]
|
||||
cores = cpu_count()
|
||||
|
||||
for i in range(10):
|
||||
domainsublist = domains[i * 50:(i + 1) * 50]
|
||||
pagesublist = startpages[i * 50:(i + 1) * 50]
|
||||
p = Process(target = startButthead, args = (domainsublist, pagesublist, asswoodqueue))
|
||||
processlist.append(p)
|
||||
|
||||
for i in range(cores):
|
||||
processlist[i].start()
|
||||
|
||||
time.sleep(180)
|
||||
|
||||
i = cores
|
||||
|
||||
while i != 10:
|
||||
time.sleep(60)
|
||||
if psutil.cpu_percent() < 66.7:
|
||||
processlist[i].start()
|
||||
i += 1
|
||||
|
||||
for i in range(10):
|
||||
processlist[i].join()
|
||||
|
||||
for i in range(10):
|
||||
asswoodtuple = asswoodqueue.get()
|
||||
ass += asswoodtuple[0]
|
||||
wood += asswoodtuple[1]
|
||||
totalpages += asswoodtuple[2]
|
||||
|
||||
print('Uhh, that was, like, ' + str(totalpages) + ' pages crawled.')
|
||||
print('Uh huhuhuhuh. It said ass ' + str(ass) + ' times.')
|
||||
print('Uh huhuhuhuh. It said wood ' + str(wood) + ' times.')
|
||||
```
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://blog.dxmtechsupport.com.au/speed-test-x86-vs-arm-for-web-crawling-in-python/
|
||||
|
||||
作者:[James Mawson][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[HankChow](https://github.com/HankChow)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://blog.dxmtechsupport.com.au/author/james-mawson/
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://blog.dxmtechsupport.com.au/wp-content/uploads/2019/02/quadbike-1024x683.jpg
|
||||
[2]: https://scrapy.org/
|
||||
[3]: https://www.info2007.net/blog/2018/review-scaleway-arm-based-cloud-server.html
|
||||
[4]: https://blog.dxmtechsupport.com.au/playing-badass-acorn-archimedes-games-on-a-raspberry-pi/
|
||||
[5]: https://www.computerworld.com/article/3178544/microsoft-windows/microsoft-and-arm-look-to-topple-intel-in-servers.html
|
||||
[6]: https://www.datacenterknowledge.com/design/cloudflare-bets-arm-servers-it-expands-its-data-center-network
|
||||
[7]: https://www.scaleway.com/
|
||||
[8]: https://aws.amazon.com/
|
||||
[9]: https://www.theregister.co.uk/2018/11/27/amazon_aws_graviton_specs/
|
||||
[10]: https://www.scaleway.com/virtual-cloud-servers/#anchor_arm
|
||||
[11]: https://www.scaleway.com/virtual-cloud-servers/#anchor_starter
|
||||
[12]: https://aws.amazon.com/ec2/spot/pricing/
|
||||
[13]: https://aws.amazon.com/ec2/pricing/reserved-instances/
|
||||
[14]: https://aws.amazon.com/ec2/instance-types/a1/
|
||||
[15]: https://aws.amazon.com/ec2/instance-types/t2/
|
||||
[16]: https://wiki.python.org/moin/GlobalInterpreterLock
|
||||
[17]: https://docs.scrapy.org/en/latest/topics/broad-crawls.html
|
||||
[18]: https://linux.die.net/man/1/top
|
||||
[19]: https://linux.die.net/man/1/stress
|
||||
[20]: https://blog.dxmtechsupport.com.au/wp-content/uploads/2019/02/Screenshot-from-2019-02-16-17-01-08.png
|
||||
[21]: https://moz.com/top500
|
||||
[22]: https://pypi.org/project/psutil/
|
||||
|
Loading…
Reference in New Issue
Block a user