Merge pull request #4444 from firstadream/master

翻译完成
This commit is contained in:
Xingyu.Wang 2016-09-21 07:27:52 +08:00 committed by GitHub
commit 57fc7f72a4
2 changed files with 80 additions and 82 deletions

View File

@ -1,82 +0,0 @@
Spark comparison: AWS vs. GCP
===========
>Tianhui Michael Li and Ariel Mndange-Pfupfu will lead a hands-on online course Oct 10, 12, and 14, 2016: Distributed Computing with Spark for Beginners. Instruction includes building functioning applications from end-to-end and mastering critical tooling around Spark.
Theres little doubt that cloud computing will play an important role in data science for the foreseeable future. The flexible, scalable, on-demand computing power available is an important resource, and as a result, theres a lot of competition between the providers of this service. Two of the biggest players in the space are [Amazon Web Services (AWS)][1] and [Google Cloud Platform (GCP)][2].
This article includes a short comparison of distributed Spark workloads in AWS and GCP—both in terms of setup time and operating cost. We ran this experiment with our students at The Data Incubator, [a big data training organization][3] that helps companies hire top-notch data scientists and train their employees on the latest data science skills. Even with the efficiencies built into Spark, the cost and time of distributed workloads can be substantial, and we are always looking for the most efficient technologies so our students are learning the best and fastest tools.
### Submitting Spark jobs to the cloud
Spark is a popular distributed computation engine that incorporates MapReduce-like aggregations into a more flexible, abstract framework. There are APIs for Python and Java, but writing applications in Sparks native Scala is preferable. That makes job submission simple, as you can package your application and all its dependencies into one JAR file.
Its common to use Spark in conjunction with HDFS for distributed data storage, and YARN for cluster management; this makes Spark a perfect fit for AWSs Elastic MapReduce (EMR) clusters and GCPs Dataproc clusters. Both EMR and Dataproc clusters have HDFS and YARN preconfigured, with no extra work required.
### Configuring cloud services
Managing data, clusters, and jobs from the command line is more scalable than using the web interface. For AWS, this means installing and using the command-line interface (cli). Youll have to set up your credentials beforehand as well as make a separate keypair for the EC2 instances that are used under the hood. Youll also need to set up roles—basically permissions—for both users (making sure they have sufficient rights) and EMR itself (usually running aws emr create-default-roles in the cli is good enough to get started).
For GCP the process is more straightforward. If you install the Google Cloud SDK and sign in with your Google account, you should be able to do most things right off the bat. The thing to remember here is to enable the relevant APIs in the API Manager: Compute Engine, Dataproc, and Cloud Storage JSON.
Once you have things set up to your liking, the fun part begins! Using commands like aws s3 cp or gsutil cp you can copy your data into the cloud. Once you have buckets set up for your inputs, outputs, and anything else you might need, running your app is as easy as starting up a cluster and submitting the JAR file. Make sure you know where the logs are kept—it can be tricky to track down problems or bugs in a cloud environment.
### You get what you pay for
When it comes to cost, Googles service is more affordable in several ways. First, the raw cost of purchasing computing power is cheaper. Running a Google Compute Engine machine with 4 vCPUs and 15 GB of RAM will run you $0.20 every hour, or $0.24 with Dataproc. An identically-specced AWS instance will cost you $0.336 per hour running EMR.
The second factor to consider is the granularity of the billing. AWS charges by the hour, so you pay the full rate even if your job takes 15 minutes. GCP charges by the minute, with a 10-minute minimum charge. This ends up being a huge difference in cost in a lot of use cases.
Both services have various other discounts. You can effectively bid on spare cloud capacity with AWSs spot instances or GCPs preemptible instances. These will be cheaper than dedicated, on-demand instances, but theyre not guaranteed to be available. Discounted rates are available on GCP if your instances live for long periods of time (25% to 100% of the month). On AWS, paying some of the costs upfront or buying in bulk can save you some money. The bottom line is, if youre a power user and you use cloud computing on a regular or even constant basis, youll need to delve deeper and perform your own calculations.
Lastly, the costs for new users wanting to try out these services are lower for GCP. They offer a 60-day free trial with $300 in credit to use however you want. AWS only offers a free tier where certain services are free to a certain point or discounted, so you will end up paying to run Spark jobs.This means that if you want to test out Spark for the first time, youll have more freedom to do what you want on GCP without worrying about price.
### Performance comparison
We set up a trial to compare the performance and cost of a typical Spark workload. The trial used clusters with one master and five core instances of AWSs m3.xlarge and GCPs n1-standard-4. They differ slightly in specification, but the number of virtual cores and amount of memory is the same. In fact, they behaved almost identically when it came to job execution time.
The job itself involved parsing, filtering, joining, and aggregating data from the publicly available Stack Exchange Data Dump. We ran the same JAR on a ~50M subset of the data (Cross Validated) and then on the full ~9.5G data set.
![](https://d3ansictanv2wj.cloudfront.net/1400_img_1_AWS_GCP-25ed6069029112a8439d89999796be18.jpg)
>Figure 1. Credit: Michael Li and Ariel M'ndange-Pfupfu.
![](https://d3ansictanv2wj.cloudfront.net/1400_img_2_AWS_GCP-448714718896b21e32f8b47d4657fc8c.jpg)
>Figure 2. Credit: Michael Li and Ariel M'ndange-Pfupfu.
The short job clearly benefited from GCPs by-the-minute billing, being charged only for 10 minutes of cluster time, whereas AWS charged for a full hour. But even the longer job was cheaper on GPS both because of fractional-hour billing and a lower per-unit time cost for comparable performance. Its also worth noting that storage costs werent included in this comparison.
### Conclusion
AWS was the first mover in the space, and this shows in the API. Its ecosystem is vast, but its permissions model is a little dated, and its configuration is a little arcane. By contrast, Google is the shiny new entrant in this space and has polished off some of the rough edges. It is missing some features on our wishlist, like an easy way to auto-terminate clusters and detailed billing information broken down by job. Also, for managing tasks programmatically in Python, the API client library isnt as full-featured as AWSs Boto.
If youre new to cloud computing, GCP is easier to get up and running, and the credits make it a tempting platform. Even if you are already used to AWS, you may still find the cost savings make switching worth it, although the switching costs may not make moving to GCP worth it.
Ultimately, its difficult to make sweeping statements about these services because theyre not just one entity; theyre entire ecosystems of integrated parts, and both have pros and cons. The real winners are the users. As an example, at The Data Incubator, our Ph.D. data science fellows really appreciate the cost reduction as they learn about distributed workloads. And while our big data corporate training clients may be less price sensitive, they appreciate being able to crunch enterprise data faster, while holding price constant. Data scientists can now enjoy the multitude of options available, and the benefits of having a competitive cloud computing market.
--------------------------------------------------------------------------------
via: https://www.oreilly.com/ideas/spark-comparison-aws-vs-gcp?utm_source=dbweekly&utm_medium=email
作者:[Michael Li][a] [Ariel M'Ndange-Pfupfu][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://www.oreilly.com/people/76a5b-michael-li
[b]: https://www.oreilly.com/people/Ariel-Mndange-Pfupfu
[1]: https://aws.amazon.com/
[2]: https://cloud.google.com/
[3]: https://www.thedataincubator.com/training.html?utm_source=OReilly&utm_medium=blog&utm_campaign=AWSvsGCP

View File

@ -0,0 +1,80 @@
AWS和GCP Spark技术哪家强
===========
>Tianhui Michael Li 和Ariel Mndange-Pfupfu将在今年10月10、12和14号组织一个在线经验分享课程Spark分布式计算入门。该课程的内容包括创建端到端的运行应用程序和精通Spark关键工具。
毋庸置疑云计算将会在未来数据科学领域扮演至关重要的角色。弹性可扩展性和按需分配的计算能力作为云计算的重要资源直接导致云服务提供商集体火拼。其中最大的两股势力正是亚马逊网络服务AWS)【1】和谷歌云平台GCP)【2】。
本文依据构建时间和运营成本对AWS和GCP的Spark工作负载作一个简短比较。实验由我们的学生在数据孵化器进行数据孵化器【3】是一个大数据培训组织专门为公司招聘顶尖数据科学家并为公司职员培训最新的大数据科学技能。尽管内置的Spark效率惊人分布式工作负载的时间和成本亦然可以大到不能忽略不计。因此我们一直努力寻求更高效的技术我们的学生也因此能够学习到最好和最快的工具。
###提交Spark任务到云
Spark是一个类MapReduce但是比MapReduce更灵活、更抽象的并行计算框架。Spark提供Python和Java 编程接口但它更愿意用户使用原生的Scala语言进行应用程序开发。Scala可以把应用程序和依赖文件打包在一个JAR文件从而使Spark任务提交变得简单。
通常情况下Sprark结合HDFS应用于分布式数据存储而与YARN协同工作则应用于集群管理这种堪称完美的配合使得Spark非常适用于AWS的弹性MapReduce(EMR)集群和GCP的Dataproc集群。这两种集群都已有HDFS和YARN预配置不需要额外进行配置。
通过命令行比通过网页接口管理数据、集群和任务具有更高的可扩展性。对AWS而言这意味着客户需要安装CLI。客户必须获得证书并为每个EC2实例创建独立的密钥对。除此之外客户还需要为EMR用户和EMR本身创建规则主要是准入许可规则从而使EMR用户获得足够多的权限。
相比而言GCP的处理流程更加直接。如果客户选择安装Google Cloud SDK并且使用其Google账号登录那么客户即刻可以使用GCP的几乎所有功能而无需其他任何配置。唯一需要提醒的是不要忘记通过API管理器使能计算引擎、Dataproc和云存储JSON的API。
AWS就是这样实现自己喜欢的应用一旦火力全开根本停不下来比如可以通过“aws s3 cp”或者“gsutil cp”命令拷贝客户的数据到云端。再比如客户可以创建自己的输入、输出或者任何其他需要的bucket如此运行一个应用就像创建一个集群或者提交JAR文件一样简单。请确定日志存放的地方毕竟在云环境下跟踪问题或者调试bug有点诡异。
###一分钱一分货
谈及成本Google的服务在以下几个方面更有优势。首先购买计算能力的原始成本更低。4个vCPU和15G RAM的Google计算引擎服务每小时只需0.20美元如果运行Dataproc每小时也只需区区0.24美元。相比之下同等的云配置AWS EMR则需要每小时0.336美元。
其次计费方式。AWS按小时计费即使客户只使用15分钟也要付足1小时的费用。GCP按分钟计费最低计费10分钟。在诸多用户案例中资费方式的不同直接导致成本的巨大差异。
两种云服务都有其他多种定价机制。客户可以使用Sport Instance或Preemptible Instance竞价AWS或GCP的空闲云计算能力。这些服务比专有的、按需服务便宜缺点是不能保证随时有可用的云资源提供服务。在GCP上如果客户长时间每月的25%至100%使用服务可以获取更多折扣。在AWS上预付费或者一次购买大批量服务可以节省不少费用。底线是如果你是一个超级用户并且使用云计算已经成为一种常态那么最好深入研究云计算自建云计算服务。
最后新手在GCP上体验云服务的费用较低。新手只需300美元信用担保就可以免费试用60天GCP提供的全部云服务。AWS只免费提供特定服务的特定试用层级如果运行Spark任务需要付费。这意味着初次体验SparkGCP具有更多选择也少了精打细算和讨价还价的烦恼。
###性能比拼
我们通过实验检测一个典型Spark工作负载的性能与开销。实验分别选择AWS的m3.xlarg和GCP的n1-standard-4它们都是由一个Master和5个核心实例组成的集群。除了规格略有差别虚拟核心和费用都相同。实际上它们在Spark任务的执行时间上也表现的惊人相似。
测试Spark任务包括对数据的解析、过滤、合并和聚合这些数据来自堆栈交换数据转储。通过运行相同的JAR我们首先对大约50M的数据子集进行交叉验证然后将验证扩大到大约9.5G的数据集。
![](https://d3ansictanv2wj.cloudfront.net/1400_img_1_AWS_GCP-25ed6069029112a8439d89999796be18.jpg)
>Figure 1. Credit: Michael Li and Ariel M'ndange-Pfupfu.
![](https://d3ansictanv2wj.cloudfront.net/1400_img_2_AWS_GCP-448714718896b21e32f8b47d4657fc8c.jpg)
>Figure 2. Credit: Michael Li and Ariel M'ndange-Pfupfu.
结果表明短任务在GCP 上具有明显的成本优势这是因为GCP以分钟计费并最终扣除了10分钟的费用而AWS则收取了1小时的费用。但是即使长任务因为计费方式占优GPS仍然具有相当优势。同样值得注意的是存储成本并不包括在此次比较当中。
###结论
AWS是云计算的先驱这甚至体现在API中。AWS拥有巨大的生态系统但其准入模型已略显陈旧配置管理也有些晦涩难解。相比之下Google是云计算领域的新星并且抛光了云计算中一些粗糙的边缘问题。但是GCP缺少一些便捷的功能比如通过简单方法自动结束集群和详细的任务计费信息分解。另外其Python编程接口也不像AWS的Boto那么全面。
如果你初次使用云计算GCP因简单易用别具魅力。即使你已在使用AWS你也许会发现迁移到GCP可能更划算尽管真正从AWS迁移到GCP的代价可能得不偿失。
当然,现在对两种云服务作一个全面的总结还非常困难,因为它们都不是单一的实体,而是由多个实体整合而成的完整生态系统,并且各有利弊。真正的赢家是用户。一个例证就是在数据孵化器,我们的博士数据科学研究员在学习过程中真正体会到成本的下降。虽然我们的大数据企业培训客户可能对价格不那么敏感,他们很欣慰能够更快速地处理企业数据,同时保持价格不增加。数据科学家现在可以享受大量的可选服务,这些都是从竞争激烈的云计算市场得到的实惠。
--------------------------------------------------------------------------------
via: https://www.oreilly.com/ideas/spark-comparison-aws-vs-gcp?utm_source=dbweekly&utm_medium=email
作者:[Michael Li][a] [Ariel M'Ndange-Pfupfu][b]
译者:[firstadream](https://github.com/firstadream)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://www.oreilly.com/people/76a5b-michael-li
[b]: https://www.oreilly.com/people/Ariel-Mndange-Pfupfu
[1]: https://aws.amazon.com/
[2]: https://cloud.google.com/
[3]: https://www.thedataincubator.com/training.html?utm_source=OReilly&utm_medium=blog&utm_campaign=AWSvsGCP