选题: 20191226 10 articles to become more data science savvy

sources/tech/20191226 10 articles to become more data science savvy.md
This commit is contained in:
DarkSun 2019-12-27 00:54:22 +08:00
parent f8538b44f8
commit be9e44f95c

View File

@ -0,0 +1,123 @@
[#]: collector: (lujun9972)
[#]: translator: ( )
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (10 articles to become more data science savvy)
[#]: via: (https://opensource.com/article/19/12/data-science-resources)
[#]: author: (Lauren Maffeo https://opensource.com/users/lmaffeo)
10 articles to become more data science savvy
======
Boost your data science game in 2020 with Opensource.com's top 10
most-read articles on the topic from 2019.
![Open data brain][1]
When LinkedIn released its third annual [Emerging Jobs report][2], engineers everywhere said, "Amen." More than half the list consists of engineering roles, with new fields like robotics appearing for the first time.
But data science had a strong showing as well. The role shows 37% annual growth, topping that aspect of the Emerging Jobs list for the third year in a row.
Looking at the core skills a data scientist needs—including R, Python, and Apache Spark—it's easy to find overlaps with open source. So, we're not surprised that data science was one of the most popular topics at Opensource.com in 2019.
We saw a need for knowledge about diverse data science topics. And our community of authors delivered answers.
For your reading pleasure, we've listed the top 10 data science articles of 2019. We define "top" as the data science articles that were published in 2019 and earned the most page views, starting with the most popular.
Whether you want to use Kubernetes for batch jobs or query 10 years' worth of GitHub data, these articles will boost your data science game in 2020.
### Why data scientists love Kubernetes
Kubernetes is having more than a moment. That's due in no small part to its versatility. You might already know that Kubernetes helps software developers and system operators deploy applications in Linux containers. But did you know how helpful it can be for data science as well?
In [_Why data scientists love Kubernetes_][3], our most popular data science article in 2019, William Benton and Sophie Watson share how Kubernetes supports the data science workflow. From repeatable batch jobs to debugging ML models, this article shares several ways for data scientists to leverage Kubernetes.
### How to use Spark SQL: A hands-on tutorial
Wondering how to use a cloud service for big data analytics? [_How to use Spark SQL: A hands-on tutorial_][4] uses Spark DataFrames to show how to use relational databases at scale. DJ Sarkar uses a real-world dataset to walk readers through the process of using Spark SQL.
Rich with screenshots and code, Sarkar's tutorial is the ideal sequel to [his first piece][5] on this subject. He shares several ways that you can use Spark to manage structured data obtained from flat files or databases.
### 9 resources for data science projects
The growth of data science in open source—from machine learning to neural networks—has left many engineers wanting to learn more. In [_9 resources for data science projects_][6], Dan Barker shares the books, tools, and online courses he thinks are a must for any engineer who wants to get started.
Barker is especially keen on Cathy O'Neil's book [_Weapons of Math Destruction_][7], which shares how bias creeps into data and how you can stop it. He also shares a range of websites for newbies to explore.
### Getting started with data science using Python
Alongside the rise of data science techniques, Python has seen a meteoric rise. It's now one of the most popular programming languages. When used with libraries like pandas and Seaborn, Python is an ideal entry to data science.
In [_Getting started with data science using Python_][8], a follow-up to his [intro to Python][9] article, Seth Kenlon shares how to create a Python virtual environment; install pandas and NumPy; create a sample dataset; and much more. This article is an especially good read if you want to learn more about data visualization.
### How to analyze log data with Python and Apache Spark
Like many articles in our top 10 list, [_How to analyze log data with Python and Apache Spark_][10] is a sequel to [an earlier article][11] on using Python and Apache Spark to wrangle data. Once you've learned how to put your data into a clean, structured format, DJ Sarkar offers this piece to help you analyze that data.
Whether you want to see the top 10 error endpoints or content size statistics, Sarkar shows you how to analyze several types of log data in your [DataFrame][12]. The data that he uses isn't "big data" from a size or volume standpoint. But these techniques can scale for use with larger datasets.
### How to wrangle log data with Python and Apache Spark
[_How to wrangle log data with Python and Apache Spark_][11], DJ Sarkar's prequel to his piece on analyzing log data, also made our top 10 list. It's no surprise since most organizations use a range of systems and infrastructure that run constantly. Data logs are an ideal way to make sure that everything keeps working effectively.
In this tutorial, Sarkar shows how to use Apache Spark on real-world production logs from NASA. He walks through the process of using Spark to do log analytics at scale on semi-structured log data. This ranges from setting up dependencies to data wrangling.
### Querying 10 years of GitHub data with GHTorrent and Libraries.io
Did you know that you can use Kibana or the Elasticsearch API to turn Amazon S3 object-storage data into a searchable Elasticsearch-type cluster? Likewise, did you know about the project that aims to build an offline version of all data available through GitHub APIs?
In [_Querying 10 years of GitHub data with GHTorrent and Libraries.io_][13], Pete Cheslock explores how to access and query GHTorrent data. You can do it using several formats, including CSV and Google Big Query. Cheslock uses the latter to search indexed GHTorrent data to learn which software languages, licenses, and rates of growth are most popular for GitHub projects.
### Predicting NFL play outcomes with Python and data science
Want to increase your machine learning skills in Python? With the NFL playoff season upon us, it's a great time to read [_Predicting NFL play outcomes with Python and data science_][14], which shares some data science tips to predict plays.
Christa Hayes shows how to spot weird values, predict downs and play types, make regression plots, and train models. Once you've read her article on [how to format data for training][15], this one is the ideal next step.
### Analyzing the Stack Overflow Survey with Python and Pandas
Stack Overflow's annual developer survey is a tech behemoth. Nearly 90,000 developers took this year's 20-minute survey and left a lot of data in their wake.
To find certain results, Moshe Zadka used the pandas library to search the survey's [anonymized results][16]. If you want to filter Stack Overflow's dataset for certain details (like seeing how many developers use certain languages or contribute to open source projects), Moshe's [_Analyzing the Stack Overflow Survey with Python and Pandas_][17] tutorial shows you how.
### 4 Python tools for getting started with astronomy
For readers with their heads in the clouds, NumFOCUS republished some of its blog posts on Opensource.com this year. In [_4 Python tools for getting started with astronomy_][18], Dr. Gina Helfrich shares how you can get involved in astronomy.
Intimidated? Don't be: Dr. Helfrich says Python packages are so advanced that building data-reduction scripts is much easier than ever before. If you want to play with astronomy imaging datasets, this piece will steer you in the right direction.
### What do you want to know about data science?
Data science is an exciting field with countless things to explore. If there's something you want to know about data science, please tell us about it in the comments so we can try to cover it in 2020. Or, if you are so inclined, please share your knowledge with Opensource.com readers by [submitting an article][19] about your favorite data science topic.
--------------------------------------------------------------------------------
via: https://opensource.com/article/19/12/data-science-resources
作者:[Lauren Maffeo][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/lmaffeo
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/osdc_520x292_opendata_0613mm.png?itok=UIjD_jhK (Open data brain)
[2]: https://business.linkedin.com/content/dam/me/business/en-us/talent-solutions/emerging-jobs-report/Emerging_Jobs_Report_U.S._FINAL.pdf
[3]: https://opensource.com/article/19/1/why-data-scientists-love-kubernetes
[4]: https://opensource.com/article/19/3/apache-spark-and-dataframes-tutorial
[5]: https://opensource.com/article/19/3/sql-scale-apache-spark-sql-and-dataframes
[6]: https://opensource.com/article/19/2/learn-data-science-ai
[7]: https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815
[8]: https://opensource.com/article/19/9/get-started-data-science-python
[9]: https://opensource.com/article/17/10/python-101
[10]: https://opensource.com/article/19/5/visualize-log-data-apache-spark
[11]: https://opensource.com/article/19/5/log-data-apache-spark
[12]: https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html#dataframe
[13]: https://opensource.com/article/19/5/chaossearch-github-ghtorrent
[14]: https://opensource.com/article/19/10/predicting-nfl-plays-python
[15]: https://opensource.com/article/19/10/formatting-nfl-data-python
[16]: https://insights.stackoverflow.com/survey
[17]: https://opensource.com/article/19/9/stack-overflow-survey-python-pandas
[18]: https://opensource.com/article/19/10/python-astronomy-open-data
[19]: https://opensource.com/how-submit-article