mirror of
https://github.com/LCTT/TranslateProject.git
synced 2025-01-13 22:30:37 +08:00
Merge remote-tracking branch 'LCTT/master'
This commit is contained in:
commit
86f928f586
@ -5,7 +5,7 @@ set -e
|
||||
function help()
|
||||
{
|
||||
cat <<EOF
|
||||
Usage: ${0##*/} [+-s} [published] [translated] [translating] [sources]
|
||||
Usage: ${0##*/} [+-s] [published] [translated] [translating] [sources]
|
||||
|
||||
显示已发布、已翻译、正在翻译和待翻译的数量
|
||||
|
||||
@ -27,7 +27,8 @@ shift $(( OPTIND - 1 ))
|
||||
OPTIND=1
|
||||
|
||||
declare -A catalog_comment_dict
|
||||
catalog_comment_dict=([translated]="待校对" [published]="已发布" [translating]="翻译中" [sources]="待翻译")
|
||||
catalog_comment_dict=([sources]="待翻译" [translating]="翻译中" [translated]="待校对" [published]="已发布")
|
||||
catalog_color_dict=([sources]="#97CA00" [translating]="#00BCD5" [translated]="#FF9800" [published]="#FF5722")
|
||||
|
||||
function count_files_under_dir()
|
||||
{
|
||||
@ -60,26 +61,27 @@ for catalog in "$@";do
|
||||
esac
|
||||
|
||||
comment=${catalog_comment_dict[${catalog}]}
|
||||
color=${catalog_color_dict[${catalog}]}
|
||||
if [[ "${show_format}" == "svg" ]];then
|
||||
cat <<EOF
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="120" height="20">
|
||||
<svg xmlns="http://www.w3.org/2000/svg" width="100" height="20">
|
||||
<linearGradient id="b" x2="0" y2="100%">
|
||||
<stop offset="0" stop-color="#bbb" stop-opacity=".1" />
|
||||
<stop offset="1" stop-opacity=".1" />
|
||||
</linearGradient>
|
||||
<mask id="a">
|
||||
<rect width="132.53125" height="20" rx="3" fill="#fff" />
|
||||
<rect width="100" height="20" rx="3" fill="#fff" />
|
||||
</mask>
|
||||
<g mask="url(#a)">
|
||||
<path fill="#555" d="M0 0 h70.53125 v20 H0 z" />
|
||||
<path fill="#97CA00" d="M70.53125 0 h62.0 v20 H70.53125 z" />
|
||||
<path fill="url(#b)" d="M0 0 h132.53125 v20 H0 z" />
|
||||
<path fill="#555" d="M0 0 h60 v20 H0 z" />
|
||||
<path fill="${comment}" d="M60 0 h40 v20 H70.53125 z" />
|
||||
<path fill="url(#b)" d="M0 0 h100 v20 H0 z" />
|
||||
</g>
|
||||
<g fill="#fff" font-family="DejaVu Sans" font-size="11">
|
||||
<text x="6" y="15" fill="#010101" fill-opacity=".3">${comment}</text>
|
||||
<text x="6" y="14">${comment}</text>
|
||||
<text x="74.53125" y="15" fill="#010101" fill-opacity=".3">${num}</text>
|
||||
<text x="74.53125" y="14">${num}</text>
|
||||
<text x="12" y="15" fill="#010101" fill-opacity=".3">${comment}</text>
|
||||
<text x="12" y="14">${comment}</text>
|
||||
<text x="70" y="15" fill="#010101" fill-opacity=".3">${num}</text>
|
||||
<text x="70" y="14">${num}</text>
|
||||
</g>
|
||||
</svg>
|
||||
EOF
|
||||
|
@ -1,138 +0,0 @@
|
||||
Translating by MjSeven
|
||||
|
||||
|
||||
Getting started with Python for data science
|
||||
======
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/GOV_open_data_520x292.jpg?itok=R8rBrlk7)
|
||||
|
||||
Whether you're a budding data science enthusiast with a math or computer science background or an expert in an unrelated field, the possibilities data science offers are within your reach. And you don't need expensive, highly specialized enterprise software—the open source tools discussed in this article are all you need to get started.
|
||||
|
||||
[Python][1], its machine-learning and data science libraries ([pandas][2], [Keras][3], [TensorFlow][4], [scikit-learn][5], [SciPy][6], [NumPy][7], etc.), and its extensive list of visualization libraries ([Matplotlib][8], [pyplot][9], [Plotly][10], etc.) are excellent FOSS tools for beginners and experts alike. Easy to learn, popular enough to offer community support, and armed with the latest emerging techniques and algorithms developed for data science, these comprise one of the best toolsets you can acquire when starting out.
|
||||
|
||||
Many of these Python libraries are built on top of each other (known as dependencies), and the basis is the [NumPy][7] library. Designed specifically for data science, NumPy is often used to store relevant portions of datasets in its ndarray datatype, which is a convenient datatype for storing records from relational tables as `cvs` files or in any other format, and vice-versa. It is particularly convenient when scikit functions are applied to multidimensional arrays. SQL is great for querying databases, but to perform complex and resource-intensive data science operations, storing data in ndarray boosts efficiency and speed (but make sure you have ample RAM when dealing with large datasets). When you get to using pandas for knowledge extraction and analysis, the almost seamless conversion between DataFrame datatype in pandas and ndarray in NumPy creates a powerful combination for extraction and compute-intensive operations, respectively.
|
||||
|
||||
For a quick demonstration, let’s fire up the Python shell and load an open dataset on crime statistics from the city of Baltimore in a pandas DataFrame variable, and view a portion of the loaded frame:
|
||||
```
|
||||
>>> import pandas as pd
|
||||
|
||||
>>> crime_stats = pd.read_csv('BPD_Arrests.csv')
|
||||
|
||||
>>> crime_stats.head()
|
||||
|
||||
```
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/panopoly_image_original/public/images/life-uploads/crime_stats_chart.jpg?itok=_rPXJYHz)
|
||||
|
||||
We can now perform most of the queries on this pandas DataFrame that we can with SQL in databases. For instance, to get all the unique values of the "Description" attribute, the SQL query is:
|
||||
```
|
||||
$ SELECT unique(“Description”) from crime_stats;
|
||||
|
||||
```
|
||||
|
||||
This same query written for a pandas DataFrame looks like this:
|
||||
```
|
||||
>>> crime_stats['Description'].unique()
|
||||
|
||||
['COMMON ASSAULT' 'LARCENY' 'ROBBERY - STREET' 'AGG. ASSAULT'
|
||||
|
||||
'LARCENY FROM AUTO' 'HOMICIDE' 'BURGLARY' 'AUTO THEFT'
|
||||
|
||||
'ROBBERY - RESIDENCE' 'ROBBERY - COMMERCIAL' 'ROBBERY - CARJACKING'
|
||||
|
||||
'ASSAULT BY THREAT' 'SHOOTING' 'RAPE' 'ARSON']
|
||||
|
||||
```
|
||||
|
||||
which returns a NumPy array (ndarray):
|
||||
```
|
||||
>>> type(crime_stats['Description'].unique())
|
||||
|
||||
<class 'numpy.ndarray'>
|
||||
|
||||
```
|
||||
|
||||
Next let’s feed this data into a neural network to see how accurately it can predict the type of weapon used, given data such as the time the crime was committed, the type of crime, and the neighborhood in which it happened:
|
||||
```
|
||||
>>> from sklearn.neural_network import MLPClassifier
|
||||
|
||||
>>> import numpy as np
|
||||
|
||||
>>>
|
||||
|
||||
>>> prediction = crime_stats[[‘Weapon’]]
|
||||
|
||||
>>> predictors = crime_stats['CrimeTime', ‘CrimeCode’, ‘Neighborhood’]
|
||||
|
||||
>>>
|
||||
|
||||
>>> nn_model = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,
|
||||
|
||||
2), random_state=1)
|
||||
|
||||
>>>
|
||||
|
||||
>>>predict_weapon = nn_model.fit(prediction, predictors)
|
||||
|
||||
```
|
||||
|
||||
Now that the learning model is ready, we can perform several tests to determine its quality and reliability. For starters, let’s feed a training set data (the portion of the original dataset used to train the model and not included in creating the model):
|
||||
```
|
||||
>>> predict_weapon.predict(training_set_weapons)
|
||||
|
||||
array([4, 4, 4, ..., 0, 4, 4])
|
||||
|
||||
```
|
||||
|
||||
As you can see, it returns a list, with each number predicting the weapon for each of the records in the training set. We see numbers rather than weapon names, as most classification algorithms are optimized with numerical data. For categorical data, there are techniques that can reliably convert attributes into numerical representations. In this case, the technique used is Label Encoding, using the LabelEncoder function in the sklearn preprocessing library: `preprocessing.LabelEncoder()`. It has a function to transform and inverse transform data and their numerical representations. In this example, we can use the `inverse_transform` function of LabelEncoder() to see what Weapons 0 and 4 are:
|
||||
```
|
||||
>>> preprocessing.LabelEncoder().inverse_transform(encoded_weapons)
|
||||
|
||||
array(['HANDS', 'FIREARM', 'HANDS', ..., 'FIREARM', 'FIREARM', 'FIREARM']
|
||||
|
||||
```
|
||||
|
||||
This is fun to see, but to get an idea of how accurate this model is, let's calculate several scores as percentages:
|
||||
```
|
||||
>>> nn_model.score(X, y)
|
||||
|
||||
0.81999999999999995
|
||||
|
||||
```
|
||||
|
||||
This shows that our neural network model is ~82% accurate. That result seems impressive, but it is important to check its effectiveness when used on a different crime dataset. There are other tests, like correlations, confusion, matrices, etc., to do this. Although our model has high accuracy, it is not very useful for general crime datasets as this particular dataset has a disproportionate number of rows that list ‘FIREARM’ as the weapon used. Unless it is re-trained, our classifier is most likely to predict ‘FIREARM’, even if the input dataset has a different distribution.
|
||||
|
||||
It is important to clean the data and remove outliers and aberrations before we classify it. The better the preprocessing, the better the accuracy of our insights. Also, feeding the model/classifier with too much data to get higher accuracy (generally over ~90%) is a bad idea because it looks accurate but is not useful due to [overfitting][11].
|
||||
|
||||
[Jupyter notebooks][12] are a great interactive alternative to the command line. While the CLI is fine for most things, Jupyter shines when you want to run snippets on the go to generate visualizations. It also formats data better than the terminal.
|
||||
|
||||
[This article][13] has a list of some of the best free resources for machine learning, but plenty of additional guidance and tutorials are available. You will also find many open datasets available to use, based on your interests and inclinations. As a starting point, the datasets maintained by [Kaggle][14], and those available at state government websites are excellent resources.
|
||||
|
||||
Payal Singh will be presenting at SCaLE16x this year, March 8-11 in Pasadena, California. To attend and get 50% of your ticket, [register][15] using promo code **OSDC**
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/18/3/getting-started-data-science
|
||||
|
||||
作者:[Payal Singh][a]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://opensource.com/users/payalsingh
|
||||
[1]:https://www.python.org/
|
||||
[2]:https://pandas.pydata.org/
|
||||
[3]:https://keras.io/
|
||||
[4]:https://www.tensorflow.org/
|
||||
[5]:http://scikit-learn.org/stable/
|
||||
[6]:https://www.scipy.org/
|
||||
[7]:http://www.numpy.org/
|
||||
[8]:https://matplotlib.org/
|
||||
[9]:https://matplotlib.org/api/pyplot_api.html
|
||||
[10]:https://plot.ly/
|
||||
[11]:https://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
|
||||
[12]:http://jupyter.org/
|
||||
[13]:https://machinelearningmastery.com/best-machine-learning-resources-for-getting-started/
|
||||
[14]:https://www.kaggle.com/
|
||||
[15]:https://register.socallinuxexpo.org/reg6/
|
@ -1,5 +1,3 @@
|
||||
Translating by jlztan
|
||||
|
||||
Top 10 Raspberry Pi blogs to follow
|
||||
======
|
||||
|
||||
|
@ -1,94 +0,0 @@
|
||||
translating by Flowsnow
|
||||
|
||||
Getting started with functional programming in Python using the toolz library
|
||||
======
|
||||
|
||||
The toolz library allows you to manipulate functions, making it easier to understand and test code.
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/programming-code-keyboard-laptop-music-headphones.png?itok=EQZ2WKzy)
|
||||
|
||||
In the second of a two-part series, we continue to explore how we can import ideas from functional programming methodology into Python to have the best of both worlds.
|
||||
|
||||
In the previous post, we covered [immutable data structures][1]. Those allow us to write "pure" functions, or functions that have no side effects, merely accepting some arguments and returning a result while maintaining decent performance.
|
||||
|
||||
In this post, we build on that using the `toolz` library. This library has functions that manipulate such functions, and they work especially well with pure functions. In the functional programming world, these are often referred to as "higher-order functions" since they take functions as arguments and return functions as results.
|
||||
|
||||
Let's start with this:
|
||||
|
||||
```
|
||||
def add_one_word(words, word):
|
||||
return words.set(words.get(word, 0) + 1)
|
||||
```
|
||||
|
||||
This function assumes that its first argument is an immutable dict-like object, and it returns a new dict-like object with the relevant place incremented: It's a simple frequency counter.
|
||||
|
||||
However, it is useful only if we apply it to a stream of words and reduce. We have access to a reducer in the built-in module `functools`. `functools.reduce(function, stream, initializer)`.
|
||||
|
||||
We want a function that, applied to a stream, will return a frequency count.
|
||||
|
||||
We start by using `toolz.curry`:
|
||||
|
||||
```
|
||||
add_all_words = curry(functools.reduce, add_one_word)
|
||||
```
|
||||
|
||||
With this version, we will need to supply the initializer. However, we can't just add `pyrsistent.m` to the `curry`; it is in the wrong order.
|
||||
|
||||
```
|
||||
add_all_words_flipped = flip(add_all_words)
|
||||
```
|
||||
|
||||
The `flip` higher-level function returns a function that calls the original, with arguments flipped.
|
||||
|
||||
```
|
||||
get_all_words = add_all_words_flipped(pyrsistent.m())
|
||||
```
|
||||
|
||||
We take advantage of the fact that `flip` auto-curries its argument to give it a starting value: an empty dictionary.
|
||||
|
||||
Now we can do `get_all_words(word_stream)` and get a frequency dictionary. However, how do we get a word stream? Python files are by line streams.
|
||||
|
||||
```
|
||||
def to_words(lines):
|
||||
for line in lines:
|
||||
yield from line.split()
|
||||
```
|
||||
|
||||
After testing each function by itself, we can combine them:
|
||||
|
||||
```
|
||||
words_from_file = toolz.compose(get_all_words, to_words)
|
||||
```
|
||||
|
||||
In this case, the composition being of just being two functions was straightforward to read: Apply `to_words` first, then apply `get_all_words` to the result. The prose, it seems, is in the inverse of the code.
|
||||
|
||||
This matters when we start taking composability seriously. It is sometimes possible to write the code as a sequence of units, test each individually, and finally, compose them all. If there are several elements, the ordering of compose can get tricky to understand.
|
||||
|
||||
The `toolz` library borrows from the Unix command line and uses `pipe` as a function that does the same, but in the reverse order.
|
||||
|
||||
```
|
||||
words_from_file = toolz.pipe(to_words, get_all_words)
|
||||
```
|
||||
|
||||
Now it reads more intuitively: Pipe the input into `to_words`, and pipe the results into `get_all_words`. On a command line, the equivalent would look like this:
|
||||
|
||||
```
|
||||
$ cat files | to_words | get_all_words
|
||||
```
|
||||
|
||||
The `toolz` library allows us to manipulate functions, slicing, dicing, and composing them to make our code easier to understand and to test.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/18/10/functional-programming-python-toolz
|
||||
|
||||
作者:[Moshe Zadka][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/moshez
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/article/18/10/functional-programming-python-immutable-data-structures
|
@ -0,0 +1,134 @@
|
||||
Python 数据科学入门
|
||||
======
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/GOV_open_data_520x292.jpg?itok=R8rBrlk7)
|
||||
|
||||
无论你是一个具有数学或计算机科学背景的数据科学爱好者,还是一个其它领域的专家,数据科学提供的可能性都在你力所能及的范围内,而且你不需要昂贵的,高度专业化的企业软件。本文中讨论的开源工具就是你入门时所需的全部内容。
|
||||
|
||||
[Python][1],其机器学习和数据科学库([pandas][2], [Keras][3], [TensorFlow][4], [scikit-learn][5], [SciPy][6], [NumPy][7] 等),以及大量可视化库([Matplotlib][8], [pyplot][9], [Plotly][10] 等)对于初学者和专家来说都是优秀的 FOSS(译注:全称为 Free and Open Source Software)工具。它们易于学习,很受欢迎且受到社区支持,并拥有为数据科学开发的最新技术和算法。它们是你在开始学习时可以获得的最佳工具集之一。
|
||||
|
||||
许多 Python 库都是建立在彼此之上的(称为依赖项),其基础是 [NumPy][7] 库。NumPy 专门为数据科学设计,经常用于在其 ndarray 数据类型中存储数据集的相关部分。ndarray 是一种方便的数据类型,用于将关系表中的记录存储为 `cvs` 文件或其它任何格式,反之亦然。将 scikit 功能应用于多维数组时,它特别方便。SQL 非常适合查询数据库,但是对于执行复杂和资源密集型的数据科学操作,在 ndarray 中存储数据可以提高效率和速度(确保在处理大量数据集时有足够的 RAM)。当你使用 pandas 进行知识提取和分析时,pandas 中的 DataFrame 数据类型和 NumPy 中的 ndarray 之间的无缝转换分别为提取和计算密集型操作创建了一个强大的组合。
|
||||
|
||||
为了快速演示,让我们启动 Python shel 并在 pandas DataFrame 变量中加载来自巴尔的摩(Baltimore)的犯罪统计数据的开放数据集,并查看加载 frame 的一部分:
|
||||
```
|
||||
>>> import pandas as pd
|
||||
|
||||
>>> crime_stats = pd.read_csv('BPD_Arrests.csv')
|
||||
|
||||
>>> crime_stats.head()
|
||||
```
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/panopoly_image_original/public/images/life-uploads/crime_stats_chart.jpg?itok=_rPXJYHz)
|
||||
|
||||
我们现在可以在这个 pandas DataFrame 上执行大多数查询就像我们可以在数据库中使用 SQL。例如,要获取 "Description"属性的所有唯一值,SQL 查询是:
|
||||
```
|
||||
$ SELECT unique(“Description”) from crime_stats;
|
||||
|
||||
```
|
||||
|
||||
利用 pandas DataFrame 编写相同的查询如下所示:
|
||||
```
|
||||
>>> crime_stats['Description'].unique()
|
||||
|
||||
['COMMON ASSAULT' 'LARCENY' 'ROBBERY - STREET' 'AGG. ASSAULT'
|
||||
|
||||
'LARCENY FROM AUTO' 'HOMICIDE' 'BURGLARY' 'AUTO THEFT'
|
||||
|
||||
'ROBBERY - RESIDENCE' 'ROBBERY - COMMERCIAL' 'ROBBERY - CARJACKING'
|
||||
|
||||
'ASSAULT BY THREAT' 'SHOOTING' 'RAPE' 'ARSON']
|
||||
|
||||
```
|
||||
|
||||
它返回的是一个 NumPy 数组(ndarray 类型):
|
||||
```
|
||||
>>> type(crime_stats['Description'].unique())
|
||||
|
||||
<class 'numpy.ndarray'>
|
||||
|
||||
```
|
||||
|
||||
接下来让我们将这些数据输入神经网络,看看它能多准确地预测使用的武器类型,给出的数据包括犯罪事件,犯罪类型以及发生的地点:
|
||||
```
|
||||
>>> from sklearn.neural_network import MLPClassifier
|
||||
|
||||
>>> import numpy as np
|
||||
|
||||
>>>
|
||||
|
||||
>>> prediction = crime_stats[[‘Weapon’]]
|
||||
|
||||
>>> predictors = crime_stats['CrimeTime', ‘CrimeCode’, ‘Neighborhood’]
|
||||
|
||||
>>>
|
||||
|
||||
>>> nn_model = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,2), random_state=1)
|
||||
|
||||
>>>
|
||||
|
||||
>>>predict_weapon = nn_model.fit(prediction, predictors)
|
||||
|
||||
```
|
||||
|
||||
现在学习模型准备就绪,我们可以执行一些测试来确定其质量和可靠性。对于初学者,让我们输入一个训练集数据(用于训练模型的原始数据集的一部分,不包括在创建模型中):
|
||||
```
|
||||
>>> predict_weapon.predict(training_set_weapons)
|
||||
|
||||
array([4, 4, 4, ..., 0, 4, 4])
|
||||
|
||||
```
|
||||
|
||||
如你所见,它返回一个列表,每个数字预测训练集中每个记录的武器。我们之所以看到的是数字而不是武器名称,是因为大多数分类算法都是用数字优化的。对于分类数据,有一些技术可以将属性转换为数字表示。在这种情况下,使用的技术是 Label Encoder,使用 sklearn 预处理库中的 LabelEncoder 函数:`preprocessing.LabelEncoder()`。它能够对一个数据和其对应的数值表示来进行变换和逆变换。在这个例子中,我们可以使用 LabelEncoder() 的 `inverse_transform` 函数来查看武器 0 和 4 是什么:
|
||||
```
|
||||
>>> preprocessing.LabelEncoder().inverse_transform(encoded_weapons)
|
||||
|
||||
array(['HANDS', 'FIREARM', 'HANDS', ..., 'FIREARM', 'FIREARM', 'FIREARM']
|
||||
|
||||
```
|
||||
|
||||
这很有趣,但为了了解这个模型的准确程度,我们将几个分数计算为百分比:
|
||||
```
|
||||
>>> nn_model.score(X, y)
|
||||
|
||||
0.81999999999999995
|
||||
|
||||
```
|
||||
|
||||
这表明我们的神经网络模型准确度约为 82%。这个结果似乎令人印象深刻,但用于不同的犯罪数据集时,检查其有效性非常重要。还有其它测试来做这个,如相关性,混淆,矩阵等。尽管我们的模型有很高的准确率,但它对于一般犯罪数据集并不是非常有用,因为这个特定数据集具有不成比例的行数,其列出 ‘FIREARM’ 作为使用的武器。除非重新训练,否则我们的分类器最有可能预测 ‘FIREARM’,即使输入数据集有不同的分布。
|
||||
|
||||
在对数据进行分类之前清洗数据并删除异常值和畸形数据非常重要。预处理越好,我们的见解准确性就越高。此外,为模型或分类器提供过多数据(通常超过 90%)以获得更高的准确度是一个坏主意,因为它看起来准确但由于[过度拟合][11]而无效。
|
||||
|
||||
[Jupyter notebooks][12] 相对于命令行来说是一个很好的交互式替代品。虽然 CLI 对大多数事情都很好,但是当你想要运行代码片段以生成可视化时,Jupyter 会很出色。它比终端更好地格式化数据。
|
||||
|
||||
[这篇文章][13] 列出了一些最好的机器学习免费资源,但是还有很多其它的指导和教程。根据你的兴趣和爱好,你还会发现许多开放数据集可供使用。作为起点,由 [Kaggle][14] 维护的数据集,以及在州政府网站上提供的数据集是极好的资源。
|
||||
|
||||
(to 校正:最后这句话不知该如何理解)
|
||||
Payal Singh 将出席今年 3 月 8 日至 11 日在 California(加利福尼亚)的 Pasadena(帕萨迪纳)举行的 SCaLE16x。要参加并获得 50% 的门票优惠,[注册][15]使用优惠码**OSDC**。
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/18/3/getting-started-data-science
|
||||
|
||||
作者:[Payal Singh][a]
|
||||
译者:[MjSeven](https://github.com/MjSeven)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://opensource.com/users/payalsingh
|
||||
[1]:https://www.python.org/
|
||||
[2]:https://pandas.pydata.org/
|
||||
[3]:https://keras.io/
|
||||
[4]:https://www.tensorflow.org/
|
||||
[5]:http://scikit-learn.org/stable/
|
||||
[6]:https://www.scipy.org/
|
||||
[7]:http://www.numpy.org/
|
||||
[8]:https://matplotlib.org/
|
||||
[9]:https://matplotlib.org/api/pyplot_api.html
|
||||
[10]:https://plot.ly/
|
||||
[11]:https://www.kdnuggets.com/2014/06/cardinal-sin-data-mining-data-science.html
|
||||
[12]:http://jupyter.org/
|
||||
[13]:https://machinelearningmastery.com/best-machine-learning-resources-for-getting-started/
|
||||
[14]:https://www.kaggle.com/
|
||||
[15]:https://register.socallinuxexpo.org/reg6/
|
@ -0,0 +1,98 @@
|
||||
使用Python的toolz库开始函数式编程
|
||||
======
|
||||
|
||||
toolz库允许你操作函数,使其更容易理解,更容易测试代码。
|
||||
|
||||
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/programming-code-keyboard-laptop-music-headphones.png?itok=EQZ2WKzy)
|
||||
|
||||
在这个由两部分组成的系列文章的第二部分中,我们将继续探索如何将函数式编程方法中的好想法引入到 Python中,以实现两全其美。
|
||||
|
||||
在上一篇文章中,我们介绍了[不可变数据结构][1]。 这些数据结构使得我们可以编写“纯”函数,或者说是没有副作用的函数,仅仅接受一些参数并返回结果,同时保持良好的性能。
|
||||
|
||||
在这篇文章中,我们使用 toolz 库来构建。 这个库具有操作此类函数的函数,并且它们在纯函数中表现得特别好。 在函数式编程世界中,它们通常被称为“高阶函数”,因为它们将函数作为参数,将函数作为结果返回。
|
||||
|
||||
让我们从这里开始:
|
||||
|
||||
```
|
||||
def add_one_word(words, word):
|
||||
return words.set(words.get(word, 0) + 1)
|
||||
```
|
||||
|
||||
这个函数假设它的第一个参数是一个不可变的类似字典的对象,它返回一个新的类似字典的在相关位置递增的对象:这就是一个简单的频率计数器。
|
||||
|
||||
但是,只有将它应用于单词流并做归纳时才有用。 我们可以使用内置模块 `functools` 中的归纳器。 `functools.reduce(function, stream, initializer)`
|
||||
|
||||
我们想要一个函数,应用于流,并且能能返回频率计数。
|
||||
|
||||
我们首先使用 `toolz.curry` 函数:
|
||||
|
||||
```
|
||||
add_all_words = curry(functools.reduce, add_one_word)
|
||||
```
|
||||
|
||||
使用此版本,我们需要提供初始化程序。 但是,我们不能只将 `pyrsistent.m` 函数添加到 `curry` 函数中中; 因为这个顺序是错误的。
|
||||
|
||||
```
|
||||
add_all_words_flipped = flip(add_all_words)
|
||||
```
|
||||
|
||||
The `flip` higher-level function returns a function that calls the original, with arguments flipped.
|
||||
|
||||
`flip` 这个高阶函数返回一个调用原始函数的函数,并且翻转参数顺序。
|
||||
|
||||
```
|
||||
get_all_words = add_all_words_flipped(pyrsistent.m())
|
||||
```
|
||||
|
||||
我们利用 `flip` 自动调整其参数的特性给它一个初始值:一个空字典。
|
||||
|
||||
现在我们可以执行 `get_all_words(word_stream)` 这个函数来获取频率字典。 但是,我们如何获得一个单词流呢? Python文件是行流的。
|
||||
|
||||
```
|
||||
def to_words(lines):
|
||||
for line in lines:
|
||||
yield from line.split()
|
||||
```
|
||||
|
||||
在单独测试每个函数后,我们可以将它们组合在一起:
|
||||
|
||||
```
|
||||
words_from_file = toolz.compose(get_all_words, to_words)
|
||||
```
|
||||
|
||||
在这种情况下,组合只是使两个函数很容易阅读:首先将文件的行流应用于 `to_words`,然后将 `get_all_words` 应用于 `to_words` 的结果。 散文似乎与代码相反。
|
||||
|
||||
当我们开始认真对待可组合性时,这很重要。 有时可以将代码编写为一个单元序列,单独测试每个单元,最后将它们全部组合。 如果有几个组合元素时,组合的顺序可能就很难理解。
|
||||
|
||||
`toolz` 库借用了 Unix 命令行的做法,并使用 `pipe` 作为执行相同操作的函数,但顺序相反。
|
||||
|
||||
```
|
||||
words_from_file = toolz.pipe(to_words, get_all_words)
|
||||
```
|
||||
|
||||
Now it reads more intuitively: Pipe the input into `to_words`, and pipe the results into `get_all_words`. On a command line, the equivalent would look like this:
|
||||
|
||||
现在读起来更直观了:将输入传递到 `to_words`,并将结果传递给 `get_all_words`。 在命令行上,等效写法如下所示:
|
||||
|
||||
```
|
||||
$ cat files | to_words | get_all_words
|
||||
```
|
||||
|
||||
The `toolz` library allows us to manipulate functions, slicing, dicing, and composing them to make our code easier to understand and to test.
|
||||
|
||||
`toolz` 库允许我们操作函数,切片,分割和组合,以使我们的代码更容易理解和测试。
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://opensource.com/article/18/10/functional-programming-python-toolz
|
||||
|
||||
作者:[Moshe Zadka][a]
|
||||
选题:[lujun9972][b]
|
||||
译者:[Flowsnow](https://github.com/Flowsnow)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]: https://opensource.com/users/moshez
|
||||
[b]: https://github.com/lujun9972
|
||||
[1]: https://opensource.com/article/18/10/functional-programming-python-immutable-data-structures
|
Loading…
Reference in New Issue
Block a user