Merge pull request #17522 from heguangzhi/TP_2020022102

translated
This commit is contained in:
Xingyu.Wang 2020-02-25 08:52:48 +08:00 committed by GitHub
commit 4989e9af6f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 689 additions and 720 deletions

View File

@ -1,720 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: (heguangzhi)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Using Python and GNU Octave to plot data)
[#]: via: (https://opensource.com/article/20/2/python-gnu-octave-data-science)
[#]: author: (Cristiano L. Fontana https://opensource.com/users/cristianofontana)
Using Python and GNU Octave to plot data
======
Learn how to do a common data science task with Python and GNU Octave.
![Analytics: Charts and Graphs][1]
Data science is a domain of knowledge that spans programming languages. Some are well-known for solving problems in this space, while others are lesser-known. This article will help you become familiar with doing data science with some popular languages.
### Choosing Python and GNU Octave for data science
Every so often, I try to learn a new programming language. Why? It is mostly a combination of boredom with the old ways and curiosity about the new ways. When I started programming, the only language I knew was C. Life was hard and dangerous in those years, as I had to manually allocate memory, manage pointers, and remember to free memory.
Then a friend suggested I try Python, and life became much easier. Programs became much slower, but I did not have to suffer through writing analysis software. However, I soon realized that each language was more suitable than others for some applications. I later studied some other languages, and each one brought some new bit of enlightenment. Discovering new programming styles let me backport some solutions to other languages, and everything became much more interesting.
To get a feeling for a new programming language (and its documentation), I always start by writing some example programs that perform a task I know well. To that ends, I will explain how to write a program in Python and GNU Octave for a particular task you could classify as data science. If you are already familiar with one of the languages, start with that one and go through the others to look for similarities and differences. It is not intended to be an exhaustive comparison of the languages, just a little showcase.
All of the programs are meant to be run on the [command line][2], not with a [graphical user interface][3] (GUI). The full examples are available in the [polyglot_fit repository][4].
### The programming task
The program you will write in this series:
* Reads data from a [CSV file][5]
* Interpolates the data with a straight line (i.e., _f(x)=m ⋅ x + q_)
* Plots the result to an image file
This is a common situation that many data scientists have encountered. The example data is the first set of [Anscombe's quartet][6], shown in the table below. This is a set of artificially constructed data that gives the same results when fitted with a straight line, but their plots are very different. The data file is a text file with tabs as column separators and a few lines as a header. This task will use only the first set (i.e., the first two columns).
[**Anscombe's quartet**][6]
I
II
III
IV
x
y
x
y
x
y
x
y
10.0
8.04
10.0
9.14
10.0
7.46
8.0
6.58
8.0
6.95
8.0
8.14
8.0
6.77
8.0
5.76
13.0
7.58
13.0
8.74
13.0
12.74
8.0
7.71
9.0
8.81
9.0
8.77
9.0
7.11
8.0
8.84
11.0
8.33
11.0
9.26
11.0
7.81
8.0
8.47
14.0
9.96
14.0
8.10
14.0
8.84
8.0
7.04
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.25
4.0
4.26
4.0
3.10
4.0
5.39
19.0
12.50
12.0
10.84
12.0
9.13
12.0
8.15
8.0
5.56
7.0
4.82
7.0
7.26
7.0
6.42
8.0
7.91
5.0
5.68
5.0
4.74
5.0
5.73
8.0
6.89
### The Python way
[Python][7] is a general-purpose programming language that is among the most popular languages in use today (as evidenced by findings from the [TIOBE index][8], [RedMonk Programming Language Rankings][9], [Popularity of Programming Language Index][10], [State of the Octoverse of GitHub][11], and other sources). It is an [interpreted language][12]; therefore, the source code is read and evaluated by a program that executes the instructions. It has a comprehensive [standard library][13] and is generally very pleasant to use (I have no reference for this last statement; it is just my humble opinion).
#### Installation
To develop with Python, you need the interpreter and a few libraries. The minimum requirements are:
* [NumPy][14] for convenient array and matrices manipulation
* [SciPy][15] for scientific calculations
* [Matplotlib][16] for plotting
Installing them in [Fedora][17] is easy:
```
`sudo dnf install python3 python3-numpy python3-scipy python3-matplotlib`
```
#### Commenting code
In Python, [comments][18] are achieved by putting a **#** at the beginning of the line, and the rest of the line will be discarded by the interpreter:
```
`# This is a comment ignored by the interpreter.`
```
The [fitting_python.py][19] example uses comments to insert licensing information in the source code, and the first line is a [special comment][20] that enables the script to be executed on the command line:
```
`#! /usr/bin/env python3`
```
This line informs the command-line interpreter that the script needs to be executed by the program **python3**.
#### Required libraries
Libraries and modules can be imported in Python as an object (as in the first line in the example) with all the functions and members of the library. There is a convenient option to rename them with a custom label by using the **as** specification:
```
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
```
You may also decide to import only a submodule (as in the second and third lines). The syntax has two (more or less) equivalent options: **import module.submodule** and **from module import submodule**.
#### Defining variables
Python's variables are declared the first time a value is assigned to them:
```
input_file_name = "anscombe.csv"
delimiter = "\t"
skip_header = 3
column_x = 0
column_y = 1
```
The variable types are inferred by the value that is assigned to the variable. There are no variables with constant values unless they are declared in a module and can only be read. Idiomatically, variables that should not be modified should be named in uppercase.
#### Printing output
Running the programs through the command line means that the output is just printed on the terminal. Python has the [**print()**][21] function that, by default, prints its argument and adds a newline at the end of the output:
```
`print("#### Anscombe's first set with Python ####")`
```
It is possible to combine the **print()** function with the [formatting power][22] of the [string class][23] in Python. Strings have the **format** method that can be used to add some formatted text to the string itself. For instance, it is possible to add a formatted float number, e.g.:
```
`print("Slope: {:f}".format(slope))`
```
#### Reading data
Reading CSV files is very easy with NumPy and the function [**genfromtxt()**][24], which generates a [NumPy array][25]:
```
`data = np.genfromtxt(input_file_name, delimiter = delimiter, skip_header = skip_header)`
```
In Python, a function can have a variable number of arguments, and you can have it pass a subset by specifying the desired ones. Arrays are very powerful matrix-like objects that can be easily sliced into smaller arrays:
```
x = data[:, column_x]
y = data[:, column_y]
```
The colons select the whole range, and they can also be used to select a subrange. For instance, to select the first two rows of the array, you would use:
```
`first_two_rows = data[0:1, :]`
```
#### Fitting data
SciPy provides convenient functions for data fitting, such as the [**linregress()**][26] function. This function provides some significant values related to the fit, such as the slope, intercept, and the correlation coefficient of the two datasets:
```
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("Slope: {:f}".format(slope))
print("Intercept: {:f}".format(intercept))
print("Correlation coefficient: {:f}".format(r_value))
```
Since **linregress()** provides several pieces of information, the result can be saved to several variables at the same time.
#### Plotting
The Matplotlib library plots only data points; therefore, you should define the points you want to plot. The **x** and **y** arrays were already defined, so you can directly plot them, but you also need data points that will represent the straight line.
```
`fit_x = np.linspace(x.min() - 1, x.max() + 1, 100)`
```
The [**linspace()**][27] function conveniently generates a set of equally spaced values between two values. The ordinates can be easily calculated by exploiting the powerful NumPy arrays, which can be used in a formula as if they were ordinary numeric variables:
```
`fit_y = slope * fit_x + intercept`
```
The formula is applied element-by-element on the array; therefore, the result has the same number of entries in the initial array.
To create the plot, first, define a [figure object][28] that will contain all the graphics:
```
fig_width = 7 #inch
fig_height = fig_width / 16 * 9 #inch
fig_dpi = 100
fig = plt.figure(figsize = (fig_width, fig_height), dpi = fig_dpi)
```
Several plots can be drawn on a figure; in Matplotlib, the plots are called [axes][29]. This example defines a single axis object to plot the data points:
```
ax = fig.add_subplot(111)
ax.plot(fit_x, fit_y, label = "Fit", linestyle = '-')
ax.plot(x, y, label = "Data", marker = '.', linestyle = '')
ax.legend()
ax.set_xlim(min(x) - 1, max(x) + 1)
ax.set_ylim(min(y) - 1, max(y) + 1)
ax.set_xlabel('x')
ax.set_ylabel('y')
```
Save the figure to a [PNG image file][30] with:
```
`fig.savefig('fit_python.png')`
```
If you want to display (instead of saving) the plot, call:
```
`plt.show()`
```
This example references all the objects used in the plotting section: it defines the object **fig** and the object **ax**. This technicality is not necessary, as the **plt** object can be used directly to plot the datasets. The [Matplotlib tutorial][31] shows an interface such as:
```
`plt.plot(fit_x, fit_y)`
```
Frankly, I do not like this approach because it hides the non-trivial interactions that happen between the various objects. Unfortunately, sometimes the [official examples][32] are a bit confusing because they tend to use different approaches. Referencing graphical objects is not necessary in this simple example, but it becomes important in more complex ones (such as when embedding plots in GUIs).
#### Results
The output on the command line is:
```
#### Anscombe's first set with Python ####
Slope: 0.500091
Intercept: 3.000091
Correlation coefficient: 0.816421
```
Here is the image Matplotlib generates.
![Plot and fit of the dataset obtained with Python][33]
### The GNU Octave way
The [GNU Octave][34] language is primarily intended for numerical computations. It offers a simple syntax for manipulating vectors and matrices and has some powerful plotting facilities. It is an interpreted language like Python. Since Octave's syntax is [mostly compatible][35] with [MATLAB][36], it is often described as a free alternative to MATLAB. Octave is not listed among the most popular programming languages, but MATLAB is, so Octave is rather popular in a sense. MATLAB predates NumPy, and I have the feeling that it was inspired by the former. While you go through the example, you will see the analogies.
#### Installation
The [fitting_octave.m][37] example only needs the basic Octave package, making the installation in Fedora rather simple:
```
`sudo dnf install octave`
```
#### Commenting code
In Octave, you can add comments to code with the percent symbol (**%**), and you can also use **#** if MATLAB compatibility is not needed. The option to use **#** allows you to write the same special comment line from the Python example to execute the script directly on the command line.
#### Necessary libraries
Everything used in this example is contained in the basic package, so you do not need to load any new libraries. If you need a library, the [syntax][38] is **pkg load module**. This command adds the module's functions to the list of available functions. In this regard, Python has more flexibility.
#### Defining variables
Variables are defined with pretty much the same syntax as Python:
```
input_file_name = "anscombe.csv";
delimiter = "\t";
skip_header = 3;
column_x = 1;
column_y = 2;
```
Note that the end of the line has a semicolon; this is not necessary, but it suppresses the output of the results of the line. Without a semicolon, the interpreter would print the result of the expression:
```
octave:1> input_file_name = "anscombe.csv"
input_file_name = anscombe.csv
octave:2> sqrt(2)
ans =  1.4142
```
#### Printing output
The powerful function [**printf()**][39] is used to print on the terminal. Unlike in Python, the **printf()** function does not automatically add a newline at the end of the printed string, so you have to add it. The first argument is a string that can contain format information for the other arguments to be passed to the function, such as:
```
`printf("Slope: %f\n", slope);`
```
In Python, the formatting is built into the string itself, but in Octave, it is specific to the **printf()** function.
#### Reading data
The [**dlmread()**][40] function can read text files structured like CSV files:
```
`data = dlmread(input_file_name, delimiter, skip_header, 0);`
```
The result is a [matrix][41] object, which is one of the fundamental data types in Octave. Matrices may be sliced with a syntax similar to Python:
```
x = data(:, column_x);
y = data(:, column_y);
```
The fundamental difference is that the indexes start at one instead of zero. Therefore, in the example, the __x__ column is column number one.
#### Fitting data
To fit the data with a straight line, you can use the [**polyfit()**][42] function. It fits the input data with a polynomial, so you just need to use a polynomial of order one:
```
p = polyfit(x, y, 1);
slope = p(1);
intercept = p(2);
```
The result is a matrix with the polynomial coefficients; therefore, it selects the first two indexes. To determine the correlation coefficient, use the [**corr()**][43] function:
```
`r_value = corr(x, y);`
```
Finally, print the results with the **printf()** function:
```
printf("Slope: %f\n", slope);
printf("Intercept: %f\n", intercept);
printf("Correlation coefficient: %f\n", r_value);
```
#### Plotting
As in the Matplotlib example, you first need to create a dataset that represents the fitted line:
```
fit_x = linspace(min(x) - 1, max(x) + 1, 100);
fit_y = slope * fit_x + intercept;
```
The analogy with NumPy is also evident here, as it uses the [**linspace()**][44] function that behaves just like the Python's equivalent version.
Again, as with Matplotlib, create a [figure][45] object first, then create an [axes][46] object to hold the plots:
```
fig_width = 7; %inch
fig_height = fig_width / 16 * 9; %inch
fig_dpi = 100;
fig = figure("units", "inches",
             "position", [1, 1, fig_width, fig_height]);
ax = axes("parent", fig);
set(ax, "fontsize", 14);
set(ax, "linewidth", 2);
```
To set properties of the axes object, use the [**set()**][47] function. The interface is rather confusing, though, as the function expects a comma-separated list of property and value pairs. These pairs are just a succession of a string representing the property name and a second object representing the value for that property. There are also other functions to set various properties:
```
xlim(ax, [min(x) - 1, max(x) + 1]);
ylim(ax, [min(y) - 1, max(y) + 1]);
xlabel(ax, 'x');
ylabel(ax, 'y');
```
Plotting is achieved with the [**plot()**][48] function. The default behavior is that each call resets the axes, so you need to use the function [**hold()**][49].
```
hold(ax, "on");
plot(ax, fit_x, fit_y,
     "marker", "none",
     "linestyle", "-",
     "linewidth", 2);
plot(ax, x, y,
     "marker", ".",
     "markersize", 20,
     "linestyle", "none");
hold(ax, "off");
```
Also, it is possible in the **plot()** function to add the property and value pairs. The [legend][50] must be created separately, and the labels should be stated manually:
```
lg = legend(ax, "Fit", "Data");
set(lg, "location", "northwest");
```
Finally, save the output to a PNG image:
```
image_size = sprintf("-S%f,%f", fig_width * fig_dpi, fig_height * fig_dpi);
image_resolution = sprintf("-r%f,%f", fig_dpi);
print(fig, 'fit_octave.png',
      '-dpng',
      image_size,
      image_resolution);
```
Confusingly, in this case, the options are passed as a single string with the property name and the value. Since in Octave strings do not have the formatting facilities of Python, you must use the [**sprintf()**][51] function. It behaves just like the **printf()** function, but its result is not printed, rather it is returned as a string.
In this example, as in the Python one, the graphical objects are referenced to keep their interactions evident. If Python's documentation in this regard is a little bit confusing, [Octave's documentation][52] is even worse. Most of the examples I found did not care about referencing the objects; instead, they rely on the fact that the plotting commands act on the currently active figure. A global [root graphics object][53] keeps track of the existing figures and axes.
#### Results
The resulting output on the command line is:
```
#### Anscombe's first set with Octave ####
Slope: 0.500091
Intercept: 3.000091
Correlation coefficient: 0.816421
```
And this shows the resulting image generated with Octave.
![Plot and fit of the dataset obtained with Octave][54]
### Next up
Both Python and GNU Octave can plot the same information, though they differ in how they get there. If you're looking to explore other languages to complete similar tasks, I highly recommend looking at [Rosetta Code][55]. It's a marvelous resource to see how to solve the same problems in many languages. 
What language do you like to plot data in? Share your thoughts in the comments.
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/2/python-gnu-octave-data-science
作者:[Cristiano L. Fontana][a]
选题:[lujun9972][b]
译者:[heguangzhi](https://github.com/heguangzhi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/cristianofontana
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/analytics-graphs-charts.png?itok=sersoqbV (Analytics: Charts and Graphs)
[2]: https://en.wikipedia.org/wiki/Command-line_interface
[3]: https://en.wikipedia.org/wiki/Graphical_user_interface
[4]: https://gitlab.com/cristiano.fontana/polyglot_fit
[5]: https://en.wikipedia.org/wiki/Comma-separated_values
[6]: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
[7]: https://www.python.org/
[8]: https://www.tiobe.com/tiobe-index/
[9]: https://redmonk.com/sogrady/2019/07/18/language-rankings-6-19/
[10]: http://pypl.github.io/PYPL.html
[11]: https://octoverse.github.com/
[12]: https://en.wikipedia.org/wiki/Interpreted_language
[13]: https://docs.python.org/3/library/
[14]: https://numpy.org/
[15]: https://www.scipy.org/
[16]: https://matplotlib.org/
[17]: https://getfedora.org/
[18]: https://en.wikipedia.org/wiki/Comment_(computer_programming)
[19]: https://gitlab.com/cristiano.fontana/polyglot_fit/-/blob/master/fitting_python.py
[20]: https://en.wikipedia.org/wiki/Shebang_(Unix)
[21]: https://docs.python.org/3/library/functions.html#print
[22]: https://docs.python.org/3/library/string.html#string-formatting
[23]: https://docs.python.org/3/library/string.html
[24]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
[25]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html
[26]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
[27]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html
[28]: https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure
[29]: https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes
[30]: https://en.wikipedia.org/wiki/Portable_Network_Graphics
[31]: https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py
[32]: https://matplotlib.org/gallery/index.html
[33]: https://opensource.com/sites/default/files/uploads/fit_python.png (Plot and fit of the dataset obtained with Python)
[34]: https://www.gnu.org/software/octave/
[35]: https://wiki.octave.org/FAQ#Differences_between_Octave_and_Matlab
[36]: https://en.wikipedia.org/wiki/MATLAB
[37]: https://gitlab.com/cristiano.fontana/polyglot_fit/-/blob/master/fitting_octave.m
[38]: https://octave.org/doc/v5.1.0/Using-Packages.html#Using-Packages
[39]: https://octave.org/doc/v5.1.0/Formatted-Output.html#XREFprintf
[40]: https://octave.org/doc/v5.1.0/Simple-File-I_002fO.html#XREFdlmread
[41]: https://octave.org/doc/v5.1.0/Matrices.html
[42]: https://octave.org/doc/v5.1.0/Polynomial-Interpolation.html
[43]: https://octave.org/doc/v5.1.0/Correlation-and-Regression-Analysis.html#XREFcorr
[44]: https://octave.sourceforge.io/octave/function/linspace.html
[45]: https://octave.org/doc/v5.1.0/Multiple-Plot-Windows.html
[46]: https://octave.org/doc/v5.1.0/Graphics-Objects.html#XREFaxes
[47]: https://octave.org/doc/v5.1.0/Graphics-Objects.html#XREFset
[48]: https://octave.org/doc/v5.1.0/Two_002dDimensional-Plots.html#XREFplot
[49]: https://octave.org/doc/v5.1.0/Manipulation-of-Plot-Windows.html#XREFhold
[50]: https://octave.org/doc/v5.1.0/Plot-Annotations.html#XREFlegend
[51]: https://octave.org/doc/v5.1.0/Formatted-Output.html#XREFsprintf
[52]: https://octave.org/doc/v5.1.0/Two_002dDimensional-Plots.html#Two_002dDimensional-Plots
[53]: https://octave.org/doc/v5.1.0/Graphics-Objects.html#XREFgroot
[54]: https://opensource.com/sites/default/files/uploads/fit_octave.png (Plot and fit of the dataset obtained with Octave)
[55]: http://www.rosettacode.org/

View File

@ -0,0 +1,689 @@
[#]: collector: (lujun9972)
[#]: translator: (heguangzhi)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Using Python and GNU Octave to plot data)
[#]: via: (https://opensource.com/article/20/2/python-gnu-octave-data-science)
[#]: author: (Cristiano L. Fontana https://opensource.com/users/cristianofontana)
使用 Python 和 GNU Octave 绘制数据
======
了解如何使用 Python 和 GNU Octave 完成一项常见的数据科学任务。
[分析:图表和图形][1]
数据科学是跨越编程语言的知识领域。有些人以解决这一领域的问题而闻名,而另一些人则鲜为人知。这篇文章将帮助你熟悉用一些流行语言做数据科学。
### 为数据科学选择 Python 和 GNU Octave
我经常尝试学习一种新的编程语言。为什么?这主要是对旧方式的厌倦和对新方式的好奇的结合。当我开始编程时,我唯一知道的语言是 C 语言。那些年的编程生涯既艰难又危险,因为我不得不手动分配内存,管理指针,并记得释放内存。
然后一个朋友建议我试试 Python现在编程生活变得简单多了。虽然程序运行变得慢多了但我不必通过编写分析软件来受苦了。然而我很快就意识到每种语言都有比其他语言更适合自己应用场景。后来我学习了其他一些语言每种语言都给我带来了一些新的启发。发现新的编程风格让我可以将一些解决方案移植到其他语言中这样一切都变得有趣多了。
为了对一种新的编程语言(及其文档)有所了解,我总是从编写一些执行我熟悉的任务的示例程序开始。为此,我将解释如何用 Python 和 GNU Octave 编写一个程序来完成一个你可以归类为数据科学的特殊任务。如果你已经熟悉其中一种语言,从中开始,浏览其他语言,寻找相似之处和不同之处。这并不是对编程语言的详尽比较,只是一个小小的展示。
所有的程序都应该在[命令行][2]上运行,而不是用[图形用户界面][3](GUI)。完整的例子可以在[多语种知识库][4]中找到。
### 编程任务
你将在本系列中编写的程序:
* 从[CSV文件][5]中读取数据
* 用直线插入数据(例如 _f(x)=m ⋅ x + q_)
* 将结果生成图像文件
这是许多数据科学家遇到的常见情况。示例数据是第一组[Anscombe's quartet][6],如下表所示。这是一组人工构建的数据,当用直线拟合时会给出相同的结果,但是它们的曲线非常不同。数据文件是一个文本文件,以制表符作为列分隔,以几行作为标题。此任务将仅使用第一组(例如:前两列)。
I
II
III
IV
x
y
x
y
x
y
x
y
10.0
8.04
10.0
9.14
10.0
7.46
8.0
6.58
8.0
6.95
8.0
8.14
8.0
6.77
8.0
5.76
13.0
7.58
13.0
8.74
13.0
12.74
8.0
7.71
9.0
8.81
9.0
8.77
9.0
7.11
8.0
8.84
11.0
8.33
11.0
9.26
11.0
7.81
8.0
8.47
14.0
9.96
14.0
8.10
14.0
8.84
8.0
7.04
6.0
7.24
6.0
6.13
6.0
6.08
8.0
5.25
4.0
4.26
4.0
3.10
4.0
5.39
19.0
12.50
12.0
10.84
12.0
9.13
12.0
8.15
8.0
5.56
7.0
4.82
7.0
7.26
7.0
6.42
8.0
7.91
5.0
5.68
5.0
4.74
5.0
5.73
8.0
6.89
### Python 方式
[Python][7]是一种通用编程语言,是当今最流行的语言之一(从[TIOBE index][8]、[RedMonk编程语言排名][9]、[编程语言流行指数][10]、[State of the Octoverse of GitHub][11]和其他来源的调查结果可以看出)。这是一种[解释的语言][12];因此,源代码由执行指令的程序读取和评估。它有一个全面的[标准库][13]并且总体上非常好用(我没有参考这最后一句话;这只是我的拙见)。
#### 安装
要使用 Python 开发,你需要解释器和一些库。最低要求是:
* [NumPy][14]用于合适的数组和矩阵操作
* [SciPy][15]进行数据科学
* [Matplotlib][16]绘图
在 [Fedora][17] 安装它们是很容易的:
```
`sudo dnf install python3 python3-numpy python3-scipy python3-matplotlib`
```
#### 注释代码
在 Python中[注释][18]是通过在行首添加一个 **#** 来实现的,该行的其余部分将被解释器丢弃:
```
`# This is a comment ignored by the interpreter.`
```
[fitting_python.py][19]示例使用注释在源代码中插入许可信息,第一行是[特殊注释][20],它允许在命令行上执行脚本:
```
`#! /usr/bin/env python3`
```
这一行通知命令行解释器,脚本需要由程序**python3**执行。
#### Required libraries
在 Python 中,库和模块可以作为一个对象导入(如示例中的第一行),其中包含库的所有函数和成员。通过使用 **as** 规范可以用于定义标签并重命名它们:
```
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
```
你也可以决定只导入一个子模块(如第二行和第三行)。语法有两个(或多或少)等效选项: **import module.submodule****from module import submodule**
#### 定义变量
Python 的变量是在第一次赋值时被声明的:
```
input_file_name = "anscombe.csv"
delimiter = "\t"
skip_header = 3
column_x = 0
column_y = 1
```
变量类型由分配给变量的值推断。没有常量值的变量,除非它们在模块中声明并且只能被读取。习惯上,不被修改的变量应该用大写字母命名。
#### 打印输出
通过命令行运行程序意味着输出只能打印在终端上。Python 有[**print()**][21]函数,默认情况下,该函数打印其参数,并在输出的末尾添加一个换行符:
```
`print("#### Anscombe's first set with Python ####")`
```
在 Python 中,可以将**print()**函数与[字符串类][23]的[格式化能力][22]相结合。字符串具有**format**方法,可用于向字符串本身添加一些格式化文本。例如,可以添加格式化的浮点数,例如:
```
`print("Slope: {:f}".format(slope))`
```
#### 读取数据
使用 NumPy 和 函数[**genfromtxt()**][24]读取CSV文件非常容易该函数生成[NumPy数组][25]:
```
`data = np.genfromtxt(input_file_name, delimiter = delimiter, skip_header = skip_header)`
```
在 Python中一个函数可以有可变数量的参数您可以通过指定所需的参数来让它传递一个子集。数组是非常强大的矩阵状对象可以很容易地分割成更小的数组:
```
x = data[:, column_x]
y = data[:, column_y]
```
冒号选择整个范围,也可以用来选择子范围。例如,要选择数组的前两行,可以使用:
```
`first_two_rows = data[0:1, :]`
```
#### 拟合数据
SciPy提供了方便的数据拟合功能例如[**linregress()**][26]功能。该函数提供了一些与拟合相关的重要值,如斜率、截距和两个数据集的相关系数:
```
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("Slope: {:f}".format(slope))
print("Intercept: {:f}".format(intercept))
print("Correlation coefficient: {:f}".format(r_value))
```
因为**linregress()**提供了几条信息,所以结果可以同时保存到几个变量中。
#### 绘图
Matplotlib 库仅仅绘制数据点,因此,你应该定义要绘制的点的坐标。已经定义了**x** 和 **y** 数组,所以你可以直接绘制它们,但是你还需要代表直线的数据点。
```
`fit_x = np.linspace(x.min() - 1, x.max() + 1, 100)`
```
[**linspace()**][27]函数可以方便地在两个值之间生成一组等距值。利用强大的 NumPy 数组可以轻松计算纵坐标,该数组可以像普通数值变量一样在公式中使用:
```
`fit_y = slope * fit_x + intercept`
```
公式在数组中逐元素应用;因此,结果在初始数组中具有相同数量的条目。
要绘图,首先,定义一个包含所有图形的[图形对象][28]:
```
fig_width = 7 #inch
fig_height = fig_width / 16 * 9 #inch
fig_dpi = 100
fig = plt.figure(figsize = (fig_width, fig_height), dpi = fig_dpi)
```
一个图形可以画几个图;在 Matplotlib 中,这些图块被称为[轴][29]。本示例定义一个单轴对象来绘制数据点:
```
ax = fig.add_subplot(111)
ax.plot(fit_x, fit_y, label = "Fit", linestyle = '-')
ax.plot(x, y, label = "Data", marker = '.', linestyle = '')
ax.legend()
ax.set_xlim(min(x) - 1, max(x) + 1)
ax.set_ylim(min(y) - 1, max(y) + 1)
ax.set_xlabel('x')
ax.set_ylabel('y')
```
将该图保存到[PNG image file][30]中,有:
```
`fig.savefig('fit_python.png')`
```
如果要显示(而不是保存)绘图,请调用:
```
`plt.show()`
```
此示例引用了绘图部分中使用的所有对象:它定义了对象 **fig** 和对象 **ax**。这种技术细节是不必要的,因为 **plt** 对象可以直接用于绘制数据集。《[Matplotlib 教程][31]展示了这样一个界面:
```
`plt.plot(fit_x, fit_y)`
```
坦率地说,我不喜欢这种方法,因为它隐藏了各种对象之间发生的重要的的交互。不幸的是,有时[官方的例子][32]有点令人困惑,因为他们倾向于使用不同的方法。在这个简单的例子中,引用图形对象是不必要的,但是在更复杂的例子中(例如在图形用户界面中嵌入图形时),引用图形对象就变得很重要了。
#### 结果
命令行输入:
```
#### Anscombe's first set with Python ####
Slope: 0.500091
Intercept: 3.000091
Correlation coefficient: 0.816421
```
这是 Matplotlib 产生的图像:
![Plot and fit of the dataset obtained with Python][33]
### GNU Octave 方式
[GNU Octave][34]语言主要用于数值计算。它提供了一个简单的操作向量和矩阵的语法,并且有一些强大的绘图工具。这是一种像 Python 一样的解释语言。由于 Octave的语法是[最兼容][35] [MATLAB][36],它经常被描述为一个免费的替代 MATLAB 的方案。Octave 没有被列为最流行的编程语言,但是 MATLAB 是,所以 Octave 在某种意义上是相当流行的。MATLAB 早于 NumPy我觉得它是受到了前者的启发。当你看这个例子时你会看到相似之处。
#### 安装
[fitting_octave.m][37]的例子只需要基本的 Octave 包,在 Fedora 中安装相当简单:
```
`sudo dnf install octave`
```
#### 注释代码
在Octave中你可以用百分比符号(**%**)为代码添加注释,如果不需要与 MATLAB 兼容,你也可以使用 **#**。使用 **#** 的选项允许你从 Python 示例中编写相同的特殊注释行,以便直接在命令行上执行脚本。
#### 必要的库
本例中使用的所有内容都包含在基本包中,因此你不需要加载任何新的库。如果你需要一个库,[语法][38]是 **pkg load module**。该命令将模块的功能添加到可用功能列表中。在这方面Python 具有更大的灵活性。
#### 定义变量
变量的定义与 Python 的语法基本相同:
```
input_file_name = "anscombe.csv";
delimiter = "\t";
skip_header = 3;
column_x = 1;
column_y = 2;
```
请注意,行尾有一个分号;这不是必需的,但是它会抑制行结果的输出。如果没有分号,解释器将打印表达式的结果:
```
octave:1> input_file_name = "anscombe.csv"
input_file_name = anscombe.csv
octave:2> sqrt(2)
ans =  1.4142
```
#### 打印输出结果
强大的功能[**printf()**][39]是用来在终端上打印的。与 Python 不同,**printf()** 函数不会自动在打印字符串的末尾添加换行,因此你必须添加它。第一个参数是一个字符串,可以包含要传递给函数的其他参数的格式信息,例如:
```
`printf("Slope: %f\n", slope);`
```
在 Python 中,格式是内置在字符串本身中的,但是在 Octave 中,它是特定于 **printf()** 函数。
#### 读取数据
[**dlmread()**][40]函数可以读取类似CSV文件的文本内容:
```
`data = dlmread(input_file_name, delimiter, skip_header, 0);`
```
结果是一个[矩阵][41]对象,这是 Octave 中的基本数据类型之一。矩阵可以用类似于 Python 的语法进行切片:
```
x = data(:, column_x);
y = data(:, column_y);
```
根本的区别是索引从1开始而不是从0开始。因此在该示例中__x__列是第一列。
#### 拟合数据
要用直线拟合数据,可以使用[**polyfit()**][42]函数。它用一个多项式拟合输入数据,所以你只需要使用一阶多项式:
```
p = polyfit(x, y, 1);
slope = p(1);
intercept = p(2);
```
结果是具有多项式系数的矩阵;因此,它选择前两个索引。要确定相关系数,请使用[**corr()**][43]函数:
```
`r_value = corr(x, y);`
```
最后,使用 **printf()** 函数打印结果:
```
printf("Slope: %f\n", slope);
printf("Intercept: %f\n", intercept);
printf("Correlation coefficient: %f\n", r_value);
```
#### 绘图
与 Matplotlib 示例一样,首先需要创建一个表示拟合直线的数据集:
```
fit_x = linspace(min(x) - 1, max(x) + 1, 100);
fit_y = slope * fit_x + intercept;
```
与 NumPy 的相似性也很明显,因为它使用了[**linspace()**][44]函数,其行为就像 Python 的等效版本一样。
同样,与 Matplotlib 一样,首先创建一个[图][45]对象,然后创建一个[轴][46]对象来保存这些图:
```
fig_width = 7; %inch
fig_height = fig_width / 16 * 9; %inch
fig_dpi = 100;
fig = figure("units", "inches",
             "position", [1, 1, fig_width, fig_height]);
ax = axes("parent", fig);
set(ax, "fontsize", 14);
set(ax, "linewidth", 2);
```
要设置轴对象的属性,请使用[**set()**][47]函数。然而,该接口相当混乱,因为该函数需要一个逗号分隔的属性和值对列表。这些对只是代表属性名的一个字符串和代表该属性值的第二个对象的连续。还有其他设置各种属性的功能:
```
xlim(ax, [min(x) - 1, max(x) + 1]);
ylim(ax, [min(y) - 1, max(y) + 1]);
xlabel(ax, 'x');
ylabel(ax, 'y');
```
标图是用[**plot()**][48]功能实现的。默认行为是每次调用都会重置坐标轴,因此需要使用函数[**hold()**][49]。
```
hold(ax, "on");
plot(ax, fit_x, fit_y,
     "marker", "none",
     "linestyle", "-",
     "linewidth", 2);
plot(ax, x, y,
     "marker", ".",
     "markersize", 20,
     "linestyle", "none");
hold(ax, "off");
```
此外,还可以在 **plot()** 函数中添加属性和值对。[legend][50]必须单独创建,标签应手动声明:
```
lg = legend(ax, "Fit", "Data");
set(lg, "location", "northwest");
```
最后将输出保存到PNG图像:
```
image_size = sprintf("-S%f,%f", fig_width * fig_dpi, fig_height * fig_dpi);
image_resolution = sprintf("-r%f,%f", fig_dpi);
print(fig, 'fit_octave.png',
      '-dpng',
      image_size,
      image_resolution);
```
令人困惑的是,在这种情况下,选项被作为一个字符串传递,带有属性名和值。因为在 Octave 字符串中没有 Python 的格式化工具,所以必须使用[**sprintf()**][51]函数。它的行为就像**printf()**函数,但是它的结果不是打印出来的,而是作为字符串返回的。
在这个例子中,就像在 Python 中一样,图形对象很明显被引用以保持它们之间的交互。如果说 Python 在这方面的文档有点混乱,那么[Octave 的文档][52]就更糟糕了。我发现的大多数例子都不关心引用对象;相反,它们依赖于绘图命令作用于当前活动图形。全局[根图形对象][53]跟踪现有的图形和轴。
#### 结果
命令行上的结果输出是:
```
#### Anscombe's first set with Octave ####
Slope: 0.500091
Intercept: 3.000091
Correlation coefficient: 0.816421
```
它显示了用 Octave 生成的结果图像。
![Plot and fit of the dataset obtained with Octave][54]
### 下一个
Python 和 GNU Octave 都可以绘制出相同的信息,尽管它们的实现方式不同。如果你想探索其他语言来完成类似的任务,我强烈建议你看看[Rosetta 代码][55]。这是一个了不起的资源,可以看到如何用多种语言解决同样的问题。
你喜欢用什么语言绘制数据?在评论中分享你的想法。
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/2/python-gnu-octave-data-science
作者:[Cristiano L. Fontana][a]
选题:[lujun9972][b]
译者:[heguangzhi](https://github.com/heguangzhi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/cristianofontana
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/analytics-graphs-charts.png?itok=sersoqbV (Analytics: Charts and Graphs)
[2]: https://en.wikipedia.org/wiki/Command-line_interface
[3]: https://en.wikipedia.org/wiki/Graphical_user_interface
[4]: https://gitlab.com/cristiano.fontana/polyglot_fit
[5]: https://en.wikipedia.org/wiki/Comma-separated_values
[6]: https://en.wikipedia.org/wiki/Anscombe%27s_quartet
[7]: https://www.python.org/
[8]: https://www.tiobe.com/tiobe-index/
[9]: https://redmonk.com/sogrady/2019/07/18/language-rankings-6-19/
[10]: http://pypl.github.io/PYPL.html
[11]: https://octoverse.github.com/
[12]: https://en.wikipedia.org/wiki/Interpreted_language
[13]: https://docs.python.org/3/library/
[14]: https://numpy.org/
[15]: https://www.scipy.org/
[16]: https://matplotlib.org/
[17]: https://getfedora.org/
[18]: https://en.wikipedia.org/wiki/Comment_(computer_programming)
[19]: https://gitlab.com/cristiano.fontana/polyglot_fit/-/blob/master/fitting_python.py
[20]: https://en.wikipedia.org/wiki/Shebang_(Unix)
[21]: https://docs.python.org/3/library/functions.html#print
[22]: https://docs.python.org/3/library/string.html#string-formatting
[23]: https://docs.python.org/3/library/string.html
[24]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
[25]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html
[26]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
[27]: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html
[28]: https://matplotlib.org/api/_as_gen/matplotlib.figure.Figure.html#matplotlib.figure.Figure
[29]: https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes
[30]: https://en.wikipedia.org/wiki/Portable_Network_Graphics
[31]: https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py
[32]: https://matplotlib.org/gallery/index.html
[33]: https://opensource.com/sites/default/files/uploads/fit_python.png (Plot and fit of the dataset obtained with Python)
[34]: https://www.gnu.org/software/octave/
[35]: https://wiki.octave.org/FAQ#Differences_between_Octave_and_Matlab
[36]: https://en.wikipedia.org/wiki/MATLAB
[37]: https://gitlab.com/cristiano.fontana/polyglot_fit/-/blob/master/fitting_octave.m
[38]: https://octave.org/doc/v5.1.0/Using-Packages.html#Using-Packages
[39]: https://octave.org/doc/v5.1.0/Formatted-Output.html#XREFprintf
[40]: https://octave.org/doc/v5.1.0/Simple-File-I_002fO.html#XREFdlmread
[41]: https://octave.org/doc/v5.1.0/Matrices.html
[42]: https://octave.org/doc/v5.1.0/Polynomial-Interpolation.html
[43]: https://octave.org/doc/v5.1.0/Correlation-and-Regression-Analysis.html#XREFcorr
[44]: https://octave.sourceforge.io/octave/function/linspace.html
[45]: https://octave.org/doc/v5.1.0/Multiple-Plot-Windows.html
[46]: https://octave.org/doc/v5.1.0/Graphics-Objects.html#XREFaxes
[47]: https://octave.org/doc/v5.1.0/Graphics-Objects.html#XREFset
[48]: https://octave.org/doc/v5.1.0/Two_002dDimensional-Plots.html#XREFplot
[49]: https://octave.org/doc/v5.1.0/Manipulation-of-Plot-Windows.html#XREFhold
[50]: https://octave.org/doc/v5.1.0/Plot-Annotations.html#XREFlegend
[51]: https://octave.org/doc/v5.1.0/Formatted-Output.html#XREFsprintf
[52]: https://octave.org/doc/v5.1.0/Two_002dDimensional-Plots.html#Two_002dDimensional-Plots
[53]: https://octave.org/doc/v5.1.0/Graphics-Objects.html#XREFgroot
[54]: https://opensource.com/sites/default/files/uploads/fit_octave.png (Plot and fit of the dataset obtained with Octave)
[55]: http://www.rosettacode.org/