From d4e82db38c29b847f19e2b9118e5ee8a49311780 Mon Sep 17 00:00:00 2001 From: WEIYUE XIE Date: Tue, 2 Aug 2016 19:15:14 +0800 Subject: [PATCH 1/5] =?UTF-8?q?Update=20and=20rename=20part=202=20-=20Buil?= =?UTF-8?q?ding=20a=20data=20science=20portfolio=20-=20Machine=20learning?= =?UTF-8?q?=20project.md=20to=20=E7=BF=BB=E8=AF=91=E4=B8=AD=20ideas4u=20pa?= =?UTF-8?q?rt=202=20-=20Building=20a=20data=20science=20portfolio=20-=20Ma?= =?UTF-8?q?chine=20learning=20project.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ideas4u 翻译中 --- ...data science portfolio - Machine learning project.md} | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) rename sources/team_test/{part 2 - Building a data science portfolio - Machine learning project.md => 翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md} (91%) diff --git a/sources/team_test/part 2 - Building a data science portfolio - Machine learning project.md b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md similarity index 91% rename from sources/team_test/part 2 - Building a data science portfolio - Machine learning project.md rename to sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md index 2b429aaab8..4a4fe73553 100644 --- a/sources/team_test/part 2 - Building a data science portfolio - Machine learning project.md +++ b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md @@ -1,7 +1,7 @@ -### Understanding the data +### 理解数据 Let’s take a quick look at the raw data files. Here are the first few rows of the acquisition data from quarter 1 of 2012: - +让我们来简单看一下原始数据文件。下面是2012年1季度采集数据的前几行。 ``` 100000853384|R|OTHER|4.625|280000|360|02/2012|04/2012|31|31|1|23|801|N|C|SF|1|I|CA|945||FRM| 100003735682|R|SUNTRUST MORTGAGE INC.|3.99|466000|360|01/2012|03/2012|80|80|2|30|794|N|P|SF|1|P|MD|208||FRM|788 @@ -10,6 +10,7 @@ Let’s take a quick look at the raw data files. Here are the first few rows of Here are the first few rows of the performance data from quarter 1 of 2012: +下面是2012年1季度执行数据的前几行 ``` 100000853384|03/01/2012|OTHER|4.625||0|360|359|03/2042|41860|0|N|||||||||||||||| 100000853384|04/01/2012||4.625||1|359|358|03/2042|41860|0|N|||||||||||||||| @@ -17,10 +18,14 @@ Here are the first few rows of the performance data from quarter 1 of 2012: ``` Before proceeding too far into coding, it’s useful to take some time and really understand the data. This is more critical in operational projects – because we aren’t interactively exploring the data, it can be harder to spot certain nuances unless we find them upfront. In this case, the first step is to read the materials on the Fannie Mae site: - +在开始编码之前,花些时间真正理解数据是值得的。这对于操作项目优为重要,因为我们没有交互式探索数据,将很难察觉到细微的差别除非我们在前期发现他们。在这种情况下,第一个步骤是阅读房利美站点的资料: - [Overview][15] +- [概述][15] - [Glossary of useful terms][16] +- [用用的术语表][16] - [FAQs][17] +- [问答][17] +- [Columns in the Acquisition and Performance files][18] - [Columns in the Acquisition and Performance files][18] - [Sample Acquisition data file][19] - [Sample Performance data file][20] From 4110b9b4fa4f9753a42de5723cc832d0fa67d79a Mon Sep 17 00:00:00 2001 From: WEIYUE XIE Date: Tue, 2 Aug 2016 20:30:46 +0800 Subject: [PATCH 2/5] =?UTF-8?q?Update=20=E7=BF=BB=E8=AF=91=E4=B8=AD=20idea?= =?UTF-8?q?s4u=20part=202=20-=20Building=20a=20data=20science=20portfolio?= =?UTF-8?q?=20-=20Machine=20learning=20project.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 翻译2 --- ...ience portfolio - Machine learning project.md | 47 ++++++++----------- 1 file changed, 20 insertions(+), 27 deletions(-) diff --git a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md index 4a4fe73553..27c7604922 100644 --- a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md +++ b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md @@ -1,54 +1,47 @@ ### 理解数据 -Let’s take a quick look at the raw data files. Here are the first few rows of the acquisition data from quarter 1 of 2012: -让我们来简单看一下原始数据文件。下面是2012年1季度采集数据的前几行。 +我们来简单看一下原始数据文件。下面是2012年1季度前几行的采集数据。 ``` 100000853384|R|OTHER|4.625|280000|360|02/2012|04/2012|31|31|1|23|801|N|C|SF|1|I|CA|945||FRM| 100003735682|R|SUNTRUST MORTGAGE INC.|3.99|466000|360|01/2012|03/2012|80|80|2|30|794|N|P|SF|1|P|MD|208||FRM|788 100006367485|C|PHH MORTGAGE CORPORATION|4|229000|360|02/2012|04/2012|67|67|2|36|802|N|R|SF|1|P|CA|959||FRM|794 ``` -Here are the first few rows of the performance data from quarter 1 of 2012: - -下面是2012年1季度执行数据的前几行 +下面是2012年1季度的前几行执行数据 ``` 100000853384|03/01/2012|OTHER|4.625||0|360|359|03/2042|41860|0|N|||||||||||||||| 100000853384|04/01/2012||4.625||1|359|358|03/2042|41860|0|N|||||||||||||||| 100000853384|05/01/2012||4.625||2|358|357|03/2042|41860|0|N|||||||||||||||| ``` - -Before proceeding too far into coding, it’s useful to take some time and really understand the data. This is more critical in operational projects – because we aren’t interactively exploring the data, it can be harder to spot certain nuances unless we find them upfront. In this case, the first step is to read the materials on the Fannie Mae site: 在开始编码之前,花些时间真正理解数据是值得的。这对于操作项目优为重要,因为我们没有交互式探索数据,将很难察觉到细微的差别除非我们在前期发现他们。在这种情况下,第一个步骤是阅读房利美站点的资料: -- [Overview][15] - [概述][15] -- [Glossary of useful terms][16] -- [用用的术语表][16] -- [FAQs][17] +- [有用的术语表][16] - [问答][17] -- [Columns in the Acquisition and Performance files][18] -- [Columns in the Acquisition and Performance files][18] -- [Sample Acquisition data file][19] -- [Sample Performance data file][20] - -After reading through these files, we know some key facts that will help us: - -- There’s an Acquisition file and a Performance file for each quarter, starting from the year 2000 to present. There’s a 1 year lag in the data, so the most recent data is from 2015 as of this writing. -- The files are in text format, with a pipe (|) as a delimiter. -- The files don’t have headers, but we have a list of what each column is. -- All together, the files contain data on 22 million loans. -- Because the Performance files contain information on loans acquired in previous years, there will be more performance data for loans acquired in earlier years (ie loans acquired in 2014 won’t have much performance history). - -These small bits of information will save us a ton of time as we figure out how to structure our project and work with the data. +- [采集和执行文件中的列][18] +- [采集数据文件样本][19] +- [执行数据文件样本][20] +在看完这些文件后后,我们了解到一些能帮助我们的关键点: +- 从2000年到现在,每季度都有一个采集和执行文件,因数据是滞后一年的,所以到目前为止最新数据是2015年的。 +- 这些文件是文本格式的,采用管道符号“|”进行分割。 +- 这些文件是没有表头的,但我们有文件各列的名称。 +- 所有一起,文件包含2200万个贷款的数据。 +由于执行文件包含过去几年获得的贷款的信息,在早些年获得的贷款将有更多的执行数据(即在2014获得的贷款没有多少历史执行数据)。 +这些小小的信息将会为我们节省很多时间,因为我们知道如何构造我们的项目和利用这些数据。 ### Structuring the project - +### 构造项目 Before we start downloading and exploring the data, it’s important to think about how we’ll structure the project. When building an end-to-end project, our primary goals are: - +在我们开始下载和探索数据之前,先想一想将如何构造项目是很重要的。当建立端到端项目时,我们的主要目标是: - Creating a solution that works +- 创建一个可行解决方案 - Having a solution that runs quickly and uses minimal resources +- 有一个快速运行且占用最小资源的解决方案 - Enabling others to easily extend our work +- 容易可扩展 - Making it easy for others to understand our code +- 容易理解的代码 - Writing as little code as possible +- 写尽量少的代码 In order to achieve these goals, we’ll need to structure our project well. A well structured project follows a few principles: From 2f0e2db40c5bde9e80db4e50433b0bce719d2485 Mon Sep 17 00:00:00 2001 From: WEIYUE XIE Date: Tue, 2 Aug 2016 21:09:04 +0800 Subject: [PATCH 3/5] =?UTF-8?q?Update=20=E7=BF=BB=E8=AF=91=E4=B8=AD=20idea?= =?UTF-8?q?s4u=20part=202=20-=20Building=20a=20data=20science=20portfolio?= =?UTF-8?q?=20-=20Machine=20learning=20project.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 保存进度3 --- ...ience portfolio - Machine learning project.md | 31 ++++++++----------- 1 file changed, 13 insertions(+), 18 deletions(-) diff --git a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md index 27c7604922..370a9853e5 100644 --- a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md +++ b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md @@ -28,33 +28,28 @@ - 所有一起,文件包含2200万个贷款的数据。 由于执行文件包含过去几年获得的贷款的信息,在早些年获得的贷款将有更多的执行数据(即在2014获得的贷款没有多少历史执行数据)。 这些小小的信息将会为我们节省很多时间,因为我们知道如何构造我们的项目和利用这些数据。 -### Structuring the project + ### 构造项目 -Before we start downloading and exploring the data, it’s important to think about how we’ll structure the project. When building an end-to-end project, our primary goals are: 在我们开始下载和探索数据之前,先想一想将如何构造项目是很重要的。当建立端到端项目时,我们的主要目标是: -- Creating a solution that works - 创建一个可行解决方案 -- Having a solution that runs quickly and uses minimal resources - 有一个快速运行且占用最小资源的解决方案 -- Enabling others to easily extend our work - 容易可扩展 -- Making it easy for others to understand our code -- 容易理解的代码 -- Writing as little code as possible +- 写容易理解的代码 - 写尽量少的代码 -In order to achieve these goals, we’ll need to structure our project well. A well structured project follows a few principles: - -- Separates data files and code files. -- Separates raw data from generated data. -- Has a README.md file that walks people through installing and using the project. -- Has a requirements.txt file that contains all the packages needed to run the project. -- Has a single settings.py file that contains any settings that are used in other files. - - For example, if you are reading the same file from multiple Python scripts, it’s useful to have them all import settings and get the file name from a centralized place. -- Has a .gitignore file that prevents large or secret files from being committed. -- Breaks each step in our task into a separate file that can be executed separately. +为了实现这些目标,需要对我们的项目进行良好的构造。一个结构良好的项目遵循几个原则: +- 分离数据文件和代码文件 +- 从原始数据中分离生成的数据。 +- 有一个README.md文件帮助人们安装和使用该项目。 +- 有一个requirements.txt文件列明项目运行所需的所有包。 +- 有一个单独的settings.py 文件列明其它文件中使用的所有的设置 + - 例如,如果从多个Python脚本读取相同的文件,把它们全部import设置和从一个集中的地方获得文件名是有用的。 +- 有一个.gitignore文件,防止大的或秘密文件被提交。 +- 分解任务中每一步可以单独执行的步骤到单独的文件中。 - For example, we may have one file for reading in the data, one for creating features, and one for making predictions. + - 例如, - Stores intermediate values. For example, one script may output a file that the next script can read. + - This enables us to make changes in our data processing flow without recalculating everything. Our file structure will look something like this shortly: From a088e129d23496de91278e9d53b9291b06b83c01 Mon Sep 17 00:00:00 2001 From: WEIYUE XIE Date: Tue, 2 Aug 2016 21:34:22 +0800 Subject: [PATCH 4/5] =?UTF-8?q?Update=20=E7=BF=BB=E8=AF=91=E4=B8=AD=20idea?= =?UTF-8?q?s4u=20part=202=20-=20Building=20a=20data=20science=20portfolio?= =?UTF-8?q?=20-=20Machine=20learning=20project.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...a science portfolio - Machine learning project.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md index 370a9853e5..b311cd814d 100644 --- a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md +++ b/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md @@ -24,7 +24,7 @@ 在看完这些文件后后,我们了解到一些能帮助我们的关键点: - 从2000年到现在,每季度都有一个采集和执行文件,因数据是滞后一年的,所以到目前为止最新数据是2015年的。 - 这些文件是文本格式的,采用管道符号“|”进行分割。 -- 这些文件是没有表头的,但我们有文件各列的名称。 +- 这些文件是没有表头的,但我们有文件列明各列的名称。 - 所有一起,文件包含2200万个贷款的数据。 由于执行文件包含过去几年获得的贷款的信息,在早些年获得的贷款将有更多的执行数据(即在2014获得的贷款没有多少历史执行数据)。 这些小小的信息将会为我们节省很多时间,因为我们知道如何构造我们的项目和利用这些数据。 @@ -46,13 +46,13 @@ - 例如,如果从多个Python脚本读取相同的文件,把它们全部import设置和从一个集中的地方获得文件名是有用的。 - 有一个.gitignore文件,防止大的或秘密文件被提交。 - 分解任务中每一步可以单独执行的步骤到单独的文件中。 - - For example, we may have one file for reading in the data, one for creating features, and one for making predictions. - - 例如, -- Stores intermediate values. For example, one script may output a file that the next script can read. + - 例如,我们将有一个文件用于读取数据,一个用于创建特征,一个用于做出预测。 +- 保存中间结果,例如,一个脚本可输出下一个脚本可读取的文件。 - - This enables us to make changes in our data processing flow without recalculating everything. + - 这使我们无需重新计算就可以在数据处理流程中进行更改。 + -Our file structure will look something like this shortly: +我们的文件结构大体如下: ``` loan-prediction @@ -64,8 +64,7 @@ loan-prediction ├── settings.py ``` -### Creating the initial files - +### 创建初始文件 To start with, we’ll need to create a loan-prediction folder. Inside that folder, we’ll need to make a data folder and a processed folder. The first will store our raw data, and the second will store any intermediate calculated values. Next, we’ll make a .gitignore file. A .gitignore file will make sure certain files are ignored by git and not pushed to Github. One good example of such a file is the .DS_Store file created by OSX in every folder. A good starting point for a .gitignore file is here. We’ll also want to ignore the data files because they are very large, and the Fannie Mae terms prevent us from redistributing them, so we should add two lines to the end of our file: From 108dde60df911aef8edc5c81906557c78d4f578c Mon Sep 17 00:00:00 2001 From: WEIYUE XIE Date: Tue, 2 Aug 2016 21:40:28 +0800 Subject: [PATCH 5/5] =?UTF-8?q?Update=20and=20rename=20=E7=BF=BB=E8=AF=91?= =?UTF-8?q?=E4=B8=AD=20ideas4u=20part=202=20-=20Building=20a=20data=20scie?= =?UTF-8?q?nce=20portfolio=20-=20Machine=20learning=20project.md=20to=20pa?= =?UTF-8?q?rt=202=20-=20Building=20a=20data=20science=20portfolio=20-=20Ma?= =?UTF-8?q?chine=20learning=20project.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 占坑 --- ...Building a data science portfolio - Machine learning project.md} | 1 + 1 file changed, 1 insertion(+) rename sources/team_test/{翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md => part 2 - Building a data science portfolio - Machine learning project.md} (99%) diff --git a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md b/sources/team_test/part 2 - Building a data science portfolio - Machine learning project.md similarity index 99% rename from sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md rename to sources/team_test/part 2 - Building a data science portfolio - Machine learning project.md index b311cd814d..19e1fed3a7 100644 --- a/sources/team_test/翻译中 ideas4u part 2 - Building a data science portfolio - Machine learning project.md +++ b/sources/team_test/part 2 - Building a data science portfolio - Machine learning project.md @@ -1,4 +1,5 @@ +翻译中 by ideas4u ### 理解数据 我们来简单看一下原始数据文件。下面是2012年1季度前几行的采集数据。 ```