diff --git a/sources/tech/20160602 Building a data science portfolio - Storytelling with data.md b/translated/tech/20160602 Building a data science portfolio - Storytelling with data.md similarity index 83% rename from sources/tech/20160602 Building a data science portfolio - Storytelling with data.md rename to translated/tech/20160602 Building a data science portfolio - Storytelling with data.md index 625b08efdc..281c2dee20 100644 --- a/sources/tech/20160602 Building a data science portfolio - Storytelling with data.md +++ b/translated/tech/20160602 Building a data science portfolio - Storytelling with data.md @@ -1,105 +1,107 @@ -Translated by Yoo-4x -Building a data science portfolio: Storytelling with data +建立一项数据科学组合:讲一个关于数据的故事 ======== ->This is the first in a series of posts on how to build a Data Science Portfolio. If you like this and want to know when the next post in the series is released, you can [subscribe at the bottom of the page][35]. +>这是如何建立科学组合系列文章中的第一篇。如果你喜欢这篇文章并且想知道此系列的下一篇文章何时发表,你可以[在页面底部订阅][35]。 -Data science companies are increasingly looking at portfolios when making hiring decisions. One of the reasons for this is that a portfolio is the best way to judge someone’s real-world skills. The good news for you is that a portfolio is entirely within your control. If you put some work in, you can make a great portfolio that companies are impressed by. +数据科学公司们在采用一个想法时越来越看重组合结果。其中一个原因就是运用组合是分析一个人真实技能的最好方式。对你来说好消息解释组合是完全可以被你掌控的。如果你针对一些事情做了一些工作,你就能的奥一个令那些公司印象深刻的组合结果。 -The first step in making a high-quality portfolio is to know what skills to demonstrate. The primary skills that companies want in data scientists, and thus the primary skills they want a portfolio to demonstrate, are: +建立一个高质量组合的第一步就是知道展示什么技能。那些公司们主要希望数据科学工作者拥有的技能,或者说他们主要希望组合所展示的技能是: -* Ability to communicate -* Ability to collaborate with others -* Technical competence -* Ability to reason about data -* Motivation and ability to take initiative +* 表达能力 +* 合作能力 +* 专业技能 +* 解释数据的能力 +* 有目标和有积极性的 -Any good portfolio will be composed of multiple projects, each of which may demonstrate 1-2 of the above points. This is the first post in a series that will cover how to make a well-rounded data science portfolio. In this post, we’ll cover how to make your first project for a data science portfolio, and how to tell an effective story using data. At the end, you’ll have a project that will help demonstrate your ability to communicate, and your ability to reason about data. +任何一个好的组合都由多个工程构成,每一个工程都会展示1-2个上面所说的点。这是涵盖了“如何完成一个完整的科学组合”系列文章的第一篇。在这篇文章中,我们将会涵括如何完成你的第一项数据科学组合工程,并且对此进行有效的解释。在ui后,你将会得到一个帮助展示你表达能力和解释数据能力的工程。 -### Storytelling with data +### 讲述一个关于数据的故事 -Data science is fundamentally about communication. You’ll discover some insight in the data, then figure out an effective way to communicate that insight to others, then sell them on the course of action you propose. One of the most critical skills in data science is being able to tell an effective story using data. An effective story can make your insights much more compelling, and help others understand your ideas. +于数据科学表达是基础。你将会发现数据的内含,并且找出一个高效的方式来向他人表达,之后向他们展示你所开展的课题。数据科学最关键的手法之一就是能够讲述一个关于使用数据的清晰的故事。一个好的故事能够使你得到的结果更加引人注目,并且能是别人理解你的想法。 -A story in the data science context is a narrative around what you found, how you found it, and what it means. An example might be the discovery that your company’s revenue has dropped 20% in the last year. It’s not enough to just state that fact – you’ll have to communicate why revenue dropped, and how to potentially fix it. +数据科学中的故事是一个讲述关于你发现了什么,你怎么发现它的,并且它意味着什么的故事。例如假使发现你公司的收入相对去年减少了百分之二十。这并不能够确定原因或者表达为什么收入会减少并且在尝试修复它。 -The main components of storytelling with data are: +讲述关于数据的故事主要包含: -* Understanding and setting the context -* Exploring multiple angles -* Using compelling visualizations -* Using varied data sources -* Having a consistent narrative +* 理解并确定内容 +* 从多角度发觉 +* 使用有趣的表示方法 +* 使用多种数据来源 +* 有一致的叙述 -The best tool to effectively tell a story with data is [Jupyter notebook][34]. If you’re unfamiliar, [here’s][33] a good tutorial. Jupyter notebook allows you to interactively explore data, then share your results on various sites, including Github. Sharing your results is helpful both for collaboration, and so others can extend your analysis. +用来讲述关于数据的故事最有效率的工具就是[Jupyter notebook][34]。如果你不熟悉,[此处][33]有一个好的教程。 Jupyter notebook 允许你交互式的发掘数据,并且将你的结果分享到多个网站,包括Github,分享你的结果有助于合作研究和其他人拓展你的分析。 -We’ll use Jupyter notebook, along with Python libraries like Pandas and matplotlib in this post. +我们将使用Jupyter notebook,Python库和matplotlib在这篇文章中。 -### Choosing a topic for your data science project +### 为你的数据科学工程选择一个主题 -The first step in creating a project is to decide on your topic. You want the topic to be something you’re interested in, and are motivated to explore. It’s very obvious when people are making projects just to make them, and when people are making projects because they’re genuinely interested in exploring the data. It’s worth spending extra time on this step, so ensure that you find something you’re actually interested in. +建立一个工程的第一步就是觉得你的主题。你需要你的主题是你兴趣所在的,并且有动力去挖掘。当人们为了完成一个项目而完成和当人们完成项目是因为有兴趣去进行数据挖掘时的区别是很明显的。这个步骤是值得花费时间的,所以确保你找到了你真正感兴趣的东西。 -A good way to find a topic is to browse different datasets and seeing what looks interesting. Here are some good sites to start with: +一个寻找主题的好的方法就是浏览不同的数据组并且寻找感兴趣的部分。这里有一些作为起点的好的网站: -* [Data.gov][20] – contains government data. -* [/r/datasets][19] – a subreddit that has hundreds of interesting datasets. -* [Awesome datasets][18] – a list of datasets, hosted on Github. -* [rs.io][17] – a great blog post with hundreds of interesting datasets. +* [Data.gov][20] - 包含了政府据。 +* [/r/datasets][19] – 一个有着上百个有趣数据组的reddit(reddit是一个类似于贴吧、论坛的网站)。 +* [Awesome datasets][18] – 一个数据组的列表,位于Github上。 +* [rs.io][17] – 一个有着上百个有趣数据组的博客。 -In real-world data science, you often won’t find a nice single dataset that you can browse. You might have to aggregate disparate data sources, or do a good amount of data cleaning. If a topic is very interesting to you, it’s worth doing the same here, so you can show off your skills better. +真实世界中的数据科学,经常无法找到合适的单个数据组。你可能需要合并多个独立的数据源,或者做数量庞大的数据清理。如果主题非常吸引你,这是值得这样做的,并且也能更好的展示你的技能。 -For the purposes of this post, we’ll be using data about New York city public schools, which can be found [here][32]. +关于这篇文章的主题,我们将使用纽约市公立学校的数据,我们可以在[这里][32]找到它。 -### Pick a topic +### 选择主题 -It’s important to be able to take the project from start to finish. In order to do this, it can be helpful to restrict the scope of the project, and make it something we know we can finish. It’s easier to add to a finished project than to complete a project that you just can’t seem to ever get enough motivation to finish. +对于完成项目来说这是十分重要的。因为主题能很好的限制项目的范围,并且使它能够是我们知道它可以被完成。比起一个没有足够动力完成的工程来说添加到一个完成的工程更加容易。 -In this case, we’ll look at the [SAT scores][31] of high schoolers, along with various demographic and other information about them. The SAT, or Scholastic Aptitude Test, is a test that high schoolers take in the US before applying to college. Colleges take the test scores into account when making admissions decisions, so it’s fairly important to do well on. The test is divided into 3 sections, each of which is scored out of 800 points. The total score is out of 2400 (although this has changed back and forth a few times, the scores in this dataset are out of 2400). High schools are often ranked by their average SAT scores, and high SAT scores are considered a sign of how good a school district is. +所以,我们将关注高中的[学术评估测试][31],伴随着多种人口统计和它们的其它数据。关于SAT,或者说学习评估测试,是美国高中生进入大学前的测试。大学在做判定时将成绩录入账号,所以高分是十分重要的。考试分为三个阶段,每个阶段总分为800。全部分数为2400(即使这个前后更改了几次,在数据中总分还是2400)。高中经常通过平均STA分数进行排名,并且SAT是评判高中有多好的标准。 -There have been allegations about the SAT being unfair to certain racial groups in the US, so doing this analysis on New York City data will help shed some light on the fairness of the SAT. +因为由关于STA分数对于美国中某些种族群体是不公平的。,所以这个纽约市数据分析能够帮助对SAT的公平性有轻许帮助。 -We have a dataset of SAT scores [here][30], and a dataset that contains information on each high school [here][29]. These will form the base of our project, but we’ll need to add more information to create compelling analysis. +我们有SAT成绩的数据组[这里][30],并且数据组中包含了每所高中的信息[这里][29]。这些将总成我们的工程的基础,但是我们将将如更多的信息来创建有趣的分析。 -### Supplementing the data +### 补充数据 -Once you have a good topic, it’s good to scope out other datasets that can enhance the topic or give you more depth to explore. It’s good to do this upfront, so you have as much data as possible to explore as you’re building your project. Having too little data might mean that you give up on your project too early. +如果你已经有了一个很好的主题,拓展其它可以提升主题或者更深入挖掘数据的的数据组是被推荐的。十分适合在前期做这些工作,那么你将会有尽可能多的数据来构建你的工程。有着越少的数据意味着你太早的放弃了你的工程。 -In this case, there are several related datasets on the same website that cover demographic information and test scores. +在包含人口统计信息和测试成绩的网站上这里有一些相关的数据组。 -Here are the links to all of the datasets we’ll be using: +这些是我们将会用到的所有数据组: -* [SAT scores by school][16] – SAT scores for each high school in New York City. -* [School attendance][15] – attendance information on every school in NYC. -* [Math test results][14] – math test results for every school in NYC. -* [Class size][13] – class size information for each school in NYC. +* [SAT scores by school][16] – 纽约市每所高中的STA成绩。 +* [School attendance][15] – 纽约市每所学校的出勤信息。 +* [Math test results][14] – 纽约市每所学校的数学成绩。 +* [Class size][13] - 纽约市每所学校课堂人数信息。 * [AP test results][12] – Advanced Placement exam results for each high school. Passing AP exams can get you college credit in the US. -* [Graduation outcomes][11] – percentage of students who graduated, and other outcome information. -* [Demographics][10] – demographic information for each school. -* [School survey][9] – surveys of parents, teachers, and students at each school. -* [School district maps][8] – contains information on the layout of the school districts, so that we can map them out. +* [AP test results][12] - 高阶位考试,在美国,通过AP测试就能获得大学学分。 +译者注:高阶位考试(AP)是美国和加拿大的一个由大学委员会创建的计划,该计划为高中学生提供大学水平的课程和考试。 美国学院和大学可以授予在考试中获得高分的学生的就学和课程学分。 -All of these datasets are interrelated, and we’ll be able to combine them before we do any analysis. +* [Graduation outcomes][11] – 由百分之几的学生毕业了,和其它去向信息。 +* [Demographics][10] – 每个学校的人口统计信息。 +* [School survey][9] – 学校的家长、教师,学生的问卷。 +* [School district maps][8] – 包含学校的区域布局信息,因此我们能将它们在地图上标出。 -### Getting background information +这些数据组合之间是相互关联的,并且我们能够在开始分析之前进行合并。 -Before diving into analyzing the data, it’s useful to research some background information. In this case, we know a few facts that will be useful: +### 获取背景信息 -* New York City is divided into `5` boroughs, which are essentially distinct regions. -* Schools in New York City are divided into several school district, each of which can contains dozens of schools. -* Not all the schools in all of the datasets are high schools, so we’ll need to do some data cleaning. -* Each school in New York City has a unique code called a `DBN`, or District Borough Number. -* By aggregating data by district, we can use the district mapping data to plot district-by-district differences. +在开始分析数据之前,搜索一些背景信息是有必要的。我们知道这些有用的信息: -### Understanding the data +* 纽约市被分为五个不同的辖区 +* 纽约市的学校坐落在学校区域内,每个都学校区域都可能包含数十所学校。 +* 数据组中的并不全是高中,所以我们需要对数据进行一些清理工作。 +* 纽约市的每所学校都有自己单独的编码,被称为‘DBN’,或者区域行政编号。 +* 为了通过区域进行合并数据,我们可以使用地图区域信息来绘制逐区差异。 -In order to really understand the context of the data, you’ll want to spend time exploring and reading about the data. In this case, each link above has a description of the data, along with the relevant columns. It looks like we have data on the SAT scores of high schoolers, along with other datasets that contain demographic and other information. +### 理解数据 -We can run some code to read in the data. We’ll be using [Jupyter notebook][28] to explore the data. The below code will: +为了真正的理解数据信息,你将需要花费时间挖掘和阅读数据。因此,每一个数据链接的描述信息都沿着相关列。假如我们拥有高中SAT成绩信息,包含图像和其它信息的数据组。 -* Loop through each data file we downloaded. -* Read the file into a [Pandas DataFrame][7]. -* Put each DataFrame into a Python dictionary. +我们可以运行一些代码来读取数据。我们将使用[Jupyter notebook][28]来挖掘数据。下面的代码将会执行一下操作: + +* 循环通过我们下载的所有数据文件。 +* 将文件读取到[Pandas DataFrame][7]。 +* 将所有数据框架导入Python数据库中。 In [100]: ``` @@ -115,7 +117,7 @@ for f in files: ``` -Once we’ve read the data in, we can use the [head][27] method on DataFrames to print the first `5` lines of each DataFrame: +一旦我们将数据读入,我们就可以使用数据框架中的[头部][27]方法打印每个数据框架的前五行。 In [103]: ``` @@ -379,21 +381,23 @@ hs_directory ``` -We can start to see some useful patterns in the datasets: +我们可以开始在数据组合中观察有用的部分: * Most of the datasets contain a `DBN` column -* Some fields look interesting for mapping, particularly `Location 1`, which contains coordinates inside a larger string. -* Some of the datasets appear to contain multiple rows for each school (repeated DBN values), which means we’ll have to do some preprocessing. +* 大部分数据组包含DBN列。 +* 一些条目看起来在地图上标出会很有趣,特别是`Location 1`,这列对应的信息会多一些。 +* 有些数据组会出现每所学校对应多行数据(DBN数据重复),这意味着我们要进行预处理。 ### Unifying the data +### 统一数据 -In order to work with the data more easily, we’ll need to unify all the individual datasets into a single one. This will enable us to quickly compare columns across datasets. In order to do this, we’ll first need to find a common column to unify them on. Looking at the output above, it appears that `DBN` might be that common column, as it appears in multiple datasets. +为了使工作更简单,我们将需要将全部零散的数据组统一为一个。这将使我们能够快速跨数据组对比数据列。因此,我们需要找到相同的列将他们统一起来。请查看上面的输出数据,当DBN出现在多个数据组中时它很可能成为共同列。 -If we google `DBN New York City Schools`, we end up [here][26], which explains that the `DBN` is a unique code for each school. When exploring datasets, particularly government ones, it’s often necessary to do some detective work to figure out what each column means, or even what each dataset is. +如果我们用google搜索`DBN New York City Schools`, 我们[在此][26]得到了结果。它解释了DBN是每个学校独特的编码。我们将挖掘数据组,特别是政府数据组。这通常需要做一些工作来找出每列的含义,或者每个数据组是的意图。 -The problem now is that two of the datasets, `class_size`, and `hs_directory`, don’t have a `DBN` field. In the `hs_directory` data, it’s just named `dbn`, so we can just rename the column, or copy it over into a new column called `DBN`. In the `class_size` data, we’ll need to try a different approach. +现在这两个数据组的主要的问题是,`class_size`, 和 `hs_directory`数据组, 没有 `DBN` 列。在`hs_directory` 数据中是dbn,那么我们只需重命名即可,或者将它复制到新的名为DBN的列中。在`class_size`数据中,我们将需要尝试不同的方法。 -The `DBN` column looks like this: +DBN列: In [5]: ``` @@ -411,7 +415,7 @@ Out[5]: Name: DBN, dtype: object ``` -If we look at the `class_size` data, here’s what we’d see in the first `5` rows: +如果我们看向`class_size`数据,我们将看到前五行: In [4]: ``` @@ -429,9 +433,9 @@ Out[4]: | 3 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | CTT | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN | | 4 | 1 | M | M015 | P.S. 015 Roberto Clemente | 02 | GEN ED | - | - | - | 15.0 | 1.0 | 15.0 | 15.0 | 15.0 | ATS | NaN | -As you can see above, it looks like the `DBN` is actually a combination of `CSD`, `BOROUGH`, and `SCHOOL CODE`. For those unfamiliar with New York City, it is composed of `5` boroughs. Each borough is an organizational unit, and is about the same size as a fairly large US City.`DBN` stands for `District Borough Number`. It looks like `CSD` is the District, `BOROUGH` is the borough, and when combined with the `SCHOOL CODE`, forms the `DBN`. There’s no systematized way to find insights like this in data, and it requires some exploration and playing around to figure out. +正如上面所见,DBN实际上是`CSD`, `BOROUGH`, 和 `SCHOOL CODE` 的组合。对那些不熟悉纽约市的人来说,纽约由五个行政区组成。每个行政区是一个组织团体,并且有着美国城市一样的面积。DBN全称为行政区域编号。看起来就像CSD是区域,BOROUGH是行政区,并且当与SCHOOL CODE合并时就组成了DBN。这里并没有系统的方法寻找像这个数据这样的内在规律,并且这需要一些探索和努力来发现。 -Now that we know how to construct the `DBN`, we can add it into the `class_size` and `hs_directory` datasets: +现在我们已经知道了DBN的组成,那么我们就可以将它加入到class_size和hs_directory数据组中了: In [ ]: ``` @@ -440,16 +444,16 @@ data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"] ``` -### Adding in the surveys +### 加入问卷 -One of the most potentially interesting datasets to look at is the dataset on student, parent, and teacher surveys about the quality of schools. These surveys include information about the perceived safety of each school, academic standards, and more. Before we combine our datasets, let’s add in the survey data. In real-world data science projects, you’ll often come across interesting data when you’re midway through your analysis, and will want to incorporate it. Working with a flexible tool like Jupyter notebook will allow you to quickly add some additional code, and re-run your analysis. +最可能值得一看的数据组之一就是学生、家长和老师关于学校质量的问卷了。这些问卷包含了每所学校的安全度,教学水平等。之前我们所合并了数据组,让我们们添加问卷数据。在真实世界的数据科学工程中,你将要经常会在分析过程中碰到有趣的数据,并且希望合并它。使用灵活的工具就像Jupyter notebook 将允许你快速添加一些新的代码,并且重新开始你的分析。 -In this case, we’ll add the survey data into our `data` dictionary, and then combine all the datasets afterwards. The survey data consists of `2` files, one for all schools, and one for school district `75`. We’ll need to write some code to combine them. In the below code, we’ll: +因此,我们将添加问卷数据到我们的data文件夹,并且合并所有之前的数据。问卷数据分为两个文件,一个包含所有的学校,一个包含75号区域的学校。我们将需要写一些代码来合并它们。之后的代码我们将: -* Read in the surveys for all schools using the `windows-1252` file encoding. -* Read in the surveys for district 75 schools using the `windows-1252` file encoding. -* Add a flag that indicates which school district each dataset is for. -* Combine the datasets into one using the [concat][6] method on DataFrames. +* 读取所有学校的问卷,并使用windows-1252作为编码。 +* 使用windows-1252编码读取所有75号区域学校的问卷。 +* 添加标签来表明每个数据组包含哪个区域的学校。 +* 使用数据框架[concat][6]方法合并数据组为一个。 In [66]: ``` @@ -461,7 +465,7 @@ survey = pandas.concat([survey1, survey2], axis=0) ``` -Once we have the surveys combined, there’s an additional complication. We want to minimize the number of columns in our combined dataset so we can easily compare columns and figure out correlations. Unfortunately, the survey data has many columns that aren’t very useful to us: +一旦我们将问卷合并,这里将会有一些混乱。我们希望我们合并的数据组列数最少,那么我们将可以轻易的进行列之间的对比并找出期间的关联。不幸的是,问卷数据有很多列并不是很有用: In [16]: ``` @@ -480,11 +484,11 @@ Out[16]: 5 rows × 2773 columns -We can resolve this issue by looking at the data dictionary file that we downloaded along with the survey data. The file tells us the important fields in the data: +我们可以通过查看数据文件夹中伴随问卷数据下载下来的文件来解决这个问题。它告诉我们们数据中重要的部分是哪些: ![](https://www.dataquest.io/blog/images/misc/xj5ud4r.png) -We can then remove any extraneous columns in `survey`: +我们可以去除`survey`数据组中多余的列: In [17]: ``` @@ -501,11 +505,11 @@ Out[17]: (1702, 23) ``` -Making sure you understand what each dataset contains, and what the relevant columns are can save you lots of time and effort later on. +请确保理你已经了解了每个数据组的内容和相关联的列,者能节约你之后大量的时间和精力: -### Condensing datasets +### 精简数据组 -If we take a look at some of the datasets, including `class_size`, we’ll immediately see a problem: +如果我们看向某些数据组,包括`class_size`,我们将立刻发现问题: In [18]: ``` @@ -523,7 +527,7 @@ Out[18]: | 3 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | CTT | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN | 01M015 | | 4 | 1 | M | M015 | P.S. 015 Roberto Clemente | 02 | GEN ED | - | - | - | 15.0 | 1.0 | 15.0 | 15.0 | 15.0 | ATS | NaN | 01M015 | -There are several rows for each high school (as you can see by the repeated `DBN` and `SCHOOL NAME` fields). However, if we take a look at the `sat_results` dataset, it only has one row per high school: +每所高中都有许多行(正如你所见的重复的`DBN`和`SCHOOL NAME`)。然而,如果我们看向`sat_result`数据组,每所高中只有一行: In [21]: ``` @@ -541,12 +545,12 @@ Out[21]: | 3 | 01M458 | FORSYTH SATELLITE ACADEMY | 7 | 414 | 401 | 359 | | 4 | 01M509 | MARTA VALLE HIGH SCHOOL | 44 | 390 | 433 | 384 | -In order to combine these datasets, we’ll need to find a way to condense datasets like `class_size` to the point where there’s only a single row per high school. If not, there won’t be a way to compare SAT scores to class size. We can accomplish this by first understanding the data better, then by doing some aggregation. With the `class_size`dataset, it looks like `GRADE` and `PROGRAM TYPE` have multiple values for each school. By restricting each field to a single value, we can filter most of the duplicate rows. In the below code, we: +为了合并这些数据组,我们将需要找到方法精简数据组到如`class_size`般一行对应一所高中。否则,我们将不能将SAT成绩与班级大小进行比较。我们通过先更好的理解数据,然后做一些合并来完成。`class_size`数据组像`GRADE`和`PROGRAM TYPE`,每个学校有多个数据对应。为了将每个范围内的数据变为一个数据,我们将大部分重复行过滤掉,在下面的代码中我们将会: -* Only select values from `class_size` where the `GRADE` field is `09-12`. -* Only select values from `class_size` where the `PROGRAM TYPE` field is `GEN ED`. -* Group the `class_size` dataset by `DBN`, and take the average of each column. Essentially, we’ll find the average `class_size` values for each school. -* Reset the index, so `DBN` is added back in as a column. +* 只从`class_size`中选择`GRADE`范围为`09-12`的行。 +* 只从`class_size`中选择`PROGRAM TYPE`是`GEN ED`的值。 +* 将`class_size`以`DBN`分组,然后取每列的平均值。重要的是,我们将找到每所学校班级大小平均值。 +* 重置索引,将`DBN`重新加到列中。 In [68]: ``` @@ -559,9 +563,9 @@ data["class_size"] = class_size ``` -### Condensing other datasets +### 精简其它数据组 -Next, we’ll need to condense the `demographics` dataset. The data was collected for multiple years for the same schools, so there are duplicate rows for each school. We’ll only pick rows where the `schoolyear` field is the most recent available: +接下来,我们将需要精简`demographic`数据组。这里有每个学校收集多年的数据,所以这里每所学校有许多重复的行。我们将只选取`schoolyear`最近的可用行: In [69]: ``` @@ -571,7 +575,7 @@ data["demographics"] = demographics ``` -We’ll need to condense the `math_test_results` dataset. This dataset is segmented by `Grade`and by `Year`. We can select only a single grade from a single year: +我们需要精简`math_test_results` 数据组。这个数据组被`Grade`和`Year`划分。我们将只选取一年选取一个年级。 In [70]: ``` @@ -580,7 +584,7 @@ data["math_test_results"] = data["math_test_results"][data["math_test_results"][ ``` -Finally, `graduation` needs to be condensed: +最后,`graduation`需要被精简: In [71]: ``` @@ -589,14 +593,14 @@ data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "To ``` -Data cleaning and exploration is critical before working on the meat of the project. Having a good, consistent dataset will help you do your analysis more quickly. +在完成工程的主要部分之前数据清理和挖掘是十分重要的。有一个高质量的,一致的数据组将会使你的分析更加快速。 -### Computing variables +### 计算变量 -Computing variables can help speed up our analysis by enabling us to make comparisons more quickly, and enable us to make comparisons that we otherwise wouldn’t be able to do. The first thing we can do is compute a total SAT score from the individual columns `SAT Math Avg. Score`, `SAT Critical Reading Avg. Score`, and `SAT Writing Avg. Score`. In the below code, we: +计算变量可以通过使我们的比较更加快速来加快分析速度,并且能是我们做到本无法做到的比较。我们能做的第一件事就是从分开的列`SAT Math Avg. Score`, `SAT Critical Reading Avg. Score`, and `SAT Writing Avg. Score`计算SAT成绩: -* Convert each of the SAT score columns from a string to a number. -* Add together all of the columns to get the `sat_score` column, which is the total SAT score. +* 将SAT列数值从字符转转化为数字。 +* 将所有列相加以得到`sat_score`,即SAT成绩。 In [72]: ``` @@ -608,10 +612,10 @@ data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_resu ``` -Next, we’ll need to parse out the coordinate locations of each school, so we can make maps. This will enable us to plot the location of each school. In the below code, we: +接下来,我们将需要进行每所学校一致区域分析,所以我们将制作地图。这将是我们画出每所学校的位置,下面的代码,我们将会: -* Parse latitude and longitude columns from the `Location 1` column. -* Convert `lat` and `lon` to be numeric. +* 从`Location 1`列分析出经度和维度。 +* 转化`lat`(经度)和`lon`(维度)为数字。 In [73]: ``` @@ -623,7 +627,7 @@ for c in ['lat', 'lon']: ``` -Now, we can print out each dataset to see what we have: +现在,我们将输出每个数据组来查看我们有了什么数据: In [74]: ``` @@ -861,18 +865,18 @@ hs_directory ``` -### Combining the datasets +### 合并数据组 -Now that we’ve done all the preliminaries, we can combine the datasets together using the `DBN` column. At the end, we’ll have a dataset with hundreds of columns, from each of the original datasets. When we join them, it’s important to note that some of the datasets are missing high schools that exist in the `sat_results` dataset. To resolve this, we’ll need to merge the datasets that have missing rows using the `outer` join strategy, so we don’t lose data. In real-world data analysis, it’s common to have data be missing. Being able to demonstrate the ability to reason about and handle missing data is an important part of building a portfolio. +现在我们已经完成了全部准备工作,我们可以用`DBN`列将数据组合并在一起了。在最后,我们将会从原始数据组得到一个有着上百列的数据组。当我们合并它们。请注意有些数据组中会没有`sat_result`中出现的高中。为了解决这个问题,我们需要使用`outer`方法来合并缺少行的数据组,这样我们就不会丢失数据。在实际分析中,缺少数据是很常见的。能够展示解释和解决数据缺失的能力是科学投资组合的重要部分。 -You can read about different types of joins [here][25]. +你可以在[此][25]阅读关于不同类型的join。 -In the below code, we’ll: +接下来的代码,我们将会: -* Loop through each of the items in the `data` dictionary. -* Print the number of non-unique DBNs in the item. -* Decide on a join strategy – `inner` or `outer`. -* Join the item to the DataFrame `full` using the column `DBN`. +* 循环通过`data`文件夹中的每一个条目。 +* 输出条目中的DBN码。 +* 决定join类别 - `inner`或`outer`。 +* 使用`DBN`列将条目合并到数据框架`full`中。 In [75]: ``` @@ -915,19 +919,19 @@ Out[75]: (374, 174) ``` -### Enjoying this post? Learn data science with Dataquest! +### 喜欢这篇文章?通过数据查询学习数据科学! ##### -* Learn from the comfort of your browser. -* Work with real-life data sets. -* Build a portfolio of projects. +* 从浏览器舒适的学习。 +* 使用实际的数据组。 +* 建立科学组合工程。 -[Start for Free][5] +[开始免费][5] -### Adding in values +### 添加值 -Now that we have our `full` DataFrame, we have almost all the information we’ll need to do our analysis. There are a few missing pieces, though. We may want to correlate the [Advanced Placement][24] exam results with SAT scores, but we’ll need to first convert those columns to numbers, then fill in any missing values: +现在我们有了我们的`full`数据框架,我们几乎拥有分析需要的所有数据。虽然这里有一些缺少的部分。我们可能将[AP][24] 考试结果与 SAT 成绩相关联,但是我们首先需要将这些列转化为数字,然后填充缺失的数据。 In [76]: ``` @@ -940,7 +944,7 @@ full[cols] = full[cols].fillna(value=0) ``` -Then, we’ll need to calculate a `school_dist` column that indicates the school district of the school. This will enable us to match up school districts and plot out district-level statistics using the district maps we downloaded earlier: +然后我们将需要计算表示哦学校所在区域的`school_dist`列。这将是我们匹配学校区域并且使用我们之前下载的区域地图画出地区级别的地图。 In [77]: ``` @@ -948,7 +952,7 @@ full["school_dist"] = full["DBN"].apply(lambda x: x[:2]) ``` -Finally, we’ll need to fill in any missing values in `full` with the mean of the column, so we can compute correlations: +最终,我们将需要用列的平均值填充缺失的数据到`full`中。那么我们就可以计算关联了: In [79]: ``` @@ -956,9 +960,9 @@ full = full.fillna(full.mean()) ``` -### Computing correlations +### 计算关联 -A good way to explore a dataset and see what columns are related to the one you care about is to compute correlations. This will tell you which columns are closely related to the column you’re interested in. We can do this via the [corr][23] method on Pandas DataFrames. The closer to `0` the correlation, the weaker the connection. The closer to `1`, the stronger the positive correlation, and the closer to `-1`, the stronger the negative correlation`: +一个好的方法来挖掘数据并查看哪些列与你所关心的问题有联系就是计算关联。这将告诉你哪列与你所关心的列更加有关联。你可以通过Pandas DataFrame 的[corr][23]方法来完成。越接近0则关联越小。越接近1则正相关越强,越接近-1则负关联越强: In [80]: ``` @@ -1032,25 +1036,25 @@ lon -1.315241e-01 Name: sat_score, dtype: float64 ``` -This gives us quite a few insights that we’ll need to explore: +这给了我们一些我们需要探索的内在规律: -* Total enrollment correlates strongly with `sat_score`, which is surprising, because you’d think smaller schools, which focused more on the student, would have higher scores. -* The percentage of females at a school (`female_per`) correlates positively with SAT score, whereas the percentage of males (`male_per`) correlates negatively. -* None of the survey responses correlate highly with SAT scores. -* There is a significant racial inequality in SAT scores (`white_per`, `asian_per`, `black_per`, `hispanic_per`). -* `ell_percent` correlates strongly negatively with SAT scores. +* total_enrollment 与 `sat_score`强烈相关,这是令人惊讶的,因为你曾经认为越小的学校越专注与学生就会取得更高的成绩。 +* 女生所占学校的比例(`female_per`) 与SAT成绩呈正相关,而男生所占学生比例(`male_per`)成负相关。 +* 没有问卷与SAT成绩成正相关。 +* SAT成绩由明显的种族不平等(`white_per`, `asian_per`, `black_per`, `hispanic_per`)。 +* `ell_percent` 与SAT成绩明显负相关。 -Each of these items is a potential angle to explore and tell a story about using the data. +每一个条目都是一个潜在的角度来挖掘和讲述一个关于数据的故事。 -### Setting the context +### 设置上下文 -Before we dive into exploring the data, we’ll want to set the context, both for ourselves, and anyone else that reads our analysis. One good way to do this is with exploratory charts or maps. In this case, we’ll map out the positions of the schools, which will help readers understand the problem we’re exploring. +在我们开始数据挖掘之前,我们将希望设置上下文,不仅为了我们自己,也是为了其它阅读我们分析的人。一个好的方法就是建立挖掘图标或者地图。因此,我们将在地图标出所有学校的位置,这将有助于读者理解我们所探索的问题。 -In the below code, we: +在下面的代码中,我们将会: -* Setup a map centered on New York City. -* Add a marker to the map for each high school in the city. -* Display the map. +* 建立纽约市为中心的地图。 +* 为城市里的每所高中添加一个标号。 +* 显示地图。 In [82]: ``` @@ -1068,7 +1072,7 @@ schools_map Out[82]:![](https://www.dataquest.io/blog/images/storytelling/map.png) -This map is helpful, but it’s hard to see where the most schools are in NYC. Instead, we’ll make a heatmap: +这个地图十分有用,但是不容易查看纽约哪里学校最多。因此,我们将用热图来代替它: In [84]: ``` @@ -1081,15 +1085,15 @@ schools_heatmap Out[84]:![](https://www.dataquest.io/blog/images/storytelling/heatmap.png) -### District level mapping +### 区域级别映射 -Heatmaps are good for mapping out gradients, but we’ll want something with more structure to plot out differences in SAT score across the city. School districts are a good way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area. +热图能够很好的标出梯度,但是我们将需要更结构化的画出不同城市之间的SAT分数差距。学校地区是一个很好的方式图形化信息,就像每个区域都有自己的管理者。纽约市数十个学校区域,并且每个区域都是一个小的地理区域。 -We can compute SAT score by school district, then plot this out on a map. In the below code, we’ll: +我们可以通过学校区域来计算SAT分数,然后将它们画在地图上。在下面的代码中,我们将会: -* Group `full` by school district. -* Compute the average of each column for each school district. -* Convert the `school_dist` field to remove leading `0`s, so we can match our geograpghic district data. +* 通过学校区域对`full`进行分组。 +* 计算每个学校区域的每列的平均值。 +* 去掉`school_dist`头部的0,然后我们就可以匹配地理数据了。 In [ ]: ``` @@ -1099,7 +1103,7 @@ district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str( ``` -We’ll now we able to plot the average SAT score in each school district. In order to do this, we’ll read in data in [GeoJSON][22] format to get the shapes of each district, then match each district shape with the SAT score using the `school_dist` column, then finally create the plot: +我们现在将可以画出SAT在每个学校区域的平均值了。因此,我们将会读取[GeoJSON][22]中的数据,转化为每个区域的形状,然后通过`school_dist`列对每个区域图形和SAT成绩进行匹配。最终我们将创建一个图形: In [85]: ``` @@ -1124,11 +1128,11 @@ show_district_map("sat_score") Out[85]:![](https://www.dataquest.io/blog/images/storytelling/district_sat.png) -### Exploring enrollment and SAT scores +### 挖掘注册人数与SAT分数 -Now that we’ve set the context by plotting out where the schools are, and SAT score by district, people viewing our analysis have a better idea of the context behind the dataset. Now that we’ve set the stage, we can move into exploring the angles we identified earlier, when we were finding correlations. The first angle to explore is the relationship between the number of students enrolled in a school and SAT score. +现在我们已经依地区画出学校位置和STA成绩确定了上下文,浏览我们分析的人将会对数据的上下文有更好的理解。现在我们已经完成了基础工作,我们可以开始从我们上面寻找关联时所提到的角度分析了。第一个分析角度是学校注册学生人数与SAT成绩。 -We can explore this with a scatter plot that compares total enrollment across all schools to SAT scores across all schools. +我们可以通过所有学校的注册学生与SAT成绩的散点图来分析。 In [87]: ``` @@ -1145,9 +1149,9 @@ Out[87]: ![]() -As you can see, there’s a cluster at the bottom left with low total enrollment and low SAT scores. Other than this cluster, there appears to only be a slight positive correlation between SAT scores and total enrollment. Graphing out correlations can reveal unexpected patterns. +如你所见,底部左边低注册人数低SAT成绩有一个集群。这个集群意外,SAT成绩与全部注册人数只有轻微正相关。画出的关联显示了意想不到的图形. -We can explore this further by getting the names of the schools with low enrollment and low SAT scores: +我们可以通过获取低注册人数且低SAT成绩的学校的名字进行进一步的分析。 In [88]: ``` @@ -1172,11 +1176,11 @@ Out[88]: Name: School Name, dtype: object ``` -Some searching on Google shows that most of these schools are for students who are learning English, and are low enrollment as a result. This exploration showed us that it’s not total enrollment that’s correlated to SAT score – it’s whether or not students in the school are learning English as a second language or not. +在Google上进行了一些搜索确定了这些学校大多数是为了正在学习英语而开设的,所以由这低注册人数。这个挖掘向我们展示了并不是所有的注册人数都与SAT成绩有关联 - 这里是否由学习英语作为第二语言的学生。 -### Exploring English language learners and SAT scores +### 挖掘英语学习者和SAT成绩 -Now that we know the percentage of English language learners in a school is correlated with lower SAT scores, we can explore the relationship. The `ell_percent` column is the percentage of students in each school who are learning English. We can make a scatterplot of this relationship: +现在我们知道英语学习者所占学校学生比例与低的SAT成绩有关联,我们可以探索其中的规律。`ell_percent`列表示一个学校英语学习者所占的比例。我们可以制作关于这个关联的散点图。 In [89]: ``` @@ -1191,7 +1195,7 @@ Out[89]: ![]() -It looks like there are a group of schools with a high `ell_percentage` that also have low average SAT scores. We can investigate this at the district level, by figuring out the percentage of English language learners in each district, and seeing it if matches our map of SAT scores by district: +看起来这里有一组学校有着高的`ell_percentage`值并且有这低的SAT成绩。我们可以在区域层面调查这个关系,通过找出每个区域英语学习者所占的比例,并且查看是否与我们的区域层面SAT地图所匹配: In [90]: ``` @@ -1202,11 +1206,11 @@ Out[90]: ![](https://www.dataquest.io/blog/images/storytelling/district_ell.png) -As we can see by looking at the two district level maps, districts with a low proportion of ELL learners tend to have high SAT scores, and vice versa. +我们一可通过两个区域层面地图来查看,一个低ELL(English-language)学习者比例的地区更倾向有高SAT成绩,反之亦然。 -### Correlating survey scores and SAT scores +### 关联问卷分数和SAT分数 -It would be fair to assume that the results of student, parent, and teacher surveys would have a large correlation with SAT scores. It makes sense that schools with high academic expectations, for instance, would tend to have higher SAT scores. To test this theory, lets plot out SAT scores and the various survey metrics: +学生、家长和老师的问卷结果如果与SAT分数有很大的关联的假设是合理的。就使例如具有高学术期望的学校倾向于有着更高的SAT分数是合理的。为了测这个理论,让我们画出SAT分数和多种问卷指标: In [91]: ``` @@ -1221,11 +1225,11 @@ Out[91]: ![]() -Surprisingly, the two factors that correlate the most are `N_p` and `N_s`, which are the counts of parents and students who responded to the surveys. Both strongly correlate with total enrollment, so are likely biased by the `ell_learners`. The other metric that correlates most is `saf_t_11`. That is how safe students, parents, and teachers perceived the school to be. It makes sense that the safer the school, the more comfortable students feel learning in the environment. However, none of the other factors, like engagement, communication, and academic expectations, correlated with SAT scores. This may indicate that NYC is asking the wrong questions in surveys, or thinking about the wrong factors (if their goal is to improve SAT scores, it may not be). +惊人的,关联最大的两个事实是`N_p`和`N_s`,分别是家长和学生回应的问卷。都与注册人数有着很强的关联,所以很可能偏离了`ell_learner`。此外指标关联最强的就是`saf_t_11`。就是学生、家长和老师对学校安全程度的感知。这说明了,越安全的学校,更能让学生在环境里安心学习。然而其它因子,像互动,交流和学术水平都与SAT分数无关,这也许表明了纽约在问卷中问了不理想的问题或者想错了因子(如果他们的目的是提高SAT分数的话)。 -### Exploring race and SAT scores +### 挖掘种族和SAT分数 -One of the other angles to investigate involves race and SAT scores. There was a large correlation differential, and plotting it out will help us understand what’s happening: +其中一个就读就是调查种族和SAT分数的联系。这是一个大相关微分,并且将其画出来帮助我们理解到底发生了什么: In [92]: ``` @@ -1240,7 +1244,7 @@ Out[92]: ![]() -It looks like the higher percentages of white and asian students correlate with higher SAT scores, but higher percentages of black and hispanic students correlate with lower SAT scores. For hispanic students, this may be due to the fact that there are more recent immigrants who are ELL learners. We can map the hispanic percentage by district to eyeball the correlation: +看起来更高的白种和亚洲学生与更高的SAT分数有关联,但是更高的黑人和西班牙裔与更低的SAT分数有关联。对于西班牙学生,这可能因为近年的移民还是英语学习者的事实。我们可以标出区层面的西班牙裔的比例并观察联系。 In [93]: ``` @@ -1252,11 +1256,11 @@ Out[93]: ![](https://www.dataquest.io/blog/images/storytelling/district_hispanic.png) -It looks like there is some correlation with ELL percentage, but it will be necessary to do some more digging into this and other racial differences in SAT scores. +看起来这里与英语学习者比例有关联,但是这将有必要做一些挖掘这种和在SAT分数上的其它种族差异。 -### Gender differences in SAT scores +### SAT分数上的性别差异` -The final angle to explore is the relationship between gender and SAT score. We noted that a higher percentage of females in a school tends to correlate with higher SAT scores. We can visualize this with a bar graph: +挖掘性别与SAT分数之间的关系是最后一个角度。我们注意学校更高的女生比例倾向于与更高的SAT分数有关联。我们可以可视化为一个条形图: In [94]: ``` @@ -1271,7 +1275,7 @@ Out[94]: ![]() -To dig more into the correlation, we can make a scatterplot of `female_per` and `sat_score`: +为了在关联中挖掘更多,我们可以制作一个`female_per`和`sat_score`的散点图: In [95]: ``` @@ -1286,7 +1290,7 @@ Out[95]: ![]() -It looks like there’s a cluster of schools with a high percentage of females, and very high SAT scores (in the top right). We can get the names of the schools in this cluster: +看起来这里有一个高女生比例高SAT成绩的簇(右上角)。我们可以获取簇中学校的名字: In [96]: ``` @@ -1306,11 +1310,11 @@ Out[96]: Name: School Name, dtype: object ``` -Searching Google reveals that these are elite schools that focus on the performing arts. These schools tend to have higher percentages of females, and higher SAT scores. This likely accounts for the correlation between higher female percentages and SAT scores, and the inverse correlation between higher male percentages and lower SAT scores. +使用Google进行搜索可以得到这些是专注于表演艺术的精英学校。这些学校有着更高比例的女生和更高的SAT分数。这可能解释了更高的女生比例和SAT分数的关联,并且相反的更高的男生比例与更低的SAT分数。 -### AP scores +### AP成绩 -So far, we’ve looked at demographic angles. One angle that we have the data to look at is the relationship between more students taking Advanced Placement exams and higher SAT scores. It makes sense that they would be correlated, since students who are high academic achievers tend to do better on the SAT. +至今,我们将关注人口统计角度。一个我们通过数据来看参加高阶测试的学生和SAT分数的角度。因为高学术获得者倾向于有着高的SAT分数说明了它们是有关联的。 In [98]: ``` @@ -1327,7 +1331,7 @@ Out[98]: ![]() -It looks like there is indeed a strong correlation between the two. An interesting cluster of schools is the one at the top right, which has high SAT scores and a high proportion of students that take the AP exams: +看起来它们之间确实有着很强的关联。有趣的是右上角高SAT分数的学校有着高的AP测试通过比例: In [99]: ``` @@ -1348,24 +1352,24 @@ Out[99]: Name: School Name, dtype: object ``` -Some Google searching reveals that these are mostly highly selective schools where you need to take a test to get in. It makes sense that these schools would have high proportions of AP test takers. +功过google搜索解释了大多数高选择性的学校你需要经过测试才能进入。这就说明了为什么这些学校会有高的AP通过人数。 -### Wrapping up the story +### 包装故事 -With data science, the story is never truly finished. By releasing analysis to others, you enable them to extend and shape your analysis in whatever direction interests them. For example, in this post, there are quite a few angles that we explored inmcompletely, and could have dived into more. +在数据科学中,故事不可能真正完结。通过向其他人发布分析,你允许它们拓展并且运用你的分析到他们所感兴趣的方向。比如在本文中,这里有一些角度我们没有完成并且可以探索更加深入。 -One of the best ways to get started with telling stories using data is to try to extend or replicate the analysis someone else has done. If you decide to take this route, you’re welcome to extend the analysis in this post and see what you can find. If you do this, make sure to comment below so I can take a look. +一个最好的方式开始讲述故事就是尝试拓展或者复制别人已经完成的分析。如果你觉得采取这个方式,你被欢迎拓展这篇文章的分析并得知你的收获。如果你确实这么做了,请在下面评论,那么我就可以看到了。 -### Next steps +### 下一步 -If you’ve made it this far, you hopefully have a good understanding of how to tell a story with data, and how to build your first data science portfolio piece. Once you’re done with your data science project, it’s a good idea to post it on [Github][21] so others can collaborate with you on it. +如果你做的足够多,你将很有希望对讲一个关于数据的故事和构建你的第一个数据科学组合有很好的理解。一旦你完成了你的数据科学工程,发表在[Github][21]上是一个好的想法,这样别人就能够与你一起合作。 -_If you liked this, you might like to read the other posts in our ‘Build a Data Science Portfolio’ series:_ +_如果你洗管这篇文章,你可能希望阅读我们‘Build a Data Science Portfolio’系列文章的其它部分:_ -* _[How to setup up a data science blog][4]._ -* _[Building a machine learning project][3]._ -* _[The key to building a data science portfolio that will get you a job][2]._ -* _[17 places to find datasets for data science projects][1]_ +* _[如何建立一个数据科学博客][4]._ +* _[建立一个机器学习工程][3]._ +* _[将帮助你的到工作的数据科学组合的关键]._ +* _[17 你能找到其它数据科学工程数据组的地方]._ -------------------------------------------------------------------------------- @@ -1374,7 +1378,7 @@ via: https://www.dataquest.io/blog/data-science-portfolio-project/ 作者:[Vik Paruchuri ][a] -译者:[译者ID](https://github.com/译者ID) +译者:[Yoo-4x] 校对:[校对者ID](https://github.com/校对者ID)