Merge pull request #28868 from Chao-zhi/patch-1

这篇文章对我来说相当难,翻译可能不准确……
This commit is contained in:
Xingyu.Wang 2023-03-15 09:48:25 +08:00 committed by GitHub
commit f19907a16f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 201 additions and 201 deletions

View File

@ -1,201 +0,0 @@
[#]: subject: "A data scientist's guide to open source community analysis"
[#]: via: "https://opensource.com/article/22/12/data-scientists-guide-open-source-community-analysis"
[#]: author: "Cali Dolfi https://opensource.com/users/cdolfi"
[#]: collector: "lkxed"
[#]: translator: "Chao-zhi"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
A data scientist's guide to open source community analysis
======
In the golden age of data analysis, open source communities are not exempt from the frenzy around getting some big, fancy numbers onto presentation slides. Such information can bring even more value if you master the art of generating a well-analyzed question with proper execution.
You might expect me, a [data scientist][1], to tell you that data analysis and automation will inform your community decisions. It's actually the opposite. Use data analysis to build on your existing open source community knowledge, incorporate others, and uncover potential biases and perspectives not considered. You might be an expert at implementing community events, while your colleague is a wiz at all things code. As each of you develops visualizations within the context of your own knowledge, you both can benefit from that information.
Let's have a moment of realness. Everyone has a thousand and one things to keep up with, and it feels like there is never enough time in a day to do so. If getting an answer about your community takes hours, you won't do it regularly (or ever). Spending the time to create a fully developed visualization makes it feasible to keep up with different aspects of the communities you care about.
With the ever-increasing pressure of being "data-driven," the treasure trove of information around open source communities can be a blessing and a curse. Using the methodology below, I will show you how to pick the needle out of the data haystack.
### What is your perspective?
When thinking about a metric, one of the first things you must consider is the perspective you want to provide. The following are a few concepts you could establish.
**Informative vs. influencing action:** Is there an area of your community that is not understood? Are you taking that first step in getting there? Are you trying to decide on a particular direction? Are you measuring an existing initiative?
**Exposing areas of improvement vs. highlighting strengths:** There are times when you are trying to hype up your community and show how great it is, especially when trying to demonstrate business impact or advocate for your project. When it comes to informing yourself and the community, you can often get the most value from your metrics by identifying shortcomings. Highlighting strengths is not a bad practice, but there is a time and place. Don't use metrics as a cheerleader inside your community to tell you how great everyone is; instead, share that with outsiders for recognition or promotion.
**Community and business impact:** Numbers and data are the languages of many businesses. That can make it incredibly difficult to advocate for your community and truly show its value. Data can be a way to speak in their language and show what they want to see to get the rest of your messaging across. Another perspective is the impact on open source overall. How does your community impact others and the ecosystem?
These are not always either/or perspectives. Proper framing will help in creating a more deliberate metric.
![Data science and machine learning workflow][2]
People often describe some version of this workflow when talking about general data science or machine learning work. I will focus on the first step, codifying problems and metrics, and briefly mention the second. From a data science perspective, this presentation can be considered a case study of this step. This step is sometimes overlooked, but your analysis's actual value starts here. You don't just wake up one day and know exactly what to look at. Begin with understanding what you want to know and what data you have to get you to the true goal of thoughtful execution of data analysis.
### 3 data analysis use cases in open source
Here are three different scenarios you might run into in your open source data analysis journey.
#### Scenario 1: Current data analysis
Suppose you are starting to go down the analysis path, and you already know what you're looking into is generally useful to you/your community. How can you improve? The idea here is to build off "traditional" open source community analysis. Suppose your data indicates you have had 120 total contributors over the project's lifetime. That's a value you can put on a slide, but you can't make decisions from it. Start taking incremental steps from just having a number to having insights. For example, you can break out the sample of total contributors into active versus drifting contributors (contributors who have not contributed in a set amount of time) from the same data.
#### Scenario 2: Community campaign impact measurement
![Goals and impacts][3]
Consider meetups, conferences, or any community outreach initiative. How do you view your impacts and goals? These two steps actually feed into each other. Once you establish the campaign goals, determine what can be measured to detect the effect. That information helps set the campaign's goals. It's easy to fall into the trap of being vague rather than concrete with plans when a campaign begins.
#### Scenario 3: Form new analysis areas to impact
![New analysis areas][4]
This situation occurs when you work from scratch in data analysis. The previous examples are different parts of this workflow. The workflow is a living cycle; you can always make improvements or extensions. From this concept, the following are the necessary steps you should work through. Later in this article, there will be three different examples of how this approach works in the real world.
#### Step 1: Break down focus areas and perspectives
First, consider a magic eight ball—the toy you can ask anything, shake, and get an answer. Think about your analysis area. If you could get any answer immediately, what would it be?
Next, think about the data. From your magic eight-ball question, what data sources could have anything to do with the question or focus area?
What questions could be answered in the data context to move you closer to your proposed magic eight-ball question? It's important to note that you must consider the assumptions made if you try to bring all the data together.
#### Step 2: Convert a question to a metric
Here is the process for each sub-question from the first step:
- Select the specific data points needed.
- Determine visualization to get the goal analysis.
- Hypothesize the impacts of this information.
Next, bring in the community to provide feedback and trigger an iterative development process. The collaborative portion of this can be where the real magic happens. The best ideas often come when bringing a concept to someone that inspires them in a way you or they would not have imagined.
#### Step 3: Analysis in action
This step is where you start working through the implications of the metric or visualization you have created.
The first thing to consider is if this metric follows what is currently known about the community.
- If **yes**: Are there assumptions made that catered to the results?
- If **no**: You want to investigate further whether this is potentially a data or calculation issue or if it is just a previously misunderstood part of the community.
Once you have determined if your analysis is stable enough to make inferences on, you can start to implement community initiatives on the information. As you are taking in the analysis to determine the next best step, you should identify specific ways to measure the initiative's success.
Now, observe these community initiatives informed by your metric. Determine if the impact is observable by your priorly established measurement of success. If not, consider the following:
- Are you measuring the right thing?
- Does the initiative strategy need to change?
### Example analysis area: New contributors
#### What is my magic eight-ball question?
- Do people have an experience that establishes them as consistent contributors?
#### What data do I have that goes into the analysis area and magic eight-ball question?
- What contributor activity exists for repos, including timestamps?
Now that you have the information and a magic eight-ball question, break the analysis down into subparts and follow each of them to the end. This idea correlates with steps 2 and 3 above.
**Sub-question 1:** "How are people coming into this project?"
This question aims to see what new contributors are doing first.
**Data:** GitHub data on first contributions over time (issues, PR, comments, etc.).
![Chart of first time contributions per quarter][5]
**Visualization:** Bar chart with first-time contributions broken down by quarter.
**Potential extension:** After you talk with other community members, further examination breaks the information down by quarter and whether the contributor was a repeat or drive-by. You can see what people are doing when they come in and if that tells you anything about whether they will stick around.
![Chart of drive-by contributions per quarter][6]
**Potential actions informed by this information:**
- Does the current documentation support contributors for the most common initial contribution? Could you support those contributors better, and would that help more of them stay?
- Is there a contribution area that is not common overall but is a good sign for a repeat contributor? Perhaps PR is a common area for repeat contributors, but most people don't work in that area.
**Action items:**
- Label "good first issues" consistently and link these issues to the contribution docs.
- Add a PR buddy to these.
**Sub-question 2:** "Is our code base really dependent on drive-by contributors?"
**Data:** Contribution data from GitHub.
![Chart of contributor types over time][7]
**Visualization:** "Total contributions: Broken down by contributions by drive-by and repeat contributor."
**Potential actions informed by this information:**
- Does this ratio achieve the program's goals? Is a lot of the work done by drive-by contributors? Is this an underutilized resource, and is the project not doing its part to bring them in?
### Analysis: Lessons learned
Number and data analysis are not "facts." They can say anything, and your internal skeptic should be very active when working with data. The iterative process is what will bring value. You don't want your analysis to be a "yes man." Take time to take a step back and evaluate the assumptions you've made.
If a metric just points you in a direction to investigate, that is a huge win. You can't look at or think of everything. Rabbit holes can be a good thing, and conversation starters can bring you to a new place.
Sometimes exactly what you want to measure is not there, but you might be able to get valuable details. You can't assume that you have all the puzzle pieces to get an exact answer to your original question. If you start to force an answer or solution, you can take yourself down a dangerous path led by assumptions. Leaving room for the direction or goal of analysis to change can lead you to a better place or insight than your original idea.
Data is a tool. It is not the answer, but it can bring together insights and information that would not have been accessible otherwise. The methodology of breaking down what you want to know into manageable chunks and building on that is the most important part.
Open source data analysis is a great example of the care you must take with all data science:
- The nuance of the topic area is the most important.
- The process of working through "what to ask/answer" is often overlooked.
- Knowing what to ask can be the hardest part, and when you come up with something insightful and innovative, it's much more than whatever tool you choose.
If you are a community member with no data science experience looking at where to start, I hope this information shows you how important and valuable you can be to this process. You bring the insights and perspectives of the community. If you are a data scientist or someone implementing the metrics or visualizations, you have to listen to the voices around you, even if you are also an active community member. More information on data science is listed at the end of this article.
### Wrap up
Use the above example as a framework for establishing data analysis of your own open source project. There are many questions to ask of your results, and knowing both the questions and their answers can lead your project in an exciting and fruitful direction.
#### More on data science
Consider the following sources for more information on data science and the technologies that provide it with data:
- [What is data science?][8]
- [What is Python?][9]
- [How to become a data scientist][10]
- [Data scientist: A day in the life][11]
- [What is big data?][12]
- [Whitepaper: Data-intensive intelligent applications in a hybrid cloud blueprint][13]
- [MariaDB and MySQL cheat sheet][14]
- [Latest data science articles][15]
--------------------------------------------------------------------------------
via: https://opensource.com/article/22/12/data-scientists-guide-open-source-community-analysis
作者:[Cali Dolfi][a]
选题:[lkxed][b]
译者:[Chao-zhi](https://github.com/Chao-zhi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/cdolfi
[b]: https://github.com/lkxed
[1]: https://enterprisersproject.com/article/2022/9/data-scientist-day-life?intcmp=7013a000002qLH8AAM
[2]: https://opensource.com/sites/default/files/2022-11/datascience-machinelearning-workflow.jpg
[3]: https://opensource.com/sites/default/files/2022-11/goals-impact.png
[4]: https://opensource.com/sites/default/files/2022-11/new-analysis-areas.png
[5]: https://opensource.com/sites/default/files/2022-11/first-time-contributions-per-quarter.png
[6]: https://opensource.com/sites/default/files/2022-11/driveby-contributions-per-quarter.png
[7]: https://opensource.com/sites/default/files/2022-11/contributor-types-over-time.png
[8]: https://opensource.com/resources/data-science?intcmp=7013a000002CxqkAAC
[9]: https://opensource.com/resources/python?intcmp=7013a000002CxqkAAC
[10]: https://opensource.com/article/17/9/data-scientist?intcmp=7013a000002CxqkAAC
[11]: https://enterprisersproject.com/article/2022/9/data-scientist-day-life?intcmp=7013a000002CxqkAAC
[12]: https://opensource.com/resources/big-data?intcmp=7013a000002CxqkAAC
[13]: https://www.redhat.com/en/resources/data-intensive-applications-hybrid-cloud-blueprint-detail?intcmp=7013a000002CxqkAAC
[14]: https://opensource.com/downloads/mariadb-mysql-cheat-sheet?intcmp=7013a000002CxqkAAC
[15]: https://opensource.com/tags/data-science?intcmp=7013a000002CxqkAAC

View File

@ -0,0 +1,201 @@
[#]: subject: "A data scientist's guide to open source community analysis"
[#]: via: "https://opensource.com/article/22/12/data-scientists-guide-open-source-community-analysis"
[#]: author: "Cali Dolfi https://opensource.com/users/cdolfi"
[#]: collector: "lkxed"
[#]: translator: "Chao-zhi"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
数据科学家的开源社区分析指南
======
在数据分析的黄金时代,开源社区也不能免俗。大家都热衷于将一些华丽的数字放到演示幻灯片上,但如果你掌握了正确的分析方法,这些信息可以为你带来更大的价值。
或许你认为作为一名[数据科学家][1],我会告诉你数据分析和自动化能为你的社区决策提供信息。但实际上,情况恰恰相反。利用数据分析来建立你现有的开源社区知识,吸收其他的知识,并发现潜在的偏见和没有考虑的观点。你或许是实施社区活动的专家,而你那些同事则是代码方面的专家。当你们每个人都在自己的知识背景下开发可视化时,你们都可以从这些信息中受益。
让我们来面对现实吧。每个人都有一千零一件事情要做,而且总感觉一天的时间永远不够用。如果要得到关于你的社区的答案需要几个小时,你就不可能有足够的精力去解决这些事情。但是,花时间创建一个全面发展的可视化项目,可以帮助你时刻掌握你所关心的社区的不同方面,这就将你从精疲力尽中解放了出来。
随着“数据驱动”思维的盛行,围绕开源社区的信息宝库可能是一种祝福,也可能是一种诅咒。下面我将分享一些方法,告诉你如何从数据干草堆中挑出有价值的信息。
### 你的预期是什么?
当考虑一个指标时,首先要明确你的目的和需求。以下是几个可能涉及的概念:
**告知性和影响性的行动:** 你的社区是否存在某个领域尚未被理解?你是否已迈出第一步?你是否试图确定特定方向?你是否正在衡量现有倡议的效果?
**暴露需要改进的领域和突出优势:** 有时你想宣传你的社区,突出它的优势,特别是在试图证明商业影响或为项目宣传时。然而,当涉及到向社区内部传递信息时,你通常需要从一堆指标中精准的找到你们的缺点,以此来帮助你们改进。虽然突出优点并非不可取,但需要在适当的时间和地点。不要把优势指标作为社区内部的拉拉队,告诉每个人都有多棒,而是要与外界分享,以获得认可或推广。
**社区和商业影响:** 数字和数据是许多企业的语言。但是这可能使得为你的社区进行宣传并真正展示其价值变得异常困难。数据可以成为用他们的语言说话的一种方式,并展示他们想看到的东西,以使你数据背后的潜在含义能够被有效转达。另一个角度是对开源的整体影响。你的社区是如何影响他人和生态系统的?
这些观点并非非此即彼,而是相互关联的。适当的框架将有助于创造一个更深思熟虑的衡量标准。
![数据科学和机器学习的工作流程][2]
当人们谈论通用的数据科学或机器学习工作时,通常会描述这样的工作流程。我将重点关注第一步,即编写问题和度量标准,并简要提及第二步。从数据科学的角度来看,这个演示可以被视为这个步骤的一个案例研究。这一步有时会被忽视,但你的分析的实际价值始于此。你不能一天醒来就知道要看什么。从理解你想知道什么和你所拥有的数据开始,逐步实现更加深度的数据分析。
### 3个开源数据分析用例
以下是您在开源数据分析过程中可能遇到的三种不同场景。
#### 场景 1现有数据分析
假设你开始进行分析,并且已经知道你将要研究的内容对你或你的社区是有用的。那么你该如何提高分析的价值呢?这里的想法是建立在“传统”的开源社区分析基础之上。假设你的数据表明,在项目的整个生命周期内,你共有 120 个贡献者。这是你可以放在幻灯片上的价值,但你不能从中做出决策。从仅有一个数字到获得洞见,逐步采取措施。例如,你可以从相同的数据中将贡献者分为活跃和流失的贡献者(那些已经有一段时间没有做出贡献的贡献者),以获得更深入的了解。
#### 场景 2社区活动影响测量
![目标和影响][3]
考虑聚会、会议或其他任何社区外联活动。你如何看待你的影响力和目标?这两个步骤实际上互相影响。一旦你确定了活动的目标,就要确定可以用什么来检测效果。这些信息有助于设定活动的目标。在活动开始时,很容易陷入模糊的计划而非具体的计划的陷阱中。
#### 场景3形成新的影响分析区
![新的分析区][4]
当你从头开始进行数据分析时,就会出现这种情况。前面的例子是这个工作流程的不同部分。这个工作流程是一个不断发展的循环;你可以随时进行改进或扩展。基于这个概念,以下是你应该经历的必要步骤。在本文的后面,将会有三个不同的例子,展示这种方法在现实世界中的应用。
#### 第一步:分解关注区和视角
首先,想象一下魔法 8 球——你可以问任何问题,摇一摇,就能得到答案的玩具。考虑你的分析区。如果你能立即得到任何答案,那会是什么?
接下来,考虑数据。从你的魔法 8 球问题中,哪些数据源可能与问题或关注领域有关?
在数据背景下,哪些问题可以被回答,让你更接近你提出的魔法 8 球问题?需要注意的是,如果你试图将所有的数据汇集在一起,你必须考虑到所做出的假设。
#### 第二步:将问题转化为度量标准
以下是第一步中每个子问题的处理过程:
- 选择所需的具体数据点。
- 确定可视化以实现目标分析。
- 假设这些信息的影响。
接下来,引入社区提供反馈并触发迭代开发过程。这个协作部分可能就是真正的魔力所在。最好的想法通常是在将一个概念带给某个人时产生的,会激发他们的灵感,这是你或他们无法想象的。
#### 第三步:分析实践
这一步是你开始处理你所创建的度量标准或可视化的影响的地方。
首先要考虑的是,这个度量标准是否符合当前对社区的了解。
- 如果**是**:是否有假设得出的结果?
- 如果**不是**:你需要进一步调查,是否这是一个潜在的数据或计算问题,或者只是先前被误解的社区的一部分。
一旦你确定你的分析足够稳定,可以开始在信息上实施社区倡议。当你正在进行分析以确定下一步最佳步骤时,你应该确定衡量倡议成功的具体方法。
现在,观察这些由你的度量标准提供信息的社区倡议。确定是否可以用你之前建立的成功衡量指标观察到影响。如果没有,可以考虑以下几点:
- 你是否在衡量正确的事情?
- 倡议战略是否需要调整?
### 分析区的例子:新贡献者
#### 魔法 8 球问题是什么?
- 如何分析哪些人为持续的贡献者?
#### 我有什么数据可以纳入分析区和魔法 8 球问题?
- 仓库存在哪些贡献者的活动,包括时间戳?
现在你有了这些信息和一个魔法 8 球问题,把分析分成几个子部分执行。这个想法与上述步骤 2 和 3 相关。
**子问题 1:** “人们是怎么进入这个项目的”
这个问题的目的是先看看新的贡献者在做什么。
**数据:** GitHub上的首次贡献随时间推移的数据问题、PR、评论等
![每季度首次贡献图表][5]
**可视化:** 按季度划分的首次贡献条形图。
**潜在的意义:** 在你与其他社区成员交谈后,进一步检查按季度细分的信息,以及贡献者是否为重复贡献者或仅仅是路过。你可以看到人们进来的时候在做什么,以及这是否能告诉你他们是否会留下来。
![每季度路过贡献图标][6]
**从这些信息中了解到的可以采取的行动。**
- 目前的文档是否能够帮助到最常见的新手?你能不能更好地帮助和支持新人朋友,这将有助于他们中更多的人留下来?
- 是否有一个贡献领域在整体上并不常见,但重复贡献者却集中在这个区域?也许 PR 是重复贡献者的一个常见区域,但大多数人却不在这个区域工作。
**行动项目:**
- 给 "好的第一个问题 "贴上一致的标签,并将这些问题链接到贡献文档中。
- 在这些问题上添加一个 PR 伙伴。
**子问题 2** "我们的代码库真的依赖于路过的贡献者吗?"
**数据:** GitHub 的贡献数据。
![贡献者类型随时间变化的图表][7]
**可视化:** "贡献总额:按路过和重复贡献者的贡献进行细分。"
**根据这一信息可能采取的行动。**
- 这个比例是否达到了项目的目标?很多工作都是由路过贡献者完成的吗?这是否是一种未被充分利用的资源,项目是否没有尽到自己的责任来吸引他们?
### 分析:吸取教训
数字和数据分析并不是“事实”它们可以支持任何观点。因此在处理数据时内部怀疑者应该非常积极并进行反复迭代以带来真正的价值。你不希望你的分析只是一个“yes man”因此花点时间退一步评估你所做的假设。
如果一个指标只是指出了调查的方向,那也是一个巨大的胜利。你不可能看清或想到所有的事情,兔子洞可以是一个好事,对话的起点可以把你带到一个新的地方。
有时,你想测量的东西恰恰不在那里,但你也许能得到有价值的细节。不要假设你有所有的拼图碎片来获得你最初问题的准确答案。如果你开始强迫一个答案或解决方案,你会把自己带入一条由假设引领的危险道路。为分析的方向或目标的改变留出空间,可以让你获得比最初的想法更好的洞察力。
数据只是是一种工具,并不是标准答案,它可以汇集原本无法获得的见解和信息。将你想知道的东西分解成可管理的小块,并在此基础上进行分析,这是最重要的部分。
开源数据分析是一个很好的例子,说明你必须对所有的数据科学采取谨慎态度。
- 主题领域的细微差别是最重要的。
- 通过“问什么/答什么”的工作过程经常被忽视。
- 知道“问什么”可能是最难的部分,当你想出一些有洞察力和创新的东西时,这比你选择的任何工具都要重要。
如果你是一个没有数据科学经验的社区成员,正在寻找开始的地方,我希望这些信息能告诉你,你在这个过程中是多么重要和宝贵。你带来了社区的洞察力和观点。如果你是一个数据科学家或实施指标或可视化的人,你必须倾听你周围的声音,即使你也是一个活跃的社区成员。关于数据科学的更多信息列在本文的最后。
### 总结
把上面的例子作为建立你自己的开源项目的数据分析的框架。对你的结果有很多问题要问,知道这些问题和它们的答案可以把你的项目引向一个令人兴奋和富有成效的方向。
## 关于数据科学的更多信息
请考虑以下来源,以获得更多关于数据科学和为其提供数据的技术的信息。
- [什么是数据科学?][8]
- [什么是 Python][9]
- [怎么成为数据科学家?][10]
- [数据科学家的一天][11]
- [什么是大数据?][12]
- [白皮书:混合云蓝图中的数据密集型智能应用][13]
- [MariaDB 和 MySQL 小技巧][14]
- [最新的数据科学文章][15]
--------------------------------------------------------------------------------
via: https://opensource.com/article/22/12/data-scientists-guide-open-source-community-analysis
作者:[Cali Dolfi][a]
选题:[lkxed][b]
译者:[Chao-zhi](https://github.com/Chao-zhi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/cdolfi
[b]: https://github.com/lkxed
[1]: https://enterprisersproject.com/article/2022/9/data-scientist-day-life?intcmp=7013a000002qLH8AAM
[2]: https://opensource.com/sites/default/files/2022-11/datascience-machinelearning-workflow.jpg
[3]: https://opensource.com/sites/default/files/2022-11/goals-impact.png
[4]: https://opensource.com/sites/default/files/2022-11/new-analysis-areas.png
[5]: https://opensource.com/sites/default/files/2022-11/first-time-contributions-per-quarter.png
[6]: https://opensource.com/sites/default/files/2022-11/driveby-contributions-per-quarter.png
[7]: https://opensource.com/sites/default/files/2022-11/contributor-types-over-time.png
[8]: https://opensource.com/resources/data-science?intcmp=7013a000002CxqkAAC
[9]: https://opensource.com/resources/python?intcmp=7013a000002CxqkAAC
[10]: https://opensource.com/article/17/9/data-scientist?intcmp=7013a000002CxqkAAC
[11]: https://enterprisersproject.com/article/2022/9/data-scientist-day-life?intcmp=7013a000002CxqkAAC
[12]: https://opensource.com/resources/big-data?intcmp=7013a000002CxqkAAC
[13]: https://www.redhat.com/en/resources/data-intensive-applications-hybrid-cloud-blueprint-detail?intcmp=7013a000002CxqkAAC
[14]: https://opensource.com/downloads/mariadb-mysql-cheat-sheet?intcmp=7013a000002CxqkAAC
[15]: https://opensource.com/tags/data-science?intcmp=7013a000002CxqkAAC