TranslateProject/sources/tech/20190516 Querying 10 years of GitHub data with GHTorrent and Libraries.io.md
darksun 545a142a38 选题: 20190516 Querying 10 years of GitHub data with GHTorrent and Libraries.io
sources/tech/20190516 Querying 10 years of GitHub data with GHTorrent and Libraries.io.md
2019-05-17 10:36:11 +08:00

13 KiB
Raw Blame History

Querying 10 years of GitHub data with GHTorrent and Libraries.io

There is a way to explore GitHub data without any local infrastructure using open source datasets. magnifying glass on computer screen

Im always on the lookout for new datasets that we can use to show off the power of my team's work. CHAOS SEARCH turns your Amazon S3 object storage data into a fully searchable Elasticsearch-like cluster. With the Elasticsearch API or tools like Kibana, you can then query whatever data you find.

I was excited when I found the GHTorrent project to explore. GHTorrent aims to build an offline version of all data available through the GitHub APIs. If datasets are your thing, this is a project worth checking out or even consider donating one of your GitHub API keys.

Accessing GHTorrent data

There are many ways to gain access to and use GHTorrents data, which is available in NDJSON** format. This project does a great job making the data available in multiple forms, includingCSV for restoring into a MySQL database, MongoDB dumps of all objects, and Google Big Query **(free) for exporting data directly into Googles object storage. There is one caveat: this dataset has a nearly complete dataset from 2008 to 2017 but is not as complete from 2017 to today. That will impact our ability to query with certainty, but it is still an exciting amount of information.

I chose Google Big Query to avoid running any database myself, so I was quickly able to download a full corpus of data including users and projects. CHAOS SEARCH can natively analyze the NDJSON format, so after uploading the data to Amazon S3 I was able to index it in just a few minutes. The CHAOS SEARCH platform doesnt require users to set up index schemas or define mappings for their data, so it discovered all of the fields—strings, integers, etc.—itself.

With my data fully indexed and ready for search and aggregation, I wanted to dive in and see what insights we can learn, like which software languages are the most popular for GitHub projects.

(A note on formatting: this is a valid JSON query that we won't format correctly here to avoid scroll fatigue. To properly format it, you can copy it locally and send to a command-line utility like jq.)

`{"aggs":{"2":{"date_histogram":{"field":"root.created_at","interval":"1M","time_zone":"America/New_York","min_doc_count":1}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["root.created_at","root.updated_at"],"query":{"bool":{"must":[],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"root.language":{"query":""}}}]}}}`

This result is of little surprise to anyone whos followed the state of open source languages over recent years.

Which software languages are the most popular on GitHub.

JavaScript is still the reigning champion, and while some believe JavaScript is on its way out, it remains the 800-pound gorilla and is likely to remain that way for some time. Java faces similar rumors and this data shows that it's a major part of the open source ecosystem.

Given the popularity of projects like Docker and Kubernetes, you might be wondering, “What about Go (Golang)?” This is a good time for a reminder that the GitHub dataset discussed here contains some gaps, most significantly after 2017, which is about when I saw Golang projects popping up everywhere. I hope to repeat this search with a complete GitHub dataset and see if it changes the rankings at all.

Now let's explore the rate of project creation. (Reminder: this is valid JSON consolidated for readability.)

`{"aggs":{"2":{"date_histogram":{"field":"root.created_at","interval":"1M","time_zone":"America/New_York","min_doc_count":1}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["root.created_at","root.updated_at"],"query":{"bool":{"must":[],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"root.language":{"query":""}}}]}}}`

Seeing the rate at which new projects are created would be fun impressive as well, with tremendous growth starting around 2012:

The rate at which new projects are created on GitHub.

Now that I knew the rate of projects created as well as the most popular languages used to create these projects, I wanted to find out what open source licenses these projects chose. Unfortunately, this data doesnt exist in the GitHub projects dataset, but the fantastic team over at Tidelift publishes a detailed list of GitHub projects, licenses used, and other details regarding the state of open source software in their Libraries.io data. Ingesting this dataset into CHAOS SEARCH took just minutes, letting me see which open source software licenses are the most popular on GitHub:

(Reminder: this is valid JSON consolidated for readability.)

`{"aggs":{"2":{"terms":{"field":"Repository License","size":10,"order":{"_count":"desc"}}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["Created Timestamp","Last synced Timestamp","Latest Release Publish Timestamp","Updated Timestamp"],"query":{"bool":{"must":[],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"Repository License":{"query":""}}}]}}}`

The results show some significant outliers:

Which open source software licenses are the most popular on GitHub.

As you can see, the MIT license and the Apache 2.0 license by far outweighs most of the other open source licenses used for these projects, while various BSD and GPL licenses follow far behind. I cant say that Im surprised by these results given GitHubs open model. I would guess that users, not companies, create most projects and that they use the MIT license to make it simple for other people to use, share, and contribute. That Apache 2.0** **licensing is right behind also makes sense, given just how many companies want to ensure their trademarks are respected and have an open source component to their businesses.

Now that I identified the most popular licenses, I was curious to see the least used ones. By adjusting my last query, I reversed the top 10 into the bottom 10 and was able to find just two projects using the University of Illinois—NCSA Open Source License. I had never heard of this license before, but its pretty close to Apache 2.0. Its interesting to see just how many different software licenses are in use across all GitHub projects.

The University of Illinois/NCSA open source license.

The University of Illinois/NCSA open source license.

After that, I dove into a specific language (JavaScript) to see the most popular license used there. (Reminder: this is valid JSON consolidated for readability.)

`{"aggs":{"2":{"terms":{"field":"Repository License","size":10,"order":{"_count":"desc"}}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["Created Timestamp","Last synced Timestamp","Latest Release Publish Timestamp","Updated Timestamp"],"query":{"bool":{"must":[{"match_phrase":{"Repository Language":{"query":"JavaScript"}}}],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"Repository License":{"query":""}}}]}}}`

There were some surprises in this output.

The most popular open source licenses used for GitHub JavaScript projects.

Even though the default license for NPM modules when created with **npm init **is the one from Internet Systems Consortium (ISC), you can see that a considerable number of these projects use MIT as well as Apache 2.0 for their open source license.

Since the Libraries.io dataset is rich in open source project content, and since the GHTorrent data is missing the last few years data (and thus missing any details about Golang projects), I decided to run a similar query to see how Golang projects license their code.

(Reminder: this is valid JSON consolidated for readability.)

`{"aggs":{"2":{"terms":{"field":"Repository License","size":10,"order":{"_count":"desc"}}}},"size":0,"_source":{"excludes":[]},"stored_fields":["*"],"script_fields":{},"docvalue_fields":["Created Timestamp","Last synced Timestamp","Latest Release Publish Timestamp","Updated Timestamp"],"query":{"bool":{"must":[{"match_phrase":{"Repository Language":{"query":"Go"}}}],"filter":[{"match_all":{}}],"should":[],"must_not":[{"match_phrase":{"Repository License":{"query":""}}}]}}}`

The results were quite different than Javascript.

How Golang projects license their GitHub code.

Golang offers a stunning reversal from JavaScript—nearly three times as many Golang projects are licensed with Apache 2.0 over MIT. While its hard precisely explain why this is the case, over the last few years theres been massive growth in Golang, especially among companies building projects and software offerings, both open source and commercially.

As we learned above, many of these companies want to enforce their trademarks, thus the move to the Apache 2.0 license makes sense.

Conclusion

In the end, I found some interesting results by diving into the GitHub users and projects data dump. Some of these I definitely would have guessed, but a few results were surprises to me as well, especially the outliers like the rarely-used NCSA license.

All in all, you can see how quickly and easily the CHAOS SEARCH platform lets us find complicated answers to interesting questions. I dove into this dataset and received deep analytics without having to run any databases myself, and even stored the data inexpensively on Amazon S3—so theres little maintenance involved. Now I can ask any other questions regarding the data anytime I want.

What other questions are you asking your data, and what data sets do you use? Let me know in the comments or on Twitter @petecheslock.

A version of this article was originally posted on CHAOS SEARCH.



via: https://opensource.com/article/19/5/chaossearch-github-ghtorrent

作者:Pete Cheslock 选题:lujun9972 译者:译者ID 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出