Merge pull request #9654 from geekpi/master

translated
This commit is contained in:
geekpi 2018-08-02 08:59:53 +08:00 committed by GitHub
commit 0dc8b14f18
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 40 additions and 42 deletions

View File

@ -1,42 +0,0 @@
translating---geekpi
Textricator: Data extraction made simple
======
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/document_free_access_cut_security.png?itok=ocvCv8G2)
You probably know the feeling: You ask for data and get a positive response, only to open the email and find a whole bunch of PDFs attached. Data, interrupted.
We understand your frustration, and weve done something about it: Introducing [Textricator][1], our first open source product.
Were Measures for Justice, a criminal justice research and transparency organization. Our mission is to provide data transparency for the entire justice system, from arrest to post-conviction. We do this by producing a series of up to 32 performance measures covering the entire criminal justice system, county by county. We get our data in many ways—all legal, of course—and while many state and county agencies are data-savvy, giving us quality, formatted data in CSVs, the data is often bundled inside software with no simple way to get it out. PDF reports are the best they can offer.
Developers Joe Hale and Stephen Byrne have spent the past two years developing Textricator to extract tens of thousands of pages of data for our internal use. Textricator can process just about any text-based PDF format—not just tables, but complex reports with wrapping text and detail sections generated from tools like Crystal Reports. Simply tell Textricator the attributes of the fields you want to collect, and it chomps through the document, collecting and writing out your records.
Not a software engineer? Textricator doesnt require programming skills; rather, the user describes the structure of the PDF and Textricator handles the rest. Most users run it via the command line; however, a browser-based GUI is available.
We evaluated other great open source solutions like [Tabula][2], but they just couldnt handle the structure of some of the PDFs we needed to scrape. “Textricator is both flexible and powerful and has cut the time we spend to process large datasets from days to hours,” says Andrew Branch, director of technology.
At MFJ, were committed to transparency and knowledge-sharing, which includes making our software available to anyone, especially those trying to free and share data publicly. Textricator is available on [GitHub][3] and released under [GNU Affero General Public License Version 3][4].
You can see the results of our work, including data processed via Textricator, on our free [online data portal][5]. Textricator is an essential part of our process and we hope civic tech and government organizations alike can unlock more data with this new tool.
If you use Textricator, let us know how it helped solve your data problem. Want to improve it? Submit a pull request.
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/7/textricator
作者:[Stephen Byrne][a]
选题:[lujun9972](https://github.com/lujun9972)
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:
[1]:https://textricator.mfj.io/
[2]:https://tabula.technology/
[3]:https://github.com/measuresforjustice/textricator
[4]:https://www.gnu.org/licenses/agpl-3.0.en.html
[5]:https://www.measuresforjustice.org/portal/

View File

@ -0,0 +1,40 @@
Textricator让数据提取变得简单
======
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/document_free_access_cut_security.png?itok=ocvCv8G2)
你可能知道这种感觉:你请求数据并得到积极的响应,只打开电子邮件并发现一大堆附加的 PDF。数据中断。
我们理解你的挫败感,我们已经做了一些事情:介绍下 [Textricator][1],我们的第一个开源产品。
我们是 “Measures for Justice”一个刑事司法研究和透明度组织。我们的使命是为整个司法系统从逮捕到定罪后提供数据透明度。我们通过制定一系列多达 32 项绩效指标来实现这一目标,涵盖整个县的刑事司法系统。我们以多种方式获取数据 - 当然,所有这些都是合法的 - 虽然许多州和县机构都掌握数据,为我们提供 CSV 格式的高质量格式化数据但这些数据通常捆绑在软件中没有简单的方法可以提取。PDF 报告是他们能提供的最佳报告。
开发者 Joe Hale 和 Stephen Byrne 在过去两年中一直在开发 Textricator它用来提取内部使用的数万页数据。Textricator 可以处理几乎任何基于文本的 PDF 格式 - 不仅仅是表格,还包括复杂的报表,其中包含从 Crystal Reports 等工具生成的文本和细节部分。只需告诉 Textricator 你要收集的字段的属性,它就会整理文档,收集并写出你的记录。
不是软件工程师Textricator 不需要编程技巧。相反,用户描述 PDF 的结构Textricator 处理其余部分。大多数用户通过命令行运行它。但是,你可以使用基于浏览器的 GUI。
我们评估了其他很好的开源解决方案,如 [Tabula][2],但他们无法处理我们需要抓取的一些 PDF 的结构。技术总监 Andrew Branch 说“Textricator 既灵活又强大,缩短了我们花费大量时间处理大型数据集的时间。”
在 MFJ我们致力于透明度和知识共享其中包括向任何人提供我们的软件特别是那些试图公开免费共享数据的人。Textricator 可以在 [GitHub][3] 上找到,并在 [GNU Affero 通用公共许可证第 3 版][4]下发布。
你可以在我们的免费[在线数据门户][5]上查看我们的工作成果,包括通过 Textricator 处理的数据。Textricator 是我们流程的重要组成部分,我们希望民间技术机构和政府组织都可以使用这个新工具解锁更多数据。
如果你使用 Textricator请告诉我们它如何帮助你解决数据问题。想要改进吗提交一个 pull request。
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/7/textricator
作者:[Stephen Byrne][a]
选题:[lujun9972](https://github.com/lujun9972)
译者:[geekpi](https://github.com/geekpi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:
[1]:https://textricator.mfj.io/
[2]:https://tabula.technology/
[3]:https://github.com/measuresforjustice/textricator
[4]:https://www.gnu.org/licenses/agpl-3.0.en.html
[5]:https://www.measuresforjustice.org/portal/