TranslateProject/translated/tech/20190322 12 open source tools for natural language processing.md
2020-02-11 12:58:47 +08:00

11 KiB
Raw Blame History

12种自然语言处理的开源工具

看看可以用在你自己NLP应用中的十几个工具吧。

Chat bubbles

Natural language processing (NLP), the technology that powers all the chatbots, voice assistants, predictive text, and other speech/text applications that permeate our lives, has evolved significantly in the last few years. There are a wide variety of open source NLP tools out there, so I decided to survey the landscape to help you plan your next voice- or text-based application.

For this review, I focused on tools that use languages I'm familiar with, even though I'm not familiar with all the tools. (I didn't find a great selection of tools in the languages I'm not familiar with anyway.) That said, I excluded tools in three languages I am familiar with, for various reasons.

The most obvious language I didn't include might be R, but most of the libraries I found hadn't been updated in over a year. That doesn't always mean they aren't being maintained well, but I think they should be getting updates more often to compete with other tools in the same space. I also chose languages and tools that are most likely to be used in production scenarios (rather than academia and research), and I have mostly used R as a research and discovery tool.

I was also surprised to see that the Scala libraries are fairly stagnant. It has been a couple of years since I last used Scala, when it was pretty popular. Most of the libraries haven't been updated since that time—or they've only had a few updates.

Finally, I excluded C++. This is mostly because it's been many years since I last wrote in C++, and the organizations I've worked in have not used C++ for NLP or any data science work.

Python tools

Python工具

Natural Language Toolkit (NLTK)

Natural Language Toolkit (NLTK)是我调研的所有工具中功能最完善的一个。它完美地实现了自然语言处理中多数功能组件,比如分类,令牌化,词干化,标注,分词和语义推理。每一种方法都有多种不同的实现方式,所以你可以选择具体的算法和方式去使用它。同时,它也支持不同语言。然而,它将所有的数据都表示为字符串的形式,对于一些简单的数据结构来说可能很方便,但是如果要使用一些高级的功能来说就可能有点困难。它的使用文档有点复杂,但也有很多其他人编写的使用文档,比如a great book。和其他的工具比起来,这个工具库的运行速度有点慢。但总的来说,这个工具包非常不错,可以用于需要具体算法组合的实验,探索和实际应用当中。

SpaCy

SpaCy是NLTK的主要竞争者。在大多数情况下都比NLTK的速度更快但是SpaCy对自然语言处理的功能组件只有单一实现。SpaCy把所有的东西都表示为一个对象而不是字符串这样就能够为构建应用简化接口。这也方便它能够集成多种框架和数据科学的工具使得你更容易理解你的文本数据。然而SpaCy不像NLTK那样支持多种语言。它对每个接口都有一些简单的选项和文档包括用于语言处理和分析各种组件的多种神经网络模型。总的来说如果创造一个新的应用的生产过程中不需要使用特定的算法的话这是一个很不错的工具。

TextBlob

TextBlob是NLTK的一个扩展库。你可以通过TextBlob用一种更简单的方式来使用NLTK的功能TextBlob也包括了Pattern库中的功能。如果你刚刚开始学习这将会是一个不错的工具可以用于生产对性能要求不太高的应用。TextBlob适用于任何场景但是对小型项目会更加合适。

Textacy

这个工具是我用过的名字最好听的。读"Textacy" 时先发出"ex"再发出"cy"。它不仅仅是名字好同时它本身也是一个很不错的工具。它使用SpaCy作为它自然语言处理核心功能但它在处理过程的前后做了很多工作。如果你想要使用SpaCy你可以先使用Textacy从而不用去多写额外的附加代码你就可以处理不同种类的数据。

PyTorch-NLP

PyTorch-NLP才出现短短的一年但它已经有一个庞大的社区了。它适用于快速原型开发。当公司或者研究人员推出很多其他工具去完成新奇的处理任务比如图像转换它就会被更新。PyTorch的目标用户是研究人员但它也能用于原型开发或在最开始的生产任务中使用最好的算法。基于此基础上的创建的库也是值得研究的。

Node tools

Retext

Retextunified collective的一部分。Unified是一个接口能够集成不同的工具和插件以便他们能够高效的工作。Retext是unified工具集三个中的一个另外的两个分别是用于markdown编辑的Remark和用于HTML处理的Rehype。这是一个非常有趣的想法我很高兴看到这个社区的发展。Retext没有暴露过多的底层技术更多的是使用插件去完成你在NLP任务中想要做的事情。拼写检查固定排版情绪检测和可读性分析都可以用简单的插件来完成。如果你不想了解底层处理技术又想完成你的任务的话这个工具和社区是一个不错的选择。

Compromise

如果你在找拥有最高级的功能和最复杂的系统的工具的话,Compromise不是你的选择。 然而如果你想要一个性能好应用广泛还能在客户端运行的工具的话Compromise值得一试。实际上它的名字是准确的因为作者更关注更具体功能的小软件包而在功能性和准确性上做出了牺牲这些功能得益于用户对使用环境的理解。

Natural

Natural包含了一般自然语言处理库所具有的大多数功能。它主要是处理英文文本,但也包括一些其他语言,它的社区也支持额外的语言。它能够进行令牌化,词干化,分类,语音处理,词频-逆文档频率计算(TF-IDF)WordNet字符相似度计算和一些变换。它和NLTK有的一比因为它想要把所有东西都包含在一个包里头使用方便但是可能不太适合专注的研究。总的来说这是一个不错的功能齐全的库目前仍在开发但可能需要对底层实现有更多的了解才能完更有效。

Nlp.js

Nlp.js是在其他几个NLP库上开发的包括Franc和Brain.js。 is built on top of several other NLP libraries, including Franc and Brain.js. It provides a nice interface into many components of NLP, like classification, sentiment analysis, stemming, named entity recognition, and natural language generation. It also supports quite a few languages, which is helpful if you plan to work in something other than English. Overall, this is a great general tool with a simplified interface into several other great tools. This will likely take you a long way in your applications before you need something more powerful or more flexible.

Java tools

OpenNLP

OpenNLP is hosted by the Apache Foundation, so it's easy to integrate it into other Apache projects, like Apache Flink, Apache NiFi, and Apache Spark. It is a general NLP tool that covers all the common processing components of NLP, and it can be used from the command line or within an application as a library. It also has wide support for multiple languages. Overall, OpenNLP is a powerful tool with a lot of features and ready for production workloads if you're using Java.

StanfordNLP

Stanford CoreNLP is a set of tools that provides statistical NLP, deep learning NLP, and rule-based NLP functionality. Many other programming language bindings have been created so this tool can be used outside of Java. It is a very powerful tool created by an elite research institution, but it may not be the best thing for production workloads. This tool is dual-licensed with a special license for commercial purposes. Overall, this is a great tool for research and experimentation, but it may incur additional costs in a production system. The Python implementation might also interest many readers more than the Java version. Also, one of the best Machine Learning courses is taught by a Stanford professor on Coursera. Check it out along with other great resources.

CogCompNLP

CogCompNLP, developed by the University of Illinois, also has a Python library with similar functionality. It can be used to process text, either locally or on remote systems, which can remove a tremendous burden from your local device. It provides processing functions such as tokenization, part-of-speech tagging, chunking, named-entity tagging, lemmatization, dependency and constituency parsing, and semantic role labeling. Overall, this is a great tool for research, and it has a lot of components that you can explore. I'm not sure it's great for production workloads, but it's worth trying if you plan to use Java.


What are your favorite open source tools and libraries for NLP? Please share in the comments—especially if there's one I didn't include.


via: https://opensource.com/article/19/3/natural-language-processing-tools

作者:Dan Barker (Community Moderator) 选题:lujun9972 译者:zxp 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出