diff --git a/sources/tech/20200807 An advanced guide to NLP analysis with Python and NLTK.md b/sources/tech/20200807 An advanced guide to NLP analysis with Python and NLTK.md deleted file mode 100644 index 802560422a..0000000000 --- a/sources/tech/20200807 An advanced guide to NLP analysis with Python and NLTK.md +++ /dev/null @@ -1,542 +0,0 @@ -[#]: collector: (lujun9972) -[#]: translator: (tanloong) -[#]: reviewer: ( ) -[#]: publisher: ( ) -[#]: url: ( ) -[#]: subject: (An advanced guide to NLP analysis with Python and NLTK) -[#]: via: (https://opensource.com/article/20/8/nlp-python-nltk) -[#]: author: (Girish Managoli https://opensource.com/users/gammay) - -An advanced guide to NLP analysis with Python and NLTK -====== -Get deeper into the foundational concepts behind natural language -processing. -![Brain on a computer screen][1] - -In my [previous article][2], I introduced natural language processing (NLP) and the Natural Language Toolkit ([NLTK][3]), the NLP toolkit created at the University of Pennsylvania. I demonstrated how to parse text and define stopwords in Python and introduced the concept of a corpus, a dataset of text that aids in text processing with out-of-the-box data. In this article, I'll continue utilizing datasets to compare and analyze natural language. - -The fundamental building blocks covered in this article are: - - * WordNet and synsets - * Similarity comparison - * Tree and treebank - * Named entity recognition - - - -### WordNet and synsets - -[WordNet][4] is a large lexical database corpus in NLTK. WordNet maintains cognitive synonyms (commonly called synsets) of words correlated by nouns, verbs, adjectives, adverbs, synonyms, antonyms, and more. - -WordNet is a very useful tool for text analysis. It is available for many languages (Chinese, English, Japanese, Russian, Spanish, and more), under many licenses (ranging from open source to commercial). The first WordNet was created by Princeton University for English under an MIT-like license. - -A word is typically associated with multiple synsets based on its meanings and parts of speech. Each synset usually provides these attributes: - -**Attribute** | **Definition** | **Example** ----|---|--- -Name | Name of the synset | Example: The word "code" has five synsets with names `code.n.01`, `code.n.02`, `code.n.03`, `code.v.01`, `code.v.02` -POS | Part of speech of the word for this synset | The word "code" has three synsets in noun form and two in verb form -Definition | Definition of the word (in POS) | One of the definitions of "code" in verb form is: "(computer science) the symbolic arrangement of data or instructions in a computer program" -Examples | Examples of word's use | One of the examples of "code": "We should encode the message for security reasons" -Lemmas | Other word synsets this word+POC is related to (not strictly synonyms, but can be considered so); lemmas are related to other lemmas, not to words directly | Lemmas of `code.v.02` (as in "convert ordinary language into code") are `code.v.02.encipher`, `code.v.02.cipher`, `code.v.02.cypher`, `code.v.02.encrypt`, `code.v.02.inscribe`, `code.v.02.write_in_code` -Antonyms | Opposites | Antonym of lemma `encode.v.01.encode` is `decode.v.01.decode` -Hypernym | A broad category that other words fall under | A hypernym of `code.v.01` (as in "Code the pieces with numbers so that you can identify them later") is `tag.v.01` -Meronym | A word that is part of (or subordinate to) a broad category | A meronym of "computer" is "chip" -Holonym | The relationship between a parent word and its subordinate parts | A hyponym of "window" is "computer screen" - -There are several other attributes, which you can find in the `nltk/corpus/reader/wordnet.py` source file in `/Lib/site-packages`. - -Some code may help this make more sense. - -This helper function: - - -``` -def synset_info(synset): -    print("Name", synset.name()) -    print("POS:", synset.pos()) -    print("Definition:", synset.definition()) -    print("Examples:", synset.examples()) -    print("Lemmas:", synset.lemmas()) -    print("Antonyms:", [lemma.antonyms() for lemma in synset.lemmas() if len(lemma.antonyms()) > 0]) -    print("Hypernyms:", synset.hypernyms()) -    print("Instance Hypernyms:", synset.instance_hypernyms()) -    print("Part Holonyms:", synset.part_holonyms()) -    print("Part Meronyms:", synset.part_meronyms()) -    print() - -[/code] [code]`synsets = wordnet.synsets('code')` -``` - -shows this: - - -``` -5 synsets: -Name code.n.01 -POS: n -Definition: a set of rules or principles or laws (especially written ones) -Examples: [] -Lemmas: [Lemma('code.n.01.code'), Lemma('code.n.01.codification')] -Antonyms: [] -Hypernyms: [Synset('written_communication.n.01')] -Instance Hpernyms: [] -Part Holonyms: [] -Part Meronyms: [] - -... - -Name code.n.03 -POS: n -Definition: (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions -Examples: [] -Lemmas: [Lemma('code.n.03.code'), Lemma('code.n.03.computer_code')] -Antonyms: [] -Hypernyms: [Synset('coding_system.n.01')] -Instance Hpernyms: [] -Part Holonyms: [] -Part Meronyms: [] - -... - -Name code.v.02 -POS: v -Definition: convert ordinary language into code -Examples: ['We should encode the message for security reasons'] -Lemmas: [Lemma('code.v.02.code'), Lemma('code.v.02.encipher'), Lemma('code.v.02.cipher'), Lemma('code.v.02.cypher'), Lemma('code.v.02.encrypt'), Lemma('code.v.02.inscribe'), Lemma('code.v.02.write_in_code')] -Antonyms: [] -Hypernyms: [Synset('encode.v.01')] -Instance Hpernyms: [] -Part Holonyms: [] -Part Meronyms: [] -``` - -Synsets and lemmas follow a tree structure you can visualize: - - -``` -def hypernyms(synset): -    return synset.hypernyms() - -synsets = wordnet.synsets('soccer') -for synset in synsets: -    print(synset.name() + " tree:") -    pprint(synset.tree(rel=hypernyms)) -    print() - -[/code] [code] - -code.n.01 tree: -[Synset('code.n.01'), - [Synset('written_communication.n.01'), -   ... - -code.n.02 tree: -[Synset('code.n.02'), - [Synset('coding_system.n.01'), -   ... - -code.n.03 tree: -[Synset('code.n.03'), -   ... - -code.v.01 tree: -[Synset('code.v.01'), - [Synset('tag.v.01'), -   ... - -code.v.02 tree: -[Synset('code.v.02'), - [Synset('encode.v.01'), -   ... -``` - -WordNet does not cover all words and their information (there are about 170,000 words in English today and about 155,000 in the latest version of WordNet), but it's a good starting point. After you learn the concepts of this building block, if you find it inadequate for your needs, you can migrate to another. Or, you can build your own WordNet! - -#### Try it yourself - -Using the Python libraries, download Wikipedia's page on [open source][5] and list the synsets and lemmas of all the words. - -### Similarity comparison - -Similarity comparison is a building block that identifies similarities between two pieces of text. It has many applications in search engines, chatbots, and more. - -For example, are the words "football" and "soccer" related? - - -``` -syn1 = wordnet.synsets('football') -syn2 = wordnet.synsets('soccer') - -# A word may have multiple synsets, so need to compare each synset of word1 with synset of word2 -for s1 in syn1: -    for s2 in syn2: -        print("Path similarity of: ") -        print(s1, '(', s1.pos(), ')', '[', s1.definition(), ']') -        print(s2, '(', s2.pos(), ')', '[', s2.definition(), ']') -        print("   is", s1.path_similarity(s2)) -        print() - -[/code] [code] - -Path similarity of: -Synset('football.n.01') ( n ) [ any of various games played with a ball (round or oval) in which two teams try to kick or carry or propel the ball into each other's goal ] -Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ] -   is 0.5 - -Path similarity of: -Synset('football.n.02') ( n ) [ the inflated oblong ball used in playing American football ] -Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ] -   is 0.05 -``` - -The highest path similarity score of the words is 0.5, indicating they are closely related. - -What about "code" and "bug"? Similarity scores for these words used in computer science are: - - -``` -Path similarity of: -Synset('code.n.01') ( n ) [ a set of rules or principles or laws (especially written ones) ] -Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ] -   is 0.1111111111111111 -... -Path similarity of: -Synset('code.n.02') ( n ) [ a coding system used for transmitting messages requiring brevity or secrecy ] -Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ] -   is 0.09090909090909091 -... -Path similarity of: -Synset('code.n.03') ( n ) [ (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions ] -Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ] -   is 0.09090909090909091 -``` - -These are the highest similarity scores, which indicates they are related. - -NLTK provides several similarity scorers, such as: - - * path_similarity - * lch_similarity - * wup_similarity - * res_similarity - * jcn_similarity - * lin_similarity - - - -See the Similarity section of the [WordNet Interface][6] page to determine the appropriate one for your application. - -#### Try it yourself - -Using Python libraries, start from the Wikipedia [Category: Lists of computer terms][7] page and prepare a list of terminologies, then see how the words correlate. - -### Tree and treebank - -With NLTK, you can represent a text's structure in tree form to help with text analysis. - -Here is an example: - -A simple text pre-processed and part-of-speech (POS)-tagged: - - -``` -import nltk - -text = "I love open source" -# Tokenize to words -words = nltk.tokenize.word_tokenize(text) -# POS tag the words -words_tagged = nltk.pos_tag(words) -``` - -You must define a grammar to convert the text to a tree structure. This example uses a simple grammar based on the [Penn Treebank tags][8]. - - -``` -# A simple grammar to create tree -grammar = "NP: {<JJ><NN>}" -``` - -Next, use the grammar to create a tree: - - -``` -# Create tree -parser = nltk.RegexpParser(grammar) -tree = parser.parse(words_tagged) -pprint(tree) -``` - -This produces: - - -``` -`Tree('S', [('I', 'PRP'), ('love', 'VBP'), Tree('NP', [('open', 'JJ'), ('source', 'NN')])])` -``` - -You can see it better graphically. - - -``` -`tree.draw()` -``` - -![NLTK Tree][9] - -(Girish Managoli, [CC BY-SA 4.0][10]) - -This structure helps explain the text's meaning correctly. As an example, identify the [subject][11] in this text: - - -``` -subject_tags = ["NN", "NNS", "NP", "NNP", "NNPS", "PRP", "PRP$"] -def subject(sentence_tree): -    for tagged_word in sentence_tree: -        # A crude logic for this case -  first word with these tags is considered subject -        if tagged_word[1] in subject_tags: -            return tagged_word[0] - -print("Subject:", subject(tree)) -``` - -It shows "I" is the subject: - - -``` -`Subject: I` -``` - -This is a basic text analysis building block that is applicable to larger applications. For example, when a user says, "Book a flight for my mom, Jane, to NY from London on January 1st," a chatbot using this block can interpret the request as: - -**Action**: Book -**What**: Flight -**Traveler**: Jane -**From**: London -**To**: New York -**Date**: 1 Jan (of the next year) - -A treebank refers to a corpus with pre-tagged trees. Open source, conditional free-for-use, and commercial treebanks are available for many languages. The most commonly used one for English is Penn Treebank, extracted from the _Wall Street Journal_, a subset of which is included in NLTK. Some ways of using a treebank: - - -``` -words = nltk.corpus.treebank.words() -print(len(words), "words:") -print(words) - -tagged_sents = nltk.corpus.treebank.tagged_sents() -print(len(tagged_sents), "sentences:") -print(tagged_sents) - -[/code] [code] - -100676 words: -['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', ...] -3914 sentences: -[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ...] -``` - -See tags in a sentence: - - -``` -sent0 = tagged_sents[0] -pprint(sent0) - -[/code] [code] - -[('Pierre', 'NNP'), - ('Vinken', 'NNP'), - (',', ','), - ('61', 'CD'), - ('years', 'NNS'), -... -``` - -Create a grammar to convert this to a tree: - - -``` -grammar = ''' -    Subject: {<NNP><NNP>} -    SubjectInfo: {<CD><NNS><JJ>} -    Action: {<MD><VB>} -    Object: {<DT><NN>} -    Stopwords: {<IN><DT>} -    ObjectInfo: {<JJ><NN>} -    When: {<NNP><CD>} -''' -parser = nltk.RegexpParser(grammar) -tree = parser.parse(sent0) -print(tree) - -[/code] [code] - -(S -  (Subject Pierre/NNP Vinken/NNP) -  ,/, -  (SubjectInfo 61/CD years/NNS old/JJ) -  ,/, -  (Action will/MD join/VB) -  (Object the/DT board/NN) -  as/IN -  a/DT -  (ObjectInfo nonexecutive/JJ director/NN) -  (Subject Nov./NNP) -  29/CD -  ./.) -``` - -See it graphically: - - -``` -`tree.draw()` -``` - -![NLP Treebank image][12] - -(Girish Managoli, [CC BY-SA 4.0][10]) - -The concept of trees and treebanks is a powerful building block for text analysis. - -#### Try it yourself - -Using the Python libraries, download Wikipedia's page on [open source][5] and represent the text in a presentable view. - -### Named entity recognition - -Text, whether spoken or written, contains important data. One of text processing's primary goals is extracting this key data. This is needed in almost all applications, such as an airline chatbot that books tickets or a question-answering bot. NLTK provides a named entity recognition feature for this. - -Here's a code example: - - -``` -`sentence = 'Peterson first suggested the name "open source" at Palo Alto, California'` -``` - -See if name and place are recognized in this sentence. Pre-process as usual: - - -``` -import nltk - -words = nltk.word_tokenize(sentence) -pos_tagged = nltk.pos_tag(words) -``` - -Run the named-entity tagger: - - -``` -ne_tagged = nltk.ne_chunk(pos_tagged) -print("NE tagged text:") -print(ne_tagged) -print() - -[/code] [code] - -NE tagged text: -(S -  (PERSON Peterson/NNP) -  first/RB -  suggested/VBD -  the/DT -  name/NN -  ``/`` -  open/JJ -  source/NN -  ''/'' -  at/IN -  (FACILITY Palo/NNP Alto/NNP) -  ,/, -  (GPE California/NNP)) -``` - -Name tags were added; extract only the named entities from this tree: - - -``` -print("Recognized named entities:") -for ne in ne_tagged: -    if hasattr(ne, "label"): -        print(ne.label(), ne[0:]) - -[/code] [code] - -Recognized named entities: -PERSON [('Peterson', 'NNP')] -FACILITY [('Palo', 'NNP'), ('Alto', 'NNP')] -GPE [('California', 'NNP')] -``` - -See it graphically: - - -``` -`ne_tagged.draw()` -``` - -![NLTK Treebank tree][13] - -(Girish Managoli, [CC BY-SA 4.0][10]) - -NLTK's built-in named-entity tagger, using PENN's [Automatic Content Extraction][14] (ACE) program, detects common entities such as ORGANIZATION, PERSON, LOCATION, FACILITY, and GPE (geopolitical entity). - -NLTK can use other taggers, such as the [Stanford Named Entity Recognizer][15]. This trained tagger is built in Java, but NLTK provides an interface to work with it (See [nltk.parse.stanford][16] or [nltk.tag.stanford][17]). - -#### Try it yourself - -Using the Python libraries, download Wikipedia's page on [open source][5] and identify people who had an influence on open source and where and when they contributed. - -### Advanced exercise - -If you're ready for it, try building this superstructure using the building blocks discussed in these articles. - -Using Python libraries, download Wikipedia's [Category: Computer science page][18] and: - - * Identify the most-occurring unigrams, bigrams, and trigrams and publish it as a list of keywords or technologies that students and engineers need to be aware of in this domain. - * Show the names, technologies, dates, and places that matter in this field graphically. This can be a nice infographic. - * Create a search engine. Does your search engine perform better than Wikipedia's search? - - - -### What's next? - -NLP is a quintessential pillar in application building. NLTK is a classic, rich, and powerful kit that provides the bricks and mortar to build practically appealing, purposeful applications for the real world. - -In this series of articles, I explained what NLP makes possible using NLTK as an example. NLP and NLTK have a lot more to offer. This series is an inception point to help get you started. - -If your needs grow beyond NLTK's capabilities, you could train new models or add capabilities to it. New NLP libraries that build on NLTK are coming up, and machine learning is being used extensively in language processing. - --------------------------------------------------------------------------------- - -via: https://opensource.com/article/20/8/nlp-python-nltk - -作者:[Girish Managoli][a] -选题:[lujun9972][b] -译者:[译者ID](https://github.com/译者ID) -校对:[校对者ID](https://github.com/校对者ID) - -本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 - -[a]: https://opensource.com/users/gammay -[b]: https://github.com/lujun9972 -[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/brain_computer_solve_fix_tool.png?itok=okq8joti (Brain on a computer screen) -[2]: https://opensource.com/article/20/8/intro-python-nltk -[3]: http://www.nltk.org/ -[4]: https://en.wikipedia.org/wiki/WordNet -[5]: https://en.wikipedia.org/wiki/Open_source -[6]: https://www.nltk.org/howto/wordnet.html -[7]: https://en.wikipedia.org/wiki/Category:Lists_of_computer_terms -[8]: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html -[9]: https://opensource.com/sites/default/files/uploads/nltk-tree.jpg (NLTK Tree) -[10]: https://creativecommons.org/licenses/by-sa/4.0/ -[11]: https://en.wikipedia.org/wiki/Subject_(grammar) -[12]: https://opensource.com/sites/default/files/uploads/nltk-treebank.jpg (NLP Treebank image) -[13]: https://opensource.com/sites/default/files/uploads/nltk-treebank-2a.jpg (NLTK Treebank tree) -[14]: https://www.ldc.upenn.edu/collaborations/past-projects/ace -[15]: https://nlp.stanford.edu/software/CRF-NER.html -[16]: https://www.nltk.org/_modules/nltk/parse/stanford.html -[17]: https://www.nltk.org/_modules/nltk/tag/stanford.html -[18]: https://en.wikipedia.org/wiki/Category:Computer_science diff --git a/translated/tech/20200807 An advanced guide to NLP analysis with Python and NLTK.md b/translated/tech/20200807 An advanced guide to NLP analysis with Python and NLTK.md new file mode 100644 index 0000000000..eef477a2c1 --- /dev/null +++ b/translated/tech/20200807 An advanced guide to NLP analysis with Python and NLTK.md @@ -0,0 +1,572 @@ +[#]: collector: (lujun9972) +[#]: translator: (tanloong) +[#]: reviewer: ( ) +[#]: publisher: ( ) +[#]: url: ( ) +[#]: subject: (An advanced guide to NLP analysis with Python and NLTK) +[#]: via: (https://opensource.com/article/20/8/nlp-python-nltk) +[#]: author: (Girish Managoli https://opensource.com/users/gammay) + +用 Python 和 NLTK 进行 NLP 分析的高级教程 +====== +进一步学习自然语言处理的基本概念 +![Brain on a computer screen][1] + +在[之前的文章][2]里,我介绍了自然语言处理NLP和宾夕法尼亚大学研发的自然语言处理工具包 ([NLTK][3])。我演示了用 Python 解析文本和定义停用词的方法,并介绍了语料库的概念。语料库是由文本构成的数据集,通过提供现成的文本数据来辅助文本处理。在这篇文章里,我将继续用各种语料库对文本进行对比和分析。 + +这篇文章主要包括以下部分: + + * 词网WordNet同义词集synset + * 相似度比较Similarity comparison + * Tree树库treebank + * 命名实体识别Named entity recognition + + +### WordNet 和同义词集synsets + +[WordNet][4] 是 NLTK 里的一个大型词典数据库。WordNet 包含各单词的诸多认知同义词cognitive synonyms (一个认知同义词cognitive synonyms常被称作 synset)。 + +WordNet 是文本分析的一个很有用的工具。它有面向多种语言的版本 (汉语、英语、日语、俄语和西班牙语等),也使用多种许可证 (从开源许可证到商业许可证都有)。初代版本的 WordNet 由普林斯顿大学研发,面向英语,使用类 MIT 许可证MIT-like license。 + +因为一个词可能有多个意义或多个词性,所以可能与多个 synset 相关联。每个 synset 通常提供下列属性: + +|**属性** | **定义** | **例子**| +|---|---|---| +|名称Name| 此 synset 的名称 | 单词 code 有 5 个 synset,名称分别是 `code.n.01`、 `code.n.02`、 `code.n.03`、`code.v.01` 和 `code.v.02`| +|词性POS| 此 synset 的词性 | 单词 code 有 3 个名词词性的 synset 和 2 个动词词性的 synset| +|定义Definition| 该词作对应词性时的定义 | 动词 code 的一个定义是: (计算机科学computer science)数据或计算机程序指令的象征性排列symbolic arrangement| +|例子Examples| 使用该词的例子 | code 一词的例子:为了安全,我们应该给信息编码。We should encode the message for security reasons| +|词元Lemmas| 与该词向关联的其他 synset (包括那些不一定严格地是该词的同义词,但可以大体看作同义词的);词元lemma直接与其他词元lemma连关联,而不是直接与单词words/rt>相关联| `code.v.02` 的词元lemma是`code.v.02.encipher`、`code.v.02.cipher`、`code.v.02.cypher`、`code.v.02.encrypt`、`code.v.02.inscribe` 和 `code.v.02.write_in_code`| +|反义词Antonyms| 意思相反的词 | 词元lemma`encode.v.01.encode` 的反义词antonym是 `decode.v.01.decode`| +|上义词Hypernym|该词所属的一个范畴更大的词 | `code.v.01` 的一个上义词hypernym是 `tag.v.01`| +|分项词Meronym| 属于该词组成部分的词 | 计算机computer的一个分项词meronym芯片chip| +|总项词Holonym| 该词作为组成部分所属的词 | window的一个总项词holonym电脑屏幕computer screen| + +synset 还有一些其他属性,在 `<你的 Python 安装路径>/Lib/site-packages` 下的 `nltk/corpus/reader/wordnet.py`,你可以找到它们。 + +下面的代码或许可以帮助理解。 + +这个函数: + + +``` +from nltk.corpus import wordnet + + +def synset_info(synset): + print("Name", synset.name()) + print("POS:", synset.pos()) + print("Definition:", synset.definition()) + print("Examples:", synset.examples()) + print("Lemmas:", synset.lemmas()) + print("Antonyms:", [lemma.antonyms() for lemma in synset.lemmas() if len(lemma.antonyms()) > 0]) + print("Hypernyms:", synset.hypernyms()) + print("Instance Hypernyms:", synset.instance_hypernyms()) + print("Part Holonyms:", synset.part_holonyms()) + print("Part Meronyms:", synset.part_meronyms()) + print() + + +synsets = wordnet.synsets('code') +print(len(synsets), "synsets:") +for synset in synsets: + synset_info(synset) +``` + +将会显示: + + +``` +5 synsets: +Name code.n.01 +POS: n +Definition: a set of rules or principles or laws (especially written ones) +Examples: [] +Lemmas: [Lemma('code.n.01.code'), Lemma('code.n.01.codification')] +Antonyms: [] +Hypernyms: [Synset('written_communication.n.01')] +Instance Hpernyms: [] +Part Holonyms: [] +Part Meronyms: [] + +... + +Name code.n.03 +POS: n +Definition: (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions +Examples: [] +Lemmas: [Lemma('code.n.03.code'), Lemma('code.n.03.computer_code')] +Antonyms: [] +Hypernyms: [Synset('coding_system.n.01')] +Instance Hpernyms: [] +Part Holonyms: [] +Part Meronyms: [] + +... + +Name code.v.02 +POS: v +Definition: convert ordinary language into code +Examples: ['We should encode the message for security reasons'] +Lemmas: [Lemma('code.v.02.code'), Lemma('code.v.02.encipher'), Lemma('code.v.02.cipher'), Lemma('code.v.02.cypher'), Lemma('code.v.02.encrypt'), Lemma('code.v.02.inscribe'), Lemma('code.v.02.write_in_code')] +Antonyms: [] +Hypernyms: [Synset('encode.v.01')] +Instance Hpernyms: [] +Part Holonyms: [] +Part Meronyms: [] +``` + +同义词集synsets词元lemma在 WordNet 里是按照树状结构组织起来的,下面的代码会给出直观的展现: + +``` +def hypernyms(synset): +    return synset.hypernyms() + +synsets = wordnet.synsets('soccer') +for synset in synsets: +    print(synset.name() + " tree:") +    pprint(synset.tree(rel=hypernyms)) +    print() + +[/code] [code] + +code.n.01 tree: +[Synset('code.n.01'), + [Synset('written_communication.n.01'), +   ... + +code.n.02 tree: +[Synset('code.n.02'), + [Synset('coding_system.n.01'), +   ... + +code.n.03 tree: +[Synset('code.n.03'), +   ... + +code.v.01 tree: +[Synset('code.v.01'), + [Synset('tag.v.01'), +   ... + +code.v.02 tree: +[Synset('code.v.02'), + [Synset('encode.v.01'), +   ... +``` + +WordNet 并没有涵盖所有的单词和其信息 (现今英语有约 17,0000 个单词,最新版的 WordNet 涵盖了约 15,5000 个),但它开了个好头。掌握了 WordNet 的各个概念后,如果你觉得它词汇少,不能满足你的需要,可以转而使用其他工具。或者,你也可以打造自己的“词网”WordNet! + +#### 自主尝试 + +使用 Python 库,下载维基百科的 [open source][5] 页面,并列出该页面所有单词的同义词集synsets 词元lemmas。 + +### 相似度比较 + +相似度比较的目的是识别出两篇文本的相似度,在搜索引擎、聊天机器人等方面有很多应用。 + +比如,相似度比较可以识别 football 和 soccer 是否有相似性。 + +``` +syn1 = wordnet.synsets('football') +syn2 = wordnet.synsets('soccer') + +# A word may have multiple synsets, so need to compare each synset of word1 with synset of word2 +# 一个单词可能有多个 synset,需要把 word1 的每个 synset 和 word2 的每个 synset 分别比较 +for s1 in syn1: +    for s2 in syn2: +        print("Path similarity of: ") +        print(s1, '(', s1.pos(), ')', '[', s1.definition(), ']') +        print(s2, '(', s2.pos(), ')', '[', s2.definition(), ']') +        print("   is", s1.path_similarity(s2)) +        print() + +[/code] [code] + +Path similarity of: +Synset('football.n.01') ( n ) [ any of various games played with a ball (round or oval) in which two teams try to kick or carry or propel the ball into each other's goal ] +Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ] +   is 0.5 + +Path similarity of: +Synset('football.n.02') ( n ) [ the inflated oblong ball used in playing American football ] +Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ] +   is 0.05 +``` + +两个词各个 synset 之间路径相似度path similarity最大的是 0.5,表明它们关联性很大 (路径相似度指两个词的意义在上下义关系的词汇分类结构hypernym/hypnoym taxonomy中的最短距离)。 + +那么 code 和 bug 呢?这两个计算机领域的词的相似度是: + +``` +Path similarity of: +Synset('code.n.01') ( n ) [ a set of rules or principles or laws (especially written ones) ] +Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ] +   is 0.1111111111111111 +... +Path similarity of: +Synset('code.n.02') ( n ) [ a coding system used for transmitting messages requiring brevity or secrecy ] +Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ] +   is 0.09090909090909091 +... +Path similarity of: +Synset('code.n.03') ( n ) [ (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions ] +Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ] +   is 0.09090909090909091 +``` + +这些是这两个词各 synset 之间路径相似度path similarity的最大值,这些值表明两个词是有关联性的。 + +NLTK 提供多种相似度计分器similarity scorers,比如: + + * path_similarity + * lch_similarity + * wup_similarity + * res_similarity + * jcn_similarity + * lin_similarity + +要进一步了解这个相似度计分器similarity scorers,请查看 [WordNet Interface][6] 的 Similarity 部分。 + +#### 自主尝试 + +使用 Python 库,从维基百科的 [Category: Lists of computer terms][7] 生成一个术语列表,然后计算各术语之间的相似度。 + +### tree树库treebank + +使用 NLTK,你可以把文本表示成树状结构以便进行分析。 + +这里有一个例子: + +这是一份简短的文本,对其做预处理和词性标注: + + +``` +import nltk + +text = "I love open source" +# Tokenize to words +words = nltk.tokenize.word_tokenize(text) +# POS tag the words +words_tagged = nltk.pos_tag(words) +``` + +要把文本转换成树状结构,你必须定义一个语法grammar 。这个例子里用的是一个基于 [Penn Treebank tags][8] 的简单语法。 + + +``` +# A simple grammar to create tree +grammar = "NP: {<JJ><NN>}" +``` + +``` +# A simple grammar to create tree +grammar = "NP: {}" +``` + +Next, use the grammar to create a tree: +然后用这个语法grammar创建一颗tree: + + +``` +# Create tree +parser = nltk.RegexpParser(grammar) +tree = parser.parse(words_tagged) +pprint(tree) +``` + +运行上面的代码,将得到: + + +``` +Tree('S', [('I', 'PRP'), ('love', 'VBP'), Tree('NP', [('open', 'JJ'), ('source', 'NN')])]) +``` + +你也可以图形化地显示结果。 + +``` +tree.draw() +``` + +![NLTK Tree][9] + +(Girish Managoli, [CC BY-SA 4.0][10]) + +这个树状结构有助于准确解读文本的意思。比如,用它可以找到文本的主语 ([subject][11]): + + +``` +subject_tags = ["NN", "NNS", "NP", "NNP", "NNPS", "PRP", "PRP$"] +def subject(sentence_tree): +    for tagged_word in sentence_tree: +        # A crude logic for this case -  first word with these tags is considered subject +        if tagged_word[1] in subject_tags: +            return tagged_word[0] + +print("Subject:", subject(tree)) +``` + +结果显示主语是 I: + +``` +Subject: I +``` + +这是一个比较基础的文本分析步骤,可以用到更广泛的应用场景中。 比如,在聊天机器人方面,如果用户告诉机器人:“给我妈妈 Jane 预订一张机票,1 月 1 号伦敦飞纽约的“,机器人可以用这种分析方法解读这个指令: + +**动作**: 预订 +**动作的对象**: 机票 +**乘客**: Jane +**出发地**: 伦敦 +**目的地**: 纽约 +**日期**: (明年) 1 月 1 号 + +树库treebank指由许多预先标注好的tree构成的语料库。现在已经有面向多种语言的树库,既有开源的,也有限定条件下才能免费使用的,以及商用的。其中使用最广泛的是面向英语的宾州树库。宾州树库取材于 _华尔街日报_ Wall Street Journal。NLTK 也包含了宾州树库作为一个子语料库。下面是一些使用树库treebank的方法: + +``` +words = nltk.corpus.treebank.words() +print(len(words), "words:") +print(words) + +tagged_sents = nltk.corpus.treebank.tagged_sents() +print(len(tagged_sents), "sentences:") +print(tagged_sents) + +[/code] [code] + +100676 words: +['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', ...] +3914 sentences: +[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ...] +``` + +查看一个句子里的各个标签tags: + +``` +sent0 = tagged_sents[0] +pprint(sent0) + +[/code] [code] + +[('Pierre', 'NNP'), + ('Vinken', 'NNP'), + (',', ','), + ('61', 'CD'), + ('years', 'NNS'), +... +``` + +定义一个语法grammar来把这个句子转换成树状结构: + +``` +grammar = ''' +    Subject: {<NNP><NNP>} +    SubjectInfo: {<CD><NNS><JJ>} +    Action: {<MD><VB>} +    Object: {<DT><NN>} +    Stopwords: {<IN><DT>} +    ObjectInfo: {<JJ><NN>} +    When: {<NNP><CD>} +''' +parser = nltk.RegexpParser(grammar) +tree = parser.parse(sent0) +print(tree) + +[/code] [code] + +(S +  (Subject Pierre/NNP Vinken/NNP) +  ,/, +  (SubjectInfo 61/CD years/NNS old/JJ) +  ,/, +  (Action will/MD join/VB) +  (Object the/DT board/NN) +  as/IN +  a/DT +  (ObjectInfo nonexecutive/JJ director/NN) +  (Subject Nov./NNP) +  29/CD +  ./.) +``` + +``` +grammar = ''' +    Subject: {<NNP><NNP>} +    SubjectInfo: {<CD><NNS><JJ>} +    Action: {<MD><VB>} +    Object: {<DT><NN>} +    Stopwords: {<IN><DT>} +    ObjectInfo: {<JJ><NN>} +    When: {<NNP><CD>} +''' +parser = nltk.RegexpParser(grammar) +tree = parser.parse(sent0) +print(tree) + +[/code] [code] + +(S +  (Subject Pierre/NNP Vinken/NNP) +  ,/, +  (SubjectInfo 61/CD years/NNS old/JJ) +  ,/, +  (Action will/MD join/VB) +  (Object the/DT board/NN) +  as/IN +  a/DT +  (ObjectInfo nonexecutive/JJ director/NN) +  (Subject Nov./NNP) +  29/CD +  ./.) +``` + + +图形化地显示: + + +``` +tree.draw() +``` + +![NLP Treebank image][12] + +(Girish Managoli, [CC BY-SA 4.0][10]) + +trees树库treebanks的概念是文本分析的一个强大的组成部分。 + +#### 自主尝试 + +使用 Python 库,下载维基百科的 [open source][5] 页面,将得到的文本以图形化的树状结构展现出来。 + +### 命名实体识别Named entity recognition + +无论口语还是书面语都包含着重要数据。文本处理的主要目标之一,就是提取出关键数据。几乎所有应用场景所需要提取关键数据,比如航空公司的订票机器人或者问答机器人。 NLTK 为此提供了一个命名实体识别named entity recognition的功能。 + +这里有一个代码示例: + + +``` +`sentence = 'Peterson first suggested the name "open source" at Palo Alto, California'` +``` + +验证这个句子里的人名name地名place有没有被识别出来。照例先预处理: + +``` +import nltk + +words = nltk.word_tokenize(sentence) +pos_tagged = nltk.pos_tag(words) +``` + +运行命名实体标注器named-entity tagger: + +``` +ne_tagged = nltk.ne_chunk(pos_tagged) +print("NE tagged text:") +print(ne_tagged) +print() + +[/code] [code] + +NE tagged text: +(S +  (PERSON Peterson/NNP) +  first/RB +  suggested/VBD +  the/DT +  name/NN +  ``/`` +  open/JJ +  source/NN +  ''/'' +  at/IN +  (FACILITY Palo/NNP Alto/NNP) +  ,/, +  (GPE California/NNP)) +``` + +上面的结果里,命名实体被识别出来并做了标注;只提取这个tree里的命名实体: + +``` +print("Recognized named entities:") +for ne in ne_tagged: +    if hasattr(ne, "label"): +        print(ne.label(), ne[0:]) + +[/code] [code] + +Recognized named entities: +PERSON [('Peterson', 'NNP')] +FACILITY [('Palo', 'NNP'), ('Alto', 'NNP')] +GPE [('California', 'NNP')] +``` + +图形化地显示: + +``` +ne_tagged.draw() +``` + +![NLTK Treebank tree][13] + +(Girish Managoli, [CC BY-SA 4.0][10]) + +NLTK 内置的命名实体标注器named-entity tagger,使用的是宾州法尼亚大学的 [Automatic Content Extraction][14] (ACE) 程序。 该标注器能够识别组织机构ORGANIZATION、人名PERSON、地名LOCATION、设施FACILITY地缘政治实体geopolitical entity等常见实体entites。 + +NLTK 也可以使用其他标注器tagger,比如 [Stanford Named Entity Recognizer][15]. 这个经过训练的标注器用 Java 写成,但 NLTK 提供了一个使用它的接口 (详情请查看 [nltk.parse.stanford][16] 或 [nltk.tag.stanford][17])。 + +#### 自主尝试 + +使用 Python 库,下载维基百科的 [open source][5] 页面,并识别出对开源open source有影响力的人的名字,以及他们为开源open source做贡献的时间和地点。 + +### 高级实践 + +如果你准备好了,尝试用这篇文章以及此前的文章介绍的知识构建一个超级结构superstructure。 + +使用 Python 库,下载维基百科的 [Category: Computer science page][18],然后: + + + * 找出其中频率最高的单词unigrams、二元搭配bigrams三元搭配trigrams,将它们作为一个关键词keywords列表或者技术techonologies列表。相关领域的学生或者工程师需要了解这样一份列表里的内容。 + * 图形化地显示这个领域里重要的人名、技术、日期和地点。这会是一份很棒的信息图。 + * 构建一个搜索引擎search engine。你的搜索引擎search engine性能能够超过维基百科吗? + + +### 接下来可以做什么? + +自然语言处理NLP应用构建application building的典型支柱。NLTK 是经典、丰富且强大的工具集,提供了为现实世界构建有吸引力、目标明确的应用的工作坊。 + +在这个系列的文章里,我用 NLTK 作为例子,展示了自然语言处理可以做什么。自然语言处理和 NLTK 还有太多东西值得探索,这个系列的文章只是帮助你探索它们的切入点。 + +如果你的需求慢慢增长到 NLTK 已经满足不了了,你可以训练新的模型或者向 NLTK 添加新的功能。基于 NLTK 构建的新的自然语言处理库NLP libraries正在不断涌现,机器学习也正被深度用于自然语言处理。 + +-------------------------------------------------------------------------------- + +via: https://opensource.com/article/20/8/nlp-python-nltk + +作者:[Girish Managoli][a] +选题:[lujun9972][b] +译者:[tanloong](https://github.com/tanloong) +校对:[校对者ID](https://github.com/校对者ID) + +本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 + +[a]: https://opensource.com/users/gammay +[b]: https://github.com/lujun9972 +[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/brain_computer_solve_fix_tool.png?itok=okq8joti (Brain on a computer screen) +[2]: https://opensource.com/article/20/8/intro-python-nltk +[3]: http://www.nltk.org/ +[4]: https://en.wikipedia.org/wiki/WordNet +[5]: https://en.wikipedia.org/wiki/Open_source +[6]: https://www.nltk.org/howto/wordnet.html +[7]: https://en.wikipedia.org/wiki/Category:Lists_of_computer_terms +[8]: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html +[9]: https://opensource.com/sites/default/files/uploads/nltk-tree.jpg (NLTK Tree) +[10]: https://creativecommons.org/licenses/by-sa/4.0/ +[11]: https://en.wikipedia.org/wiki/Subject_(grammar) +[12]: https://opensource.com/sites/default/files/uploads/nltk-treebank.jpg (NLP Treebank image) +[13]: https://opensource.com/sites/default/files/uploads/nltk-treebank-2a.jpg (NLTK Treebank tree) +[14]: https://www.ldc.upenn.edu/collaborations/past-projects/ace +[15]: https://nlp.stanford.edu/software/CRF-NER.html +[16]: https://www.nltk.org/_modules/nltk/parse/stanford.html +[17]: https://www.nltk.org/_modules/nltk/tag/stanford.html +[18]: https://en.wikipedia.org/wiki/Category:Computer_science