Merge pull request #13607 from MjSeven/sentiment_analysis

翻译完成
我对校对这篇感觉有点怵头——看起来好像很难。
This commit is contained in:
Xingyu.Wang 2019-05-12 22:36:02 +08:00 committed by GitHub
commit 4533cf1b5c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 115 additions and 117 deletions

View File

@ -1,117 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: (MjSeven)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Getting started with social media sentiment analysis in Python)
[#]: via: (https://opensource.com/article/19/4/social-media-sentiment-analysis-python)
[#]: author: (Michael McCune https://opensource.com/users/elmiko/users/jschlessman)
Getting started with social media sentiment analysis in Python
======
Learn the basics of natural language processing and explore two useful
Python packages.
![Raspberry Pi and Python][1]
Natural language processing (NLP) is a type of machine learning that addresses the correlation between spoken/written languages and computer-aided analysis of those languages. We experience numerous innovations from NLP in our daily lives, from writing assistance and suggestions to real-time speech translation and interpretation.
This article examines one specific area of NLP: sentiment analysis, with an emphasis on determining the positive, negative, or neutral nature of the input language. This part will explain the background behind NLP and sentiment analysis and explore two open source Python packages. [Part 2][2] will demonstrate how to begin building your own scalable sentiment analysis services.
When learning sentiment analysis, it is helpful to have an understanding of NLP in general. This article won't dig into the mathematical guts, rather our goal is to clarify key concepts in NLP that are crucial to incorporating these methods into your solutions in practical ways.
### Natural language and text data
A reasonable place to begin is defining: "What is natural language?" It is the means by which we, as humans, communicate with one another. The primary modalities for communication are verbal and text. We can take this a step further and focus solely on text communication; after all, living in an age of pervasive Siri, Alexa, etc., we know speech is a group of computations away from text.
### Data landscape and challenges
Limiting ourselves to textual data, what can we say about language and text? First, language, particularly English, is fraught with exceptions to rules, plurality of meanings, and contextual differences that can confuse even a human interpreter, let alone a computational one. In elementary school, we learn articles of speech and punctuation, and from speaking our native language, we acquire intuition about which words have less significance when searching for meaning. Examples of the latter would be articles of speech such as "a," "the," and "or," which in NLP are referred to as _stop words_ , since traditionally an NLP algorithm's search for meaning stops when reaching one of these words in a sequence.
Since our goal is to automate the classification of text as belonging to a sentiment class, we need a way to work with text data in a computational fashion. Therefore, we must consider how to represent text data to a machine. As we know, the rules for utilizing and interpreting language are complicated, and the size and structure of input text can vary greatly. We'll need to transform the text data into numeric data, the form of choice for machines and math. This transformation falls under the area of _feature extraction_.
Upon extracting numeric representations of input text data, one refinement might be, given an input body of text, to determine a set of quantitative statistics for the articles of speech listed above and perhaps classify documents based on them. For example, a glut of adverbs might make a copywriter bristle, or excessive use of stop words might be helpful in identifying term papers with content padding. Admittedly, this may not have much bearing on our goal of sentiment analysis.
### Bag of words
When you assess a text statement as positive or negative, what are some contextual clues you use to assess its polarity (i.e., whether the text has positive, negative, or neutral sentiment)? One way is connotative adjectives: something called "disgusting" is viewed as negative, but if the same thing were called "beautiful," you would judge it as positive. Colloquialisms, by definition, give a sense of familiarity and often positivity, whereas curse words could be a sign of hostility. Text data can also include emojis, which carry inherent sentiments.
Understanding the polarity influence of individual words provides a basis for the [_bag-of-words_][3] (BoW) model of text. It considers a set of words or vocabulary and extracts measures about the presence of those words in the input text. The vocabulary is formed by considering text where the polarity is known, referred to as _labeled training data_. Features are extracted from this set of labeled data, then the relationships between the features are analyzed and labels are associated with the data.
The name "bag of words" illustrates what it utilizes: namely, individual words without consideration of spatial locality or context. A vocabulary typically is built from all words appearing in the training set, which tends to be pruned afterward. Stop words, if not cleaned prior to training, are removed due to their high frequency and low contextual utility. Rarely used words can also be removed, given the lack of information they provide for general input cases.
It is important to note, however, that you can (and should) go further and consider the appearance of words beyond their use in an individual instance of training data, or what is called [_term frequency_][4] (TF). You should also consider the counts of a word through all instances of input data; typically the infrequency of words among all documents is notable, which is called the [_inverse document frequency_][5] (IDF). These metrics are bound to be mentioned in other articles and software packages on this subject, so having an awareness of them can only help.
BoW is useful in a number of document classification applications; however, in the case of sentiment analysis, things can be gamed when the lack of contextual awareness is leveraged. Consider the following sentences:
* We are not enjoying this war.
* I loathe rainy days, good thing today is sunny.
* This is not a matter of life and death.
The sentiment of these phrases is questionable for human interpreters, and by strictly focusing on instances of individual vocabulary words, it's difficult for a machine interpreter as well.
Groupings of words, called _n-grams_ , can also be considered in NLP. A bigram considers groups of two adjacent words instead of (or in addition to) the single BoW. This should alleviate situations such as "not enjoying" above, but it will remain open to gaming due to its loss of contextual awareness. Furthermore, in the second sentence above, the sentiment context of the second half of the sentence could be perceived as negating the first half. Thus, spatial locality of contextual clues also can be lost in this approach. Complicating matters from a pragmatic perspective is the sparsity of features extracted from a given input text. For a thorough and large vocabulary, a count is maintained for each word, which can be considered an integer vector. Most documents will have a large number of zero counts in their vectors, which adds unnecessary space and time complexity to operations. While a number of clever approaches have been proposed for reducing this complexity, it remains an issue.
### Word embeddings
Word embeddings are a distributed representation that allows words with a similar meaning to have a similar representation. This is based on using a real-valued vector to represent words in connection with the company they keep, as it were. The focus is on the manner that words are used, as opposed to simply their existence. In addition, a huge pragmatic benefit of word embeddings is their focus on dense vectors; by moving away from a word-counting model with commensurate amounts of zero-valued vector elements, word embeddings provide a more efficient computational paradigm with respect to both time and storage.
Following are two prominent word embedding approaches.
#### Word2vec
The first of these word embeddings, [Word2vec][6], was developed at Google. You'll probably see this embedding method mentioned as you go deeper in your study of NLP and sentiment analysis. It utilizes either a _continuous bag of words_ (CBOW) or a _continuous skip-gram_ model. In CBOW, a word's context is learned during training based on the words surrounding it. Continuous skip-gram learns the words that tend to surround a given word. Although this is more than what you'll probably need to tackle, if you're ever faced with having to generate your own word embeddings, the author of Word2vec advocates the CBOW method for speed and assessment of frequent words, while the skip-gram approach is better suited for embeddings where rare words are more important.
#### GloVe
The second word embedding, [_Global Vectors for Word Representation_][7] (GloVe), was developed at Stanford. It's an extension to the Word2vec method that attempts to combine the information gained through classical global text statistical feature extraction with the local contextual information determined by Word2vec. In practice, GloVe has outperformed Word2vec for some applications, while falling short of Word2vec's performance in others. Ultimately, the targeted dataset for your word embedding will dictate which method is optimal; as such, it's good to know the existence and high-level mechanics of each, as you'll likely come across them.
#### Creating and using word embeddings
Finally, it's useful to know how to obtain word embeddings; in part 2, you'll see that we are standing on the shoulders of giants, as it were, by leveraging the substantial work of others in the community. This is one method of acquiring a word embedding: namely, using an existing trained and proven model. Indeed, myriad models exist for English and other languages, and it's possible that one does what your application needs out of the box!
If not, the opposite end of the spectrum in terms of development effort is training your own standalone model without consideration of your application. In essence, you would acquire substantial amounts of labeled training data and likely use one of the approaches above to train a model. Even then, you are still only at the point of acquiring understanding of your input-text data; you then need to develop a model specific for your application (e.g., analyzing sentiment valence in software version-control messages) which, in turn, requires its own time and effort.
You also could train a word embedding on data specific to your application; while this could reduce time and effort, the word embedding would be application-specific, which would reduce reusability.
### Available tooling options
You may wonder how you'll ever get to a point of having a solution for your problem, given the intensive time and computing power needed. Indeed, the complexities of developing solid models can be daunting; however, there is good news: there are already many proven models, tools, and software libraries available that may provide much of what you need. We will focus on [Python][8], which conveniently has a plethora of tooling in place for these applications.
#### SpaCy
[SpaCy][9] provides a number of language models for parsing input text data and extracting features. It is highly optimized and touted as the fastest library of its kind. Best of all, it's open source! SpaCy performs tokenization, parts-of-speech classification, and dependency annotation. It contains word embedding models for performing this and other feature extraction operations for over 46 languages. You will see how it can be used for text analysis and feature extraction in the second article in this series.
#### vaderSentiment
The [vaderSentiment][10] package provides a measure of positive, negative, and neutral sentiment. As the [original paper][11]'s title ("VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text") indicates, the models were developed and tuned specifically for social media text data. VADER was trained on a thorough set of human-labeled data, which included common emoticons, UTF-8 encoded emojis, and colloquial terms and abbreviations (e.g., meh, lol, sux).
For given input text data, vaderSentiment returns a 3-tuple of polarity score percentages. It also provides a single scoring measure, referred to as _vaderSentiment's compound metric_. This is a real-valued measurement within the range **[-1, 1]** wherein sentiment is considered positive for values greater than **0.05** , negative for values less than **-0.05** , and neutral otherwise.
In [part 2][2], you will learn how to use these tools to add sentiment analysis capabilities to your designs.
--------------------------------------------------------------------------------
via: https://opensource.com/article/19/4/social-media-sentiment-analysis-python
作者:[Michael McCune ][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/elmiko/users/jschlessman
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/getting_started_with_python.png?itok=MFEKm3gl (Raspberry Pi and Python)
[2]: https://opensource.com/article/19/4/social-media-sentiment-analysis-python-part-2
[3]: https://en.wikipedia.org/wiki/Bag-of-words_model
[4]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency
[5]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency
[6]: https://en.wikipedia.org/wiki/Word2vec
[7]: https://en.wikipedia.org/wiki/GloVe_(machine_learning)
[8]: https://www.python.org/
[9]: https://pypi.org/project/spacy/
[10]: https://pypi.org/project/vaderSentiment/
[11]: http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

View File

@ -0,0 +1,115 @@
[#]: collector: (lujun9972)
[#]: translator: (MjSeven)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Getting started with social media sentiment analysis in Python)
[#]: via: (https://opensource.com/article/19/4/social-media-sentiment-analysis-python)
[#]: author: (Michael McCune https://opensource.com/users/elmiko/users/jschlessman)
使用 Python 进行社交媒体情感分析入门
======
学习自然语言处理的基础知识并探索两个有用的 Python 包。
![Raspberry Pi and Python][1]
自然语言处理NLP是机器学习的一种它解决了口语或书面语言和计算机辅助分析这些语言之间的相关性。日常生活中我们经历了无数的 NLP 创新,从写作帮助和建议到实时语音翻译,还有口译。
本文研究了 NLP 的一个特定领域:情感分析。重点是确定输入语言的积极、消极或中性性质。本部分将解释 NLP 和情感分析的背景,并探讨两个开源的 Python 包。[第 2 部分][2]将演示如何开始构建自己的可扩展情感分析服务。
在学习情感分析时,对 NLP 有一个大体了解是有帮助的。本文不会深入研究数学本质。相反,我们的目标是阐明 NLP 中的关键概念,这些概念对于将这些方法实际结合到你的解决方案中至关重要。
### 自然语言和文本数据
合理的起点是从定义开始:“什么是自然语言?”它是我们人类相互交流的方式,沟通的主要方式是口语和文字。我们可以更进一步,只关注文本交流。毕竟,生活在 Siri, Alexa 等无处不在的时代,我们知道语音是一组与文本无关的计算。
### 数据前景和挑战
我们只考虑使用文本数据,我们可以对语言和文本做什么呢?首先是语言,特别是英语,除了规则还有很多例外,含义的多样性和语境差异,这些都可能使人类口译员感到困惑,更不用说计算机翻译了。在小学,我们学习文章和标点符号,通过讲母语,我们获得了寻找直觉上表示唯一意义的词的能力。比如,出现诸如 "a"、"the" 和 "or" 之类的文章,它们在 NLP 中被称为 _停用词_,因为传统上 NLP 算法是在一个序列中找到这些词时意味着搜索停止。
由于我们的目标是自动将文本分类为情感类,因此我们需要一种以计算方式处理文本数据的方法。因此,我们必须考虑如何向机器表示文本数据。众所周知,利用和解释语言的规则很复杂,输入文本的大小和结构可能会有很大差异。我们需要将文本数据转换为数字数据,这是机器和数学的首选方式。这种转变属于 _特征提取_ 的范畴。
在提取输入文本数据的数字表示形式后,一个改进可能是:给定一个文本输入体,为上面列出的文章确定一组向量统计数据,并根据这些数据对文档进行分类。例如,过多的副词可能会使撰稿人感到愤怒,或者过度使用停用词可能有助于识别带有内容填充的学期论文。诚然,这可能与我们情感分析的目标没有太大关系。
### 词袋
当你评估一个文本陈述是积极还是消极的时候,你使用哪些上下文来评估它的极性?(例如,文本中是否具有积极的、消极的或中性的情感)一种方式是隐含形容词:被称为 "disgusting" 的东西被认为是消极的,但如果同样的东西被称为 "beautiful",你会认为它是积极的。从定义上讲,俗语给人一种熟悉感,通常是积极的,而脏话可能是敌意的表现。文本数据也可以包括表情符号,它带有固定的情感。
理解单个单词的极性影响为文本的[_词袋_][3](BoW) 模型提供了基础。它考虑一组单词或词汇表,并提取关于这些单词在输入文本中是否存在的度量。词汇表是通过考虑极性已知的文本形成的,称为 _标记的训练数据_。从这组标记数据中提取特征,然后分析特征之间的关系,并将标签与数据关联起来。
“词袋”这个名称说明了它的用途:即不考虑空间位置或上下文的的单个词。词汇表通常是由训练集中出现的所有单词构建的,在训练结束后被删除。如果在训练之前没有清理停用词,那么停用词会因为其高频率和低语境而被移除。很少使用的单词也可以删除,因为一般情况下它们提供了缺失的信息。
但是,重要的是要注意,你可以(并且应该)进一步考虑单词在单独的训练数据实例中使用之外的外观,称为[_词频_][4] (TF)。你还应该考虑输入数据的所有实例中的单词计数,通常,所有文档中的单词频率显著,这被称为[_逆文本频率指数_][5](IDF)。这些指标一定会在本主题的其他文章和软件包中提及,因此了解它们会有所帮助。
词袋在许多文档分类应用程序中很有用。然而,在情感分析中,当缺乏情境意识的问题被利用时,事情就可以解决。考虑以下句子:
* 我们不喜欢这场战争。
* 我讨厌下雨天,好事是今天是晴天。
* 这不是生死攸关的问题。
这些短语的情感对于人类口译员来说是有难度的,而且由于严格关注单个词汇的实例,对于机器翻译来说也是困难的。
在 NLP 中也可以考虑称为 _n-grams_ 的单词分组。一个二元组考虑两个相邻单词组成的组而不是(或除了)单个词袋。这应该可以缓解诸如上述“不喜欢”之类的情况,但由于缺乏语境意思,它仍然是个问题。此外,在上面的第二句中,下半句的情感语境可以被理解为否定前半部分。因此,这种方法中也会丢失上下文线索的空间局部性。从实用角度来看,使问题复杂化的是从给定输入文本中提取的特征的稀疏性。对于一个完整的大型词汇表,每个单词都有一个计数,可以将其视为一个整数向量。大多数文档的向量中都有大量的零计数,这给操作增加了不必要的空间和时间复杂度。虽然已经提出了许多用于降低这种复杂性的简便方法,但它仍然是一个问题。
### 词嵌入
词嵌入是一种分布式表示,它允许具有相似含义的单词具有相似的表示。这是基于使用实值向量来与它们周围相关联。重点在于使用单词的方式,而不仅仅是它们的存在。此外,词嵌入的一个巨大语用优势是它们对密集向量的关注。通过摆脱具有相应数量的零值向量元素的单词计数模型,词嵌入在时间和存储方面提供了一个更有效的计算范例。
以下是两个优秀的词嵌入方法。
#### Word2vec
第一个是 [Word2vec][6],它是由 Google 开发的。随着你对 NLP 和情绪分析研究的深入,你可能会看到这种嵌入方法。它要么使用一个 _连续的词袋_CBOW要么使用一个 _连续的 skip-gram_ 模型。在 CBOW 中,一个单词的上下文是在训练中根据围绕它的单词来学习的。连续的 skip-gram 学习倾向于围绕给定的单词学习单词。虽然这可能超出了你需要解决的问题,但是如果你曾经面对必须生成自己的词嵌入情况,那么 Word2vec 的作者提倡使用 CBOW 方法来提高速度并评估频繁的单词,而 skip-gram 方法更适合嵌入稀有单词更重要的嵌入。
#### GloVe
第二个是 [ _Global Vectors for Word Representation_][7(GloVe),它是斯坦福大学开发的。它是 Word2vec 方法的扩展,它试图将通过经典的全局文本统计特征提取获得的信息与 Word2vec 确定的本地上下文信息相结合。实际上在一些应用程序中GloVe 性能优于 Word2vec而在另一些应用程序中则不如 Word2vec。最终用于词嵌入的目标数据集将决定哪种方法最优。因此最好了解它们的存在性和高级机制因为你很可能会遇到它们。
#### 创建和使用词嵌入
最后,知道如何获得词嵌入是有用的。在第 2 部分中,你将看到我们通过利用社区中其他人的实质性工作,可以说我们是站在了巨人的肩膀上。这是获取词嵌入的一种方法:即使用现有的经过训练和验证的模型。实际上,有无数的模型适用于英语和其他语言,一定会有一种模型可以满足你的应用程序,让你开箱即用!
如果没有的话,就开发工作而言,另一个极端是培训你自己的独立模型,而不考虑你的应用程序。实质上,你将获得大量标记的训练数据,并可能使用上述方法之一来训练模型。即使这样,你仍然只是在获取对输入文本数据的理解。然后,你需要为你应用程序开发一个特定的模型(例如,分析软件版本控制消息中的情感价值),这反过来又需要自己的时间和精力。
你还可以为你的应用程序数据训练一个词嵌入,虽然这可以减少时间和精力,但这个词嵌入将是特定于应用程序的,这将会降低它的可重用性。
### 可用的工具选项
考虑到所需的大量时间和计算能力,你可能想知道如何才能找到解决问题的方法。的确,开发可靠模型的复杂性可能令人望而生畏。但是,有一个好消息:已经有许多经过验证的模型、工具和软件库可以为我们提供所需的大部分内容。我们将重点关注 [Python][8],因为它为这些应用程序提供了大量方便的工具。
#### SpaCy
[SpaCy][9] 提供了许多用于解析输入文本数据和提取特征的语言模型。它经过了高度优化并被誉为同类中最快的库。最棒的是它是开源的SpaCy 会执行标识化、词性分类和依赖项注释。它包含了用于执行此功能的词嵌入模型,还有用于为超过 46 种语言的其他特征提取操作。在本系列的第二篇文章中,你将看到它如何用于文本分析和特征提取。
#### vaderSentiment
[vaderSentiment][10] 包提供了积极、消极和中性情绪的衡量标准。正如 [original paper][11] 的标题“VADER一个基于规则的社交媒体文本情感分析模型”所示这些模型是专门为社交媒体文本数据开发和调整的。VADER 接受了一组完整的人类标记数据的训练包括常见的表情符号、UTF-8 编码的表情符号以及口语术语和缩写(例如 meh、lol、sux
对于给定的输入文本数据vaderSentiment 返回一个极性分数百分比的三元组。它还提供了一个单个的评分标准,称为 _vaderSentiment 复合指标_。这是一个在 **[-1, 1]** 范围内的实值,其中对于分值大于 **0.05** 的情绪被认为是积极的,对于分值小于 **-0.05** 的被认为是消极的,否则为中性。
在[第 2 部分][2]中,你将学习如何使用这些工具为你的设计添加情感分析功能。
--------------------------------------------------------------------------------
via: https://opensource.com/article/19/4/social-media-sentiment-analysis-python
作者:[Michael McCune ][a]
选题:[lujun9972][b]
译者:[MjSeven](https://github.com/MjSeven)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/elmiko/users/jschlessman
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/getting_started_with_python.png?itok=MFEKm3gl (Raspberry Pi and Python)
[2]: https://opensource.com/article/19/4/social-media-sentiment-analysis-python-part-2
[3]: https://en.wikipedia.org/wiki/Bag-of-words_model
[4]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency
[5]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency
[6]: https://en.wikipedia.org/wiki/Word2vec
[7]: https://en.wikipedia.org/wiki/GloVe_(machine_learning)
[8]: https://www.python.org/
[9]: https://pypi.org/project/spacy/
[10]: https://pypi.org/project/vaderSentiment/
[11]: http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf