This commit is contained in:
Xingyu Wang 2020-04-13 23:53:20 +08:00
parent e232ab3c4a
commit c2e05718dd
2 changed files with 137 additions and 136 deletions

View File

@ -1,136 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (How I'm using AI to translate 'wash your hands' in 500 languages)
[#]: via: (https://opensource.com/article/20/4/ai-translation)
[#]: author: (Daniel Whitenack https://opensource.com/users/datadan)
How I'm using AI to translate 'wash your hands' in 500 languages
======
By using both human and machine-generated translations, key health
phrases can be translated into local languages spoken all over the
world.
![Two diverse hands holding a globe][1]
You might not know, but there are currently [7,117 languages spoken in the world][2]. Not dialects, but living languages! However, much of the world's digital media is available in only a couple dozen languages, and translation platforms like Google Translate only support around 100 languages. This reality means that there are billions of people around the world that are marginalized due to a lack of timely access to information. The current coronavirus (COVID-19) pandemic has made this painfully clear, and it has stressed the need for immediate, rapid translation of health-related phrases (like "wash your hands" or "keep your distance") into the long tail of languages.
To this end, I applied state-of-the-art AI techniques to construct something close to the phrase "wash your hands" in 544 languages and counting (my GPUs are still running). Multilingual Unsupervised and Supervised Embeddings (MUSE) methods are used to train cross-lingual word embeddings between each of 544 languages and English. These embeddings then allow for the extraction of a phrase similar to the target phrase from existing documents.
I performed this work in collaboration with my colleagues at SIL International, who have gathered even more human translations of the phrase. The combination of these human translations and some of my machine translations can be searched on [this Ethnologue guide page][3] (machine-generated phrases are indicated with a little robot icon), and more translations will be added as they are generated/gathered.
### Leveraging existing corpora
SIL International has done linguistic work in over 2000 languages and is currently managing over 1600 language projects. Thus, as I approached this particular problem, I knew that we had likely already translated the phrase "wash your hands" and/or similar phrases many times into hundreds of languages, and that guess paid off in spades. I was able to quickly gather documents (mostly completed shell book templates, educational materials, and Bibles) from our archives in over 900 languages. Each of these documents has an English parallel, which necessarily includes the phrase "wash your hands" and/or similar phrases like "wash your face." Moreover, each of these documents is very high quality and translated and checked in cooperation with the local language communities.
That is quite the multilingual data set. However, there are two problems to overcome. First, this data included thousands of samples for most languages, which is in contrast to the millions used to train machine translation models. Second, even if the documents include the phrase "wash your hands" in the target language, we don't know the exact location of the phrase within the surrounding text.
We could certainly exploit some of the latest tricks in [machine translation for low resource languages][4], but it would take some time to tune automated methods for rapidly adapting translation models in each language pair. Moreover, many of the languages we are targeting have no existing baseline with which we could compare evaluation metrics, e.g., [BLEU score][5]. Given the pressing concerns about the Coronavirus pandemic, we wanted to move a bit faster than that (although we plan to return to this problem in the future).
I opted to try and construct the phrase "wash your hands" by finding the phrase itself or components of the phrase (like "wash your" or "your hands") in existing documents. To find these, I trained cross-lingual embedding for each {English, Target Language} pair using [Multilingual Unsupervised and Supervised Embedding (MUSE)][6] from Facebook Research. MUSE takes monolingual word embeddings as input (I used [_fasttext_][7] to generate these) and learns a mapping from the English to the target embedding space using adversarial methods. The output of this process is cross-lingual word embeddings.
![Using fasttext along with MUSE to perform cross-language embedding][8]
Once the cross-lingual embeddings are generated, we can get to finding the phrase components in the target language documents. As it turns out, the phrase "wash your face" was most clearly used throughout the documents along with instances of "hands," "wash your," etc. in isolation. For each of the languages, I search through n-grams in areas where I expected the phrase to appear (based on its usage in the English parallel). N-grams were vectorized using the cross-lingual embedding and compared with vectorized versions of the English phrases using various distance metrics. The n-grams that were "closest" to the English phrases in the embedding space were determined to be the target language matches.
Finally, component phrases matching their English counterparts were combined to generate the phrase "wash your hands" in the target language. This combination utilizes the cross-lingual embedding again to make sure that the components are combined in an appropriate manner. For example, if we matched the phrase "wash your feet" in the target language, the n-gram corresponding to "feet" must be replaced with the n-gram corresponding to "hands." Here's an example for Belize Kriol English:
![][9]
There were, of course, some assumptions that were made during this matching process, and it is entirely possible that this procedure does not produce grammatically correct predictions. For example, I assumed that in most languages, the word for "hands" and the word for "feet" are both one token long (with tokens being separated by spaces and punctuation). This is certainly not always the case. This could create a bad word salad something like "and wash the and hand you" or similar. Hopefully, we can overcome some of these limitations and extend the system in the future, but, for now, we chose to reinforce the idea with graphics.
We adapted the World Health Organization's hand washing instructions into a template PNG image. We then took our translated and generated phrases and rendered them into the hand washing image using a combination of Bash and Go scripts. In this way, the idea of proper hand washing is emphasized in both text and imagery (just in case our generated translations are awkward).
![][10]
### Results
Thus far, I've been able to train cross-lingual embeddings for 544 languages. I used the above-discussed method to try and construct "wash your hands" for all of these languages. Because I don't have aligned data for many of the language pairs, I used separate holdout documents also containing components of "wash your hands" to help validate the tokens in the constructed phrase. This gives us some confidence in the translations that we publicly release (at least that they contain information indicating washing and/or hands). In addition, I compared the method with language pairs that are also supported by Google Translate and/or have available human translations. Here's a sample of the translations with language stats from [the Ethnologue][11]:
#### Language: Italian [ita]
Location: Italy
Population: 68,000,000
Our system: "làvati la mani"
Google Translate: "Lavati le mani"
#### Language: Bulgarian [bul]
Location: Bulgaria
Population: 8,000,000
Our system: "умий ръцете"
Google Translate: "Измий си ръцете"
#### Language: Dutch [nld]
Location: Netherlands
Population: 24,000,000
Our system: "wast uw handen"
Google Translate: "Was je handen"
#### Language: Pijin [pis]
Location: Solomon Islands
Population: 550,000
Our system: "wasim han"
Google Translate: Not supported
#### Language: Tikar [tik]
Location: Cameroon
Population: 110,000
Our system: "ɓɔsi fyàʼ"
Google Translate: Not supported
#### Language: Waffa [waj]
Location: Papua New Guinea
Population: 1,300
Our system: "yaakuuvaitana nnikiiyauvaa fini"
Google Translate: Not supported
The constructed phrases are similar to reference translations or appear to be alternative ways of saying "wash your hands." For example, in Bulgarian, I predict "умий ръцете," and Google Translate predicts "Измий си ръцете." However, if I back-translate my prediction using Google Translate, I still get "wash your hands." There is some uncertainty where I can't compare to reference translations (e.g., Pijin [pis] from the Solomon Islands) or human-annotated spans, but I can still validate that the word for wash (wasim) and the word for hands (han) are used in other reference documents that are necessarily talking about washing, or hands, respectively. About 15% of the translations could be validated using this method, and I hope to validate more as I gather reference dictionaries.
Note, I used at most about 7,000 sentences in each language to get the above translations, even for high-resource languages like Italian. I also did not rely on aligned sentences between the language pairs. Despite this very data-scarce, unsupervised scenario, I was still able to obtain phrases similar to that of Google Translate for languages supported by both systems. This demonstrates the potential utility of this sort of "hybrid" approach (unsupervised alignment of word embeddings + rule-based matching) for translating short phrases into languages where very little data exists.
Note—I'm definitely not saying that this is a solution to the problem of information spread about Coronavirus and other health-related issues. There are still a lot of things to explore and formally evaluate here, and we are working on that. In many cases, this approach won't be able to help construct important informational material in hundreds of languages. However, I think that we should all be trying to develop creative solutions to problems related to the current crisis. Maybe this is one piece of a very large puzzle.
You can view the complete list of validated translations plus human translations on [this Ethnologue guide page][3]. In addition, a more thorough description and analysis of the system in paper form is forthcoming. We welcome feedback from the public on the translations to help fine-tune the system and, most of all, to make sure that health information gets out to marginalized language communities around the world.
### Create your own hand washing posters
We have open sourced [the code used to render complex scripts and generate the hand washing posters][12]. This methodology should be able to handle almost all languages and scripts. You can add your own translation of "wash your hands" to a poster to help spread the word or tailor the translations for your own local context. Be sure to share your generated posters on social media with the hashtag #WashYourHands.
### Develop your AI skills
There are so many exciting AI problems out there that can make a huge impact in the world. If you want to solve problems like the one above with AI or if you think your business might need to start leveraging AI for other things (supply chain optimization, recommendation, customer service automation, etc.), don't miss the [_AI Classroom_ training event this May][13]. _AI Classroom_ is an immersive, three-day virtual training event for anyone with at least some programming experience and foundational understanding of mathematics. The training provides a practical baseline for realistic AI development using Python and open source frameworks like TensorFlow and PyTorch. After completing the course, participants will have the confidence to start developing and deploying their own AI solutions.
_This article was republished with permission from <https://datadan.io/blog/wash-your-hands>_
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/4/ai-translation
作者:[Daniel Whitenack][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/datadan
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/world_hands_diversity.png?itok=zm4EDxgE (Two diverse hands holding a globe)
[2]: https://www.ethnologue.com/guides/how-many-languages
[3]: https://www.ethnologue.com/guides/health
[4]: https://datadan.io/blog/resources-for-low-resource-machine-translation
[5]: https://en.wikipedia.org/wiki/BLEU
[6]: https://github.com/facebookresearch/MUSE
[7]: https://fasttext.cc/
[8]: https://opensource.com/sites/default/files/uploads/ai-language-translation-wash-your-hands-opensourcedotcom.gif (Using fasttext along with MUSE to perform cross-language embedding)
[9]: https://opensource.com/sites/default/files/uploads/ai-language-translation-wash-your-hands-opensourcedotcom2.gif
[10]: https://opensource.com/sites/default/files/uploads/ai-language-translation-wash-your-hands-opensourcedotcom3.gif
[11]: https://www.ethnologue.com/
[12]: https://github.com/sil-ai/wash-your-hands
[13]: https://datadan.io/

View File

@ -0,0 +1,137 @@
[#]: collector: (lujun9972)
[#]: translator: (wxy)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (How I'm using AI to translate 'wash your hands' in 500 languages)
[#]: via: (https://opensource.com/article/20/4/ai-translation)
[#]: author: (Daniel Whitenack https://opensource.com/users/datadan)
我是如何用 AI 把 “请洗手”翻译成 500 种语言的?
======
> 通过使用人类和机器生成的翻译,可以将关键的健康短语翻译成世界各地的当地语言。
![Two diverse hands holding a globe][1]
你可能不知道,目前世界上有 [7117 种语言][2],不是方言,而是在用的语言! 然而,世界上大部分的数字媒体只采用几十种语言,而像谷歌翻译这样的翻译平台只支持 100 种左右的语言。这样的现实意味着由于缺乏及时获取信息的机会全世界有数十亿人被边缘化。当前的冠状病毒COVID-19大流行已经让人痛苦地意识到了这一点凸显了将健康相关的短语如“<ruby>请洗手<rt>wash your hands</rt></ruby>”或“保持距离”等)即时、快速翻译成小众语言的必要性。
为此我应用了最先进的AI技术用 544 种语言构建出了与“请洗手”相近的短语并进行了统计(我的 GPU 还在运行)。<ruby>多语言无监督和监督嵌入<rt>Multilingual Unsupervised and Supervised Embeddings</rt></ruby>MUSE方法被用来训练这 544 种语言和英语之间的跨语言单词嵌入。然后,这些嵌入方法可以从现有文档中提取出与目标短语相似的短语。
我与 SIL 国际公司的同事们合作完成了这项工作,他们收集了该短语的更多的人工翻译结果。这些人工翻译结果和我的一些机器翻译结果的组合可以在[这个民族语指南页面][3]上搜索到(机器生成的短语用一个小的机器人图标表示),更多的翻译将在生成/收集到的时候加入。
### 利用现有的语料库
SIL 国际公司已经完成了 2000 多种语言的语言工作,目前管理着 1600 多个语言项目。因此,当我解决这个特殊的问题时,我知道我们很可能已经多次将“请洗手”和/或类似的短语翻译成了数百种语言,而这一猜测得到了回报。我很快就从我们的档案库中收集到了超过 900 种语言的文件(主要是完成的贝壳书模板、教材和圣经)。这些文件中的每一份文件都有一个英文的对应语,其中必然包括“请洗手”和/或类似“请洗脸”这样的短语。此外,这些文件的质量都很高,并与当地语言社区合作进行了翻译和检查。
这是相当多语言的数据集。然而,有两个问题需要克服。首先,这个数据包含了大多数语言的数千种样本,这与训练机器翻译模型所使用的数百万个样本形成了鲜明对比。其次,即使文档中包含目标语言中的“请洗手”这个短语,我们也不知道这个短语在周围文本中的确切位置。
我们当然可以利用[低资源语言的机器翻译][4]中的一些最新技巧,但要在每一对语言中快速调整翻译模型的自动化方法,还需要一些时间。此外,我们所针对的许多语言都没有现成的可以与之比较评估指标的基线,例如[BLEU 评分][5]。考虑到对冠状病毒大流行的迫切担忧,我们希望比这更快一点(尽管我们计划在将来再来解决这个问题)。
我选择通过在现有的文档中寻找短语本身或短语的组件(如“请洗”或“你的手”)来尝试构建“请洗手”这个短语。为了找到这些成分,我使用 Facebook Research 的 [多语言无监督和监督嵌入MUSE][6]对每个 {英语、目标语言} 对进行<ruby>跨语言<rt>cross-lingual</rt></ruby>嵌入训练。MUSE 以<ruby>单语言<rt>monolingual</rt></ruby>的单词嵌入作为输入(我使用 [fasttext][7] 来生成这些词),并使用对抗性方法学习从英语到目标嵌入空间的映射。这个过程的输出是<ruby>跨语言<rt>cross-lingual</rt></ruby>的单词嵌入。
![Using fasttext along with MUSE to perform cross-language embedding][8]
一旦产生了跨语言嵌入,我们就可以开始在目标语言文档中寻找短语成分。结果发现,整个文档中清楚地使用了“请洗脸”这个短语以及单独的“手”、“请洗”等词。对于每一种语言,我都通过 n-grams 搜索我预期该短语会出现的地方(根据其在英语平行语中的用法)。使用跨语言嵌入法对 n-grams 进行了矢量化,并使用各种距离指标与英语短语的矢量化版本进行了比较。在嵌入空间中,与英文短语“最接近”的 n-grams 被确定为目标语言匹配。
最后,将与英语对应的成分短语进行组合,生成目标语言中的“请洗手”短语。这种组合方式再次利用跨语言嵌入,确保组件短语的组合方式合适。例如,如果我们在目标语言中匹配“请洗脚”这个短语,就必须将“脚”对应的 n-gram 替换成“手”对应的 n-gram。下面是<ruby>伯利兹·克里奥尔<rt>Belize Kriol</rt></ruby>英语的一个例子:
![][9]
当然,在这个匹配过程中,有一些假设,这个过程完全有可能不能产生语法上正确的预测。例如,我假设在大多数语言中,“手”的单词和“脚”的单词都是一个字元长的(字元由空格和标点符号隔开)。当然,情况肯定不一定是这样。这可能会造成类似于“和洗的和手你”或类似的不好的词条。希望我们可以克服其中的一些局限性,并在未来扩展这个系统,但是,现在,我们选择用图形来强化这个想法。
我们将世界卫生组织的洗手说明改编成了一个 PNG 图片模板。然后,我们把我们翻译和生成的短语,用 Bash 和 Go 脚本的组合将其渲染到洗手图像中。这样,在文字和图像中都强调了正确洗手的理念(只是为了防止我们生成的翻译有问题)。
![][10]
### 结果
到目前为止,我已经能够训练出 544 种语言的跨语言嵌入。我使用上述讨论过的方法尝试为所有这些语言构建“请洗手”这个短语。因为我没有许多语言对的对齐数据,所以我使用了同样包含“请洗手”成分的单独的保留文档来帮助验证构造短语中的字元。这让我们对公开发布的翻译版本有了一些信心(至少它们包含了表示“洗”和/或“手”的信息)。此外,我还将该方法与谷歌翻译支持的和/或有可用的人工翻译的语言对进行了比较。以下是来自 [Ethnologue][11] 带有语言统计的翻译样本。
**语言:意大利语 [Ita]**
- 地点:意大利
- 人口: 68,000,000
- 我们的系统: làvati la mani
- 谷歌翻译: Lavati le mani
**语言:保加利亚语 [bul]**
- 地点:保加利亚
- 人口8,000,000
- 我们的系统:умий ръцете
- 谷歌翻译:Измий си ръцете
**语言: 荷兰语 [nld]**
- 地点:荷兰
- 人口24,000,000,000
- 我们的系统wast uw handen
- 谷歌翻译Was je handen
**语言: Pijin [pis]**
- 地点:所罗门群岛
- 人口: 550,000
- 我们的系统wasim han
- 谷歌翻译:不支持
**语言Tikar [tik]**
- 地点:喀麦隆
- 人口110,000
- 我们的系统ɓɔsi fyàʼ
- 谷歌翻译:不支持
**语言Waffa [waj]**
- 地点:巴布亚新几内亚
- 人口1,300
- 我们的系统yaakuuvaitana nnikiiyauvaa fini
- 谷歌翻译:不支持
构建的短语类似于参考翻译,或者似乎是“请洗手”的另一种说法。例如,在保加利亚语中,我预测为“умий ръцете”,而谷歌翻译预测为“Измий си ръцете”。然而,如果我用谷歌翻译回译我的预测,我还是会得到“请洗手”。有一些不确定的地方,我无法与参考译文(例如,所罗门群岛的 Pijin [pis]或人类注释的跨度进行比较但我仍然可以验证“洗”wasim和“手”han分别用在其他必定是谈论洗或手的参考文件中。大约有 15% 的译文可以用这个方法验证,我希望在收集参考文献字典的过程中,能验证更多的译文。
请注意,我最多使用了每种语言中大约 7000 个句子来得到上述译文,即使是意大利语这样的高资源语言也是如此。我也没有依赖语言对之间的对齐句子。尽管在这种数据非常稀缺、无监督的情况下,但对于两个系统都支持的语言,我仍然能够获得类似于谷歌翻译的短语。这证明了这种“混合”方法(无监督的单词嵌入+基于规则的匹配)在将短语翻译成数据非常少的语言中的潜在用途。
注意:我绝对不是说这是解决冠状病毒和其他健康相关的信息传播问题的解决方案。这里仍有很多东西需要探索和正式评估,我们正在为此努力。在很多情况下,这种方法无法帮助构建数百种语言的重要信息资料。但是,我认为,我们都应该尝试着为当前危机的相关问题制定创造性的解决方案。也许这只是一个非常大的拼图中的一块。
你可以在[这个民族语言指南][3]上查看经过验证的译文加上人工翻译的完整列表。此外,我们即将以论文的形式对这一系统进行更深入的描述和分析。我们欢迎公众对翻译进行反馈,以帮助系统进行微调,最重要的是,确保将健康信息传递给世界各地的边缘化语言社区。
### 制作自己的洗手海报
我们已经开源了[用于渲染复合的脚本和生成洗手海报的代码][12]。这种方法应该能够处理几乎所有的语言和脚本。你可以在海报中添加你自己的“请洗手”的翻译,以帮助传播,或者根据自己的本地语境进行翻译。请务必在社交媒体上以 #WashYourHands 为标签分享你生成的海报。
### 培养你的 AI 技能
有很多令人兴奋的 AI 问题,可以给世界带来巨大的影响。如果你想用人工智能解决像上面提到的问题,或者你认为你的企业可能需要开始利用人工智能来做其他事情(供应链优化、推荐、客户服务自动化等),那么不要错过今年 5 月的[AI 课堂培训活动][13]。*AI 课堂*是一个沉浸式的、为期三天的虚拟培训活动,适合至少有一定编程经验和数学基础知识的人参加。该培训提供了使用 Python 和开源框架(如 TensorFlow 和 PyTorch进行现实的 AI 开发的实用基础知识。完成课程后,学员将有信心开始开发和部署自己的 AI 解决方案。
本文经许可转载自 https://datadan.io/blog/wash-your-hands
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/4/ai-translation
作者:[Daniel Whitenack][a]
选题:[lujun9972][b]
译者:[wxy](https://github.com/wxy)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/datadan
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/world_hands_diversity.png?itok=zm4EDxgE (Two diverse hands holding a globe)
[2]: https://www.ethnologue.com/guides/how-many-languages
[3]: https://www.ethnologue.com/guides/health
[4]: https://datadan.io/blog/resources-for-low-resource-machine-translation
[5]: https://en.wikipedia.org/wiki/BLEU
[6]: https://github.com/facebookresearch/MUSE
[7]: https://fasttext.cc/
[8]: https://opensource.com/sites/default/files/uploads/ai-language-translation-wash-your-hands-opensourcedotcom.gif (Using fasttext along with MUSE to perform cross-language embedding)
[9]: https://opensource.com/sites/default/files/uploads/ai-language-translation-wash-your-hands-opensourcedotcom2.gif
[10]: https://opensource.com/sites/default/files/uploads/ai-language-translation-wash-your-hands-opensourcedotcom3.gif
[11]: https://www.ethnologue.com/
[12]: https://github.com/sil-ai/wash-your-hands
[13]: https://datadan.io/