The NLP group of AI lab. in Ehime University studies the following research topics.
- HPSG Parsing
- Grammar Theory
- Machine Learning
- Deep Learning
- Distributed Representations
- Symbol Grounding
- POS tagging
- Textual Entailment Recognition
- Sentiment Analysis
- Machine Translation
- Automatic Summarization
- Grammatical Error Correction
This group is studying parsing technologies for Head-driven Phrase Structure Grammar (HPSG). HPSG is one of linguistic grammar theories for syntactic phrase structures, and is known as the most sophisticated grammar theory as well as Generative Grammar, Tree Adjoining Grammar (TAG), Lexical Function Grammar (LFG) and Combinatory Categorial Grammar (CCG). HPSG is one of lexicalized grammars, which explain linguistic phenomena elegantly with a small number of grammar rules (phrase structure rules and lexical rules) and a number of complex lexical entries. In HPSG, grammar rules, lexical entries and parse trees are defined with feature structures, which are labeled graph structures. Unification of feature structures provides linguistic constraints and phrase structures. This group studies how to develop HPSG grammars and how to achieve efficiency and higher accuracy in HPSG parsing.
Followings are the research topics for HPSG parsing in our group.
- Corpus-oriented grammar development for HPSG: HPSG can provide precise syntactic analyses for sentences, but was once considered a rubbish, useless grammar theory because manually-developed HPSGs could not provide syntactic analyses for the sentences in the real-world, such as newspaper sentences. This is mainly because manually-developed lexical entries were not sufficient to describe linguistic phenomena in the real-world. We proposed a new grammar development methodology, which we call “corpus-oriented grammar development.” In our methodology, lexical entries are inductively acquired from Penn Treebank, which is the most famous corpus annotated with CFG parse trees. By converting CFG parse trees in Penn Treebank to HPSG parse trees, consistent lexical entries can be acquired from the HPSG parse trees. The developed HPSG grammar is called “HPSG treebank grammar.” The HPSG treebank grammar is robust enough to analyse newspapers or bio-medical papers.
- HPSG parsing with a supertagger: This group studies probabilistic HPSG models with a supertagger for fast and accurate parsing. A supertagger is a probabilistic model for selecting lexical entries with word and POS n-gram features. Recently, supertagging has become well known to drastically improve parsing accuracy and speed in CCG, HPSG and CDG parsing. We proposed a conditional random fields for HPSG in which the supertagger is introduced as its reference distribution. This model properly incorporates the supertagging probabilities into parse tree’s probabilistic model, and achieved speed-ups (around 2.4 times faster) and higher accuracy (up from 86.3% to 88.8%).
- Deterministic shift-reduce parsing for HPSG: Many parsing techniques assume the use of a packed parse forest to enable efficient and accurate parsing. However, they suffer from an inherent problem that derives from the restriction of locality in the packed parse forest. Deterministic parsing is one solution that can achieve simple and fast parsing without the mechanisms of the packed parse forest by accurately choosing search paths. We proposed new deterministic shift-reduce parsing and its variants for unification-based grammars. Deterministic parsing cannot simply be applied to unification-based grammar parsing, which often fails because of its hard constraints. Therefore, this is developed by using default unification, which almost always succeeds in unification by overwriting inconsistent constraints in grammars.
- Japanese HPSG parsing with case analysis: Japanese dependency parsing has already been studied by several NLP groups. However, Japanese dependency parsing generally gives us only word dependencies without case labels. It is insufficient for the use of other applications because post-positional particles in Japanese are ambiguous and the surface cases cannot be identified with only word dependencies. For example, suppose that a sentence “太郎(Taro) は(wa-PP) 花子(Hanako) が (ga-PP) 好きだ(love)” is given. We understand that “太郎(Taro)” and “花子(Hanako)” depend “好きだ(love),” but the agent and object of “好きだ(love)” are ambiguous (We can interpret the sentence as both “Taro loves Hanako” and “Hanako loves Taro.”) Analyzing the cases of predicate’s arguments is called case analysis. Currently, we are investigating Japanese HPSG parsing with case analysis.
This group studies grammar theories. The followings are research topics for grammar theories in our group.
- Left-corner transform for Combinatory Categorial Grammar (CCG) trees: Generally, parsing requires 2-dimensional arrays for analyzing ambiguous packed parse trees, which consumes memory and time. CCG has special grammar rules which can delay grammar rule applications, and consequently can change the form of parse trees. We are investigating left-corner transform of CCG trees, which convert the arbitrary CCG trees to the form of linear chains. We hope that this enables us fast and memory-efficient incremental parsing for CCG.
- Argument cluster coordination in HPSG: Generally, it is difficult to understand a sentence “He gave a teacher an apple and a policeman a flower” because “gave” seems to subcategorize coordinated arguments “a teacher an apple” and “a policeman a flower”. The coordinated arguments are called argument cluster, and this linguistic phenomena is called argument cluster coordination. We are investigating the argument cluster coordination in HPSG.
In 90s, manually-annotated large corpora were developed in the filed of NLP, and then parameters in NLP systems are automatically learned from the annotated corpora using the techniques developed in the field of machine learning. Currently, many NLP researchers directly investigate the techniques of machine learning to develop precision NLP systems rather than using machine learning techniques as a black box tool. Machine learning is a fundamental technology for many researches in artificial intelligence.
Followings are the research topics for machine learning in our group.
- Online feature selection based on ℓ1-regularized logistic regression: Finding features for classifiers is one of the most important concerns in various fields for improving classifier prediction performance. Online grafting is one solution for finding useful features from an extremely large feature set. Given a sequence of features, online grafting selects or discards each feature in the sequence of features one at a time. Online grafting is preferable in that it incrementally selects features, and it is defined as an optimization problem based on ℓ1-regularized logistic regression. However, its learning is inefficient due to frequent parameter optimization. We proposed two improved methods, in terms of efficiency, for online grafting that approximate original online grafting by testing multiple features simultaneously. The experiments have shown that our methods significantly improved efficiency of online grafting. Though our methods are approximation techniques, deterioration of prediction performance was negligibly small.
Deep Learning for Natural Language Processing
As is well known, deep learning is intensively studied in AI all over the world. Deep learning is a sub-field of machine learning which studies learning technologies for deep neural networks. Deep learning attracts many researchers because it achieved significant results in image recognition due to efficient online learning, abstraction by pre-training and generalization by drop-out. Deep neural networks can learn features automatically. For example, in the case of face recognition, primitive factors such as lines are learned in the first layer, then face parts are learned in the next layer, and finally faces are learned in the final layer. Therefore, deep learning is considered to have high abilities of abstraction. We also hope that deep learning can contribute to learning abstracted language representations and enables high precision natural language processing.
Followings are the research topics for deep learning in our group.
Since word2vec was released in public, many NLP researchers started studying learning for distributed representations. Distributed representations are dense low-dimensional real-vectors for representing language elements such as words and phrases. word2vec is a tool for acquiring distributed representations for words. word2vec attracts many NLP researchers because word2vec enables calculation of word-word relations by summation and subtraction, for example, (vector for `king’) – (vector for `man’) + (vector for `woman’) is close to (vector for `queen’).
Followings are the research topics for distributed representations in our group.
- Distributed representations for verb-object pairs by using word2vec
Ordinary natural language processing presupposes only models for symbolic relations, and the relations between symbols and the real-world are ignored for a long time. This problem is called the symbol grounding problem, and is recently studied by many researchers in the fields of image recognition and natural language processing. Especially, caption generation from images and image retrieval by natural languages are intensively studied.
Followings are the research topics for symbol grounding in our group.
- Symbol grounding for natural language processing
Part-Of-Speech (POS) Tagging
Analyzing part-of-speeches (POSs) in a sentence is called POS tagging. For example, given a sentence “I have a pen.”, a POS tagger outputs “I/PRONOUN have/VERB a/ARTICLE pen/NOUN ./PERIOD”. POS tagging is considered as one of the most fundamental and important tasks in NLP. This group is also investigating POS tagging.
- Active learning and Self-learning
- Virtual sample generation
- Dirichlet prior smoothing
- Multi-task learning
Textual Entailment Recognition
Given two sentences, answering whether entailment relations holds or not between the two sentences is called textual entailment recognition. This task is similar to proving logical entailment relations, e.g., “(P⇒Q)∧(Q⇒R)” entails “P⇒R”. In textual entailment recognition, texts are given instead of logical formulas. Many NLP researchers consider that accurate text entailment recognition could contribute to achieving intelligent QA systems. This group studied Markov logic networks for textual entailment recognition.
Analyzing reputations for some commercial products is called sentiment analysis. Given a review for some product, a sentiment analyzer infers the review points for the product (e.g., good/bad, 1-to-5 stars). This group is also investigating sentiment analysis, using Rakuten data.
Machine translation is a sub-field of NLP that investigates techniques for translating one language to another language automatically. Machine translation is studied by many research groups in NLP, and is one of the biggest research fields in NLP. This group is investigating Japanese-English and English-Japanese machine translation. Japanese is an idiosyncratic language in the world, and therefore machine translation between Japanese and other European languages is more difficult than those between European-European languages. The quality of machine translation from English to Japanese becomes as good as other European languages thanks to recent studies of pre-ordering of English using parsing analyses. However, the quality of machine translation from Japanese to English is still low. This group investigates the following research topics for machine translation.
- Pre-ordering of Japanese-English machine translation
- Domain adaptation using sampling
Automatic summarization is the methods for automatically generating summaries of documents. Given a document and the maximum length of summaries, the goal of automatic summarization is to make a summary that contains the information of the original document as much as possible within the maximum length. It is known that automatic summarization methods based on Integer Linear Programming (ILP) achieve high accuracy, using the maximum length of summaries as a constraint in ILP. Our group is investigating automatic summarization methods based on ILP without using the maximum length of summaries.
Grammatical Error Correction
Grammatical error correction is a task of NLP in which systems aim to automatically find/correct the grammatical errors in a document. Recently, annotated corpora for grammatical error correction, such as Konan JIEM Learner Corpus and NUCLE corpus, become available, and many NLP groups are intensively investigating grammatical error correction. Our group is also investigating grammatical error correction using logistic regression.