SPBA: Statistical Principle-Based Approach for
Biomedical Named Entity Recognition
Named entity recognition (NER) is an essential prerequisite for many natural language processing tasks, such as relation extraction. Current machine learning models for NER are mostly non-interpretable, making it difficult to improve their performances through error analysis. It is also non-trivial for them to adopt fine-grained linguistic knowledge, such as long-distance constraints. Furthermore, their performances often deteriorate significantly when applied to an un-trained domain. Biomedical named entity recognition (BNER) is one of the hardest NER problems. Given the increasing interest in applying text mining techniques on healthcare-related literature, a more robust approach to NER is very desirable.
We introduce the statistical principle-based approach (SPBA) to BNER. SPBA is a novel machine learning approach that can leverage fine-grained linguistic knowledge, represented as human-readable semantic concepts and patterns. SPBA automatically groups similar patterns into clusters, and picks a representative pattern from each cluster as a principle. The alignment matching algorithm in SPBA is flexible in that, it allows insertion and deletion in such a way that those inserted words and concepts can be weighed using logistic regression according to whether they give rise to true or false positive instances.
We evaluate SPBA on benchmark corpora including those that are either related or unrelated to the training set (the latter is for robustness check). Results demonstrate that SPBA is effective, more portable, and outperforms most state-of-the-art approaches. In addition, the principles learned are human comprehensible, and error correction is more intuitive than state-of-the-art BNER approaches.
About Dr. Wen-Lian HSU
Dr. Wen-Lian Hsu (F’06) is currently the Director and a Distinguished Research Fellow of the Institute of Information Science, Academia Sinica, Taiwan. He received a B.S. in Mathematics from National Taiwan University, and a Ph.D. in operations research from Cornell University in 1980. He was a tenured associate professor in Northwestern University before joining the Institute of Information Science in Academia Sinica as a research fellow in 1989. Dr. Hsu’s earlier contribution was on graph algorithms and he has applied similar techniques to tackle computational problems in biology and natural language. In 1993, he developed a Chinese input software, GOING, which has since revolutionized Chinese input on computer. Dr. Hsu is particularly interested in applying natural language processing techniques to understanding DNA sequences as well as protein sequences, structures and functions and also to biological literature mining. Dr. Hsu received the Outstanding Research Award from the National Science Council in 1991, 1994, 1996, the first K. T. Li Research Breakthrough Award in 1999, the IEEE Fellow in 2006, and the Teco Award in 2008. He was the president of the Artificial Intelligence Society in Taiwan from 2001 to 2002 and the president of the Computational Linguistic Society of Taiwan from 2011 to 2012.