Wednesday 28 November 2007

Performance and Error Analysis in an Open-Domain Question Answering System

I have been reading ``Performance Issues and Error Analysis in an Open-Domain Question Answering System'' by Dan Moldovan, Marius Pasca, Sanda Harabagiu and Miha Surdeanu 2003. This paper presents an analysis of the performance of question answering systems. In doing so it presents interesting related concepts and findings.

Moldovan et al. notes that open domain question answering systems aiming at returning brief answers in response to natural language questions represent an advanced application of natural language processing. Moldovan et al. notes that there are now global metrics that allow for tracking of global performances of various systems. Up to now there have only been general evaluations and little in-depth error analysis of question answering systems at different levels.

Moldovan et al. notes that question answering systems are mostly chained serially and are affected by the weakest link. Therefore analysis of question answering systems at different levels is likely to be insightful.

In general, question answering systems consist of three phases, question processing, document and passage retrieval and answer processing.

Moldovan et al. notes that the ``performance of a question answering system is tightly coupled with the complexity of questions asked and the difficulty of answer extraction. For example many systems did well against questions such as ``Where is Perth?'' but poorly against ``What is the difference between AM radio stations and FM radio stations?''

Moldovan et al. provides a useful system of classifying question answering systems based on ``(a) linguistic and knowledge resources, (b) natural language processing involved, (c) document processing, (d) reasoning methods, (e) whether systems assumed answers are explicitly stated in a document and (f) whether answer fusion is necessary.''

Class 1 systems are capable of processing factual questions.

Class 2 systems have simple reasoning mechanisms. For example in order to answer the question, ``How did Socrates die?'' the system needs to link with ``drinking poisoned wine.''

Class 3 systems are capable of answer fusion from different documents. For example ``What management successions occurred at IBM in the past year?''

Class 4 systems are interactive. These systems are able to use past interactions to discern an appropriate current context. These types of systems are potentially very relevant to my area of research.

Class 5 systems are capable of analogical reasoning and are able to answer speculative questions. For example, ``Is the airline industry in trouble?'' These systems are able to handle situations where the answers are probably not explicitly state in documents.

[I intuitively challenge this paper because each module is different, each module has a different level of impact and there are specific ones I have a hunch we should concentrate on. I guess this paper is seeks to provide more evidence following up my hunch.]

There is a large section of the paper that involves different types of questions and answers and how different modules of the system were turned on and off to provide interesting results. A lot of it is interesting but not immediately relevant in my search for practical papers. There is an interesting section that I will note here about the impact of natural language processing on question answering.

The use of natural language processing modules and resources entails a trade-off between answer processing complexity and answer accuracy. Natural language processing resources include parsers, WordNet, answer type hierarchies, named entity recognizers, question and answer semantic transformations and so forth.

The system without any NLP technique had an answer precision of 0.028. With all NLP techniques enabled except expected answer type derivation the answer precision was 0.150 With answer type derivation enabled the answer precision was 0.468. With all NLP techniques and feedback systems enabled, the answer precision was 0.572.

Top performers at TREC-8, TREC-9 and TREC-2001 had respectively 0.555, 0.580 and 0.570.The precision has not varied much in spite of higher difficulty and Moldovan et al. holds that this is because of the increased use of NLP techniques.

Moldovan et al. continues with the discusion on how as the difficulty of questions rises so too does the requirement for lecico semantic information. For example with the question, ``What was the name of the US helicopter pilot shot down over North Korea?'' with the proper resources the system can derive that the answer type is Person enabling the system to answer the question in a more correct context.

Moldovan et al. concludes that the ``the overall performance of question answering systems is directly related to the depth of NLP resources'' and that bottlenecks in the question answering system studied are in the expected answer type derivation module and in the keyword expansion module. Moldovan adds that question answering systems ``perform better when the relevant passages and the candidate answers are clearly defined in the questions'' and that ``the main problem is the lack of powerful schemes and algorithms for modeling complex questions in order to derive as much information as possible, and for a well-guided search through thousands of text documents.''

This paper indicates that the bottlenecks are where they have been expected, that is in the semantic capability of question answering systems. This supports the validity of pursing research in the direction of semantics which includes looking at areas such as context and interpretation.

No comments:

Post a Comment