Wednesday, 5 December 2007

Question-answering system based on concepts and statistics

I've been reading "Question-answering system based on concepts and statistics" by Lin Hongfei, Yang Zhihao and Zhao Jing. It was published by Springer-Verlag in 2007. It has a number of neat ideas and way too much maths for me.

The paper covers a lot of practical matters to make it relevant to me in this stage of research. It covers an implementation mechanism for a question answering system. The system takes into account answer types, conceptual expansions, latent semantic indexing, matching algorithms, and a frequently asked question corpus.

Lin et al. begins with a brief history of QA systems. The first QA system was developed in 1993 at MIT. The second was called START. The third was MURAX by Julian Kupiec. These were followed by FAQ Finder. These systems were based on templates, linguistics knowledge or statistics or some combination.

The next advancement came from the commencement of a QA stream in 1998 at a well respected conference TREC. [I remember reading a while ago that TREC had finished, but I'm not sure.] A significant portion of QA research has been coming from China.

QA systems developed in the past with knowledge representation has been limited by the range of the knowledge base and rules. QA systems have also been developed with statistical, syntactic and semantic techniques.

[I am not sure what Lin et al. mean by concepts.]

The QA system presented works by determine the question focuses and answer classes. Depending on the question focuses, appropriate conceptual expansions are made.

Latent semantic indexing is used to retrieve appropriate passages. Sentence similarity match algorithms are used to link questions and answer sentences. The answer sentences are stored in a FAQ database.

[I question the potential level of evolution of such a system. This research is on a different level to what I am seeking. I am seeking to bestow a more basic understanding capability whereas this system works from an arbitrary perspective to make use of patterns and similarity. Ideally I would seek to obtain a working QA system and then work on implementing context concepts.]

The first stage is question analysis.

Question and answer type extraction rules are determined by using language segmentation, part of speech tagging and statistical analysis.

Conceptual expansion is based on a semantic tree with a hierarchy based on level of abstraction. Step one is synonymous expansion for the primary focus. Step two is approximate expansion for the secondary focus for the concepts on the same tree level or a level higher. Step three is additional expansion for the additional focus on the same tree level, higher level and lower level.

[There is some mathematical notation to formalise the expansion steps here that is beyond my current math level.]

Each focus is given a weight and the weight calculation formula is presented. The question is then represented as a feature vector which contains the weights of the primary and secondary focuses.

The second stage is passage matching.

Lin et al. breaks down passages into units for retrieval. Each unit contains additional information to help preserve relationships between units. Reduction into units has the added benefit of quickening the extraction process. The different type of units are text, passage, sentence and word.

Latent Semantic Indexing is used here but I don't understand to what purpose but I intuitively challenge the notion that semantics are actually used here. From previous discussions I think that the end result is a number of vectors and similarity is calculated as the angle between vectors.

The third stage is answer extraction. Lin et al. divides this stage into two sub-stages, match algorithm and threshold selection, followed by answer extraction based on question similarity.

In match
algorithm and threshold selection, the factors considered are keyword density, question type limitation and focus constraints. The field type of sentence units is used to filter out significant portions.

[There are additional details described that I don't quite get yet.]

In calculating the similarities between passages and questions factors to be considered are focus frequency in passages and focus weights.

[Some more nifty mathematics is presented to convey how similarity is calculated.]

Answer extraction based on question similarity uses the idea that new questions are often similar to older questions that are already answered. This task involves calculating the similarity between the two question sentences. Sentence similarity is based on syntax, semantics and pragmatics.

Here Lin et al. notes that pragmatic similarity is "comparatively complicated and often provides unsatisfactory results."

The similarity search takes into account morphological matching, word-order matching and conceptual expansion.

[Mathematical formulas are presented for morphological similarity, word-order similarity and general similarity.]

Depending on the results an appropriate threshold is determined and the appropriate questions are extracted from the FAQ database.

Lin et al. goes on to describe the system. The FAQ database uses word segmentation, has had stop words removed and is indexed by Latent Semantic Indexing. The accuracy of the system is measured using mean reciprocal answer.

In conclusion Lin et al. notes that passage retrieval is "based on statistics, and relies on the similarities of questions" and a FAQ database. Retrieval units are assigned additional attributes to reduce information lost from partitioning. The system accuracy can be improved by increasing the FAQ entries.

Lin et al. notes that the limitations of the system are,

"grammar support and semantic analysis. Although the statistical approach is easier to implement, it fails to well refine questions.... The important task for the future is to fuse the shallow parsing technology and the passage retrieval method based on statistics...."

[I would add that it is also important to look at deep parsing. All in all, this is an example of a practical paper that has a significant amount of implementation detail. Now it would be great to find a paper that covers practical implementation details for deep parsing and semantics.]

[Chat bots are a type of question answer system.]

No comments:

Post a Comment