information retrieval - Question Answering with Lucene -
for toy project, want implement automated question answering system lucene , i'm trying figure out reasonable way implement it. basic operation follows:
1) user enter question.
2) system identify keywords in question.
3) keywords searched in large knowledgebase , matching sentences shown answers.
my knowledgebase (i.e., corpus) not structured. large, continuous text (say, user manual without chapters). mean structure sentences , paragraphs identified.
i plan treat each sentence or paragraph separate document. present answer in context, may consider keeping 1 sentence/paragraph before/after indexed 1 payload. know if makes sense. also, i'm wondering if there other tried , well-known approaches kind of systems. example, approach comes mind index large chunks of corpus documents token positions, process vicinity of found keywords construct answers.
i appreciate direct recommendations based on experience or intuition, tutorials or introductory materials question-answering systems lucene in mind.
thanks.
indexing each sentence document give problems. you've pointed out one: need store surrounding texts payloads. means you'll need store each sentence 3 times (before, during , after), , you'll have manually payload.
if want go route of each sentence being document, recommend coming id each sentence , storing separate field. can display [id-1, id, id+1] in each result.
the bigger question though is: how should break text documents? identifying semantically related areas seems difficult, doing sentence/paragraph might way go. better way if find text header of section, , put in section document.
you might want use index (if corpus has one). terms there boosted, presumably more important.
Comments
Post a Comment