Wednesday, April 7, 2010

DEVELOPMENT OF AN AUTO-SUMMARIZATION TOOL -- IT PROJECT

Objective/Vision

This has some applications like summarizing the search-engine results, providing briefs of big documents that do not have an abstract etc. There are two categories of summarizers, linguistic and statistical. Linguistic summarizers use knowledge about the language to summarize a document. Statistical ones operate by finding the important sentences using statistical methods. Statistical summarizers normally do not use any linguistic information.

User of the System

Used to generate summaries of electronic documents

Using statistical techniques

To handle the document types like Plain Text, HTML, Word Document

Techniques involve finding the frequency of words, scoring the sentences, ranking the sentences


Functional Requirements

i. Study about auto-summarizing techniques & concentrate more on summarizers based on statistical techniques

ii. Collect the list of stop-words from an Internet site

iii. Come up with algorithms for the different functional components listed in the previous section. Some heuristic methods could be used to come up with modification of any existing algorithm

iv. Implement the pre-processor/sentence separator/word separator/word frequency calculator. These do not require much work on the algorithm side and existing algorithms will do fine.

v. Implement the scoring and ranking component

vi. Test it with some documents and tune the algorithms, if needed

vii. Bench-mark your tool against some tools available on the Internet


Optional Features

This algorithm determines the score of each sentence. Several possibilities exist. The score can be made to be proportional to the sum of frequencies of the different words comprising the sentence The score can also be made to be inversely proportional to the number of sentences in which the words in the sentence appear in the document. Likewise, many such heuristic rules can be applied to score the sentences.

The sentences will be ranked according to the scores. Any other criteria like the position of a sentence in the document can be used to control the ranking. For example, even though the scores are high, we would not put consecutive sentences together

Based on the user input on the size of the summary, the sentences will be picked from the ranked list and concatenated. The resulting summary file could be stored with a name like _summary.txt

Other Important issues

After finalizing on the algorithms, the system is integrated so that it is possible to test using a GUI or a command line interface

The tool should be tested with documents of different size and content

1 comment:

  1. please suggests some algorithms for text summarization

    ReplyDelete

Your Ad Here