This has some applications like summarizing the search-engine results, providing briefs of big documents that do not have an abstract etc. There are two categories of summarizers, linguistic and statistical. Linguistic summarizers use knowledge about the language to summarize a document. Statistical ones operate by finding the important sentences using statistical methods. Statistical summarizers normally do not use any linguistic information.
User of the System
Used to generate summaries of electronic documents
Using statistical techniques
To handle the document types like Plain Text, HTML, Word Document
Techniques involve finding the frequency of words, scoring the sentences, ranking the sentences
Functional Requirements
i. Study about auto-summarizing techniques & concentrate more on summarizers based on statistical techniques
ii. Collect the list of stop-words from an Internet site
iii. Come up with algorithms for the different functional components listed in the previous section. Some heuristic methods could be used to come up with modification of any existing algorithm
iv. Implement the pre-processor/sentence separator/word separator/word frequency calculator. These do not require much work on the algorithm side and existing algorithms will do fine.
v. Implement the scoring and ranking component
vi. Test it with some documents and tune the algorithms, if needed
vii. Bench-mark your tool against some tools available on the Internet
Optional Features
This algorithm determines the score of each sentence. Several possibilities exist. The score can be made to be proportional to the sum of frequencies of the different words comprising the sentence The score can also be made to be inversely proportional to the number of sentences in which the words in the sentence appear in the document. Likewise, many such heuristic rules can be applied to score the sentences.
The sentences will be ranked according to the scores. Any other criteria like the position of a sentence in the document can be used to control the ranking. For example, even though the scores are high, we would not put consecutive sentences together
Based on the user input on the size of the summary, the sentences will be picked from the ranked list and concatenated. The resulting summary file could be stored with a name like
Other Important issues
After finalizing on the algorithms, the system is integrated so that it is possible to test using a GUI or a command line interface
The tool should be tested with documents of different size and content
please suggests some algorithms for text summarization
ReplyDelete