ABSTRACT
We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by “translating” it into a more concise representation rather than attempting to extract a text span verbatim. OCELOT uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.
- 1.Berger, A., Brown, P., Della Pietra, S., Della Pietra, V., Gillett, J., Lafferty, J., Pdntz, H., and Ures, L. The Candide system for machine translation. In Proceedings of the ARPA Human Language Technology Workshop (1994). Google ScholarDigital Library
- 2.Berger, A., and Lafferty, J. The Weaver system for document retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8) (1999).Google Scholar
- 3.Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993), 263-311. Google ScholarDigital Library
- 4.Clarkson, P., and Rosenfeld, R. Statistical language modeling using the CMU-Cambddge toolkit. In Proceedings of Eurospeech '97 (1997).Google Scholar
- 5.DeJong, G. F. An overview of the FRUMP system. In Strategies for Natural Language Processing, W. G. Lehnert and M. H. Ringle, Eds. Lawrence Erlbaum Associates, 1982, pp. 149-176.Google Scholar
- 6.Edmundson, H. P. Problems in automatic extracting. Communications of the ACM 7 (1964), 259-263. Google ScholarDigital Library
- 7.Fomey, G. D. The Viterbi Algorithm. Proceedings of the IEEE (1973), 268-278.Google Scholar
- 8.Goldstein, J., Kantrowitz, M., Mittal, V. O., and Carbonell, J. Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In Proceedings of SIGIR-99 (Berkeley, CA., 1999), pp. 121-128. Google ScholarDigital Library
- 9.Good, I. The population frequencies of species and the estimation of population parameters. Biometrika 40 (1953).Google Scholar
- 10.Hand, T. E A proposal for task-based evaluation of text summarization systems. In ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization (July 1997), pp. 31-36.Google Scholar
- 11.Jelinek, E Statistical methods for speech recognition. MIT Press, 1997. Google ScholarDigital Library
- 12.Jing, H., Barzilay, R., McKeown, K., and Elhadad, M. Summarization evaluation methods experiments and analysis. In AAAI Intelligent Text Summarization Workshop (Mar. 1998), pp. 60-68.Google Scholar
- 13.Luhn, R H. Automatic creation of literature abstracts. IBM Journal (1958), 159-165.Google ScholarDigital Library
- 14.Marcu, D. From discourse structures to text summaries. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization (1997), pp. 82-88.Google Scholar
- 15.Mathis, B. A., Rush, J. E., and Young, C. E. Improvement of automatic abstracts by the use of structural analysis. JA- SIS 24 (1973), 101-109.Google ScholarCross Ref
- 16.Nathan, K., Beigi, H., Subrahmonia, J., Clary, G., and Maruyama, H. Real-time on-line unconstrained handwriting recognition using statistical methods. In Proceedings of the 1EEE ICASSP-95 Conference (1995).Google ScholarCross Ref
- 17.The Open Directory project: http : //draoz. org.Google Scholar
- 18.Ponte, J., and Croft, W. A language modeling approach to information retrieval. In Proceedings of SIGIR-98 (1998), pp. 275-281. Google ScholarDigital Library
- 19.Resnick, E Mining the Web for bilingual text. In Proceedings of ACL'99 (1999). Google ScholarDigital Library
- 20.Witbrock, M., and Mittal, V.O. Headline generation: A framework for generating highly-condensed nonextractive summaries. In Proceedings of SIGIR-99 (1999), pp. 315-316. Google ScholarDigital Library
Index Terms
- OCELOT: a system for summarizing Web pages
Recommendations
Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementReaders of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Latent Dirichlet learning for document summarization
ICASSP '09: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal ProcessingAutomatic summarization is developed to extract the representative contents or sentences from a large corpus of documents. This paper presents a new hierarchical representation of words, sentences and documents in a corpus, and infers the Dirichlet ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Comments