skip to main content
10.1145/345508.345565acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article
Free Access

OCELOT: a system for summarizing Web pages

Authors Info & Claims
Published:01 July 2000Publication History

ABSTRACT

We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by “translating” it into a more concise representation rather than attempting to extract a text span verbatim. OCELOT uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.

References

  1. 1.Berger, A., Brown, P., Della Pietra, S., Della Pietra, V., Gillett, J., Lafferty, J., Pdntz, H., and Ures, L. The Candide system for machine translation. In Proceedings of the ARPA Human Language Technology Workshop (1994). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.Berger, A., and Lafferty, J. The Weaver system for document retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8) (1999).Google ScholarGoogle Scholar
  3. 3.Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993), 263-311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.Clarkson, P., and Rosenfeld, R. Statistical language modeling using the CMU-Cambddge toolkit. In Proceedings of Eurospeech '97 (1997).Google ScholarGoogle Scholar
  5. 5.DeJong, G. F. An overview of the FRUMP system. In Strategies for Natural Language Processing, W. G. Lehnert and M. H. Ringle, Eds. Lawrence Erlbaum Associates, 1982, pp. 149-176.Google ScholarGoogle Scholar
  6. 6.Edmundson, H. P. Problems in automatic extracting. Communications of the ACM 7 (1964), 259-263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.Fomey, G. D. The Viterbi Algorithm. Proceedings of the IEEE (1973), 268-278.Google ScholarGoogle Scholar
  8. 8.Goldstein, J., Kantrowitz, M., Mittal, V. O., and Carbonell, J. Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In Proceedings of SIGIR-99 (Berkeley, CA., 1999), pp. 121-128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.Good, I. The population frequencies of species and the estimation of population parameters. Biometrika 40 (1953).Google ScholarGoogle Scholar
  10. 10.Hand, T. E A proposal for task-based evaluation of text summarization systems. In ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization (July 1997), pp. 31-36.Google ScholarGoogle Scholar
  11. 11.Jelinek, E Statistical methods for speech recognition. MIT Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.Jing, H., Barzilay, R., McKeown, K., and Elhadad, M. Summarization evaluation methods experiments and analysis. In AAAI Intelligent Text Summarization Workshop (Mar. 1998), pp. 60-68.Google ScholarGoogle Scholar
  13. 13.Luhn, R H. Automatic creation of literature abstracts. IBM Journal (1958), 159-165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.Marcu, D. From discourse structures to text summaries. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization (1997), pp. 82-88.Google ScholarGoogle Scholar
  15. 15.Mathis, B. A., Rush, J. E., and Young, C. E. Improvement of automatic abstracts by the use of structural analysis. JA- SIS 24 (1973), 101-109.Google ScholarGoogle ScholarCross RefCross Ref
  16. 16.Nathan, K., Beigi, H., Subrahmonia, J., Clary, G., and Maruyama, H. Real-time on-line unconstrained handwriting recognition using statistical methods. In Proceedings of the 1EEE ICASSP-95 Conference (1995).Google ScholarGoogle ScholarCross RefCross Ref
  17. 17.The Open Directory project: http : //draoz. org.Google ScholarGoogle Scholar
  18. 18.Ponte, J., and Croft, W. A language modeling approach to information retrieval. In Proceedings of SIGIR-98 (1998), pp. 275-281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.Resnick, E Mining the Web for bilingual text. In Proceedings of ACL'99 (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.Witbrock, M., and Mittal, V.O. Headline generation: A framework for generating highly-condensed nonextractive summaries. In Proceedings of SIGIR-99 (1999), pp. 315-316. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. OCELOT: a system for summarizing Web pages

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
          July 2000
          396 pages
          ISBN:1581132263
          DOI:10.1145/345508

          Copyright © 2000 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 July 2000

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader