OCELOT: a system for summarizing Web pages

Authors:
Adam L. Berger

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Vibhu O. Mittal

Just Research, 4616 Henry Street, Pittsburgh, PA

Just Research, 4616 Henry Street, Pittsburgh, PA
View Profile

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrievalJuly 2000Pages 144–151https://doi.org/10.1145/345508.345565

Published:01 July 2000Publication History

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

Pages 144–151

ABSTRACT

We introduce OCELOT, a prototype system for automatically generating the “gist” of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by “translating” it into a more concise representation rather than attempting to extract a text span verbatim. OCELOT uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.

References

1.Berger, A., Brown, P., Della Pietra, S., Della Pietra, V., Gillett, J., Lafferty, J., Pdntz, H., and Ures, L. The Candide system for machine translation. In Proceedings of the ARPA Human Language Technology Workshop (1994). Google ScholarDigital Library
2.Berger, A., and Lafferty, J. The Weaver system for document retrieval. In Proceedings of the Eighth Text REtrieval Conference (TREC-8) (1999).Google Scholar
3.Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993), 263-311. Google ScholarDigital Library
4.Clarkson, P., and Rosenfeld, R. Statistical language modeling using the CMU-Cambddge toolkit. In Proceedings of Eurospeech '97 (1997).Google Scholar
5.DeJong, G. F. An overview of the FRUMP system. In Strategies for Natural Language Processing, W. G. Lehnert and M. H. Ringle, Eds. Lawrence Erlbaum Associates, 1982, pp. 149-176.Google Scholar
6.Edmundson, H. P. Problems in automatic extracting. Communications of the ACM 7 (1964), 259-263. Google ScholarDigital Library
7.Fomey, G. D. The Viterbi Algorithm. Proceedings of the IEEE (1973), 268-278.Google Scholar
8.Goldstein, J., Kantrowitz, M., Mittal, V. O., and Carbonell, J. Summarizing Text Documents: Sentence Selection and Evaluation Metrics. In Proceedings of SIGIR-99 (Berkeley, CA., 1999), pp. 121-128. Google ScholarDigital Library
9.Good, I. The population frequencies of species and the estimation of population parameters. Biometrika 40 (1953).Google Scholar
10.Hand, T. E A proposal for task-based evaluation of text summarization systems. In ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization (July 1997), pp. 31-36.Google Scholar
11.Jelinek, E Statistical methods for speech recognition. MIT Press, 1997. Google ScholarDigital Library
12.Jing, H., Barzilay, R., McKeown, K., and Elhadad, M. Summarization evaluation methods experiments and analysis. In AAAI Intelligent Text Summarization Workshop (Mar. 1998), pp. 60-68.Google Scholar
13.Luhn, R H. Automatic creation of literature abstracts. IBM Journal (1958), 159-165.Google ScholarDigital Library
14.Marcu, D. From discourse structures to text summaries. In Proceedings of the ACL'97/EACL'97 Workshop on Intelligent Scalable Text Summarization (1997), pp. 82-88.Google Scholar
15.Mathis, B. A., Rush, J. E., and Young, C. E. Improvement of automatic abstracts by the use of structural analysis. JA- SIS 24 (1973), 101-109.Google ScholarCross Ref
16.Nathan, K., Beigi, H., Subrahmonia, J., Clary, G., and Maruyama, H. Real-time on-line unconstrained handwriting recognition using statistical methods. In Proceedings of the 1EEE ICASSP-95 Conference (1995).Google ScholarCross Ref
17.The Open Directory project: http : //draoz. org.Google Scholar
18.Ponte, J., and Croft, W. A language modeling approach to information retrieval. In Proceedings of SIGIR-98 (1998), pp. 275-281. Google ScholarDigital Library
19.Resnick, E Mining the Web for bilingual text. In Proceedings of ACL'99 (1999). Google ScholarDigital Library
20.Witbrock, M., and Mittal, V.O. Headline generation: A framework for generating highly-condensed nonextractive summaries. In Proceedings of SIGIR-99 (1999), pp. 315-316. Google ScholarDigital Library

Index Terms

OCELOT: a system for summarizing Web pages
1. Information systems
  1. Information retrieval
    1. Document representation
  2. Information storage systems
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Abstraction

Recommendations

Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Readers of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Read More
Latent Dirichlet learning for document summarization
ICASSP '09: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing

Automatic summarization is developed to extract the representative contents or sentences from a large corpus of documents. This paper presents a new hierarchical representation of words, sentences and documents in a corpus, and infers the Dirichlet ...
Read More
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
July 2000
396 pages
ISBN:1581132263
DOI:10.1145/345508
Chairmen:
Emmanuel Yannakoudakis
Athens Univ. of Economics and Business, Greece
,
Nicholas J. Belkin
Rutgers Univ.
,
Mun-Kew Leong
Kent Ridge Digital Labs
,
Peter Ingwersen
Royal School of Library and Information Science
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 89
  Total Citations
  View Citations
- 166
  Total Downloads
- Downloads (Last 12 months)51
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

OCELOT: a system for summarizing Web pages

SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Topic-driven reader comments summarization

Latent Dirichlet learning for document summarization

Research on Multi-document Summarization Based on LDA Topic Model