- Topic Modeling
- Natural Language Processing Techniques
- Advanced Text Analysis Techniques
- Information Retrieval and Search Behavior
- Data Visualization and Analytics
- Semantic Web and Ontologies
- Biomedical Text Mining and Ontologies
- Web Data Mining and Analysis
- Speech and dialogue systems
- Text Readability and Simplification
- Usability and User Interface Design
- Software Engineering Research
- Data Management and Algorithms
- Video Analysis and Summarization
- Advanced Database Systems and Queries
- Expert finding and Q&A systems
- Multimedia Communication and Technology
- Online Learning and Analytics
- Text and Document Classification Technologies
- Wikis in Education and Collaboration
- Digital Humanities and Scholarship
- Image Retrieval and Classification Techniques
- Data Mining Algorithms and Applications
- Big Data and Business Intelligence
- Interactive and Immersive Displays
University of California, Berkeley
2015-2024
Berkeley College
2014-2024
Allen Institute
2020-2023
University of Washington
2020-2023
Northwestern University
2019-2023
Massachusetts Institute of Technology
2023
University of Pennsylvania
2022
Microsoft Research (United Kingdom)
2021
University of Minnesota
2021
Seoul National University
2021
My first exposure to Support Vector Machines came this spring when heard Sue Dumais present impressive results on text categorization using analysis technique. This issue's collection of essays should help familiarize our readers with interesting new racehorse in the Machine Learning stable. Bernhard Scholkopf, an introductory overview, points out that a particular advantage SVMs over other learning algorithms is it can be analyzed theoretically concepts from computational theory, and at...
We describe a method for the automatic acquisition of hyponymy lexical relation from unrestricted text. Two goals motivate approach: (i) avoidance need pre-encoded knowledge and (ii) applicability across wide range identify set lexico-syntactic patterns that are easily recognizable, occur frequently text genre boundaries, indisputably indicate interest. discovering these suggest other relations will also be acquirable in this way. A subset algorithm is implemented results used to augment...
To build systems shielding users from fraudulent (or phishing) websites, designers need to know which attack strategies work and why. This paper provides the first empirical evidence about malicious are successful at deceiving general users. We analyzed a large set of captured phishing attacks developed hypotheses why these might work. then assessed with usability study in 22 participants were shown 20 web sites asked determine ones fraudulent. found that 23% did not look browser-based cues...
There are currently two dominant interface types for searching and browsing large image collections: keyword-based search, by overall similarity to sample images. We present an alternative based on enabling users navigate along conceptual dimensions that describe the The makes use of hierarchical faceted metadata dynamically generated query previews. A usability study, in which 32 art history students explored a collection 35,000 fine arts images, compares this approach standard search...
The possibilities for data mining from large text collections are virtually untapped. Text expresses a vast, rich range of information, but encodes this information in form that is difficult to decipher automatically. Perhaps reason, there has been little work date, and most people who have talked about it either conflated with access or not made use directly discover heretofore unknown information.
Article Free Access Share on Reexamining the cluster hypothesis: scatter/gather retrieval results Authors: Marti A. Hearst Xerox Palo Alto Research Center, 3333 Coyote Hill Rd, Alto, CA CAView Profile , Jan O. Pedersen Authors Info & Claims SIGIR '96: Proceedings of 19th annual international ACM conference and development in information retrievalAugust 1996 Pages 76–84https://doi.org/10.1145/243199.243216Online:18 August 1996Publication History 493citation2,028DownloadsMetricsTotal...
The field of information retrieval has traditionally focused on textbases consisting titles and abstracts. As a consequence, many underlying assumptions must be altered for from full-length text collections. This paper argues making use structure when retrieving full documents, presents visualization paradigm, called TileBars, that demonstrates the usefulness explicit term distribution in Boolean-type queries. TileBars simultaneously compactly indicate relative document length, query...
This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of texts. The uses domain-independent lexical frequency and distribution information to recognize interactions multiple simultaneous themes. Two fully-implemented versions are described shown produce segmentation that corresponds well human judgments major boundaries thirteen lengthy
The P k evaluation metric, initially proposed by Beeferman, Berger, and Lafferty (1997), is becoming the standard measure for assessing text segmentation algorithms. However, a theoretical analysis of metric finds several problems: penalizes false negatives more heavily than positives, overpenalizes near misses, affected variation in segment size distribution. We propose simple modification to that remedies these problems. This new metric—called Window Diff—moves fixed-sized window across...
This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of texts. The uses domain-independent lexical frequency and distribution information to recognize interactions multiple simultaneous themes. Two fully-implemented versions are described shown produce segmentation that corresponds well human judgments major boundaries thirteen lengthy
article Clustering versus faceted categories for information exploration Author: Marti A. Hearst University of California, Berkeley BerkeleyView Profile Authors Info & Claims Communications the ACMVolume 49Issue 4April 2006 pp 59–61https://doi.org/10.1145/1121949.1121983Published:01 April 2006Publication History 262citation4,167DownloadsMetricsTotal Citations262Total Downloads4,167Last 12 Months95Last 6 weeks6 Get Citation AlertsNew Alert added!This alert has been successfully added and will...
Abstract In the summarization domain, a key requirement for summaries is to be factually consistent with input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied inconsistency detection. this work, we revisit use of NLI detection, finding past suffered from mismatch in granularity between datasets (sentence-level), and detection (document level). We provide highly effective light-weight method called SummaCConv enables...
We argue that the advent of large volumes full-length text, as opposed to short texts like abstracts and newswire, should be accompanied by corresponding new approaches information access. Toward this end, we discuss merits imposing structure on text documents; is, a partition into coherent multi-paragraph units represent pattern subtopics comprise text. Using structure, can make distinction between main topics, which occur throughout length subtopics, are only limited extent. why...
Designing a search system and interface may best be served (and executed) by scrutinizing usability studies.
A crucial step toward the goal of automatic extraction propositional information from natural language text is identification semantic relations between constituents in sentences. We examine problem distinguishing among seven relation types that can occur entities "treatment" and "disease" bioscience text, identifying such entities. compare five generative graphical models a neural network, using lexical, syntactic, features, finding latter help achieve high classification accuracy.
We describe a new animation technique for supporting interactive exploration of graph. use the well-known radial tree layout method, in which view is determined by selection focus node. Our main contribution method animating transition to when node selected. In order keep easy follow, linearly interpolates polar coordinates nodes, while enforcing ordering and orientation constraints. apply this visualizations social networks Gnutella file-sharing network, discuss results from our informal...
A quantitative analysis of a large collection expert-rated web sites reveals that page-level metrics can accurately predict if site will be highly rated. The also provides empirical evidence important metrics, including page composition, formatting, and overall characteristics, differ among categories such as education, community, living, finance. These results provide an foundation for design guidelines suggest which most evaluation via user studies.
Article Free Access Share on Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy Authors: Marti A. Hearst Xerox Palo Alto Research Center, 3333 Coyote Hill Rd, Alto, CA CAView Profile , Chandu Karadi School of Medicine, M121, Stanford University, Stanford, Authors Info & Claims SIGIR '97: Proceedings the 20th annual international ACM conference development in information retrievalJuly 1997Pages...