- Topic Modeling
- Natural Language Processing Techniques
- Web Data Mining and Analysis
- Information Retrieval and Search Behavior
- Advanced Text Analysis Techniques
- Complex Network Analysis Techniques
- Semantic Web and Ontologies
- Genomics and Rare Diseases
- Spam and Phishing Detection
- Speech and dialogue systems
- Expert finding and Q&A systems
- Recommender Systems and Techniques
- Genetic Associations and Epidemiology
- Sentiment Analysis and Opinion Mining
- Biomedical Text Mining and Ontologies
- Cancer Genomics and Diagnostics
- Genomic variations and chromosomal abnormalities
- Data Management and Algorithms
- Advanced Database Systems and Queries
- Digital Marketing and Social Media
- Algorithms and Data Compression
- Genetics, Bioinformatics, and Biomedical Research
- Digital Humanities and Scholarship
- Open Education and E-Learning
- Statistics Education and Methodologies
Twitter (United States)
2012-2021
Color (United States)
2016-2020
Yahoo (United Kingdom)
2008-2012
Yahoo (United States)
2009-2010
University of Amsterdam
2003-2007
University of Maryland, College Park
2003
The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability such increases, task identifying high-quality sites based on user contributions --social media -- becomes increasingly important. Social in general exhibit a rich variety information sources: addition itself, there is wide array non-content available, as links between items explicit ratings members community. In this paper we investigate methods for exploiting community feedback...
This paper describes AutoTag, a tool which suggests tags for weblog posts using collaborative filtering methods. An evaluation of AutoTag on large collection shows good accuracy; coupled with the blogger's final quality control, assists both in simplifying tagging process and improving its quality.
In web search, recency ranking refers to documents by relevance which takes freshness into account. this paper, we propose a retrieval system automatically detects and responds sensitive queries. The queries using high precision classifier. machine learned model trained for such We use multiple features provide temporal evidence effectively represents document recency. Furthermore, several training methodologies important rankers. Finally, develop new evaluation metrics Our experiments...
Inherited susceptibility to common, complex diseases may be caused by rare, pathogenic variants ("monogenic") or the cumulative effect of numerous common ("polygenic"). Comprehensive genome interpretation should enable assessment for both monogenic and polygenic components inherited risk. The traditional approach requires two distinct genetic testing technologies-high coverage sequencing known genes detect a genome-wide genotyping array followed imputation calculate scores (GPSs). We...
The reasoning tasks that can be performed with semantic web service descriptions depend on the quality of domain ontologies used to create these descriptions. However, building such is a time consuming and difficult task.We describe an automatic extraction method learns for from textual documentations attached services. We conducted our experiments in field bioinformatics by learning ontology documentation services myGrid, project supports biology Grid. Based evaluation extracted context...
We describe a method for discovering irregularities in temporal mood patterns appearing large corpus of blog posts, and labeling them with natural language explanation. Simple techniques based on comparing frequencies, coupled quantities data, are shown to be effective identifying the events underlying changes global moods.
We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in web search literature, Twitter context introduces a "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides case study illustrating challenges of data processing era "big data". tell story how our system was built twice: first implementation on typical Hadoop-based...
The real-time nature of Twitter means that term distributions in tweets and search queries change rapidly: the most frequent terms one hour may look very different from those next. Informally, we call this phenomenon "churn". Our interest analyzing churn stems perspective search. How do "correctly" compute statistics, considering underlying rapidly? In paper, present an analysis tweet query on Twitter, as a first step to answering question. Analyses reveal interesting insights temporal...
We describe a system for automating call-center analysis and monitoring. Our integrates transcription of incoming calls with their content; the analysis, we introduce novel method estimating domain-specific importance conversation fragments, based on divergence corpus statistics. Combining this Information Retrieval approaches, provide knowledge-mining tools both agents administrators center.
User browsing information, particularly non-search-related activity, reveals important contextual information on the preferences and intents of Web users. In this article, we demonstrate importance mining general user behavior data to improve ranking other Web-search experience, with an emphasis analyzing individual sessions for creating aggregate models. context, introduce ClickRank , efficient, scalable algorithm estimating Webpage Website from user-behavior data. We lay out theoretical...
Advances in genome sequencing have led to a tremendous increase the discovery of novel missense variants, but evidence for determining clinical significance can be limited or conflicting. Here, we present Learning from Evidence Assess Pathogenicity (LEAP), machine learning model that utilizes variety feature categories classify and achieves high performance multiple genes different health conditions. Feature include functional predictions, splice population frequencies, conservation scores,...
Next generation sequencing (NGS) has become a common technology for clinical genetic tests. The quality of NGS calls varies widely and is influenced by features like reference sequence characteristics, read depth, mapping accuracy. With recent advances in software tools, the majority variants called using alone are fact accurate reliable. However, small subset difficult-to-call that still do require orthogonal confirmation exist. For this reason, many laboratories confirm results...