Jimmy Lin

ORCID: 0000-0002-0661-7189
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Information Retrieval and Search Behavior
  • Web Data Mining and Analysis
  • Multimodal Machine Learning Applications
  • Advanced Text Analysis Techniques
  • Semantic Web and Ontologies
  • Advanced Database Systems and Queries
  • Advanced Image and Video Retrieval Techniques
  • Data Quality and Management
  • Domain Adaptation and Few-Shot Learning
  • Data Management and Algorithms
  • Biomedical Text Mining and Ontologies
  • Cloud Computing and Resource Management
  • Algorithms and Data Compression
  • Explainable Artificial Intelligence (XAI)
  • Expert finding and Q&A systems
  • Speech and dialogue systems
  • Text and Document Classification Technologies
  • Caching and Content Delivery
  • Complex Network Analysis Techniques
  • Speech Recognition and Synthesis
  • Machine Learning and Data Classification
  • Scientific Computing and Data Management
  • Recommender Systems and Techniques

University of Waterloo
2015-2024

University of Kassel
2024

Universidade Estadual de Campinas (UNICAMP)
2024

Leipzig University
2024

Tsinghua University
2024

University of Washington
2023

University of Toronto
2019-2022

Pennsylvania State University
2018-2022

Kaiser Permanente South San Francisco Medical Center
2022

Twitter (United States)
2011-2021

Abstract As DNA sequencing outpaces improvements in computer speed, there is a critical need to accelerate tasks like alignment and SNP calling. Crossbow cloud-computing software tool that combines the aligner Bowtie caller SOAPsnp. Executing parallel using Hadoop, analyzes data comprising 38-fold coverage of human genome three hours 320-CPU cluster rented from cloud computing service for about $85. available http://bowtie-bio.sourceforge.net/crossbow/ .

10.1186/gb-2009-10-11-r134 article EN cc-by Genome biology 2009-11-20

WTF ("Who to Follow") is Twitter's user recommendation service, which responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview shares lessons we learned in building running the service over past few years. Particularly noteworthy was our design decision process entire Twitter graph memory a single server, significantly reduced complexity allowed us develop...

10.1145/2488388.2488433 article EN 2013-05-13

This half-day tutorial introduces participants to data-intensive text processing with the MapReduce programming model [1], using open-source Hadoop implementation. The focus will be on scalability and tradeoffs associated distributed of large datasets. Content include general discussions about algorithm design, presentation illustrative algorithms, case studies in HLT applications, as well practical advice writing programs running clusters.

10.3115/1620950.1620951 article EN 2009-01-01

Modeling sentence similarity is complicated by the ambiguity and variability of linguistic expression.To cope with these challenges, we propose a model for comparing sentences that uses multiplicity perspectives.We first each using convolutional neural network extracts features at multiple levels granularity types pooling.We then compare our representations several granularities metrics.We apply to three tasks, including Microsoft Research paraphrase identification task two SemEval semantic...

10.18653/v1/d15-1181 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2015-01-01

In the natural language processing literature, neural networks are becoming increasingly deeper and complex. The recent poster child of this trend is deep representation model, which includes BERT, ELMo, GPT. These developments have led to conviction that previous-generation, shallower for understanding obsolete. paper, however, we demonstrate rudimentary, lightweight can still be made competitive without architecture changes, external training data, or additional input features. We propose...

10.48550/arxiv.1903.12136 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Software toolkits play an essential role in information retrieval research. Most open-source developed by academics are designed to facilitate the evaluation of models over standard test collections. Efforts generally directed toward better ranking and less attention is usually given scalability other operational considerations. On hand, Lucene has become de facto platform industry for building search applications (outside a small number companies that deploy custom infrastructure). Compared...

10.1145/3077136.3080721 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2017-07-28

The combination of recent developments in question-answering research and the availability unparalleled resources developed specifically for automatic semantic processing text medical domain provides a unique opportunity to explore complex question answering clinical medicine. This article presents system designed satisfy information needs physicians practicing evidence-based We have series knowledge extractors, which employ knowledge-based statistical techniques, automatically identifying...

10.1162/coli.2007.33.1.63 article EN cc-by-nc-nd Computational Linguistics 2007-03-01

In this paper, we provide a characterization of the topological features Twitter follow graph, analyzing properties such as degree distributions, connected components, shortest path lengths, clustering coefficients, and assortativity. For each these properties, compare contrast with available data from other social networks. These analyses set authoritative statistics that community can reference. addition, use to investigate an often-posed question: Is network or information network? The...

10.1145/2567948.2576939 article EN 2014-04-07

ABSTRACT Cloud computing is a platform that resides in large data center and able to dynamically provide servers with the ability address wide range of needs, from scientific research e-commerce. The provision resources as if it were utility such electricity, while potentially revolutionary service, presents many major problems information policy, including issues privacy, security, reliability, access, regulation. This article explores nature potential cloud computing, policy raised,...

10.1080/19331680802425479 article EN Journal of Information Technology & Politics 2008-10-27

This work proposes the use of a pretrained sequence-to-sequence model for document ranking. Our approach is fundamentally different from commonly adopted classification-based formulation based on encoder-only transformer architectures such as BERT. We show how can be trained to generate relevance labels “target tokens”, and underlying logits these target tokens interpreted probabilities Experimental results MS MARCO passage ranking task that our superior strong models. On three other...

10.18653/v1/2020.findings-emnlp.63 article EN cc-by 2020-01-01

Textual similarity measurement is a challenging problem, as it requires understanding the semantics of input sentences.Most previous neural network models use coarse-grained sentence modeling, which has difficulty capturing fine-grained word-level information for semantic comparisons.As an alternative, we propose to explicitly model pairwise word interactions and present novel focus mechanism identify important correspondences better measurement.Our ideas are implemented in architecture that...

10.18653/v1/n16-1108 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2016-01-01

Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Jimmy Lin. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics (Demonstrations). 2019.

10.18653/v1/n19-4013 preprint EN 2019-01-01

We present simple BERT-based models for relation extraction and semantic role labeling. In recent years, state-of-the-art performance has been achieved using neural by incorporating lexical syntactic features such as part-of-speech tags dependency trees. this paper, extensive experiments on datasets these two tasks show that without any external features, a model can achieve performance. To our knowledge, we are the first to successfully apply BERT in manner. Our provide strong baselines...

10.48550/arxiv.1904.05255 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. However, they are also notorious for being slow in inference, which makes them difficult deploy real-time We propose a simple but effective method, DeeBERT, accelerate inference. Our approach allows samples exit earlier without passing through the entire model. Experiments show that DeeBERT is able save up ~40% inference time with minimal degradation model quality. Further analyses...

10.18653/v1/2020.acl-main.204 article EN cc-by 2020-01-01

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, easy-to-use first-stage in multi-stage ranking architecture. Our self-contained as standard package comes queries, relevance judgments, pre-built indexes, evaluation scripts many commonly used IR test collections. We aim support, out of the box, entire lifecycle efforts aimed at improving modern neural approaches. In particular,...

10.1145/3404835.3463238 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

One technique to improve the retrieval effectiveness of a search engine is expand documents with terms that are related or representative documents' content.From perspective question answering system, this might comprise questions document can potentially answer. Following observation, we propose simple method predicts which queries will be issued for given and then expands it those predictions vanilla sequence-to-sequence model, trained using datasets consisting pairs query relevant...

10.48550/arxiv.1904.08375 preprint EN other-oa arXiv (Cornell University) 2019-01-01

The success of data-driven solutions to difficult problems, along with the dropping costs storing and processing massive amounts data, has led growing interest in large-scale machine learning. This paper presents a case study Twitter's integration learning tools into its existing Hadoop-based, Pig-centric analytics platform. We begin an overview this platform, which handles "traditional" data warehousing business intelligence tasks for organization. core work lies recent Pig extensions...

10.1145/2213836.2213958 article EN 2012-05-20

This work tackles the perennial problem of reproducible baselines in information retrieval research, focusing on bag-of-words ranking models. Although academic researchers have a long history building and sharing systems, they are primarily designed to facilitate publication research papers. As such, these systems often incomplete, inflexible, poorly documented, difficult use, slow, particularly context modern web-scale collections. Furthermore, growing complexity software ecosystems...

10.1145/3239571 article EN Journal of Data and Information Quality 2018-10-29

We present, to our knowledge, the first application of BERT document classification. A few characteristics task might lead one think that is not most appropriate model: syntactic structures matter less for content categories, documents can often be longer than typical input, and have multiple labels. Nevertheless, we show a straightforward classification model using able achieve state art across four popular datasets. To address computational expense associated with inference, distill...

10.48550/arxiv.1904.08398 preprint EN other-oa arXiv (Cornell University) 2019-01-01

We explore the application of deep residual learning and dilated convolutions to keyword spotting task, using recently-released Google Speech Commands Dataset as our benchmark. Our best network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms accuracy. By varying model depth width, we can achieve compact models that also outperform small-footprint variants. To knowledge, are first examine these approaches for spotting, results...

10.1109/icassp.2018.8462688 article EN 2018-04-01

A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout training, indexing and query workflows. The IR community made great advancements in training effective dual-encoder dense (DR) recently. text model uses a single vector representation per passage to score match, which enables low-latency first-stage with nearest neighbor search. Increasingly common, approaches require enormous compute power, as they either conduct negative sampling...

10.1145/3404835.3462891 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021-07-11

The advent of deep neural networks pre-trained via language modeling tasks has spurred a number successful applications in natural processing. This work explores one such popular model, BERT, the context document ranking. We propose two variants, called monoBERT and duoBERT, that formulate ranking problem as pointwise pairwise classification, respectively. These models are arranged multi-stage architecture to form an end-to-end search system. One major advantage this design is ability trade...

10.48550/arxiv.1910.14424 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Long Short-Term Memory Networks (LSTMs) have been applied to daily discharge prediction with remarkable success. Many practical scenarios, however, require predictions at more granular timescales. For instance, accurate of short but extreme flood peaks can make a life-saving difference, yet such may escape the coarse temporal resolution predictions. Naively training an LSTM on hourly data, entails very long input sequences that learning hard and computationally expensive. In this study, we...

10.5194/hess-25-2045-2021 article EN cc-by Hydrology and earth system sciences 2021-04-19
Coming Soon ...