- Topic Modeling
- Natural Language Processing Techniques
- Biomedical Text Mining and Ontologies
- Advanced Text Analysis Techniques
- Semantic Web and Ontologies
- Speech and dialogue systems
- Text Readability and Simplification
- Speech Recognition and Synthesis
- Advanced Graph Neural Networks
- Handwritten Text Recognition Techniques
- Text and Document Classification Technologies
- Genomics and Phylogenetic Studies
- Protist diversity and phylogeny
- RNA and protein synthesis mechanisms
- Expert finding and Q&A systems
- Multimodal Machine Learning Applications
- Authorship Attribution and Profiling
- Pharmacogenetics and Drug Metabolism
- Advanced Image and Video Retrieval Techniques
- Spam and Phishing Detection
- Time Series Analysis and Forecasting
- Neural Networks and Applications
- Hate Speech and Cyberbullying Detection
- Sentiment Analysis and Opinion Mining
- Privacy, Security, and Data Protection
Google (United States)
2021-2023
University of California, Santa Barbara
2023
University of Rochester
2023
Allen Institute
2018-2020
Allen Institute for Artificial Intelligence
2017-2019
Northwestern University
2018
Carnegie Mellon University
2012-2017
Laboratoire d'Informatique de Paris-Nord
2017
Johns Hopkins University
2017
The University of Tokyo
2017
Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical is a critically important application area of which there are few robust, practical, publicly available models. This paper describes scispaCy, new tool practical biomedical/scientific heavily leverages the spaCy library. We detail performance two packages released scispaCy demonstrate their robustness on several...
Pre-trained word embeddings learned from unlabeled text have become a standard component of neural network architectures for NLP tasks. However, in most cases, the recurrent that operates on word-level representations to produce context sensitive is trained relatively little labeled data. In this paper, we demonstrate general semi-supervised approach adding pretrained bidirectional language models systems and apply it sequence labeling We evaluate our model two datasets named entity...
We describe DyNet, a toolkit for implementing neural network models based on dynamic declaration of structure. In the static strategy that is used in toolkits like Theano, CNTK, and TensorFlow, user first defines computation graph (a symbolic representation computation), then examples are fed into an engine executes this computes its derivatives. DyNet's strategy, construction mostly transparent, being implicitly constructed by executing procedural code outputs, free to use different...
Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew Peters, Joanna Power, Sam Skjonsberg, Lucy Lu Wang, Chris Wilhelm, Zheng Yuan, Madeleine van Zuylen, Oren Etzioni. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry...
We introduce new methods for estimating and evaluating embeddings of words in more than fifty languages a single shared embedding space. Our estimation methods, multiCluster multiCCA, use dictionaries monolingual data; they do not require parallel data. evaluation method, multiQVEC-CCA, is shown to correlate better previous ones with two downstream tasks (text categorization parsing). also describe web portal that will facilitate further research this area, along open-source releases all our methods.
We train one multilingual model for dependency parsing and use it to parse sentences in several languages. The uses (i) word clusters embeddings; (ii) token-level language information; (iii) language-specific features (fine-grained POS tags). This input representation enables the parser not only effectively multiple languages, but also generalize across languages based on linguistic universals typological similarities, making more effective learn from limited annotations. Our parser’s...
Arman Cohan, Waleed Ammar, Madeleine van Zuylen, Field Cady. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, Roy Schwartz. Proceedings of the 2018 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
Non-textual components such as charts, diagrams and tables provide key information in many scientific documents, but the lack of large labeled datasets has impeded development data-driven methods for figure extraction. In this paper, we induce high-quality training labels task extraction a number with no human intervention. To accomplish leverage auxiliary data provided two web collections documents (arXiv PubMed) to locate figures their associated captions rasterized PDF. We share resulting...
Online discussions forums, known as forums for short, are conversational social cyberspaces constituting rich repositories of content and an important source collaborative knowledge. However, most this knowledge is buried inside the forum infrastructure its extraction both complex difficult. The ability to automatically rate postings in online discussion based on value their contribution, enhances users find within content. Several key have utilized intelligence made by users. a large...
Chu-Cheng Lin, Waleed Ammar, Chris Dyer, Lori Levin. Proceedings of the 2015 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2015.
This paper describes our submission for the ScienceIE shared task (SemEval- 2017 Task 10) on entity and relation extraction from scientific papers. Our model is based end-to-end of Miwa Bansal (2016) with several enhancements such as semi-supervised learning via neural language models, character-level encoding, gazetteers extracted existing knowledge bases, ensembles. official ranked first in (scenario 1), second relation-only 3).
Ontology alignment is the task of identifying semantically equivalent entities from two given ontologies. Different ontologies have different representations same entity, resulting in a need to de-duplicate when merging We propose method for enriching an ontology with external definition and context information, use this additional information alignment. develop neural architecture capable encoding available, show that addition data results F1-score 0.69 on Alignment Evaluation Initiative...
Iz Beltagy, Kyle Lo, Waleed Ammar. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
Linguistic borrowing is the phenomenon of transferring linguistic constructions (lexical, phonological, morphological, and syntactic) from a "donor" language to "recipient" as result contacts between communities speaking different languages.Borrowed words are found in all languages, and-in contrast cognate relationships-borrowing relationships may exist across unrelated languages (for example, about 40% Swahili's vocabulary borrowed Arabic).In this paper, we develop model morpho-phonological...
Type-level word embeddings use the same set of parameters to represent all instances a regardless its context, ignoring inherent lexical ambiguity in language. Instead, we embed semantic concepts (or synsets) as defined WordNet and token particular context by estimating distribution over relevant concepts. We new, context-sensitive model for predicting prepositional phrase (PP) attachments jointly learn concept parameters. show that using improves accuracy PP attachment 5.4% absolute points,...
We describe the CMU submission for 2014 shared task on language identification in code-switched data. participated all four pairs: Spanish‐English, Mandarin‐English, Nepali‐English, and Modern Standard Arabic‐Arabic dialects. After describing our CRF-based baseline system, we discuss three extensions learning from unlabeled data: semi-supervised learning, word embeddings, lists.
Rahul Goel, Waleed Ammar, Aditya Gupta, Siddharth Vashishtha, Motoki Sano, Faiz Surani, Max Chang, HyunJeong Choe, David Greene, Chuan He, Rattima Nitisaroj, Anna Trukhina, Shachi Paul, Pararth Shah, Rushin Zhou Yu. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.