- Topic Modeling
- Natural Language Processing Techniques
- Hate Speech and Cyberbullying Detection
- Recommender Systems and Techniques
- Housing Market and Economics
- Spam and Phishing Detection
- Multimodal Machine Learning Applications
- Customer churn and segmentation
- Data Mining Algorithms and Applications
- Advanced Malware Detection Techniques
- Domain Adaptation and Few-Shot Learning
- Human Mobility and Location-Based Analysis
- Data Quality and Management
- Data Stream Mining Techniques
- Text Readability and Simplification
- Software Engineering Research
- Geochemistry and Geologic Mapping
- FinTech, Crowdfunding, Digital Finance
- Web Data Mining and Analysis
- Microfinance and Financial Inclusion
- Biomedical Text Mining and Ontologies
- Machine Learning in Healthcare
- Adversarial Robustness in Machine Learning
- Contact Mechanics and Variational Inequalities
- Music and Audio Processing
Group Sense (China)
2023-2024
Yahoo (United States)
2019-2022
New Jersey Institute of Technology
2015-2020
Worcester Polytechnic Institute
2020
Yahoo (Spain)
2020
Twitter (United States)
2020
Institute of Rock and Soil Mechanics
2017
University of Delaware
2014-2015
Abstract Infectious agents are the third highest human cancer risk factor and may have a greater role in origin and/or progression of cancers related pathogenesis. Thus, knowing specific viruses microbial associated with type provide insights into cause, diagnosis treatment. We utilized pan-pathogen array technology to identify signatures triple negative breast (TNBC). This detects low copy number fragmented genomes extracted from formalin-fixed paraffin embedded archival tissues. The...
Online peer-to-peer (P2P) lending is expected to benefit both investors and borrowers due their low transaction cost the elimination of expensive intermediaries. From lenders' perspective, maximizing return on investment an ultimate goal during decision-making procedure. In this paper, we explore address a fundamental problem underlying such goal: how represent two competing risks, charge-off prepayment, in funded loans. We propose model potential risks simultaneously, which remains largely...
We consider the problem of project success prediction on crowdfunding platforms. Despite information in a profile can be different modalities such as text, images, and metadata, most existing approaches leverage only text dominated modality. Nowadays rich visual images have been utilized more profiles for attracting backers, little work has conducted to evaluate their effects towards prediction. Moreover, meta exploited many improving accuracy. However, is usually limited dynamics after...
We present our HABERTOR model for detecting hatespeech in large scale user-generated content. Inspired by the recent success of BERT model, we propose several modifications to enhance performance on downstream classification task. inherits BERT's architecture, but is different four aspects: (i) it generates its own vocabularies and pre-trained from scratch using largest dataset; (ii) consists Quaternion-based factorized components, resulting a much smaller number parameters, faster training...
It is widely acknowledged that the value of a house mixture large number characteristics. House price prediction thus presents unique set challenges in practice. While body works are dedicated to this task, their performance and applications have been limited by shortage long time span transaction data, absence real-world settings insufficiency housing features. To end, time-aware latent hierarchical model introduced capture underlying spatiotemporal interactions behind evolution prices. The...
Much of named entity recognition (NER) research focuses on developing dataset-specific models based data from the domain interest, and a limited set related types. This is frustrating as each new dataset requires model to be trained stored. In this work, we present ``versatile'' model---the Prompting-based Unified NER system (PUnifiedNER)---that works with different domains can recognise up 37 types simultaneously, theoretically it could many possible. By using prompt learning, PUnifiedNER...
Bang An, Jie Lyu, Zhenyi Wang, Chunyuan Li, Changwei Hu, Fei Tan, Ruiyi Zhang, Yifan Changyou Chen. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
User intended actions are widely seen in many areas. Forecasting these and taking proactive measures to optimize business outcome is a crucial step towards sustaining the steady growth. In this work, we focus on predicting attrition, which one of typical user actions. Conventional attrition predictive modeling strategies suffer few inherent drawbacks. To overcome limitations, propose novel end-to-end learning scheme keep track evolution patterns for modeling. It integrates activity logs,...
In this work, we present a new language pre-training model TNT (Text Normalization based of Transformers) for content moderation. Inspired by the masking strategy and text normalization, is developed to learn representation training transformers reconstruct from four operation types typically seen in manipulation: substitution, transposition, deletion, insertion. Furthermore, normalization involves prediction both token labels, enabling more challenging tasks than standard task masked word...
Current methods for prompt learning in zero-shot scenarios widely rely on a development set with sufficient human-annotated data to select the best-performing template posteriori. This is not ideal because real-world scenario of practical relevance, no labelled available. Thus, we propose simple yet effective method screening reasonable templates text classification: Perplexity Selection (Perplection). We hypothesize that language discrepancy can be used measure efficacy templates, and...
The affiliated school district of a real estate property is often crucial concern. How to automate the identification residential homes located in favorable educational environment, however, largely unexplored until now. availability heterogeneous estate-related data offers great opportunity for this task. Nevertheless, it such heterogeneity that poses significant challenges their amalgamation unified fashion. To end, we develop G-LRMM model integrate digital price, textual comments, and...
The neural attention mechanism plays an important role in many natural language processing applications. In particular, the use of multi-head extends single-head by allowing a model to jointly attend information from different perspectives. Without explicit constraining, however, may suffer collapse, issue that makes heads extract similar attentive features, thus limiting model's representation power. this paper, for first time, we provide novel understanding Bayesian perspective. Based on...
Current methods for prompt learning in zeroshot scenarios widely rely on a development set with sufficient human-annotated data to select the best-performing template posteriori. This is not ideal because realworld zero-shot scenario of practical relevance, no labelled available. Thus, we propose simple yet effective method screening reasonable templates text classification: Perplexity Selection (Perplection). We hypothesize that language discrepancy can be used measure efficacy templates,...
Aligned Large Language Models (LLMs) showcase remarkable versatility, capable of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected to exhibit speciality, excelling in specific applications. However, fine-tuning with extra data, a common practice gain often leads catastrophic forgetting (CF) previously acquired hindering the model's performance across In response this challenge, we propose CoFiTune, coarse fine framework an attempt strike balance between speciality...
We consider the problem of Named Entity Recognition (NER) on biomedical scientific literature, and more specifically genomic variants recognition in this work. Significant success has been achieved for NER canonical tasks recent years where large data sets are generally available. However, it remains a challenging many domain-specific areas, especially domains only small gold annotations can be obtained. In addition, variant entities exhibit diverse linguistic heterogeneity, differing much...
Content understanding with many potential industrial applications, is spurring interest by researchers in areas artificial intelligence. We propose to revisit the content problem digital marketing from three novel perspectives. First, our explore way how user experience delivered divergent key multimedia elements. Second, we treat as elucidate their causal implications driving responses. Third, understand based on observational audience visit logs. To approach this problem, measure and...
In this work, we report our efforts in advancing Chinese Word Segmentation for the purpose of rapid deployment different applications. The pre-trained language model (PLM) based segmentation methods have achieved state-of-the-art (SOTA) performance, whereas paradigm also poses challenges deployment. It includes balance between performance and cost, ambiguity due to domain diversity vague words boundary, multi-grained segmentation. context, propose a simple yet effective approach, namely...