Fei Tan

ORCID: 0000-0002-3232-1912
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Hate Speech and Cyberbullying Detection
  • Recommender Systems and Techniques
  • Housing Market and Economics
  • Spam and Phishing Detection
  • Multimodal Machine Learning Applications
  • Customer churn and segmentation
  • Data Mining Algorithms and Applications
  • Advanced Malware Detection Techniques
  • Domain Adaptation and Few-Shot Learning
  • Human Mobility and Location-Based Analysis
  • Data Quality and Management
  • Data Stream Mining Techniques
  • Text Readability and Simplification
  • Software Engineering Research
  • Geochemistry and Geologic Mapping
  • FinTech, Crowdfunding, Digital Finance
  • Web Data Mining and Analysis
  • Microfinance and Financial Inclusion
  • Biomedical Text Mining and Ontologies
  • Machine Learning in Healthcare
  • Adversarial Robustness in Machine Learning
  • Contact Mechanics and Variational Inequalities
  • Music and Audio Processing

Group Sense (China)
2023-2024

Yahoo (United States)
2019-2022

New Jersey Institute of Technology
2015-2020

Worcester Polytechnic Institute
2020

Yahoo (Spain)
2020

Twitter (United States)
2020

Institute of Rock and Soil Mechanics
2017

University of Delaware
2014-2015

Abstract Infectious agents are the third highest human cancer risk factor and may have a greater role in origin and/or progression of cancers related pathogenesis. Thus, knowing specific viruses microbial associated with type provide insights into cause, diagnosis treatment. We utilized pan-pathogen array technology to identify signatures triple negative breast (TNBC). This detects low copy number fragmented genomes extracted from formalin-fixed paraffin embedded archival tissues. The...

10.1038/srep15162 article EN cc-by Scientific Reports 2015-10-15

Online peer-to-peer (P2P) lending is expected to benefit both investors and borrowers due their low transaction cost the elimination of expensive intermediaries. From lenders' perspective, maximizing return on investment an ultimate goal during decision-making procedure. In this paper, we explore address a fundamental problem underlying such goal: how represent two competing risks, charge-off prepayment, in funded loans. We propose model potential risks simultaneously, which remains largely...

10.1109/tnnls.2018.2870573 article EN IEEE Transactions on Neural Networks and Learning Systems 2018-10-10

We consider the problem of project success prediction on crowdfunding platforms. Despite information in a profile can be different modalities such as text, images, and metadata, most existing approaches leverage only text dominated modality. Nowadays rich visual images have been utilized more profiles for attracting backers, little work has conducted to evaluate their effects towards prediction. Moreover, meta exploited many improving accuracy. However, is usually limited dynamics after...

10.24963/ijcai.2019/299 article EN 2019-07-28

We present our HABERTOR model for detecting hatespeech in large scale user-generated content. Inspired by the recent success of BERT model, we propose several modifications to enhance performance on downstream classification task. inherits BERT's architecture, but is different four aspects: (i) it generates its own vocabularies and pre-trained from scratch using largest dataset; (ii) consists Quaternion-based factorized components, resulting a much smaller number parameters, faster training...

10.18653/v1/2020.emnlp-main.606 article EN cc-by 2020-01-01

It is widely acknowledged that the value of a house mixture large number characteristics. House price prediction thus presents unique set challenges in practice. While body works are dedicated to this task, their performance and applications have been limited by shortage long time span transaction data, absence real-world settings insufficiency housing features. To end, time-aware latent hierarchical model introduced capture underlying spatiotemporal interactions behind evolution prices. The...

10.1109/icdm.2017.147 article EN 2021 IEEE International Conference on Data Mining (ICDM) 2017-11-01

Much of named entity recognition (NER) research focuses on developing dataset-specific models based data from the domain interest, and a limited set related types. This is frustrating as each new dataset requires model to be trained stored. In this work, we present ``versatile'' model---the Prompting-based Unified NER system (PUnifiedNER)---that works with different domains can recognise up 37 types simultaneously, theoretically it could many possible. By using prompt learning, PUnifiedNER...

10.1609/aaai.v37i11.26564 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

Bang An, Jie Lyu, Zhenyi Wang, Chunyuan Li, Changwei Hu, Fei Tan, Ruiyi Zhang, Yifan Changyou Chen. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

10.18653/v1/2020.emnlp-main.17 article EN cc-by 2020-01-01

User intended actions are widely seen in many areas. Forecasting these and taking proactive measures to optimize business outcome is a crucial step towards sustaining the steady growth. In this work, we focus on predicting attrition, which one of typical user actions. Conventional attrition predictive modeling strategies suffer few inherent drawbacks. To overcome limitations, propose novel end-to-end learning scheme keep track evolution patterns for modeling. It integrates activity logs,...

10.1109/icdm.2018.00064 article EN 2021 IEEE International Conference on Data Mining (ICDM) 2018-11-01

In this work, we present a new language pre-training model TNT (Text Normalization based of Transformers) for content moderation. Inspired by the masking strategy and text normalization, is developed to learn representation training transformers reconstruct from four operation types typically seen in manipulation: substitution, transposition, deletion, insertion. Furthermore, normalization involves prediction both token labels, enabling more challenging tasks than standard task masked word...

10.18653/v1/2020.emnlp-main.383 article EN cc-by 2020-01-01

Current methods for prompt learning in zero-shot scenarios widely rely on a development set with sufficient human-annotated data to select the best-performing template posteriori. This is not ideal because real-world scenario of practical relevance, no labelled available. Thus, we propose simple yet effective method screening reasonable templates text classification: Perplexity Selection (Perplection). We hypothesize that language discrepancy can be used measure efficacy templates, and...

10.18653/v1/2023.acl-long.128 article EN cc-by 2023-01-01

The affiliated school district of a real estate property is often crucial concern. How to automate the identification residential homes located in favorable educational environment, however, largely unexplored until now. availability heterogeneous estate-related data offers great opportunity for this task. Nevertheless, it such heterogeneity that poses significant challenges their amalgamation unified fashion. To end, we develop G-LRMM model integrate digital price, textual comments, and...

10.1109/icdm.2016.0164 article EN 2016-12-01

The neural attention mechanism plays an important role in many natural language processing applications. In particular, the use of multi-head extends single-head by allowing a model to jointly attend information from different perspectives. Without explicit constraining, however, may suffer collapse, issue that makes heads extract similar attentive features, thus limiting model's representation power. this paper, for first time, we provide novel understanding Bayesian perspective. Based on...

10.48550/arxiv.2009.09364 preprint EN other-oa arXiv (Cornell University) 2020-01-01

10.1007/s10618-018-00612-0 article EN Data Mining and Knowledge Discovery 2019-02-08

Current methods for prompt learning in zeroshot scenarios widely rely on a development set with sufficient human-annotated data to select the best-performing template posteriori. This is not ideal because realworld zero-shot scenario of practical relevance, no labelled available. Thus, we propose simple yet effective method screening reasonable templates text classification: Perplexity Selection (Perplection). We hypothesize that language discrepancy can be used measure efficacy templates,...

10.48550/arxiv.2209.15206 preprint EN public-domain arXiv (Cornell University) 2022-01-01

Aligned Large Language Models (LLMs) showcase remarkable versatility, capable of handling diverse real-world tasks. Meanwhile, aligned LLMs are also expected to exhibit speciality, excelling in specific applications. However, fine-tuning with extra data, a common practice gain often leads catastrophic forgetting (CF) previously acquired hindering the model's performance across In response this challenge, we propose CoFiTune, coarse fine framework an attempt strike balance between speciality...

10.48550/arxiv.2404.10306 preprint EN arXiv (Cornell University) 2024-04-16

10.18653/v1/2024.emnlp-main.903 article EN Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2024-01-01

We consider the problem of Named Entity Recognition (NER) on biomedical scientific literature, and more specifically genomic variants recognition in this work. Significant success has been achieved for NER canonical tasks recent years where large data sets are generally available. However, it remains a challenging many domain-specific areas, especially domains only small gold annotations can be obtained. In addition, variant entities exhibit diverse linguistic heterogeneity, differing much...

10.1609/aaai.v34i01.5399 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2020-04-03

Content understanding with many potential industrial applications, is spurring interest by researchers in areas artificial intelligence. We propose to revisit the content problem digital marketing from three novel perspectives. First, our explore way how user experience delivered divergent key multimedia elements. Second, we treat as elucidate their causal implications driving responses. Third, understand based on observational audience visit logs. To approach this problem, measure and...

10.1109/icdm.2019.00168 article EN 2021 IEEE International Conference on Data Mining (ICDM) 2019-11-01

In this work, we report our efforts in advancing Chinese Word Segmentation for the purpose of rapid deployment different applications. The pre-trained language model (PLM) based segmentation methods have achieved state-of-the-art (SOTA) performance, whereas paradigm also poses challenges deployment. It includes balance between performance and cost, ambiguity due to domain diversity vague words boundary, multi-grained segmentation. context, propose a simple yet effective approach, namely...

10.18653/v1/2023.acl-industry.1 article EN cc-by 2023-01-01
Coming Soon ...