- Topic Modeling
- Natural Language Processing Techniques
- Advanced Graph Neural Networks
- Text and Document Classification Technologies
- Advanced Text Analysis Techniques
- Semantic Web and Ontologies
- Sentiment Analysis and Opinion Mining
- Complex Network Analysis Techniques
- Multimodal Machine Learning Applications
- Recommender Systems and Techniques
- Data Quality and Management
- Speech and dialogue systems
- Face and Expression Recognition
- Domain Adaptation and Few-Shot Learning
- Bayesian Modeling and Causal Inference
- Privacy-Preserving Technologies in Data
- Image Retrieval and Classification Techniques
- Advanced Image and Video Retrieval Techniques
- Text Readability and Simplification
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Explainable Artificial Intelligence (XAI)
- Data Visualization and Analytics
- Hate Speech and Cyberbullying Detection
- Advanced Clustering Algorithms Research
Tsinghua University
2006-2024
University of Hong Kong
2013-2024
Peng Cheng Laboratory
2020-2024
Hong Kong University of Science and Technology
2013-2024
Zhejiang University of Finance and Economics
2024
Bar-Ilan University
2023
Tencent (China)
2019-2022
Association for Computing Machinery
2019-2021
West Virginia University
2015-2018
Peking University
2017
Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral suffers from a scalability problem both memory use and computational time when the size of data set is large. To perform on large sets, we investigate two representative ways approximating dense similarity matrix. We compare one approach by sparsifying matrix with another Nyström method. then pick strategy via retaining nearest neighbors...
Heterogeneous Information Network (HIN) is a natural and general representation of data in modern large commercial recommender systems which involve heterogeneous types data. HIN based recommenders face two problems: how to represent the high-level semantics recommendations fuse information make recommendations. In this paper, we solve problems by first introducing concept meta-graph HIN-based recommendation, then solving fusion problem with "matrix factorization (MF) + machine (FM)"...
Text classification to a hierarchical taxonomy of topics is common and practical problem. Traditional approaches simply use bag-of-words have achieved good results. However, when there are lot labels with different topical granularities, representation may not be enough. Deep learning models been proven effective automatically learn levels representations for image data. It interesting study what the best way represent texts. In this paper, we propose graph-CNN based deep model first convert...
Understanding how topics evolve in text data is an important and challenging task. Although much work has been devoted to topic analysis, the study of evolution largely limited individual topics. In this paper, we introduce TextFlow, a seamless integration visualization mining techniques, for analyzing various patterns that emerge from multiple We first extend existing analysis technique extract three-level features: trend, critical event, keyword correlation. Then coherent consists three...
Nedjma Ousidhoum, Zizheng Lin, Hongming Zhang, Yangqiu Song, Dit-Yan Yeung. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint (EMNLP-IJCNLP). 2019.
With explosive growth of Android malware and due to the severity its damages smart phone users, detection has become increasingly important in cybersecurity. The increasing sophistication calls for new defensive techniques that are capable against novel threats harder evade. In this paper, detect malware, instead using Application Programming Interface (API) only, we further analyze different relationships between them create higher-level semantics which require more effort attackers evade...
Network embedding has been proven to be helpful for many real-world problems. In this paper, we present a scalable multiplex network model represent information of multi-type relations into unified space. To combine different types while maintaining their distinctive properties, each node, propose one high-dimensional common and lower-dimensional additional type relation. Then multiple can learned jointly based on model. We conduct experiments two tasks: link prediction node classification...
Despite the successes in capturing continuous distributions, application of generative adversarial networks (GANs) to discrete settings, like natural language tasks, is rather restricted. The fundamental reason difficulty back-propagation through random variables combined with inherent instability GAN training objective. To address these problems, we propose Maximum-Likelihood Augmented Discrete Generative Adversarial Networks. Instead directly optimizing objective, derive a novel and...
Recurrent neural networks (RNNs) have been successfully applied to various natural language processing (NLP) tasks and achieved better results than conventional methods. However, the lack of understanding mechanisms behind their effectiveness limits further improvements on architectures. In this paper, we present a visual analytics method for comparing RNN models NLP tasks. We propose technique explain function individual hidden state units based expected response input texts. then...
With the rapid progress of large language models (LLMs), many downstream NLP tasks can be well solved given appropriate prompts. Though model developers and researchers work hard on dialog safety to avoid generating harmful content from LLMs, it is still challenging steer AI-generated (AIGC) for human good. As powerful LLMs are devouring existing text data various domains (e.g., GPT-3 trained 45TB texts), natural doubt whether private information included in training what privacy threats...
Most text mining tasks, including clustering and topic detection, are based on statistical methods that treat as bags of words. Semantics in the is largely ignored process, results often have low interpretability. One particular challenge faced by such approaches lies short understanding, texts lack enough content from which conclusions can be drawn easily. In this paper, we improve understanding using a probabilistic knowledgebase rich our mental world terms concepts (of worldly facts) it...
In this paper, we present a novel exploratory visual analytic system called TIARA (Text Insight via Automated Responsive Analytics), which combines text analytics and interactive visualization to help users explore analyze large collections of text. Given collection documents, first uses topic analysis techniques summarize the documents into set topics, each is represented by keywords. addition extracting derives time-sensitive keywords depict content evolution over time. To understand...
Previous chapter Next Full AccessProceedings Proceedings of the 2008 SIAM International Conference on Data Mining (SDM)Semi-supervised Multi-label Learning by Solving a Sylvester EquationGang Chen, Yangqiu Song, Fei Wang, and Changshui ZhangGang Zhangpp.410 - 419Chapter DOI:https://doi.org/10.1137/1.9781611972788.37PDFBibTexSections ToolsAdd to favoritesExport CitationTrack CitationsEmail SectionsAboutAbstract learning refers problems where an instance can be assigned more than one category....
Transfer learning, which leverages knowledge from source domains to enhance learning ability in a target domain, has been proven effective various applications. One major limitation of transfer is that the and should be directly related. If there little overlap between two domains, performing these will not effective. Inspired by human transitive inference ability, whereby seemingly unrelated concepts can connected string intermediate bridges using auxiliary concepts, this paper we study...
Taxonomies, especially the ones in specific domains, are becoming indispensable to a growing number of applications. State-of-the-art approaches assume there exists text corpus accurately characterize domain interest, and that taxonomy can be derived from using information extraction techniques. In reality, neither assumption is valid, for highly focused or fast-changing domains. this paper, we study challenging problem: Deriving set keyword phrases. A solution benefit many real life...
We are building an interactive visual text analysis tool that aids users in analyzing large collections of text. Unlike existing work analytics, which focuses either on developing sophisticated analytic techniques or inventing novel visualization metaphors, ours tightly integrates state-of-the-art analytics with to maximize the value both. In this article, we present our from two aspects. first introduce enhanced, LDA-based topic technique automatically derives a set topics summarize...
Lack of labeled training data is a major bottleneck for neural network based aspect and opinion term extraction on product reviews. To alleviate this problem, we first propose an algorithm to automatically mine rules from existing examples dependency parsing results. The mined are then applied label large amount auxiliary data. Finally, study procedures train model which can learn both the by small accurately annotated human. Experimental results show that although themselves do not perform...
Question answering (QA) has become a popular way for humans to access billion-scale knowledge bases. Unlike web search, QA over base gives out accurate and concise results, provided that natural language questions can be understood mapped precisely structured queries the base. The challenge, however, is human ask one question in many different ways. Previous approaches have limits due their representations: rule based only understand small set of "canned" questions, while keyword or synonym...
In this paper, we systematically study the problem of dataless hierarchical text classification. Unlike standard classification schemes that rely on supervised training, depends understanding labels sought after categories and requires no labeled data. Given a collection documents set labels, show can be used to accurately categorize documents. This is done by embedding both in semantic space allows one compute meaningful similarity between document potential label. We scheme support...
Understanding human’s language requires complex world knowledge. However, existing large-scale knowledge graphs mainly focus on about entities while ignoring activities, states, or events, which are used to describe how things act in the real world. To fill this gap, we develop ASER (activities, and their relations), a eventuality graph extracted from more than 11-billion-token unstructured textual data. contains 15 relation types belonging five categories, 194-million unique eventualities,...
Heterogeneous information network (HIN) embedding has gained increasing interests recently. However, the current way of random-walk based HIN methods have paid few attention to higher-order Markov chain nature meta-path guided random walks, especially stationarity issue. In this paper, we systematically formalize walk as a process,and present heterogeneous personalized spacey efficiently and effectively attain expected stationary distribution among nodes. Then propose generalized scalable...
Word embeddings have attracted much attention recently. Different from alphabetic writing systems, Chinese characters are often composed of subcharacter components which also semantically informative. In this work, we propose an approach to jointly embed words as well their and fine-grained components. We use three likelihoods evaluate whether the context words, characters, can predict current target word, collected 13,253 demonstrate existing approaches decomposing not enough. Evaluation on...