- Topic Modeling
- Complex Network Analysis Techniques
- Advanced Graph Neural Networks
- Natural Language Processing Techniques
- Recommender Systems and Techniques
- Opinion Dynamics and Social Influence
- Semantic Web and Ontologies
- Web Data Mining and Analysis
- Expert finding and Q&A systems
- Advanced Text Analysis Techniques
- Data Quality and Management
- Text and Document Classification Technologies
- Multimodal Machine Learning Applications
- X-ray Diffraction in Crystallography
- Crystallization and Solubility Studies
- Domain Adaptation and Few-Shot Learning
- Service-Oriented Architecture and Web Services
- Mobile Crowdsensing and Crowdsourcing
- Biomedical Text Mining and Ontologies
- Human Mobility and Location-Based Analysis
- Data Mining Algorithms and Applications
- Spam and Phishing Detection
- Caching and Content Delivery
- Online Learning and Analytics
- Sentiment Analysis and Opinion Mining
Tsinghua University
2016-2025
Sichuan University of Science and Engineering
2025
Hubei University of Technology
2024
Xuzhou Construction Machinery Group (China)
2024
Nanjing Tech University
2024
Hunan University of Traditional Chinese Medicine
2018-2024
Southwest Forestry University
2024
Central Conservatory of Music
2024
University of Science and Technology of China
2024
Institute of Psychology, Chinese Academy of Sciences
2021-2023
This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. Specifically, system focuses on: 1) Extracting researcher profiles automatically from Web; 2) Integrating publication data into network existing digital libraries; 3) Modeling entire network; 4) Providing search services for network. So far, 448,470 have been extracted using a unified tagging approach. We integrate publications online Web databases propose...
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, new evaluation set we release to measure functional correctness for synthesizing programs docstrings, our solves 28.8% the problems, while GPT-3 0% GPT-J 11.4%. Furthermore, find that repeated sampling is surprisingly effective strategy producing working solutions difficult...
On April 13th, 2019, OpenAI Five became the first AI system to defeat world champions at an esports game. The game of Dota 2 presents novel challenges for systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all which will become increasingly central more capable systems. leveraged existing reinforcement learning techniques, scaled learn from batches approximately million frames every seconds. We developed a distributed training tools...
Deep supervised learning has achieved great success in the last decade. However, its deficiencies of dependence on manual labels and vulnerability to attacks have driven people explore a better solution. As an alternative, self-supervised attracts many researchers for soaring performance representation several years. Self-supervised leverages input data itself as supervision benefits almost all types downstream tasks. In this survey, we take look into new methods computer vision, natural...
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled unlabeled data. By storing into parameters fine-tuning on specific tasks, rich implicitly encoded benefit variety downstream which has been extensively demonstrated via experimental...
Graph representation learning has emerged as a powerful technique for addressing real-world problems. Various downstream graph tasks have benefited from its recent developments, such node classification, similarity search, and classification. However, prior arts on focus domain specific problems train dedicated model each dataset, which is usually non-transferable to out-of-domain data. Inspired by the advances in pre-training natural language processing computer vision, we design...
Since the invention of word2vec, skip-gram model has significantly advanced research network embedding, such as recent emergence DeepWalk, LINE, PTE, and node2vec approaches. In this work, we show that all aforementioned models with negative sampling can be unified into matrix factorization framework closed forms. Our analysis proofs reveal that: (1) DeepWalk empirically produces a low-rank transformation network's normalized Laplacian matrix; (2) in theory, is special case when size...
Social and information networking activities such as on Facebook, Twitter, WeChat, Weibo have become an indispensable part of our everyday life, where we can easily access friends' behaviors are in turn influenced by them. Consequently, effective social influence prediction for each user is critical a variety applications online recommendation advertising.
Social status, defined as the relative rank or position that an individual holds in a social hierarchy, is known to be among most important motivating forces behaviors. In this paper, we consider notion of status from perspective title held by person enterprise. We study intersection and networks whether enterprise communication logs can help reveal how interactions manifest themselves networks. To end, use two datasets with three channels --- voice call, short message, email demonstrate...
Network embedding (or graph embedding) has been widely used in many real-world applications. However, existing methods mainly focus on networks with single-typed nodes/edges and cannot scale well to handle large networks. Many consist of billions nodes edges multiple types, each node is associated different attributes. In this paper, we formalize the problem learning for Attributed Multiplex Heterogeneous propose a unified framework address problem. The supports both transductive inductive...
We show that information about social relationships can be used to improve user-level sentiment analysis. The main motivation behind our approach is users are somehow "connected" may more likely hold similar opinions; therefore, relationship complement what we extract a user's viewpoints from their utterances. Employing Twitter as source for experimental data, and working within semi-supervised framework, propose models induced either the follower/followee network or in formed by referring...
Abstract With the prevalence of pre-trained language models (PLMs) and pre-training–fine-tuning paradigm, it has been continuously shown that larger tend to yield better performance. However, as PLMs scale up, fine-tuning storing all parameters is prohibitively costly eventually becomes practically infeasible. This necessitates a new branch research focusing on parameter-efficient adaptation PLMs, which optimizes small portion model while keeping rest fixed, drastically cutting down...
It is well known that different types of social ties have essentially influence on people. However, users in online networks rarely categorize their contacts into "family", "colleagues", or "classmates". While a bulk research has focused inferring particular relationships specific network, few publications systematically study the generalization problem over multiple heterogeneous networks. In this work, we develop framework for classifying type by learning across The incorporates theories...
Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL seen emerging success natural language processing and other fields, such as the wide adoption of BERT GPT. Despite this, contrastive learning---which heavily relies on structural data augmentation complicated training strategies---has dominant approach graph SSL, while progress graphs, especially autoencoders (GAEs), thus far not reached potential promised fields. In this paper, we...
Interdisciplinary collaborations have generated huge impact to society. However, it is often hard for researchers establish such cross-domain collaborations. What are the patterns of collaborations? How do those form? Can we predict this type
More often than not, people are active in more one social network. Identifying users from multiple heterogeneous networks and integrating the different is a fundamental issue many applications. The existing methods tackle this problem by estimating pairwise similarity between two networks. However, those suffer potential inconsistency of matchings
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. also demonstrate finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking fashion design, methods stabilize pretraining, eliminating NaN losses. CogView achieves state-of-the-art FID on...
Retweeting is an important action (behavior) on Twitter, indicating the behavior that users re-post microblogs of their friends. While much work has been conducted for mining textual content generate or analyzing social network structure, few publications systematically study underlying mechanism retweeting behaviors. In this paper, we perform interesting analysis problem Twitter. We have found almost 25.5% tweets posted by are actually retweeted from friends' blog spaces. Our investigation...
Influence is a complex and subtle force that governs the dynamics of social networks as well behaviors involved users. Understanding influence can benefit various applications such viral marketing, recommendation, information retrieval. However, most existing works on analysis have focused verifying existence influence. Few systematically investigate how to mine strength direct indirect between nodes in heterogeneous networks.
Prompt tuning, which only tunes continuous prompts with a frozen language model, substantially reduces per-task storage and memory usage at training. However, in the context of NLU, prior work reveals that prompt tuning does not perform well for normal-sized pretrained models. We also find existing methods cannot handle hard sequence labeling tasks, indicating lack universality. present novel empirical finding properly optimized can be universally effective across wide range model scales NLU...
Prompting a pretrained language model with natural patterns has been proved effective for understanding (NLU). However, our preliminary study reveals that manual discrete prompts often lead to unstable performance—e.g., changing single word in the prompt might result substantial performance drop. We propose novel method P-Tuning employs trainable continuous embeddings concatenation prompts. Empirically, not only stabilizes training by minimizing gap between various prompts, but also improves...
Despite years of research, the name ambiguity problem remains largely unresolved. Outstanding issues include how to capture all information for disambiguation in a unified approach, and determine number people K process. In this paper, we formalize probabilistic framework, which incorporates both attributes relationships. Specifically, define objective function propose two-step parameter estimation algorithm. We also investigate dynamic approach estimating K. Experiments show that our...
Demographics are widely used in marketing to characterize different types of customers. However, practice, demographic information such as age, gender, and location is usually unavailable due privacy other reasons. In this paper, we aim harness the power big data automatically infer users' demographics based on their daily mobile communication patterns. Our study a real-world large network more than 7,000,000 users over 1,000,000,000 records (CALL SMS). We discover several interesting social...