- Recommender Systems and Techniques
- Multimodal Machine Learning Applications
- Advanced Bandit Algorithms Research
- Topic Modeling
- Advanced Graph Neural Networks
- Human Pose and Action Recognition
- Advanced Image and Video Retrieval Techniques
- IoT and Edge/Fog Computing
- Image Retrieval and Classification Techniques
- Privacy-Preserving Technologies in Data
- Video Analysis and Summarization
- Domain Adaptation and Few-Shot Learning
- Expert finding and Q&A systems
- Music and Audio Processing
- Generative Adversarial Networks and Image Synthesis
- Image and Video Quality Assessment
- Stochastic Gradient Optimization Techniques
- Caching and Content Delivery
- Music Technology and Sound Studies
- Image Processing and 3D Reconstruction
- Semantic Web and Ontologies
- Natural Language Processing Techniques
- Cryptography and Data Security
- Speech and Audio Processing
- Human Mobility and Location-Based Analysis
Zhejiang University
2020-2024
Communication University of China
2024
Alibaba Group (China)
2024
Chinese University of Hong Kong
2013-2018
Wuhan University
2018
California Institute of Technology
2007
Chengdu University of Information Technology
2005
Influenced by the great success of deep learning via cloud computing and rapid development edge chips, research in artificial intelligence (AI) has shifted to both paradigms, i.e., computing. In recent years, we have witnessed significant progress developing more advanced AI models on servers that surpass traditional owing model innovations (e.g., Transformers, Pretrained families), explosion training data soaring capabilities. However, computing, especially collaborative are still its...
Device Model Generalization (DMG) is a practical yet under-investigated research topic for on-device machine learning applications. It aims to improve the generalization ability of pre-trained models when deployed on resource-constrained devices, such as improving performance cloud smart mobiles. While quite lot works have investigated data distribution shift across clouds and most them focus model fine-tuning personalized individual devices facilitate DMG. Despite their promising, these...
Recent research on video moment retrieval has mostly focused enhancing the performance of accuracy, efficiency, and robustness, all which largely rely abundance high-quality annotations. While precise frame-level annotations are time-consuming cost-expensive, few attentions have been paid to labeling process. In this work, we explore a new interactive manner stimulate process human-in-the-loop annotation in task. The key challenge is select “ambiguous” frames videos for binary facilitate...
A common assumption behind most of the recent research on network rate allocation is that traffic flows are elastic, which means their utility functions concave and continuous there no hard limit allocated to each flow. These critical assumptions lead tractability analytic models for based maximization, but also applicability resulting protocols. This paper focuses inelastic removes these restrictive often invalid assumptions. First, we consider nonconcave functions, turn maximization into...
In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where pretraining data distribution differs from that downstream on which pretrained model will be fine-tuned. Existing methods for are purely likelihood-based, leading spurious correlations and hurt generalization ability when transferred tasks. By correlation, mean conditional probability one token (object or word) given another can high (due dataset biases) without robust (causal)...
Effectively representing users lie at the core of modern recommender systems. Since users' interests naturally exhibit multiple aspects, it is increasing interest to develop multi-interest frameworks for recommendation, rather than represent each user with an overall embedding. Despite their effectiveness, existing methods solely exploit encoder (the forward flow) aspects interests. However, without explicit regularization, embeddings may not be distinct from other nor semantically reflect...
Large Language Models (LLMs) for Recommendation (LLM4Rec) is a promising research direction that has demonstrated exceptional performance in this field. However, its inability to capture real-time user preferences greatly limits the practical application of LLM4Rec because (i) LLMs are costly train and infer frequently, (ii) struggle access data (its large number parameters poses an obstacle deployment on devices). Fortunately, small recommendation models (SRMs) can effectively supplement...
Learning user representations based on historical behaviors lies at the core of modern recommender systems. Recent advances in sequential recommenders have convincingly demonstrated high capability extracting effective from given behavior sequences. Despite significant progress, we argue that solely modeling observational sequences may end up with a brittle and unstable system due to noisy sparse nature interactions logged. In this paper, propose learn accurate robust representations, which...
<u>V</u>ideo <u>O</u>bject <u>G</u>rounding (VOG) is the problem of associating spatial object regions in video to a descriptive natural language query. This challenging vision-language task that necessitates constructing correct cross-modal correspondence and modeling appropriate spatio-temporal context query caption, thereby localizing specific objects accurately. In this paper, we tackle by novel framework called <u>H</u>i<u>E</u>rarchical spatio-tempo<u>R</u>al reas<u>O</u>ning (HERO)...
Large Language Models (LLMs) have demonstrated strong performance across various reasoning tasks, yet building a single model that consistently excels all domains remains challenging. This paper addresses this problem by exploring strategies to integrate multiple domain-specialized models into an efficient pivot model.We propose two fusion combine the strengths of LLMs: (1) pairwise, multi-step approach sequentially distills each source model, followed weight merging step distilled final...
Deep neural networks have become foundational to advancements in multiple domains, including recommendation systems, natural language processing, and so on. Despite their successes, these models often contain incompatible parameters that can be underutilized or detrimental model performance, particularly when faced with specific, varying data distributions. Existing research excels removing such merging the outputs of different pretrained models. However, former focuses on efficiency rather...
In recommender systems, modeling user-item behaviors is essential for user representation learning. Existing sequential recommenders consider the correlations between historically interacted items capturing users' historical preferences. However, since preferences are by nature time-evolving and diversified, solely preference (without being aware of trends preferences) can be inferior recommending complementary or fresh thus hurt effectiveness systems. this paper, we bridge gap past...
In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred characteristics depicted in is vital successful promoting. Traditional captioning methods, which focus on routinely describing what exists and happens video, not amenable product-oriented captioning. To address this problem, we propose captioner framework, abbreviated as Poet. Poet firstly represents spatial-temporal graphs. Then, based...
In recommender systems, users' behavior data are driven by the interactions of user-item latent factors. To improve recommendation effectiveness and robustness, recent advances focus on factor disentanglement via variational inference. Despite significant progress, uncovering underlying interactions, i.e., dependencies factors, remains largely neglected literature. bridge gap, we investigate joint factors between them, namely structure learning. We propose to analyze problem from causal...
Text-based image captioning (TextCap) requires simultaneous comprehension of visual content and reading the text images to generate a natural language description. Although task can teach machines understand complex human environment further given that is omnipresent in our daily surroundings, it poses additional challenges normal captioning. A text-based intuitively contains abundant multimodal relational content, is, details be described diversely from multiview rather than single caption....
Modern online platforms are increasingly employing recommendation systems to address information overload and improve user engagement. There is an evolving paradigm in this research field that network learning occurs both on the cloud edges with knowledge transfer between (i.e., edge-cloud collaboration). Recent works push further by enabling edge-specific context-aware adaptivity, where model parameters updated real-time based incoming on-edge data. However, we argue frequent data exchanges...
Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration real-world applications where multiple domains distribution shifts naturally exist. The key to domain generalization (VADG) lies alleviating spurious correlations over multi-modal features. To achieve this goal, we resort causal theory attribute such correlation confounders affecting both features labels. We propose a DeVADG framework that conducts uni-modal...
Tackling the pervasive issue of data sparsity in recommender systems, we present an insightful investigation into burgeoning area non-overlapping cross-domain recommendation, a technique that facilitates transfer interaction knowledge across domains without necessitating inter-domain user/item correspondence. Existing approaches have predominantly depended on auxiliary information, such as user reviews and item tags, to establish connectivity, but these resources may become inaccessible due...
In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive volume. To recommend these videos to potential consumers more effectively, diverse and catchy video titles critical. However, seldom accompany appropriate titles. bridge this gap, we integrate comprehensive sources information, including content narrative comment sentences supplied by consumers, product attributes, an end-to-end...
Recommendation performance usually exhibits a long-tail distribution over users — small portion of head enjoy much more accurate recommendation services than the others. We reveal two sources this heterogeneity problem: uneven historical interactions (a natural source); and biased training recommender models model source). As addressing problem cannot sacrifice overall performance, wise choice is to eliminate bias while maintaining heterogeneity. The key debiased lies in eliminating effect...
Cloud storage has gained a remarkable success in recent years with an increasing number of consumers and enterprises outsourcing their data to the cloud. To assure availability integrity outsourced data, several protocols have been proposed audit cloud storage. Despite formally guaranteed security, constructions employed heavy cryptographic operations as well advanced concepts (e.g., bilinear maps over elliptic curves digital signatures), thus are inefficient admit wide applicability...
Waterfall Recommender System (RS), a popular form of RS in mobile applications, is stream recommended items consisting successive pages that can be browsed by scrolling. In waterfall RS, when user finishes browsing page, the edge (e.g., phones) would send request to cloud server get new page recommendations, known as paging mechanism. RSs typically put large number into one reduce excessive resource consumption from numerous requests, which, however, diminish RSs' ability timely renew...