- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Domain Adaptation and Few-Shot Learning
- Video Analysis and Summarization
- Generative Adversarial Networks and Image Synthesis
- Air Quality and Health Impacts
- Microplastics and Plastic Pollution
- Recycling and Waste Management Techniques
- Allergic Rhinitis and Sensitization
- Non-Invasive Vital Sign Monitoring
- China's Socioeconomic Reforms and Governance
- Chinese history and philosophy
- Asthma and respiratory diseases
- China's Ethnic Minorities and Relations
- Sperm and Testicular Function
- Military and Defense Studies
- ECG Monitoring and Analysis
- Cancer-related molecular mechanisms research
- Mast cells and histamine
- Climate Change and Health Impacts
- Hydrocarbon exploration and reservoir analysis
- EEG and Brain-Computer Interfaces
- COVID-19 diagnosis using AI
- NMR spectroscopy and applications
- Face recognition and analysis
Hangzhou Normal University
2023-2024
ShangHai JiAi Genetics & IVF Institute
2023-2024
Shanghai Artificial Intelligence Laboratory
2022-2024
University College London
2024
East China Normal University
2020-2023
Beijing Academy of Artificial Intelligence
2023
Beijing Normal University
2022
Shanghai Jiao Tong University
2022
China University of Petroleum, Beijing
2018
South China University of Technology
2017
Scale is the primary factor for building a powerful foundation model that could well generalize to variety of downstream tasks. However, it still challenging train video models with billions parameters. This paper shows masked autoencoder (VideoMAE) scalable and general self-supervised pre-trainer models. We scale VideoMAE in both data core design. Specifically, we present dual masking strategy efficient pre-training, an encoder operating on subset tokens decoder processing another tokens....
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision simply focus image-level pretraining and adpation, which are limited for dynamic complex video-level understanding tasks. To fill the gap, we present general video models, InternVideo, by taking advantage both generative discriminative self-supervised learning. Specifically, InternVideo efficiently explores masked modeling video-language...
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image (IFMs), which face challenges in transferring the video domain. Although VideoMAE has trained a robust ViT from data, its low-level reconstruction poses convergence difficulties conflicts with high-level cross-modal alignment. This paper proposes training-efficient method for temporal-sensitive that integrates benefits of existing methods. To...
Falls are a very dangerous situation especially among elderly people, because they may lead to fractures, concussion, and other injuries. Without timely rescue, falls even endanger their lives. The existing optical sensor-based fall monitoring systems have some disadvantages, such as limited range inconvenience carry for users. Furthermore, the detection system based only on an accelerometer often mistakenly determines activities of daily living (ADL) falls, leading low accuracy in...
Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term dependency with self-attention. Unfortunately, they exhibit limitations tackling local redundancy, due to blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator transformer format. However, model require tiresome...
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging pre-trained text-to-image (T2I) as basis. It is highly desirable yet challenging task simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving strong creative generation nature T2I model. To this end, we propose LaVie, an integrated video framework that operates on cascaded latent diffusion models, comprising base T2V model, temporal...
The prolific performances of Vision Transformers (ViTs) in image tasks have prompted research into adapting the ViTs for video tasks. However, substantial gap between and impedes spatiotemporal learning these image-pretrained models. Though video-specialized models like UniFormer can transfer to domain more seamlessly, their unique architectures require prolonged pretraining, limiting scalability. Given emergence powerful open-source ViTs, we propose unlocking potential understanding with...
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for understanding generation. The InternVid contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop scalable approach autonomously build high-quality with large language models (LLM), thereby showcasing its efficacy in...
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs progressive training paradigm unifies different self- or weakly-supervised learning frameworks of masked token reconstruction, cross-modal contrastive learning, next prediction. Different stages would guide our to capture levels structure semantic information through pretext tasks. At data...
Despite increasing alarms over the health impacts of microplastics (MPs) due to their detection in human organs and feces, precise exposure evaluations remain scarce. To comprehend risks, there is a distinct need prioritize quantitive estimates MP exposome, particularly at environmentally-realistic level. Here we used method rooted real-world measurements activity patterns determine daily intake MPs through inhalation from ground dust/soil ingestion. We found that nearly 80% this comes...
In this report, we present our champion solutions to five tracks at Ego4D challenge. We leverage developed InternVideo, a video foundation model, for tasks, including Moment Queries, Natural Language Future Hand Prediction, State Change Object Detection, and Short-term Interaction Anticipation. InternVideo-Ego4D is an effective paradigm adapt the strong model downstream ego-centric understanding tasks with simple head designs. these performance of comprehensively surpasses baseline methods...
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image (IFMs), which face challenges in transferring the video domain. Although VideoMAE has trained a robust ViT from data, its low-level reconstruction poses convergence difficulties conflicts with high-level cross-modal alignment. This paper proposes training-efficient method for temporal-sensitive that integrates benefits of existing methods. To...
ON 7 SEPTEMBER 2010, A CHINESE FISHING BOAT collided with two Japanese Coast Guard vessels near the disputed Diaoyu/Senkaku Islands within Japan's claimed exclusive economic zone. The detained crew and ship but thereafter let them go, only to hand over Chinese captain prosecutors for obstructing its execution of duties. government immediately protested Japan demanded release captain. situation calmed down in late September when set free refused apologize or give compensation. When diplomatic...
Abstract Background Allergic rhinitis is a common health concern that affects quality of life. This study aims to examine the online search trends allergic in China before and after COVID-19 epidemic explore association between daily air volumes Beijing. Methods We extracted data rhinitis-related keywords from Baidu index database January 23, 2017 June 2022. analyzed compared temporal distribution behaviors across different themes pandemic mainland China, using (BSI). also obtained (AQI)...
Purpose To investigate information-seeking behavior related to urticaria before and during the COVID-19 pandemic in China. Methods Search query data for terms were retrieved using Baidu Index database from October 23, 2017 April 2022, daily vaccination doses obtained website of Chinese Center Disease Control Prevention. Among 23 eligible search terms, four themes generated as classification, symptom, etiology, treatment urticarial, respectively. (BSI) value each term extracted analyze...
Building video-language foundation models is costly and difficult due to the redundant nature of video data lack high-quality datasets. In this paper, we propose an efficient framework harvest from image ones. Our method intuitively simple by randomly dropping input patches masking out text during post-pretraining procedure. The patch boosts training efficiency significantly enforces learning cross-modal fusion. We conduct extensive experiments validate effectiveness our on a wide range...