- Advanced Image and Video Retrieval Techniques
- Image Retrieval and Classification Techniques
- Video Analysis and Summarization
- Multimodal Machine Learning Applications
- Speech Recognition and Synthesis
- Human Pose and Action Recognition
- Video Surveillance and Tracking Methods
- Topic Modeling
- Domain Adaptation and Few-Shot Learning
- Natural Language Processing Techniques
- Speech and Audio Processing
- Music and Audio Processing
- 3D Surveying and Cultural Heritage
- Anomaly Detection Techniques and Applications
- Face recognition and analysis
- Complex Network Analysis Techniques
- 3D Shape Modeling and Analysis
- Advanced Vision and Imaging
- Remote-Sensing Image Classification
- Advanced Neural Network Applications
- Recommender Systems and Techniques
- Geographic Information Systems Studies
- Advanced Text Analysis Techniques
- Data Management and Algorithms
- Computer Graphics and Visualization Techniques
Apple (United States)
2023-2025
Wuxi Fourth People's Hospital
2024
XinHua Hospital
2024
Shanghai Jiao Tong University
2024
Jiangnan University
2024
University of Massachusetts Amherst
2019-2022
Google (United States)
2019-2022
Guangzhou Experimental Station
2022
Carnegie Mellon University
2020
Amherst College
2020
The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount data with are relatively easy to obtain. Traditionally, label noise has been treated statistical outliers, and techniques such importance re-weighting bootstrapping have proposed alleviate the problem. According our observation, real-world exhibit multimode characteristics true labels, rather than behaving like independent random outliers. In this work, we propose unified distillation...
This paper considers the person verification problem in modern surveillance and video retrieval systems. The is to identify whether a pair of face or human body images about same person, even if not seen before. Traditional methods usually look for distance (or similarity) measure between (e.g., by metric learning algorithms), make decisions based on fixed threshold. We show that this nevertheless insufficient sub-optimal problem. proposes learn decision function can be viewed as joint model...
Most research efforts on image classification so far have been focused medium-scale datasets, which are often defined as datasets that can fit into the memory of a desktop (typically 4G~48G). There two main reasons for limited effort large-scale classification. First, until emergence ImageNet dataset, there was almost no publicly available benchmark data This is mostly because class labels expensive to obtain. Second, hard it poses more challenges than its counterparts. A key challenge how...
We present a novel generative model for simultaneously recognizing and segmenting object scene classes. Our is inspired by the traditional bag of words representation texts images as well number related models, including probabilistic Latent Semantic Analysis (pLSA) Dirichlet Allocation (LDA). A major drawback pLSA LDA models assumption that each patch in image independently generated given its corresponding latent topic. While such provides an efficient computational method, it lacks power...
This paper studies the problem of discovering and comparing geographical topics from GPS-associated documents. documents become popular with pervasiveness location-acquisition technologies. For example, in Flickr, geo-tagged photos are associated tags GPS locations. In Twitter, locations tweets can be identified by smart phones. Many interesting concepts, including cultures, scenes, product sales, correspond to specialized distributions. this paper, we interested two questions: (1) how...
Composing fashion outfits involves deep understanding of standards while incorporating creativity for choosing multiple items (e.g., Jewelry, Bag, Pants, Dress). In websites, popular or high-quality are usually designed by experts and followed large audiences. this paper, we propose a machine learning system to compose automatically. The core the proposed automatic composition is score outfit candidates based on appearances meta-data. We leverage popularity oriented websites supervise...
Attribute-based representation has shown great promises for visual recognition due to its intuitive interpretation and cross-category generalization property. However, human efforts are usually involved in the attribute designing process, making costly obtain. In this paper, we propose a novel formulation automatically design discriminative "category-level attributes", which can be efficiently encoded by compact category-attribute matrix. The allows us achieve critical criteria...
With the recent popularity of animated GIFs on social media, there is need for ways to index them with rich meta-data. To advance research GIF understanding, we collected a new dataset, Tumblr (TGIF), 100K from and 120K natural language descriptions obtained via crowdsourcing. The motivation this work develop testbed image sequence description systems, where task generate or video clips. ensure high quality developed series novel controls validate free-form text input crowd-workers. We show...
This work addresses the unsupervised adaptation of an existing object detector to a new target domain. We assume that large number unlabeled videos from this domain are readily available. automatically obtain labels on data by using high-confidence detections detector, augmented with hard (misclassified) examples acquired exploiting temporal cues tracker. These automatically-obtained then used for re-training original model. A modified knowledge distillation loss is proposed, and we...
We summarize the results of a host efforts using giant automatic speech recognition (ASR) models pre-trained large, diverse unlabeled datasets containing approximately million hours audio. find that combination pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens thousands labeled data. In particular, on an ASR task 34k data, by fine-tuning 8 billion parameter Conformer we can match state-of-the-art (SoTA)...
Non-invasive methods of detecting cancer by circulating exosomes are challenged inefficient purification and identification. This study hereby proposed an automated centrifugal microfluidic disc system combined with functionalized membranes (Exo-CMDS) to isolate enrich exosomes, which will then be processed a novel aptamer fluorescence (Exo-AFS) in order detect the exosome surface proteins effective manner. Exo-CMDS features highly qualified yields optimal exosomal concentration 5.1 × 109...
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse process gradually adding noise images, allowing generate high-quality samples from complex distribution. In this survey, we provide exhaustive overview existing methods using editing, covering both theoretical practical aspects field. We delve...
In recent years, many research works have been carried out to recognize human actions from video clips. To learn an effective action classifier, most of the previous approaches rely on enough training labels. When being required in a different dataset, these re-train model using new However, labeling sequences is very tedious and time-consuming task, especially when detailed spatial locations time durations are required. this paper, we propose adaptive detection approach which reduces...
This paper studies the problem of recognizing gender from full body images. has not been addressed before, partly because variant nature human bodies and clothing that can bring tough difficulties. However, recognition high application potentials, e.g. security surveillance customer statistics collection in restaurants, supermarkets, even building entrances. In this paper, we build a system images, taken frontal or back views. Our contributions are three-fold. First, to handle variety...
In this paper, we investigate the detection of semantic human actions in complex scenes. Unlike conventional action recognition well-controlled environments, scenes suffers from cluttered backgrounds, heavy crowds, occluded bodies, and spatial-temporal boundary ambiguities caused by imperfect tracking. Conventional algorithms are likely to fail with such ambiguities. work, candidate regions an treated as a bag instances. Then novel multiple-instance learning framework, named SMILE-SVM...
We introduce the novel problem of automatically generating animated GIFs from video. are short looping video with no sound, and a perfect combination between image that really capture our attention. tell story, express emotion, turn events into humorous moments, new wave photojournalism. pose question: Can we automate entirely manual elaborate process GIF creation by leveraging plethora user generated content? propose Robust Deep RankNet that, given video, generates ranked list its segments...
Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, tackle real-life answering problems multimedia collections such as personal photos, we look at whole sequences of photos or videos. When questions from a large collection, natural problem is identify snippets support the answer. In this paper, describe novel network called Focal Visual-Text Attention (FVTA) for collective reasoning in...
Sentiment analysis is crucial for extracting social signals from media content. Due to huge variation in media, the performance of sentiment classifiers using single modality (visual or textual) still lags behind satisfaction. In this paper, we propose a new framework that integrates textual and visual information robust analysis. Different previous work, believe should be treated jointly structural fashion. Our system first builds semantic tree structure based on sentence parsing, aimed at...
Sarcasm is a peculiar form of sentiment expression, where the surface differs from implied sentiment. The detection sarcasm in social media platforms has been applied past mainly to textual utterances lexical indicators (such as interjections and intensifiers), linguistic markers, contextual information user profiles, or conversations) were used detect sarcastic tone. However, modern allow create multimodal messages audiovisual content integrated with text, making analysis mode isolation...
In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of and its corresponding sound generation method resulting in more natural sounding reconstructed speech. Our proposed consists an autoencoder to extract bottleneck features the which is then used target our main reading comprising CNN, LSTM fully connected layers. experiments show that able reconstruct original with...
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify in the LLM paradigm, Ferret employs novel powerful hybrid region representation that integrates discrete coordinates continuous features jointly to represent image. extract versatile regions, we propose spatial-aware visual sampler, adept at handling varying sparsity across...
Since the emergence of extensive multimedia data, feature fusion has been more and important for image video retrieval, indexing annotation. Existing techniques simply concatenate a pair different features or use canonical correlation analysis based methods joint dimensionality reduction in space. However, how to fuse multiple generalized way is still an open problem. In this paper, we reformulate as general subspace learning The objective framework find linear which cumulative pairwise...
Social multimedia hosting and sharing websites, such as Flickr, Facebook, Youtube, Picasa, ImageShack Photobucket, are increasingly popular around the globe. A major trend in current studies on social is using media sites a source of huge amount labeled data for solving large scale computer science problems vision, mining multimedia. In this paper, we take new path to explore global trends sentiments that can be drawn by analyzing patterns uploaded downloaded sense, each time an image or...
This work aims to build a system suggest tourist destinations based on visual matching and minimal user input. A can provide either photo of the desired scenary or keyword describing place interest, will look into its database for places that share characteristics. To end, we first cluster large-scale geotagged web collection groups by location then find representative images each group. Tourist destination recommendations are produced comparing query against tags under premise "if you like...
This article studies the problem of latent community topic analysis in text-associated graphs. With development social media, a lot user-generated content is available with user networks. Along rich information networks, graphs can be extended text associated nodes. Topic modeling classic mining and it interesting to discover topics Different from traditional methods considering links, we incorporate discovery into guarantee topical coherence communities so that users same are closely linked...