Liangliang Cao

ORCID: 0000-0003-0900-1512
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Image Retrieval and Classification Techniques
  • Video Analysis and Summarization
  • Multimodal Machine Learning Applications
  • Speech Recognition and Synthesis
  • Human Pose and Action Recognition
  • Video Surveillance and Tracking Methods
  • Topic Modeling
  • Domain Adaptation and Few-Shot Learning
  • Natural Language Processing Techniques
  • Speech and Audio Processing
  • Music and Audio Processing
  • 3D Surveying and Cultural Heritage
  • Anomaly Detection Techniques and Applications
  • Face recognition and analysis
  • Complex Network Analysis Techniques
  • 3D Shape Modeling and Analysis
  • Advanced Vision and Imaging
  • Remote-Sensing Image Classification
  • Advanced Neural Network Applications
  • Recommender Systems and Techniques
  • Geographic Information Systems Studies
  • Advanced Text Analysis Techniques
  • Data Management and Algorithms
  • Computer Graphics and Visualization Techniques

Apple (United States)
2023-2025

Wuxi Fourth People's Hospital
2024

XinHua Hospital
2024

Shanghai Jiao Tong University
2024

Jiangnan University
2024

University of Massachusetts Amherst
2019-2022

Google (United States)
2019-2022

Guangzhou Experimental Station
2022

Carnegie Mellon University
2020

Amherst College
2020

The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount data with are relatively easy to obtain. Traditionally, label noise has been treated statistical outliers, and techniques such importance re-weighting bootstrapping have proposed alleviate the problem. According our observation, real-world exhibit multimode characteristics true labels, rather than behaving like independent random outliers. In this work, we propose unified distillation...

10.1109/iccv.2017.211 article EN 2017-10-01

This paper considers the person verification problem in modern surveillance and video retrieval systems. The is to identify whether a pair of face or human body images about same person, even if not seen before. Traditional methods usually look for distance (or similarity) measure between (e.g., by metric learning algorithms), make decisions based on fixed threshold. We show that this nevertheless insufficient sub-optimal problem. proposes learn decision function can be viewed as joint model...

10.1109/cvpr.2013.463 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2013-06-01

Most research efforts on image classification so far have been focused medium-scale datasets, which are often defined as datasets that can fit into the memory of a desktop (typically 4G~48G). There two main reasons for limited effort large-scale classification. First, until emergence ImageNet dataset, there was almost no publicly available benchmark data This is mostly because class labels expensive to obtain. Second, hard it poses more challenges than its counterparts. A key challenge how...

10.1109/cvpr.2011.5995477 article EN 2011-06-01

We present a novel generative model for simultaneously recognizing and segmenting object scene classes. Our is inspired by the traditional bag of words representation texts images as well number related models, including probabilistic Latent Semantic Analysis (pLSA) Dirichlet Allocation (LDA). A major drawback pLSA LDA models assumption that each patch in image independently generated given its corresponding latent topic. While such provides an efficient computational method, it lacks power...

10.1109/iccv.2007.4408965 article EN 2007-01-01

This paper studies the problem of discovering and comparing geographical topics from GPS-associated documents. documents become popular with pervasiveness location-acquisition technologies. For example, in Flickr, geo-tagged photos are associated tags GPS locations. In Twitter, locations tweets can be identified by smart phones. Many interesting concepts, including cultures, scenes, product sales, correspond to specialized distributions. this paper, we interested two questions: (1) how...

10.1145/1963405.1963443 article EN 2011-03-28

Composing fashion outfits involves deep understanding of standards while incorporating creativity for choosing multiple items (e.g., Jewelry, Bag, Pants, Dress). In websites, popular or high-quality are usually designed by experts and followed large audiences. this paper, we propose a machine learning system to compose automatically. The core the proposed automatic composition is score outfit candidates based on appearances meta-data. We leverage popularity oriented websites supervise...

10.1109/tmm.2017.2690144 article EN IEEE Transactions on Multimedia 2017-03-30

Attribute-based representation has shown great promises for visual recognition due to its intuitive interpretation and cross-category generalization property. However, human efforts are usually involved in the attribute designing process, making costly obtain. In this paper, we propose a novel formulation automatically design discriminative "category-level attributes", which can be efficiently encoded by compact category-attribute matrix. The allows us achieve critical criteria...

10.1109/cvpr.2013.105 article EN 2009 IEEE Conference on Computer Vision and Pattern Recognition 2013-06-01

With the recent popularity of animated GIFs on social media, there is need for ways to index them with rich meta-data. To advance research GIF understanding, we collected a new dataset, Tumblr (TGIF), 100K from and 120K natural language descriptions obtained via crowdsourcing. The motivation this work develop testbed image sequence description systems, where task generate or video clips. ensure high quality developed series novel controls validate free-form text input crowd-workers. We show...

10.1109/cvpr.2016.502 article EN 2016-06-01

This work addresses the unsupervised adaptation of an existing object detector to a new target domain. We assume that large number unlabeled videos from this domain are readily available. automatically obtain labels on data by using high-confidence detections detector, augmented with hard (misclassified) examples acquired exploiting temporal cues tracker. These automatically-obtained then used for re-training original model. A modified knowledge distillation loss is proposed, and we...

10.1109/cvpr.2019.00087 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

We summarize the results of a host efforts using giant automatic speech recognition (ASR) models pre-trained large, diverse unlabeled datasets containing approximately million hours audio. find that combination pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens thousands labeled data. In particular, on an ASR task 34k data, by fine-tuning 8 billion parameter Conformer we can match state-of-the-art (SoTA)...

10.1109/jstsp.2022.3182537 article EN IEEE Journal of Selected Topics in Signal Processing 2022-06-13

Non-invasive methods of detecting cancer by circulating exosomes are challenged inefficient purification and identification. This study hereby proposed an automated centrifugal microfluidic disc system combined with functionalized membranes (Exo-CMDS) to isolate enrich exosomes, which will then be processed a novel aptamer fluorescence (Exo-AFS) in order detect the exosome surface proteins effective manner. Exo-CMDS features highly qualified yields optimal exosomal concentration 5.1 × 109...

10.1016/j.bios.2022.114487 article EN cc-by Biosensors and Bioelectronics 2022-06-18

Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse process gradually adding noise images, allowing generate high-quality samples from complex distribution. In this survey, we provide exhaustive overview existing methods using editing, covering both theoretical practical aspects field. We delve...

10.1109/tpami.2025.3541625 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2025-02-13

In recent years, many research works have been carried out to recognize human actions from video clips. To learn an effective action classifier, most of the previous approaches rely on enough training labels. When being required in a different dataset, these re-train model using new However, labeling sequences is very tedious and time-consuming task, especially when detailed spatial locations time durations are required. this paper, we propose adaptive detection approach which reduces...

10.1109/cvpr.2010.5539875 article EN 2010-06-01

This paper studies the problem of recognizing gender from full body images. has not been addressed before, partly because variant nature human bodies and clothing that can bring tough difficulties. However, recognition high application potentials, e.g. security surveillance customer statistics collection in restaurants, supermarkets, even building entrances. In this paper, we build a system images, taken frontal or back views. Our contributions are three-fold. First, to handle variety...

10.1145/1459359.1459470 article EN Proceedings of the 30th ACM International Conference on Multimedia 2008-10-26

In this paper, we investigate the detection of semantic human actions in complex scenes. Unlike conventional action recognition well-controlled environments, scenes suffers from cluttered backgrounds, heavy crowds, occluded bodies, and spatial-temporal boundary ambiguities caused by imperfect tracking. Conventional algorithms are likely to fail with such ambiguities. work, candidate regions an treated as a bag instances. Then novel multiple-instance learning framework, named SMILE-SVM...

10.1109/iccv.2009.5459153 article EN 2009-09-01

We introduce the novel problem of automatically generating animated GIFs from video. are short looping video with no sound, and a perfect combination between image that really capture our attention. tell story, express emotion, turn events into humorous moments, new wave photojournalism. pose question: Can we automate entirely manual elaborate process GIF creation by leveraging plethora user generated content? propose Robust Deep RankNet that, given video, generates ranked list its segments...

10.1109/cvpr.2016.114 article EN 2016-06-01

Recent insights on language and vision with neural networks have been successfully applied to simple single-image visual question answering. However, tackle real-life answering problems multimedia collections such as personal photos, we look at whole sequences of photos or videos. When questions from a large collection, natural problem is identify snippets support the answer. In this paper, describe novel network called Focal Visual-Text Attention (FVTA) for collective reasoning in...

10.1109/cvpr.2018.00642 article EN 2018-06-01

Sentiment analysis is crucial for extracting social signals from media content. Due to huge variation in media, the performance of sentiment classifiers using single modality (visual or textual) still lags behind satisfaction. In this paper, we propose a new framework that integrates textual and visual information robust analysis. Different previous work, believe should be treated jointly structural fashion. Our system first builds semantic tree structure based on sentence parsing, aimed at...

10.1145/2964284.2964288 article EN Proceedings of the 30th ACM International Conference on Multimedia 2016-09-29

Sarcasm is a peculiar form of sentiment expression, where the surface differs from implied sentiment. The detection sarcasm in social media platforms has been applied past mainly to textual utterances lexical indicators (such as interjections and intensifiers), linguistic markers, contextual information user profiles, or conversations) were used detect sarcastic tone. However, modern allow create multimodal messages audiovisual content integrated with text, making analysis mode isolation...

10.1145/2964284.2964321 preprint EN Proceedings of the 30th ACM International Conference on Multimedia 2016-09-29

In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of and its corresponding sound generation method resulting in more natural sounding reconstructed speech. Our proposed consists an autoencoder to extract bottleneck features the which is then used target our main reading comprising CNN, LSTM fully connected layers. experiments show that able reconstruct original with...

10.1109/icassp.2018.8461856 article EN 2018-04-01

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify in the LLM paradigm, Ferret employs novel powerful hybrid region representation that integrates discrete coordinates continuous features jointly to represent image. extract versatile regions, we propose spatial-aware visual sampler, adept at handling varying sparsity across...

10.48550/arxiv.2310.07704 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Since the emergence of extensive multimedia data, feature fusion has been more and important for image video retrieval, indexing annotation. Existing techniques simply concatenate a pair different features or use canonical correlation analysis based methods joint dimensionality reduction in space. However, how to fuse multiple generalized way is still an open problem. In this paper, we reformulate as general subspace learning The objective framework find linear which cumulative pairwise...

10.1145/1386352.1386373 article EN 2008-07-07

Social multimedia hosting and sharing websites, such as Flickr, Facebook, Youtube, Picasa, ImageShack Photobucket, are increasingly popular around the globe. A major trend in current studies on social is using media sites a source of huge amount labeled data for solving large scale computer science problems vision, mining multimedia. In this paper, we take new path to explore global trends sentiments that can be drawn by analyzing patterns uploaded downloaded sense, each time an image or...

10.1145/1873951.1874196 article EN Proceedings of the 30th ACM International Conference on Multimedia 2010-10-25

This work aims to build a system suggest tourist destinations based on visual matching and minimal user input. A can provide either photo of the desired scenary or keyword describing place interest, will look into its database for places that share characteristics. To end, we first cluster large-scale geotagged web collection groups by location then find representative images each group. Tourist destination recommendations are produced comparing query against tags under premise "if you like...

10.1109/icassp.2010.5495905 article EN IEEE International Conference on Acoustics Speech and Signal Processing 2010-01-01

This article studies the problem of latent community topic analysis in text-associated graphs. With development social media, a lot user-generated content is available with user networks. Along rich information networks, graphs can be extended text associated nodes. Topic modeling classic mining and it interesting to discover topics Different from traditional methods considering links, we incorporate discovery into guarantee topical coherence communities so that users same are closely linked...

10.1145/2337542.2337548 article EN ACM Transactions on Intelligent Systems and Technology 2012-09-01
Coming Soon ...