- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Handwritten Text Recognition Techniques
- Domain Adaptation and Few-Shot Learning
- Natural Language Processing Techniques
- Image Retrieval and Classification Techniques
- Topic Modeling
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Islamic Thought and Society Studies
- Psychology of Moral and Emotional Judgment
- Generative Adversarial Networks and Image Synthesis
- Families in Therapy and Culture
- Speech Recognition and Synthesis
- Values and Moral Education
- Image Processing and 3D Reconstruction
- Mathematics, Computing, and Information Processing
- Social and Intergroup Psychology
- Turkish Literature and Culture
- Vehicle License Plate Recognition
- Cultural Differences and Values
- Face Recognition and Perception
- Advanced Neural Network Applications
- Religion, Spirituality, and Psychology
- Evolutionary Psychology and Human Behavior
Computer Vision Center
2019-2023
Universitat Autònoma de Barcelona
2018-2023
Işık University
2020
Istanbul Bilgi University
2017-2020
Fatih University
2020
Bahçeşehir University
2020
Istanbul University
2020
Rogers (United States)
2020
Barcelona Supercomputing Center
2019
Artifex University
2019
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight importance of exploiting high-level in images as textual cues Visual Question Answering process. We use dataset define series tasks increasing difficulty for which reading scene context provided is necessary reason and generate appropriate answer. propose evaluation metric these account both reasoning...
Current image captioning systems perform at a merely descriptive level, essentially enumerating the objects in scene and their relations. Humans, on contrary, interpret images by integrating several sources of prior knowledge world. In this work, we aim to take step closer producing captions that offer plausible interpretation scene, such contextual information into pipeline. For focus used illustrate news articles. We propose novel method is able leverage provided text articles associated...
We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact each modality, and reveal importance language module, especially when enriched with layout information. Accounting this, single objective pre-training scheme that only text spatial cues. show applying this on scanned documents has certain advantages using...
In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement. We start by employing transformer-based architecture that incorporates three pretext tasks as learning objectives be optimized during pre-training without the usage of labelled data. Each is specifically tailored for final downstream tasks. conduct several ablation experiments confirm...
Many sets of human facial photographs produced in Western cultures are available for scientific research. We report here on the development a face database Turkish undergraduate student targets. High-resolution standardized were taken and supported by following materials: (a) basic demographic appearance-related information, (b) two types landmark configurations (for Webmorph geometric morphometrics (GM)), (c) width-to-height ratio (fWHR) measurement, (d) information photography parameters,...
Explaining an image with missing or non-existent objects is known as object bias (hallucination) in captioning. This behaviour quite common the state-of-the-art captioning models which not desirable by humans. To decrease hallucination captioning, we propose three simple yet efficient training augmentation method for sentences requires no new data increase model size. By extensive analysis, show that proposed methods can significantly diminish our models’ on metrics. Moreover,...
This paper presents final results of ICDAR 2019 Scene Text Visual Question Answering competition (ST-VQA). ST-VQA introduces an important aspect that is not addressed by any system up to date, namely the incorporation scene text answer questions asked about image. The a new dataset comprising 23,038 images annotated with 31,791 question / pairs where always grounded on instances present in are taken from 7 different public computer vision datasets, covering wide range scenarios. was...
Text contained in an image carries high-level semantics that can be exploited to achieve richer understanding. In particular, the mere presence of text provides strong guiding content should employed tackle a diversity computer vision tasks such as retrieval, fine-grained classification, and visual question answering. this paper, we address problem classification retrieval by leveraging textual information along with cues comprehend existing intrinsic relation between two modalities. The...
Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content the form visual and textual tackle task fine-grained image classification retrieval. First, obtain from by employing reading system. Then, combine features with salient regions exploit complementary carried two sources. Specifically, employ Graph Convolutional Network...
The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning that offer very limited set relation-ships between images sentences in their ground-truth annotations. This ground truth information forces us use evaluation metrics based on binary relevance: given sentence query we consider only one as relevant. many other...
Consonant with a functional view of moral emotions, we argue that morality is best analyzed within relationships rather than in individuals, and use Fiske's (1992) theory relational models (RMs: communal sharing [CS], authority ranking [AR], equality matching [EM], market pricing [MP]) to predict violations different RMs will arouse intensities other-blaming emotions (anger, contempt disgust) both observers victims, together self-blaming (shame guilt) perpetrators, these patterns emotion...
This paper explores the possibilities of image style transfer applied to text maintaining original transcriptions. Results on different domains (scene text, machine printed and handwritten text) cross-modal results demonstrate that this is feasible, open research lines. Furthermore, two architectures for selective transfer, which means transferring only desired pixels, are proposed. Finally, scene evaluated as a data augmentation technique expand detection datasets, resulting in boost...
Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and very limited linguistic information (dictionaries language models). For example, in case of historical ciphered manuscripts, which are usually written with invented alphabets hide message contents. Thus, this paper we address through generation technique based on Bayesian Program Learning (BPL). Contrary traditional approaches, require huge amount images, our method able generate human-like...
Humans exploit prior knowledge to describe images, and are able adapt their explanation specific contextual information given, even the extent of inventing plausible explanations when images do not match. In this work, we propose novel task captioning Wikipedia by integrating knowledge. Specifically, produce models that jointly reason over articles, Wikimedia associated descriptions contextualized captions. The same image can be used illustrate different produced caption needs adapted...
Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later be finetuned on downstream tasks. One problems pretraining approaches is inconsistent usage data with different OCR engines leading incomparable results between models. In other words, it not obvious whether performance gain coming from diverse amount and distinct or proposed To remedy problem, we make public annotations for IDL using commercial engine given...
ÖzetShweder ve diğerleri (1997), Kohlberg'in (1971) ahlakın evrenselliği en önemli erdemin adalet olduğu varsayımlarını reddetmişler farklı kültürlerde derecelerde önemsenen "ahlakın üç temel etiği"ni önererek kültürel çeşitliliği varsaymışlardır.Walker Pitts (1998) ise, bugünkü ahlak araştırmalarının bir eksiğinin sıradan insanın doğal kavramsallaştırmalarının çalışılmaması olduğunu ifade etmektedirler.Bu araştırmanın amacı, toplumumuzda nasıl kavramsallaştırıldığına bu...