- Natural Language Processing Techniques
- Topic Modeling
- Advanced Text Analysis Techniques
- Data Quality and Management
- Semantic Web and Ontologies
- Privacy-Preserving Technologies in Data
- Data Management and Algorithms
- Human Mobility and Location-Based Analysis
- Spam and Phishing Detection
- Generative Adversarial Networks and Image Synthesis
- Advanced Image Processing Techniques
- Misinformation and Its Impacts
- Mobile Crowdsensing and Crowdsourcing
- Medical Image Segmentation Techniques
- Vehicular Ad Hoc Networks (VANETs)
- Web Data Mining and Analysis
- Privacy, Security, and Data Protection
- Internet Traffic Analysis and Secure E-voting
- Image and Object Detection Techniques
- Data Stream Mining Techniques
- Algorithms and Data Compression
- Face and Expression Recognition
- Face recognition and analysis
- Time Series Analysis and Forecasting
- Text and Document Classification Technologies
Hanyang University
2022-2025
Seoul National University
2017-2021
Despite the significant advances in neural machine translation, performance remains subpar for low-resource language pairs. Ensembling multiple systems is a widely adopted technique to enhance performance, often accomplished by combining probability distributions. However, previous approaches face challenge of high computational costs training models. Furthermore, black-box models, averaging token-level probabilities at each decoding step not feasible. To address problems multi-model...
Cardinality estimation of LIKE predicate queries has an important role in the query optimization database systems. Traditional approaches generally use a summary text data with some statistical assumptions. Recently, deep learning model for cardinality been investigated. To provide more accurate estimates and reduce maximum errors, we propose that utilizes extended N -gram table conditional regression header. We next investigate how to efficiently generate training data. Our LEADER (LikE...
Geospatial data provides a lot of benefits for personalized services. However, since the geospatial contains sensitive information about personal activities, collecting raw has potential risk leaking private from collectors. Recently, local differential privacy (LDP), which protects users without trusting collector, been adopted to preserve in many real applications. In this paper, we investigate problem locations individual under LDP, and propose perturbation mechanism designed carefully...
Geospatial data provides a lot of benefits for personalized services. However, since the geospatial contains sensitive information about personal activities, collecting raw has potential risk leaking private from collectors. Recently, local differential privacy (LDP), which protects users without trusting collector, been adopted to preserve in many real applications. most existing LDP algorithms focus on obtaining aggregated values such as mean and histogram collected data. In this paper, we...
Cardinality estimation of an approximate substring query is important problem in database systems. Traditional approaches build a summary from the text data and estimate cardinality using with some statistical assumptions. Since deep learning models can learn underlying complex patterns effectively, they have been successfully applied shown to outperform traditional methods for estimations queries However, since are not yet queries, we investigate approach such queries. Although accuracy...
Relation extraction (RE) has been extensively studied due to its importance in real-world applications such as knowledge base construction and question answering. Most of the existing works train models on either distantly supervised data or human-annotated data. To take advantage high accuracy human annotation cheap cost distant supervision, we propose dual supervision framework which effectively utilizes both types However, simply combining two a RE model may decrease prediction since...
Log data from mobile devices generally contain a series of events with temporal information including time intervals which consist the start and finish times. However, problem releasing differentially private interval datasets has not been tackled yet. A dataset can be represented by two dimensional (2D) histogram. Most methods to publish 2D histograms partition into rectangular spaces reduce aggregated noise error for range queries. existing algorithms suffer structural when applied...
Despite the rapid growth in model architecture, scarcity of large parallel corpora remains main bottleneck Neural Machine Translation. Data augmentation is a technique that enhances performance data-hungry models by generating synthetic data instead collecting new ones. We explore prompt-based approaches leverage large-scale language such as ChatGPT. To create corpus, we compare 3 methods using different prompts. employ two assessment metrics to measure diversity generated data. This...
Named Entity Recognition (NER) frequently suffers from the problem of insufficient labeled data, particularly in fine-grained NER scenarios. Although K-shot learning techniques can be applied, their performance tends to saturate when number annotations exceeds several tens labels. To overcome this problem, we utilize existing coarse-grained datasets that offer a large annotations. A straightforward approach address is pre-finetuning, which employs data for representation learning. However,...
Document-level relation extraction (RE) has recently received a lot of attention. However, existing models for document-level RE have similar structures to the sentence-level RE. Thus, they still do not consider some unique characteristics new problem setting. For example, in Wikipedia, there is title each page and it usually represents topic entity that mainly described on page. In many cases, omitted text. often fail find relations with entity. To tackle problem, we propose Topic-aware...
End-to-end automatic speech recognition (E2E ASR) systems have significantly improved through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method utilize the state-of-the-art Whisper without modifying its architecture, preserving generalization performance while enabling it leverage descriptions effectively. Moreover, two...
End-to-end automatic speech recognition (E2E ASR) systems have significantly improved through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method utilize the state-of-the-art Whisper without modifying its architecture, preserving generalization performance while enabling it leverage descriptions effectively. Moreover, two...
Pluralistic Image Inpainting (PII) offers multiple plausible solutions for restoring missing parts of images and has been successfully applied to various applications including image editing object removal. Recently, VQGAN-based methods have proposed shown that they significantly improve the structural integrity in generated images. Nevertheless, state-of-the-art model PUT faces a critical challenge: degradation detail quality output due feature quantization. Feature quantization restricts...
Existing works for truth discovery in categorical data usually assume that claimed values are mutually exclusive and only one among them is correct. However, many not even functional predicates due to their hierarchical structures. Thus, we need consider the structure effectively estimate trustworthiness of sources infer truths. We propose a probabilistic model utilize structures an inference algorithm find In addition, knowledge fusion, step automatically extracting information from...
Named Entity Recognition (NER) frequently suffers from the problem of insufficient labeled data, particularly in fine-grained NER scenarios. Although $K$-shot learning techniques can be applied, their performance tends to saturate when number annotations exceeds several tens labels. To overcome this problem, we utilize existing coarse-grained datasets that offer a large annotations. A straightforward approach address is pre-finetuning, which employs data for representation learning. However,...
Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and outperformed general-domain PLMs in specialized areas such as biomedical, scientific, clinical domains. In addition, financial studied because of high economic impact data analysis. However, we found that were not on sufficiently diverse data. This lack training leads to a subpar generalization performance, resulting general-purpose PLMs, including BERT, often outperforming many...
Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and outperformed general-domain PLMs in specialized areas such as biomedical, scientific, clinical domains. In addition, financial studied because of high economic impact data analysis. However, we found that were not on sufficiently diverse data. This lack training leads to a subpar generalization performance, resulting general-purpose PLMs, including BERT, often outperforming many...