NFDI4DS | UHH-SEMS - Publication Details

A Single Model Ensemble Framework for Neural Machine Translation using Pivot Translation

OPENALEX - Publications

Seok Jin Oh Keonwoong Noh Woohwan Jung

Despite the significant advances in neural machine translation, performance remains subpar for low-resource language pairs. Ensembling multiple systems is a widely adopted technique to enhance performance, often accomplished by combining probability distributions. However, previous approaches face challenge of high computational costs training models. Furthermore, black-box models, averaging token-level probabilities at each decoding step not feasible. To address problems multi-model...

10.48550/arxiv.2502.01182 preprint EN arXiv (Cornell University) 2025-02-03

Cardinality Estimation of LIKE Predicate Queries using Deep Learning

OPENALEX - Publications

Suyong Kwon Kyuseok Shim Woohwan Jung

Cardinality estimation of LIKE predicate queries has an important role in the query optimization database systems. Traditional approaches generally use a summary text data with some statistical assumptions. Recently, deep learning model for cardinality been investigated. To provide more accurate estimates and reduce maximum errors, we propose that utilizes extended N -gram table conditional regression header. We next investigate how to efficiently generate training data. Our LEADER (LikE...

10.1145/3709670 article EN other-oa Proceedings of the ACM on Management of Data 2025-02-10

Improving Messenger Phishing Detection Using Heterogeneous Phishing Data

OPENALEX - Publications

Seung Hwan Oh Kyubo Noh S. Kim Dezhi An Woohwan Jung

10.1109/iceic64972.2025.10879662 article EN 2020 International Conference on Electronics, Information, and Communication (ICEIC) 2025-01-19

Data Augmentation for Messenger Phishing Detection Using Large Language Models

OPENALEX - Publications

Kyubo Noh Seung Hwan Oh S. Kim Dezhi An Woohwan Jung

10.1109/iceic64972.2025.10879623 article EN 2020 International Conference on Electronics, Information, and Communication (ICEIC) 2025-01-19

Semi-supervised Learning for Photovoltaic Cell Defect Detection Using Module and Cell-Level Labels

OPENALEX - Publications

Nayoung Gil Kil Houm Park Do-Won Jeong Woohwan Jung

10.1109/iceic64972.2025.10879677 article EN 2020 International Conference on Electronics, Information, and Communication (ICEIC) 2025-01-19

Improving Detail in Pluralistic Image Inpainting with Feature Dequantization

OPENALEX - Publications

Kil Houm Park Woohwan Jung

10.1109/wacv61041.2025.00076 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025-02-26

Collecting Geospatial Data Under Local Differential Privacy With Improving Frequency Estimation

OPENALEX - Publications

Daeyoung Hong Woohwan Jung Kyuseok Shim

Geospatial data provides a lot of benefits for personalized services. However, since the geospatial contains sensitive information about personal activities, collecting raw has potential risk leaking private from collectors. Recently, local differential privacy (LDP), which protects users without trusting collector, been adopted to preserve in many real applications. In this paper, we investigate problem locations individual under LDP, and propose perturbation mechanism designed carefully...

10.1109/tkde.2022.3181049 article EN IEEE Transactions on Knowledge and Data Engineering 2022-01-01

Collecting Geospatial Data with Local Differential Privacy for Personalized Services

OPENALEX - Publications

Daeyoung Hong Woohwan Jung Kyuseok Shim

Geospatial data provides a lot of benefits for personalized services. However, since the geospatial contains sensitive information about personal activities, collecting raw has potential risk leaking private from collectors. Recently, local differential privacy (LDP), which protects users without trusting collector, been adopted to preserve in many real applications. most existing LDP algorithms focus on obtaining aggregated values such as mean and histogram collected data. In this paper, we...

10.1109/icde51399.2021.00230 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2021-04-01

Cardinality estimation of approximate substring queries using deep learning

OPENALEX - Publications

Suyong Kwon Woohwan Jung Kyuseok Shim

Cardinality estimation of an approximate substring query is important problem in database systems. Traditional approaches build a summary from the text data and estimate cardinality using with some statistical assumptions. Since deep learning models can learn underlying complex patterns effectively, they have been successfully applied shown to outperform traditional methods for estimations queries However, since are not yet queries, we investigate approach such queries. Although accuracy...

10.14778/3551793.3551859 article EN Proceedings of the VLDB Endowment 2022-07-01

Dual Supervision Framework for Relation Extraction with Distant Supervision and Human Annotation

OPENALEX - Publications

Woohwan Jung Kyuseok Shim

Relation extraction (RE) has been extensively studied due to its importance in real-world applications such as knowledge base construction and question answering. Most of the existing works train models on either distantly supervised data or human-annotated data. To take advantage high accuracy human annotation cheap cost distant supervision, we propose dual supervision framework which effectively utilizes both types However, simply combining two a RE model may decrease prediction since...

10.18653/v1/2020.coling-main.564 article EN cc-by Proceedings of the 17th international conference on Computational linguistics - 2020-01-01

Integration of graphs from different data sources using crowdsourcing

OPENALEX - Publications

Younghoon Kim Woohwan Jung Kyuseok Shim

10.1016/j.ins.2017.01.006 article EN Information Sciences 2017-01-05

TIDY: Publishing a Time Interval Dataset With Differential Privacy

OPENALEX - Publications

Woohwan Jung Suyong Kwon Kyuseok Shim

Log data from mobile devices generally contain a series of events with temporal information including time intervals which consist the start and finish times. However, problem releasing differentially private interval datasets has not been tackled yet. A dataset can be represented by two dimensional (2D) histogram. Most methods to publish 2D histograms partition into rectangular spaces reduce aggregated noise error for range queries. existing algorithms suffer structural when applied...

10.1109/tkde.2019.2952351 article EN IEEE Transactions on Knowledge and Data Engineering 2019-11-08

Data Augmentation for Neural Machine Translation using Generative Language Model

OPENALEX - Publications

Seok Jin Oh Su ah Lee Woohwan Jung

Despite the rapid growth in model architecture, scarcity of large parallel corpora remains main bottleneck Neural Machine Translation. Data augmentation is a technique that enhances performance data-hungry models by generating synthetic data instead collecting new ones. We explore prompt-based approaches leverage large-scale language such as ChatGPT. To create corpus, we compare 3 methods using different prompts. employ two assessment metrics to measure diversity generated data. This...

10.48550/arxiv.2307.16833 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Enhancing Low-resource Fine-grained Named Entity Recognition by Leveraging Coarse-grained Datasets

OPENALEX - Publications

Su Jeong Lee Seok Jin Oh Woohwan Jung

Named Entity Recognition (NER) frequently suffers from the problem of insufficient labeled data, particularly in fine-grained NER scenarios. Although K-shot learning techniques can be applied, their performance tends to saturate when number annotations exceeds several tens labels. To overcome this problem, we utilize existing coarse-grained datasets that offer a large annotations. A straightforward approach address is pre-finetuning, which employs data for representation learning. However,...

10.18653/v1/2023.emnlp-main.197 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

T-REX

OPENALEX - Publications

Woohwan Jung Kyuseok Shim

Document-level relation extraction (RE) has recently received a lot of attention. However, existing models for document-level RE have similar structures to the sentence-level RE. Thus, they still do not consider some unique characteristics new problem setting. For example, in Wikipedia, there is title each page and it usually represents topic entity that mainly described on page. In many cases, omitted text. often fail find relations with entity. To tackle problem, we propose Topic-aware...

10.1145/3340531.3412133 article EN 2020-10-19

Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions

OPENALEX - Publications

Ji-Won Suh Inyoung Na Woohwan Jung

End-to-end automatic speech recognition (E2E ASR) systems have significantly improved through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method utilize the state-of-the-art Whisper without modifying its architecture, preserving generalization performance while enabling it leverage descriptions effectively. Moreover, two...

10.21437/interspeech.2024-377 article EN Interspeech 2022 2024-09-01

Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions

OPENALEX - Publications

Ji-Won Suh Inyoung Na Woohwan Jung

End-to-end automatic speech recognition (E2E ASR) systems have significantly improved through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method utilize the state-of-the-art Whisper without modifying its architecture, preserving generalization performance while enabling it leverage descriptions effectively. Moreover, two...

10.48550/arxiv.2407.17874 preprint EN arXiv (Cornell University) 2024-07-25

Improving Detail in Pluralistic Image Inpainting with Feature Dequantization

OPENALEX - Publications

Kil Houm Park Woohwan Jung

Pluralistic Image Inpainting (PII) offers multiple plausible solutions for restoring missing parts of images and has been successfully applied to various applications including image editing object removal. Recently, VQGAN-based methods have proposed shown that they significantly improve the structural integrity in generated images. Nevertheless, state-of-the-art model PUT faces a critical challenge: degradation detail quality output due feature quantization. Feature quantization restricts...

10.48550/arxiv.2412.01046 preprint EN arXiv (Cornell University) 2024-12-01

Crowdsourced Truth Discovery in the Presence of Hierarchies for Knowledge Fusion

OPENALEX - Publications

Woohwan Jung Young-Hoon Kim Kyuseok Shim

Existing works for truth discovery in categorical data usually assume that claimed values are mutually exclusive and only one among them is correct. However, many not even functional predicates due to their hierarchical structures. Thus, we need consider the structure effectively estimate trustworthiness of sources infer truths. We propose a probabilistic model utilize structures an inference algorithm find In addition, knowledge fusion, step automatically extracting information from...

10.48550/arxiv.1904.10217 preprint EN cc-by arXiv (Cornell University) 2019-01-01

Enhancing Low-resource Fine-grained Named Entity Recognition by Leveraging Coarse-grained Datasets

OPENALEX - Publications

Su Ah Lee Seok Jin Oh Woohwan Jung

Named Entity Recognition (NER) frequently suffers from the problem of insufficient labeled data, particularly in fine-grained NER scenarios. Although $K$-shot learning techniques can be applied, their performance tends to saturate when number annotations exceeds several tens labels. To overcome this problem, we utilize existing coarse-grained datasets that offer a large annotations. A straightforward approach address is pre-finetuning, which employs data for representation learning. However,...

10.48550/arxiv.2310.11715 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

OPENALEX - Publications

Jae Young Choe Keonwoong Noh Nayeon Kim Seyun Ahn Woohwan Jung

Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and outperformed general-domain PLMs in specialized areas such as biomedical, scientific, clinical domains. In addition, financial studied because of high economic impact data analysis. However, we found that were not on sufficiently diverse data. This lack training leads to a subpar generalization performance, resulting general-purpose PLMs, including BERT, often outperforming many...

10.48550/arxiv.2310.13312 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Exploring the Impact of Corpus Diversity on Financial Pretrained Language Models

OPENALEX - Publications

Jae Young Choe Keonwoong Noh Nayeon Kim Seyun Ahn Woohwan Jung

Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and outperformed general-domain PLMs in specialized areas such as biomedical, scientific, clinical domains. In addition, financial studied because of high economic impact data analysis. However, we found that were not on sufficiently diverse data. This lack training leads to a subpar generalization performance, resulting general-purpose PLMs, including BERT, often outperforming many...

10.18653/v1/2023.findings-emnlp.138 article EN cc-by 2023-01-01