Woohwan Jung

ORCID: 0000-0003-4561-2214
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Topic Modeling
  • Advanced Text Analysis Techniques
  • Data Quality and Management
  • Semantic Web and Ontologies
  • Privacy-Preserving Technologies in Data
  • Data Management and Algorithms
  • Human Mobility and Location-Based Analysis
  • Spam and Phishing Detection
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Image Processing Techniques
  • Misinformation and Its Impacts
  • Mobile Crowdsensing and Crowdsourcing
  • Medical Image Segmentation Techniques
  • Vehicular Ad Hoc Networks (VANETs)
  • Web Data Mining and Analysis
  • Privacy, Security, and Data Protection
  • Internet Traffic Analysis and Secure E-voting
  • Image and Object Detection Techniques
  • Data Stream Mining Techniques
  • Algorithms and Data Compression
  • Face and Expression Recognition
  • Face recognition and analysis
  • Time Series Analysis and Forecasting
  • Text and Document Classification Technologies

Hanyang University
2022-2025

Seoul National University
2017-2021

Despite the significant advances in neural machine translation, performance remains subpar for low-resource language pairs. Ensembling multiple systems is a widely adopted technique to enhance performance, often accomplished by combining probability distributions. However, previous approaches face challenge of high computational costs training models. Furthermore, black-box models, averaging token-level probabilities at each decoding step not feasible. To address problems multi-model...

10.48550/arxiv.2502.01182 preprint EN arXiv (Cornell University) 2025-02-03

Cardinality estimation of LIKE predicate queries has an important role in the query optimization database systems. Traditional approaches generally use a summary text data with some statistical assumptions. Recently, deep learning model for cardinality been investigated. To provide more accurate estimates and reduce maximum errors, we propose that utilizes extended N -gram table conditional regression header. We next investigate how to efficiently generate training data. Our LEADER (LikE...

10.1145/3709670 article EN other-oa Proceedings of the ACM on Management of Data 2025-02-10

10.1109/iceic64972.2025.10879662 article EN 2020 International Conference on Electronics, Information, and Communication (ICEIC) 2025-01-19

10.1109/iceic64972.2025.10879623 article EN 2020 International Conference on Electronics, Information, and Communication (ICEIC) 2025-01-19

10.1109/iceic64972.2025.10879677 article EN 2020 International Conference on Electronics, Information, and Communication (ICEIC) 2025-01-19

10.1109/wacv61041.2025.00076 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025-02-26

Geospatial data provides a lot of benefits for personalized services. However, since the geospatial contains sensitive information about personal activities, collecting raw has potential risk leaking private from collectors. Recently, local differential privacy (LDP), which protects users without trusting collector, been adopted to preserve in many real applications. In this paper, we investigate problem locations individual under LDP, and propose perturbation mechanism designed carefully...

10.1109/tkde.2022.3181049 article EN IEEE Transactions on Knowledge and Data Engineering 2022-01-01

Geospatial data provides a lot of benefits for personalized services. However, since the geospatial contains sensitive information about personal activities, collecting raw has potential risk leaking private from collectors. Recently, local differential privacy (LDP), which protects users without trusting collector, been adopted to preserve in many real applications. most existing LDP algorithms focus on obtaining aggregated values such as mean and histogram collected data. In this paper, we...

10.1109/icde51399.2021.00230 article EN 2022 IEEE 38th International Conference on Data Engineering (ICDE) 2021-04-01

Cardinality estimation of an approximate substring query is important problem in database systems. Traditional approaches build a summary from the text data and estimate cardinality using with some statistical assumptions. Since deep learning models can learn underlying complex patterns effectively, they have been successfully applied shown to outperform traditional methods for estimations queries However, since are not yet queries, we investigate approach such queries. Although accuracy...

10.14778/3551793.3551859 article EN Proceedings of the VLDB Endowment 2022-07-01

Relation extraction (RE) has been extensively studied due to its importance in real-world applications such as knowledge base construction and question answering. Most of the existing works train models on either distantly supervised data or human-annotated data. To take advantage high accuracy human annotation cheap cost distant supervision, we propose dual supervision framework which effectively utilizes both types However, simply combining two a RE model may decrease prediction since...

10.18653/v1/2020.coling-main.564 article EN cc-by Proceedings of the 17th international conference on Computational linguistics - 2020-01-01

Log data from mobile devices generally contain a series of events with temporal information including time intervals which consist the start and finish times. However, problem releasing differentially private interval datasets has not been tackled yet. A dataset can be represented by two dimensional (2D) histogram. Most methods to publish 2D histograms partition into rectangular spaces reduce aggregated noise error for range queries. existing algorithms suffer structural when applied...

10.1109/tkde.2019.2952351 article EN IEEE Transactions on Knowledge and Data Engineering 2019-11-08

Despite the rapid growth in model architecture, scarcity of large parallel corpora remains main bottleneck Neural Machine Translation. Data augmentation is a technique that enhances performance data-hungry models by generating synthetic data instead collecting new ones. We explore prompt-based approaches leverage large-scale language such as ChatGPT. To create corpus, we compare 3 methods using different prompts. employ two assessment metrics to measure diversity generated data. This...

10.48550/arxiv.2307.16833 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Named Entity Recognition (NER) frequently suffers from the problem of insufficient labeled data, particularly in fine-grained NER scenarios. Although K-shot learning techniques can be applied, their performance tends to saturate when number annotations exceeds several tens labels. To overcome this problem, we utilize existing coarse-grained datasets that offer a large annotations. A straightforward approach address is pre-finetuning, which employs data for representation learning. However,...

10.18653/v1/2023.emnlp-main.197 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2023-01-01

Document-level relation extraction (RE) has recently received a lot of attention. However, existing models for document-level RE have similar structures to the sentence-level RE. Thus, they still do not consider some unique characteristics new problem setting. For example, in Wikipedia, there is title each page and it usually represents topic entity that mainly described on page. In many cases, omitted text. often fail find relations with entity. To tackle problem, we propose Topic-aware...

10.1145/3340531.3412133 article EN 2020-10-19

End-to-end automatic speech recognition (E2E ASR) systems have significantly improved through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method utilize the state-of-the-art Whisper without modifying its architecture, preserving generalization performance while enabling it leverage descriptions effectively. Moreover, two...

10.21437/interspeech.2024-377 article EN Interspeech 2022 2024-09-01

End-to-end automatic speech recognition (E2E ASR) systems have significantly improved through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method utilize the state-of-the-art Whisper without modifying its architecture, preserving generalization performance while enabling it leverage descriptions effectively. Moreover, two...

10.48550/arxiv.2407.17874 preprint EN arXiv (Cornell University) 2024-07-25

Pluralistic Image Inpainting (PII) offers multiple plausible solutions for restoring missing parts of images and has been successfully applied to various applications including image editing object removal. Recently, VQGAN-based methods have proposed shown that they significantly improve the structural integrity in generated images. Nevertheless, state-of-the-art model PUT faces a critical challenge: degradation detail quality output due feature quantization. Feature quantization restricts...

10.48550/arxiv.2412.01046 preprint EN arXiv (Cornell University) 2024-12-01

Existing works for truth discovery in categorical data usually assume that claimed values are mutually exclusive and only one among them is correct. However, many not even functional predicates due to their hierarchical structures. Thus, we need consider the structure effectively estimate trustworthiness of sources infer truths. We propose a probabilistic model utilize structures an inference algorithm find In addition, knowledge fusion, step automatically extracting information from...

10.48550/arxiv.1904.10217 preprint EN cc-by arXiv (Cornell University) 2019-01-01

Named Entity Recognition (NER) frequently suffers from the problem of insufficient labeled data, particularly in fine-grained NER scenarios. Although $K$-shot learning techniques can be applied, their performance tends to saturate when number annotations exceeds several tens labels. To overcome this problem, we utilize existing coarse-grained datasets that offer a large annotations. A straightforward approach address is pre-finetuning, which employs data for representation learning. However,...

10.48550/arxiv.2310.11715 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and outperformed general-domain PLMs in specialized areas such as biomedical, scientific, clinical domains. In addition, financial studied because of high economic impact data analysis. However, we found that were not on sufficiently diverse data. This lack training leads to a subpar generalization performance, resulting general-purpose PLMs, including BERT, often outperforming many...

10.48550/arxiv.2310.13312 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Over the past few years, various domain-specific pretrained language models (PLMs) have been proposed and outperformed general-domain PLMs in specialized areas such as biomedical, scientific, clinical domains. In addition, financial studied because of high economic impact data analysis. However, we found that were not on sufficiently diverse data. This lack training leads to a subpar generalization performance, resulting general-purpose PLMs, including BERT, often outperforming many...

10.18653/v1/2023.findings-emnlp.138 article EN cc-by 2023-01-01
Coming Soon ...