Jinhyuk Lee

ORCID: 0000-0003-4972-239X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Multimodal Machine Learning Applications
  • Protein Structure and Dynamics
  • Biomedical Text Mining and Ontologies
  • Computational Drug Discovery Methods
  • RNA and protein synthesis mechanisms
  • Lipid Membrane Structure and Behavior
  • Enzyme Structure and Function
  • Expert finding and Q&A systems
  • Information Retrieval and Search Behavior
  • Force Microscopy Techniques and Applications
  • Text and Document Classification Technologies
  • Conferences and Exhibitions Management
  • Economic theories and models
  • DNA and Nucleic Acid Chemistry
  • Domain Adaptation and Few-Shot Learning
  • melanin and skin pigmentation
  • Interpreting and Communication in Healthcare
  • Artificial Intelligence in Healthcare and Education
  • Advanced Text Analysis Techniques
  • scientometrics and bibliometrics research
  • Consumer Market Behavior and Pricing
  • Molecular spectroscopy and chirality
  • Machine Learning in Bioinformatics

Korea University
2016-2024

Princeton University
2021-2023

Kyungpook National University
2023

Google (United States)
2023

Korea Research Institute of Bioscience and Biotechnology
2013-2022

Korea University of Science and Technology
2013-2022

Korea Institute of Oriental Medicine
2021

Icahn School of Medicine at Mount Sinai
2021

Bar-Ilan University
2021

University of Washington
2019-2020

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With progress in natural language processing (NLP), extracting valuable information from literature has gained popularity among researchers, and deep learning boosted development effective models. However, directly applying advancements NLP to often yields unsatisfactory results due a word distribution shift general domain corpora corpora. In this article, we investigate how...

10.1093/bioinformatics/btz682 article EN Bioinformatics 2019-09-05

In biomedical natural language processing, named entity recognition (NER) and normalization (NEN) are key tasks that enable the automatic extraction of entities (e.g. diseases drugs) from ever-growing literature. this article, we present BERN2 (Advanced Biomedical Entity Recognition Normalization), a tool improves previous neural network-based NER by employing multi-task model NEN models to achieve much faster more accurate inference. We hope our can help annotate large-scale texts for...

10.1093/bioinformatics/btac598 article EN Bioinformatics 2022-08-31

Existing open-domain question answering (QA) models are not suitable for real-time usage because they need to process several long documents on-demand every input query, which is computationally prohibitive. In this paper, we introduce query-agnostic indexable representations of document phrases that can drastically speed up QA. particular, our dense-sparse phrase encoding effectively captures syntactic, semantic, and lexical information the eliminates pipeline filtering context documents....

10.18653/v1/p19-1436 preprint EN 2019-01-01

The amount of biomedical literature is vast and growing quickly, accurate text mining techniques could help researchers to efficiently extract useful information from the literature. However, existing named entity recognition models used by tools such as tmTool ezTag are not effective enough, cannot accurately discover new entities. Also, traditional do consider overlapping entities, which frequently observed in multi-type results. We propose a neural normalization tool called BERN. BERN...

10.1109/access.2019.2920708 article EN cc-by-nc-nd IEEE Access 2019-01-01

Biomedical named entities often play important roles in many biomedical text mining tools. However, due to the incompleteness of provided synonyms and numerous variations their surface forms, normalization is very challenging. In this paper, we focus on learning representations solely based entities. To learn from incomplete synonyms, use a model-based candidate selection maximize marginal likelihood present top candidates. Our candidates are iteratively updated contain more difficult...

10.18653/v1/2020.acl-main.335 article EN cc-by 2020-01-01

Finding biomedical named entities is one of the most essential tasks in text mining. Recently, deep learning-based approaches have been applied to entity recognition (BioNER) and showed promising results. However, as learning need an abundant amount training data, a lack data can hinder performance. BioNER datasets are scarce resources each dataset covers only small subset types. Furthermore, many bio polysemous, which major obstacles recognition. To address type misclassification problem,...

10.1186/s12859-019-2813-6 article EN cc-by BMC Bioinformatics 2019-05-01

Recently, open-domain question answering (QA) has been combined with machine comprehension models to find answers in a large knowledge source. As QA requires retrieving relevant documents from text corpora answer questions, its performance largely depends on the of document retrievers. However, since traditional information retrieval systems are not effective obtaining high probability containing answers, they lower systems. Simply extracting more increases number irrelevant documents, which...

10.18653/v1/d18-1053 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2018-01-01

Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, Danqi Chen. Proceedings of the 59th Annual Meeting Association for Computational Linguistics and 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

10.18653/v1/2021.acl-long.518 article EN cc-by 2021-01-01

Open-domain question answering has exploded in popularity recently due to the success of dense retrieval models, which have surpassed sparse models using only a few supervised training examples. However, this paper, we demonstrate current are not yet holy grail retrieval. We first construct EntityQuestions, set simple, entity-rich questions based on facts from Wikidata (e.g., "Where was Arve Furset born?"), and observe that retrievers drastically under-perform methods. investigate issue...

10.18653/v1/2021.emnlp-main.496 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Pre-trained language models (LMs) have become ubiquitous in solving various natural processing (NLP) tasks. There has been increasing interest what knowledge these LMs contain and how we can extract that knowledge, treating as bases (KBs). While there much work on probing the general domain, little attention to whether powerful be used domain-specific KBs. To this end, create BioLAMA benchmark, which is comprised of 49K biomedical factual triples for LMs. We find with recently proposed...

10.18653/v1/2021.emnlp-main.388 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

Abstract Motivation Traditional drug discovery approaches identify a target for disease and find compound that binds to the target. In this approach, structures of compounds are considered as most important features because it is assumed similar will bind same Therefore, structural analogs drugs selected candidates. However, even though not analogs, they may achieve desired response. A new method based on response, which can complement structure-based methods, needed. Results We implemented...

10.1093/bioinformatics/btz411 article EN Bioinformatics 2019-05-16

The recent outbreak of the novel coronavirus is wreaking havoc on world and researchers are struggling to effectively combat it. One reason why fight difficult due lack information knowledge. In this work, we outline our effort contribute shrinking knowledge vacuum by creating covidAsk, a question answering (QA) system that combines biomedical text mining QA techniques provide answers questions in real-time. Our also leverages retrieval (IR) approaches entity-level complementary models....

10.18653/v1/2020.nlpcovid19-2.1 article EN cc-by 2020-01-01

Many extractive question answering models are trained to predict start and end positions of answers. The choice predicting answers as is mainly due its simplicity effectiveness. In this study, we hypothesize that when the distribution answer highly skewed in training set (e.g., lie only k-th sentence each passage), QA can learn spurious positional cues fail give different positions. We first illustrate position bias popular such BiDAF BERT thoroughly examine how propagates through layer...

10.18653/v1/2020.emnlp-main.84 article EN cc-by 2020-01-01

To explore the microscopic forces governing helix tilting in membranes, we have calculated potential of mean force (PMF) as a function tilt angle ($\ensuremath{\tau}$) WALP19, transmembrane model peptide, dimyristoylphosphatidylcholine membrane. The PMF shows wide range thermally accessible angles (5\ifmmode^\circ\else\textdegree\fi{} to 22\ifmmode^\circ\else\textdegree\fi{}) with minimum at $\ensuremath{\tau}=12.5\ifmmode^\circ\else\textdegree\fi{}$. free energy decomposition reveals that...

10.1103/physrevlett.100.018103 article EN Physical Review Letters 2008-01-08

Ab initio protein structure prediction is a challenging problem that requires both an accurate energetic representation of and efficient conformational sampling method for successful modeling. In this article, we present ab which combines recently suggested novel way fragment assembly, dynamic assembly (DFA) space annealing (CSA) algorithm. DFA, model structures are scored by continuous functions constructed based on short- long-range structural restraint information from library. Here, DFA...

10.1002/prot.23059 article EN Proteins Structure Function and Bioinformatics 2011-04-20

Open-domain question answering can be formulated as a phrase retrieval problem, in which we expect huge scalability and speed benefit but often suffer from low accuracy due to the limitation of existing representation models. In this paper, aim improve quality each embedding by augmenting it with contextualized sparse (Sparc). Unlike previous vectors that are term-frequency-based (e.g., tf-idf) or directly learned (only few thousand dimensions), leverage rectified self-attention indirectly...

10.18653/v1/2020.acl-main.85 article EN cc-by 2020-01-01

Scientific novelty drives the efforts to invent new vaccines and solutions during pandemic. First-time collaboration international are two pivotal channels expand teams' search activities for a broader scope of resources required address global challenge, which might facilitate generation novel ideas. Our analysis 98,981 coronavirus papers suggests that scientific measured by BioBERT model is pretrained on 29 million PubMed articles, first-time increased after outbreak COVID-19, witnessed...

10.1002/asi.24612 article EN Journal of the Association for Information Science and Technology 2021-12-25

Dense retrieval methods have shown great promise over sparse in a range of NLP problems. Among them, dense phrase retrieval—the most fine-grained unit—is appealing because phrases can be directly used as the output for question answering and slot filling tasks. In this work, we follow intuition that retrieving naturally entails larger text blocks study whether serve basis coarse-level including passages documents. We first observe phrase-retrieval system, without any retraining, already...

10.18653/v1/2021.emnlp-main.297 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021-01-01

We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging key idea: distilling knowledge from large language models (LLMs) into retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the quality retrieving set of candidate passages for each query, relabeling positive hard negative same The effectiveness our approach is demonstrated compactness...

10.48550/arxiv.2403.20327 preprint EN arXiv (Cornell University) 2024-03-29

10.1016/j.engappai.2025.110655 article EN Engineering Applications of Artificial Intelligence 2025-04-08

Personal names tend to have many variations differing from country country. Though there exists a large amount of personal on the Web, nationality prediction solely based has not been fully studied due its difficulties in extracting subtle character level features. We propose recurrent neural network model which predicts nationalities each name using automatic feature extraction. Evaluation Olympic record data shows that our achieves greater accuracy than previous approaches tasks. also...

10.24963/ijcai.2017/289 article EN 2017-07-28

To explore the role of hydrogen bonding and helix−lipid interactions in transmembrane helix association, we have calculated potential mean force (PMF) as a function helix−helix distance between two pVNVV peptides, model peptide based on GCN4 leucine-zipper, dimyristoylphosphatidylcholine (DMPC) membrane. The name represents interfacial residues heptad repeat dimer. free energy decomposition reveals that total PMF consists competing contributions from interactions. direct, favorable arise...

10.1021/ja711239h article EN Journal of the American Chemical Society 2008-04-19

We evaluate the pK(a) of dihydrofolate (H(2)F) at N(5) position in three ternary complexes with Escherichia coli reductase (ecDHFR), namely ecDHFR(NADP(+):H(2)F) closed form (1), and Michaelis ecDHFR(NADPH:H(2)F) (2) occluded (3) forms, by performing free energy perturbation molecular dynamics simulations (FEP/MD). Our suggest that complex is modulated Met20 loop fluctuations, providing largest shift substates a "tightly closed" conformation; "partially closed/open" substates, similar to...

10.1110/ps.062724307 article EN Protein Science 2007-05-02
Coming Soon ...