Ehsaneddin Asgari

ORCID: 0000-0002-6518-7238
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Natural Language Processing Techniques
  • Genomics and Phylogenetic Studies
  • Machine Learning in Bioinformatics
  • Sentiment Analysis and Opinion Mining
  • Advanced Text Analysis Techniques
  • Gene expression and cancer classification
  • Authorship Attribution and Profiling
  • vaccines and immunoinformatics approaches
  • Hate Speech and Cyberbullying Detection
  • Healthcare Systems and Practices
  • Social Sciences and Governance
  • Biochemical and Structural Characterization
  • Computational Drug Discovery Methods
  • RNA and protein synthesis mechanisms
  • Antibiotic Resistance in Bacteria
  • Adversarial Robustness in Machine Learning
  • Speech and dialogue systems
  • Gut microbiota and health
  • Bacterial Identification and Susceptibility Testing
  • Bioinformatics and Genomic Networks
  • Language and cultural evolution
  • Formal Methods in Verification
  • Logic, Reasoning, and Knowledge
  • Algorithms and Data Compression

University of California, Berkeley
2015-2024

Helmholtz Centre for Infection Research
2018-2024

Berkeley College
2023-2024

Sharif University of Technology
2022-2023

Data:Lab Munich (Germany)
2022-2023

Volkswagen Group (United States)
2023

German Center for Infection Research
2022

Volkswagen Group (Germany)
2020-2022

University of California System
2017-2022

Technische Universität Braunschweig
2022

We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer sequences in general with protein-vectors (ProtVec) proteins (amino-acid sequences) gene-vectors (GeneVec) gene sequences, this can be widely used applications of deep learning proteomics genomics. In the present paper, we focus on that utilized wide array bioinformatics investigations such as family classification, protein visualization, structure prediction,...

10.1371/journal.pone.0141287 article EN cc-by PLoS ONE 2015-11-10
Naihui Zhou Yuxiang Jiang Timothy Bergquist Alexandra Lee Balint Z. Kacsoh and 95 more Alex W. Crocker Kimberley A. Lewis George P. Georghiou Huy Nguyen Md-Nafiz Hamid L. Taylor Davis Tunca Doğan Volkan Atalay Ahmet Süreyya Rifaioğlu Alperen Dalkıran Rengül Çetin-Atalay Chengxin Zhang Rebecca L. Hurto Peter L. Freddolino Yang Zhang Prajwal Bhat Fran Supek José M. Fernández Branislava Gemović Vladimir Perović Radoslav Davidović Neven Šumonja Nevena Veljković Ehsaneddin Asgari Mohammad R. K. Mofrad Giuseppe Profiti Castrense Savojardo Pier Luigi Martelli Rita Casadio Florian Boecker Heiko Schoof Indika Kahanda Natalie Thurlby Alice C. McHardy Alexandre Renaux Rabie Saidi Julian Gough Alex A. Freitas Magdalena Antczak Fábio Fabris Mark N. Wass Jie Hou Jianlin Cheng Zheng Wang Alfonso E. Romero Alberto Paccanaro Haixuan Yang Tatyana Goldberg Chenguang Zhao Liisa Holm Petri Törönen Alan Medlar Elaine Zosa Itamar Borukhov Ilya B. Novikov Angela D. Wilkins Olivier Lichtarge Po-Han Chi Wei-Cheng Tseng Michal Linial Peter W. Rose Christophe Dessimoz Vedrana Vidulin Sašo Džeroski Ian Sillitoe Sayoni Das Jonathan Lees David T. Jones Cen Wan Domenico Cozzetto Rui Fa Mateo Torres Alex Warwick Vesztrocy José Manuel Rodrı́guez Michael L. Tress Marco Frasca Marco Notaro Giuliano Grossi Alessandro Petrini Matteo Ré Giorgio Valentini Marco Mesiti Daniel B. Roche Jonas Reeb David W. Ritchie Sabeur Aridhi Seyed Ziaeddin Alborzi Marie‐Dominique Devignes Da Chen Emily Koo Richard Bonneau Vladimir Gligorijević Meet Barot Hai Fang Stefano Toppo Enrico Lavezzo

Abstract Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation protein function. Results Here, we report on results third CAFA challenge, CAFA3, that featured expanded analysis over previous rounds, both in terms volume data analyzed types performed. In a novel major new development, predictions assessment goals drove some experimental assays, resulting functional annotations for...

10.1186/s13059-019-1835-8 article EN cc-by Genome biology 2019-11-19

Article12 February 2020Open Access Transparent process Predicting antimicrobial resistance in Pseudomonas aeruginosa with machine learning-enabled molecular diagnostics Ariane Khaledi Department of Molecular Bacteriology, Helmholtz Centre for Infection Research, Braunschweig, Germany Bacteriology Group, TWINCORE-Centre Experimental and Clinical Hannover, Search more papers by this author Aaron Weimann orcid.org/0000-0003-4597-2471 Computational Biology German Center Research (DZIF), Monika...

10.15252/emmm.201910264 article EN cc-by EMBO Molecular Medicine 2020-02-12

Abstract The advent of rapid whole-genome sequencing has created new opportunities for computational prediction antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno Aytan-Aktug), an baseline the ResFinder by training testing each them across 78 species–antibiotic...

10.1093/bib/bbae206 article EN cc-by-nc Briefings in Bioinformatics 2024-03-27

Microbial communities play important roles in the function and maintenance of various biosystems, ranging from human body to environment. A major challenge microbiome research is classification microbial different environments or host phenotypes. The most common cost-effective approach for such studies date 16S rRNA gene sequencing. Recent falls sequencing costs have increased demand simple, efficient accurate methods rapid detection diagnosis with proved applications medicine, agriculture...

10.1093/bioinformatics/bty296 article EN cc-by-nc Bioinformatics 2018-04-15

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea PPE is inspired by the byte-pair (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify algorithm adding sampling framework allowing for multiple ways segmenting sequence. steps can be learned over large set (Swiss-Prot) or even domain-specific...

10.1038/s41598-019-38746-w article EN cc-by Scientific Reports 2019-03-05

We present the Touch\'e23-ValueEval Dataset for Identifying Human Values behind Arguments. To investigate approaches automated detection of human values arguments, we collected 9324 arguments from 6 diverse sources, covering religious texts, political discussions, free-text newspaper editorials, and online democracy platforms. Each argument was annotated by 3 crowdworkers 54 values. The dataset extends Webis-ArgValues-22. In comparison to previous dataset, effectiveness a 1-Baseline...

10.48550/arxiv.2301.13771 preprint EN cc-by arXiv (Cornell University) 2023-01-01

The coronavirus SARS-CoV-2 is the causative agent for disease COVID-19. To capture IgA, IgG, and IgM antibody response of patients infected with at individual epitope resolution, we constructed planar microarrays 648 overlapping peptides that cover four major structural proteins S(pike), N(ucleocapsid), M(embrane), E(nvelope). arrays were incubated sera 67 positive 22 negative control samples. Specific responses to detectable, nine associated a more severe course disease. A random forest...

10.1080/22221751.2022.2057874 article EN cc-by Emerging Microbes & Infections 2022-03-23

In this study, we introduce a solution to the SemEval 2024 Task 10 on subtask 1, dedicated Emotion Recognition in Conversation (ERC) code-mixed Hindi-English conversations. ERC conversations presents unique challenges, as existing models are typically trained monolingual datasets and may not perform well data. To address this, propose series of that incorporate both previous future context current utterance, sequential information conversation. facilitate processing data, developed...

10.48550/arxiv.2501.11166 preprint EN arXiv (Cornell University) 2025-01-19

The SemEval-2024 Task 3 presents two subtasks focusing on emotion-cause pair extraction within conversational contexts. Subtask 1 revolves around the of textual pairs, where causes are defined and annotated as spans conversation. Conversely, 2 extends analysis to encompass multimodal cues, including language, audio, vision, acknowledging instances may not be exclusively represented in data. Our proposed model for is meticulously structured into three core segments: (i) embedding extraction,...

10.48550/arxiv.2501.11170 preprint EN arXiv (Cornell University) 2025-01-19

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) widely used in Large Models (LLMs), it often disregards morpheme boundaries, leading suboptimal segmentation, particularly morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two...

10.48550/arxiv.2502.00894 preprint EN arXiv (Cornell University) 2025-02-02

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual updated grounding. Recent advances in multimodal learning have led the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, video enhance generated outputs. However, cross-modal alignment reasoning...

10.48550/arxiv.2502.08826 preprint EN arXiv (Cornell University) 2025-02-12
Naihui Zhou Yuxiang Jiang Timothy Bergquist Alexandra Lee Balint Z. Kacsoh and 95 more Alex W. Crocker Kimberley A. Lewis George P. Georghiou Huy Nguyen Md-Nafiz Hamid L. Taylor Davis Tunca Doğan Volkan Atalay Ahmet Süreyya Rifaioğlu Alperen Dalkıran Rengül Çetin-Atalay Chengxin Zhang Rebecca L. Hurto Peter L. Freddolino Yang Zhang Prajwal Bhat Fran Supek José M. Fernández Branislava Gemović Vladimir Perović Radoslav Davidović Neven Šumonja Nevena Veljković Ehsaneddin Asgari Mohammad RK Mofrad Giuseppe Profiti Castrense Savojardo Pier Luigi Martelli Rita Casadio Florian Boecker Indika Kahanda Natalie Thurlby Alice C. McHardy Alexandre Renaux Rabie Saidi Julian Gough Alex A. Freitas Magdalena Antczak Fábio Fabris Mark N. Wass Jie Hou Jianlin Cheng Jie Hou Zheng Wang Alfonso E. Romero Alberto Paccanaro Haixuan Yang Tatyana Goldberg Chenguang Zhao Liisa Holm Petri Törönen Alan Medlar Elaine Zosa Itamar Borukhov Ilya B. Novikov Angela D. Wilkins Olivier Lichtarge Po-Han Chi Wei-Cheng Tseng Michal Linial Peter W. Rose Christophe Dessimoz Vedrana Vidulin Sašo Džeroski Ian Sillitoe Sayoni Das Jonathan Lees David T. Jones Cen Wan Domenico Cozzetto Rui Fa Mateo Torres Alex Wiarwick Vesztrocy José Manuel Rodrı́guez Michael L. Tress Marco Frasca Marco Notaro Giuliano Grossi Alessandro Petrini Matteo Ré Giorgio Valentini Marco Mesiti Daniel B. Roche Jonas Reeb David W. Ritchie Sabeur Aridhi Seyed Ziaeddin Alborzi Marie‐Dominique Devignes Da Chen Emily Koo Richard Bonneau Vladimir Gligorijević Meet Barot Hai Fang Stefano Toppo Enrico Lavezzo

Abstract The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation protein function. Here we report on results third CAFA challenge, CAFA3, that featured expanded analysis over previous rounds, both in terms volume data analyzed types performed. In a novel major new development, predictions assessment goals drove some experimental assays, resulting functional annotations for more than 1000...

10.1101/653105 preprint EN cc-by bioRxiv (Cold Spring Harbor Laboratory) 2019-05-29

We present SuperPivot, an analysis method for low-resource languages that occur in a superparallel corpus, i.e., corpus contains order of magnitude more than parallel corpora currently use. show SuperPivot performs well the crosslingual linguistic phenomenon tense. produce results 1000 languages, conducting – to best our knowledge largest computational study performed date. extend existing methodology leveraging typological by overcoming limiting assumption earlier work: only require feature...

10.18653/v1/d17-1011 article EN cc-by Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2017-01-01

B-cell epitopes (BCEs) play a pivotal role in the development of peptide vaccines, immuno-diagnostic reagents and antibody production, thus infectious disease prevention diagnostics general. Experimental methods used to determine BCEs are costly time-consuming. Therefore, it is essential develop computational for rapid identification BCEs. Although several have been developed this task, generalizability still major concern, where cross-testing classifiers trained tested on different datasets...

10.1093/bioinformatics/btab467 article EN cc-by-nc Bioinformatics 2021-06-25

Author(s): Asgari, Ehsaneddin; Mofrad, Mohammad RK | Abstract: We introduce a new measure of distance between languages based on word embedding, called embedding language divergence (WELD). WELD is defined as unified similarity distribution words languages. Using such measure, we perform comparison for fifty natural and twelve genetic Our dataset collection sentence-aligned parallel corpora from bible translations spanning variety families. Although use corpora, which guarantees having the...

10.18653/v1/w16-1208 article EN 2016-01-01

Abstract Motivation Here we investigate deep learning-based prediction of protein secondary structure from the primary sequence. We study function different features in this task, including one-hot vectors, biophysical features, sequence embedding (ProtVec), contextualized (known as ELMo), and Position Specific Scoring Matrix (PSSM). In addition to role evaluate various learning architectures following models/mechanisms certain combinations: Bidirectional Long Short-Term Memory (BiLSTM),...

10.1101/705426 preprint EN cc-by-nc-nd bioRxiv (Cold Spring Harbor Laboratory) 2019-07-18

This paper describes EmbLexChange, a system introduced by the “Life-Language” team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is defined as divergence between embedding based profiles word w (calculated with respect to set reference words) in source and target domains (source can be simply two time frames t_1 t_2). The underlying assumption that change would affect its co-occurring words subsequently alters neighborhoods spaces. We show using...

10.18653/v1/2020.semeval-1.24 article EN cc-by 2020-01-01

Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM lags behind English, resulting in the absence or weakness of LLMs for many languages. In response this necessity, we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously curated collection comprising 20,192 four-choice questions sourced from 38 diverse tasks extracted Persian examinations, spanning wide...

10.48550/arxiv.2404.06644 preprint EN arXiv (Cornell University) 2024-04-09

Generating coherent and comprehensive responses remains a significant challenge Question-Answering (QA) systems when working with short answers especially for low-resourced languages like Farsi. We present novel approach to expand these into complete, fluent responses, addressing the critical issue of limited Farsi resources models. Our methodology employs two-stage process: first, we develop dataset using rule-based techniques on text, followed by BERT-based ranking system ensure fluency...

10.20944/preprints202410.1684.v1 preprint EN 2024-10-22

Abstract Summary Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine imposes high demands on accuracy micro-biome analysis techniques. We propose an alignment- reference- free subsequence based 16S rRNA data analysis, as a new paradigm phenotype biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting reads into most...

10.1093/bioinformatics/bty954 article EN Bioinformatics 2018-11-29

Recently, pretrained representations have gained attention in various machine learning applications. These methods involve considerable computational costs for training the model, hence motivating alternative approaches representation learning. We introduce TripletProt, a new approach protein based on Siamese neural networks. Representation of biological entities which capture essential features can alleviate most challenges associated with supervised bioinformatics. The important...

10.1109/tcbb.2021.3108718 article EN cc-by IEEE/ACM Transactions on Computational Biology and Bioinformatics 2021-08-30
Coming Soon ...