Simin Fan

ORCID: 0000-0002-1490-9413
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Topic Modeling
  • Pancreatic function and diabetes
  • Online Learning and Analytics
  • Pancreatic and Hepatic Oncology Research
  • Natural Language Processing Techniques
  • Biosensors and Analytical Detection
  • Artificial Intelligence in Healthcare and Education
  • Anomaly Detection Techniques and Applications
  • Education and Critical Thinking Development
  • Advanced biosensing and bioanalysis techniques
  • Advanced Nanomaterials in Catalysis
  • Speech Recognition and Synthesis
  • Traditional Chinese Medicine Studies
  • Machine Learning in Healthcare
  • Digital Marketing and Social Media
  • Human Pose and Action Recognition
  • Microbial Natural Products and Biosynthesis
  • Genomics and Phylogenetic Studies
  • Genetic Associations and Epidemiology
  • Text Readability and Simplification
  • Diet, Metabolism, and Disease
  • Genetic Syndromes and Imprinting
  • Statistical Methods and Inference
  • Intelligent Tutoring Systems and Adaptive Learning
  • Metabolomics and Mass Spectrometry Studies

First Affiliated Hospital of Guangzhou University of Chinese Medicine
2025

Henan Normal University
2024

Zhuhai Institute of Advanced Technology
2024

University of Chinese Academy of Sciences
2024

University of Michigan
2021-2024

École Polytechnique Fédérale de Lausanne
2024

Xinjiang Institute of Ecology and Geography
2024

Chinese Academy of Sciences
2024

Guangzhou University of Chinese Medicine
2023

Michigan United
2023

Despite that reading assignments are prevalent, methods to encourage students actively read limited. We propose a system ReadingQuizMaker supports instructors conveniently design high-quality questions help comprehend readings. adapts instructors' natural workflows of creating questions, while providing NLP-based process-oriented support. enables decide when and which NLP models use, select the input models, edit outcomes. In an evaluation study, found resulting be comparable their...

10.1145/3544548.3580957 article EN 2023-04-19

AI assistants, such as ChatGPT, are being increasingly used by students in higher education institutions. While these tools provide opportunities for improved teaching and education, they also pose significant challenges assessment learning outcomes. We conceptualize through the lens of vulnerability, potential university assessments outcomes to be impacted student use generative AI. investigate scale this vulnerability measuring degree which assistants can complete questions standard...

10.1073/pnas.2414955121 article EN cc-by Proceedings of the National Academy of Sciences 2024-11-26

Abstract Large language and multimodal models (LLMs LMMs) will transform access to medical knowledge clinical decision support. However, the current leading systems fall short of this promise, as they are either limited in scale, which restricts their capabilities, closed-source, limits extensions scrutiny that can be applied them, or not sufficiently adapted settings, inhibits practical use. In work, we democratize large-scale AI by developing MEDITRON: a suite open-source LLMs LMMs with 7B...

10.21203/rs.3.rs-4139743/v1 preprint EN cc-by Research Square (Research Square) 2024-04-03

NLP-powered automatic question generation (QG) techniques carry great pedagogical potential of saving educators' time and benefiting student learning. Yet, QG systems have not been widely adopted in classrooms to date. In this work, we aim pinpoint key impediments investigate how improve the usability for educational purposes by understanding instructors construct questions identifying touch points enhance underlying NLP models. We perform an in-depth need finding study with 11 across 7...

10.18653/v1/2022.naacl-main.22 article EN cc-by Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2022-01-01

This study addresses the critical need for regional tourism integration and sustainable development by identifying cooperation opportunities among tourist attractions within a region. We introduce novel methodology that combines association rule mining with complex network analysis utilizes search index data as dynamic contemporary source to reveal cooperative patterns attractions. Our approach delineates potential destination ecosystem, categorizing into three distinct communities: core,...

10.1371/journal.pone.0298035 article EN cc-by PLoS ONE 2024-02-07

SUMMARY A hallmark of type 2 diabetes (T2D), a major cause world-wide morbidity and mortality, is dysfunction insulin-producing pancreatic islet β cells 1–3 . T2D genome-wide association studies (GWAS) have identified hundreds signals, mostly in the non-coding genome overlapping cell regulatory elements, but translating these into biological mechanisms has been challenging 4–6 To identify early disease-driving events, we performed single spatial proteomics, sorted transcriptomics, assessed...

10.1101/2021.12.16.466282 preprint EN bioRxiv (Cold Spring Harbor Laboratory) 2021-12-17

The mechanism underlying traditional Chinese medicine (TCM) compatibility is difficult to understand. This study combined lipidomics and efficacy-oriented explore mechanisms of Qi Ge decoction (QG) for improving lipid metabolism in hyperlipidemic rats. QG was divided into three groups according the efficacy group strategy: Huangqi-Gegen (HG), Chenpi (CP), groups. Hyperlipidemic rats were treated with QG, HG, CP, or atorvastatin 3 weeks. mass spectral data widely targeted used evaluate...

10.1002/bmc.5595 article EN Biomedical Chromatography 2023-02-03

Due to their small size and special chemical features, open reading frame (smORF)-encoding peptides (SEPs) are often neglected. However, they may play critical roles in regulating gene expression, enzyme activity, metabolite production. Studies on bacterial microproteins have mainly focused pathogenic bacteria, which importance systematically investigate SEPs streptomycetes rich sources of bioactive secondary metabolites. Our study is the first perform a global identification smORFs...

10.1128/msystems.00245-23 article EN cc-by mSystems 2023-09-15

Recent research on the grokking phenomenon has illuminated intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise network's accuracy test set, which occurs long after an extended overfitting phase, during network perfectly fits set. While existing primarily focus shallow networks such as 2-layer MLP 1-layer Transformer, we explore deep (e.g. 12-layer MLP). We empirically replicate find that can be more susceptible than its...

10.48550/arxiv.2405.19454 preprint EN arXiv (Cornell University) 2024-05-29

<title>Abstract</title> The inability of surgical biopsy to monitor the dynamic evolution cancer cells hampers its capacity reflect real-time tumor heterogeneity. Circulating (CTCs), as a crucial target in liquid biopsy, offer novel approach for accurate monitoring tumors. However, rarity and complex phenotype resulting from epithelial mesenchymal transition pose challenges conventional methods such CellSearch immunohistochemistry, which have insufficient ability simultaneous phenotyping...

10.21203/rs.3.rs-4911090/v1 preprint EN cc-by Research Square (Research Square) 2024-10-11

Specialist language models (LMs) focus on a specific task or domain which they often outperform generalist LMs of the same size. However, specialist data needed to pretrain these is only available in limited amount for most tasks. In this work, we build from large training sets instead. We adjust distribution with guidance domain-specific data. explore several approaches, clustered importance sampling standing out. This method clusters dataset and samples based their frequencies smaller...

10.48550/arxiv.2410.03735 preprint EN arXiv (Cornell University) 2024-09-30

Influence functions provide a principled method to assess the contribution of individual training samples specific target. Yet, their high computational costs limit applications on large-scale models and datasets. Existing methods proposed for influence function approximation have significantly reduced overheads. However, they mostly suffer from inaccurate estimation due lack strong convergence guarantees algorithm. The family hyperpower are well-known rigorous matrix inverse approximation,...

10.48550/arxiv.2410.05090 preprint EN arXiv (Cornell University) 2024-10-07

The composition of training data mixtures is critical for effectively large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal to identify an optimal mixture specialize LLM a specific task with access only few examples. Traditional approaches this problem include ad-hoc reweighting methods, importance sampling, and gradient alignment techniques. This paper focuses introduces Dynamic Gradient Alignment (DGA), scalable online algorithm. DGA...

10.48550/arxiv.2410.02498 preprint EN arXiv (Cornell University) 2024-10-03

NLP-powered automatic question generation (QG) techniques carry great pedagogical potential of saving educators' time and benefiting student learning. Yet, QG systems have not been widely adopted in classrooms to date. In this work, we aim pinpoint key impediments investigate how improve the usability for educational purposes by understanding instructors construct questions identifying touch points enhance underlying NLP models. We perform an in-depth need finding study with 11 across 7...

10.48550/arxiv.2205.00355 preprint EN other-oa arXiv (Cornell University) 2022-01-01

The coverage and composition of the pretraining data significantly impacts generalization ability Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics trial error to increase or reduce influence data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes probability sampling from each domain (domain weights) in a principled way. Our approach is two-stage process consisting (i) training proxy model obtain weights...

10.48550/arxiv.2310.15393 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Automatic data selection and curriculum design for training large language models is challenging, with only a few existing methods showing improvements over standard training. Furthermore, current schemes focus on domain-level selection, overlooking the more fine-grained contributions of each individual point. It difficult to apply traditional datapoint models: most online batch perform two-times forward or backward passes, which introduces considerable extra costs large-scale models. To...

10.48550/arxiv.2310.15389 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...