NFDI4DS | UHH-SEMS - Publication Details

Jifan Yu

ORCID: 0000-0003-3430-4048

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5038246814

Research Areas

Topic Modeling
Natural Language Processing Techniques
Online Learning and Analytics
Multimodal Machine Learning Applications
Advanced Graph Neural Networks
Data Quality and Management
Text Readability and Simplification
Intelligent Tutoring Systems and Adaptive Learning
Semantic Web and Ontologies
Software Engineering Research
Speech and dialogue systems
Human Pose and Action Recognition
Advanced Text Analysis Techniques
Video Analysis and Summarization
Educational Technology and Assessment
Text and Document Classification Technologies
Explainable Artificial Intelligence (XAI)
Advanced Computational Techniques and Applications
Machine Learning and Data Classification
Data Stream Mining Techniques
Business Process Modeling and Analysis
Software System Performance and Reliability
Data Mining Algorithms and Applications
Computational and Text Analysis Methods
Biomedical Text Mining and Ontologies

Tsinghua University
2018-2025

Beijing Academy of Artificial Intelligence
2019-2023

Beihang University
2023

Renmin University of China
2022

Tencent (China)
2021

MOOCCube: A Large-scale Data Repository for NLP Applications in MOOCs

OPENALEX - Publications

Jifan Yu Gan Luo Tong Xiao Qingyang Zhong Yuquan Wang and 7 more

Jifan Yu, Gan Luo, Tong Xiao, Qingyang Zhong, Yuquan Wang, Wenzheng Feng, Junyi Chenyu Lei Hou, Juanzi Li, Zhiyuan Liu, Jie Tang. Proceedings of the 58th Annual Meeting Association for Computational Linguistics. 2020.

10.18653/v1/2020.acl-main.285 article EN cc-by 2020-01-01

Subgraph Retrieval Enhanced Model for Multi-hop Knowledge Base Question Answering

OPENALEX - Publications

Jing Zhang Xiaokang Zhang Jifan Yu Jian Tang Jie Tang and 2 more

Recent works on knowledge base question answering (KBQA) retrieve subgraphs for easier reasoning. The desired subgraph is crucial as a small one may exclude the answer but large might introduce more noises. However, existing retrieval either heuristic or interwoven with reasoning, causing reasoning partial subgraphs, which increases bias when intermediate supervision missing. This paper proposes trainable retriever (SR) decoupled from subsequent process, enables plug-and-play framework to...

10.18653/v1/2022.acl-long.396 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Program Transfer for Answering Complex Questions over Knowledge Bases

OPENALEX - Publications

Shulin Cao Jiaxin Shi Zijun Yao Xin Lv Jifan Yu and 4 more

Shulin Cao, Jiaxin Shi, Zijun Yao, Xin Lv, Jifan Yu, Lei Hou, Juanzi Li, Zhiyuan Liu, Jinghui Xiao. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.

10.18653/v1/2022.acl-long.559 article EN cc-by Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022-01-01

Dynamic Scaling of Unit Tests for Code Reward Modeling

OPENALEX - Publications

Zeyao Ma Xiaokang Zhang Jing Zhang Jifan Yu Sijia Luo and 1 more

Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of tests serve as reward signals identify correct solutions. As LLMs always confidently make mistakes, these are not reliable, thereby diminishing quality signals. Motivated observation...

10.48550/arxiv.2501.01054 preprint EN arXiv (Cornell University) 2025-01-01

CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis

OPENALEX - Publications

Bohan Zhang Xiaokang Zhang Jing Zhang Jifan Yu Sijia Luo and 1 more

Current inference scaling methods, such as Self-consistency and Best-of-N, have proven effective in improving the accuracy of LLMs on complex reasoning tasks. However, these methods rely heavily quality candidate responses are unable to produce correct answers when all candidates incorrect. In this paper, we propose a novel strategy, CoT-based Synthesizer, which leverages CoT synthesize superior by analyzing complementary information from multiple responses, even flawed. To enable...

10.48550/arxiv.2501.01668 preprint EN arXiv (Cornell University) 2025-01-03

Exploring LLM-based Student Simulation for Metacognitive Cultivation

OPENALEX - Publications

Haoxuan Li Jifan Yu Xin Cong Yang Dang Y. H. Zhan and 2 more

Metacognitive education plays a crucial role in cultivating students' self-regulation and reflective thinking, providing essential support for those with learning difficulties through academic advising. Simulating students insufficient capabilities using large language models offers promising approach to refining pedagogical methods without ethical concerns. However, existing simulations often fail authentically represent struggles face challenges evaluation due the lack of reliable metrics...

10.48550/arxiv.2502.11678 preprint EN arXiv (Cornell University) 2025-02-17

LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

OPENALEX - Publications

Shangqing Tu Yucheng Wang Daniel Zhang-Li Yushi Bai Jifan Yu and 6 more

Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle generate coherent outputs beyond 1,000 words. We find that the primary limitation is absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each multiple input images, an instruction, corresponding ranging from 0 10,000 Moreover, achieve maintain...

10.48550/arxiv.2502.14834 preprint EN arXiv (Cornell University) 2025-02-20

Awaking the Slides: A Tuning-free and Knowledge-regulated AI Tutoring System via Language Model Coordination

OPENALEX - Publications

Daniel Zhang-Li Zheyuan Zhang Jifan Yu Joy Jia Yin Lim Shangqing Tu and 6 more

10.1145/3690624.3709423 article EN 2025-04-04

SoAy: A Solution-based LLM API-using Methodology for Academic Information Seeking

OPENALEX - Publications

Yuanchun Wang Jifan Yu Zijun Yao Jing Zhang Yuyang Xie and 11 more

10.1145/3690624.3709412 article EN 2025-04-04

Interactive Contrastive Learning for Self-Supervised Entity Alignment

OPENALEX - Publications

Kaisheng Zeng Zhenhao Dong Lei Hou Yixin Cao Minghao Hu and 10 more

Self-supervised entity alignment (EA) aims to link equivalent entities across different knowledge graphs (KGs) without the use of pre-aligned pairs. The current state-of-the-art (SOTA) self-supervised EA approach draws inspiration from contrastive learning, originally designed in computer vision based on instance discrimination and loss, suffers two shortcomings. Firstly, it puts unidirectional emphasis pushing sampled negative far away rather than pulling positively aligned pairs close, as...

10.1145/3511808.3557364 article EN Proceedings of the 31st ACM International Conference on Information & Knowledge Management 2022-10-16

GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue Generation

OPENALEX - Publications

Jing Zhang Xiaokang Zhang Daniel Zhang-Li Jifan Yu Zijun Yao and 8 more

We present GLM-Dialog, a large-scale language model (LLM) with 10B parameters capable of knowledge-grounded conversation in Chinese using search engine to access the Internet knowledge. GLM-Dialog offers series applicable techniques for exploiting various external knowledge including both helpful and noisy knowledge, enabling creation robust dialogue LLMs limited proper datasets. To evaluate more fairly, we also propose novel evaluation method allow humans converse multiple deployed bots...

10.1145/3580305.3599832 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2023-08-04

MOOCCubeX: A Large Knowledge-centered Repository for Adaptive Learning in MOOCs

OPENALEX - Publications

Jifan Yu Yuquan Wang Qingyang Zhong Gan Luo Yiming Mao and 14 more

The prosperity of massive open online courses provides fodder for plentiful research efforts on adaptive learning. However, current open-access educational datasets are still far from sufficient to meet the need various topics Existing released often cover only small-scale data, lack fine-grained knowledge concepts. They even difficult curate and supplement due platform limitations. In this work, we construct MOOCCubeX, a large, knowledge-centered repository consisting 4,216 courses, 230,263...

10.1145/3459637.3482010 article EN 2021-10-26

Benchmarking Foundation Models with Language-Model-as-an-Examiner

OPENALEX - Publications

Yushi Bai Jiahao Ying Yixin Cao Xin Lv Yuze He and 8 more

Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test model's ability understand and generate language in manner similar humans. Most these works focus proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage evaluation automation. In this paper, propose novel framework, Language-Model-as-an-Examiner, where LM knowledgeable...

10.48550/arxiv.2306.04181 preprint EN other-oa arXiv (Cornell University) 2023-01-01

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

OPENALEX - Publications

Jifan Yu Xiaozhi Wang Shangqing Tu Shulin Cao Daniel Zhang-Li and 30 more

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, applicable Given importance world knowledge LLMs, construct a Knowledge-oriented Assessment benchmark (KoLA), which carefully design three crucial factors: (1) For ability modeling, mimic human cognition form four-level taxonomy knowledge-related...

10.48550/arxiv.2306.09296 preprint EN other-oa arXiv (Cornell University) 2023-01-01

GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation

OPENALEX - Publications

Ji Qi Jifan Yu Teng Tu Gao Kun-yu Yifan Xu and 6 more

Despite the recent emergence of video captioning models, how to generate vivid, fine-grained descriptions based on background knowledge (i.e., long and informative commentary about domain-specific scenes with appropriate reasoning) is still far from being solved, which however has great applications such as automatic sports narrative. Based soccer game videos synchronized data, we present GOAL, a benchmark over 8.9k clips, 22k sentences, 42k triples for proposing challenging new task setting...

10.1145/3583780.3615120 article EN cc-by-nc 2023-10-21

TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios

OPENALEX - Publications

Xiaokang Zhang Jing Zhang Zeyao Ma Li Yang Bohan Zhang and 9 more

We introduce TableLLM, a robust large language model (LLM) with 13 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. propose distant supervision method training, which comprises reasoning process extension strategy, aiding in training LLMs understand patterns more effectively as well cross-way validation ensuring the quality of automatically...

10.48550/arxiv.2403.19318 preprint EN arXiv (Cornell University) 2024-03-28

ChatLog: Recording and Analyzing ChatGPT Across Time

OPENALEX - Publications

Shangqing Tu Chunyang Li Jifan Yu Xiaozhi Wang Lei Hou and 1 more

While there are abundant researches about evaluating ChatGPT on natural language understanding and generation tasks, few studies have investigated how ChatGPT's behavior changes over time. In this paper, we collect a coarse-to-fine temporal dataset called ChatLog, consisting of two parts that update monthly daily: ChatLog-Monthly is 38,730 question-answer pairs collected every month including questions from both the reasoning classification tasks. ChatLog-Daily, other hand, consists...

10.48550/arxiv.2304.14106 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Course Concept Expansion in MOOCs with External Knowledge and Interactive Game

OPENALEX - Publications

Jifan Yu Chenyu Wang Gan Luo Lei Hou Juanzi Li and 2 more

As Massive Open Online Courses (MOOCs) become increasingly popular, it is promising to automatically provide extracurricular knowledge for MOOC users. Suffering from semantic drifts and lack of guidance, existing methods can not effectively expand course concepts in complex environments. In this paper, we first build a novel boundary during searching new via external base then utilize heterogeneous features verify the high-quality results. addition, involve human efforts our model, design an...

10.18653/v1/p19-1421 preprint EN cc-by 2019-01-01

XDAI: A Tuning-free Framework for Exploiting Pre-trained Language Models in Knowledge Grounded Dialogue Generation

OPENALEX - Publications

Jifan Yu Xiaohan Zhang Yifan Xu Xuanyu Lei Xinyu Guan and 4 more

Large-scale pre-trained language models (PLMs) have shown promising advances on various downstream tasks, among which dialogue is one of the most concerned. However, there remain challenges for individual developers to create a knowledge-grounded system upon such big because expensive cost collecting knowledge resources supporting as well tuning these large task. To tackle obstacles, we propose XDAI, that equipped with prompt-aware tuning-free PLM exploitation and supported by ready-to-use...

10.1145/3534678.3539135 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022-08-12

MoocRadar: A Fine-grained and Multi-aspect Knowledge Repository for Improving Cognitive Student Modeling in MOOCs

OPENALEX - Publications

Jifan Yu Mengying Lu Qingyang Zhong Zijun Yao Shangqing Tu and 7 more

Student modeling, the task of inferring a student's learning characteristics through their interactions with coursework, is fundamental issue in intelligent education. Although recent attempts from knowledge tracing and cognitive diagnosis propose several promising directions for improving usability effectiveness current models, existing public datasets are still insufficient to meet need these potential solutions due ignorance complete exercising contexts, fine-grained concepts, labels. In...

10.1145/3539618.3591898 article EN Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023-07-18

LittleMu: Deploying an Online Virtual Teaching Assistant via Heterogeneous Sources Integration and Chain of Teach Prompts

OPENALEX - Publications

Shangqing Tu Z. X. Zhang Jifan Yu C.M. Li Siyu Zhang and 3 more

Teaching assistants have played essential roles in the long history of education. However, few MOOC platforms are providing human or virtual teaching to support learning for massive online students due complexity real-world education scenarios and lack training data. In this paper, we present a assistant, LittleMu with minimum labeled data, provide question answering chit-chat services. Consisting two interactive modules heterogeneous retrieval language model prompting, first integrates...

10.1145/3583780.3615484 article EN cc-by-nc-sa 2023-10-21

Reverse That Number! Decoding Order Matters in Arithmetic Learning

OPENALEX - Publications

Daniel Zhang-Li Nianyi Lin Jifan Yu Zheyuan Zhang Zijun Yao and 4 more

Recent advancements in pretraining have demonstrated that modern Large Language Models (LLMs) possess the capability to effectively learn arithmetic operations. However, despite acknowledging significance of digit order computation, current methodologies predominantly rely on sequential, step-by-step approaches for teaching LLMs arithmetic, resulting a conclusion where obtaining better performance involves fine-grained step-by-step. Diverging from this conventional path, our work introduces...

10.48550/arxiv.2403.05845 preprint EN arXiv (Cornell University) 2024-03-09

A Solution-based LLM API-using Methodology for Academic Information Seeking

OPENALEX - Publications

Yuanchun Wang Jifan Yu Zijun Yao Jing Zhang Yuyang Xie and 11 more

Applying large language models (LLMs) for academic API usage shows promise in reducing researchers' information seeking efforts. However, current LLM API-using methods struggle with complex coupling commonly encountered queries. To address this, we introduce SoAy, a solution-based methodology seeking. It uses code solution as the reasoning method, where is pre-constructed calling sequence. The addition of reduces difficulty model to understand relationships between APIs. Code improves...

10.48550/arxiv.2405.15165 preprint EN arXiv (Cornell University) 2024-05-23

Simulating Classroom Education with LLM-Empowered Agents

OPENALEX - Publications

Zheyuan Zhang Daniel Zhang-Li Jifan Yu Linlu Gong Jinchang Zhou and 3 more

Large language models (LLMs) have been employed in various intelligent educational tasks to assist teaching. While preliminary explorations focused on independent LLM-empowered agents for specific tasks, the potential LLMs within a multi-agent collaborative framework simulate classroom with real user participation remains unexplored. In this work, we propose SimClass, simulation involving participation. We recognize representative class roles and introduce novel control mechanism automatic...

10.48550/arxiv.2406.19226 preprint EN arXiv (Cornell University) 2024-06-27

Coming Soon ...