NFDI4DS | UHH-SEMS - Publication Details

When does pretraining help?

OPENALEX - Publications

Lucia Zheng Neel Guha Brandon Anderson Peter Henderson Daniel E. Ho

While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, yielded few documented instances of substantial gains to domain spite the fact that legal is widely seen be unique. We hypothesize these existing results stem from NLP tasks are too easy and fail meet conditions for can help. To address this, we first present CaseHOLD...

10.1145/3462757.3466088 article EN 2021-06-21

Holistic Evaluation of Language Models

OPENALEX - Publications

Percy Liang Rishi Bommasani Tong Lee Dimitris Tsipras Dilara Soylu and 45 more

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks not well understood. We present Holistic Evaluation of Models (HELM) to improve transparency models. First, we taxonomize vast space potential scenarios (i.e. use cases) metrics desiderata) that interest LMs. Then select a broad subset based on coverage feasibility, noting what's missing or underrepresented (e.g. question answering neglected English...

10.48550/arxiv.2211.09110 preprint EN cc-by arXiv (Cornell University) 2022-01-01

A Reasoning-Focused Legal Retrieval Benchmark

OPENALEX - Publications

Lucia Zheng Neel Guha Javokhir Arifov Sarah Zhang Michal Skreta and 3 more

10.1145/3709025.3712219 article EN 2025-03-13

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

OPENALEX - Publications

Peter Henderson Mark Krass Lucia Zheng Neel Guha Christopher D. Manning and 2 more

One concern with the rise of large language models lies their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter material, but such been ad hoc failed take context into account. We offer an approach filtering grounded in law, which has directly addressed tradeoffs material. First, we gather make available Pile Law, a 256GB (and growing) dataset open-source...

10.48550/arxiv.2207.00220 preprint EN other-oa arXiv (Cornell University) 2022-01-01

NLP Systems That Can’t Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps

OPENALEX - Publications

Kristina Gligorić Myra Cheng Lucia Zheng Esin Durmus Dan Jurafsky

10.18653/v1/2024.naacl-long.331 article EN 2024-01-01

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

OPENALEX - Publications

Lucia Zheng Neel Guha Brandon Anderson Peter Henderson Daniel E. Ho

While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, yielded few documented instances of substantial gains to domain spite the fact that legal is widely seen be unique. We hypothesize these existing results stem from NLP tasks are too easy and fail meet conditions for can help. To address this, we first present CaseHOLD...

10.48550/arxiv.2104.08671 preprint EN cc-by arXiv (Cornell University) 2021-01-01

NLP Systems That Can't Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps

OPENALEX - Publications

Kristina Gligorić Myra Cheng Lucia Zheng Esin Durmus Dan Jurafsky

The use of words to convey speaker's intent is traditionally distinguished from the `mention' for quoting what someone said, or pointing out properties a word. Here we show that computationally modeling this use-mention distinction crucial dealing with counterspeech online. Counterspeech refutes problematic content often mentions harmful language but not itself (e.g., calling vaccine dangerous same as expressing disapproval vaccines dangerous). We even recent models fail at distinguishing...

10.48550/arxiv.2404.01651 preprint EN arXiv (Cornell University) 2024-04-02

FLawN-T5: An Empirical Examination of Effective Instruction-Tuning Data Mixtures for Legal Reasoning

OPENALEX - Publications

Joel Niklaus Lucia Zheng Arya D. McCarthy Christopher Hahn Brian M. Rosen and 5 more

Instruction tuning is an important step in making language models useful for direct user interaction. However, many legal tasks remain out of reach most open LLMs and there do not yet exist any large scale instruction datasets the domain. This critically limits research this application area. In work, we curate LawInstruct, a dataset, covering 17 jurisdictions, 24 languages total 12M examples. We present evidence that domain-specific pretraining improve performance on LegalBench, including...

10.48550/arxiv.2404.02127 preprint EN arXiv (Cornell University) 2024-04-02