Making Language Models Better Reasoners with Step-Aware Verifier
China
Topic Modeling
Chen
Natural language processing
FOS: Political science
Paleontology
Computational linguistics
FOS: Law
Statistical Machine Translation and Natural Language Processing
Schema Matching
Computer science
Description Logics
Language Modeling
Programming language
Artificial Intelligence
Computer Science
Physical Sciences
Zhàng
Political science
Law
Biology
Semantic Web and Ontology Development
Natural Language Processing
DOI:
10.18653/v1/2023.acl-long.291
Publication Date:
2023-08-05T00:57:42Z
AUTHORS (7)
ABSTRACT
Few-shot learning is a challenging task that requires language models to generalize from limited examples.Large language models like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems.To improve their reasoning skills, previous work has proposed to guide the language model with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate.In this paper, we present DIVERSE (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of language models.DIVERSE has three main components: first, it generates diverse prompts to explore different reasoning paths for the same question; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme; and third, it verifies each reasoning step individually instead of the whole chain.We evaluate DIVERSE on the latest language model code-davinci-002 and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks (e.g., GSM8K 74.4% → 83.2%).<br/>Few-shot learning is a challenging task that requires language models to generalize from limited examples.Large language models like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems.To improve their reasoning skills, previous work has proposed to guide the language model with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate.In this paper, we present DIVERSE (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of language models.DIVERSE has three main components : first, it generates diverse prompts to explore different reasoning paths for the same question ; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme ; and third, it verifies each reasoning step individually instead of the whole chain.We evaluate DIVERSE on the latest language model code-davinci-002 and show that it achieves new state-of-the-art results on six des huit critères de référence de raisonnement (par exemple, GSM8K 74,4 % → 83,2 %).<br/>يعد التعلم قليل الطلقة مهمة صعبة تتطلب تعميم النماذج اللغوية من أمثلة محدودة. أحرزت نماذج لغوية كبيرة مثل GPT -3 و PaLM تقدمًا مثيرًا للإعجاب في هذا المجال، لكنها لا تزال تواجه صعوبات في مهام التفكير مثل GSM8K، وهو معيار للمشاكل الحسابية. لتحسين مهاراتهم في التفكير، اقترح العمل السابق توجيه نموذج اللغة بمطالبات تثير سلسلة من خطوات التفكير قبل إعطاء الإجابة النهائية، وتحقيق تحسن كبير في GSM8K من 17.9 ٪ إلى 58.1 ٪ في معدل حل المشكلات. في هذه الورقة، نقدم مجموعة متنوعة (مدقق متنوع في خطوة الاستدلال)، وهو نهج جديد يعزز بشكل أكبر القدرة المنطقية لنماذج اللغة. يحتوي DIVERSE على ثلاثة مكونات رئيسية: أولاً، يولد مطالبات متنوعة لاستكشاف مسارات استدلال مختلفة لنفس السؤال ؛ ثانيًا، يستخدم مدققًا لتصفية الإجابات غير الصحيحة بناءً على مخطط تصويت مرجح ؛ وثالثًا، يتحقق من كل خطوة استدلالية بشكل فردي بدلاً من السلسلة بأكملها. نقوم بتقييم التنوع في أحدث رمز لنموذج اللغة - davinci -002 ونظهر أنه يحقق نتائج حديثة جديدة في ستة من ثمانية معايير منطقية (على سبيل المثال، GSM8K 74.4 ٪ → 83.2 ٪).<br/>Few-shot learning is a challenging task that requires language models to generalize from limited examples.Large language models like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems.To improve their reasoning skills, previous work has proposed to guide the language model with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate.In this paper, we present DIVERSE (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of language models.DIVERSE has three main components: first, it generates diverse prompts to explore different reasoning paths for the same question; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme; and third, it verifies each reasoning step individually instead of the whole chain.We evaluate DIVERSE on the latest language model code-davinci-002 and show that it achieves new state-of-the-art results on six de ocho puntos de referencia de razonamiento (e.g., GSM8K 74.4% → 83.2%).<br/>
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (24)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....