Improve Math Problem-Solving in Language Models with Self-Critique Pipeline

Our self-critique pipeline enhances both language and math abilities in large language models (LLMs) simultaneously, resulting in state-of-the-art performance on the M Aus V Benchmark. We introduce the math critique model, which assesses mathematical responses based on reference answers, providing scores between 1 and 10. The direct preference optimization (DPO) method improves our model’s performance by directly comparing correct and incorrect answers to the same question sampled from the model. We collected data from various sources, including public data sets and middle school and university exam questions. Limitations include difficulty with questions involving drawings and precision calculations, which we aim to address in future work.

Introduction

Large language models (LLMs) have gained popularity in recent years due to their ability to perform language-related tasks. However, improving their mathematical problem-solving abilities has been a challenge. In this article, we propose the self-critique pipeline, which enhances LLMs’ mathematical and linguistic abilities simultaneously. We introduce a math critique model for feedback during the alignment process through stages like rejective fine-tuning (RFT) and direct preference optimization (DPO). We also analyze the factors influencing the enhancement of a model’s mathematical abilities and provide suggestions for future development directions.

Key Takeaways
– The self-critique pipeline enhances LLMs’ mathematical and linguistic abilities.
– Math critique model provides feedback on mathematical accuracy.
– Rejective fine-tuning and direct preference optimization are two stages of the self-critique pipeline.
– M A U R V Benchmark evaluates LLMs on complex real-world mathematical problems.

Math Critique: A General Critic for Math πŸ”

Math critique is a model that evaluates mathematical responses based on questions and reference answers, providing scores between 1 and 10. It assesses mathematical responses by scoring them based on questions, reference answers, and providing an explanatory analysis. Math critique classifies responses into four categories: entirely incorrect, partially correct but with errors in the process, mostly correct but with some flaws in the process, and completely correct, corresponding to score ranges of 1 to 2, 3 to 5, 6 to 8, and 9 to 10, respectively. We use two evaluation methods with math critique: average score evaluation and hard split evaluation.

Quote
"Math critique enhances accuracy and interpretability."

State Two Direct Preference Optimization πŸš€

Direct preference optimization (DPO) is a method used to improve LLMs’ performance. It is advantageous due to its simplicity in handling data flows, stability, and speed during training. DPO directly compares correct and incorrect answers to the same question sampled from the model post-RFT. We included the SFT loss of DPO positive examples during training as an approximation for a regularization term. Our DPO data filtering process is similar to critique RFT, with the main difference being how DPO training pairs are constructed.

Table
DPO Data Filtering Process
Sampling
Select pairs based on the most significant difference in math critique scoring results between a correct and an incorrect answer.

Data Collection πŸ“Š

We gathered data from various sources, including public data sets and middle school and university exam questions. For English data, we used prompts from GSM 8K and math training sets and their corresponding responses. We also utilized publicly available middle school and university exam questions. Our evaluation metrics involved performing greedy inference once on all data sets and reporting results from different academic data sets and benchmarks. Our model outperformed others on various academic data sets and demonstrated strong mathematical reasoning and cross-linguistic generalization capabilities.

List
Data Sources
– Public data sets (GSM 8K and math)
– Middle school and university exam questions
– M A E V A L data set
– Chinese academic data sets (APE2 110K and C math)
– Hungarian National exam

Limitation and Future Work πŸ”πŸ”¬

One issue with our model is its difficulty with questions involving drawings and precision calculations. To tackle this, we plan to explore integrating multimodal input and output components and external tools for

About the Author

About the Channel:

Share the Post:
en_GBEN_GB