Boost LLM Math Skills with Self-Evaluation in ChatGLM-Math

Math critique model improves evaluation accuracy of mathematical responses by integrating reference answers and defining scoring rules and intervals. It offers more precise judgments compared to traditional reward models. Model’s performance on the M Aus erval test set showcased its superiority over baselines and other existing models achieving a score of 4.23. The model’s limitations in handling questions requiring graphic thinking abilities and precision calculation capabilities for complex computations can be addressed by exploring multimodal input and output components for graphic thinking abilities and utilizing external tools or code interpreters for precise calculations. #MathCritiqueModel

Key Takeaways ๐Ÿš€

  • The math critique model enhances the evaluation accuracy of mathematical responses by integrating reference answers and defining scoring rules and intervals.
  • The model offers more precise judgments compared to traditional reward models by incorporating reference answers.
  • The self-critique pipeline utilizing rejection sampling and direct preference optimization methods enhances the model’s capabilities and accuracy in evaluating mathematical responses.
  • The math critique model outperforms competitors in various data sets and achieves superior judgment accuracy compared to human annotations and models like GPT 3.5 turbo.
  • Exploring multimodal input and output components for graphic thinking abilities and utilizing external tools or code interpreters for precise calculations can further enhance the math critique model’s performance.

Introduction ๐Ÿ“

In this paper, we present the math critique model, a novel approach that enhances the evaluation accuracy of mathematical responses by integrating reference answers and defining scoring rules and intervals. By incorporating reference answers, our model offers more precise judgments compared to traditional reward models, ultimately improving the evaluation process.

The Math Critique Model ๐Ÿงฎ

The math critique model functions by taking a question reference answer and model answer as input, providing a critique and score based on correctness of both result and process. To develop and train the math critique model, we followed a systematic approach. We filtered a data set from training data annotated 10K data entries using critique llm and orm and divided the data set into a training set of 5K entries and a test set of 800 entries.

The Self-Critique Pipeline ๐Ÿ”

We then implemented the self-critique pipeline utilizing rejection sampling and direct preference optimization methods to refine the model iteratively. This process allowed us to enhance the model’s capabilities and accuracy in evaluating mathematical responses. Furthermore, through an ablation study, we analyzed the impact of different data compositions and boosting methods on the model’s performance. We found that integrating real-life scenario data and math-specific EPO data significantly enhanced the model’s mathematical capabilities, particularly on medium difficulty questions.

Direct Preference Optimization (DPO) Training ๐ŸŽ“

Subsequently, we utilized the math critique model to sample contrast data and proceeded with direct preference optimization (DPO) training. By introducing cross entropy loss as a regularization term in DPO training and selecting pairs with significant differences in math critique scoring results, we trained the model for 500 steps with a batch size of 64. The evaluation of the model’s performance on the M Aus erval test set showcased its superiority over baselines and other existing models, achieving a score of 4.23 and outperforming competitors in various data sets.

Empirical Experiments and Out-of-Distribution Tests ๐Ÿงช

Additionally, empirical experiments and out-of-distribution tests validated the effectiveness of the math critique model, demonstrating superior judgment accuracy compared to human annotations and surpassing models like GPT 3.5 turbo in various evaluations.

Future Directions ๐Ÿ”ฎ

Lastly, we identified limitations in the model’s handling of questions requiring graphic thinking abilities and precision calculation capabilities for complex computations. To address these issues, we proposed exploring multimodal input and output components for graphic thinking abilities and utilizing external tools or code interpreters for precise calculations. These future directions aim to further enhance the math critique model’s performance and applicability in evaluating mathematical responses accurately and comprehensively.

Conclusion ๐Ÿ’ก

The math critique model offers a novel approach to enhance the evaluation accuracy of mathematical responses by integrating reference answers and defining scoring rules and intervals. The self-critique pipeline utilizing rejection sampling and direct preference optimization methods enhances the model’s capabilities and accuracy in evaluating mathematical responses. The model outperforms competitors in various data sets and achieves superior judgment accuracy compared to human annotations and models like GPT 3.5 turbo. Exploring multimodal input and output components for graphic thinking abilities

About the Author

About the Channel๏ผš

Share the Post:
en_GBEN_GB