Collected from Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future
Method | Year | Setting | Model | Mathematical | Mathematical | Mathematical | Mathematical | Commonsense | Commonsense | Symbolic | Symbolic | Paper |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GSM8K | SVAMP | Asdiv | AQuA | CSQA | StrategyQA | LastLetterConcat | CoinFlip | |||||
I-O Prompting | 2020 | fewshot | text-davinci-002 | 19.70 | 69.90 | 74.00 | 29.50 | 79.50 | 65.90 | 5.80 | 49.00 | Language Models are Few-Shot Learners |
Fewshot CoT | 2022 | fewshot | text-davinci-002 | 63.10 | 76.40 | 80.40 | 45.30 | 73.50 | 65.40 | 77.50 | 99.60 | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
PoT | 2022 | fewshot | text-davinci-002 | 80.00 | 89.10 | - | 58.60 | - | - | - | - | Program of Thoughts Prompting: Disentangling Computation from Reasoningfor Numerical Reasoning Tasks |
Complex CoT | 2023 | fewshot | text-davinci-002 | 72.60 | - | - | - | - | 77.00 | - | - | Complexity-Based Prompting for Multi-step Reasoning |
Automate CoT | 2023 | fewshot | text-davinci-002 | 49.70 | 73.30 | 74.20 | 37.90 | 76.10 | 67.90 | 58.90 | - | Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data |
Fewshot CoT | 2022 | fewshot | text-davinci-003 | 66.83 | 69.06 | - | 29.13 | - | - | - | - | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
PHP | 2023 | fewshot | text-davinci-003 | 79.00 | 84.70 | - | 58.60 | - | - | - | - | Progressive-Hint Prompting Improves Reasoning in Large Language Models |
Self-consistency | 2023 | fewshot | text-davinci-003 | 67.93 | 83.11 | - | 55.12 | - | - | - | - | Self-Consistency Improves Chain of Thought Reasoning in Language Models |
Active Prompt | 2023 | fewshot | text-davinci-003 | 65.60 | 80.50 | 79.80 | 48.00 | 78.90 | 74.20 | 71.20 | - | Active Prompting with Chain-of-Thought for Large Language Models |
Synthetic Prompt | 2023 | fewshot | text-davinci-003 | 73.90 | 81.80 | 80.70 | - | - | - | - | - | Synthetic Prompting: Generating Chain-of-Thought Demonstrations forLarge Language Models |
FOBAR | 2023 | fewshot | text-davinci-003 | 79.50 | 86.00 | - | 58.66 | - | - | - | - | Forward-Backward Reasoning in Large Language Models for Verification |
Boosted Prompting | 2023 | fewshot | text-davinci-003 | 71.60 | - | - | 55.10 | - | - | - | - | Boosted Prompt Ensembles for Large Language Models |
Fewshot CoT | 2022 | fewshot | code-davinci-002 | 60.10 | 75.80 | 80.10 | 39.80 | 79.00 | 73.40 | 70.40 | 99.00 | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
Self-Consistency | 2023 | fewshot | code-davinci-002 | 78.00 | 86.80 | 87.80 | 52.00 | 81.50 | 79.80 | 73.40 | 99.50 | Self-Consistency Improves Chain of Thought Reasoning in Language Models |
PAL | 2023 | fewshot | code-davinci-002 | 72.00 | 79.40 | 79.60 | - | - | - | - | - | PAL: Program-aided Language Models |
Resprompt | 2023 | fewshot | code-davinci-002 | 66.60 | - | - | 45.30 | - | - | - | - | Resprompt: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models |
DIVERSE | 2023 | fewshot | code-davinci-002 | 82.30 | 87.00 | 88.70 | - | 79.90 | 78.60 | - | - | Making Language Models Better Reasoners with Step-Aware Verifier |
Least-to-Most | 2023 | fewshot | code-davinci-002 | 68.01 | - | - | - | - | - | 94.00 | - | Least-to-Most Prompting Enables Complex Reasoning in Large Language Models |
Boosted Prompting | 2023 | fewshot | code-davinci-002 | 83.30 | 88.60 | - | 61.70 | - | - | - | - | Boosted Prompt Ensembles for Large Language Models |
Fewshot CoT | 2022 | fewshot | gpt-3.5-turbo | 76.50 | 81.90 | - | 54.30 | 78.00 | 63.70 | 73.20 | 99.00 | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
Self-consistency | 2023 | fewshot | gpt-3.5-turbo | 81.90 | 86.40 | - | 62.60 | - | - | - | - | Self-Consistency Improves Chain of Thought Reasoning in Language Models |
MetaCoT | 2023 | fewshot | gpt-3.5-turbo | 75.10 | 88.60 | - | 54.70 | 72.40 | 64.50 | 77.20 | 100.00 | Meta-CoT: Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with Large Language Models |
Verify CoT | 2023 | fewshot | gpt-3.5-turbo | 86.00 | - | - | 69.50 | - | - | 92.60 | - | Deductive Verification of Chain-of-Thought Reasoning |
Active Prompting | 2023 | fewshot | gpt-3.5-turbo | 81.80 | 82.50 | 87.90 | 55.30 | - | - | - | - | Active Prompting with Chain-of-Thought for Large Language Models |
RCoT | 2023 | fewshot | gpt-3.5-turbo | 84.60 | 84.90 | 89.30 | 57.10 | - | - | - | - | RCOT: Detecting and Rectifying Factual Inconsistency in Reasoningby Reversing Chain-of-Thought |
FOBAR | 2023 | fewshot | gpt-3.5-turbo | 87.40 | 87.40 | - | 57.50 | - | - | - | - | Forward-Backward Reasoning in Large Language Models for Verification |
Memory-of-Thought | 2023 | fewshot | gpt-3.5-turbo | - | - | - | 54.10 | - | - | - | - | MoT: Pre-thinking and Recalling Enable ChatGPT to Self-Improve withMemory-of-Thoughts |
Adaptive-consistency | 2023 | fewshot | gpt-3.5-turbo | 82.70 | 85.00 | 83.00 | - | - | 67.90 | - | - | Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoningwith LLMs |
Boosted Prompting | 2023 | fewshot | gpt-3.5-turbo | 87.10 | - | - | 72.80 | - | - | - | - | Boosted Prompt Ensembles for Large Language Models |
Zeroshot CoT | 2022 | zeroshot | text-davinci-002 | 40.50 | 63.70 | - | 31.90 | 64.00 | 52.30 | 57.60 | 87.80 | Large Language Models are Zero-Shot Reasoners |
PoT | 2022 | zeroshot | text-davinci-002 | 57.00 | 70.80 | - | 43.90 | - | - | - | - | Program of Thoughts Prompting: Disentangling Computation from Reasoningfor Numerical Reasoning Tasks |
AutoCoT | 2023 | zeroshot | text-davinci-002 | 47.90 | 69.50 | - | 36.50 | 74.40 | 65.40 | 59.70 | 99.90 | Automatic Chain of Thought Prompting in Large Language Models |
COSP | 2023 | zeroshot | code-davinci-001 | 8.70 | - | - | 55.40 | 52.80 | - | - | Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoningwith LLMs | |
Plan-and-Solve | 2023 | zeroshot | text-davinci-003 | 58.20 | 72.00 | - | 42.50 | 65.20 | 63.80 | 64.80 | 96.80 | Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models |
Agent-Instruct | 2023 | zeroshot | gpt-3.5-turbo | 73.40 | 80.80 | - | 57.90 | 74.10 | 69.00 | 99.80 | 95.20 | Agent Instructs Large Language Models to be General Zero-Shot Reasoners |
Self-Refine | 2023 | zeroshot | gpt-3.5-turbo | 64.10 | - | - | - | - | - | - | - | Self-Refine: Iterative Refinement with Self-Feedback |
RCoT | 2023 | zeroshot | gpt-3.5-turbo | 82.00 | 79.60 | 86.00 | 55.50 | - | - | - | - | RCOT: Detecting and Rectifying Factual Inconsistency in Reasoningby Reversing Chain-of-Thought |