HoT LogoHoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Tin Nguyen1,*, Logan Bolton1,*, Mohammad Reza Taesiri2, Trung Bui3, Anh Totti Nguyen1

*Equal contribution

1Auburn University, 2University of Alberta, 3Adobe Research

Auburn University University of Alberta Adobe Research
Paper Source Code
HoT Teaser - MATH500 Example

CoT and HoT (ours) responses for a MATH500 question in ReasoningTrap benchmarks, both generated by Gemini. Left: With CoT prompting, the LLM misses the provided facts $a=10, b=c=d=0$, resulting in an incorrect answer. Right: With HoT prompting, the model highlights the key constraint $b=c=d=0$ and plugs it into the expression $ab^2c^3d^4$ to produce the correct answer of 0.

HoT Teaser - Puzzle Example

Left: After finetuned via SFT on CoT examples, Qwen-2.5-1.5B answers incorrectly an adversarial question from PuzzleTrivial as it does not factor in the key fact of permanently infertile lions. Right: In contrast, HoT-finetuned counterpart LLM can highlight facts and answer correctly using the fact the lions would never reproduce.

📜 Abstract

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain of Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in fewshot settings, HoT outperforms vanilla chain of thought prompting (CoT) (Wei et al., 2022) on a wide range of 22 tasks from arithmetic, reading comprehension to logical reasoning. We also test how much highlights help users detect when LLMs are correct. As expected, they help time-limited human participants to more accurately and efficiently recognize when LLMs are correct. However, interestingly, when LLMs are wrong, HoTs tend to fool users into believing that an answer is correct.

Key Takeaways

Recipe to make these highlights?

To prompt LLMs to generate HoTs, we create the 8-shot demonstration examples (which are CoT demonstrations but with XML tags) would show LLMs how to insert tags and answer questions. Second, the HoT instruction would be a short, explicit request that asks LLMs to insert tags into questions and answer it.

LLMs generate HoT responses by wrapping XML tags around the information that the model determines is the most important. Regex and CSS are then used to add color highlighting for the user to easily understand the response.

Benchmark Improvements

We evaluate HoT on 22 tasks across arithmetic, question-answering, and logical reasoning datasets using Gemini-1.5-Pro (), Gemini-1.5-Flash (), Llama-3.1-70B (), Llama-3.1-405B () and GPT-4o ().

HoT outperforms CoT across 3 ReasoningTrap tasks, with significant gains in MATH500 (Conditioned Math) (+18.00 , an example shown in ) and PuzzleTrivial (+12.50 , an example shown in ).

HoT consistently improves accuracy over CoT across arithmetic tasks. Notably, HoT achieves the largest performance gains in AQUA (+14.64 for Gemini-1.5-Flash ) and r-GSM (+12.73 for Gemini-1.5-Pro ).

HoT demonstrates consistent accuracy improvements over CoT across QA tasks (StrategyQA, SpartQA, Date) and Reading Comprehension tasks (DROP (Break) and DROP (Census)). The largest gains are observed in StrategyQA (+15.07 for Llama-3.1-70B ) and SpartQA (+11.88 for Gemini-1.5-Flash ).

HoT outperforms CoT across the logical reasoning BBH subsets, with notable gains in Causal Judgment (+15.5 for GPT-4o ) and Five Object tasks (+6.00 for Gemini-1.5-Pro ).

HoT helps users to read Long Context Questions easier

LLMs are great at answering a wide variety of questions, but it can be annoying to read through the huge blocks of text that they tend to generate. Humans frequently add color highlighting to our writing to make it easier to read, so why not allow LLMs to do the same thing?

Which of these two responses are easier to read?
(Left) CoT: you have to parse through all the irrelevant context to find the part of the conversation that you care about. (Right) HoT: you can almost instantly scan over the LLM response to see exactly where it drew its answer from.

How does this affect the user experience?

We find that HoT helps time-limited human participants to more accurately and efficiently recognize when LLMs are correct. However, when LLMs are wrong, HoT responses tend to fool users into believing that an answer is actually correct.

Prompt Avg Time
(seconds)
Verification Accuracy for
Correct LLM Responses ✓
Verification Accuracy for
Incorrect LLM Responses ✗
HoT 47.26 84.48% ± 20.28% 54.83% ± 30.13%
CoT 62.38 78.82% ± 28.26% 72.21% ± 21.99%

HoT reduces hallucinations compared to CoT

HoT prompting makes LLMs, here , hallucinates consistently less over a diverse set of tasks. Table shows the SelfCheckGPT hallucination scores. Lower is better.

An example of a question and answer from StrategyQA, where CoT hallucinates the fact "Vietnam War lasted from 1955 to 1975" whereas HoT detects the fact within the question that "War in Vietnam (1945-46)".

HoT outperforms other advanced prompting techniques

Over 5 runs across 22 benchmarks, HoT consistently outperforms both CoT and Repeating Questions (RQ), and even CoT + Self-Consistency (SC), and ComplexCoT. HoT and HoT + SC also outperforms their counterparts (ComplexCoT and CoT + SC) showing that HoT can complement these methods.

On average over 5 runs and 3 datasets, HoT alone is still the most performing method compared to all other advanced prompting methods of CoT, LtM, CoVE, Self-Refine, and ToT. Under other prompting techniques, we observe LLMs often miss critical facts (e.g., overlooking temporal indicators like "yesterday" in Date), causing incorrect answers. In contrast, LLMs tend to focus better on key facts under HoT prompting.