HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Tin Nguyen, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen

An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate non-factual statements. A response mixed of factual and non-factual statements poses a challenge for humans to verify and accurately base their decisions on. To combat this problem, we propose Highlighted Chain of Thought Prompting (HoT), a technique for prompting LLMs to generate responses with XML tags that ground facts to those provided in the query. That is, given an input question, LLMs would first re-format the question to add XML tags highlighting key facts, and then, generate a response with highlights over the facts referenced from the input. Interestingly, in fewshot settings, HoT outperforms vanilla chain of thought prompting (CoT) (Wei et al., 2022) on a wide range of 17 tasks from arithmetic, reading comprehension to logical reasoning. We also test how much highlights help users detect when LLMs are correct. As expected, they help time-limited human participants to more accurately and efficiently recognize when LLMs are correct. However, interestingly, when LLMs are wrong, HoTs tend to fool users into believing that an answer is correct.

A HoT response from Gemini 1.5 Pro in response to a question from GSM8K.

Introduction

Modern LLMs can bold, italicize or underline their text. Why not highlight as well? We propose Highlighted Chain of Thought (HoT), a technique for prompting LLMs to generate highlights around their responses that links specific information from the user query to the LLM response. This method improves user experiences and improves answer accuracy when compared to vanilla Chain of Thought prompting.

Key Takeaways

LLMs generate HoT responses by wrapping XML tags around the information that the model determines is the most important. Regex and CSS are then used to add color highlighting for the user to easily understand the response.

Benchmark Improvements

We evaluate HoT on 17 tasks across arithmetic, question-answering, and logical reasoning datasets using Gemini-1.5-Pro (), Gemini-1.5-Flash (), Llama-3.1-70B (), Llama-3.1-405B () and GPT-4o ().

HoT consistently improves accuracy over CoT across arithmetic tasks. Notably, HoT achieves the largest performance gains in AQUA (+14.64 for Gemini-1.5-Flash ) and r-GSM (+12.73 for Gemini-1.5-Pro ).

HoT demonstrates consistent accuracy improvements over CoT across QA tasks (StrategyQA, SpartQA, Date) and Reading Comprehension tasks (DROP (Break) and DROP (Census)). The largest gains are observed in StrategyQA (+15.07 for Llama-3.1-70B ) and SpartQA (+11.88 for Gemini-1.5-Flash ).

HoT outperforms CoT across the logical reasoning BBH subsets, with notable gains in Causal Judgment (+15.5 for GPT-4o ) and Five Object tasks (+6.00 for Gemini-1.5-Pro ).

What's the Point?

LLMs are great at answering a wide variety of questions, but it can be annoying to read through the huge blocks of text that they tend to generate. Humans frequently add color highlighting to our writing to make it easier to read, so why not allow LLMs to do the same thing?

HoT responses are easier to quickly read through than CoT responses.

Consider the previous example. Which of these two responses are easier to read? For the response on the left, you have to parse through all the irrelevant context to find the part of the conversation that you care about. For the HoT response, you can almost instantly scan over the LLM response to see exactly where it drew its answer from.

How do you make these highlights?

Given an input question, the LLM first repeats the original question but with XML tags wrapped around the key facts needed to answer the question. Then, the model generates answers using the information in the question. The model is able to understand what types of information to tag using fewshot examples given in the HoT instruction prompt.

For HoT responses, the same information that is highlighted in the question is then highlighted in the answer.

For example, if the model wrapped "1 step right" with the fact1 tag in the reformatted question, it knows to then wrap any references to "1 step right" in the answer with the same fact1 tag. Then using regex and CSS, we can easily strip out these XML tags and assign a highlight color to each unique tag (see the first image under Key Takeaways for an example).

How does this affect the user experience?

We conduct a study with 63 users to evaluate the effectiveness of HoT in helping humans verify the correctness of LLM answers. Users are randomly assigned to see exclusively HoT or CoT LLM response and then have to predict if the given answer is correct or incorrect. We find that HoT helps time-limited human participants to more accurately and efficiently recognize when LLMs are correct. However, when LLMs are wrong, HoT responses tend to fool users into believing that an answer is actually correct.

Prompt Avg Time
(seconds)
Verification Accuracy for
Correct LLM Responses ✓
Verification Accuracy for
Incorrect LLM Responses ✗
HoT 47.26 84.48% ± 20.28% 54.83% ± 30.13%
CoT 62.38 78.82% ± 28.26% 72.21% ± 21.99%

Discussion

Why does HoT work?

Part of the increase of benchmark accuracy in HoT can likely be attributed to the LLM repeating the input question in its response. Several papers (Xu et al., 2024 and Mekala et al., 2024) have shown that having the LLM repeat the input question before generating a response can improve performance. However, in our full paper we show that HoT has a higher performance than just repeating the question.

We theorize that generating extra tokens (in this case XML tags) around key facts helps the LLM to focus its attention to the most important part of the context. It's also possible that by adding the XML tags, the LLM is more effectively able to recall specific information from earlier in its context window with less frequent hallucinations. However, more experimentation is needed to verify these claims.

What about Reasoning Models?

So if HoT can help LLMs to more effectively reason, how do reasoning models respond to this prompting strategy? Given the relatively high cost to run Deepseek R1 and the low rate limits of Gemini-2.0-Flash-Thinking, we were only able to run evaluations on a subset of benchmarks. However, we see no benefit to using HoT with these reasoning models.

Given that HoT relies on fewshot examples, the negative results for Deepseek align with the warnings from the creators of R1 that fewshot examples can actually hurt R1's performance (DeepSeek-AI, 2025). Interestingly, the thinking tokens for R1 only contain XML tags about 10% of the time, regardless of whether or not its final answer does actually include XML tags. Gemini Flash 2.0 Thinking does not provide thinking tokens over the API, so we are not able to analyze its internal Chain-of-Thought.

If trained to use XML tags in its thinking tokens, HoT could potentially be a useful tool for reasoning models to ground facts over long contexts. However, the current reasoning models do not benefit from HoT.

Limitations

HoT relies on few shot examples in order to demonstrate the desired output format to the LLM. On our Github, we have gathered plenty of fewshot examples that can be applied to most domains. However, if you have a niche task that you want to use HoT on, you must first construct fewshot examples (whether manually or through LLM generated examples). Future work could easily alleviate this issue with a finetuned model that produces HoT responses by default.

Conclusion

We present Highlighted Chain of Thought, a novel prompting approach that enables LLMs to directly reference text from the input question in their responses. Our experiments show that on average, HoT improves arithmetic, question answering, and logical reasoning tasks by +1.6, +2.58, and +2.53 percentage points over CoT while also enabling users to verify correct LLM answers 24% faster than CoT LLM answers.