
We can’t trust large language model (LLM) outputs. One of the reasons is that it doesn’t always generate reliable confidence estimates. One could look into the model likelihoods, but even that is infeasible for many black-box models. We show here that it’s possible to train a lightweight external model to infer an LLM’s internal confidence based only on the prompt and answers from the LLM (purely black box).