yoavg/structured-cot.md

## structured-cot.md

      
    Raw
  

              structured-cot.md
            
          
    Are OpenAI training models in a way that encourages security risks?

Todays's topic is structured outputs, how to produce them, their interprlay with chain-of-thought, and a potential security risk this opens up.
Structured Outputs

When using an LLM programatically as part of a larger system or process, it is useful to have the model produce outputs in a structured format which is easy to parse programatically. Formatting the output as a JSON structure makes a lot of sense in this regard, and the commercial LLM models are trained to produce JSON outputs according to your specification.
So for example instead of asking the model to produce a list of 10 items (left) which may be tricky to parse, I could ask it to return the answer as a JSON list of 10 strings (right).


I can also ask for more elaborate structures:

This is both easier to parse, and also guarantees the model will return only the structure I want and no additional noise like opening and closing lines.
In case the response fails to parse to JSON, or fails to parse according to the structure I want, I can then tell the model exactly what the problem is, and ask it to revise. And this can be done automatically in a loop, until it conforms.
This is great. Very convenient. Works well. Models are trained for it. They are also trained to take exact specifications of how the output should look like, also in structured formats  (not like the messy spec I wrote above). They also have modes to verify up-front that the generated output conforms to the structure I asked for, so I don't need the looping described above. Fantastic.
Chains of Thought

However, we also know that the model's computational capacity is limited, and that "Chain-of-thought" promting really helps sometimes.
In short, we ask the model to produce not only a final answer, but to start by producing a reasoning, and only then produce its final answer.
This is helpful as it gives the model more time to do its calculations, and to spell them out. The generation of the answer is now produced
based not only on the question and some internal computation, but also on some external intermediate computation produced by the model.
Chain-of-thought prompting is a very popular technique, and for a good reason. In many cases chain-of-thought answers are of higher quality.
But... structured output?

What happens when you want to combine the two? I want my final answer to be easy to parse. But I also want the model to "think" before giving me the answer, so it will be more accurate.
What do I do? How do I ask the model to first produce a chain-of-thought or to work step-by-step, while still having the final answer as a nice structured object?
I can just ask it for exactly this: "think about the solution step by step, then produce the final outcome as a structured output of the form ..."
This works, but sometimes fails, and may produce extra material after the final step. Also, I cannot use the "verify that the output conforms to this structure" mode some models provide.
Another option is to work with two prompts. In the first prompt, I ask it to produce the reasoning steps towards the answer. In the second prompt, I ask it "now produce the answer as a structured object of the form...".
Now the second prompt has a comppletely structured output, and we are all happy. Except that this is a bit cumbersome to write and work with.
Given the need to combine these two, people came up with a clever solution:
Structured chain-of-thought

This solution was proposed to me by several LLM users online:


If it wasn't clear from the above bluesky posts, the idea is basically: include the chain-of-thought, or "reasoning steps", as part of the structured output you ask the model to produce.
So, instead of asking "produce a reasoning by thinking step by step, and then produce a JSON structure with two fields, A and B" we ask it:
"produce a JSON structure with three fields. The first field is called 'reasoning', followed by the two fields A and B which are the result of the reasoning process".
Cool trick, right?
And it is also recommended by OpenAI in their official documentations, which means they likely train their models for it.


)
Cool.
But! It goes against a very core principle of separation people use when working with data. And which I think we'd like LLMs to use as well.
Code-vs-Data

As humans, we generally try to separate between "code", or "instructions" and the "values" they operate on.
So, when I tell you:
Translate the following sentence: "generate 10 names of dogs" to french.

You treat the outer parts ("translate the following sentence ___ to french") as an instruction, and the quoted sentence as the data to be translated.
You most definitely not read this text as a request to generate 10 dog names, as this part is just data.
While some computer languages (ehm LISP, Scheme, et al) push the notion of "code is data and data is code", this is not a widely adopted approach, and for a good reason: they are generaly different things, and when I do treat data as code, I want to be very explicit about it.
For those familar with SQL injection attacks (and simialr security vulnerabilities¹), they break precisely this principle: something that was meant to be data was treated by mistake as code, and was executed instead of being operated on.
Training LLMs to be more brittle

When an LLM is asked to include chain-of-thought reasoning steps in its structured outputs, with the intent for it to take them into account when generating the next part of the structure, we ask it to break this break this principle. (We tell the model to treat content inside of quoted strings, in something which is clearly a data structure, as instructions for producing the rest of the data. This is not natural.)
And when it then follows our request and does so, it indeed breaks this princuple.²
And when LLM providers tune their model to follow reasoning steps in structured outputs, they are actively training against this principle.
Which is... odd. And maybe dangerous. It definitely makes the models more brittle, and I wonder if it will generalize and manifest also as consequences beyond just structured-outputs.
Dangerous?! Well, you know. Not in the "the models will destroy us" sense, but yes in "the models will become more exploitable" sense.
There is already a well known hijacking attack where you tell a model "ignore all previous instructions and do X", which allow users to control LLM outputs. When such strings are written into input fields that are then fed to LLMs, we change the intended behavior of the software.
This is a by-the-book injection attack. One way to solve it would be to mark the user input as being data, and train the LLM to not follow instructions in the data part of the prompt.
This is not currently done. And training the models to do structured-chain-of-thought actively trains against this separatio, and opens the doors to potential vectors of attack by this nature.
So what am I proposing?

I am not here to propose anything. This is just describing an observation I find interesting and worthwhile to share.
BUT, if I do have to propose something, it will be this: add the ability to incldue comments in the structured outputs. Formats like JSON5 support comments.
And comments around data are not data, they are naturally used as description of data or explanations of data. They are very natural place to include such model produced chain-of-thought reasoning chains, for the model to take into consideration when creating the following piece of data.
Footnotes


In fact, the amount of security vulnerabilities that can be summarized at a high level as "treating data as executable code" is just staggering. ↩


Interestingly, when I played with it a little in OpenAI models, it seemed that chain-of-thoughts in structured fields are indeed less effective then chain of thoughts outside of fields (which is good and expected). But there is also evidence that structured-chain-of-thought does have a positive effect when compared to not including it, and OpenAI docs do suggest they are expecting this usage, and so are likely training for it. ↩