OpenAI's Structured Outputs for RAG and Data Extraction

Last week, OpenAI released a very interesting new feature: Structured Outputs.

When you are building applications that integrate large language models with other systems or use LLMs to call functions and execute tasks, having predictable outputs becomes crucial. And this is where OpenAI's new feature comes into play.

Previously, the way to solve this was either to use JSON mode with the output schema defined in the model prompt, or use function calling with a specific function definition and schema, assuming your LLM provider supported these capabilities.

These methods generally worked well but none of them guaranteed that the model's response would conform to the defined schema, and it could be challenging to make it work for complex schemas. This made LLM-based applications more unpredictable and error-prone, and made complex integrations and automation systems a big challenge.

The new Structured Outputs feature changes this completely. Using a technique called constrained decoding, where the model's output is constrained to only the tokens that are valid according to the defined schema, OpenAI's team has managed to make structured outputs score a perfect 100% when following complex JSON schemas.

In this blog post we are going to learn more about this topic and explore practical examples of how to use structured outputs for Retrieval-Augmented Generation (RAG) and data extraction. We will also explore the Pydantic library, the OpenAI Python SDK and its helpful parsing functions for handling structured outputs. And finally, we will look at an essential feature of LLM and RAG apps in production: streaming.

#Structured Outputs with Function Calling for RAG

One of the ways to use OpenAI's structured outputs is via function calling, which allows us to connect models to external tools and systems. Function calling was already available before, but now by setting strict: true in the function definition we can enable structured outputs and ensure that the model response adheres exactly to the function schema. Let's now explore how we can apply this to a RAG application.

In a previous post, we introduced Retrieval-Augmented Generation (RAG) as a very powerful technique to combine the power of LLMs with external knowledge retrieval. The core idea of RAG is to take a user query, retrieve relevant information to that query from a knowledge base (using semantic search with vector embeddings or more traditional keyword-based search methods), and then use the retrieved information as augmented context for the LLM to generate an answer to the user query.

In the basic RAG system we built, we performed information retrieval for each user question in isolation without considering the conversation history. But we can build a more powerful and intelligent system using function calling. By providing the LLM with a knowledge base query tool, we can let the LLM decide when and how to use the tool.

This is more powerful for two different reasons:

The LLM can selectively decide when to use the retrieval tool depending on the user message and the prompt instructions (for example, it would not make sense to retrieve information for a greetings or unrelated messages).
The LLM can intelligently rephrase the user question when necessary using the conversation history. This is critical for follow-up questions.

Let's now see a practical example of how we could use the new function calling with structured outputs feature in a Retrieval-Augmented Generation (RAG) app.

To define schemas for structured outputs, we are going to use Pydantic, a great Python library for data parsing and validation. We can easily define our knowledge base query tool with the following Pydantic model:

from pydantic import BaseModel, Field

class QueryKnowledgeBaseTool(BaseModel):
    """Query the knowledge base to answer the user questions."""
    query_input: str = Field(description='The natural language query input. The query input should be clear and standalone.')

    def __call__(self):
        chunks = search_knowledge_base(self.query_input)
        return '\n\n---\n\n'.join(chunks) + '\n\n---'

In this case, the tool definition is very simple and takes a single parameter: the query input string. The model also includes a __call__ method to make the tool callable and perform the search using the query input. This is not strictly necessary, but it is very convenient to have it all encapsulated in the same class.

OpenAI's Python SDK and its parsing helpers accept Pydantic models as input, so generating the required JSON schema to include the tool in an OpenAI's API call is as simple as:

from openai import pydantic_function_tool

pydantic_function_tool(QueryKnowledgeBaseTool)

And it returns the following schema:

{
  "type": "function",
  "function": {
    "name": "QueryKnowledgeBaseTool",
    "strict": true,
    "parameters": {
      "description": "Query the knowledge base to answer the user questions.",
      "properties": {
        "query_input": {
          "description": "The natural language query input. The query input should be clear and standalone.",
          "title": "Query Input",
          "type": "string"
        }
      },
      "required": [
        "query_input"
      ],
      "title": "QueryKnowledgeBaseTool",
      "type": "object",
      "additionalProperties": false
    },
    "description": "Query the knowledge base to answer the user questions."
  }
}

Notice how it's using the very concise Pydantic model to generate the JSON schema with:

The class name QueryKnowledgeBaseTool as the function name.
The docstring as the function description.
The query_input field as the function parameter, including its description.
The new strict: true option to enable the structured outputs feature.

The function and parameters descriptions are very important as they provide additional information to the LLM about what the function does and how to call it.

We can now use OpenAI's GPT-4o-mini model with this tool to answer user questions about a company using a knowledge base:

from openai import OpenAI

client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

SYSTEM_PROMPT = """
You are an AI assistant for TechPro Solutions, a company specializing in IT services and solutions.
Your role is to answer user questions about the company's services and offerings accurately and professionally.
To ensure you provide the most up-to-date and accurate information, always use the QueryKnowledgeBaseTool to retrieve relevant information before answering user queries. You are a reliable assistant and your answers must always be based on truth.
"""

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'What services do you provide?'}
]

response = client.beta.chat.completions.parse(
    model='gpt-4o-mini',
    messages=messages,
    tools=[pydantic_function_tool(QueryKnowledgeBaseTool)],
)
content = response.choices[0].message.content
tool_call = response.choices[0].message.tool_calls[0]

print(content)
# None

print(tool_call.function.name)
# 'QueryKnowledgeBaseTool'

print(tool_call.function.parsed_arguments)
# query_input='What services does TechPro Solutions provide?'

As you can see above, the LLM correctly identifies that it needs to call the QueryKnowledgeBaseTool to retrieve the relevant information to answer the user query. So instead of returning a text response in the message content property as usual, it returns a tool_calls property with the function to call and the arguments to pass. Note also how it rephrases the user query to include the company name and make it clearer.

Additionally, the client.beta.chat.completions.parse() method provided by the SDK (a wrapper over the usual client.chat.completions.create() method) automatically parses the response tool call into an instance of our Pydantic model. This is possible because we are using the structured outputs feature and passing the Pydantic model as input.

As a result, the tool_call.function.parsed_arguments property contains an actual instance of QueryKnowledgeBaseTool, with the query_input parameter filled in. And thanks to the __call__ method defined earlier, we can simply call the tool to query the knowledge base with the provided query input:

# kb_tool is an instance of QueryKnowledgeBaseTool
kb_tool = tool_call.function.parsed_arguments

print(kb_tool.query_input)
# 'What services does TechPro Solutions provide?'

# The __call__ method allows us to call the tool to perform the query
kb_result = kb_tool()

And we can run the final RAG step by adding a message with the tool response and calling the model so that it can answer the user query with the retrieved information:

messages.append(
    {'role': 'tool', 'tool_call_id': tool_calls[0].id, 'content': kb_result}
)
response = client.beta.chat.completions.parse(
    model='gpt-4o-mini',
    messages=messages
)
content = response.choices[0].message.content
print(content)

In this final step we do not need to provide tools to the model, as we just want it to give an answer to the user query using the retrieved information in the tool message. And it will output the answer in the message content property.

#Structured Outputs with JSON responses for Data Extraction

The second way to use structured outputs is by providing a JSON schema to the new response_format API parameter. This is useful when you want to return the LLM response in a structured JSON format, which was previously done using JSON mode.

Let's take as an example the invoice processing automation system we built in a previous post, where we used GPT-4o's vision capabilities to extract relevant data from invoice images following a specific schema. In that system, we defined the schema in the model prompt and used JSON mode to extract the data according to the schema. We then validated the data with a Pydantic model.

With structured outputs, we can now pass the same Pydantic model directly in the response_format parameter to define the output schema and automatically parse the response, just like we did in the function calling example. And unlike JSON mode, we are guaranteed that the model response will conform strictly to the schema.

We can do all of that in a single call with the same client.beta.chat.completions.parse() method we used before:

messages = [
    {'role': 'system', 'content': INVOICE_EXTRACTION_SYSTEM_PROMPT},
    {'role': 'user', 'content': [
		    {'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{invoice_img}'}}
		]}
]

response = client.beta.chat.completions.parse(
    model='gpt-4o-mini',
    messages=messages,
    response_format=Invoice,
)

message = completion.choices[0].message
if message.parsed:
    print(message.parsed)

The message.parsed property will therefore return an Invoice instance with its fields populated with the extracted data.

#Streaming Structured Outputs

In production LLM and RAG applications, streaming is essential to provide the best user experience. By streaming the model's output, we can start displaying results to users immediately, rather than waiting for the response to be fully generated. This significantly improves the perceived speed of the applications and provides a more responsive and interactive experience.

Let's return to our previous RAG example and add streaming using the client.beta.chat.completions.stream() method provided by the OpenAI Python SDK:

client = AsyncOpenAI(api_key=os.environ['OPENAI_API_KEY'])

SYSTEM_PROMPT = """
You are an AI assistant for TechPro Solutions, a company specializing in IT services and solutions.
Your role is to answer user questions about the company's services and offerings accurately and professionally.
To ensure you provide the most up-to-date and accurate information, always use the QueryKnowledgeBaseTool to retrieve relevant information before answering user queries. You are a reliable assistant and your answers must always be based on truth.
"""

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT},
    {'role': 'user', 'content': 'What services do you provide?'}
]

content = None
tool_calls = []

async with client.beta.chat.completions.stream(
    model='gpt-4o-mini',
    messages=messages,
    tools=[pydantic_function_tool(QueryKnowledgeBaseTool)],
) as stream:
    async for event in stream:
        if event.type == 'content.delta':
            print(event.delta, flush=True, end='')
        elif event.type == 'content.done':
            content = event.content
        elif event.type == 'tool_calls.function.arguments.done':
            tool_calls.append({'name': event.name, 'parsed_arguments': event.parsed_arguments})

print(content)
# None

print(tool_calls[0]['name'])
# QueryKnowledgeBaseTool

print(tool_calls[0]['parsed_arguments'])
# query_input='list of services provided by TechPro Solutions'

This method simplifies the parsing and handling of streaming responses, providing different events to track the generation progress:

content.delta: Allows you to view and print new content as it's generated, providing immediate feedback to the user.
content.done: Signals when content generation is complete, giving you access to the full content.
tool_calls.function.arguments.done: Provides parsed tool calls as soon as their arguments are complete. And as before, the parsed_arguments property contains an instance of our QueryKnowledgeBaseTool Pydantic model.

In this specific example, the LLM decides to call the knowledge base query tool instead of generating a text answer. As a result, you won't see streaming content deltas. However, if you call the model again appending a message with the tool result (as we did in the non-streaming version), you will see the final answer streamed chunk by chunk.

There are other useful events like chunk (which provides every single response chunk as it arrives) and tool_calls.function.arguments.delta (which provides parts of a tool call's arguments as they are generated). You can learn more about the parsing methods and the streaming events in the OpenAI Python SDK repository.

For convenience, the SDK also provides a get_final_completion method that returns the final accumulated response once it has been fully generated. This allows us to easily obtain the same final results as in the non-streaming version:

response = await stream.get_final_completion()

content = response.choices[0].message.content
tool_call = response.choices[0].message.tool_calls[0]

print(content)
# None

print(tool_call.function.name)
# QueryKnowledgeBaseTool

print(tool_call.function.parsed_arguments)
# query_input='What services does TechPro Solutions provide?'

As we have seen in this post, combining the new structured outputs feature with function calling, JSON responses and streaming can significantly improve RAG applications and LLM-based systems in general.

This overview also serves as an introduction to some key concepts and tools we'll use in a future blog post, where we'll build a production-ready RAG chatbot application. Stay tuned!