logo

Automating Invoice Processing with GPT-4o

Extract invoice data with AI vision, analyze financials and generate an Excel report in Python

July 29, 2024

Running a business very often means dealing with a mountain of administrative tasks, paperwork, invoices and countless chores that eat up an insane amount of hours that you could be otherwise dedicating to improving and growing the business. But what is still perhaps not widely known is the incredible potential of the new AI models combined with some simple programming to automate away a lot of these tasks.

In this post, I will demonstrate this idea by building a simple demo app in Python that uses the vision capabilities of OpenAI's most advanced model, GPT-4o, to automate the extraction of data from multiple invoices simultaneously, use that data to perform basic financial summaries, and then create an Excel report with all the data collected and the generated summaries.

While we will look at the code and the technical details, don't worry if you are not a programmer. The concepts behind this app are relevant to anyone interested in automating and optimizing their business operations. And these techniques are not just useful for invoice processing but can be generalized to all kinds of documents and paperwork. By the end of this post, you'll have new insights on how AI can transform routine administrative tasks, potentially saving your business time and reducing errors.

#Overview of the Invoice Processing App

Before we look at the code, let's take a step back and see what the app is actually doing. This will give you a clearer picture of how AI can transform a manual process into an automated workflow.

The app goes through three main stages:

  1. Data extraction.
  2. Financial analysis.
  3. Report generation.

The DATA EXTRACTION stage is where the magic happens. OpenAI's GPT-4o is a very advanced multimodal model, which means that it can process text, audio and images in real-time. This is incredibly powerful and the possibilities are endless. If you haven't already, check out some of the examples in OpenAI's website. This means we can feed it images and ask the model to describe them, analyze them or extract information from them (such as all the relevant details in an invoice).

And not only that, you can also force the model to return the information in a structured format like JSON using the JSON mode. A simple text response is great when building a text-based assistant, but if you want to build a more complex application that integrates the model with other systems and your own code, it's essential to have the output of the model in a structured format that machines understand and that is much easier to handle with code.

Let's look at an example of an invoice and the extracted output:

Invoice example

{
"invoice_number": "DT-45678",
"invoice_date": "2024-01-07",
"invoice_type": "incoming",
"issuer": {
"name": "DevTools Pro",
"address": "789 Coder's Court, Austin, TX 78701",
"phone": "+1 (512) 555-3456",
"email": "accounts@devtoolspro.com"
},
"recipient": {
"name": "TechNova Solutions, Inc.",
"address": "1250 Charleston Road, Mountain View, CA 94043, United States",
"phone": "+1 (650) 555-0123",
"email": "info@technovasolutions.com"
},
"invoice_items": [
{
"description": "Annual Team License Renewal (25 users)",
"total": 8500.00
},
{
"description": "Premium Support Package",
"total": 2000.00
}
],
"subtotal": 10500.00,
"tax_rate": 8.25,
"tax": 866.25,
"total": 11366.25,
"terms": "Payment is due within 30 days"
}

As you can see, it does a perfect extraction job. It's even able to extract the invoice number that appears as vertical text on the left side. And how do we instruct the model to perform the data extraction and output the data with this particular structure? With the following prompt:

You are an AI assistant specialized in extracting information from invoice images for the company "TechNova Solutions, Inc.".
Your task is to analyze the given invoice image and extract the relevant information according to the specified JSON structure below.
If a field is not present in the invoice or cannot be determined leave it as null.
If the invoice date format is ambiguous, assume it's the international date format dd-mm-yyyy.

Provide the extracted information in this JSON structure:

{
"invoice_number": "string",
"invoice_date": "YYYY-MM-DD",
"invoice_type": "incoming | outgoing",
"issuer": {
"name": "string",
"address": "string",
"phone": "string",
"email": "string"
},
"recipient": {
"name": "string",
"address": "string",
"phone": "string",
"email": "string"
},
"invoice_items": [
{
"description": "string",
"total": "number"
}
],
"subtotal": "number",
"tax_rate": "number (percentage)",
"tax": "number",
"total": "number",
"terms": "string"
}

For the "invoice_type" field:

  • Set it to "incoming" if the company "TechNova Solutions, Inc." is the recipient of the invoice.
  • Set it to "outgoing" if the company "TechNova Solutions, Inc." is the issuer of the invoice.
  • Set it to null if the company "TechNova Solutions, Inc." does not appear as issuer or recipient.

You can try it yourself in ChatGPT with your own invoices. Notice also how we are specifying the name of the company so that the model is able to categorize the invoice as “incoming” (invoiced to us) or “outgoing” (invoiced to a client). This is very useful for the financial analysis we are doing next.

In the ANALYSIS STAGE, the application uses the extracted data to calculate totals like invoiced revenues (from the outgoing invoices), invoiced expenses (from the incoming invoices), net income, taxes and other metrics. The app also performs a month-by-month breakdown, calculating the same metrics but on a monthly basis. This step turns a pile of individual invoices into a basic financial picture of your business.

Finally, in the REPORT GENERATION stage, the application takes all this processed data and creates a properly-formatted Excel report with two sheets. The 'Invoices' sheet shows the extracted invoice data:

Excel report: Invoices sheet

Excel report: Invoices sheet

And the 'Summary' sheet displays the financial summaries and an automatically generated chart:

Excel report: Summary sheet

The whole process, from feeding in the invoice PDFs to getting the final Excel report, takes about 10-15 seconds. Imagine doing all of this manually!

And what about other kinds of documents and paperwork? The beauty of this approach is its flexibility. With some tweaks, you could adapt this system to process receipts, purchase orders, shipping manifests, employee timesheets, customer feedback forms, and so on. The AI can be instructed to look for different types of data depending on the document type.

In the next sections, we will take a look at the Python code used to implement each of these 3 steps. You can find all the code for the app in this GitHub repo.

#Data Extraction with a Vision Model

This is the main function that performs the data extraction:

from tqdm.asyncio import tqdm_asyncio

async def extract_invoices_data(invoices_dir):
invoices_filenames = []
tasks = []
print('Extracting the invoices data')
for filename in sorted(os.listdir(invoices_dir)):
if filename.endswith('.pdf'):
print(filename)
invoices_filenames.append(filename)
pdf_path = os.path.join(invoices_dir, filename)
base64_images = pdf_to_base64_images(pdf_path)
task = extract_invoice_data(base64_images[0])
tasks.append(task)
invoices_json = await tqdm_asyncio.gather(*tasks)
return invoices_filenames, invoices_json

It loads all the PDF files in the given invoices directory, converts the PDFs to Base64 encoded images (the format that GPT-4o accepts) and then extracts the invoice data in JSON format using the model.

We are using Python's asyncio library to run the code asynchronously. This allows us to extract data from all invoices simultaneously using asyncio.gather, rather than processing them sequentially one after the other. By calling the AI model for all invoices in parallel, we significantly reduce the overall processing time. This makes a huge difference when processing a large number of invoices.

Note that we are using tqdm's asyncio wrapper, instead of asyncio.gather directly, in order to display a progress bar during the data extraction.

This is the function that converts the PDF into Base64 encoded images, using the pdf2image library:

from pdf2image import convert_from_path

def pdf_to_base64_images(pdf_path):
imgs = convert_from_path(pdf_path, fmt='png')
base64_imgs = []
for image in imgs:
buffered = BytesIO()
image.save(buffered, format='PNG')
img_str = base64.b64encode(buffered.getvalue()).decode()
base64_imgs.append(img_str)

return base64_imgs

And this is the function that calls the GPT-4o model asynchronously to extract the data for each invoice:

from openai import AsyncOpenAI

openai_client = AsyncOpenAI(api_key=settings.OPENAI_API_KEY)

async def extract_invoice_data(base64_img):
response = await openai_client.chat.completions.create(
model='gpt-4o',
response_format={'type': 'json_object'},
messages=[
{'role': 'system', 'content': SYSTEM_PROMPT.format(company=settings.COMPANY_NAME)},
{'role': 'user', 'content': [
{'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{base64_img}'}}
]}
],
temperature=0.1,
)
return response.choices[0].message.content

When calling the model, we are using the same prompt shown earlier. Notice how we are using Python's format function to add the company name (defined in the application settings) to the prompt. Notice also the use of the response_format={'type': 'json_object'} option to enable JSON mode. You can learn more about these parameters in OpenAI's documentation.

#Invoice Data Validation and DataFrame Creation

A crucial step when generating structured output with an AI model is ensuring that the data is correct and in the right format. In the Python ecosystem, there is an amazing library called Pydantic just for that purpose. Pydantic is widely used for data parsing and validation in backend applications, APIs development (it's a key pillar of FastAPI) and data science. It has very powerful functionalities and it's really fast.

Pydantic uses Python's type hints to define data schemas. Let's take a look at the schema we are using to validate the invoices data according to the structure specified in the prompt:

from pydantic import BaseModel, ConfigDict

class InvoiceType(str, Enum):
incoming = 'incoming'
outgoing = 'outgoing'

class InvoiceItem(BaseModel):
description: str
total: float

class Company(BaseModel):
name: str
address: str
phone: str | None = None
email: str | None = None

class Invoice(BaseModel):
model_config = ConfigDict(use_enum_values=True)

invoice_number: str
invoice_date: date
invoice_type: InvoiceType | None = None
issuer: Company
recipient: Company
invoice_items: list[InvoiceItem]
subtotal: float
tax_rate: float = 0
tax: float = 0
total: float
terms: str | None = None

If you are new to this syntax, I recommend checking out Pydantic's documentation for a deeper dive. It's also worth being familiar with Python's type hints and FastAPI's documentation includes a great introduction to them.

Essentially, the code above is a simple way to define in Python the same structure we instructed GPT-4o to follow when extracting the invoice data. Many of the fields are defined as required (for instance, invoice_number: str), while there are others that are optional and include a default value (like phone: str | None = None).

Once you have a Pydantic model defined, like Invoice above, you can perform data validation with a single line of code:

validated_invoice = Invoice(**invoice_data)

After validating the invoice data, the next step is to transform it into a format that makes it easier to analyze, manipulate and eventually display the results in an Excel report. This is where Pandas comes in. Pandas is a powerful data manipulation library for Python, widely used in data science and analytics.

Converting the invoice data into a Pandas DataFrame has two advantages:

  1. It gives access to a wide range of functions for data aggregation, filtering and computation, which we will use for our financial analysis and summaries.
  2. It represents data in a table-like structure, which is perfect to generate our final Excel report.

Now that we have introduced the main elements of this section, let's take a look at the actual function that parses and validates the invoice data, and then transforms it into a DataFrame:

import json
import pandas as pd

def build_invoices_dataframe(invoices_filenames, invoices_json):
flat_invoices = []
for filename, invoice_json in zip(invoices_filenames, invoices_json):
try:
invoice_data = json.loads(invoice_json)
validated_invoice = Invoice(**invoice_data)
flat_invoice = flatten_invoice_structure(validated_invoice)
flat_invoices.append(flat_invoice)
except json.JSONDecodeError as e:
print(f'JSON parsing error in invoice {filename}: {e}')
except ValidationError as e:
print(f'Validation error in invoice {filename}: {e}')
except Exception as e:
print(f'Unexpected error in invoice {filename}: {e}')

invoices_df = pd.DataFrame(flat_invoices)
invoices_df['Invoice Date'] = pd.to_datetime(invoices_df['Invoice Date'])
invoices_df.insert(2, 'Year-Month', invoices_df['Invoice Date'].dt.to_period('M'))
return invoices_df

A brief summary of what the function is doing:

  1. JSON parsing: Loads the JSON string output from GPT-4o into a Python object.
  2. Data validation: Validates the invoice data according to our Pydantic Invoice model.
  3. Error handling: Catches any parsing or validation errors, ensuring only valid invoices are processed.
  4. Data flattening: Transforms the nested invoice structure into a flat, tabular format using flatten_invoice_structure (you can find this helper function in the repository).
  5. DataFrame creation: Creates a Pandas DataFrame from the flattened invoice data.
  6. Date processing: Converts the 'Invoice Date' field to a datetime format and adds a 'Year-Month' field for the monthly analysis.

#Simple Financial Analysis with Pandas

We can synthesize the whole workflow as: extract → validate → flatten → analyze → report. We have already completed the first three steps. Let's now focus on the function that analyzes the data and generates the financial summaries of the invoices:

def generate_financial_summary(invoices_df):
total_s = pd.Series(dtype='object')

# Total
total_s['Period'] = f"{invoices_df['Invoice Date'].min().strftime('%Y-%m')} to {invoices_df['Invoice Date'].max().strftime('%Y-%m')}"
total_s['Invoices'] = len(invoices_df)

revenue = invoices_df[invoices_df['Invoice Type'] == 'outgoing']['Total'].sum()
expenses = invoices_df[invoices_df['Invoice Type'] == 'incoming']['Total'].sum()
total_s['Revenue'] = revenue
total_s['Expenses'] = expenses
total_s['Net Income'] = revenue - expenses

total_tax_collected = invoices_df[invoices_df['Invoice Type'] == 'outgoing']['Tax'].sum()
total_tax_paid = invoices_df[invoices_df['Invoice Type'] == 'incoming']['Tax'].sum()
total_s['Tax Collected'] = total_tax_collected
total_s['Tax Paid'] = total_tax_paid
total_s['Net Tax'] = total_tax_collected - total_tax_paid

# Monthly
monthly_df = invoices_df.groupby('Year-Month').agg(**{
'Invoices': ('Invoice Number', 'count'),
'Revenue': ('Total', lambda x: x[invoices_df['Invoice Type'] == 'outgoing'].sum()),
'Expenses': ('Total', lambda x: x[invoices_df['Invoice Type'] == 'incoming'].sum()),
'Tax Collected': ('Tax', lambda x: x[invoices_df['Invoice Type'] == 'outgoing'].sum()),
'Tax Paid': ('Tax', lambda x: x[invoices_df['Invoice Type'] == 'incoming'].sum())
})
monthly_df = monthly_df.reset_index()
monthly_df['Year-Month'] = monthly_df['Year-Month'].astype(str)
monthly_df.insert(4, 'Net Income', monthly_df['Revenue'] - monthly_df['Expenses'])
monthly_df['Net Tax'] = monthly_df['Tax Collected'] - monthly_df['Tax Paid']

return total_s, monthly_df.T

If you have never used Pandas before, the code might look confusing, but what it's doing is actually quite simple. The function generates two main outputs: a total summary and a monthly breakdown of our invoice data.

For the total summary, we use a one-dimensional Pandas Series, perfect for storing a single row of summary data. This includes the total period covered, number of invoices, invoiced revenue, invoiced expenses, net income, and tax information. To calculate these metrics, we need to sum up the relevant data from the invoice DataFrame, distinguishing between incoming and outgoing invoices.

For the monthly summary, we use a two-dimensional Pandas DataFrame (just like for the invoice data), which is ideal to represent multiple rows (months) with multiple columns (financial metrics). We use Pandas' groupby function (which is very similar to the SQL GROUP BY statement) to aggregate the invoice data by month, calculating the same metrics as in the total summary but on a month-by-month basis.

The power of Pandas is obvious here. With only a few lines of code, we are able to perform complex calculations across the whole dataset of invoices, transforming raw invoice data into financial insights. And now we are ready to display it all in an Excel report.

#Excel Report Creation

The final step in the invoice processing pipeline is creating an Excel report. Using Pandas and the xlsxwriter library, we create an Excel workbook with two sheets: the 'Invoices' sheet contains all the raw invoice data, and the 'Summary' sheet displays the financial analysis calculated in the previous step.

This is how the function looks like:

def create_excel_report(invoices_df, total_s, monthly_df, filepath):
with pd.ExcelWriter(filepath, engine='xlsxwriter', datetime_format='YYYY-MM-DD') as writer:
# Write invoice data sheet
invoices_df.to_excel(writer, sheet_name='Invoices', index=False)

# Create summary sheet
workbook = writer.book
worksheet = workbook.add_worksheet('Summary')

# Define formats
title_format = workbook.add_format({'bold': True, 'font_size': 14})
money_format = workbook.add_format({'num_format': '#,##0.00'})

# Write total summary
worksheet.write('A1', 'TOTAL', title_format)
total_s.to_frame().to_excel(writer, sheet_name='Summary', startrow=1, startcol=0, header=False)

# Write monthly summary
worksheet.write('A12', 'MONTHLY', title_format)
monthly_df.to_excel(writer, sheet_name='Summary', startrow=12, startcol=0, header=False)

# Money formatting
money_rows = list(range(3, 9)) + list(range(14, 20))
for row in money_rows:
worksheet.set_row(row, None, money_format)

# Create chart
chart = workbook.add_chart({'type': 'line'})

num_cols = monthly_df.shape[1]
for i, metric in enumerate(['Revenue', 'Expenses', 'Net Income']):
chart.add_series({
'name': metric,
'categories': ['Summary', 12, 1, 12, num_cols],
'values': ['Summary', 14 + i, 1, 14 + i, num_cols],
})

chart.set_x_axis({'name': 'Month'})
chart.set_y_axis({'name': 'Amount', 'major_gridlines': {'visible': False}})

worksheet.insert_chart('A23', chart)

As you can see above, displaying a Pandas DataFrame in an Excel sheet is as simple as: invoices_df.to_excel(writer, sheet_name='Invoices', index=False). And if we want to display the data as columns and starting at a specific cell (like we do in the 'Summary' sheet) we can do it like this: monthly_df.to_excel(writer, sheet_name='Summary', startrow=12, startcol=0, header=False).

We also apply some basic formatting to make the report more readable, such as bold titles and proper number formatting for monetary values. And at the end, we generate a line chart displaying the invoiced revenues, expenses and net income over time.

It is a very basic financial analysis and report, but you can use it as a blueprint to do much more. Depending on your business needs, you could easily extend this code to include more complex financial analyses, additional charts, conditional formatting and other worksheets with more detailed breakdowns.

#Putting It All Together

The only missing part now is putting together the whole invoice processing pipeline using the functions created in the previous sections. And we need to call it using asyncio because it contains asynchronous code. This is how it looks like:

async def process_invoices():
invoices_filenames, invoices_json = await extract_invoices_data(settings.INVOICES_DIR)
invoices_df = build_invoices_dataframe(invoices_filenames, invoices_json)
total_s, monthly_df = generate_financial_summary(invoices_df)
create_excel_report(invoices_df, total_s, monthly_df, settings.REPORT_FILEPATH)

def main():
asyncio.run(process_invoices())

This simple demo shows the immense potential of AI-powered automation to transform tedious business tasks. In not many lines of code, we have automated a task that would probably require hours of manual work.

Remember that you can access all the code here, and you can apply and extend it to your own business processes. What will you automate next?