Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) are changing how we analyze and interact with data. But to unlock their full potential, especially in finance, we need to make financial reports machine-readable. Global filings—like annual reports, quarterly earnings, press releases, and shareholder notices—are packed with valuable insights. Unfortunately, they often come in formats like PDFs or HTML that aren’t AI-friendly out of the box.
By converting these unstructured documents into structured, searchable data, we enable LLMs to more accurately retrieve facts and generate insights. This post explains how to build that end-to-end pipeline—from collecting raw documents to making them usable for RAG systems. We’ll also explore the technical challenges along the way and share best practices so data engineers and AI developers can avoid reinventing the wheel.
The first challenge is collecting financial reports from all relevant sources. Public companies around the world publish disclosures through various channels—regulatory websites, stock exchange announcements, and/or companies' investor relations pages.
For example:
You’ll need to build connectors or scrapers (or use APIs) for each of these platforms. The formats vary widely—some provide HTML pages, others offer PDFs, plain text, or even structured XBRL data. Your pipeline needs to handle all these formats reliably.
Once documents are fetched, they need to be grouped and stored systematically. A typical approach is to organize filings by company, date, and report type. For instance, all of Company X’s filings can be stored under its identifier, with sub-folders or tags for annual reports vs. interim reports, etc. Consistent organization ensures you don’t lose track of a document and can easily retrieve the latest filings (e.g. fetching a Q2 2025 report as soon as it’s released). Reliability is crucial – missing a filing or pulling incomplete data can skew analysis. Many teams use document databases or cloud storage to manage the raw files, and may also pre-store extracted text for quick search.
Not all filings are alike. A 10-K (U.S. annual report) is very different from a press release or a shareholder letter. To parse and analyze documents effectively, it’s essential to first classify them by type.
Some examples:
Proper classification allows you to apply specific parsing rules and tailor AI queries. For example, you might look for a "Risk Factors" section in an annual report—but not in a press release. It also improves search: analysts can filter results by report type to answer targeted questions (e.g., only review annual reports for a five-year performance trend).
Some sources provide helpful metadata—for instance, EDGAR tags each filing with a form type like 10-K or 8-K. But if you’re scraping PDFs from an IR site, you might need to infer the type from the document’s title or contents (e.g., “Q1 2024 Earnings” on the cover page).
You can start with simple keyword rules, then move to more advanced NLP models for trickier cases. It’s worth the effort—correct classification makes your entire RAG system smarter by guiding it to the most relevant sources of information.
With documents in hand and properly labeled, the next major challenge is converting their content into a machine-readable format. Financial filings are primarily intended for human consumption, not for automated data extraction. This makes parsing especially difficult—PDFs in particular pose serious problems. Their layouts often include multiple columns, complex tables, footnotes, headers and footers, and embedded images, all of which can disrupt straightforward text extraction. In many cases, what appears to be text is actually just an image of scanned pages, leaving nothing for a text parser to work with.
Even when actual text is present, simply using tools like PDFMiner or PyPDF often produces disordered output: lines are jumbled, context is lost, and structural cues like columns or spacing are ignored. Important formatting—such as bold or italic text, headings, bullet points, and table layouts—is stripped away, removing the visual signals that add semantic meaning.
To retain as much structure and meaning as possible, a better approach is to convert PDFs into a rich, structured text format like Markdown or HTML. Markdown is especially useful due to its simplicity and compatibility with language models—it supports headings, lists, tables, and hyperlinks with a minimal syntax. By converting a report into Markdown, you preserve its logical flow and hierarchical structure. Section titles become headers (e.g., ## Risk Factors
), bullet points remain intact, and tables can be represented in Markdown syntax. This formatting not only improves readability but also provides natural chunking boundaries for indexing and downstream AI processing.
Achieving this level of structure, however, is not trivial. Standard extraction libraries can help retrieve raw text, but they often require enhancement to detect layout elements and reconstruct structure. Modern tools increasingly use machine learning or vision models to interpret the visual layout of a page—some, like Google’s Gemini or other vision-enabled LLMs, can convert pages directly into Markdown by analyzing them the way a human would. In practice, a robust pipeline might combine several techniques: start with text extraction, then apply heuristics or ML models to identify headings, detect tables, and reformat the content accordingly. It’s also important to clean the output at this stage—for example, by removing repetitive headers and footers, and merging hyphenated words split across lines.
A particularly thorny issue arises with scanned documents and embedded images, which are common in international or older filings. These require Optical Character Recognition (OCR) to convert images of text into actual text data. OCR tools like Tesseract or commercial APIs can perform this conversion, but they typically ignore layout, resulting in large blocks of unstructured text. For example, OCR might read a multi-column page straight across, blending two columns together, or misinterpret a table’s layout. To address this, OCR must be followed by post-processing to detect and preserve structure—such as identifying columns by analyzing text position gaps or reconstructing table cells from raw OCR output.
OCR also introduces the risk of misread characters, especially in documents with poor scan quality or decorative fonts. For example, it might confuse an “8” with a “3.” To improve accuracy, it’s helpful to cross-verify critical numbers—like totals or key metrics—against expected values as a form of error checking.
Visual content presents yet another challenge. Annual reports often include charts, graphs, or infographics that convey key information non-textually. While this content can't be directly parsed as text, it shouldn’t be discarded. One strategy is to insert image placeholders or generate brief textual descriptions that summarize what the image conveys. For instance, a chart showing revenue versus profit might be described in Markdown with a caption like: “Chart: Revenue increased 12% year-over-year while profit remained flat.” These captions can be created manually for high-value visuals or generated using image captioning models. At minimum, preserving figure titles and labels as text provides context for downstream AI systems. Fully extracting data from charts remains complex and may require specialized tools—only worthwhile if there's significant value in the visual content.
Once you have the filings converted into structured, clean text, the final preparation step is chunking the documents for indexing. Chunking means splitting a long document into smaller, self-contained pieces (chunks) that are easier for an AI to handle. RAG pipelines typically work by retrieving relevant chunks in response to a query, rather than feeding an entire 200-page report to the model. Effective chunking improves both search relevance and ensures each chunk can fit within the context window of an LLM. The key is to split on logical boundaries so that each chunk covers a coherent topic or section of the report.
A good approach is to leverage the document’s structure when chunking. Since our parsing preserved headings and sections, we can start by chunking at major section breaks. For example, a 10-K could be initially divided into chunks for “Business Overview,” “Risk Factors,” “Financial Statements,” “Management’s Discussion & Analysis,” etc., using the headings in the Markdown as guides. Within very large sections, further subdivide into smaller chunks – perhaps by subheadings or paragraphs. A rule of thumb is that each chunk should represent a standalone idea or data set. For instance, each risk factor bullet could be one chunk, and each financial table (along with its accompanying text) could be another chunk. This granularity ensures that when a query is about, say, debt covenants, the retrieval step can pinpoint the specific risk factor paragraph about debt covenants, rather than pulling an entire 30-page section.
During chunking, it’s also wise to clean and enrich the text a bit more. This might include removing any residual artifacts (like page numbers). Additionally, adding metadata to each chunk greatly boosts the usefulness of your index. Metadata could include the company name, the filing type (e.g. “10-K” or “Annual Report”), the year or quarter, and the section title it came from. With such tags, a RAG system can later filter or prioritize results. For example, if a user query asks for 2023 revenue, the retriever can focus on chunks tagged “2023” and sections related to financial statements, improving both speed and accuracy. Metadata also helps track provenance: if an answer is drawn from a chunk, you’ll know exactly which document and section it originated from – crucial for explainability.
Finally, decide on an appropriate chunk size. Chunks might be defined by a maximum number of tokens or sentences (to fit model limits), but avoid breaking in the middle of a sentence or table. Some experimentation is usually needed – too large chunks may dilute relevance, while too small chunks might miss context. A common strategy is overlapping chunks (sliding window) to preserve context, but with structured financial filings, sticking to section or paragraph boundaries often suffices since the documents are formally written. By the end of this step, you’ll have a repository of indexed, vectorized chunks ready for retrieval. Your vector database or search index will treat each chunk (with its metadata) as a searchable entry, and the heavy lifting of preparing the data for AI consumption is complete.
Transforming global financial reports into machine-readable documents is a complex but rewarding process. We started with raw filings scattered across countries and platforms and ended with a structured knowledge base that an AI assistant can easily draw from. By accessing all relevant sources, classifying documents, preserving structure (headings, tables, images) via Markdown conversion, and intelligently chunking the text, you enable RAG systems to deliver accurate, context-rich answers about a company’s performance or risks. The value added by this pipeline is huge – what was once hundreds of pages of opaque PDF content becomes an interactive database for analysis, comparison, and questioning. Technical teams, especially data engineers and AI practitioners, gain the ability to integrate financial filings into their workflows for tasks like automated Q&A, trend analysis over time, or feeding models that detect insights across many reports.
That said, building and maintaining this pipeline in-house is non-trivial. It demands continuous effort: monitoring for parsing errors, updating logic for new document formats, scaling to thousands of filings, and ensuring accuracy at each step. Many organizations discover that a lot of engineering time goes into just keeping the data pipeline running reliably. If this sounds daunting, one alternative is to leverage specialized services that have done the heavy lifting already. For example, Captide offers an API that encapsulates this entire process – from fetching original filings across the US, Canada, Europe, East Asia, Australia, etc., to returning a machine-readable version with structured text and data. By using an API like this, teams can skip reinventing the wheel and immediately access clean, indexed filings for their RAG applications.