Captide | Extracting and Standardizing Non-GAAP Metrics from 8-Ks Programmatically

Extracting and Standardizing Non-GAAP Metrics from 8-Ks Programmatically

Adjusted EBITDA, Free Cash Flow, and Core Earnings are central to modern financial analysis. These non-GAAP metrics often drive valuation models, comparables analysis, and investment theses—especially when GAAP figures don’t fully reflect a company’s underlying performance. Yet accessing these metrics remains a significant challenge. They’re frequently embedded deep within earnings releases and 8-K filings, presented in unstructured and inconsistent formats that vary not only across companies but also between reporting periods for the same issuer. Tables may appear as images, poorly formatted text, or irregular layouts. Line item labels change, order varies, and no standard definition exists for terms like “Adjusted EBITDA.”

Traditional parsing techniques—such as regular expressions, templates, or table extraction tools—struggle to keep up. The result is a slow, manual, and error-prone process that resists scale and automation.

Captide’s Solution

The Captide API enables prompt-driven, retrieval-augmented generation (RAG) workflows directly on SEC filings and other unstructured company disclosures. By leveraging advanced agentic AIbehind an API with access to one of the biggest datasets of financial disclosures, Captide can return structured, schema-consistent JSON with the exact metrics required (e.g., all lines for Net Income to Adjusted EBITDA reconciliation, always numeric, with explicit sign conventions).

This case study demonstrates an automated, end-to-end workflow that fetches the earnings press release 8-Ks from multiple public companies, extracts the reconciliation from Net Income to Adjusted EBITDA as a clean JSON object, and iteratively standardizes and aligns metric line items—even as they change across filings—so results are immediately suitable for analytics, visualization, or downstream modeling.

Step 1: Extract Reconciliation Metrics with Captide

The first step in our workflow is to programmatically retrieve reconciliation line items from Net Income to Adjusted EBITDA, directly from 8-K filings. Using Captide’s retrieval-augmented generation API, we can prompt the model to return these metrics as a clean, structured JSON object—regardless of how inconsistently they're presented in the underlying documents. Captide will only retrieve information that is actually found in the filings and makes available audit traces for quality assurance.

We begin by filtering for recent 8-K filings (post-2022) that are classified as Item 2.02 (Results of Operations and Financial Condition) — the documents most likely to include reconciliations and earnings tables. The fetch_documents function handles this filtering and returns metadata for qualifying filings across multiple tickers.

Once we’ve gathered the relevant source links, the fetch_metrics_with_prompt function sends a structured prompt to Captide’s agent-query endpoint. This prompt instructs the model to return only a numeric JSON reconciliation from Net Income to Adjusted EBITDA, omitting non-numeric commentary or inconsistent formatting.

The response is streamed as server-sent events (SSE) and parsed into a usable Python dictionary via the parse_sse_response function.

This step automates what would otherwise be hours of manual work—parsing filings, normalizing line items, and extracting financial data hidden deep in unstructured disclosures.

import re, json, requests, pandas as pd
from typing import Dict, List

CAPTIDE_API_KEY = os.getenv("CAPTIDE_API_KEY", "YOUR_CAPTIDE_API_KEY")
HEADERS = {
    "X-API-Key": CAPTIDE_API_KEY,
    "Content-Type": "application/json",
    "Accept": "application/json"
}

TICKERS = ["SNAP", "PLTR", "UBER"]

BASE_PROMPT = (
    "Return a single valid JSON object with double-quoted keys and numeric values (in thousands of dollars). The object "
    "must represent the reconciliation from Net Income to Adjusted EBITDA, including all reported line items. Use "
    "positive values for additions to Net Income and negative values for subtractions. Do not include words like 'add' "
    "or 'less' in the keys. Output only the JSON object—no commentary or extra text."
)

def is_valid_fiscal_period(fp: str) -> bool:
    m = re.match(r"Q([1-4]) (\d{4})", fp)
    return bool(m and int(m.group(2)) > 2022)

def is_valid_document(doc: Dict) -> bool:
    if doc["sourceType"] == "8-K":
        return "2.02" in doc.get("additionalKwargs", {}).get("item", "")
    return True

def fetch_documents(ticker: str) -> List[Dict]:
    url = f"https://rest-api.captide.co/api/v1/companies/ticker/{ticker}/documents"
    docs = requests.get(url, headers=HEADERS, timeout=60).json()
    return [
        {"ticker": doc["ticker"],
         "fiscalPeriod": doc["fiscalPeriod"],
         "sourceLink": doc["sourceLink"]}
        for doc in docs
        if doc["sourceType"] == "8-K"
        and "fiscalPeriod" in doc
        and is_valid_fiscal_period(doc["fiscalPeriod"])
        and is_valid_document(doc)
    ]

def parse_sse_response(sse_text: str) -> Dict:
    try:
        lines = [l[6:] for l in sse_text.splitlines() if l.startswith("data: ")]
        for l in lines:
            obj = json.loads(l)
            if obj.get("type") == "full_answer":
                content = re.sub(r"\s*\[#\w+\]", "", obj["content"])
                m = re.search(r"\{.*\}", content, re.DOTALL)
                return json.loads(m.group(0)) if m else {}
    except Exception:
        pass
    return {}

def fetch_metrics_with_prompt(source_links: List[str], prompt: str) -> Dict:
    payload = {"query": prompt, "sourceLink": source_links}
    r = requests.post(
        "https://rest-api.captide.co/api/v1/rag/agent-query-stream",
        json=payload, headers=HEADERS, timeout=120
    )
    return parse_sse_response(r.text)

Below is a sample response for Palantir Technologies' Q1 2023 reconciliation.

{
  "Net Income (Loss)": 16802,
  "Net Income (Loss) to Non-Controlling Interests": 2349,
  "Interest Income": -20853,
  "Interest Expense": 1275,
  "Other (Income) Expense Net": 2861,
  "Income Tax (Benefit) Expense": 1681,
  "Depreciation and Amortization": 8320,
  "Stock-Based Compensation": 114714,
  "Payroll Tax on Stock-Based Compensation": 6285,
  "Adjusted EBITDA": 133434
}

Step 2: Normalize and Merge Extracted Metrics

Because reconciliation line items vary not only between companies but also across different reporting periods, the next step is to normalize this structure into a stable schema that can support time-series analysis and consistent aggregation.

The key challenge is that issuers often introduce, rename, or reorder line items in their Adjusted EBITDA reconciliations. A rigid schema would either lose valuable information or require constant manual updates. To solve this, we use a dynamic approach that incrementally learns and aligns the line item ordering.

The build_prompt function augments our original prompt with positional guidance derived from previously seen reconciliations. If prior periods contained line items in a particular order, we ask the model to maintain that order where possible—while still allowing new line items to be inserted in a sensible position.

The merge_key_lists function handles this logic. It iteratively compares the current period’s line items with a “master” list accumulated across filings. When new items appear, it infers the appropriate insertion point by checking for nearby keys that already exist in the master list. This approach preserves semantic continuity without enforcing rigid templates.

By the end of this step, we’ve established a schema-consistent, order-aware list of reconciliation line items—robust enough to absorb quarterly variations while maintaining structure for downstream processing.

def build_prompt(prev_keys: List[str]) -> str:
    if not prev_keys:
        return BASE_PROMPT
    joined = ", ".join(f'"{k}"' for k in prev_keys)
    return (
        BASE_PROMPT +
        f"Use the following keys in this order if they appear: [{joined}]. "
        "If the document contains additional reconciliation line items, insert "
        "them at the correct position relative to the list above."
    )

def merge_key_lists(master: list[str], this_quarter: list[str]) -> list[str]:
    for i, k in enumerate(this_quarter):
        if k in master:
            continue
        insert_pos = None
        for j in range(i - 1, -1, -1):
            prev_key = this_quarter[j]
            if prev_key in master:
                insert_pos = master.index(prev_key) + 1
                break
        if insert_pos is None:
            for j in range(i + 1, len(this_quarter)):
                nxt_key = this_quarter[j]
                if nxt_key in master:
                    insert_pos = master.index(nxt_key)
                    break
        if insert_pos is None:
            insert_pos = len(master)
        master.insert(insert_pos, k)
    return master

Step 3: Batch Process All Tickers

With the logic in place for extraction and normalization, we can now scale our workflow across multiple tickers—automating the collection of structured, schema-aligned reconciliation metrics at scale.

The run_one_ticker function orchestrates the full process for a single company. It fetches all qualifying 8-K filings, sorts them chronologically using the fiscal_sort_key, and processes each in turn. As each quarter is parsed, the line items are normalized and merged into a cumulative schema using merge_key_lists. This ensures that earlier schema decisions inform how future filings are interpreted and structured.

The output is a time-indexed dictionary of structured reconciliation data per ticker, along with a final, consolidated ordering of all metric keys encountered.

To parallelize processing across the entire ticker list, we use a ThreadPoolExecutor. Each ticker is processed in its own thread, maximizing throughput and minimizing latency—especially useful when dealing with network-bound operations like API calls.

The final per_ticker_output dictionary contains fully normalized, JSON-formatted reconciliation data for each company, ready for direct use in analytics, dashboards, or financial models.

from concurrent.futures import ThreadPoolExecutor, as_completed

def fiscal_sort_key(fp: str) -> tuple[int, int]:
    m = re.match(r"Q([1-4]) (\d{4})", fp)
    if not m:
        return (9999, 9)
    q, yr = int(m.group(1)), int(m.group(2))
    return (yr, q)

def run_one_ticker(ticker: str) -> Dict[str, Dict[str, float]]:
    docs = fetch_documents(ticker)
    docs.sort(key=lambda d: fiscal_sort_key(d["fiscalPeriod"]))

    key_order: List[str] = []
    results: Dict[str, Dict[str, float]] = {}

    for doc in docs:
        prompt = build_prompt(key_order)
        data = fetch_metrics_with_prompt([doc["sourceLink"]], prompt)
        if not data:
            continue
        results[doc["fiscalPeriod"]] = data
        key_order = merge_key_lists(key_order, list(data.keys()))

    return {"keys": key_order, "data": results}

per_ticker_output = {}
with ThreadPoolExecutor(max_workers=len(TICKERS)) as pool:
    futures = {pool.submit(run_one_ticker, t): t for t in TICKERS}
    for fut in as_completed(futures):
        ticker = futures[fut]
        per_ticker_output[ticker] = fut.result()

Step 4: Structure Data for Analysis

With normalized metrics collected across companies and time periods, the final step is to convert this data into a format suitable for inspection, visualization, or direct analysis.

Each company’s output—originally a nested dictionary of fiscal periods and corresponding reconciliation metrics—is transformed into a tidy, column-aligned pandas.DataFrame. The key_order ensures that line items appear in a consistent and meaningful sequence across all periods, regardless of how they were presented in the original filings.

The result is a clean, rectangular table for each ticker, where rows represent standardized reconciliation line items (e.g., Depreciation and Amortization, Stock-Based Compensation), and columns represent fiscal quarters. This structure makes it trivial to:

Perform time-series analysis
Visualize trends in adjustments to EBITDA
Compare line-item behavior across issuers

tables = {}
for ticker, payload in per_ticker_output.items():
    key_order = payload["keys"]
    series_by_q = payload["data"]
    df = pd.DataFrame(series_by_q).reindex(key_order)
    df.index.name = "Line item"
    tables[ticker] = df

for t, frame in tables.items():
    print(f"\n📊  {t}")
    print(frame)

Below are the results we get for Snap Inc. and Palantir Technologies Inc.

EBITDA Reconciliation for SNAP
Line item					Q1 2023	Q2 2023	Q3 2023	Q4 2023	Q1 2024	Q2 2024	Q3 2024	Q4 2024
Net Income (Loss)				-328674	-377308	-368256	-248247	-305090	-248620	-153247	9101
Interest Income					-37948	-43144	-43839	-43463	-39898	-36462	-38533	-38573
Interest Expense				5885	5343	5521	5275	4743	5113	5883	5813
Other (Income) Expense Net			-11372	-1323	20662	34447	81	20792	4355	-8382
Income Tax (Benefit) Expense			6845	12093	5849	3275	6932	5202	8332	5164
Depreciation and Amortization			35220	39688	41209	43882	38098	37930	38850	39581
Stock-Based Compensation			314931	317943	353846	333063	254715	258946	260229	257731
Payroll Tax on Stock-Based Compensation		15926	8229	6463	8706	15970	10133	6093	5572
Restructuring Charges				0	0	18639	22211	70108	1943	0	0
Adjusted EBITDA					813	-38479	40094	159149	45659	54977	131962	276007

EBITDA Reconciliation for PLTR
Line item					Q1 2023	Q2 2023	Q3 2023	Q4 2023	Q1 2024	Q2 2024	Q3 2024	Q4 2024	Q1 2025
Net Income (Loss)				16802	28127	71505	93391	105530	134126	143525	79009	214031
Net Income (Loss) to Non-Controlling Interests	2349	-255	1934	3522	541	1444	5816	-2073	3686
Interest Income					-20853	-30310	-36864	-44545	-43352	-46593	-52120	-54727	-50441
Interest Expense				1275	1317	742	136	0	0	0	0	0
Other (Income) Expense Net			2861	9024	-3864	3956	13507	11173	8110	-14768	3173
Income Tax (Benefit) Expense			1681	2171	6530	9334	4655	5189	7809	3602	5599
Depreciation and Amortization			8320	8399	8663	7972	8438	8056	8087	7006	6622
Stock-Based Compensation			114714	114201	114380	132608	125651	141764	142425	281798	155339
Payroll Tax on Stock-Based Compensation		6285	10760	8909	10953	19926	6464	19950	79681	59323
Adjusted EBITDA					133434	143434	171935	217327	234896	261623	283602	379528	397332

Conclusion

What used to be a tedious, manual task—digging through 8-Ks for elusive non-GAAP metrics—can now be fully automated with precision and scale. By combining Captide’s retrieval-augmented generation capabilities with a structured prompt strategy and dynamic schema normalization, we’ve shown how even the most inconsistent financial disclosures can be transformed into clean, analysis-ready data.

This approach doesn’t just save time—it unlocks new possibilities. With standardized, machine-readable versions of Adjusted EBITDA reconciliations across companies and quarters, analysts can now:

Perform true apples-to-apples comparisons
Build time-series models with confidence in data consistency
Rapidly respond to earnings releases with automated pipelines
Enrich dashboards and investment screens with unstructured disclosures

Crucially, this framework is generalizable. The same method can be applied to extract and normalize other elusive financial metrics: Free Cash Flow definitions, Core Earnings breakdowns, or custom KPIs buried in footnotes and management commentary.

In a world increasingly driven by unstructured data, tools like Captide don’t just streamline workflows—they change the very nature of what's feasible in financial analysis. What was once hidden is now structured. What was once manual is now instant. And what was once brittle is now intelligent.

May 29, 2025

Want more?

Automate insights and data extraction from global disclosures with Captide

GET API KEY BOOK A DEMO

Home

Use cases

API

Careers

Articles

Book a demo

Request API key

Contact us