Back to blog

How to Mine Data from SEC Filings Using the Captide API

In financial analysis, we often rely on data providers for clean, structured metrics — but that convenience comes at a cost: limited coverage. Many metrics that companies report in earnings releases, SEC filings (such as 10-Ks and 10-Qs), earnings slides, or earnings calls never make it into standardized datasets. Sometimes it’s because the metric is non-GAAP, inconsistently labeled, or disclosed deep in a report. Other times, it’s simply because the unstructured nature of some of the documents makes it difficult for traditional tools to capture all the relevant data.

Yet these “overlooked” metrics can be critical. Investors, analysts, and risk managers often cite them in earnings calls, valuation models, or stress tests. If you want to track such metrics across a peer group or over time, you’re left to manually dig through financial disclosures — or build and maintain brittle scrapers that break whenever disclosure formats change.

Captide offers a better way. The Captide API uses large language models (LLMs) to extract structured data directly from SEC filings via natural language queries. That means you can systematically collect the exact metrics you care about — even if they aren’t in any vendor feed.

In this post, we’ll walk through a working example of this idea in action. Specifically, we will:

  • Pull recent 10-Ks, 10-Qs, and selected 8-Ks for a group of major U.S. bank tickers using Captide’s document API
  • Query those filings for four example metrics: Tangible Book Value, Return on Tangible Common Equity, Efficiency Ratio, and Adjusted/Core Earnings
  • Parse and structure the extracted data into a format ready for analysis or visualization

These four metrics are just examples — you could swap in any KPI that companies disclose. The key takeaway is that with Captide, you can programmatically mine SEC filings for non-standard but material financial data at scale — something traditional pipelines often miss.

Step 1: Fetching and Filtering SEC Filings

We start by using Captide’s /documents endpoint (which fetches filings from US listed companies) to pull filings for a list of bank tickers. We retrieve the metadata for each company’s recent filings, then filter the list to include only relevant document types and timeframes. For example, we keep only 10-Ks, 10-Qs, and 8-Ks (and for 8-Ks, only those containing earnings results), and we restrict to filings from 2023 onwards. This ensures each document we process is both recent and pertinent.

import re
import os
import requests
from typing import Dict, List, Tuple
from collections import defaultdict

# Constants
CAPTIDE_API_KEY = os.getenv("CAPTIDE_API_KEY", "YOUR_CAPTIDE_API_KEY")
TICKERS = ['BAC', 'C', 'GS']
HEADERS = {
    "X-API-Key": CAPTIDE_API_KEY,
    "Content-Type": "application/json",
    "Accept": "application/json",
}

def is_valid_fiscal_period(fp: str) -> bool:
    match = re.match(r"Q([1-4]) (\d{4})", fp)
    return bool(match and int(match.group(2)) > 2022)

def is_valid_document(doc: Dict) -> bool:
    if doc["sourceType"] == "8-K":
        return "2.02" in doc.get("additionalKwargs", {}).get("item", "")
    return True

def fetch_documents(ticker: str) -> Tuple[str, List[Dict]]:
    url = f"https://rest-api.captide.co/api/v1/companies/ticker/{ticker}/documents"
    try:
        response = requests.get(url, headers=HEADERS, timeout=60)
        docs = response.json()
        valid_docs = [
            {
                "ticker": doc["ticker"],
                "fiscalPeriod": doc["fiscalPeriod"],
                "sourceLink": doc["sourceLink"],
                "date": doc["date"]
            }
            for doc in docs
            if doc["sourceType"] in {"10-K", "10-Q", "8-K"}
            and "fiscalPeriod" in doc
            and is_valid_fiscal_period(doc["fiscalPeriod"])
            and is_valid_document(doc)
        ]
        return ticker, valid_docs
    except Exception as e:
        return ticker, []

# Fetch and store all document metadata
from concurrent.futures import ThreadPoolExecutor, as_completed

all_docs = []
with ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(fetch_documents, ticker) for ticker in TICKERS]
    for future in as_completed(futures):
        _, docs = future.result()
        all_docs.extend(docs)

# Group by (ticker, fiscalPeriod)
grouped_docs = defaultdict(list)
for doc in all_docs:
    grouped_docs[(doc["ticker"], doc["fiscalPeriod"])].append(doc["sourceLink"])

The grouped_docs dictionary holds the final output of this step as a mapping of each (ticker, fiscalPeriod) pair to the list of filtered document URLs relevant to that quarter. This structure enables efficient lookup and sequential processing of filings, ensuring that for each reporting period, all pertinent 10-K, 10-Q, and qualifying 8-K documents are readily accessible.

Here’s an example of what that structure looks like:

{
  "('BAC', 'Q1 2025')": [
    "https://rest-api.captide.co/api/v1/document?sourceType=10-Q&documentId=b7d7a0a2-aa0b-4472-917c-ebb19e2c77cf"
  ],
  "('BAC', 'Q4 2024')": [
    "https://rest-api.captide.co/api/v1/document?sourceType=10-K&documentId=bc20b70e-79d5-4beb-8b6a-e70f26841641",
    "https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=11a3d090-1bb1-4c2c-bbf4-42b29875c07e",
    "https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=25182680-0d36-40ba-a6f0-42c73677e680",
    "https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=64526a7f-e5ec-4117-85d1-fc02d438c78d"
  ],
  "('BAC', 'Q3 2024')": [
    "https://rest-api.captide.co/api/v1/document?sourceType=10-Q&documentId=2f39b969-eb79-4ba7-a291-3b205fa3359c",
    "https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=bbab0b48-a188-437f-8c89-5950198f792f",
    "https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=7f9e0cb4-13b9-4508-a32e-3eb14372944f",
    "https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=cb93556c-670e-4333-9676-b495a6871063"
  ],
  ...
}

Step 2: Querying the Filings for Financial Metrics

Next, we use Captide’s agent-query-stream endpoint to ask a specific question of set of documents (grouped by ticker and fiscal period) — namely: “What is the Tangible Book Value, Return on Tangible Common Equity, Efficiency Ratio, and Adjusted/Core Earnings?” This endpoint employs an LLM-based agent under the hood to find those answers within the selected documents. We format our query in a strict JSON-only prompt (see the QUERY_TEMPLATE defined above) to ensure the API returns the data as a structured JSON object.

For each set of documents, we send the query to the API. The responses come back as a server-sent event stream, so we include a helper function to parse the stream and extract the final JSON answer. To speed things up, we run all the queries in parallel using a thread pool. This allows us to efficiently collect the results for every filing in our list.

import json

def parse_sse_response(sse_text: str) -> Dict:
    try:
        lines = [line.strip()[6:] for line in sse_text.splitlines() if line.startswith("data: ")]
        for line in lines:
            obj = json.loads(line)
            if obj.get("type") == "full_answer":
                content = obj["content"].encode().decode("unicode_escape")
                content = re.sub(r"\s*\[#\w+\]", "", content)
                match = re.search(r"\{.*\}", content, re.DOTALL)
                if match:
                    return json.loads(match.group(0))
    return {}

QUERY_TEMPLATE = (
    "Provide ONLY a valid JSON object with double-quoted keys and numeric values, representing these metrics: "
    "'Tangible Book Value', 'Return on Tangible Common Equity', 'Efficiency Ratio', and 'Adjusted/Core Earnings'. "
    "Provide the values in millions of dollars or as decimal representation if they are percentages. "
    "DO NOT include any commentary, labels, or additional text. Only return one JSON object in strict JSON format."
)

def fetch_metrics(ticker: str, fiscal_period: str, source_links: List[str]) -> Dict:
    payload = {"query": QUERY_TEMPLATE, "sourceLink": source_links}
    try:
        response = requests.post(
            "https://rest-api.captide.co/api/v1/rag/agent-query-stream",
            json=payload, headers=HEADERS, timeout=120,
        )
        data = parse_sse_response(response.text)
        return {"ticker": ticker, "fiscalPeriod": fiscal_period, "data": data}
    except Exception as e:
        return {"ticker": ticker, "fiscalPeriod": fiscal_period, "data": None}

# Run metrics extraction
all_results = []
with ThreadPoolExecutor(max_workers=50) as executor:
    futures = [
        executor.submit(fetch_metrics, ticker, period, links)
        for (ticker, period), links in grouped_docs.items()
    ]
    for future in as_completed(futures):
        result = future.result()
        if result["data"]:
            all_results.append(result)

For each company and fiscal period, the final output includes a dictionary containing the ticker, the fiscal period, and the extracted metrics in structured form. Below is an example of what one such response looks like:

[
  {
    "ticker": "GS",
    "fiscalPeriod": "Q1 2025",
    "data": {
        "Tangible Book Value": 102407,
        "Return on Tangible Common Equity": 0.180,
        "Efficiency Ratio": 0.606,
        "Adjusted/Core Earnings": 4583
    }
  },
  {
    "ticker": "GS",
    "fiscalPeriod": "Q4 2024",
    "data": {
      "Tangible Book Value": 102403,
      "Return on Tangible Common Equity": 0.135,
      "Efficiency Ratio": 0.631,
      "Adjusted/Core Earnings": 13525
    }
  },
  ...
]

Captide extracts financial metrics directly from the source documents and provides built-in traceability tools to ensure full auditability. Each response can be linked back to the exact location in the original filing. Below are Captide’s trace-back sources for Tangible Book Value and Efficiency Ratio, as derived from Goldman Sachs’ Q1 2025 10-Q filing. These references show precisely where the model identified and justified each value within the document.

Step 3: Structuring and Displaying the Data

Finally, we convert the results into a tabular format using pandas, making it easier to analyze or visualize. We organize the data by ticker and metric, then create a pandas DataFrame for each ticker with metrics as rows and fiscal periods as columns. To ensure the time series is ordered correctly, we sort the fiscal-period columns chronologically (using a custom sort function to handle labels like "Q4 2023"). At the end, we can print or plot these tables to compare the metrics across periods for each company.

import pandas as pd

# Group results: ticker → metric → period → value
grouped_data = defaultdict(lambda: defaultdict(dict))
for result in all_results:
    ticker, period = result["ticker"], result["fiscalPeriod"]
    for metric, value in result["data"].items():
        grouped_data[ticker][metric][period] = value

def sort_fiscal_periods(periods: List[str]) -> List[str]:
    def sort_key(p: str):
        m = re.match(r"Q([1-4]) (\d{4})", p)
        return (int(m.group(2)), int(m.group(1))) if m else (9999, 9)
    return sorted(periods, key=sort_key)

# Create and display DataFrames
ticker_tables = {}
for ticker, metrics in grouped_data.items():
    df = pd.DataFrame(metrics).T
    df.columns.name = "Fiscal Period"
    df = df[sort_fiscal_periods(df.columns.tolist())]
    ticker_tables[ticker] = df

# Example: print all tables
for ticker, df in ticker_tables.items():
    print(f"\ Financials for {ticker}")
    print(df)

With the extracted data now organized into clean, structured tables, we can easily compare key financial metrics across periods for each institution. Below are the results for Goldman Sachs, Bank of America, and Citigroup — showing how their Tangible Book Value, Return on Tangible Common Equity, Efficiency Ratio, and Adjusted/Core Earnings have evolved over the last several quarters. This output illustrates the power of programmatically mining SEC filings: metrics that previously required manual extraction are now accessible at scale, ready for quantitative analysis or visualization.

Financials for GSCE
Fiscal Period				Q2 2023	Q3 2023	Q4 2023	Q1 2024	Q2 2024	Q3 2024	Q4 2024	Q1 2025
Tangible Book Value			97927	98820	98609	100425	99825	101113	102043	102407
Return on Tangible Common Equity	0.044	0.077	0.081	0.159	0.116	0.111	0.135	0.180
Efficiency Ratio			0.784	0.744	0.746	0.609	0.670	0.655	0.631	0.606
Adjusted/Core Earnings			1071	1882	7907	3931	2891	2780	13525	4583

Financials for BAC
Fiscal Period				Q2 2023	Q3 2023	Q4 2023	Q1 2024	Q2 2024	Q3 2024	Q4 2024	Q1 2025
Tangible Book Value			184755	188516	193105	195026	197229	201858	202311	205008
Return on Tangible Common Equity	0.1550	0.1550	0.135	0.127	0.136	0.128	0.1260	0.1394
Efficiency Ratio			0.6365	0.6293	0.670	0.670	0.640	0.650	0.6557	0.6493
Adjusted/Core Earnings			7102	7270	29300	7200	6900	6886	25503	6990

Financials for C-PN
Fiscal Period				Q2 2023	Q3 2023	Q4 2023	Q1 2024	Q2 2024	Q3 2024	Q4 2024	Q1 2025
Tangible Book Value			164335	166319	164025	165307	166989	169588	167698	170941
Return on Tangible Common Equity	0.064	0.077	0.049	0.076	0.072	0.070	0.070	0.091
Efficiency Ratio			0.698	0.671	0.718	0.673	0.663	0.652	0.665	0.622
Adjusted/Core Earnings			3007	3332	9228	3371	6067	2961	12835	3795

Final Thoughts

This workflow demonstrates how you can use Captide’s API to turn a massive stream of SEC EDGAR filings into structured, analyzable data with just a few lines of Python code. Whether you’re an analyst, investor, or researcher, this approach dramatically reduces the time and effort needed to extract insights from regulatory disclosures.

May 24, 2025
Want more?

Automate insights and data extraction from SEC filings with Captide