In financial analysis, we often rely on data providers for clean, structured metrics — but that convenience comes at a cost: limited coverage. Many metrics that companies report in earnings releases, SEC filings (such as 10-Ks and 10-Qs), earnings slides, or earnings calls never make it into standardized datasets. Sometimes it’s because the metric is non-GAAP, inconsistently labeled, or disclosed deep in a report. Other times, it’s simply because the unstructured nature of some of the documents makes it difficult for traditional tools to capture all the relevant data.
Yet these “overlooked” metrics can be critical. Investors, analysts, and risk managers often cite them in earnings calls, valuation models, or stress tests. If you want to track such metrics across a peer group or over time, you’re left to manually dig through financial disclosures — or build and maintain brittle scrapers that break whenever disclosure formats change.
Captide offers a better way. The Captide API uses large language models (LLMs) to extract structured data directly from SEC filings via natural language queries. That means you can systematically collect the exact metrics you care about — even if they aren’t in any vendor feed.
In this post, we’ll walk through a working example of this idea in action. Specifically, we will:
These four metrics are just examples — you could swap in any KPI that companies disclose. The key takeaway is that with Captide, you can programmatically mine SEC filings for non-standard but material financial data at scale — something traditional pipelines often miss.
We start by using Captide’s /documents
endpoint (which fetches filings from US listed companies) to pull filings for a list of bank tickers. We retrieve the metadata for each company’s recent filings, then filter the list to include only relevant document types and timeframes. For example, we keep only 10-Ks, 10-Qs, and 8-Ks (and for 8-Ks, only those containing earnings results), and we restrict to filings from 2023 onwards. This ensures each document we process is both recent and pertinent.
import re
import os
import requests
from typing import Dict, List, Tuple
from collections import defaultdict
# Constants
CAPTIDE_API_KEY = os.getenv("CAPTIDE_API_KEY", "YOUR_CAPTIDE_API_KEY")
TICKERS = ['BAC', 'C', 'GS']
HEADERS = {
"X-API-Key": CAPTIDE_API_KEY,
"Content-Type": "application/json",
"Accept": "application/json",
}
def is_valid_fiscal_period(fp: str) -> bool:
match = re.match(r"Q([1-4]) (\d{4})", fp)
return bool(match and int(match.group(2)) > 2022)
def is_valid_document(doc: Dict) -> bool:
if doc["sourceType"] == "8-K":
return "2.02" in doc.get("additionalKwargs", {}).get("item", "")
return True
def fetch_documents(ticker: str) -> Tuple[str, List[Dict]]:
url = f"https://rest-api.captide.co/api/v1/companies/ticker/{ticker}/documents"
try:
response = requests.get(url, headers=HEADERS, timeout=60)
docs = response.json()
valid_docs = [
{
"ticker": doc["ticker"],
"fiscalPeriod": doc["fiscalPeriod"],
"sourceLink": doc["sourceLink"],
"date": doc["date"]
}
for doc in docs
if doc["sourceType"] in {"10-K", "10-Q", "8-K"}
and "fiscalPeriod" in doc
and is_valid_fiscal_period(doc["fiscalPeriod"])
and is_valid_document(doc)
]
return ticker, valid_docs
except Exception as e:
return ticker, []
# Fetch and store all document metadata
from concurrent.futures import ThreadPoolExecutor, as_completed
all_docs = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(fetch_documents, ticker) for ticker in TICKERS]
for future in as_completed(futures):
_, docs = future.result()
all_docs.extend(docs)
# Group by (ticker, fiscalPeriod)
grouped_docs = defaultdict(list)
for doc in all_docs:
grouped_docs[(doc["ticker"], doc["fiscalPeriod"])].append(doc["sourceLink"])
The grouped_docs
dictionary holds the final output of this step as a mapping of each (ticker, fiscalPeriod)
pair to the list of filtered document URLs relevant to that quarter. This structure enables efficient lookup and sequential processing of filings, ensuring that for each reporting period, all pertinent 10-K, 10-Q, and qualifying 8-K documents are readily accessible.
Here’s an example of what that structure looks like:
{
"('BAC', 'Q1 2025')": [
"https://rest-api.captide.co/api/v1/document?sourceType=10-Q&documentId=b7d7a0a2-aa0b-4472-917c-ebb19e2c77cf"
],
"('BAC', 'Q4 2024')": [
"https://rest-api.captide.co/api/v1/document?sourceType=10-K&documentId=bc20b70e-79d5-4beb-8b6a-e70f26841641",
"https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=11a3d090-1bb1-4c2c-bbf4-42b29875c07e",
"https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=25182680-0d36-40ba-a6f0-42c73677e680",
"https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=64526a7f-e5ec-4117-85d1-fc02d438c78d"
],
"('BAC', 'Q3 2024')": [
"https://rest-api.captide.co/api/v1/document?sourceType=10-Q&documentId=2f39b969-eb79-4ba7-a291-3b205fa3359c",
"https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=bbab0b48-a188-437f-8c89-5950198f792f",
"https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=7f9e0cb4-13b9-4508-a32e-3eb14372944f",
"https://rest-api.captide.co/api/v1/document?sourceType=8-K&documentId=cb93556c-670e-4333-9676-b495a6871063"
],
...
}
Next, we use Captide’s agent-query-stream
endpoint to ask a specific question of set of documents (grouped by ticker and fiscal period) — namely: “What is the Tangible Book Value, Return on Tangible Common Equity, Efficiency Ratio, and Adjusted/Core Earnings?” This endpoint employs an LLM-based agent under the hood to find those answers within the selected documents. We format our query in a strict JSON-only prompt (see the QUERY_TEMPLATE
defined above) to ensure the API returns the data as a structured JSON object.
For each set of documents, we send the query to the API. The responses come back as a server-sent event stream, so we include a helper function to parse the stream and extract the final JSON answer. To speed things up, we run all the queries in parallel using a thread pool. This allows us to efficiently collect the results for every filing in our list.
import json
def parse_sse_response(sse_text: str) -> Dict:
try:
lines = [line.strip()[6:] for line in sse_text.splitlines() if line.startswith("data: ")]
for line in lines:
obj = json.loads(line)
if obj.get("type") == "full_answer":
content = obj["content"].encode().decode("unicode_escape")
content = re.sub(r"\s*\[#\w+\]", "", content)
match = re.search(r"\{.*\}", content, re.DOTALL)
if match:
return json.loads(match.group(0))
return {}
QUERY_TEMPLATE = (
"Provide ONLY a valid JSON object with double-quoted keys and numeric values, representing these metrics: "
"'Tangible Book Value', 'Return on Tangible Common Equity', 'Efficiency Ratio', and 'Adjusted/Core Earnings'. "
"Provide the values in millions of dollars or as decimal representation if they are percentages. "
"DO NOT include any commentary, labels, or additional text. Only return one JSON object in strict JSON format."
)
def fetch_metrics(ticker: str, fiscal_period: str, source_links: List[str]) -> Dict:
payload = {"query": QUERY_TEMPLATE, "sourceLink": source_links}
try:
response = requests.post(
"https://rest-api.captide.co/api/v1/rag/agent-query-stream",
json=payload, headers=HEADERS, timeout=120,
)
data = parse_sse_response(response.text)
return {"ticker": ticker, "fiscalPeriod": fiscal_period, "data": data}
except Exception as e:
return {"ticker": ticker, "fiscalPeriod": fiscal_period, "data": None}
# Run metrics extraction
all_results = []
with ThreadPoolExecutor(max_workers=50) as executor:
futures = [
executor.submit(fetch_metrics, ticker, period, links)
for (ticker, period), links in grouped_docs.items()
]
for future in as_completed(futures):
result = future.result()
if result["data"]:
all_results.append(result)
For each company and fiscal period, the final output includes a dictionary containing the ticker, the fiscal period, and the extracted metrics in structured form. Below is an example of what one such response looks like:
[
{
"ticker": "GS",
"fiscalPeriod": "Q1 2025",
"data": {
"Tangible Book Value": 102407,
"Return on Tangible Common Equity": 0.180,
"Efficiency Ratio": 0.606,
"Adjusted/Core Earnings": 4583
}
},
{
"ticker": "GS",
"fiscalPeriod": "Q4 2024",
"data": {
"Tangible Book Value": 102403,
"Return on Tangible Common Equity": 0.135,
"Efficiency Ratio": 0.631,
"Adjusted/Core Earnings": 13525
}
},
...
]
Captide extracts financial metrics directly from the source documents and provides built-in traceability tools to ensure full auditability. Each response can be linked back to the exact location in the original filing. Below are Captide’s trace-back sources for “Tangible Book Value” and “Efficiency Ratio”, as derived from Goldman Sachs’ Q1 2025 10-Q filing. These references show precisely where the model identified and justified each value within the document.
Finally, we convert the results into a tabular format using pandas, making it easier to analyze or visualize. We organize the data by ticker and metric, then create a pandas DataFrame for each ticker with metrics as rows and fiscal periods as columns. To ensure the time series is ordered correctly, we sort the fiscal-period columns chronologically (using a custom sort function to handle labels like "Q4 2023"). At the end, we can print or plot these tables to compare the metrics across periods for each company.
import pandas as pd
# Group results: ticker → metric → period → value
grouped_data = defaultdict(lambda: defaultdict(dict))
for result in all_results:
ticker, period = result["ticker"], result["fiscalPeriod"]
for metric, value in result["data"].items():
grouped_data[ticker][metric][period] = value
def sort_fiscal_periods(periods: List[str]) -> List[str]:
def sort_key(p: str):
m = re.match(r"Q([1-4]) (\d{4})", p)
return (int(m.group(2)), int(m.group(1))) if m else (9999, 9)
return sorted(periods, key=sort_key)
# Create and display DataFrames
ticker_tables = {}
for ticker, metrics in grouped_data.items():
df = pd.DataFrame(metrics).T
df.columns.name = "Fiscal Period"
df = df[sort_fiscal_periods(df.columns.tolist())]
ticker_tables[ticker] = df
# Example: print all tables
for ticker, df in ticker_tables.items():
print(f"\ Financials for {ticker}")
print(df)
With the extracted data now organized into clean, structured tables, we can easily compare key financial metrics across periods for each institution. Below are the results for Goldman Sachs, Bank of America, and Citigroup — showing how their Tangible Book Value, Return on Tangible Common Equity, Efficiency Ratio, and Adjusted/Core Earnings have evolved over the last several quarters. This output illustrates the power of programmatically mining SEC filings: metrics that previously required manual extraction are now accessible at scale, ready for quantitative analysis or visualization.
Financials for GSCE
Fiscal Period Q2 2023 Q3 2023 Q4 2023 Q1 2024 Q2 2024 Q3 2024 Q4 2024 Q1 2025
Tangible Book Value 97927 98820 98609 100425 99825 101113 102043 102407
Return on Tangible Common Equity 0.044 0.077 0.081 0.159 0.116 0.111 0.135 0.180
Efficiency Ratio 0.784 0.744 0.746 0.609 0.670 0.655 0.631 0.606
Adjusted/Core Earnings 1071 1882 7907 3931 2891 2780 13525 4583
Financials for BAC
Fiscal Period Q2 2023 Q3 2023 Q4 2023 Q1 2024 Q2 2024 Q3 2024 Q4 2024 Q1 2025
Tangible Book Value 184755 188516 193105 195026 197229 201858 202311 205008
Return on Tangible Common Equity 0.1550 0.1550 0.135 0.127 0.136 0.128 0.1260 0.1394
Efficiency Ratio 0.6365 0.6293 0.670 0.670 0.640 0.650 0.6557 0.6493
Adjusted/Core Earnings 7102 7270 29300 7200 6900 6886 25503 6990
Financials for C-PN
Fiscal Period Q2 2023 Q3 2023 Q4 2023 Q1 2024 Q2 2024 Q3 2024 Q4 2024 Q1 2025
Tangible Book Value 164335 166319 164025 165307 166989 169588 167698 170941
Return on Tangible Common Equity 0.064 0.077 0.049 0.076 0.072 0.070 0.070 0.091
Efficiency Ratio 0.698 0.671 0.718 0.673 0.663 0.652 0.665 0.622
Adjusted/Core Earnings 3007 3332 9228 3371 6067 2961 12835 3795
This workflow demonstrates how you can use Captide’s API to turn a massive stream of SEC EDGAR filings into structured, analyzable data with just a few lines of Python code. Whether you’re an analyst, investor, or researcher, this approach dramatically reduces the time and effort needed to extract insights from regulatory disclosures.