Captide | Building a Dynamic Industry Classification System with Captide’s API

Building a Dynamic Industry Classification System with Captide’s API

In financial analysis, how we classify companies fundamentally shapes how we compare peers, identify trends, and manage portfolios. Traditional systems like GICS or NAICS assign each company to a single sector and industry code. While these frameworks provide structure, they often fall short in a modern context. Businesses increasingly span multiple lines—Amazon, for instance, operates in retail, cloud infrastructure, and media. Forcing such companies into a single category oversimplifies their actual scope and risk profile.

Moreover, these classifications are slow to adapt. New sectors—like AI infrastructure or decentralized finance—often remain unrecognized until they’ve matured. This rigidity can obscure meaningful relationships or risk exposures that fall outside traditional labels.

Dynamic Industry Classification (DIC) offers an alternative: a data-driven, adaptive way to group companies based on what they actually do, not just how they’ve been historically categorized. DIC systems use company-level data—such as product offerings, revenue segments, or language from SEC filings—to create multi-dimensional clusters. A firm can belong to several groups at once, reflecting its full business footprint. These clusters evolve naturally as companies update their strategies, markets, or technologies.

For example, an AI-enabled DIC system might group a cloud storage provider and a cybersecurity firm together due to their shared focus on enterprise infrastructure—even if one is officially tagged as “Technology” and the other as “Software Services.” Recent studies show that these data-driven clusters often align more closely with financial behaviors like return correlations than static industry codes do.

Why does this matter for financial engineers and data scientists? Because it offers a more accurate and flexible foundation for analysis. Whether you’re identifying peers, building factor models, or designing diversified portfolios, using a dynamic view of industries helps reveal risks and opportunities that traditional approaches can miss. It also supports richer modeling: instead of assigning a firm to just one industry, you can quantify its exposure to multiple business areas.

How Captide fits in: Building a DIC system requires high-quality, up-to-date business descriptions at scale—something that’s hard to get from PDFs or manual scraping. Captide’s API solves this by providing structured access to data from SEC filings (10-Ks, 10-Qs, etc.). Powered by large language models, Captide can extract specific, relevant business descriptions with a simple API call—no manual parsing needed.

This gives you the raw input for a DIC system: timely, detailed narratives of what companies actually say they do.

In this post, we’ll walk through a practical implementation of a Dynamic Industry Classification pipeline using Captide’s API and Python. Specifically, we’ll:

Retrieve product and service descriptions from SEC filings using Captide.
Transform these descriptions into vector embeddings using a pre-trained language model.
Cluster the embeddings to identify groups of similar companies.
Label each cluster using GPT to generate interpretable names.
Summarize each company’s distribution across these dynamic clusters in a table.

By the end, you’ll see how a handful of companies can be grouped in a way that more accurately reflects the current market landscape—without relying on rigid, outdated labels.

All code examples will be provided. You’ll need a Captide API key, an OpenAI API key, and standard Python packages like requests, sentence_transformers, sklearn, and pandas.

Implementing Dynamic Industry Classification with Captide: Step-by-Step

Let’s walk through the implementation of a Dynamic Industry Classification (DIC) system using a four-step pipeline. We begin by extracting structured product and service data from SEC filings, then convert this text into embeddings, cluster the data to identify industry groupings, and finally assign interpretable labels to each cluster.

Step 1: Extract Product and Service Descriptions from SEC Filings Using Captide

The first step is to collect data that accurately reflects what each company actually does. For this, we focus on the official descriptions of products and services found in regulatory filings. Captide’s API streamlines this process. It allows users to submit natural language queries and receive precise, source-backed responses extracted directly from filings. This eliminates the need for manual PDF parsing or keyword scraping.

In the example below, we query Captide’s /rag/agent-query-stream endpoint. The query asks: “List all the products or services of the company in a dictionary where each key is the name of the product and the value is a brief description.” We apply this query to a list of companies, retrieving data from their most recent 10-K, 10-Q, or 8-K filings. While our simplified approach uses only the latest reports, a more advanced implementation could iterate over historical filings to track how business activities evolve over time.

The following code demonstrates how to loop over multiple tickers, collect their product/service descriptions using Captide, and build a dataset suitable for embedding and clustering:

import os
import re
import json
import datetime as dt
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests

CAPTIDE_API_KEY = os.getenv("CAPTIDE_API_KEY", "YOUR_CAPTIDE_API_KEY")

TICKERS = ['AAPL', 'ABBV', 'ABT', 'ACN', 'ADBE', 'AIG', 'AMD', 'AMGN', 'AMT', 'AMZN', 'AVGO', 'AXP', 'BA', 'BAC', 'BK', 'BKNG', 'BLK', 'BMY', 'BRK.B', 'C', 'CAT', 'CHTR', 'CL', 'CMCSA', 'COF', 'COP', 'COST', 'CRM', 'CSCO', 'CVS', 'CVX', 'DE', 'DHR', 'DIS', 'DUK', 'EMR', 'FDX', 'GD', 'GE', 'GILD', 'GM', 'GOOG', 'GS', 'HD', 'HON', 'IBM', 'INTC', 'INTU', 'ISRG', 'JNJ', 'JPM', 'KO', 'LIN', 'LLY', 'LMT', 'LOW', 'MA', 'MCD', 'MDLZ', 'MDT', 'MET', 'META', 'MMM', 'MO', 'MRK', 'MS', 'MSFT', 'NEE', 'NFLX', 'NKE', 'NOW', 'NVDA', 'ORCL', 'PEP', 'PFE', 'PG', 'PLTR', 'PM', 'PYPL', 'QCOM', 'RTX', 'SBUX', 'SCHW', 'SO', 'SPG', 'T', 'TGT', 'TMO', 'TMUS', 'TSLA', 'TXN', 'UNH', 'UNP', 'UPS', 'USB', 'V', 'VZ', 'WFC', 'WMT', 'XOM']

LOOKBACK_DAYS = 90
START_DATE = (dt.date.today() - dt.timedelta(days=LOOKBACK_DAYS)).isoformat()
END_DATE = dt.date.today().isoformat()

HEADERS = {
    "X-API-Key": CAPTIDE_API_KEY,
    "Content-Type": "application/json",
    "Accept": "application/json",
}

QUERY_TEMPLATE = (
    "List all the products or services of the company in a dictionary where each key is the name "
    "of the product and the value is a brief description. In the description don't include company "
    "or brand names, just a description of the products or services offered. Don't include any "
    "introductory text or outro in the response, just the dictionary."
)

def _extract_dict(text: str):
    m = re.search(r'"type":"full_answer","content":"(.*?)"}', text, re.DOTALL)
    if not m:
        return None
    cleaned = m.group(1).encode().decode("unicode_escape")
    cleaned = re.sub(r"\s*\[#\w+\]", "", cleaned)
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        return None

def fetch_products_one(ticker: str):
    payload = {
        "query": QUERY_TEMPLATE,
        "tickers": [ticker],
        "sourceType": ["10-K", "10-Q", "8-K"],
        "startDate": START_DATE,
        "endDate": END_DATE,
    }
    try:
        resp = requests.post(
            "https://rest-api.captide.co/api/v1/rag/agent-query-stream",
            json=payload,
            headers=HEADERS,
            timeout=300,
        )
        return ticker, _extract_dict(resp.text)
    except Exception as err:
        return ticker, None

all_products = {}
with ThreadPoolExecutor(max_workers=10) as pool:
    futures = [pool.submit(fetch_products_one, t) for t in TICKERS]
    for fut in as_completed(futures):
        tic, prod = fut.result()
        if prod:
            all_products[tic] = prod

At this stage, we’ve collected product and service descriptions for each company in our target list. Captide’s query mechanism locates and extracts relevant portions of SEC filings—specifically, sections that describe what the company offers in terms of goods and services. The result is a structured set of textual descriptions for each company, which forms the foundation for our dynamic classification. These descriptions capture how each company presents its business activities—critical context for identifying semantic similarities across firms.

Here’s a sample output dictionary for Apple:

{
    'iPhone': 'A line of smartphones featuring advanced camera systems, high-performance processors, and a range of models from entry-level to professional, designed for communication, productivity, and entertainment.',
    'Mac': 'A family of personal computers, including laptops and desktops, powered by proprietary silicon chips, offering high performance for professional and personal use.',
    'iPad': 'A series of tablets designed for portability and versatility, supporting productivity, creativity, and entertainment, with models ranging from entry-level to high-performance.',
    'Wearables, Home and Accessories': 'A category including smartwatches, wireless earbuds, spatial computing devices, and other personal and home accessories that support health, fitness, audio, and immersive experiences.',
    'Apple Watch': 'A smartwatch offering health and fitness tracking, notifications, and connectivity features, with advanced health monitoring capabilities.',
    'AirPods': 'Wireless earbuds providing high-quality audio, active noise cancellation, and features such as hearing health monitoring and open-ear design.',
    'Apple Vision Pro': 'A spatial computing device enabling immersive experiences, spatial video, and productivity applications through advanced display and sensor technologies.',
    'Services': 'A suite of digital services including app distribution, cloud storage, digital payments, music and video streaming, news, books, fitness, gaming, and advertising.',
    'App Store': 'A digital marketplace for downloading and purchasing applications and games for mobile and desktop devices.',
    'Cloud Services': 'Online storage and synchronization solutions for files, photos, and device backups.',
    'Apple Pay': 'A digital payment platform enabling secure and private transactions in stores, online, and within apps.',
    'Music Streaming': 'A subscription-based service offering access to a large catalog of music and curated playlists.',
    'Video Streaming': 'A subscription service providing original movies, series, and exclusive video content.',
    'News': 'A digital news aggregation and subscription service offering access to a wide range of publications and magazines.',
    'Books': 'A digital bookstore and reading platform for purchasing and reading e-books and audiobooks.',
    'Fitness+': 'A subscription service offering guided workouts and fitness content.',
    'Arcade': 'A subscription-based gaming service providing access to a curated collection of games.',
    'Advertising': 'A platform for digital advertising across various devices and services.',
    'Apple Intelligence': 'A personal intelligence system integrating generative artificial intelligence features such as writing tools, image creation, visual intelligence, and privacy-focused on-device and cloud processing.',
    'Private Cloud Compute': 'A privacy-focused cloud infrastructure for processing generative AI tasks securely.'
}

Step 2: Generate Embeddings Using a Sentence-Transformers Model

With a structured set of product and service descriptions in hand, the next step is to convert these texts into numerical embeddings—dense vector representations that capture the semantic meaning of each description. Embeddings are essential for comparing textual content: descriptions that refer to similar business activities (e.g., cloud infrastructure, digital payments, or logistics) will be positioned close to each other in the embedding space. This enables us to identify clusters of companies based on shared operational themes, even if the language they use differs.

To generate these embeddings, we’ll use a pre-trained model from the Sentence Transformers library. For this example, we’ll use the "all-MiniLM-L6-v2" model, which offers a strong balance between speed and accuracy for general-purpose sentence embedding tasks.

Below is the code to generate embeddings for the product and service descriptions we've collected:

from sentence_transformers import SentenceTransformer

EMBED_MODEL = "all-MiniLM-L6-v2"

texts, meta = [], []

for ticker, prod_dict in all_products.items():
    for name, desc in prod_dict.items():
        texts.append(desc)
        meta.append({"ticker": ticker, "product": name})

embedder = SentenceTransformer(EMBED_MODEL)
embeddings = embedder.encode(texts, batch_size=256, show_progress_bar=True)

With just a few lines of code, we’ve converted each product or service description into a high-dimensional vector. If we began with, for example, 1,000 textual snippets, we now have 1,000 corresponding embeddings—each capturing the semantic content of a specific business activity.

These embeddings exist in a vector space where semantic similarity translates to spatial proximity. In other words, descriptions related to similar concepts—such as cloud computing, logistics, or consumer electronics—will naturally form clusters. This structure enables us to identify meaningful groupings of companies based on the actual language they use to describe their operations.

Step 3: Cluster Embeddings and Assign Descriptive Labels Using GPT

With our product and service embeddings prepared, the next step is to identify patterns by grouping similar vectors together. This unsupervised clustering process allows us to uncover latent industry groupings that emerge directly from the data—free from any predefined categories.

We’ll use the K-means algorithm to partition the embedding space into k clusters. Each cluster represents a group of descriptions with similar semantic content, ideally corresponding to a coherent business domain or functional area. For this demonstration, we’ll set k=30, though this number can be adjusted based on dataset size, diversity, or downstream analytical needs.

After clustering, we’re left with numbered clusters (e.g., Cluster 0 to Cluster 29). These numeric labels are arbitrary and lack interpretability. To make the output more useful, we’ll generate descriptive names for each cluster using OpenAI’s GPT model.

For each cluster, we sample a few representative descriptions and prompt GPT to synthesize a concise, human-readable label that captures the common theme. This step leverages the model’s strength in summarization and abstraction, allowing us to convert raw groupings into meaningful industry labels.

The following code performs K-means clustering on the embeddings and then queries the OpenAI API to name each cluster:

from sklearn.cluster import KMeans
from collections import defaultdict
import json as _json
from openai import OpenAI
from tqdm import tqdm as _tqdm

N_CLUSTERS = 30
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

km = KMeans(n_clusters=N_CLUSTERS, n_init="auto", random_state=42)
labels = km.fit_predict(embeddings)

cluster_to_desc = defaultdict(list)
for idx, cid in enumerate(labels):
    cluster_to_desc[cid].append({
        "ticker": meta[idx]["ticker"],
        "product": meta[idx]["product"],
        "description": texts[idx],
    })

cluster_names = {}
for cid, desc_list in _tqdm(cluster_to_desc.items(), desc="Clusters"):
    bullets = "\n".join(f"- {d}" for d in desc_list[:10])
    system = "You are a market-structure analyst. Name the common theme."
    prompt = ("Return ONLY valid JSON: {\"label\": string, \"confidence\": int (0-100)}\n\n" + bullets)
    try:
        resp = client.chat.completions.create(
            model="gpt-4o",
            temperature=0.2,
            max_tokens=50,
            response_format={"type": "json_object"},
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": prompt},
            ],
        )
        label = _json.loads(resp.choices[0].message.content.strip())["label"].strip()
    except Exception:
        label = "Miscellaneous"
    cluster_names[cid] = label

In this code block, several key steps transform raw clustering output into interpretable industry categories:

KMeans Clustering: We apply the KMeans algorithm to group similar embeddings together. Each cluster is composed of descriptions that share underlying semantic content. For example, descriptions related to cloud computing might fall into one cluster, while those focused on retail, advertising, or hardware could form others—depending entirely on the structure present in the data.
Cluster Assignment: The output labels is an array indicating the cluster assignment for each individual description. This allows us to group descriptions according to their respective cluster IDs.
Cluster Summarization Using GPT: For each cluster, we randomly sample a few representative descriptions (sample_texts). These are used to construct a prompt for OpenAI’s GPT model, asking it to return a concise, human-readable industry category name that captures the essence of the sampled descriptions.
For instance, if a cluster contains entries like “AWS cloud platform services” and “Azure cloud computing offerings”, GPT might return a name such as “Cloud Infrastructure”.
Naming and Fallback Logic: We store the results in a dictionary, cluster_names, which maps cluster IDs to descriptive labels. To maintain robustness, we implement a fallback: if GPT returns an empty string, the cluster is labeled with a generic identifier such as “Cluster 7”.

Once this step is complete, we have a set of meaningful industry labels derived directly from the data. Sample output might look like:

Cluster 0: Cloud Infrastructure  
Cluster 1: Online Retail  
Cluster 2: Digital Advertising  
... (and so on for all 30 clusters)

The resulting labels represent dynamically generated industry categories—not imposed from a predefined taxonomy, but derived directly from the data. This approach surfaces business themes that are actually present in companies’ own descriptions, offering a more adaptive and accurate view of the market landscape. Assigning descriptive names to the clusters adds an essential layer of interpretability.

This classification can be versioned and stored over time, enabling longitudinal analysis—for example, tracking how a company’s product mix shifts across clusters in subsequent quarters. Additionally, the number of clusters can remain flexible, allowing new themes or sub-industries to emerge naturally as the market evolves.

Step 4: Summarize Cluster Distribution at the Company Level

In the final step, we shift from product-level analysis to a company-level view. The goal is to understand how each company is represented across the dynamically generated industry clusters.

To summarize the distribution, we construct a pandas DataFrame listing each company along with the cluster assignments of its product and service descriptions. We then aggregate the results to show how many entries from each company fall into each cluster.

While our example treats each description equally, a more advanced version could weight entries by financial relevance—such as segment revenue or operating income—for a more economically meaningful representation.

Here’s the code to build and display the summary DataFrame:

import pandas as pd
from collections import Counter

totals = Counter(m["ticker"] for m in meta)
per_company_cluster = defaultdict(Counter)
for i, cid in enumerate(labels):
    ticker = meta[i]["ticker"]
    per_company_cluster[meta[i]["ticker"]][cid] += 1

rows = []
for tic in sorted(per_company_cluster):
    for cid, cnt in per_company_cluster[tic].items():
        pct = round(100 * cnt / totals[tic], 2)
        rows.append(
            {
                "Ticker": tic,
                "ClusterID": cid,
                "ClusterName": cluster_names[cid],
                "Products": cnt,
                "% of Ticker Products": pct,
            }
        )

summary_df = pd.DataFrame(rows).sort_values(["Ticker", "% of Ticker Products"], ascending=[True, False])

Each company now has a multi-dimensional profile that reflects its presence across several dynamically defined industry clusters. For example, Apple has three product descriptions categorized under the “Devices and Hardware Products” cluster while also showing representation in “Software & Digital Content” and “AI & Cloud Infrastructure,” capturing its broader ecosystem of services and emerging technologies.

Similarly, Amazon exhibits a strong presence in both the “Online Retail” and “Cloud Infrastructure” clusters, consistent with its dual role as a global e-commerce leader and a dominant cloud services provider through AWS. It may also appear in “Digital Media” due to offerings like Prime Video.

This more nuanced view aligns closely with how these companies actually operate, in contrast to traditional classification systems like GICS, which might label Amazon simply as an “Internet Retail” company.

ticker	ClusterName			Products	% of products
AAPL	Devices and Hardware Products	3		0.15
AAPL	Wearables & Accessories		4		0.2
AAPL	Software & Digital Content	11		0.55
AAPL	AI & Cloud Infrastructure	2		0.1
...

Conclusion

In this post, we introduced the concept of Dynamic Industry Classification (DIC) and demonstrated how to build a simple yet powerful DIC system using Captide’s API, combined with modern natural language processing tools.

We began by using Captide to extract product and service descriptions from real SEC filings—providing structured access to rich, unstructured data that reflects what companies actually do. These descriptions were transformed into vector embeddings using a sentence-transformer model, enabling us to cluster them into naturally emerging business categories. With GPT, we assigned intuitive labels to these clusters, creating a new, adaptive industry taxonomy on the fly. Finally, we summarized each company’s distribution across these clusters, revealing business complexity that traditional classifications often obscure.

This approach illustrates how AI can unlock deeper insights in financial analysis. With relatively little code, we replaced rigid, static industry labels with a data-driven, multidimensional view of how companies operate.

Captide’s API played a central role—providing clean, on-demand access to the information buried in filings, which served as the foundation of our classification. In a real-world scenario, this framework could be expanded to cover hundreds or thousands of companies and updated continuously as new filings arrive—offering analysts a living, evolving map of the business landscape.

Potential Improvements and Next Steps

Our implementation is a basic prototype of a dynamic industry classification. There are many ways to improve and extend this approach:

Enhance Accuracy with Better Models: We used a general-purpose MiniLM model for embeddings. Accuracy could improve with domain-specific models (e.g. finetuned on financial text) or more powerful models for capturing nuance in company descriptions. Similarly, more advanced clustering techniques (such as DBSCAN or spectral clustering) might find more natural group counts or shapes than K-means.
Hierarchical Clustering: In industry analysis, there’s often a hierarchy (sectors, industries, sub-industries). We could apply hierarchical clustering to build a tree of clusters – for example, first split into broad sectors, then subdivide those into finer groups. This would yield a dynamic taxonomy rather than a single flat layer of clusters.
Multi-label Classification and Soft Clusters: Instead of hard assignment of each description to one cluster, we could allow overlap or probabilities. For instance, using topic modeling or soft clustering, a product description might belong 30% to one cluster and 70% to another. This reflects reality when a product spans multiple domains. It also aligns with the idea of giving companies a distribution across sectors‍ (as some research suggests using embedding-based methods to get probability distributions over industries for each company).
Expand Data Sources: We focused on SEC filings (10-Ks) for product information. Additional sources could enrich the classification. Earnings call transcripts, investor presentations, or even news articles could provide more context on what companies are doing. Captide’s platform ingests those as well. More data would help capture emerging trends (e.g., if many companies start mentioning “AI initiatives” in press releases, a new cluster might form around that theme).
Automate and Scale: To turn this into a production-ready tool, one would add automation around data refresh (fetch new filings as they come), periodic re-clustering, and monitoring of cluster stability over time. We’d also want to evaluate the clusters qualitatively (do they make intuitive sense?) and quantitatively (perhaps by checking if companies in the same cluster show correlated stock performance or other fundamentals). This could become an evolving system that alerts analysts to changes – for example, if a company suddenly gets a new cluster assignment due to a strategic shift or acquisition.

By implementing these improvements, a dynamic industry classification system can become even more powerful. The end goal is to have a living industry map that reflects the real structure of the business world, providing financial professionals with deeper insights than static labels ever could. As we’ve shown, using Captide’s API alongside modern NLP techniques is an effective way to start building that vision. With continued refinement, DIC could play a valuable role in fundamental analysis, portfolio construction, and uncovering the next big industry trends.

May 23, 2025

Want more?

Automate insights and data extraction from SEC filings with Captide

BOOK A DEMO

Home

API

Careers

Articles

Book a demo

Contact us