Part one of a two-part series on how C3 AI can turn complex ESG data into forward-thinking, actionable strategy

Companies are facing increasing pressure from stakeholders — investors, customers, partners, regulators, and the public — to measure and disclose performance on topics across environmental, social, and governance (ESG) programs. But aligning corporate ESG strategy with those stakeholder expectations and priorities is a challenge.

To do this, once a year, companies undergo what’s known as a materiality assessment — but this exercise can only go so far. It’s limited by mountains of difficult to sort through data, constantly shifting viewpoints, and, typically, constrained resources.

It is incredibly difficult for sustainability teams to pursue initiatives across the wide breadth of ESG topics, including greenhouse gas emissions, workforce diversity, and business ethics, therefore they must focus on the ones that are most relevant and material to their business.

How AI Can Help

AI is a natural solution to keep track of changes in stakeholder priorities over time and widen the aperture of stakeholder signals. One technique, natural language processing (NLP), is particularly well suited to address this problem because it can be used to digest stakeholder documents such as reports, press releases, and engagement guidelines and identify ESG sentiment.

Even with NLP, building an AI application that ingests and processes data from ESG documents in a standardized way is a difficult task. There are many questions to ask when designing the technical features of this application, but they can all boil down to one: How do we measure the extent to which a document discusses a certain ESG topic?

This was a critical question that the C3 AI data science team needed to answer while building the C3 AI ESG application. The outcome: A five-step process that quantifies ESG stakeholder materiality.


  1. Data Ingestion: Web collection to capture and store all relevant stakeholder documents.
  2. Data Preparation: Document cleansing and parsing to digest input documents into NLP-ready paragraphs.
  3. Paragraph-Level Analysis Pipeline: Identifying ESG topics and scoring materiality in paragraphs with NLP.
  4. Aggregate Materiality: Machine-learning (ML) aggregation of ESG materiality scores across documents and stakeholders, all weighted over time.
  5. User Insights: Exposing actionable AI recommendations with visuals and alerts.

Although data ingestion and preparation are critical to the success of the NLP model, the novel piece of this technology is in the paragraph-level analysis pipeline — shown as step three in the workflow diagram. This is when the machine learning NLP pipeline determines whether and to what extent a paragraph discusses an ESG topic.

This pipeline is an ensemble model built from a combination of four different components rolled into one, each designed to make a decision about the content of a paragraph that culminates in a final prediction and answer the question: does this paragraph discuss a specific ESG topic?

That decision is then followed by a post processing step to ensure the model is considering how ESG topics overlap and are connected. For example, say the model is analyzing a report that is mentioning multiple ESG topics: greenhouse gas emissions and biodiversity. Those topics are, at minimum, not only related to each other, but also fall under the umbrella of more comprehensive ESG topics, including climate. But there is a chance that climate is not mentioned in the report even though GHG emissions and biodiversity are being discussed under this umbrella topic. In that case, this post processing step would inflate the score it is giving the topic of climate to ensure it is being weighted fairly in the model. That is what the post processing step solves for; it works to create an understanding of how the topics relate hierarchically in whichever document it is analyzing.

Now, before diving into the ensemble model components, let’s first discuss the underlying NLP models to understand how their underlying mechanisms informed our engineering of the paragraph-level pipeline.

How C3 AI ESG Uses Large Language Models

The NLP approach for ESG stakeholder materiality leverages two publicly available large language models (LLMs): a general and domain-specific BERT model. These LLMs are used to generate text embeddings that represent the definition and semantically related words for a given ESG topic. The model leverages the embeddings to identify when and to what extent an ESG topic is discussed in a paragraph. Text embeddings are numerical representations of text, and similar embeddings represent text content with similar meanings.

Text embeddings from both LLMs are leveraged in the NLP pipeline:

  1. General Embedding: Produced by a general BERT model that is effective at capturing semantic similarity between sentences and short paragraphs.
  2. Domain Embedding: Produced by a domain BERT model that is particularly strong at capturing the ESG-specific content of the text.

Building the Unique Ensemble Model

The ML pipeline takes paragraph text data as inputs and produces a prediction that indicates whether an ESG topic is discussed in the paragraph. It also produces a score representing how similar the paragraph is to a “centroid” (domain embedding) that defines the ESG topic. Both outputs are produced on a per topic per paragraph basis because each paragraph can discuss multiple ESG topics.

The pipeline operates at a paragraph-level because of the token length limitations of LLMs, and because a paragraph is the shortest component of a document at which the presence of an ESG topic can be measured without losing context.

Building such a complex ensemble NLP pipeline is enabled by the C3 AI Platform architecture for composing ML pipelines. The pipeline is designed to match the performance of human ESG experts by leveraging four component models. The key term–based prediction uses expert knowledge, defined through ESG topic key terms, to perform well in low-data environments. The centroid-based prediction uses labeled data to continuously enhance model performance over time. Together, the ensemble pipeline applies LLMs to produce predictions 100x faster than expert labelers and is extensible to new ESG issues as they arise.

The paragraph-level prediction is produced by a ML model that combines the strengths of rule-based and NLP-based methods. By leveraging predictions from four component models, this ensemble model achieves higher overall model performance.

The Ensemble Prediction Model


The Ensemble Prediction Model Components

1. Key Term Search: The key term search model searches each paragraph for a matching set of configured key terms for the ESG topic. If and only if a matching set of key terms is found, then the model predicts that the paragraph discusses the topic.

In order to make key term search successful, a text processing pipeline pre-processes both the paragraph text and key terms, including punctuation removal, extra blank space removal, lowercasing, and lemmatization. The processing pipeline ensures that small differences in the way that a term is denoted do not impact whether a matching set of key terms is identified.

The key term search logic is configured per ESG topic and consists of a Boolean logical expression that combines key terms groups. An illustrative example for Customer Privacy is below:

Boolean Expression: g0 | (g1 & g2)

Term Definitions:

g0: private data, personally identifiable information, PII, privacy, GDPR

g1: leak, breach, hack

g2: accounts, users, customers

In this case, a paragraph is given a positive prediction if it contains any of the key terms in g0, or at least one key term in g1 and at least one key term in g2.

2. Nearest Neighbors with Key Term Search: The nearest neighbors with key term search model leverages a weakly supervised nearest neighbors model to consider inter-word semantics and filter out paragraphs that key term search predicts as positive but that in fact discuss unrelated content. For example, “Biodiversity” could be configured to provide a positive key term search prediction if the key term “ecosystem” is in the text. However, some paragraphs could contain “business ecosystem” or “developer ecosystem” and thus are not relevant to biodiversity. In these cases, the nearest neighbors with key term search model leverages paragraph embeddings to provide context and yields a negative prediction.

During model training, all training paragraphs are given “weak labels” that are defined as the results of the key term search pipeline. These weak labels and the corresponding general embeddings of those paragraphs are saved in the model, which is why the approach is considered weakly supervised. During inference, the model identifies the paragraphs in the training set with the nearest general embeddings to the input paragraph. If a high enough proportion of these identified training set paragraphs have positive weak labels, the input paragraph is predicted to discuss the ESG topic.

To recap key term–based prediction: if both component models yield positive predictions, then the ensemble model overall predicts that the paragraph discusses the ESG issue. The next section discusses the other half of the ensemble model: Centroid-Based Prediction.

3. Embedding Similarity: The embedding similarity model is a supervised model that uses domain embeddings to determine if a paragraph meets the definition of the ESG topic. It computes a prediction based on the similarity between the domain embedding of the target paragraph and the domain embedding of a “centroid” that describes the ESG topic. This centroid is often a topic definition, or the definition and a sentence that lists key terms related to the topic. The threshold for a positive prediction is independent for each topic and is learned during model training to maximize the F1 score on the training set.

4. Nearest Neighbors with Key Term Search: The nearest neighbors with training labels approach is similar to nearest neighbors with key term search, except that it leverages a supervised approach and uses domain embeddings instead of general embeddings. The model utilizes a supervised multilabel k-nearest neighbors model and is trained using domain embeddings and ground truth labels from the training set. At inference, the model uses Bayesian inference to generate a prediction for the target paragraph based on its nearest neighbors in domain embedding space. It yields a positive prediction if the nearest neighbors have positive labels. This supervised nearest neighbors model performs well only when sufficient training data is available for the ESG topic.


The ensemble model yields a positive prediction using the following Boolean expression based on the component model predictions: 1 & 2 | 3 & 4. In other words, the ensemble prediction is true if the predictions for 1 and 2 are true and/or if the predictions for 3 and 4 are true.

The key term–based prediction indicates that a paragraph discusses a topic if it contains the key terms of the topic and has a general embedding similar to those of other paragraphs that contain the key terms. The centroid-based prediction indicates that a paragraph discusses a topic if its domain embedding is both similar to the domain embedding of the topic definition and similar to positively labeled training paragraphs.

Ensuring Deduplication with Post Processing

Some ESG topics are, by definition, supersets of others. Once the ensemble model processes all input paragraphs, there is a post-processing pipeline to double check predictions for parent and child ESG topics. The pipeline handles these cases by predicting a “parent topic” as present in a paragraph if any of its “child topics” are discussed in the paragraph. The pipeline calculates the embedding similarity score for a parent topic as the maximum of its similarity score and the maximum similarity score of its children.

For example, “Climate” is defined as a parent ESG topic of “Greenhouse Gas Emissions.” If a paragraph is predicted to discuss “Greenhouse Gas Emissions” but not “Climate,” the post-processing pipeline will give it a positive prediction for both and will set the embedding similarity score for “Climate” to be the maximum of the similarity scores for “Climate” and “Greenhouse Gas Emissions.”

The Result: A Powerful AI Application for ESG Leaders

The C3 AI ESG application leverages NLP to surface insights on stakeholder ESG materiality for an organization, removing the guess work and high volume of manual effort required. To quantify stakeholder materiality, the AI-enabled application uses a novel data science approach to continuously monitor and evaluate discussion of key ESG topics in published stakeholder documents — and persists the results in a time series so that changes in stakeholder priorities can be evaluated over time.

To dive further into other C3 AI ESG capabilities beyond stakeholder materiality, download the application data sheet or check out the product page.

In the second half of this series, we will explore how C3 AI ESG uses the information extracted from stakeholder documents to surface insights, and how ESG leaders can use those insights to develop strategic goals and actionable plans.


About the Authors

Robert Young (author) is a Manager in the Data Science team at C3 AI, where he develops machine learning and optimization solutions for sustainability, supply chain, and predictive maintenance problems. Prior to C3 AI, he built AI systems for smart buildings. He holds a MS in Engineering from Stanford University.

Jessica Matthys (editor) is a Product Manager at C3 AI, working on the C3 AI Sustainability Suite. Prior to C3 AI, Jessica worked in energy and sustainability at Tesla and Accenture. She has an MBA from the Kellogg School of Management at Northwestern University and a BSE in Mechanical Engineering from Duke University.

Thank you to the C3 AI ESG Data Science team, including Hang Le and Suvansh Dutta, for their contributions to this blog.