As part of our research collaboration with the Value Reporting Foundation to contribute to the revision of their Human Capital standards, Rights CoLab teamed up with the Data for Good Scholars (DfG) Program of Columbia University’s Data Science Institute in December 2019. The goal is to use data science — natural language processing (NLP) and other machine learning methods — to find evidence of the financial materiality of labor-related human rights risks.

In consultation with a group of international labor experts that Rights CoLab has convened (the “Expert Group”), as well as data scientists specializing in human rights or sustainability standards, we identified two workstreams: the “Extension Workstream” and the “Addition Workstream.” Each workstream addresses a criterion of financial materiality as defined in the SASB Standards Conceptual Framework. For the Extension Workstream, we seek evidence to justify the extension of the existing Human Capital disclosure topics to a broader range of industries standards that currently lack that topic. We focus on identifying investor interest in a given topic for those industries not currently containing that indicator.  For the Addition Workstream, we seek to identify new relationships between labor rights and SASB criteria for financial materiality across a broad range of SASB industry standards to support the case for the development of new metrics that better reflect the human rights risks that companies face. As such, these metrics do not currently appear anywhere on the SASB materiality map.

During the Fall 2021 semester, the Data for Good Scholars project team built upon the methods developed over the course of the project to deliver our first findings. We applied the methods, described in previous updates,[1] to two topics, “Diversity, Equity & Inclusion” (DEI) and “Labor Conditions in the Supply Chain” (LCSC), using two data sources: Form 10-Ks (annual financial filings of U.S. corporations) and FactSet’s Truvalue SASB Spotlight Events dataset (TVL). In this update, we describe the data sources and our methods, including our reasoning around the treatment of terms for natural language processing (NLP). In subsequent updates, we will present the findings of our work to generate evidence of the financial materiality of DEI and LCSC across all industries. Our objective is to find supporting evidence for extending existing DEI and LCSC metrics to industry standards that do not currently contain that metric and for defining new metrics for new business risks.

Isha Shah (Coordinator), Raiha Khan, Sabrina Jade Shih, Jay Trevino, and Junyi Zhang comprised the Fall 2021 data team. The DfG work is led by Rights CoLab Co-Founders Joanne Bauer and Paul Rissman, who together with the Expert Group guide the project. DfG Program Coordinator Dr. Ipek Ensari supervises the data science methods selection and implementation. Meagan Barrera, a consultant, conducted a literature review to augment our list of terms.

We hope this report will be useful to data scientists working on sustainability disclosure standards, who may wish to replicate some aspect of this work. Our team is working on a public GitHub repository for our analyses as notebooks/scripts, which will be open to contribution. We welcome comments on our decisions and suggestions for revisions. Suggestions may be sent to

Data Sources

Form 10-Ks

Form 10-Ks are annually mandated corporate financial filings to the U.S. Securities and Exchange Commission (SEC) in which companies are required to report their “risk factors” (Item A1). Form 10-Ks provide just one measure of investor interest, yet it is a valuable one in that the filings represent a corporate view of the risks that investors find most salient. Even if companies are conservative and disclose no more than the necessary minimum, we can consider the disclosures in our Form 10-K dataset as unambiguous risks.

The 10-K dataset we built — based on a labeled Form 10-K dataset that SASB had created and shared with us — contains approximately 39,000 Form 10-Ks, representing all those filed by every US-domiciled company from 2013 to 2020.

FactSet’s Truvalue Spotlight Events Dataset

FactSet’s Truvalue SASB Spotlight Events (TVL) is a dataset of events from international news, government agencies, and NGOs tagged by companies according to SASB industry categories and the relevant SASB General Issue Category (GIC) or topic. The dataset is well-curated — each event rarely has more than one article associated with it, and the selected article provides good coverage of the event.

TVL was an improvement over GDELT, another dataset we tested, since the TVL data is much better tagged for our purposes. GDELT, a project of Google Jigsaw, is a dataset of articles scraped from news sources in over 100 languages every 15 minutes. As such, it is undoubtedly more comprehensive than TVL. Yet extensive cleaning and tagging is needed to work with the dataset. It involved first extracting entity names from the articles, using this incomplete extraction to map the content of the article to SASB GICs and industries, and finally culling the database to account for over-coverage of certain types of events and companies over others. This process led to poorer data quality and lower accuracy overall.

Since our aim is to reveal financially material company practices for the two SASB General Issue Categories, “Employee Engagement, Diversity & Inclusion” and “Supply Chain Management,” we utilized the corresponding Truvalue Spotlight Events dataset. This dataset contains 7,416 articles categorized under “Employee Engagement, Diversity & Inclusion,” and 14,344 articles categorized as “Supply Chain Management.” Because our interest is “Labor Conditions in Supply Chains,” a topic within “Supply Chain Management” and not also environmental management, for example, we used a heuristic method to limit this TVL dataset to labor issues, which we describe below. The articles for both datasets are from January 2016 to September 2021 and include events from 4,988 international sources in 13 different languages.[2]


Categorization and Terminology Changes from the SASB Standard

“Employee Engagement, Diversity & Inclusion” is the label for this GIC in the current published SASB standards. During the Human Capital standards revisions process, SASB researchers switched to using the term “Diversity, Equity & Inclusion,” to align with the terminology now widely used by market actors.

In this report, we will use the acronym DEI to refer to this GIC. It is noteworthy that the DEI theme appears as a GIC in the current SASB standard, whereas “Labor Conditions in Supply Chains,” which we refer to in this report with the acronym LCSC, is a topic within the Supply Chain Management GIC. In order to simplify, in this report we refer to both DEI and LCSC as our “topics” of interest.

The change in terminology from “Employee Engagement, Diversity & Inclusion” to “Diversity, Equity & Inclusion” has implications for natural language processing since we source our keywords directly from the standards. Therefore, any change in the GICs or topics is likely to require a change in our keyword dictionaries, as we discuss below.

The Heuristic Method and Expansion of Keyword Dictionaries

Our heuristic approach uses keywords that we have categorized as either “practices” or “outcomes.” Practice terms refer to company actions – including policies, programs, initiatives, and mechanisms as they relate to our two research topics, DEI and LCSC. Outcomes terms refer to the material risks and opportunities that may be linked to those practices.

We develop a keyword dictionary through iterative cultivation, involving multiple rounds of culling as well as ongoing integration of additional sources. We started with a list of terms suggested by Rights CoLab and the Expert Group. As described in the March 2021 update,[3] we used an n-gram analysis[4] on the labeled datasets that Value Reporting Foundation’s data team provided to us provided to us for two SASB dimensions: Human Capital” and Business Models & Innovation dimensions.

During the Fall 2021 semester, we worked to build out our keyword dictionaries. Our dictionaries are currently structured as practice terms and outcome terms, with practice terms categorized into those pertaining to DEI and those pertaining to LCSC, as well as more specific, distinct sub-categories within those two topics. In this update, we demonstrate how categorizing practices and outcomes makes our findings interpretable, enabling us to draw conclusions about the financial materiality of practices related to distinct or overlapping sub-categories of practices and risks/opportunities.

Using this heuristic approach, we gathered evidence of the materiality of company practices using methods appropriate to our two datasets. With Form 10-Ks, we identify occurrences of practice keywords in their paragraphs and categorize them by industry as evidence that companies in particular industries deem them to be “risk factors.” Our TVL co-occurrence corpus, is comprised of co-occurrences of practice keywords and outcome keywords across events that pertain to particular companies to identify industry-by-industry indications of material practices not currently well represented on the SASB materiality map.

Expansion of Terms: Diversity, Equity & Inclusion

Development of DEI Context Terms

In the current SASB standards there are seven accounting metrics within the DEI category. It is necessary to build our term dictionaries for one metric at a time since our heuristics depend upon word specificity. We chose to start with the following metric because it is most prevalent across the 12 industries that have an EDI standard, with 9 industries requiring disclosure on this metric:

Percentage of gender and racial/ethnic group representation for executive management, non-executive management, professionals, technical staff, and all other employees.

We use this metric — which we dub our “gender/race representation” metric — to build our DEI term dictionaries.

When we started to apply the corporate practices/outcomes heuristic method for DEI to our data sources this semester, we found that practice terms alone were not specific enough to ensure that the co-occurrences that our methods produced capture text related to DEI issues. For example, the co-occurrence of the practice term “promotion” and the risk term “allegation” could refer to a DEI issue related to discrimination in promotion, but there is no way to know more conclusively without manually checking. Therefore, we developed DEI context terms that we could also search the corpus to gain more accuracy in identifying DEI events in TVL news articles and Form 10-Ks. After including DEI context terms in our heuristic method, as described below, our accuracy in identifying relevant articles rose from approximately 50% to close to 90%.

Since workplace DEI concerns often center on the discriminatory treatment of employees or contractors with specific racial, gender, or other demographic characteristics, we developed a list of DEI terms that could serve as context markers. For example, if any terms with the root word “race” or terms like “BIPOC,” “people of color/colour,” or “blackface” are found in the text, it is likely that the TVL article or Form 10-K paragraph is discussing an event with a racial element. These terms comprise our “race” category of DEI context terms. We similarly developed categories for terms describing migrant status, education/skill level (see Figure 1, below), criminal history, and more, all of which cover several DEI considerations of the hiring process. We collected single DEI context terms as well, such as “pregnant” and “veteran,” which are sufficient to capture the general context of that issue.

Figure 1: Sample DEI context terms categorized as “education/skill level” (The terms in this category help to flag articles/10-K paragraphs where education and skill level are mentioned alongside a practice, such as hiring or developing programs to support employees furthering education)


We classified certain DEI context terms into categories; in some cases, we classified a term into multiple categories. For example, we placed “working mother” in “gender-M/F,” “familial status,” and “working mother.” It is possible that a worker experiences unfair treatment as a “working mother” due to gender inequality (the disparity in treatment between working mothers and working fathers) or due to familial status (the disparity in treatment between working mothers and working women without children) or due to a combination of both. By tracking one term in multiple categories, we can capture the many contexts in which the term appears and understand their prevalence in one demographic category in relation to another. This method can be particularly helpful when we move to generating support for metrics development.

“Discrimination” and “Harassment”: DEI Standalone Context Terms

We identified two DEI context terms unrelated to a specific demographic: “discrimination” and “harassment.” As such, we made these DEI standalone context terms to reflect the fact that they are not exclusive to one demographic, but occur across gender, LGBTQ+, age, etc. At the same time, we reasoned that they cannot be neatly categorized as either practice terms or risk terms.

First, “discrimination” and “harassment” are not practices that a company would intentionally implement or openly promote. Although any company event involving “discrimination” or “harassment” can be assumed to be financially material, to demonstrate how “discrimination” and “harassment” lead to financial loss in a given event, we seek a co-occurrence of either of these two terms with other risk terms.

Second, treating these context terms as standalone may help to explain why a DEI practice (related to a particular demographic) leads to a financially material risk for a company. For example, a 2020 TVL article entitled,

captures the context in which this lawsuit takes place. Within the title, we observe two DEI context terms: “age,” which is demographic-specific, and “discrimination,” a standalone term. The tagline of the article reveals more information:

This sentence shows a co-occurrence between the practice term “hiring” and the risk terms “lawsuit,” “consent decree,” and “injunctive relief.” However, without the context term “age discrimination,” the relationship between “hiring” and “lawsuit” would not be captured. By adding the demographic context term “age” to the standalone term “discrimination,” we can infer the relationship among the terms, such that,

“hiring” + “age discrimination” –> “lawsuit”

where “age discrimination” is a DEI consequence that results from the company’s hiring practice, which leads to financially material legal risks.

Capturing the Financial Materiality of DEI Practices in TVL and 10-Ks

Table 1 presents the criteria used to capture financial materiality of DEI practices in TVL and 10-Ks:


Table 1: Criteria for Capturing Financially Material DEI Practices in TVL Articles and Form 10-Ks
Practice terms Risk terms Context terms Worker-related terms
TVL articles At least one DEI practice term At least one company risk term At least one DEI demographic context term OR at least one DEI standalone term OR at least 2 out of 5 of the following terms: “diversity,” “equity,” “inclusion,” “DEI,” “DE&I” At least one occurrence of a word whose root word is “work” or “employ”
Form 10-Ks At least one DEI practice term At least one company risk term only from the “worker protest” or “modern slavery” groups of risk terms At least one DEI demographic context term OR at least one DEI standalone term OR at least 2 out of 5 of the following terms: “diversity,” “equity,” “inclusion,” “DEI,” “DE&I” A least one occurrence of one of the following terms: “worker,” “employee,” “manager,” or “director”


We deem a TVL report to be related to the topic of DEI and to be financially material when we find a co-occurrence  between at least one of each of the following: a practice term, a risk term, any word whose root word is either “work” or “employ,” and an appearance of DEI context. We count an article as DEI-relevant when at least one DEI context term, demographic or standalone, is found in the text or when at least two of five of the following terms are found in the text: “diversity,” “equity,” “inclusion,” “DEI,” and “DE&I.” With the latter case, we also seek to capture content that discusses DEI practices and risks broadly, without reference to a particular marginalized demographic or harm. In iteratively sampling and applying these indicators to look for articles demonstrating financial materiality of DEI practices, we found that an article or 10-K paragraph has a strong chance of discussing DEI if it contains at least two of these five terms. The threshold of at least two terms (like DEI demographic and standalone context terms) is set to account for the variety of ways terms like “diversity” and “equity” may appear individually in articles unrelated to DEI, such as “diversified investment portfolio” or “financial equities.” However, when more than one of these terms is found in a text, the probability that the terms are being used in relation to DEI increases dramatically.

We initially developed our DEI keyword lists by following the terminology of the SASB Standards “Employee Engagement, Diversity & Inclusion” (EDI) GIC. Accordingly, our DEI context terms were the same as those mentioned above, except for the words “engagement” and “equity.” Yet, as noted above, SASB researchers later shifted to the more widely used terminology, “Diversity, Equity & Inclusion” (DEI). Since we used EDI to build our context terms, this shift meant that we had to revise our heuristic accordingly, which we did by switching “engagement” out for “equity” in the DEI context terms.

A Form 10-K paragraph is deemed DEI-relevant and financially material using the same criteria as above. Yet our Form 10-K corpus does not require us to use our entire dictionary of outcome terms, since the corpus represents company reporting of financially material risks. We are interested in capturing DEI issues not just at company headquarters, but also within a company’s value chain, bearing in mind that women, racial and ethnic minorities, migrants, and other members of a diverse workforce are those most vulnerable to discrimination, sexual assault, workplace harms, and other forms of abuse. Therefore, for Form 10-K searches, we restricted our risk terms to those within our worker protest (e.g., “strike,” “protest”) and modern slavery (e.g., “forced labor,” “debt bondage”) categories. Also, for 10-Ks, since the root words “work” or “employ,” are too broad for the 10-K corpus, we searched instead for paragraphs with at least one of the following terms: “worker,” “employee,” “manager,” or “director.”

Used in combination, these methods can provide valuable information. For example, in a 2020 TVL article, “Prada reaches settlement with NYC over blackface imagery,” our heuristic method flagged the following terms within the text:

In this example, while the co-occurrence of the terms “blackface,” “training,” and “investigation” is sufficient to flag the article, searching for co-occurrences across these terms, including at least two of the three terms “diversity,” “equity,” and “inclusion,” reveals specific company initiatives (Prada’s Council) that the company developed to reduce risk or in response to a risk.

Expansion of Terms: Labor Conditions in Supply Chains

To expand the keyword list for practices and outcomes for labor conditions in supply chains (LCSC), we manually reviewed a selection of NGO reports and academic papers. We also added terms that reflect positive practices and opportunities to our initial keyword list of negative practice and risk terms and labeled them accordingly. In addition, we recategorized as “risks” certain terms that we had previously labeled as “practices,” such as “modern slavery” and “forced labor.” A sample of the resulting keyword dictionary from our notebook appears in Figure 2, below.

Figure 2: Sample practice terms listed in their revised categories


We further expanded the new keyword dictionary for labor supply chain outcome terms using an unsupervised n-gram analysis on a subset of news articles from our TVL dataset. This subset contained articles mentioning an LCSC practice term from our new dictionary plus an outcome term from our previous list, and also revealed several new outcome terms related to operational costs and legal risk. Along with this method, we continued to review NGO reports and found terms that pick up a strong signal for risks of reputational damage and modern slavery.

Using these expanded dictionaries of practice and outcome terms, we applied our heuristic method to TVL articles categorized under the Supply Chain Management GIC to detect co-occurrences of these terms as evidence of the financial materiality of LCSC practices.

Criteria to Identify Financially Material Practices in 10-Ks

Consistent with our previous work with Form 10-Ks, we identify paragraphs of 10-Ks in which a labor practice term occurs. Since there is no clear-cut way to determine whether a 10-K paragraph is discussing a company’s supply chain, however, we defined a set of terms that demonstrate a supply chain context, and created a rule that a supply chain context term must be present in order to deem a 10-K paragraph to demonstrate financial materiality. Our supply chain context terms are designed to capture raw materials sourcing (e.g., “raw materials,” “sourcing”), supplier relationships (e.g., “supplier,” “manufacturer”), and supply chain settings (e.g., “factory,” “mining,” “workshop”).

We also looked for outcome terms related to worker protest and modern slavery within paragraphs containing a supply chain context term, since these outcome terms describe risks that most likely occur due to a company’s negative labor practice in relation to its suppliers. If a company mentions any such financially impactful risks in their 10-K, then we can assume that they are actively trying to avoid negative labor practices that can result in such risks.


Figure 3: Sample risk terms related to worker protest and modern slavery


Validating Keywords

After augmenting our DEI and LCSC dictionaries of practice/outcome keywords, we applied the terms to samples of a hundred 10-K paragraphs and a hundred TVL news articles. We iteratively adjusted the keyword dictionaries in three ways:

  1. We added context terms for certain practice/risk terms to ensure that we select relevant content from our data sources. This process, as applied to the DEI GIC, is explained in detail above. For the LCSC topic, when applying “pricing pressure” as an LCSC practice term, for example, we added context terms like “supplier” and “manufacturer” to filter paragraphs/articles down to those specifically discussing the impacts of pricing pressures on suppliers.
  2. We deprioritized practice/outcome terms that returned too many false positive 10-K paragraphs or news articles even with additional context terms. We discovered false positives by running our heuristic method on samples of 10-K paragraphs and news articles and manually reviewing the results to see which ones were flagged for mentioning terms outside of our expected context. For example, we found that the practice term “risk assessment” and the risk term “fraud” returned inordinate false positives.
  3. To cast a wider net across our datasets for those practices/outcomes, we added variations of existing terms. For example, to strengthen our heuristic for the practice term “forced overtime,” we added “illegal overtime” and “mandatory overtime,” as well as “long hours” and “off the clock.”

Using the resulting keyword lists for DEI and LCSC, we achieved 90% accuracy in returns of paragraphs/articles with a discussion of financially material practices.


Criteria to Measure the Prevalence of Financially Material DEI/LCSC by Industry

In our Spring/Summer 2021 work, we had applied the heuristic method to our Form 10-K data set using a small number of diversity-related and inclusion-related terms (see Appendix A of our September 2021 update). We counted the mentions of these terms along two measures: ubiquity and intensity:

  • Ubiquity refers to how widespread mentions of our (Spring/Summer 2021) diversity terms are within industries, and we measured it by the share of all Form 10-Ks within an industry that have any mention of the terms.
  • Intensity refers to the frequency of mentions of (Spring/Summer 2021) diversity terms by companies within an industry; we calculated intensity by taking an average of the number of paragraphs per Form 10-K that mention the terms across an industry.

During Fall 2021, we revised these measures to better explain the pervasiveness of DEI and LCSC issues within each industry in a way that can also be applied to our TVL dataset.

To make our measures comparable across different data sources, we revised our ubiquity measure to account for the count of companies within a particular industry linked to a financially-material DEI/LCSC issue. We now present ubiquity as a measure of the share of companies out of all companies for a particular industry that are represented in a data source, based on our criteria. For example, our TVL corpus contains reports (published between January 2016 and September 2021) for a total of 48 unique companies in the Hotels & Lodging industry. Our DEI heuristic method flags TVL articles for 6 of the 48 companies, so we can conclude that the ubiquity of financially material DEI events, sourced from TVL, in the Hotels & Lodging industry is 12.5%. On the other hand, when we compare that number with the corresponding ubiquity of financially material DEI issues in the Hotels & Lodging industry in our Form 10-Ks corpus, we find that out of 14 total companies in the industry for which we have Form 10-Ks, 8 have at least 1 mention of a financially material DEI issue in any of their Form 10-Ks over the years. From this, we can conclude that the ubiquity of financially material DEI events in the Hotels & Lodging industry, according to Form 10-Ks, is 57.14%. As such, we can use the ubiquity measure to understand the prevalence of financially material DEI issues in Hotels & Lodging in one data source and to compare its discussion between both data sources.

We retired the intensity metric, because it is not relevant to our TVL third-party reporting corpus, since using counts of paragraphs with DEI mentions is not a useful measure to describe our results with news articles in general.

Next Steps

At its December 2021 meeting, the SASB Standards Board approved an approach[5] for defining financially material DEI topics across industry standards. Based on a review of academic literature, consultations with market actors, and other evidence, SASB researchers identified four “channels of business relevance”: “Talent Attraction & Retention,” “Product Design, Marketing & Delivery,” “Community Relations,” and “Innovation & Risk Recognition.”

Source: SASB


Our DEI keyword dictionary provides substantial coverage of all channels of business relevance, but this framework provides a guide on how to go further. One method we will use is word embeddings, which leverage NLP and machine learning to generate learned representations of text. We will use them to uncover new keyword candidates and to eventually train models to automatically label financially material content on the topics of our datasets.[6]

Having focused our fall work on building out and categorizing our DEI and LCSC term dictionaries, we improved our accuracy and better captured the multiple facets of gender, race, and other protected group representation for the DEI topic and for supply chain labor practices/outcomes for the LCSC topic. We can now apply the heuristic method and term dictionaries to other datasets.

Whereas financial filings provide the company perspective on material issues, proxy statements and earnings calls will allow us to gain better insight into investor perspectives. Since investor interest is a key determinant of financial materiality, these data sources can potentially turn up evidence for decision-useful metrics across the four channels of DEI business relevance.

We will also endeavor to support VRF’s ongoing efforts to globalize the standard. The need to globalize becomes even more acute as the Value Reporting Foundation looks ahead to its merger in mid-2022 with the Climate Disclosure Standards Board to form the International Sustainability Standards Board (ISSB) under the IFRS Foundation. We will continue to use the TVL global corpus of third-party reports, and incorporate new data sources, such as financial filings in other jurisdictions besides the U.S. and foreign company filings in the U.S. (Form 20-Fs) to provide a global perspective of how companies consider the risks and opportunities of DEI and LCSC practices.



[1] September 2021 Update on Automating Research to Identify Financially Material Human Rights Topics

[2] TVL utilizes native speakers, translation services, and rigorous post-processing data quality checks to translate foreign language content into English. See Factset, “Data Methodology”

[3] March 2021 Update on Automating Research to Identify Financially Material Human Rights Topics

[4] N-gram analysis is an analysis of a sequence of a specified number of continuous terms.

[5] Value Reporting Foundation, “Diversity, Equity & Inclusion Proposal Approach”

[6] Read about our first attempt with word embeddings for this project at “Supervised Learning on Form 10-Ks” in our March 2021 report.


Photo by Maxim Hopman on Unsplash