Our Data Science Methodology
About the Research
This research collaboration with the Sustainability Accounting Standards Board (SASB) aims to develop and define a strengthened set of disclosure standards that investors can use to persuade companies to improve labor rights for their workforce – direct employees, alternative workers, and workers in a company’s supply chain. The work is supported by the Moving the Market Initiative of Humanity United, UBS Optimus Foundation, and Freedom Fund.
The project has two components: 1) a data science project, and 2) an Expert Group. The data science project — carried out with the Data for Good Scholars (DfG) Program of the Data Science Institute of Columbia University — uses natural language processing, machine learning, and other data science techniques to identify new relationships between labor-related human rights risks and financial materiality. The Expert Group provides subject matter expertise to the data project, and makes recommendations on how to incorporate relevant findings into revised SASB standards to benefit workers while reducing reputational and legal risks for companies.
Two Data Science Projects
Beginning in May 2020, the DfG team began meeting with the Expert Group, who provided input on the gaps in the human capital management standards. Based on these discussions as well as conversations with other experts working at the intersection of data science and modern slavery, the DfG team and Rights CoLab created a data science plan with two work streams:
- The Extension Project will expand the topics that currently appear in some SASB industry standards to sectors where they do not yet appear. On the SASB materiality map these are represented as white spaces.
- The Addition Project will identify new relationships between labor rights metrics and SASB criteria for financial materiality within a broad range of SASB industry standards. These metrics do not currently appear anywhere on the SASB materiality map.
Both projects use machine learning to analyze the language of data to surface investor interest in or the financial impact of corporate labor practices, in line with SASB’s Conceptual Framework. By tracking the identification of the relationships between key terms over time, the projects seek out “emerging materiality.”
For a discussion of the gaps in SASB that this work seeks to fill, see “Our Data Science Plan for Improved SASB Standards.”
- The Extension Project
The research for the current SASB standards took place from 2013 to 2016, which formed the basis for the determination of the financial materiality of topics for certain industries. The DfG Extension Project team, trained in machine learning and natural language processing, is now identifying those industries where labor-related topics were not identified as material during that time by building a classifier to recognize relevant text. Comparisons of the frequency with which these topics appear in industries for which the standard currently does not recommend disclosures to those for which it does will be used to argue for the extension of the standard to those industries. Once in place, this approach can be applied to all SASB sustainability categories.
The research focuses primarily on two SASB general issue categories (GICs) — labor practices and supply chain management.
Labor Practices (found in the Human Capital Management dimension of the SASB standards) “addresses the company’s ability to uphold commonly accepted labor standards in the workplace, including compliance with labor laws and internationally accepted norms and standards. This includes, but is not limited to, ensuring basic human rights related to child labor, forced or bonded labor, exploitative labor, fair wages, and overtime pay, and other basic worker rights. It also includes minimum wage policies and provision of benefits, which may influence how a workforce is attracted, retained, and motivated. The category further addresses a company’s relationship with organized labor and freedom of association.”
Supply Chain Management (found in the Business Models & Innovation dimension of the SASB standards) “addresses the management of ESG risks with a company’s supply chain… and issues associated with environmental and social externalities created by suppliers through operational activities. Such issues include, but are not limited to, environmental responsibility human rights, labor practices, ethics, and corruption. Management may involve screening, selection, monitoring and engagement with suppliers’ environmental and social impacts.”
SASB determines materiality through evidence of investor interest and financial impact form a mix of sources. It is important, therefore, to expand the array of text sources that the classifier can reliably be applied to. Towards this end, the classifier can also be applied to the following data sources:
- 10-Ks, including industries where labor practices were not deemed material in SASB’s 2018 standards (see below)
- Proxy statements, no action requests, and proxy advice
- Quarterly earnings calls
- Sell-side reports
SASB’s agreement to share its labeled data for 10-Ks for its Human Capital Management dimension was a big assist to the Extension Project. Starting with this data set, the team uses machine learning to train a classifier to recognize the same categories in 10-Ks across all industries. Once this classifier is built, it can be transferred onto an unlabeled dataset, such as proxy statements. (See below.)
Beyond the value of having SASB’s 10-K labeled data set in hand, there are other good reasons for starting with 10-Ks:
- 10-Ks are all in a single format on EDGAR, and it is easy to retrieve a wide sample of filings, including across industries and going back in time.
- 10-Ks are a strong indicator of financial impact/risk since they represent the company’s own judgement of risk.
- SASB itself deems 10-Ks to be a highly valuable measure of investor interest; therefore, SASB is likely to be persuaded by the evidence of 10-Ks.
By comparison, earnings calls over-represent the investors least interested in sustainability (short-term oriented hedge funds), while data about filing shareholder resolutions (proxy statements and no-action requests) over-represent the investors most interested in sustainability.
Transferring knowledge from one textual domain to another is not always seamless and idiosyncrasies must be handled. The language signifying a Human Capital Management topic in shareholder resolutions, for example, might be different from that in 10-Ks, requiring care and effort to make a successful adaptation. The aim, therefore, will be to work with the minimum number of data sources required to meet the SASB threshold for evidence of materiality.
Using an initial set of keywords and phrases compiled by the Expert Group, additional positive training cases can be obtained. This list may be expanded via word embeddings to “seed” the categories for different data sets. In building a classifier for labor practices, DfG relies on Expert Group subject matter experts to develop heuristics to identify “positive” and “negative” cases. For example, if a shareholder resolution has two or more of the words “diversity,” “minority,” “underrepresented,” etc., then it pertains to the “diversity and inclusion” category. Word embeddings can verify if any other words are commonly used in the same contexts, which then could be added to the heuristics.
In addition to supervised learning using the labeled 10-K data set, beginning in March 2020, we conducted unsupervised clustering on proxy statements to identify only those sections relevant to SASB standards, reducing the volume of documents requiring human analysis. Once classified as relevant, these statements can be further sorted into specific topics, using either 1) further unsupervised clustering; or 2) a transfer learning method, where SASB 10-K labeled data is applied to proxy statements.
Unsupervised clustering maps relevant paragraphs and clusters to two of SASB’s GICs: Human Capital Management and Supply Chain Management. If clusters contain content outside these two GICs, we consider whether they might represent new GICs.
After pre-processing, we used a K-means clustering algorithm and an LDA-mallet model to cluster proxy statement paragraphs. Having determined that the K-means model was more useful, we proceeded with that method.
Transfer learning uses a neural network-based model trained on the SASB 10-K labeled dataset and applied to the proxy statement dataset. We aim to develop similar labels for each paragraph of a proxy statement.
The approximate category labels from each method provide a starting point for labeling proxy statements. SASB can then apply a broader range of techniques to parse data. Since both methods are promising and may yield different insights, team members will pursue them in parallel.
A detailed report of the progress through August 2020 is available here.
- The Addition Project
The DfG Addition Project team — trained in statistics and natural language processing — analyzes relationships between corporate practice and financial impact/investor interest to identify human rights-related metrics to enhance SASB standards. To prove the methodology, we began this project with a narrow focus on two topic areas: 1) workforce structure (both direct and supply chain) and 2) recruitment fees (directly related to modern slavery). Once the method is honed, it can be applied to other topics. Again, the Expert Group will provide keywords and phrases.
Workforce structure was selected as a starting topic for the following reasons:
- It addresses both the direct workforce of a company and workers in their supply chains. In doing so, it overcomes current SASB typology limits that place these workers in separate dimensions (i.e., Human Capital Management for the direct workforce and Business Models & Innovation for workers in supply chains).
- Expert Group members have identified this topic as a gap in the current standards.
- Good metrics related to workforce structure are already available; companies report these in the Workforce Disclosure Initiative (e.g., WDI 3.5), Bloomberg Social Metrics, and the Global Reporting Initiative.
- The Human Capital Management Coalition — an influential group of institutional investors also engaging SASB on its human capital management project — is interested in this issue.
- SASB has flagged the topic in its Human Capital Management Preliminary Framework.
The DfG Addition Project team began by defining an outcome measure of corporate practice risks by using natural language processing to identify the frequency with which specific key terms that demonstrate risk, such as “protest,” “lawsuit,” “impoundment,” “divestment,” “boycott,” etc. are used in relation to practices. This method relies primarily from structured and unstructured data sources drawn from news and social media.
The Addition Project uses word embeddings irrespective of practice or risk measure. This is done to identify “noun chunks,” such as “forced labor,” or “migrant workers” and to better identify similar words. We are relying on the Expert Group to develop heuristics.
The final stage of the Addition Project research could involve event studies to demonstrate the financial materiality of the risks.
Anticipated next steps for each project during Fall 2020 are as follows:
The Extension Project
The classifier will be continuously refined, including additional cleaning and pre-processing steps on the training data, testing different classification methods, and tuning parameters. Once the classifier can reliably detect labor-related disclosures in 10-Ks, we will be able to determine their relative frequency in different industries, including those in which the topic was originally not deemed material. In the process of developing the classifier, the team can also investigate increases in labor-related terms over time, providing a starting point for similar analysis on proxy statements.
The classifier will then be adapted to different sources of text, including earnings calls, news reports, discussion boards, and so on. In its first iteration, the classifier can detect text as relevant or not relevant to human capital management or supply chain management. In its second, more refined iteration, it will be able to classify text as pertaining to a specific topic within these domains.
The Addition Project
The team will continue to define the outcomes and obtain measurements for the selected corporate practice and risks: namely, workforce structure and recruitment fees or an alternative modern slavery topic.
Photo by Joshua Sortino on Unsplash
 SASB is updating the Conceptual Framework. Starting in September 2020, the “Exposure Draft” was opened for public comment: https://www.sasb.org/standard-setting-process/conceptual-framework/
 Labeled data is data that have been annotated with meaningful outcome measures, often category tags.
 Text classifiers analyze text and assign the input to one or more predefined categories.
 With respect to conference calls, the most interesting data will come from the past two years, since interest in sustainability topics has grown.
 Word embeddings map words or phrases to a space in which similar entities are close to each other. One use is to input a single word and discover words commonly found in context with it.
 Unsupervised learning is “modeling the underlying or hidden structure or distribution in the data in order to learn more about the data. Unsupervised learning is where you only have input data and no corresponding output variables.” See https://www.educba.com/supervised-learning-vs-unsupervised-learning/
 SASB Human Capital Management Project Preliminary Framework: Executive Summary, June 2020, p. 8 (unpublished).