Research Documentation

MEASURE Index Research Methodology

Version: 1.0Date: November 2, 2025Project: Corporate DEI Posture Analysis & Database System

Executive Summary

This methodology document outlines the comprehensive, systematic approach used to research, analyze, and catalog Diversity, Equity, and Inclusion (DEI) postures for S&P 500 companies. The system combines automated web research, AI-powered contextual analysis, quantitative risk scoring, and structured database storage to create an authoritative, scalable, and queryable repository of corporate DEI intelligence.

Key Components:

  • Automated Research Pipeline: Multi-source web research using Anthropic's Batch API with integrated web search
  • Structured Data Schema: JSON-based research format with comprehensive provenance tracking
  • AI Contextualization: Claude-powered analytical insights and strategic implications
  • Risk Quantification: Algorithmic scoring based on legal controversies and negative events
  • Relational Database: PostgreSQL-backed normalized schema with 15 tables and analytical views
  • Quality Controls: Multi-layer validation, source reliability scoring, and data quality flagging

Research Objectives

Primary Goals:

  1. Comprehensive Coverage: Document DEI posture for all S&P 500 companies
  2. Multi-Dimensional Analysis: Capture leadership, policies, controversies, commitments, and transparency
  3. Evidence-Based Research: Require minimum 5 credible sources per company with full citations
  4. Temporal Tracking: Focus on current state (2024-2025) while capturing historical trajectory
  5. Actionable Intelligence: Provide strategic insights for stakeholders (investors, advocates, researchers)

Research Scope:

Each company analysis includes:

  • DEI Posture & Status: Current commitment level and strategic positioning
  • Leadership Structure: Chief Diversity Officer (CDO) role, reporting lines, C-suite representation
  • Transparency & Reporting: Publication practices, disclosure frequency, report availability
  • Recent Events: Policy changes, announcements, awards, and milestones (2024-2025)
  • Legal Controversies: Ongoing lawsuits, settlements, regulatory actions, employee protests
  • Commitments & Initiatives: Industry coalitions, pledges, programs, and philanthropic efforts
  • Supplier Diversity: Program existence, status, and spending disclosure
  • Comparative Context: Industry peer positioning and relative maturity

Data Collection Methodology

Phase 1: Source Identification & Web Research

Approach: Anthropic Batch API with integrated web search capabilities

1.1 Input Data Sources

  • Primary Dataset: S&P 500 company list (sp500.csv)
  • Fields Captured: Ticker symbol, company name, GICS sector/sub-industry, headquarters location, CIK number, founding year

1.2 Research Process

Tools: batch_api_launcher.py and submit_batch.py

  1. Request Generation:
    • Creates JSONL batch file with one research task per company
    • Each request includes comprehensive research prompt with 6 search objectives
    • Model: claude-sonnet-4-5-20250929 with 16,000 max tokens
    • Web search enabled: Up to 15 search queries per company
  2. Search Strategy (per company):
    • DEI Posture Searches: Corporate DEI pages, ESG reports, annual reports
    • Leadership Searches: CDO identification, LinkedIn profiles, press releases
    • Transparency Searches: Standalone DEI reports, sustainability reports, 10-K/proxy statements
    • Recent Events: News articles, press releases, policy announcements (2024-2025)
    • Legal Controversies: Discrimination lawsuits, court dockets, SEC filings, regulatory actions
    • Commitments: CEO Action pledge, industry coalitions, supplier diversity programs
  3. Source Requirements:
    • Minimum: 5 unique, credible sources per company
    • Reliability Hierarchy:
      • 5 (Highest): SEC filings, company websites, proxy statements
      • 4: Major news outlets (WSJ, NYT, Reuters, Bloomberg)
      • 3: Trade press, industry publications
      • 2: Think tanks, research organizations
      • 1: Blogs, social media
  4. Batch Processing:
    • Submitted to Anthropic Batch API for parallel processing
    • Processing window: Up to 24 hours
    • Cost efficiency: 50% discount vs. real-time API
    • Results downloaded as JSONL with one research JSON per company

Phase 2: Research Validation & Storage

Tool: download_batch_results.py

  • Downloads completed batch results from Anthropic API
  • Parses JSONL responses and extracts research JSON from AI output
  • Validates against schema requirements
  • Saves individual research files: research_data/{TICKER}_research.json
  • Logs successes, failures, and validation errors

Research Schema & Standards

All research outputs follow standardized JSON schema (v1.1)

{
  "schema_version": "1.1",
  "company_identifier": {
    "ticker_symbol": "AAPL",
    "company_name": "Apple Inc.",
    "industry": "Information Technology"
  },
  "research_snapshot": {
    "captured_at": "ISO-8601 timestamp"
  },
  "data_sources": [...],
  "findings": {
    "dei_posture": {...},
    "cdo_role": {...},
    "reporting": {...},
    "events": [...],
    "controversies": [...],
    "commitments": [...]
  }
}

Key Schema Features:

  • Provenance Tracking: Every finding includes provenance_ids array linking to source IDs
  • Verbatim Quotes: Each section supports quotes array (max 240 chars per quote)
  • Controlled Vocabularies: ENUMs for status fields, event types, controversy types, etc.
  • Data Quality Metadata: Flags for incomplete data, conflicting sources, verification needs

AI-Powered Profile Generation

Tool: create_company_profile.py

Phase 3: Profile Creation & AI Contextualization

Each research JSON is transformed into an enriched company profile through:

AI Analysis Components:

  1. Executive Summary (150-250 words) - Comprehensive overview of DEI posture
  2. Key Insights (5-7 bullet points, 20-40 words each) - Critical observations
  3. Trend Analysis (100-150 words) - Momentum assessment and trajectory
  4. Comparative Context (100-150 words) - Industry peer comparison
  5. Strategic Implications (3-5 points, 25-50 words each) - Business risks and opportunities
  6. Commitment Strength Rating (1-10 scale)
  7. Transparency Rating (1-10 scale)
  8. Overall Recommendation - Leading, Strong, Moderate, Weak, or Concerning

AI Prompt Design:

  • Provides research findings, risk assessment, and editorial summaries
  • Requests structured JSON output
  • Model: claude-sonnet-4-5-20250929 (4,000 max tokens)
  • Temperature: Standard (balanced creativity and accuracy)

Risk Assessment Framework

Quantitative Risk Scoring

Methodology: Event-based accumulation with weighted severity

Risk Factors & Weights:

FactorWeightRationale
Ongoing Lawsuit15 pointsActive legal exposure, unresolved risk
Settled Case5 pointsHistorical liability, reputational impact
Negative Event10 pointsPublic criticism, policy reversals, protests
High-Impact Event5 pointsMajor announcements, significant controversies

Calculation:

Risk Score = MIN(100, Σ(factor_count × factor_weight))

Risk Level Thresholds:

  • 0-24: Low risk
  • 25-49: Medium risk
  • 50-100: High risk

Risk Score Bias Considerations

Known Biases:

  1. Litigation bias: Larger companies face more lawsuits (exposure ≠ worse behavior)
  2. Transparency paradox: Companies disclosing issues may score higher risk than opaque peers
  3. Media bias: High-profile companies receive more coverage (detection bias)

Mitigation:

  • Risk scores normalized by company size (optional view)
  • Transparency rating as separate dimension (rewarding disclosure)
  • Context fields capture case details beyond binary counts

Database Architecture

Tools: database/schema.sql, database/import_profiles.py

Phase 4: PostgreSQL Database Import

Database Design Principles:

  1. Normalization: 3NF-compliant schema to eliminate redundancy
  2. Versioning: Support multiple profile snapshots per company
  3. Referential Integrity: Foreign key constraints across all relationships
  4. Performance: Comprehensive indexing on foreign keys, dates, ratings, and filter columns
  5. Type Safety: PostgreSQL ENUM types for controlled vocabularies

15 Core Tables:

  1. companies
  2. profiles
  3. risk_assessments
  4. dei_postures
  5. cdo_roles
  6. reporting_practices
  7. events
  8. controversies
  1. commitments
  2. supplier_diversity
  3. ai_contexts
  4. ai_key_insights
  5. ai_strategic_implications
  6. data_sources
  7. data_quality_flags

3 Analytical Views:

  • v_latest_company_profiles: Denormalized view of latest profile with key metrics
  • v_risk_by_industry: Risk aggregation by industry sector
  • v_cdo_stats: CDO prevalence and C-suite representation statistics

Quality Assurance

Multi-Layer Validation

1. Schema Validation

  • JSON Schema Draft 7 specification
  • Required fields enforcement
  • Type checking (strings, integers, booleans, dates)
  • Pattern validation (ticker symbols, CIK numbers, URLs, dates)
  • Enum validation for controlled vocabularies

2. Source Reliability Scoring (5-Point Scale)

  • 5: Primary corporate sources (SEC filings, corporate websites, proxy statements)
  • 4: Major news outlets (WSJ, NYT, Reuters, Bloomberg)
  • 3: Trade press and industry publications
  • 2: Think tanks, research organizations, advocacy groups
  • 1: Blogs, social media, unverified sources

Requirement: Minimum 5 sources per company, preference for scores 3+

3. Data Quality Flags

  • incomplete_data: Source count < 5
  • conflicting_sources: AI detects contradictory information
  • outdated_information: No recent data (2024-2025)
  • verification_needed: Array of specific claims requiring fact-check

4. Provenance Tracking

Every finding requires:

  • Source ID references (provenance_ids array)
  • Verbatim quotes where applicable
  • Date of information capture
  • Source reliability score

Automation Pipeline

End-to-End Workflow

MEASURE INDEX PIPELINE

1. INPUT: sp500.csv

└── 503 companies (as of 2025)

2. BATCH REQUEST GENERATION

├── Tool: batch_api_launcher.py

└── Output: batch_api_requests.jsonl

3. BATCH SUBMISSION

├── Tool: submit_batch.py

├── API: Anthropic Batch API

└── Model: claude-sonnet-4-5-20250929

4. RESULTS DOWNLOAD

├── Tool: download_batch_results.py

└── Output: research_data/{TICKER}_research.json

5. PROFILE GENERATION

├── Tool: create_company_profile.py

└── Output: company_profiles/{TICKER}_profile.json

6. DATABASE IMPORT

├── Tool: database/import_profiles.py

├── Target: PostgreSQL (Supabase)

└── Schema: 15 tables, 3 views

7. QUERY & ANALYSIS

└── Applications: Dashboards, APIs, research tools

Limitations & Considerations

Methodological Limitations

1. Point-in-Time Analysis

  • Research captures snapshot at time of capture
  • Corporate policies and leadership can change rapidly
  • Recommendation: Periodic re-research (quarterly or annually)

2. Source Availability Bias

  • Larger companies have more public information
  • Smaller S&P 500 companies may have less DEI disclosure
  • Private companies or recent IPOs may lack historical data

3. AI Analysis Subjectivity

  • AI-generated insights reflect model interpretation
  • Ratings (commitment strength, transparency) are qualitative assessments
  • Recommendation: Human review for high-stakes decisions

4. Risk Score Simplification

  • Algorithm weights may not capture nuance of specific cases
  • Ongoing lawsuit may be frivolous or meritorious (not differentiated)
  • Settled cases may involve admissions or no-fault settlements

Ethical Considerations

Fairness & Representation

  • Analysis focuses on publicly available information (transparency bias)
  • Companies with less disclosure not necessarily less committed
  • Risk scores reflect legal exposure, not inherent company values

Use Case Appropriateness

  • Intended for research, analysis, and advocacy
  • Not suitable as sole basis for investment decisions
  • Human judgment required for contextual interpretation

Document Prepared By: MEASURE Index Research Team

Last Updated: November 2, 2025

Version 1.0