MEASURE Index Research Methodology
Executive Summary
This methodology document outlines the comprehensive, systematic approach used to research, analyze, and catalog Diversity, Equity, and Inclusion (DEI) postures for S&P 500 companies. The system combines automated web research, AI-powered contextual analysis, quantitative risk scoring, and structured database storage to create an authoritative, scalable, and queryable repository of corporate DEI intelligence.
Key Components:
- Automated Research Pipeline: Multi-source web research using Anthropic's Batch API with integrated web search
- Structured Data Schema: JSON-based research format with comprehensive provenance tracking
- AI Contextualization: Claude-powered analytical insights and strategic implications
- Risk Quantification: Algorithmic scoring based on legal controversies and negative events
- Relational Database: PostgreSQL-backed normalized schema with 15 tables and analytical views
- Quality Controls: Multi-layer validation, source reliability scoring, and data quality flagging
Research Objectives
Primary Goals:
- Comprehensive Coverage: Document DEI posture for all S&P 500 companies
- Multi-Dimensional Analysis: Capture leadership, policies, controversies, commitments, and transparency
- Evidence-Based Research: Require minimum 5 credible sources per company with full citations
- Temporal Tracking: Focus on current state (2024-2025) while capturing historical trajectory
- Actionable Intelligence: Provide strategic insights for stakeholders (investors, advocates, researchers)
Research Scope:
Each company analysis includes:
- DEI Posture & Status: Current commitment level and strategic positioning
- Leadership Structure: Chief Diversity Officer (CDO) role, reporting lines, C-suite representation
- Transparency & Reporting: Publication practices, disclosure frequency, report availability
- Recent Events: Policy changes, announcements, awards, and milestones (2024-2025)
- Legal Controversies: Ongoing lawsuits, settlements, regulatory actions, employee protests
- Commitments & Initiatives: Industry coalitions, pledges, programs, and philanthropic efforts
- Supplier Diversity: Program existence, status, and spending disclosure
- Comparative Context: Industry peer positioning and relative maturity
Data Collection Methodology
Phase 1: Source Identification & Web Research
Approach: Anthropic Batch API with integrated web search capabilities
1.1 Input Data Sources
- Primary Dataset: S&P 500 company list (sp500.csv)
- Fields Captured: Ticker symbol, company name, GICS sector/sub-industry, headquarters location, CIK number, founding year
1.2 Research Process
Tools: batch_api_launcher.py and submit_batch.py
- Request Generation:
- Creates JSONL batch file with one research task per company
- Each request includes comprehensive research prompt with 6 search objectives
- Model: claude-sonnet-4-5-20250929 with 16,000 max tokens
- Web search enabled: Up to 15 search queries per company
- Search Strategy (per company):
- DEI Posture Searches: Corporate DEI pages, ESG reports, annual reports
- Leadership Searches: CDO identification, LinkedIn profiles, press releases
- Transparency Searches: Standalone DEI reports, sustainability reports, 10-K/proxy statements
- Recent Events: News articles, press releases, policy announcements (2024-2025)
- Legal Controversies: Discrimination lawsuits, court dockets, SEC filings, regulatory actions
- Commitments: CEO Action pledge, industry coalitions, supplier diversity programs
- Source Requirements:
- Minimum: 5 unique, credible sources per company
- Reliability Hierarchy:
- 5 (Highest): SEC filings, company websites, proxy statements
- 4: Major news outlets (WSJ, NYT, Reuters, Bloomberg)
- 3: Trade press, industry publications
- 2: Think tanks, research organizations
- 1: Blogs, social media
- Batch Processing:
- Submitted to Anthropic Batch API for parallel processing
- Processing window: Up to 24 hours
- Cost efficiency: 50% discount vs. real-time API
- Results downloaded as JSONL with one research JSON per company
Phase 2: Research Validation & Storage
Tool: download_batch_results.py
- Downloads completed batch results from Anthropic API
- Parses JSONL responses and extracts research JSON from AI output
- Validates against schema requirements
- Saves individual research files: research_data/{TICKER}_research.json
- Logs successes, failures, and validation errors
Research Schema & Standards
All research outputs follow standardized JSON schema (v1.1)
{
"schema_version": "1.1",
"company_identifier": {
"ticker_symbol": "AAPL",
"company_name": "Apple Inc.",
"industry": "Information Technology"
},
"research_snapshot": {
"captured_at": "ISO-8601 timestamp"
},
"data_sources": [...],
"findings": {
"dei_posture": {...},
"cdo_role": {...},
"reporting": {...},
"events": [...],
"controversies": [...],
"commitments": [...]
}
}Key Schema Features:
- Provenance Tracking: Every finding includes provenance_ids array linking to source IDs
- Verbatim Quotes: Each section supports quotes array (max 240 chars per quote)
- Controlled Vocabularies: ENUMs for status fields, event types, controversy types, etc.
- Data Quality Metadata: Flags for incomplete data, conflicting sources, verification needs
AI-Powered Profile Generation
Tool: create_company_profile.py
Phase 3: Profile Creation & AI Contextualization
Each research JSON is transformed into an enriched company profile through:
AI Analysis Components:
- Executive Summary (150-250 words) - Comprehensive overview of DEI posture
- Key Insights (5-7 bullet points, 20-40 words each) - Critical observations
- Trend Analysis (100-150 words) - Momentum assessment and trajectory
- Comparative Context (100-150 words) - Industry peer comparison
- Strategic Implications (3-5 points, 25-50 words each) - Business risks and opportunities
- Commitment Strength Rating (1-10 scale)
- Transparency Rating (1-10 scale)
- Overall Recommendation - Leading, Strong, Moderate, Weak, or Concerning
AI Prompt Design:
- Provides research findings, risk assessment, and editorial summaries
- Requests structured JSON output
- Model: claude-sonnet-4-5-20250929 (4,000 max tokens)
- Temperature: Standard (balanced creativity and accuracy)
Risk Assessment Framework
Quantitative Risk Scoring
Methodology: Event-based accumulation with weighted severity
Risk Factors & Weights:
| Factor | Weight | Rationale |
|---|---|---|
| Ongoing Lawsuit | 15 points | Active legal exposure, unresolved risk |
| Settled Case | 5 points | Historical liability, reputational impact |
| Negative Event | 10 points | Public criticism, policy reversals, protests |
| High-Impact Event | 5 points | Major announcements, significant controversies |
Calculation:
Risk Score = MIN(100, Σ(factor_count × factor_weight))
Risk Level Thresholds:
- 0-24: Low risk
- 25-49: Medium risk
- 50-100: High risk
Risk Score Bias Considerations
Known Biases:
- Litigation bias: Larger companies face more lawsuits (exposure ≠ worse behavior)
- Transparency paradox: Companies disclosing issues may score higher risk than opaque peers
- Media bias: High-profile companies receive more coverage (detection bias)
Mitigation:
- Risk scores normalized by company size (optional view)
- Transparency rating as separate dimension (rewarding disclosure)
- Context fields capture case details beyond binary counts
Database Architecture
Tools: database/schema.sql, database/import_profiles.py
Phase 4: PostgreSQL Database Import
Database Design Principles:
- Normalization: 3NF-compliant schema to eliminate redundancy
- Versioning: Support multiple profile snapshots per company
- Referential Integrity: Foreign key constraints across all relationships
- Performance: Comprehensive indexing on foreign keys, dates, ratings, and filter columns
- Type Safety: PostgreSQL ENUM types for controlled vocabularies
15 Core Tables:
- companies
- profiles
- risk_assessments
- dei_postures
- cdo_roles
- reporting_practices
- events
- controversies
- commitments
- supplier_diversity
- ai_contexts
- ai_key_insights
- ai_strategic_implications
- data_sources
- data_quality_flags
3 Analytical Views:
- v_latest_company_profiles: Denormalized view of latest profile with key metrics
- v_risk_by_industry: Risk aggregation by industry sector
- v_cdo_stats: CDO prevalence and C-suite representation statistics
Quality Assurance
Multi-Layer Validation
1. Schema Validation
- JSON Schema Draft 7 specification
- Required fields enforcement
- Type checking (strings, integers, booleans, dates)
- Pattern validation (ticker symbols, CIK numbers, URLs, dates)
- Enum validation for controlled vocabularies
2. Source Reliability Scoring (5-Point Scale)
- 5: Primary corporate sources (SEC filings, corporate websites, proxy statements)
- 4: Major news outlets (WSJ, NYT, Reuters, Bloomberg)
- 3: Trade press and industry publications
- 2: Think tanks, research organizations, advocacy groups
- 1: Blogs, social media, unverified sources
Requirement: Minimum 5 sources per company, preference for scores 3+
3. Data Quality Flags
- incomplete_data: Source count < 5
- conflicting_sources: AI detects contradictory information
- outdated_information: No recent data (2024-2025)
- verification_needed: Array of specific claims requiring fact-check
4. Provenance Tracking
Every finding requires:
- Source ID references (provenance_ids array)
- Verbatim quotes where applicable
- Date of information capture
- Source reliability score
Automation Pipeline
End-to-End Workflow
MEASURE INDEX PIPELINE
1. INPUT: sp500.csv
└── 503 companies (as of 2025)
2. BATCH REQUEST GENERATION
├── Tool: batch_api_launcher.py
└── Output: batch_api_requests.jsonl
3. BATCH SUBMISSION
├── Tool: submit_batch.py
├── API: Anthropic Batch API
└── Model: claude-sonnet-4-5-20250929
4. RESULTS DOWNLOAD
├── Tool: download_batch_results.py
└── Output: research_data/{TICKER}_research.json
5. PROFILE GENERATION
├── Tool: create_company_profile.py
└── Output: company_profiles/{TICKER}_profile.json
6. DATABASE IMPORT
├── Tool: database/import_profiles.py
├── Target: PostgreSQL (Supabase)
└── Schema: 15 tables, 3 views
7. QUERY & ANALYSIS
└── Applications: Dashboards, APIs, research tools
Limitations & Considerations
Methodological Limitations
1. Point-in-Time Analysis
- Research captures snapshot at time of capture
- Corporate policies and leadership can change rapidly
- Recommendation: Periodic re-research (quarterly or annually)
2. Source Availability Bias
- Larger companies have more public information
- Smaller S&P 500 companies may have less DEI disclosure
- Private companies or recent IPOs may lack historical data
3. AI Analysis Subjectivity
- AI-generated insights reflect model interpretation
- Ratings (commitment strength, transparency) are qualitative assessments
- Recommendation: Human review for high-stakes decisions
4. Risk Score Simplification
- Algorithm weights may not capture nuance of specific cases
- Ongoing lawsuit may be frivolous or meritorious (not differentiated)
- Settled cases may involve admissions or no-fault settlements
Ethical Considerations
Fairness & Representation
- Analysis focuses on publicly available information (transparency bias)
- Companies with less disclosure not necessarily less committed
- Risk scores reflect legal exposure, not inherent company values
Use Case Appropriateness
- Intended for research, analysis, and advocacy
- Not suitable as sole basis for investment decisions
- Human judgment required for contextual interpretation
Document Prepared By: MEASURE Index Research Team
Last Updated: November 2, 2025
Version 1.0