Hawaii Chyron Dataset

Evaluating AI tools on Hawaiian language text extraction from broadcast video

Role: Lead Researcher
Status: IASA Journal (in review)
Affiliation: Brandeis University

Overview

This project investigates the ability of current AI tools — including OCR systems and vision-language models — to accurately extract Hawaiian language text from chyrons (on-screen text overlays) in archival broadcast video. Hawaiian presents unique challenges for text recognition due to its use of diacritical marks (the okina and kahako) that are critical for meaning but often missed or misinterpreted by standard OCR and VLM systems.

Motivation

Broadcast archives from Hawaii contain rich cultural and linguistic content, but automated processing tools are typically trained on English text and struggle with the specific orthographic features of Hawaiian. This work highlights the gap between AI tool performance on majority languages and low-resource or indigenous languages, with implications for archival accessibility and cultural preservation.

Dataset & Evaluation

The dataset consists of annotated chyron images from Hawaiian broadcast video, with ground-truth transcriptions including correct diacritical marks. We evaluate multiple OCR engines and VLMs on their ability to faithfully reproduce Hawaiian text, measuring both character-level accuracy and diacritical mark preservation.

Technical Stack

OCR Vision-Language Models Low-Resource Languages Python Dataset Design Cultural Heritage

Publications & Presentations

Understanding On-Screen Text: Do AI Tools Struggle with Hawaiian Chyrons?

IASA Journal — In review

2025

Hawaii Chyron Dataset & Archival AI

IASA Conference, Honolulu — Presentation

2025

Hawaii Chyron Dataset & Archival AI

Fantastic Futures Conference — Presentation