Hawaii Chyron Dataset
Evaluating AI tools on Hawaiian language text extraction from broadcast video
Overview
This project investigates the ability of current AI tools — including OCR systems and vision-language models — to accurately extract Hawaiian language text from chyrons (on-screen text overlays) in archival broadcast video. Hawaiian presents unique challenges for text recognition due to its use of diacritical marks (the okina and kahako) that are critical for meaning but often missed or misinterpreted by standard OCR and VLM systems.
Motivation
Broadcast archives from Hawaii contain rich cultural and linguistic content, but automated processing tools are typically trained on English text and struggle with the specific orthographic features of Hawaiian. This work highlights the gap between AI tool performance on majority languages and low-resource or indigenous languages, with implications for archival accessibility and cultural preservation.
Dataset & Evaluation
The dataset consists of annotated chyron images from Hawaiian broadcast video, with ground-truth transcriptions including correct diacritical marks. We evaluate multiple OCR engines and VLMs on their ability to faithfully reproduce Hawaiian text, measuring both character-level accuracy and diacritical mark preservation.
Technical Stack
Publications & Presentations
Understanding On-Screen Text: Do AI Tools Struggle with Hawaiian Chyrons?
IASA Journal — In review
Hawaii Chyron Dataset & Archival AI
IASA Conference, Honolulu — Presentation