kitepdf/io.github.yuroyami.kitepdf.text/TextExtractor

TextExtractor

Naive text extraction (ISO 32000-1 §9.4).

Walks content-stream operations for text-showing operators:

Tj : show a string
TJ : show an array of strings/spacing adjustments
' : move to next line and show
" : set spacing, move to next line, and show

Output decoding: we treat the showed bytes as PDFDocEncoding (close to latin-1) when no font/encoding is known. UTF-16BE BOM strings are detected via PdfString.asText(). PDF spec compliance for real text extraction requires resolving each font's /Encoding and /ToUnicode CMap — that's a session-2 deliverable. The current output is a useful first approximation for documents using standard fonts with WinAnsi/PDFDocEncoding.

Line breaks are inserted on BT/ET, Td, TD, T*, Tm heuristics because PDF text positioning is geometric, not line-based.

Functions

extract

fun extract(page: PdfPage): String

fun extract(ops: List<Operation>): String