TextExtractor

Naive text extraction (ISO 32000-1 §9.4).

Walks content-stream operations for text-showing operators:

  • Tj : show a string

  • TJ : show an array of strings/spacing adjustments

  • ' : move to next line and show

  • " : set spacing, move to next line, and show

Output decoding: we treat the showed bytes as PDFDocEncoding (close to latin-1) when no font/encoding is known. UTF-16BE BOM strings are detected via PdfString.asText(). PDF spec compliance for real text extraction requires resolving each font's /Encoding and /ToUnicode CMap — that's a session-2 deliverable. The current output is a useful first approximation for documents using standard fonts with WinAnsi/PDFDocEncoding.

Line breaks are inserted on BT/ET, Td, TD, T*, Tm heuristics because PDF text positioning is geometric, not line-based.

Functions

Link copied to clipboard