TextExtractor
Naive text extraction (ISO 32000-1 §9.4).
Walks content-stream operations for text-showing operators:
Tj: show a stringTJ: show an array of strings/spacing adjustments': move to next line and show": set spacing, move to next line, and show
Output decoding: we treat the showed bytes as PDFDocEncoding (close to latin-1) when no font/encoding is known. UTF-16BE BOM strings are detected via PdfString.asText(). PDF spec compliance for real text extraction requires resolving each font's /Encoding and /ToUnicode CMap — that's a session-2 deliverable. The current output is a useful first approximation for documents using standard fonts with WinAnsi/PDFDocEncoding.
Line breaks are inserted on BT/ET, Td, TD, T*, Tm heuristics because PDF text positioning is geometric, not line-based.