PDF text extraction
Syntax
$input.pdf2txt()
Description
Extracts the text from a given PDF document. In the case of a scanned PDF, text is extracted using OCR software. The result may then be inaccurate or incorrect.
The function uses external tools that may need to be installed on the application server: pdf2txt
for simple text extraction, pdftoppm
for image generation in preparation for OCR, and tesseract
for the OCR process of scanned documents.
Parameters
name | Type | Type Description | Mandatory | Default |
---|---|---|---|---|
input | Binary data in PDF format | The document to be converted. | yes | - |
Return value
Type: String
The text extracted from the PDF.