Enhancement
Two new functions pdf2txt and parseJson for text extraction from PDF and parsing JSON structures from text.
The necessary external tools can be installed on a Debian system with
apt install poppler-utils apt install python3-pdfminer apt install tesseract-ocr
can be installed.
test
- /com.top_logic.demo/src/test/java/test/com/top_logic/demo/scripted/tlscript/TestParseJson.script.xml
For pdf2txt: External tools are required, which must be installed on the application server: pdf2txt, pdftopmn, and tesseract.
If these tools are installed, the following expression, for example, can be evaluated:
"JVBERi0xLjUKJeLjz9MKNCAwIG9iago8PC9GaWx0ZXIvRmxhdGVEZWNvZGUvTGVuZ3RoIDE4Mj4+c3RyZWFtCnicdZAxC8IwEIX39ytu1OVMQpKmjoIOgmIh4CAOQq2lWqVF8O+bVCu0IDe9y73v8q7BwkOQ05JTRT7H0iNDA8Ha0QuK1uG1ghS0weEoKIdWrDQlacrWUQ1jFKe217deK8dRfYYH4jdZYo87JMVqL2PuwDmEjleWKMInYwVMiDNbSTKcWPJFwMe+JK3ZRrpkE2LWmGxP9Xk+9VUM3HnUP0/i2NjOs2sf1fn6JPn1xUNleAMdukKVCmVuZHN0cmVhbQplbmRvYmoKMSAwIG9iago8PC9Db250ZW50cyA0IDAgUi9UeXBlL1BhZ2UvUmVzb3VyY2VzPDwvRm9udDw8L0YxIDIgMCBSL0YyIDMgMCBSPj4+Pi9QYXJlbnQgNSAwIFIvTWVkaWFCb3hbMCAwIDU5NS4yIDg0MS45Ml0+PgplbmRvYmoKMiAwIG9iago8PC9TdWJ0eXBlL1R5cGUxL1R5cGUvRm9udC9CYXNlRm9udC9IZWx2ZXRpY2EtQm9sZC9FbmNvZGluZy9XaW5BbnNpRW5jb2Rpbmc+PgplbmRvYmoKMyAwIG9iago8PC9TdWJ0eXBlL1R5cGUxL1R5cGUvRm9udC9CYXNlRm9udC9IZWx2ZXRpY2EvRW5jb2RpbmcvV2luQW5zaUVuY29kaW5nPj4KZW5kb2JqCjUgMCBvYmoKPDwvS2lkc1sxIDAgUl0vVHlwZS9QYWdlcy9Db3VudCAxPj4KZW5kb2JqCjYgMCBvYmoKPDwvVHlwZS9DYXRhbG9nL1BhZ2VzIDUgMCBSPj4KZW5kb2JqCjcgMCBvYmoKPDwvTW9kRGF0ZShEOjIwMjMxMTIyMTYwOTMwKzAxJzAwJykvQ3JlYXRpb25EYXRlKEQ6MjAyMzExMjIxNjA5MzArMDEnMDAnKS9Qcm9kdWNlcihPcGVuUERGIDEbase64Decode().pdf2txt()
The expected result is Name: Project 1.
A simple OCR test can be performed with this expression:
"JVBERi0xLjYKJcOkw7zDtsOfCjIgMCBvYmoKPDwvTGVuZ3RoIDMgMCBSL0ZpbHRlci9GbGF0ZURlY29kZT4+CnN0cmVhbQp4nD2MIRKAMAwEfV9xGhGa0NDEY5AoHsAggSmG70OnM9jb243EeEJBRKQoBnUlyQpLTDYy7j2sHc6woEBESDGqkZmDWShDjIb/VNqotdZIVVJzfHBsR+jnI2G6sHzJF1HQGe8KZW5kc3RyZWFtCmVuZG9iagoKMyAwIG9iago5OQplbmRvYmoKCjQgMCBvYmoKPDwvVHlwZS9YT2JqZWN0L1N1YnR5cGUvSW1hZ2UvV2lkdGggMTk5IC9IZWlnaHQgNTAgL0JpdHNQZXJDb21wb25lbnQgOCAvQ29sb3JTcGFjZS9EZXZpY2VSR0IvRmlsdGVyL0RDVERlY29kZS9MZW5ndGggMzcxNz4+CnN0cmVhbQr/2P/gABBKRklGAAEBAQBgAGAAAP/bAEMAAwICAwICAwMDAwQDAwQFCAUFBAQFCgcHBggMCgwMCwoLCw0OEhANDhEOCwsQFhARExQVFRUMDxcYFhQYEhQVFP/bAEMBAwQEBQQFCQUFCRQNCw0UFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFP/CABEIADIAxwMBIgACEQEDEQH/xAAbAAEAAwEBAQEAAAAAAAAAAAAABQYHBAMCCP/EABoBAAMBAQEBAAAAAAAAAAAAAAADBAIBBQb/2gAMAwEAAhADEAAAAf1SU7FNxKYFzG5jM5Gf2L2jJN/lB3AAAAAzBV2nlK0q6qjJZZOBsTG9kx+P6OwU7W8vR6U1Cwd0XfxSsRL7knaLK2NkkHoWQbI2DN4XliIfqb1zeHOyP4tMdHdz38GmZlufvqMxYk+h8yXpK1/PzQs+dVizsPVe0ArFf0cuyD5LO2iuVfS2HwU6NhqUvLMvr/LanO5/Yp5xnhVLk0ivUzVGK4H2mG4wZMAAAAAbase64Decode().pdf2txt()
The expected result is "Hello world!".