Verbesserung
Zwei neue Funktionen pdf2txt und parseJson für Text-Extraktion aus PDF und parsen von JSON-Strukturen aus Text.
Die notwendigen externen Tools können auf einem Debian-System mit
apt install poppler-utils apt install python3-pdfminer apt install tesseract-ocr
installiert werden.
Test
- /com.top_logic.demo/src/test/java/test/com/top_logic/demo/scripted/tlscript/TestParseJson.script.xml
Für pdf2txt: Es werden externe Tools benötigt, die auf dem Applikationsserver installiert sein mussen: pdf2txt, pdftopmn, und tesseract.
Wenn diese Tools installiert sind, kann, z.B. der folgende Ausdruck ausgewertet werden:
"JVBERi0xLjUKJeLjz9MKNCAwIG9iago8PC9GaWx0ZXIvRmxhdGVEZWNvZGUvTGVuZ3RoIDE4Mj4+c3RyZWFtCnicdZAxC8IwEIX39ytu1OVMQpKmjoIOgmIh4CAOQq2lWqVF8O+bVCu0IDe9y73v8q7BwkOQ05JTRT7H0iNDA8Ha0QuK1uG1ghS0weEoKIdWrDQlacrWUQ1jFKe217deK8dRfYYH4jdZYo87JMVqL2PuwDmEjleWKMInYwVMiDNbSTKcWPJFwMe+JK3ZRrpkE2LWmGxP9Xk+9VUM3HnUP0/i2NjOs2sf1fn6JPn1xUNleAMdukKVCmVuZHN0cmVhbQplbmRvYmoKMSAwIG9iago8PC9Db250ZW50cyA0IDAgUi9UeXBlL1BhZ2UvUmVzb3VyY2VzPDwvRm9udDw8L0YxIDIgMCBSL0YyIDMgMCBSPj4+Pi9QYXJlbnQgNSAwIFIvTWVkaWFCb3hbMCAwIDU5NS4yIDg0MS45Ml0+PgplbmRvYmoKMiAwIG9iago8PC9TdWJ0eXBlL1R5cGUxL1R5cGUvRm9udC9CYXNlRm9udC9IZWx2ZXRpY2EtQm9sZC9FbmNvZGluZy9XaW5BbnNpRW5jb2Rpbmc+PgplbmRvYmoKMyAwIG9iago8PC9TdWJ0eXBlL1R5cGUxL1R5cGUvRm9udC9CYXNlRm9udC9IZWx2ZXRpY2EvRW5jb2RpbmcvV2luQW5zaUVuY29kaW5nPj4KZW5kb2JqCjUgMCBvYmoKPDwvS2lkc1sxIDAgUl0vVHlwZS9QYWdlcy9Db3VudCAxPj4KZW5kb2JqCjYgMCBvYmoKPDwvVHlwZS9DYXRhbG9nL1BhZ2VzIDUgMCBSPj4KZW5kb2JqCjcgMCBvYmoKPDwvTW9kRGF0ZShEOjIwMjMxMTIyMTYwOTMwKzAxJzAwJykvQ3JlYXRpb25EYXRlKEQ6MjAyMzExMjIxNjA5MzArMDEnMDAnKS9Qcm9kdWNlcihPcGVuUERGIDEuMy4zMCk+PgplbmRvYmoKeHJlZgowIDgKMDAwMDAwMDAwMCA2NTUzNSBmIAowMDAwMDAwMjY0IDAwMDAwIG4gCjAwMDAwMDAzOTAgMDAwMDAgbiAKMDAwMDAwMDQ4MyAwMDAwMCBuIAowMDAwMDAwMDE1IDAwMDAwIG4gCjAwMDAwMDA1NzEgMDAwMDAgbiAKMDAwMDAwMDYyMiAwMDAwMCBuIAowMDAwMDAwNjY3IDAwMDAwIG4gCnRyYWlsZXIKPDwvSW5mbyA3IDAgUi9JRCBbPGI1ODI0Y2U0NDg2YTJkNjI2YjcxZWIyMmYxY2RiZTQ1PjxjN2JmYjVjMmMyZTQzNjlkN2NmMGU2MTc3YWMyYjhkND5dL1Jvb3QgNiAwIFIvU2l6ZSA4Pj4Kc3RhcnR4cmVmCjc4MwolJUVPRgo=".base64Decode().pdf2txt()
Das erwartete Ergebnis ist Name: Projekt 1.
Ein einfacher OCR-Test kann mit diesem Ausdruck durchgeführt werden:
"JVBERi0xLjYKJcOkw7zDtsOfCjIgMCBvYmoKPDwvTGVuZ3RoIDMgMCBSL0ZpbHRlci9GbGF0ZURlY29kZT4+CnN0cmVhbQp4nD2MIRKAMAwEfV9xGhGa0NDEY5AoHsAggSmG70OnM9jb243EeEJBRKQoBnUlyQpLTDYy7j2sHc6woEBESDGqkZmDWShDjIb/VNqotdZIVVJzfHBsR+jnI2G6sHzJF1HQGe8KZW5kc3RyZWFtCmVuZG9iagoKMyAwIG9iago5OQplbmRvYmoKCjQgMCBvYmoKPDwvVHlwZS9YT2JqZWN0L1N1YnR5cGUvSW1hZ2UvV2lkdGggMTk5IC9IZWlnaHQgNTAgL0JpdHNQZXJDb21wb25lbnQgOCAvQ29sb3JTcGFjZS9EZXZpY2VSR0IvRmlsdGVyL0RDVERlY29kZS9MZW5ndGggMzcxNz4+CnN0cmVhbQr/2P/gABBKRklGAAEBAQBgAGAAAP/bAEMAAwICAwICAwMDAwQDAwQFCAUFBAQFCgcHBggMCgwMCwoLCw0OEhANDhEOCwsQFhARExQVFRUMDxcYFhQYEhQVFP/bAEMBAwQEBQQFCQUFCRQNCw0UFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFP/CABEIADIAxwMBIgACEQEDEQH/xAAbAAEAAwEBAQEAAAAAAAAAAAAABQYHBAMCCP/EABoBAAMBAQEBAAAAAAAAAAAAAAADBAIBBQb/2gAMAwEAAhADEAAAAf1SU7FNxKYFzG5jM5Gf2L2jJN/lB3AAAAAzBV2nlK0q6qjJZZOBsTG9kx+P6OwU7W8vR6U1Cwd0XfxSsRL7knaLK2NkkHoWQbI2DN4XliIfqb1zeHOyP4tMdHdz38GmZlufvqMxYk+h8yXpK1/PzQs+dVizsPVe0ArFf0cuyD5LO2iuVfS2HwU6NhqUvLMvr/LanO5/Yp5xnhVLk0ivUzVGK4H2mG4wZMAAAAAAAAAAAAAAAAAf/8QAKBAAAgMBAAECAwkAAAAAAAAABAUCAwYBABARBxY2EhMVJjA1N0BQ/9oACAEBAAEFAv7Tt/JQZ5ToJW6P03Bt4Y+zKtDQLJdmt/SDNvnufNiZcChkzmDm0rHrZb6bq+AzH58S+JT6Ge5btTTW5hLXKXfEDv2hN59NwPgszoVL5/RnHZNp1sezrxrG8qC7Q3X7N6eRLQtWxzFwUQ3ys9U1tCRLFLeXQP5CfuTLWunVtgkzDnvicfz2zvpto8m1/CQ/F9MKN+QsGJ23cCs75v48qC3f007pkVicwZUUiBnxhvfDSuZnUWBSXIs/LjnSpZ8B2m+Jrgl1NUh8Wt/bge+3xD+8it3+77+Wq667swmpqHV+jVDW2K8rQVVvHOdEeRExVIxLhPQ6ErQUcTpklaYS/BhzuUJRktHj/OD6CJa2oxckTVIwnWcEd+L8WIGU6UQdhUVcooZZcZiwbohHdPyAPPwxPSYqUrK04P8Ap//EADQRAAEDAwAHBgILAAAAAAAAAAECAwQAERIFExQhMVFhECIyQXGhstEGFSAwM0JSY4GRwf/aAAgBAwEBPwGPHclOapoXPy30zHckZFseEXPp2NaNW7s9lfikgdLG1LTgop5fZjRVSQ4QfAnKozBlPoYTxUQP7pyI622Xbd0Kxv17Po/b6wTlwsr4TUFUEtSdmSoHVniQeXICr6PgpZbe8CkgkYXvfzyvf5VD4aNt+tXxCgdkiOSm0ArLhFyL2H8860w2hp9CkJxKkpURyJpMRhxa3inc+kBHRRH+KHvQZbYujAFbLV7W/MTvvzxBqOhEww5DyBkpdjusFAW8vao8lcky8gNzauAA8xyqM5sUqEwygYrwJNgSSTv39KkvjY3GP3CfbsZfcjr1jRsaafcZyDZtkLH0prScxlrUIcOPKkzJCGw0lfdBuPXnTek5jTinkOHJXHrTrq31lxw3Jra38UIz3I3jpQmyA/tQWc+dL0hKdeS+td1J4dKZlPR1lxpViaj6UmxUatlwgUZLymiyVd297def3/8A/8QALREAAgEDAgIHCQAAAAAAAAAAAQIAAxESITETQQQQIzNRYcEgMDJDYnGBobH/2gAIAQIBAT8BZwgu0Zwtr8+pqwXL6YDcX9l3wt5x2wUseUFRScefV0vuj+P7Koq5JmRvO1qliu9/H0lT5329J3jhCdLTo5LKQeRnEYALf4d5kW1vox/UYmnxEU8o6BMLeMccRKjMdReIvaBvLqZQ4sYVDbw0abHIjWGmpNyIaNMgKRtAoUWE4a3JtvOGmOFtIKSBcQNIyKwsRGoU3N2EwW+Vtff/AP/EAD0QAAIBAwEFAwcHDQAAAAAAAAECAwAEERIFEyExQRAUMiIjUWFxcrEVNEJzgcHRMDNAUFJ0gpGSk6Gy8P/aAAgBAQAGPwL9KsoBCJO8NjOrGOyTZm5ACJr3mrtstzI0eubDaeop5IZDG+pRqHtq1ZjkmJST9n5O6t2lYwrDkJ06dk8sEhjkGnDD20L0jfOsWrBPOobopuy48Oc9uyZZDhFYkmvnR/tN+FTT27a4mhODjHor5K2awiZV1SzH6IqCa5u++2cj6X1cxWzj6Z/uqT30+NQ3L+GOBT/iu+997lG/5uNfRVxszaAHe4eIYfSFMAdJI4Gr2C6laWaCXGX54qa2aVjatlEXpkVsyxtpmjDHVLp9FHZezHEWgedmPSop7m57/Zs2l88xQurR8M5XDeo1b3Mm1mdGw5j9VXv1H4VFsnZpCTMNTyn6IqVrjaHeoDjWrc+dP+71Z+727HVhkFzwr5rF/TVwkahF3PIfZV1BfM0azDVGQ2ONcWlP8dbOHRZ/uqT30+NLu+OIUbh7KtDGw8lAh9RFXEsPlRxR6WYensupTwiuoCw96tnbUx5xLjev7pNXu0R5UUY3cZrakUx0tPxTPXjmtxkGWVgFWrWN/Em6B/lVr9Uvwq8z1g/CnafyVuIsIx/71Vce1f8AYUI5pBFG9vpLnpwq2ihkE0aIAHXr22s7yMhtzkBevY+09428ddGjpS79cOvhkXmKila8uJt2wZVduFbifOOYI5ivk6Rnmi9L86a3SR5Yyc+cpnhmmtVbmkZ4VurZMZ8THmeyETMyGM5BX4U9m3CJk08Old3iJYZyWbmaUzArKvhkTmKW4kkku5V8JmPKu7SO0a6g2VqOIcQihaivdbwzpjinWglwmSvhccxWmS9uZI/2CeFGwbIi06cjnSWsRLKvVv1p/8QAKBABAAIBAwMDAwUAAAAAAAAAAQARITFBURBhcYGR8DChsUBQwdHx/9oACAEBAAE/If1QGlm+D2z07h5U6Yqu/XE2tzY0nLO+aco+B4N2n03UHW4N9KPmGpLIygWbpX8wwN1TB69cRi6ujHQU5+xK3bGB/wAOmEotxIiFoD0YZCaJzjiEgWLa8LlcR7Fz/c94h9UHZrWZ8gBdl1+JeQC3SXZ7SqJudWXI+lxKjdFqghyWHwxKuAyibulF1PvvScCoFgcbFMKr1qOY7/tHub36j4MK30n+ZhES6OjWNZUBHCi/Rh9Ed3NIBC3iHaGSAjz90AI+QGb0FNxK+HJwx846Kb2y1LPxXrL9wj76oinUe18+1wcTl9h/gilbg051l4wQ3ZMp8DxgXFWVe8jFNs21PjzCaUiX9KPQWlxEdHWV36tLqYV5dCU90gqY/qXd6oVLT2UoRsiY8tnl5JhfaXfOInxdMqOPEbUN5/E+RXkL3emFCs7TfKHX5C4MPX8qrILqjqpZMUWHsjXQgd48x0EAXVoqIxXZ6qN0ztjgPGxMCWzhengJhWkvt3yK/un/2gAMAwEAAgADAAAAEOPPOPfPPPPOevLB9e+fuz8dLj/LrjzD3PPjnnL/ADzzzzzzzzzzzzz/xAAiEQEBAAICAgICAwAAAAAAAAABEQAhMUFRYRCBcZEgMKH/2gAIAQMBAT8QGVIsoaCtsOBwawNuCCC755NG/g9Ad62Cn7esd9tJ+mfxN8UV7BCHvePGAheBU3izpyo4Fk543ZPiPtGczfMH99U6Xoq8d49GiFM7DIR0QkSOA8LLNiQimKABBV3OoYLwETQqgdDzOrg/TWxE1nhh1g+Flg0yh4DTom8GjSQKIKCaVSG++MlUkOE7oL+XNROR3BBFK0RJMBE3qMlHPF9c/BKFEvpEefIpgcEnsXJ/mRTb9GXmUUvqYygirx4Ox/GCZcvW3lJF8a1ix95Vq5z9VHyWqfZcTjNVDFXn9+OMeLFWuHEJD9YlUyKdjyPSOCgnYaYvijPqY4VdejM+0/v/AP/EACYRAQEAAQQABQQDAAAAAAAAAAERACExUWEQQXGBsSCRodEwwfD/2gAIAQIBAT8Qt6H70xMPdD18NQmwXulyBy+kWU3TD20F+2E61FnXhWlvfgwFATsJzyuBtEQNEnlon7zdXD5YleAWDK+3GKRRAeQxGFWvYE+RvtjHQht8hCcVMKQQo1qLfP8AOAWdTur5POV80AVAA007xfRz8+EZ0ycmxp65M/tiUlSPZxhsZs6wkcDOUND3i1x0xdKt+8FXB/WS1X/ffAiNAl64/n//xAAmEAEBAAIBBAIBBAMAAAAAAAABEQAhMUFRYYEQcZEwscHwQFCh/9oACAEBAAE/EP8AKYiRpSigLq8fCDVAEoJ6I8r0+UN1FBpa8XD8ECwABfJirCRqjK+/0+58aAkO+38/EO3xwhL3FPedFW1E3QM/GXbSuHjQX8fLMRkSBtht9fGg1rCOpfQJ+MmsId3BOoah9p5xTsh3btS8QFPqdc4EzfSnP6Lsy1cD8sgPLgdxOAhRSOmc8vPXNaILml+xL/DEvGzFkA8jvG/W6IQfR+/ByrrL7G2PebpQbjHwJWGctvuGwTZOPvL8eAbqCKE8YLNN+LWnqYKzZInHXOzP6zvnpRpJu4enG3yHXA29nTuNDX/DDUqD0L/jAISA+n5AiWtBdUz+ofxkPG5h4TL47Ou4BKJp7zC6IkHl6cuFZakcBG3CabCnaMpI1myqZgL52KAQ6NL9I4J7l97wPNPf4q+7utcfdONCreccL7h7wHAwbSkj7wboZs1AFwRsYoy2B4uDXyTEUekT48HIMVJV4/hxqoEJcBXV2Psd8tQso4K4WaW010S3DzvVGJCa38vo8IVRlF6dPhn/AGkBVGW6dcKLXyV6U5PDrItnYhtByUNZpfuYDwmWubetKiJw1qcTDKqEKgJONDGd6FFeSPB4NGQorV5ZNvg4PhTVgEs2Q6YOt6MR9CoNBBOKIJ5MjrcIQ9YTWAtyRJ2vU8ONmUCFnCDTPOLmsoruGiTecbt7wgvnWPIQ4dQH2Kbjm5W1tLzDcYU4YdscjQk4HBozWJ9GsVXlEHeXUGGrlV99v9p//9kKZW5kc3RyZWFtCmVuZG9iagoKNiAwIG9iago8PAo+PgplbmRvYmoKCjcgMCBvYmoKPDwvRm9udCA2IDAgUgovWE9iamVjdDw8L0ltNCA0IDAgUj4+Ci9Qcm9jU2V0Wy9QREYvVGV4dC9JbWFnZUMvSW1hZ2VJL0ltYWdlQl0KPj4KZW5kb2JqCgoxIDAgb2JqCjw8L1R5cGUvUGFnZS9QYXJlbnQgNSAwIFIvUmVzb3VyY2VzIDcgMCBSL01lZGlhQm94WzAgMCA1OTUuMzAzOTM3MDA3ODc0IDg0MS44ODk3NjM3Nzk1MjhdL0dyb3VwPDwvUy9UcmFuc3BhcmVuY3kvQ1MvRGV2aWNlUkdCL0kgdHJ1ZT4+L0NvbnRlbnRzIDIgMCBSPj4KZW5kb2JqCgo1IDAgb2JqCjw8L1R5cGUvUGFnZXMKL1Jlc291cmNlcyA3IDAgUgovTWVkaWFCb3hbIDAgMCA1OTUuMzAzOTM3MDA3ODc0IDg0MS44ODk3NjM3Nzk1MjggXQovS2lkc1sgMSAwIFIgXQovQ291bnQgMT4+CmVuZG9iagoKOCAwIG9iago8PC9UeXBlL0NhdGFsb2cvUGFnZXMgNSAwIFIKL09wZW5BY3Rpb25bMSAwIFIgL1hZWiBudWxsIG51bGwgMF0KL0xhbmcoZGUtREUpCj4+CmVuZG9iagoKOSAwIG9iago8PC9DcmVhdG9yPEZFRkYwMDU3MDA3MjAwNjkwMDc0MDA2NTAwNzI+Ci9Qcm9kdWNlcjxGRUZGMDA0QzAwNjkwMDYyMDA3MjAwNjUwMDRGMDA2NjAwNjYwMDY5MDA2MzAwNjUwMDIwMDAzNzAwMkUwMDMzPgovQ3JlYXRpb25EYXRlKEQ6MjAyNDAyMDYxNTAyMjErMDEnMDAnKT4+CmVuZG9iagoKeHJlZgowIDEwCjAwMDAwMDAwMDAgNjU1MzUgZiAKMDAwMDAwNDIwMiAwMDAwMCBuIAowMDAwMDAwMDE5IDAwMDAwIG4gCjAwMDAwMDAxODkgMDAwMDAgbiAKMDAwMDAwMDIwOCAwMDAwMCBuIAowMDAwMDA0MzcwIDAwMDAwIG4gCjAwMDAwMDQwODMgMDAwMDAgbiAKMDAwMDAwNDEwNSAwMDAwMCBuIAowMDAwMDA0NDk0IDAwMDAwIG4gCjAwMDAwMDQ1OTAgMDAwMDAgbiAKdHJhaWxlcgo8PC9TaXplIDEwL1Jvb3QgOCAwIFIKL0luZm8gOSAwIFIKL0lEIFsgPEUzNTc4RjUyMzk0MkExQkRCRjU0NTM1QzY1Qjk4M0MwPgo8RTM1NzhGNTIzOTQyQTFCREJGNTQ1MzVDNjVCOTgzQzA+IF0KL0RvY0NoZWNrc3VtIC9ERkEzOUZCOTEyNUM3RjU4MjJEMzc3MjMwRjNENjM3MAo+PgpzdGFydHhyZWYKNDc2NAolJUVPRgo=".base64Decode().pdf2txt()
Das erwartete Ergebnis is "Hello world!".