#233416After installing tesseract-lang, tesseract only will work after reinstall

pedrohqbpedrohqb
opened 2 months ago
Author
HOMEBREW_VERSION: 4.6.3 ORIGIN: https://github.com/Homebrew/brew HEAD: a0d01bc7c410bdb55794f4858c29e9c79e0e485c Last commit: 2 days ago Branch: stable Core tap JSON: 13 Aug 22:12 UTC HOMEBREW_PREFIX: /home/linuxbrew/.linuxbrew HOMEBREW_CASK_OPTS: [] HOMEBREW_DISPLAY: :0 HOMEBREW_EDITOR: /usr/bin/nano HOMEBREW_MAKE_JOBS: 20 SUDO_ASKPASS: /usr/libexec/openssh/gnome-ssh-askpass Homebrew Ruby: 3.4.5 => /var/home/linuxbrew/.linuxbrew/Homebrew/Library/Homebrew/vendor/portable-ruby/3.4.5/bin/ruby CPU: 20-core 64-bit alderlake Clang: N/A Git: 2.50.1 => /bin/git Curl: 8.9.1 => /bin/curl Kernel: Linux 6.14.11-200.fc41.x86_64 x86_64 GNU/Linux OS: Bluefin (Version: gts-41.20250810 / FROM Fedora Silverblue 41) (Deinonychus) Host glibc: 2.40 /usr/bin/gcc: 14.3.1 /usr/bin/ruby: N/A glibc: N/A gcc@11: N/A gcc: 15.1.0 xorg: N/A

Verification

  • My brew doctor output says Your system is ready to brew. and am still able to reproduce my issue.
  • I ran brew update and am still able to reproduce my issue.
  • I have resolved all warnings from brew doctor and that did not fix my problem.
  • I searched for recent similar issues at https://github.com/Homebrew/homebrew-core/issues?q=is%3Aissue and found no duplicates.
  • My issue is not about a failure to build a formula from source.

What were you trying to do (and why)?

I installed ocrmypdf and its two main dependencies, i.e., tesseract and tesseract-lang. However, after installing tesseract-lang, besides the original languages that come with tesseract (such as eng) disappear, it failed to work. I was able to work it around by reinstalling tesseract.

What happened (include all command output)?

By running "tesseract --list-langs" you will get this output: ❯ tesseract --list-langs List of available languages in "/home/linuxbrew/.linuxbrew/share/tessdata/" (160): afr amh ara asm aze aze_cyrl bel ben bod bos bre bul cat ceb ces chi_sim chi_sim_vert chi_tra chi_tra_vert chr cos cym dan deu div dzo ell enm epo equ est eus fao fas fil fin fra frk frm fry gla gle glg grc guj hat heb hin hrv hun hye iku ind isl ita ita_old jav jpn jpn_vert kan kat kat_old kaz khm kir kmr kor kor_vert lao lat lav lit ltz mal mar mkd mlt mon mri msa mya nep nld nor oci ori pan pol por pus que ron rus san script/Arabic script/Armenian script/Bengali script/Canadian_Aboriginal script/Cherokee script/Cyrillic script/Devanagari script/Ethiopic script/Fraktur script/Georgian script/Greek script/Gujarati script/Gurmukhi script/HanS script/HanS_vert script/HanT script/HanT_vert script/Hangul script/Hangul_vert script/Hebrew script/Japanese script/Japanese_vert script/Kannada script/Khmer script/Lao script/Latin script/Malayalam script/Myanmar script/Oriya script/Sinhala script/Syriac script/Tamil script/Telugu script/Thaana script/Thai script/Tibetan script/Vietnamese sin slk slv snd spa spa_old sqi srp srp_latn sun swa swe syr tam tat tel tgk tha tir ton tur uig ukr urd uzb uzb_cyrl vie yid yor

Also, it will fail when trying to ocr a pdf:

❯ ocrmypdf -l por Summa\ Theologica\ -\ Wikipedia.pdf Summa\ Theologica\ -\ Wikipedia-ocr.pdf --force-ocr Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 27/27 0:00:00 Start processing 20 pages concurrently ocr.py:96 1 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
2 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
3 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
4 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
5 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
6 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
7 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
8 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
9 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
10 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
11 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
12 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
13 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
14 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
15 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
16 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
17 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
18 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
19 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
20 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
5 [tesseract] read_params_file: Can't open hocr tesseract.py:257 5 [tesseract] read_params_file: Can't open txt tesseract.py:257 21 page already has text! - rasterizing text and running OCR anyway _pipeline.py:331 4 [tesseract] read_params_file: Can't open hocr tesseract.py:257 4 [tesseract] read_params_file: Can't open txt tesseract.py:257 3 [tesseract] read_params_file: Can't open hocr tesseract.py:257 3 [tesseract] read_params_file: Can't open txt tesseract.py:257 6 [tesseract] read_params_file: Can't open hocr tesseract.py:257 6 [tesseract] read_params_file: Can't open txt tesseract.py:257 9 [tesseract] read_params_file: Can't open hocr tesseract.py:257 9 [tesseract] read_params_file: Can't open txt tesseract.py:257 14 [tesseract] read_params_file: Can't open hocr tesseract.py:257 14 [tesseract] read_params_file: Can't open txt tesseract.py:257 18 [tesseract] read_params_file: Can't open hocr tesseract.py:257 18 [tesseract] read_params_file: Can't open txt tesseract.py:257 20 [tesseract] read_params_file: Can't open hocr tesseract.py:257 20 [tesseract] read_params_file: Can't open txt tesseract.py:257 8 [tesseract] read_params_file: Can't open hocr tesseract.py:257 8 [tesseract] read_params_file: Can't open txt tesseract.py:257 19 [tesseract] read_params_file: Can't open hocr tesseract.py:257 19 [tesseract] read_params_file: Can't open txt tesseract.py:257 1 [tesseract] read_params_file: Can't open hocr tesseract.py:257 1 [tesseract] read_params_file: Can't open txt tesseract.py:257 17 [tesseract] read_params_file: Can't open hocr tesseract.py:257 17 [tesseract] read_params_file: Can't open txt tesseract.py:257 7 [tesseract] read_params_file: Can't open hocr tesseract.py:257 7 [tesseract] read_params_file: Can't open txt tesseract.py:257 11 [tesseract] read_params_file: Can't open hocr tesseract.py:257 11 [tesseract] read_params_file: Can't open txt tesseract.py:257 10 [tesseract] read_params_file: Can't open hocr tesseract.py:257 10 [tesseract] read_params_file: Can't open txt tesseract.py:257 15 [tesseract] read_params_file: Can't open hocr tesseract.py:257 15 [tesseract] read_params_file: Can't open txt tesseract.py:257 12 [tesseract] read_params_file: Can't open hocr tesseract.py:257 12 [tesseract] read_params_file: Can't open txt tesseract.py:257 16 [tesseract] read_params_file: Can't open hocr tesseract.py:257 16 [tesseract] read_params_file: Can't open txt tesseract.py:257 13 [tesseract] read_params_file: Can't open hocr tesseract.py:257 13 [tesseract] read_params_file: Can't open txt tesseract.py:257 2 [tesseract] read_params_file: Can't open hocr tesseract.py:257 2 [tesseract] read_params_file: Can't open txt tesseract.py:257 2 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:241 21 [tesseract] read_params_file: Can't open hocr tesseract.py:257 21 [tesseract] read_params_file: Can't open txt tesseract.py:257 OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/27 -:--:-- An exception occurred while executing the pipeline _common.py:296 Traceback (most recent call last):
File
"/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/_pipelines/_common.py", line
261, in cli_exception_handler
return fn(options, plugin_manager)
File "/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py",
line 181, in _run_pipeline
optimize_messages = exec_concurrent(context, executor)
File "/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py",
line 117, in exec_concurrent
executor(
~~~~~~~~^
use_threads=options.use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<10 lines>...
task_finished=update_page,
^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/_concurrent.py", line
78, in call
self._execute(
~~~~~~~~~~~~~^
use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^
...<5 lines>...
task_finished=task_finished,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File
"/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/builtin_plugins/concurrency.p
y", line 144, in _execute
result = future.result()
File "/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py",
line 81, in _exec_page_sync
ocr_out, text_out = _image_to_ocr_text(page_context, ocr_image_out)
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py",
line 63, in _image_to_ocr_text
ocr_out = render_hocr_page(hocr_out, page_context)
File "/home/linuxbrew/.linuxbrew/Cellar/ocrmypdf/16.10.4/libexec/lib/python3.13/site-packages/ocrmypdf/_pipeline.py", line
774, in render_hocr_page
if hocr.stat().st_size == 0:
~~~~~~~~~^^
File "/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/pathlib/_local.py", line 515, in stat
return os.stat(self, follow_symlinks=follow_symlinks)
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ocrmypdf.io.3qq18bqv/000005_ocr_hocr.hocr'

What did you expect to happen?

After reinstalling tesseract, you will get the correct output, which in included eng:

❯ tesseract --list-langs List of available languages in "/home/linuxbrew/.linuxbrew/share/tessdata/" (163): afr amh ara asm aze aze_cyrl bel ben bod bos bre bul cat ceb ces chi_sim chi_sim_vert chi_tra chi_tra_vert chr cos cym dan deu div dzo ell eng enm epo equ est eus fao fas fil fin fra frk frm fry gla gle glg grc guj hat heb hin hrv hun hye iku ind isl ita ita_old jav jpn jpn_vert kan kat kat_old kaz khm kir kmr kor kor_vert lao lat lav lit ltz mal mar mkd mlt mon mri msa mya nep nld nor oci ori osd pan pol por pus que ron rus san script/Arabic script/Armenian script/Bengali script/Canadian_Aboriginal script/Cherokee script/Cyrillic script/Devanagari script/Ethiopic script/Fraktur script/Georgian script/Greek script/Gujarati script/Gurmukhi script/HanS script/HanS_vert script/HanT script/HanT_vert script/Hangul script/Hangul_vert script/Hebrew script/Japanese script/Japanese_vert script/Kannada script/Khmer script/Lao script/Latin script/Malayalam script/Myanmar script/Oriya script/Sinhala script/Syriac script/Tamil script/Telugu script/Thaana script/Thai script/Tibetan script/Vietnamese sin slk slv snd snum spa spa_old sqi srp srp_latn sun swa swe syr tam tat tel tgk tha tir ton tur uig ukr urd uzb uzb_cyrl vie yid yor

Also, creating ocr on a pdf now works:

❯ ocrmypdf -l por Summa\ Theologica\ -\ Wikipedia.pdf Summa\ Theologica\ -\ Wikipedia-ocr.pdf --force-ocr Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 27/27 0:00:00 Start processing 20 pages concurrently ocr.py:96 1 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
2 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
3 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
4 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
5 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
6 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
7 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
8 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
9 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
10 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
11 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
12 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
13 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
14 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
15 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
16 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
17 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
18 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
19 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
20 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
21 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
22 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
23 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
24 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
25 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
26 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
2 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:241 27 page already has text! - rasterizing text and running OCR _pipeline.py:331 anyway
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 27/27 0:00:00 Postprocessing... ocr.py:144 PDF/A conversion ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 27/27 0:00:00 Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00 Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:-- Image optimization did not improve the file - optimizations will not be optimize.py:735 used
Image optimization ratio: 1.00 savings: -0.1% _pipeline.py:1002 Total file size ratio: 0.09 savings: -1059.6% _pipeline.py:1005 Output file is a PDF/A-2B (as expected) _common.py:474 The output file size is 11.41× larger than the input file. _validation.py:358 Possible reasons for this include:
--force-ocr was issued, causing transcoding.
PDF/A conversion was enabled. (Try --output-type pdf.)

Step-by-step reproduction instructions (by running brew commands)

1. To install: brew install ocrmypdf tesseract-lang. 2. The workaround, after ocrmypdf and tesseract-lang installation: brew reinstall tesseract