Why not to directly use OCR or Acrobat Word files for translation

Posted on June 15, 2020

[vc_row][vc_column width="2/3"][vc_column_text]

A Word document coming from an OCR conversion or a PDF saved directly to .docx format using Acrobat, might sometimes look fine at first glance. But... will it be suitable for translation? Well, probably not. These types of files need to be properly prepared and reviewed before using them with any CAT-tool, to ensure that the translation will be successful. The file preparation could either be a relatively easy process, or It could be a lengthy time-consuming job. Nonetheless, it is important to carry out. The time invested in this process depends entirely on the type of PDF that you are dealing with. We often talk about two different types of PDF- Editable PDFs (created from an editable source file like Word or a design software), or scanned PDFs, which are non-editable and usually of poor quality.

In this guide, we will show you some example of how it can look when converting a PDFs in different ways and the issues that arises when doing it. We would like to alert the importance of carrying out a proper file preparation before using converted files for translation.

[/vc_column_text][/vc_column][vc_column width="1/3"][/vc_column][/vc_row][vc_row][vc_column width="2/3"][vc_empty_space height="12px"][vc_custom_heading text="Poor quality, non-editable PDF (scanned)" font_container="tag:h3|text_align:left" google_fonts="font_family:Roboto%3A100%2C100italic%2C300%2C300italic%2Cregular%2Citalic%2C500%2C500italic%2C700%2C700italic%2C900%2C900italic|font_style:900%20bold%20regular%3A900%3Anormal"][vc_column_text]

What identifies these types of PDFs are blurry (sometimes illegible) text and bad image quality.

Converting these types of PDFs fully with an OCR, or saving them directly to .docx format through Acrobat will most likely result in a document with erroneous characters (because the OCR is not able to read the file correctly because of the blurriness, etc). What the OCR tries to do is basically replicating the layout as much as possible, not considering all the “garbage” that it includes. This could be text often ending up contained in text boxes or showing as images instead of in normal text flow, a lot of section breaks where it’s not always necessary, dashes placed instead of actual hyphenation, spacing splitting words into two and paragraph marks causing segmentation issues and misinterpretation of  entences/ideas that will badly affect the quality of the translation. If the file contains images or backgrounds, they might end up showing incorrectly as well. Another thing the OCR does is creating endless number of text styles for each paragraph. All of this will in the end cause issues in the document layout, making it difficult to control, which in turn will make the post translation DTP a challenging task.

Due to all of this, we never recommend our clients to use OCR files directly for translation without any proper preparation/correction beforehand, since it won’t make the translation easy. The text needs to be reviewed and corrected by manually typing the parts that are corrupted or that ended up showing as images. Tedious – but necessary to be able to translate the whole document.

[/vc_column_text][vc_custom_heading text="Examples Poor quality, non-editable PDF (scanned)" font_container="tag:h6|text_align:left|color:%23f47e5a" google_fonts="font_family:Roboto%3A100%2C100italic%2C300%2C300italic%2Cregular%2Citalic%2C500%2C500italic%2C700%2C700italic%2C900%2C900italic|font_style:900%20bold%20regular%3A900%3Anormal"][vc_single_image image="32594" img_size="full"][vc_single_image image="32574" img_size="full"][vc_single_image image="32595" img_size="full"][vc_single_image image="32596" img_size="full"][vc_custom_heading text="Editable PDF" font_container="tag:h3|text_align:left" google_fonts="font_family:Roboto%3A100%2C100italic%2C300%2C300italic%2Cregular%2Citalic%2C500%2C500italic%2C700%2C700italic%2C900%2C900italic|font_style:900%20bold%20regular%3A900%3Anormal"][vc_column_text]

Editable PDFs are the ones that have been created from e.g. a Word or InDesign file where text and images are of excellent quality. They are editable because you are able to highlight the text in the PDF and also edit it directly in Acrobat if needed. This is something you can not do in a scanned version.

If you run an editable PDF with an OCR or save it directly to .docx format through Acrobat, the outcome might look quite decent. But you should not settle for this result, the preparation and review stage are still important even in these types of files. Luckily, these files are more pleasant to work with since the text normally doesn’t end up too corrupted with numerous incorrect characters, etc, therefore the reviewing of this is significantly less time consuming. But it might still contain unnecessary section breaks, text contained in text boxes and wrongly segmented paragraphs, lots of different text styles, etc, and these need to be addressed.

Also important: If the source file includes plenty of different images, these will also be causing layout problems in the converted file since they will most probably be floating all over the document. They need to be correctly placed into the layout to ensure a smoother post translation DTP.

[/vc_column_text][vc_custom_heading text="Examples - Editable PDF" font_container="tag:h6|text_align:left|color:%2387d0d6" google_fonts="font_family:Roboto%3A100%2C100italic%2C300%2C300italic%2Cregular%2Citalic%2C500%2C500italic%2C700%2C700italic%2C900%2C900italic|font_style:900%20bold%20regular%3A900%3Anormal"][vc_single_image image="32584" img_size="full"][vc_single_image image="32585" img_size="full"][vc_custom_heading text="Summary" font_container="tag:h3|text_align:left" google_fonts="font_family:Roboto%3A100%2C100italic%2C300%2C300italic%2Cregular%2Citalic%2C500%2C500italic%2C700%2C700italic%2C900%2C900italic|font_style:900%20bold%20regular%3A900%3Anormal"][vc_column_text]

We have showed you the different outcomes of converting PDFs of different qualities, the issues that this process causes and why the files need to be reviewed and prepared in order to achieve a successful translation (regardless the type of PDF).

Scanned PDFs are the most time consuming to prepare. Especially if they consist of many pages. Why? Because the quality of these PDFs is normally very poor which tend to result in converted files full of erroneous characters that will need very close attention to be spotted and then corrected manually. If the document layout includes images or background, these will most probably not be replicated correctly (if at all) in the conversion and need to be inserted manually into the layout afterwards.

Editable PDFs will give you a much better converted result. But remember that these might still contain items that need to be dealt with, such as images not being placed correctly, many unnecessary section breaks, text contained in text boxes, etc. You should not skip the preparation process of these files if you want a successful translation!

Don't hesitate to contact us if you need any help. We have many years of experience recreating PDF-Word files.

[/vc_column_text][/vc_column][vc_column width="1/3"][/vc_column][/vc_row]

    READER 15
  • Microsoft Office
  • Adobe Creative Cloud


Let us know about your next project.

Information: info@ttsnordika.com

Address: 5 Norte 475, Viña del mar, Chile

linkedin logo