Skip to content

Commit

Permalink
Merge pull request #51 from pzembrod/manuals-ocr
Browse files Browse the repository at this point in the history
Manuals OCR
  • Loading branch information
pzembrod authored Aug 4, 2024
2 parents 4660d45 + 826e21f commit f3b9b41
Show file tree
Hide file tree
Showing 7 changed files with 44,434 additions and 0 deletions.
122 changes: 122 additions & 0 deletions doc/About.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# About the Scanned Manuals

This directory's main content is scanned versions of the original German
manuals of VolksForth from the 80s. There were 4 main flavours of VolksForth
and accordingly 4 manuals: C64/C16/Plus4, Atari ST, CP/M and MSDOS.

The manuals for C64/C16/Plus4, Atari ST and MSDOS have recently been rescanned
and OCR-ed. Of the CP/M manual we have an older scan and an almost complete
[Org Mode](https://orgmode.org/) transcript. A partial Org Mode transcript also
exist of the MSDOS manual.

Based on the different text versions of the different manuals (transscripts,
sidecar files from `ocrmypdf`), a translation into an English manual is being
started in the 6502/C64/doc directory for the C64 3.9.6 release. Eventually
this is intended to result in a unified manual for all versions.

Note: The mix of Org Mode and Markdown in documents here stems from different
stems from different prefernces or past habits of different contributors.

## VolksForth CBM 3.80 Manual

The [doc/cbm/](cbm) directory contains the German manual for the C64/C16/Plus4
VolksForth version 3.80.

* [vf-cbm-380-manual-de.pdf](cbm/vf-cbm-380-manual-de.pdf) is the scanned and
OCR-ed PDF.
* [vf-cbm-380-manual-de.sidecar.txt](cbm/vf-cbm-380-manual-de.sidecar.txt)
is the sidecar text output generated by
[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)'s option `--sidecar`.
* [raw-scans/](cbm/raw-scans) contains the raw PDF files as produced by the
scanner from the paper orignal.

## VolksForth Atari ST 3.80 Manual

The [doc/atari-st/](atari-st) directory contains the German manual for the
Atari ST FolksForth version 3.80.

* [vf-atari-st-380-manual-de.pdf](atari-st/vf-atari-st-380-manual-de.pdf) is
the scanned and OCR-ed PDF.
* [vf-atari-st-380-manual-de.sidecar.txt](atari-st/vf-atari-st-380-manual-de.sidecar.txt)
is the sidecar text output generated by
[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)'s option `--sidecar`.
* [raw-scans/](atari-st/raw-scans) contains the raw PDF files as produced by
the scanner from the paper orignal.
* [LIESMICH.TXT](atari-st/LIESMICH.TXT) is an overview, in German,
of VolksForth and of the files that come with the Atari ST version.
Note: The .SCR files are Forth screen files, i.e. sources, and they have
since been renamed to .FB (for Forth Block source).
* [README.TXT](atari-st/README.TXT) is the same, in English.
* [CHANGES.ORG](atari-st/CHANGES.ORG) is a change log, in German, between
versions 3.7 and 3.80.

## VolksForth CP/M 3.80 Manual

The [doc/cpm/](cpm) directory contains the German manual for the CP/M
VolksForth version 3.80. Note that the CP/M VolksForth was shipped with the
C64/C16/Plus4 manual, and the CP/M manual only describes the CP/M VolksForth's
differences compared to the C64 etc. version.

* [VolksForth-3.80-CPM.pdf](cpm/VolksForth-3.80-CPM.pdf) is the scanned
and OCR-ed PDF.
* [readme.org](cpm/readme.org) is a transcript of the scanned PDF. Note that
the order of the chapters differ slightly between scan and transcript.

## VolksForth MSDOS 3.81 Manual

The [doc/msdos/](msdos) directory contains the German manual for the MSDOS
VolksForth version 3.81.

* [vf-msdos-381-manual-de.pdf](msdos/vf-msdos-381-manual-de.pdf) is the scanned
and OCR-ed PDF.
* [vf-msdos-381-manual-de.sidecar.txt](msdos/vf-msdos-381-manual-de.sidecar.txt)
is the sidecar text output generated by
[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)'s option `--sidecar`.
* [raw-scans/](msdos/raw-scans) contains the raw PDF files as produced by the
scanner from the paper orignal.
* [LIESMICH.TXT](msdos/LIESMICH.TXT) is a partial transcript of the scanned
PDF.
* [README.TXT](msdos/README.TXT) is a started cross-platform overview of
VolksForth, in English.

## Scanning and OCR notes

For the records, this is the procedure used to create the 3 newly-scanned PDFs:

The scans were made from 3 printed manual copies in mint condition; the manuals
are in A5 format.

The scanner used is a HP Color LaserJet MFP M477fdn which has a document feeder
with two-sided scanning ability, and a fixed A4 scanning size.
Since a full VolksForth manual exceeds the capacity of the feeder,
each manual was split into 3 batches; the resulting A4 PDFs are now sitting
in the `raw-scans/` directories.

The raw scans `scan0000.pdf` to `scan0002.pdf` were concatenated and cropped
using the Linux GUI tool `pdfarranger` (version 1.4.2). Steps:
* Drag & drop all files from `raw-scans/` into `pdfarranger` window.
* Press ctrl-A to select all pages.
* Edit -> Crop
* Set lower margin to 29% (1 - (1 / sqrt(2)).
* Set left and right margin to 14.5% (29% / 2).
* Click "OK.
* Edit -> Edit Properties
* Set Creator to "Forth Gesellschaft e.V." (in other PDF vierers this is
displayed as the Author property).
* Save as "newly-cropped.pdf"

The final searchable PDF was created from the intermediate `newly-cropped.pdf`
by adding an OCR text layer using OCRmyPDF:

```
ocrmypdf -l deu -d -c -i newly-cropped.pdfvf-<version>-manual-de.pdf --sidecar vf-<version>-manual-de.sidecar.txt
```

The sidecar file contains the OCR-ed text added into the text layer and is
expected to be useful as input for a machine-aided translation of the manual
into English.

A note about PDF versions: The raw scans are PDF-1.4, `pdfarranger` outputs
PDF-1.3 which seems to cause problems (error 14) when opening files with
Adobe Acrobat. `ocrmypdf` produces PDF/A-2b which does not seem to cause these
problems.
Binary file modified doc/atari-st/vf-atari-st-380-manual-de.pdf
Binary file not shown.
Loading

0 comments on commit f3b9b41

Please sign in to comment.