-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #51 from pzembrod/manuals-ocr
Manuals OCR
- Loading branch information
Showing
7 changed files
with
44,434 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
# About the Scanned Manuals | ||
|
||
This directory's main content is scanned versions of the original German | ||
manuals of VolksForth from the 80s. There were 4 main flavours of VolksForth | ||
and accordingly 4 manuals: C64/C16/Plus4, Atari ST, CP/M and MSDOS. | ||
|
||
The manuals for C64/C16/Plus4, Atari ST and MSDOS have recently been rescanned | ||
and OCR-ed. Of the CP/M manual we have an older scan and an almost complete | ||
[Org Mode](https://orgmode.org/) transcript. A partial Org Mode transcript also | ||
exist of the MSDOS manual. | ||
|
||
Based on the different text versions of the different manuals (transscripts, | ||
sidecar files from `ocrmypdf`), a translation into an English manual is being | ||
started in the 6502/C64/doc directory for the C64 3.9.6 release. Eventually | ||
this is intended to result in a unified manual for all versions. | ||
|
||
Note: The mix of Org Mode and Markdown in documents here stems from different | ||
stems from different prefernces or past habits of different contributors. | ||
|
||
## VolksForth CBM 3.80 Manual | ||
|
||
The [doc/cbm/](cbm) directory contains the German manual for the C64/C16/Plus4 | ||
VolksForth version 3.80. | ||
|
||
* [vf-cbm-380-manual-de.pdf](cbm/vf-cbm-380-manual-de.pdf) is the scanned and | ||
OCR-ed PDF. | ||
* [vf-cbm-380-manual-de.sidecar.txt](cbm/vf-cbm-380-manual-de.sidecar.txt) | ||
is the sidecar text output generated by | ||
[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)'s option `--sidecar`. | ||
* [raw-scans/](cbm/raw-scans) contains the raw PDF files as produced by the | ||
scanner from the paper orignal. | ||
|
||
## VolksForth Atari ST 3.80 Manual | ||
|
||
The [doc/atari-st/](atari-st) directory contains the German manual for the | ||
Atari ST FolksForth version 3.80. | ||
|
||
* [vf-atari-st-380-manual-de.pdf](atari-st/vf-atari-st-380-manual-de.pdf) is | ||
the scanned and OCR-ed PDF. | ||
* [vf-atari-st-380-manual-de.sidecar.txt](atari-st/vf-atari-st-380-manual-de.sidecar.txt) | ||
is the sidecar text output generated by | ||
[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)'s option `--sidecar`. | ||
* [raw-scans/](atari-st/raw-scans) contains the raw PDF files as produced by | ||
the scanner from the paper orignal. | ||
* [LIESMICH.TXT](atari-st/LIESMICH.TXT) is an overview, in German, | ||
of VolksForth and of the files that come with the Atari ST version. | ||
Note: The .SCR files are Forth screen files, i.e. sources, and they have | ||
since been renamed to .FB (for Forth Block source). | ||
* [README.TXT](atari-st/README.TXT) is the same, in English. | ||
* [CHANGES.ORG](atari-st/CHANGES.ORG) is a change log, in German, between | ||
versions 3.7 and 3.80. | ||
|
||
## VolksForth CP/M 3.80 Manual | ||
|
||
The [doc/cpm/](cpm) directory contains the German manual for the CP/M | ||
VolksForth version 3.80. Note that the CP/M VolksForth was shipped with the | ||
C64/C16/Plus4 manual, and the CP/M manual only describes the CP/M VolksForth's | ||
differences compared to the C64 etc. version. | ||
|
||
* [VolksForth-3.80-CPM.pdf](cpm/VolksForth-3.80-CPM.pdf) is the scanned | ||
and OCR-ed PDF. | ||
* [readme.org](cpm/readme.org) is a transcript of the scanned PDF. Note that | ||
the order of the chapters differ slightly between scan and transcript. | ||
|
||
## VolksForth MSDOS 3.81 Manual | ||
|
||
The [doc/msdos/](msdos) directory contains the German manual for the MSDOS | ||
VolksForth version 3.81. | ||
|
||
* [vf-msdos-381-manual-de.pdf](msdos/vf-msdos-381-manual-de.pdf) is the scanned | ||
and OCR-ed PDF. | ||
* [vf-msdos-381-manual-de.sidecar.txt](msdos/vf-msdos-381-manual-de.sidecar.txt) | ||
is the sidecar text output generated by | ||
[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)'s option `--sidecar`. | ||
* [raw-scans/](msdos/raw-scans) contains the raw PDF files as produced by the | ||
scanner from the paper orignal. | ||
* [LIESMICH.TXT](msdos/LIESMICH.TXT) is a partial transcript of the scanned | ||
PDF. | ||
* [README.TXT](msdos/README.TXT) is a started cross-platform overview of | ||
VolksForth, in English. | ||
|
||
## Scanning and OCR notes | ||
|
||
For the records, this is the procedure used to create the 3 newly-scanned PDFs: | ||
|
||
The scans were made from 3 printed manual copies in mint condition; the manuals | ||
are in A5 format. | ||
|
||
The scanner used is a HP Color LaserJet MFP M477fdn which has a document feeder | ||
with two-sided scanning ability, and a fixed A4 scanning size. | ||
Since a full VolksForth manual exceeds the capacity of the feeder, | ||
each manual was split into 3 batches; the resulting A4 PDFs are now sitting | ||
in the `raw-scans/` directories. | ||
|
||
The raw scans `scan0000.pdf` to `scan0002.pdf` were concatenated and cropped | ||
using the Linux GUI tool `pdfarranger` (version 1.4.2). Steps: | ||
* Drag & drop all files from `raw-scans/` into `pdfarranger` window. | ||
* Press ctrl-A to select all pages. | ||
* Edit -> Crop | ||
* Set lower margin to 29% (1 - (1 / sqrt(2)). | ||
* Set left and right margin to 14.5% (29% / 2). | ||
* Click "OK. | ||
* Edit -> Edit Properties | ||
* Set Creator to "Forth Gesellschaft e.V." (in other PDF vierers this is | ||
displayed as the Author property). | ||
* Save as "newly-cropped.pdf" | ||
|
||
The final searchable PDF was created from the intermediate `newly-cropped.pdf` | ||
by adding an OCR text layer using OCRmyPDF: | ||
|
||
``` | ||
ocrmypdf -l deu -d -c -i newly-cropped.pdfvf-<version>-manual-de.pdf --sidecar vf-<version>-manual-de.sidecar.txt | ||
``` | ||
|
||
The sidecar file contains the OCR-ed text added into the text layer and is | ||
expected to be useful as input for a machine-aided translation of the manual | ||
into English. | ||
|
||
A note about PDF versions: The raw scans are PDF-1.4, `pdfarranger` outputs | ||
PDF-1.3 which seems to cause problems (error 14) when opening files with | ||
Adobe Acrobat. `ocrmypdf` produces PDF/A-2b which does not seem to cause these | ||
problems. |
Binary file not shown.
Oops, something went wrong.