Skip to content

Greek Betacode to Unicode Transformations

Lisa Cerrato edited this page Sep 14, 2021 · 13 revisions

Background

Ancient Greek betacode is a way of representing polytonic Greek in the Roman alphabet. For example, see this graphic from the Perseus site:

However, we are now working to convert all data in the Perseus Digital Library text that's still in betacode to UTF-8. This process is not unproblematic.

Scripts

Most of the files currently in the canonical repos have been converted from Greek betacode to UTF-8 with one of the following scripts:

NB for those unfamiliar with how these files work, java -jar invokes the transform, so:

java -jar /your/path/to/file/tei-conversion-tools/jar/tei.transformer.lang_grc.jar /your/path/to/file//canonical-greekLit/data/tlg0525/tlg001/tlg0525.tlg001.perseus-eng2.xml

this results in file with "unicode" added to the name.

Update (2018)

Discovered previously undocumented issue with combined diacriticals (illustrated by changing P4 display settings): a)/|smatos was converted to ᾁσματος

Similar issue with δ̓ instead of δ᾽

Problems

There are problems that arise from some of these scripts, some caused by the scripts, and others by discrepancies in the original encoding. These problems are why not all of the text has been converted to UTF-8. These problems most commonly manifest as:

  1. Apostrophes have been mistakenly conflated with smooth breathing marks.

  2. Errors in the parentheses.

An approach to solving them must include:

  1. an attempt to correct underlying betacode.

  2. an attempt to correct versions that are already in UTF-8, with later modifications to the markup that can't be lost.

For more on finding legacy betacode versions, see Finding-Data-(including-legacy-versions).

For more on these problems and various thoughts on how to go about solving them, see below.

#Problems, more on

At some point, someone will need to write logic and code that fixes some of the problems introduced in the process of encoding and converting.

Before tackling this in depth, please read the following email exchange between Dr. Giuseppe Celano (@gcelano), Frederik Baumgardt (@fbaumgardt), and Bridget Almas (@balmas):

@gcelano:

"You are looking for a character within a text, which means that you are looking for a specific Unicode codepoint. Please do not rely on how a character is displayed (glyph) because with apostrophes and similia you very easily get confused. So what I would recommend is to use a function in your programming language that accepts Unicode codepoints.

The cleanest way to do this, in my opinion, would be to analyze all not-letter characters contained in a text (this can be easily done with a function accepting regular expressions/category escapes): this list is usually not long and allows you to identify any "strange" character belonging to "punctuation". You can then decide whether to keep it or change it.

I have an XQuery for Greek which returns all characters that do not belong to a set specified by me (which basically includes all letters of the Greek alphabet I manually checked using the Unicode charts). What you get is usually a very short list of those characters that are notoriously problematic: apostrophe, semicolon/middle dot, and so on. On the basis of this you can decide if they are ok or you need to change something.

Trying to uniform these characters is really very important. A note on the apostrophe: if I remember correctly, in many texts it is encoded as Modifier Letter Apostrophe U+02BC, which may be a felicitous choice because this Unicode codepoint does not belong to Unicode punctuation, and so when you tokenize a text using Category Escapes, the character is correctly not separated by the preceding characters."

@fbaumgardt:

"The solution that Giuseppe and I devised in Leipzig was to identify the segments in Betacode, transform them both with correct and incorrect rules and replace the incorrect Unicode segments with the correct ones. This, I’d like to stress, was a solution to fix the texts that you have already processed further and where a more fundamental approach would result in loss of your work.

In general I suggested normalizing the Betacode before transforming it, because the root of the problem was that there appear to be dialects in circulation that cause the data corruption during the transformation. During the Hackathon Hugh and others were in agreement if I recall correctly.

The specific issues that we noted so far was due to a distinction between wide and regular characters that the transformer ignored. I don’t remember the exact unicode code points, but it should be easy to find them."

@balmas:

"A few bits of additional background information which may or may not be useful.

  1. the decision to use 02BC for apostrophe was an explicit one, guided by Helma Dik and others. This is in the Alpheios beta-to-unicode XSLT transforms, but I don't know if the Java based Epidoc transformer used by P4 and others was ever updated. I thought I had at least done it in the Perseus version but I would have to check. Not sure it matters at this point.

  2. As far as normalization of the BetaCode is concerned, I agree wholeheartedly with that approach. That said, I'm not sure it will be possible to 100% accurately normalize the Betacode for this particular problem, as it's not only a question of different dialects but also just incorrect data entry choices,. When I was first working on the transformations of the texts, I had asked Helma Dik if there were any rules we could follow to determine whether something should be transformed as an apostrophe or a smooth breathing. In particular, I asked:

  3. At the time, after conferring with Greg(@gregorycrane) and and Helma, as there was a push to get the data up in GitHub in Unicode, we decided that it would be best to go ahead with the transform as it was and not try to correct for the data entry errors at the same time. But a further complicating problem with the Perseus texts in GitHub is that they weren't all converted in the same way. The ones that went through the Alpheios XSLT transform should at least use the right unicode chars (e.g. including the 02BC for apos) but didn't try to correct for data entry errors where breathing was used for apostrophe and vice-versa. I don't really know what happened after that, but since my initial transform missed some of the betacode altogether, and that was processed separately using a different converter, I guess we probably have an even greater mixture of possibilities there now unfortunately. "

@gcelano:

" I would add that another problem can be, if I remember correctly the discussion I had with @fbaumgardt, the fact that the apostrophe can also stand for a closing parenthesis, which I guess in the Betacode was encoded as ")".

Since there seems to be uncertainty even about the fact that the apostrophe and the breathing mark have always been encoded with the same Unicode codepoint, what I was suggesting, i.e., preliminary checking the texts for all punctuation marks, might be useful to get a sense of the situation.

In any case, relying on what @balmas writes, I would distinguish the following cases:

  1. The apostrophe mark stands for a smooth breathing mark:

This sign is associated almost always with the initial vowel of a word (when the vowel is in capital letters, the order of ")" is reversed). Particular cases are breathing with R and vowel inside a word, meaning crasis.

  1. The apostrophe mark stands for the apostrophe:

The mark is at the end of a word or more rarely at the beginning. It is likely that in these cases a 100% accurate algorithm is difficult to write, especially if the the entire word is just a vowel.

  1. The apostrophe mark stands for a closing parenthesis.

This case is maybe the most urgent one to solve, because the error is clearly visible. If the opening parenthesis has been encoded as a rough breathing I think that only those cases can be safely identified where such rough breathing is at the beginning of a word which does not start with a vowel (I do not remember now other cases where the rough breathing/opening quote might appear)

To recap, these might be the priorities:

  1. verify that at least the apostrophe/smooth breathing are encoded with the same character across texts (if we decide that the situation now is such that we cannot anymore distinguish between them since they were mistakenly conflated). Of course, if possible, it would be better to distinguish between them.

  2. Try to see if most parenthesis error can be solved.

  3. Later, see if something can be made to distinguish between apostrophe and smooth breathing (was the confusion due to the fact that they are in Betacode encoded in the same way, i.e., as ")" ?)"

Clone this wiki locally