Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PW IAST corrections #419

Open
funderburkjim opened this issue Jul 22, 2018 · 37 comments
Open

PW IAST corrections #419

funderburkjim opened this issue Jul 22, 2018 · 37 comments

Comments

@funderburkjim
Copy link
Contributor

In the PW dictionary, a relatively small number of words appear in IAST spellings; for examole
image

Some of these have spelling errors in the Cologne digitization:
image

This issue is devoted to correcting such spelling errors.

@gasyoun
Copy link
Member

gasyoun commented Jul 22, 2018

Some of these have spelling errors

This is a result of manual checking, right?

@funderburkjim
Copy link
Contributor Author

<is> tag.

These cases are identified in the current digitization by the <is> tag. The reason Thomas originally coded these words is that, as the print example shows, they appear with wide letter spacing. Thomas
original coding was converted to the current <is> xml-type tag: <is>Agastiya</is>.

the cases of <is> tag

There are 4858 distinct text instances of the <is> tag.
We want to find spelling errors.
It is expected that many of these 4858 instances are spelled correctly. One way to make a separation
into cases which are probably correctly spelled and cases which possibly are incorrectly spelled is to
make use of a list of known correctly spelled words. For this purpose, we are using the headwords of MW (193,000 distinct such headwords).

After converting the IAST words to lower case, and then transcoding from IAST to slp1, we can compare
to the list of MW headwords. The result is that
3273 of the words are recognized as MW headwords (therefore probably correctly spelled)
1585 of the words are not so recognized, and therefore need further examination.

These two lists are in this gist

  • pwis_notmw.txt
  • pwis_mw.txt

Each line shows

  • the IAST spelling
  • the number of instances
  • the slp1 spelling

There is also an html file for the nonmw list. This contains a link to PW basic display for each PW headword where the questionable IAST spelling occurs.

@funderburkjim
Copy link
Contributor Author

Suggestion for correction

Make a local copy of the pwis_notmw.txt file, and also of the pwis_notmw.html file.

Indicate corrections in the pwis_notmw.txt file by adding a 4th field with the correct spelling in SLP1 form.

Post processing program can convert the SLP1 correction back to IAST. It is probably easier (for @drdhaval2785 , at least) to enter the correction in SLP1 rather than the diacritics required in many of
the IAST spellings).

Then submit back to me the corrected file. I'll convert these to standard 'updateByLine' old/new corrections for PW, and install the corrections.

Don't worry about whether the correction is a typo or print error. Probably almost all are typos.

@drdhaval2785
Copy link
Contributor

One more possibility to reduce the list.

  1. Unique german (or french?) tendency to use 'k' instead of 'c'.

E.g. pracetas - praketas
paYcagavya - paNkagavya
etc.

If we make replacement from k to c and find the word in MW headword list, it can be listed as auto corrected.

More observations to reduce list will be enumerated as and when I encounter such tendencies which are manageable programmatically

@drdhaval2785
Copy link
Contributor

screenshot_20180723-142317_samsung internet

@funderburkjim
Copy link
Contributor Author

Autocorrection 1

pwis_notmw1.txt has been added to the gist.

This contains the same list of 1585 words as in pwis_notmw.txt , but with 179 autocorrections.
The autocorrections are generated by the rules:

  • k -> c . As suggested above.
    • NOTE: the description of PWG iast for palatals is also applicable to PW.
      One of these is that k' (k-acute) is used for 'c'; In the original AS coding, this k' would have been
      written as 'k4'; if the typist missed the accent, it would be just k.
  • g -> j. g' was PW's IAST for 'j'.
  • n -> R. (R is slp1) -- common to miss an underdot in IAST
  • vant -> vat, and mant -> mat: different convention in PWG than MW
  • ending 'ar' -> 'f' (slp1). different convention in PWG than MW

These rules were applied to slp1 spelling of each of the 1585; if one of the rules resulted in a new
spelling which matched an MW headword, then this was indicated in the output (pwis_notmw1.txt) by

  • adding the new slp1 spelling as a fourth field (this is the autocorrection)
  • Putting (Auto) as a fifth field, to distinguish it as an autocorrection.

@drdhaval2785 This should help a bit, by autocorrecting 11% of the cases. You could download the
pwis_notmw1.txt and work from it.

@gasyoun
Copy link
Member

gasyoun commented Jul 24, 2018

'k4'; if the typist missed the accent, it would be just k

That explains a lot.

11% of the cases

Well done, well done.

Dhaval, thanks again for being back. This one still remains the major dictionary. Not widely used in India, because people tend to forget German, but the most academic one up to now.

@funderburkjim
Copy link
Contributor Author

Autocorrection 2

This is based on an idea in article How to Write a Spelling Corrector by Peter Norvig.

Consider example of yAjNavalkya, in slp1 spelling.

The idea is

  1. Find candidate spellings which are an 'edit distance' of 1 from the original spelling.
    (i.e., by replacing one character, removing one character, or inserting one character)
    There are 1127 such spellings, mostly nonsense: aAjNavalkya, yaAjNavalkya, yjNavalkya, etc.
  2. check each of these spellings against list of known MW headwords.
    • Declare success (and mark as (Auto1)) if there is exactly 1 known spelling among the candidates
    • Declare possible success (and mark as (Auto1X)) if there are more than 1 known spellings among the candidates

Results:

See pwis_notmw2.txt

  • We started with a list of 1585
  • 179 were previously autocorrected and marked (Auto), as described in previous comments
  • 580 of the remaining are now autocorrected and marked (Auto1). Probably most of these corrections
    are right.
  • 517 of the remaining have multiple autocorrections, and are marked (Auto1X). Probably one of
    the possible corrections is right.
  • 310 are completely unmarked thus far.

@gasyoun
Copy link
Member

gasyoun commented Jul 31, 2018

N 1 n X,O,x,o,F,f,nI,nO,nA,nE,an,in,nU,ni,na,nf,nu,A,E,I,U,a,e,nF,i,no,u (Auto1X)

How can one help here?

Visṇu 1 visRu visru,vizRu (Auto1X)
Vrṣṇi 1 vrzRi vArzRi,vfzRi (Auto1X)
Yogint 1 yogint yogin,yoginI,yoginy (Auto1X)

The method is simple. The results - promising. What's the wanted output format?

@funderburkjim
Copy link
Contributor Author

LevAuto

An additional step of autosuggestion was carried out on the remaining 300+ items of pwis_notmw2.txt that have no suggestions by the previous steps.

An example will illustrate the conceptually simple process:
One of these 300+ is *Maṅguśrī 1 maNguSrI. Now consider the unknown spellingmaNguSrI in light of all MW headwords, and find the headword or headwords which are closest in spelling to maNguSrI. Here, the closest headwords are those with minimal Levenshtein edit distance. Thus
we must go through a process of examining the edit distance of each of the (approximately 200,000) MW headwords from the word maNguSrI, and choose those headwords with the smallest possible
edit distance from maNguSrI. This list is used for the suggestion. In this case, the answer
turns out to be the headwords aNgurI,maNgura,maDuSrI,maYjuSrI. In this case, the suggestion list
contains what is almost surely the right spelling correction maYjuSrI.

The results are shown in pwis_notmw3.txt.
The 300+ suggestions generated by this minimal edit distance technique are marked with (LevAuto).

While this technique is conceptually simple, it is computationally complex. In fact, the notmw3 LevAuto suggestions were generated by applying a Levenshtein Automaton built on top of the Pynini python library developed by Kyle Gorman. The details of my application are in this pynini-learn repository.

As mentioned there, the current implementation does not appear efficient enough to be very useful with such a large 'lexicon' as the 200,000 MW headword list. Gorman held out the possibility of a more efficient algorithm in this comment.

funderburkjim added a commit to sanskrit-lexicon/sanskrit-lexicon.github.io that referenced this issue Aug 31, 2018
@funderburkjim
Copy link
Contributor Author

How can one help here?
The method is simple. The results - promising. What's the wanted output format?

There is now a file in the PWK repository where corrections can be entered: pwis_notmw3_correctionform.txt . Here is link to brief readme.

@drdhaval2785 If you already have corrections in some other format, I'll be glad to transfer them
to the pwis_notmw3_correctionform.txt file.

@gasyoun Does this procedure satisfy your needs ?

@drdhaval2785
Copy link
Contributor

Corrections still are not prepared / installed.
Saw Metron. Agastiya's. in webpage today.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Dec 19, 2020

Example

The readme at https://github.com/sanskrit-lexicon/PWK/tree/master/pw_iast gives some background on
anticipated usage. The objective is to change to modern IAST spellings various suspicious spellings in PW
dictionary.

Here's how I would proceed to deal with the 'Agastiya' example.

open pwis_notmw3_correctionform.txt

link = https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt
Find 'Agastiya':

Case 0023: Agastiya 4 agastiya : Corrected_SLP1=
; Suggestion method: (Auto1X)   Corrected by: 
; Suggestions: agastIya,agastya

open pwis_notmw.html in browser.

Link is https://sanskrit-lexicon.github.io/PWK/pwis_notmw.html

and find 'Agastiya':

Agastiya | 4 | agastiya | OrvaSeya kalaSaBU kumBaBU kumBasaMBava

The 4 words 'OrvaSeya' are SLP1 spellings of headwords where the suspicious word Agastiya appears.

Examine instances

First, look up OrvaSeya in PW dictionary using one of the displays
image

Examine scanned image to see what print actually is:

image

Decide modern IAST spelling:

I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.

Examine other uses:

kalaSaBU 
image

kumBaBU 
image

kumBasaMBava
image

Choose Answer

All the cases are the same: print has 'Agastja', Modern form is 'Agastiya' 'Agastya'. Current pw.txt digitization has
'Agastiya'.
Solution is to change to 'Agastya'

Fill in Correctionform for Case 23

Edit [pwis_notmw3_correctionform.txt]( https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt

  • Fill in 'Corrected_SLP1' to 'Agastya'
  • Fill in 'Corrected by:' to funderburkjim (my Github user name)
Case 0023: Agastiya 4 agastiya : Corrected_SLP1= Agastya
; Suggestion method: (Auto1X)   Corrected by: funderburkjim
; Suggestions: agastIya,agastya

Commit the change (commit message = 'Case 23').

@funderburkjim
Copy link
Contributor Author

installing corrections

Filling in the correction form does not install the corrections to pw.txt. Installation would be a separate step done
by either @drdhaval2785 or @funderburkjim .

This is a slow process, but looks reliable.

There are 1585 cases.

The end result would be improvement to modern IAST

@gasyoun
Copy link
Member

gasyoun commented Dec 19, 2020

I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.

Exactly.

All the cases are the same: print has 'Agastja', Modern form is 'Agastiya'. Current pw.txt digitization has
'Agastiya'.

Modern form is 'Agastya', and not 'Agastiya' only.

@gasyoun gasyoun added the IAST modern IAST label Dec 19, 2020
@funderburkjim
Copy link
Contributor Author

Wonder if @SergeA would have interest in working on this?

@funderburkjim
Copy link
Contributor Author

Modern form is 'Agastya'
👍
Have corrected comment.

@Andhrabharati
Copy link

Andhrabharati commented Aug 16, 2021

If you already have corrections in some other format, I'll be glad to transfer them
to the pwis_notmw3_correctionform.txt file.

Wonder if @SergeA would have interest in working on this?

Can I poke-in my nose in this, if @funderburkjim is willing to work on it, if given in 'some other format'?
[It's hardly ~2 days' work for me.]

@Andhrabharati
Copy link

This is one of the many cases that are "counter" to what was replied by @drdhaval2785 and @gasyoun against my posting somewhere [that I do not get @funderburkjim's response for months together, while others get almost 'immediately'], that my posts are "heavy-meals" and not easily chewable/digestible as are all others' postings.

Here, I just wrote a single sentence, and yet to get some/any response from Jim (for almost 2 years now)!

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jun 28, 2023

@Andhrabharati Obviously, I lost track of your question here.

Please provide a couple of examples of what you mean by some other format.

When the current work with you on Grassman dictionary is complete, I will examine the feasibility of working with you on this.

@Andhrabharati
Copy link

Andhrabharati commented Jun 28, 2023

Obviously, my resolutions would be non-slp1 but in plain iast.

I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like <ab><is>Z.</is></ab>?

@Andhrabharati
Copy link

Andhrabharati commented Jun 28, 2023

Now that I saw many later posts at this forum, I think the regular

old: yyy
new: zzz

would be the way for me take it up, wrt the latest pw.txt lines (at csl-orig); probably with just the is-tagged word [yyy/zzz] (as at times the line could be quite longer).

@Andhrabharati
Copy link

Andhrabharati commented Jun 28, 2023

Just had a "look" inside the pw.txt for the <is>-strings and noticed ~10k instances of <is>…</is> strings inside the italics {%…%}; whereas the print has vast majority of them (if not all) in normal-face (font) [& wide-spaced].

Also many more "bad"-tagging/marking of various types are seen.

This calls for a full overhaul of the data, and I get reminded of the earlier reaction of Thomas, if I say anything more!!
[I had stopped working on pw after the <ls> marking those days, seeing Thomas's reaction on my post.]

I see not much worth taking up correction of just the <is>-tagged iast portion.
But, is Jim ready/willing now to take up a collaborative work to "bring" a good-shape to pw.txt?

@Andhrabharati
Copy link

Andhrabharati commented Jun 29, 2023

Did a quick checking for these "notmw" words in MW, and found quite many to be present in MW!

Just a few small changes/additions in the "search pattern" would eliminate all such ones from the list!!

@Andhrabharati
Copy link

When tried looking for some random words in the list, noticed that the CDSL pwk scan pages are not so clear, as compared to my copy. [Probably, it could be a reason for the typo errors.]

Probably, these scans could be replaced, for the benefit of any and everyone.
[This point was discussed elsewhere earlier wherein I mentioned having good scans of the vol.s 2-7, and now I've all the 7 volumes in my possession.]

@funderburkjim
Copy link
Contributor Author

@Andhrabharati Am ready to begin working with you in this issue related to <is> tag in PW.

  1. You identify a problem with markup re 'is' tag: <is>X</is> within italics {%Y%}.
  2. You suggest the need for some 'new' rules for auto-correction.

I think we should restrict this study to is-tag, if possible (so that this good issue can be solved).

What do you need from me?

@Andhrabharati
Copy link

I had progressed much ahead from this IAST part in pw in the past couple of days, @funderburkjim !!

I shall post my work in due course of time, for your perusal.

If you are willing, pl. give me the links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.

I am about to start proofing full headwords, as some errors were noted while working on this pw.txt
[BTW, I had already marked grouped entries all throughout the file, just like in GRA and MW.]

@Andhrabharati
Copy link

My present work is covering all kinds of markups and listing the abbr. and ls entries.

@funderburkjim
Copy link
Contributor Author

funderburkjim commented Jul 17, 2023

links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.

I could provide a python script that would run as follows

python pw_convert.py slp1,iast pw.txt pw_iast.txt
python pw_convert.py iast,slp1 pw_iast.txt pw_slp1.txt

This would convert the metaline  (k1 and k2) and all the {#X#} from slp1 to iast, and back.
And similarly for 'deva' instead of 'iast'

Is this what you request?

@Andhrabharati
Copy link

Yes, exactly.

@Andhrabharati
Copy link

Andhrabharati commented Jul 17, 2023

Probably, you could leave the metalines as is, as I am not going to touch that portion.

My reading will be limited to the header and body portions alone.

The metalines would have to be generated from the header portion, as done in case of GRA.

@funderburkjim
Copy link
Contributor Author

@Andhrabharati. Further discussions found in sanskrit-lexicon/PWK#95.

We can leave this #419 issue open until the work in PWK repository completed.

@Andhrabharati
Copy link

Andhrabharati commented Jul 18, 2023

Here are some small pieces from my work, wrt the <is> elements--

* marked items (diff. in CSL & AB texts):

image

abbr. type items:

I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like <ab><is>Z.</is></ab>?

<is n="Adhyātmarāmāyaṇa">Adhyātmar.</is>
<is n="Adhyāya">Adhy.</is>
<is n="Agni">A.</is>
<is n="Āgnīdhra">Ā.</is>
<is n="Agniṣṭoma">A.</is>
<is n="Aṅga">A.</is>
<is n="Apsaras">A.</is>
<is n="Arka">A.</is>
<is n="Āśvalāyana">A.</is>
<is n="Atharvan">Ath.</is>
<is n="Avanti">Av.</is>
<is n="Ayodhyā">A.</is>
<is n="Bālāhaka">B.</is>
<is n="Bhaṇḍīratha">Bh.</is>
<is n="Bhārgava">B.</is>
<is n="Bhūliṅgā">Bh.</is>
<is n="Brahman">B.</is>
<is n="Brahman">Br.</is>
<is n="Brahmaṇācchaṃsin">Br.</is>
<is n="Cakora">C.</is>
<is n="Camasa">C.</is>
<is n="Dhanvantari">Dh.</is>
<is n="Dūrvā">D.</is>
<is n="Dvārakaukas">Dv.</is>
<is n="Dvāravatī">Dv.</is>
<is n="Gandharva">G.</is>
<is n="Gaṇeśa">G.</is>
<is n="Gārgya">G.</is>
<is n="Himālaya">H.</is>
<is n="Indra">I.</is>
<is n="Jagatī">J.</is>
<is n="Jamadagni">J.</is>
<is n="Kālī">K.</is>
<is n="Kānyakubja">Kānyak.</is>
<is n="Kārikā">K.</is>
<is n="Karṇāṭa">K.</is>
<is n="Kāśi">K.</is>
<is n="Kāśmīra">K.</is>
<is n="Kosala">K.</is>
<is n="Kuḍava">K.</is>
<is n="Likhita">L.</is>
<is n="Makara">M.</is>
<is n="Manu">M.</is>
<is n="Marut">M.</is>
<is n="Mathurā">M.</is>
<is n="Nairañjanā">N.</is>
<is n="Nalikā">N.</is>
<is n="Narmadā">N.</is>
<is n="Nīlakaṇṭha">Nīlak.</is>
<is n="Pañcālā">P.</is>
<is n="Paphaka">P.</is>
<is n="Pavamāna Stotra">P. St.</is>
<is n="Puronuvākyā">P.</is>
<is n="Pūru">P.</is>
<is n="Pūṣan">P.</is>
<is n="Rāma">R.</is>
<is n="Rāmāyaṇa">R.</is>
<is n="Revatī">R.</is>
<is n="Śākaṭāyana">Śāk.</is>
<is n="Sāman">S.</is>
<is n="Śaṅkha">Ś.</is>
<is n="Sarasvatī">S.</is>
<is n="Savitar">S.</is>
<is n="Sāyaṇa">Sāy.</is>
<is n="Soma">S.</is>
<is n="Śūdra">Ś.</is>
<is n="Sumantra">S.</is>
<is n="Tārkṣya">T.</is>
<is n="Udgātar">U.</is>
<is n="Udumbara">U.</is>
<is n="Vaṅkara">V.</is>
<is n="Vāsudeva">Vās.</is>
<is n="Vāyu">V.</is>
<is n="Veda">V.</is>
<is n="Vidura">V.</is>
<is n="Viśvarūpa">V.</is>
<is n="Yajus">Y.</is>
<is n="Yayāti">Y.</is>
<is n="Yuvanāśva">Yuv.</is>

@funderburkjim
Copy link
Contributor Author

@Andhrabharati I do not find any of your examples of .</is>
Can you provide the line-numbers of your examples?

@funderburkjim
Copy link
Contributor Author

I should have made previous comment in sanskrit-lexicon/PWK#95

@Andhrabharati
Copy link

Pl. see my initial post above --

it isn't .</is> but is </is>. in the CDSL file and I had brought the dot inside the is-tagging (and of course, there were some typos as well that I had corrected).

@funderburkjim
Copy link
Contributor Author

How to deal with them, something like Z.?

@Andhrabharati
In your pw versions discussed at sanskrit-lexicon/PWK#95,
you introduce markup such is <is n="TOOLTIP">Z.</is> (also <ab n="TOOLTIP">Z.</ab>, etc.),
and the displays provide html so TOOLTIP is available to users.

This seems to answer the above question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants