PW IAST corrections #419

funderburkjim · 2018-07-22T21:12:25Z

In the PW dictionary, a relatively small number of words appear in IAST spellings; for examole

Some of these have spelling errors in the Cologne digitization:

This issue is devoted to correcting such spelling errors.

gasyoun · 2018-07-22T21:21:34Z

Some of these have spelling errors

This is a result of manual checking, right?

funderburkjim · 2018-07-22T21:32:56Z

<is> tag.

These cases are identified in the current digitization by the <is> tag. The reason Thomas originally coded these words is that, as the print example shows, they appear with wide letter spacing. Thomas
original coding was converted to the current <is> xml-type tag: <is>Agastiya</is>.

the cases of `<is>` tag

There are 4858 distinct text instances of the <is> tag.
We want to find spelling errors.
It is expected that many of these 4858 instances are spelled correctly. One way to make a separation
into cases which are probably correctly spelled and cases which possibly are incorrectly spelled is to
make use of a list of known correctly spelled words. For this purpose, we are using the headwords of MW (193,000 distinct such headwords).

After converting the IAST words to lower case, and then transcoding from IAST to slp1, we can compare
to the list of MW headwords. The result is that
3273 of the words are recognized as MW headwords (therefore probably correctly spelled)
1585 of the words are not so recognized, and therefore need further examination.

These two lists are in this gist

pwis_notmw.txt
pwis_mw.txt

Each line shows

the IAST spelling
the number of instances
the slp1 spelling

There is also an html file for the nonmw list. This contains a link to PW basic display for each PW headword where the questionable IAST spelling occurs.

funderburkjim · 2018-07-22T21:38:11Z

Suggestion for correction

Make a local copy of the pwis_notmw.txt file, and also of the pwis_notmw.html file.

Indicate corrections in the pwis_notmw.txt file by adding a 4th field with the correct spelling in SLP1 form.

Post processing program can convert the SLP1 correction back to IAST. It is probably easier (for @drdhaval2785 , at least) to enter the correction in SLP1 rather than the diacritics required in many of
the IAST spellings).

Then submit back to me the corrected file. I'll convert these to standard 'updateByLine' old/new corrections for PW, and install the corrections.

Don't worry about whether the correction is a typo or print error. Probably almost all are typos.

drdhaval2785 · 2018-07-23T08:56:46Z

One more possibility to reduce the list.

Unique german (or french?) tendency to use 'k' instead of 'c'.

E.g. pracetas - praketas
paYcagavya - paNkagavya
etc.

If we make replacement from k to c and find the word in MW headword list, it can be listed as auto corrected.

More observations to reduce list will be enumerated as and when I encounter such tendencies which are manageable programmatically

drdhaval2785 · 2018-07-23T08:57:44Z

funderburkjim · 2018-07-23T22:44:09Z

Autocorrection 1

pwis_notmw1.txt has been added to the gist.

This contains the same list of 1585 words as in pwis_notmw.txt , but with 179 autocorrections.
The autocorrections are generated by the rules:

k -> c . As suggested above.
- NOTE: the description of PWG iast for palatals is also applicable to PW.
  One of these is that k' (k-acute) is used for 'c'; In the original AS coding, this k' would have been
  written as 'k4'; if the typist missed the accent, it would be just k.
g -> j. g' was PW's IAST for 'j'.
n -> R. (R is slp1) -- common to miss an underdot in IAST
vant -> vat, and mant -> mat: different convention in PWG than MW
ending 'ar' -> 'f' (slp1). different convention in PWG than MW

These rules were applied to slp1 spelling of each of the 1585; if one of the rules resulted in a new
spelling which matched an MW headword, then this was indicated in the output (pwis_notmw1.txt) by

adding the new slp1 spelling as a fourth field (this is the autocorrection)
Putting (Auto) as a fifth field, to distinguish it as an autocorrection.

@drdhaval2785 This should help a bit, by autocorrecting 11% of the cases. You could download the
pwis_notmw1.txt and work from it.

gasyoun · 2018-07-24T04:49:07Z

'k4'; if the typist missed the accent, it would be just k

That explains a lot.

11% of the cases

Well done, well done.

Dhaval, thanks again for being back. This one still remains the major dictionary. Not widely used in India, because people tend to forget German, but the most academic one up to now.

funderburkjim · 2018-07-30T23:12:27Z

Autocorrection 2

This is based on an idea in article How to Write a Spelling Corrector by Peter Norvig.

Consider example of yAjNavalkya, in slp1 spelling.

The idea is

Find candidate spellings which are an 'edit distance' of 1 from the original spelling.
(i.e., by replacing one character, removing one character, or inserting one character)
There are 1127 such spellings, mostly nonsense: aAjNavalkya, yaAjNavalkya, yjNavalkya, etc.
check each of these spellings against list of known MW headwords.
- Declare success (and mark as (Auto1)) if there is exactly 1 known spelling among the candidates
- Declare possible success (and mark as (Auto1X)) if there are more than 1 known spellings among the candidates

Results:

See pwis_notmw2.txt

We started with a list of 1585
179 were previously autocorrected and marked (Auto), as described in previous comments
580 of the remaining are now autocorrected and marked (Auto1). Probably most of these corrections
are right.
517 of the remaining have multiple autocorrections, and are marked (Auto1X). Probably one of
the possible corrections is right.
310 are completely unmarked thus far.

gasyoun · 2018-07-31T06:28:34Z

N 1 n X,O,x,o,F,f,nI,nO,nA,nE,an,in,nU,ni,na,nf,nu,A,E,I,U,a,e,nF,i,no,u (Auto1X)

How can one help here?

Visṇu 1 visRu visru,vizRu (Auto1X)
Vrṣṇi 1 vrzRi vArzRi,vfzRi (Auto1X)
Yogint 1 yogint yogin,yoginI,yoginy (Auto1X)

The method is simple. The results - promising. What's the wanted output format?

funderburkjim · 2018-08-31T20:13:54Z

LevAuto

An additional step of autosuggestion was carried out on the remaining 300+ items of pwis_notmw2.txt that have no suggestions by the previous steps.

An example will illustrate the conceptually simple process:
One of these 300+ is *Maṅguśrī 1 maNguSrI. Now consider the unknown spellingmaNguSrI in light of all MW headwords, and find the headword or headwords which are closest in spelling to maNguSrI. Here, the closest headwords are those with minimal Levenshtein edit distance. Thus
we must go through a process of examining the edit distance of each of the (approximately 200,000) MW headwords from the word maNguSrI, and choose those headwords with the smallest possible
edit distance from maNguSrI. This list is used for the suggestion. In this case, the answer
turns out to be the headwords aNgurI,maNgura,maDuSrI,maYjuSrI. In this case, the suggestion list
contains what is almost surely the right spelling correction maYjuSrI.

The results are shown in pwis_notmw3.txt.
The 300+ suggestions generated by this minimal edit distance technique are marked with (LevAuto).

While this technique is conceptually simple, it is computationally complex. In fact, the notmw3 LevAuto suggestions were generated by applying a Levenshtein Automaton built on top of the Pynini python library developed by Kyle Gorman. The details of my application are in this pynini-learn repository.

As mentioned there, the current implementation does not appear efficient enough to be very useful with such a large 'lexicon' as the 200,000 MW headword list. Gorman held out the possibility of a more efficient algorithm in this comment.

funderburkjim · 2018-08-31T22:07:27Z

How can one help here?
The method is simple. The results - promising. What's the wanted output format?

There is now a file in the PWK repository where corrections can be entered: pwis_notmw3_correctionform.txt . Here is link to brief readme.

@drdhaval2785 If you already have corrections in some other format, I'll be glad to transfer them
to the pwis_notmw3_correctionform.txt file.

@gasyoun Does this procedure satisfy your needs ?

drdhaval2785 · 2020-12-18T05:45:32Z

Corrections still are not prepared / installed.
Saw Metron. Agastiya's. in webpage today.

funderburkjim · 2020-12-19T21:55:13Z

Example

The readme at https://github.com/sanskrit-lexicon/PWK/tree/master/pw_iast gives some background on
anticipated usage. The objective is to change to modern IAST spellings various suspicious spellings in PW
dictionary.

Here's how I would proceed to deal with the 'Agastiya' example.

open pwis_notmw3_correctionform.txt

link = https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt
Find 'Agastiya':

Case 0023: Agastiya 4 agastiya : Corrected_SLP1=
; Suggestion method: (Auto1X)   Corrected by: 
; Suggestions: agastIya,agastya

open pwis_notmw.html in browser.

Link is https://sanskrit-lexicon.github.io/PWK/pwis_notmw.html

and find 'Agastiya':

Agastiya | 4 | agastiya | OrvaSeya kalaSaBU kumBaBU kumBasaMBava

The 4 words 'OrvaSeya' are SLP1 spellings of headwords where the suspicious word Agastiya appears.

Examine instances

First, look up OrvaSeya in PW dictionary using one of the displays

Examine scanned image to see what print actually is:

Decide modern IAST spelling:

I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.

Examine other uses:

kalaSaBU

kumBaBU

kumBasaMBava

Choose Answer

All the cases are the same: print has 'Agastja', Modern form is ~~'Agastiya'~~ 'Agastya'. Current pw.txt digitization has
'Agastiya'.
Solution is to change to 'Agastya'

Fill in Correctionform for Case 23

Edit [pwis_notmw3_correctionform.txt]( https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt

Fill in 'Corrected_SLP1' to 'Agastya'
Fill in 'Corrected by:' to funderburkjim (my Github user name)

Case 0023: Agastiya 4 agastiya : Corrected_SLP1= Agastya
; Suggestion method: (Auto1X)   Corrected by: funderburkjim
; Suggestions: agastIya,agastya

Commit the change (commit message = 'Case 23').

funderburkjim · 2020-12-19T22:00:29Z

installing corrections

Filling in the correction form does not install the corrections to pw.txt. Installation would be a separate step done
by either @drdhaval2785 or @funderburkjim .

This is a slow process, but looks reliable.

There are 1585 cases.

The end result would be improvement to modern IAST

gasyoun · 2020-12-19T22:04:34Z

I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'.

Exactly.

All the cases are the same: print has 'Agastja', Modern form is 'Agastiya'. Current pw.txt digitization has
'Agastiya'.

Modern form is 'Agastya', and not 'Agastiya' only.

funderburkjim · 2020-12-19T23:08:59Z

Wonder if @SergeA would have interest in working on this?

funderburkjim · 2020-12-19T23:09:24Z

Modern form is 'Agastya'
👍
Have corrected comment.

Andhrabharati · 2021-08-16T06:46:27Z

If you already have corrections in some other format, I'll be glad to transfer them
to the pwis_notmw3_correctionform.txt file.

Wonder if @SergeA would have interest in working on this?

Can I poke-in my nose in this, if @funderburkjim is willing to work on it, if given in 'some other format'?
[It's hardly ~2 days' work for me.]

Andhrabharati · 2023-06-27T10:17:00Z

This is one of the many cases that are "counter" to what was replied by @drdhaval2785 and @gasyoun against my posting somewhere [that I do not get @funderburkjim's response for months together, while others get almost 'immediately'], that my posts are "heavy-meals" and not easily chewable/digestible as are all others' postings.

Here, I just wrote a single sentence, and yet to get some/any response from Jim (for almost 2 years now)!

funderburkjim · 2023-06-28T02:30:57Z

@Andhrabharati Obviously, I lost track of your question here.

Please provide a couple of examples of what you mean by some other format.

When the current work with you on Grassman dictionary is complete, I will examine the feasibility of working with you on this.

Andhrabharati · 2023-06-28T02:53:42Z

Obviously, my resolutions would be non-slp1 but in plain iast.

I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like <ab><is>Z.</is></ab>?

Andhrabharati · 2023-06-28T03:01:25Z

Now that I saw many later posts at this forum, I think the regular

old: yyy
new: zzz

would be the way for me take it up, wrt the latest pw.txt lines (at csl-orig); probably with just the is-tagged word [yyy/zzz] (as at times the line could be quite longer).

Andhrabharati · 2023-06-28T11:57:59Z

Just had a "look" inside the pw.txt for the <is>-strings and noticed ~10k instances of <is>…</is> strings inside the italics {%…%}; whereas the print has vast majority of them (if not all) in normal-face (font) [& wide-spaced].

Also many more "bad"-tagging/marking of various types are seen.

This calls for a full overhaul of the data, and I get reminded of the earlier reaction of Thomas, if I say anything more!!
[I had stopped working on pw after the <ls> marking those days, seeing Thomas's reaction on my post.]

I see not much worth taking up correction of just the <is>-tagged iast portion.
But, is Jim ready/willing now to take up a collaborative work to "bring" a good-shape to pw.txt?

Andhrabharati · 2023-06-29T09:49:57Z

Did a quick checking for these "notmw" words in MW, and found quite many to be present in MW!

Just a few small changes/additions in the "search pattern" would eliminate all such ones from the list!!

Andhrabharati · 2023-06-29T12:54:08Z

When tried looking for some random words in the list, noticed that the CDSL pwk scan pages are not so clear, as compared to my copy. [Probably, it could be a reason for the typo errors.]

Probably, these scans could be replaced, for the benefit of any and everyone.
[This point was discussed elsewhere earlier wherein I mentioned having good scans of the vol.s 2-7, and now I've all the 7 volumes in my possession.]

funderburkjim · 2023-07-17T17:34:28Z

@Andhrabharati Am ready to begin working with you in this issue related to <is> tag in PW.

You identify a problem with markup re 'is' tag: <is>X</is> within italics {%Y%}.
You suggest the need for some 'new' rules for auto-correction.

I think we should restrict this study to is-tag, if possible (so that this good issue can be solved).

What do you need from me?

Andhrabharati · 2023-07-17T17:52:02Z

I had progressed much ahead from this IAST part in pw in the past couple of days, @funderburkjim !!

I shall post my work in due course of time, for your perusal.

If you are willing, pl. give me the links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.

I am about to start proofing full headwords, as some errors were noted while working on this pw.txt
[BTW, I had already marked grouped entries all throughout the file, just like in GRA and MW.]

Andhrabharati · 2023-07-17T17:56:03Z

My present work is covering all kinds of markups and listing the abbr. and ls entries.

funderburkjim · 2023-07-17T18:10:41Z

links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally.

I could provide a python script that would run as follows

python pw_convert.py slp1,iast pw.txt pw_iast.txt
python pw_convert.py iast,slp1 pw_iast.txt pw_slp1.txt

This would convert the metaline  (k1 and k2) and all the {#X#} from slp1 to iast, and back.
And similarly for 'deva' instead of 'iast'

Is this what you request?

Andhrabharati · 2023-07-17T18:18:11Z

Yes, exactly.

Andhrabharati · 2023-07-17T18:22:22Z

Probably, you could leave the metalines as is, as I am not going to touch that portion.

My reading will be limited to the header and body portions alone.

The metalines would have to be generated from the header portion, as done in case of GRA.

funderburkjim · 2023-07-18T02:46:50Z

@Andhrabharati. Further discussions found in sanskrit-lexicon/PWK#95.

We can leave this #419 issue open until the work in PWK repository completed.

Andhrabharati · 2023-07-18T13:08:06Z

Here are some small pieces from my work, wrt the <is> elements--

* marked items (diff. in CSL & AB texts):

abbr. type items:

I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like <ab><is>Z.</is></ab>?

<is n="Adhyātmarāmāyaṇa">Adhyātmar.</is>
<is n="Adhyāya">Adhy.</is>
<is n="Agni">A.</is>
<is n="Āgnīdhra">Ā.</is>
<is n="Agniṣṭoma">A.</is>
<is n="Aṅga">A.</is>
<is n="Apsaras">A.</is>
<is n="Arka">A.</is>
<is n="Āśvalāyana">A.</is>
<is n="Atharvan">Ath.</is>
<is n="Avanti">Av.</is>
<is n="Ayodhyā">A.</is>
<is n="Bālāhaka">B.</is>
<is n="Bhaṇḍīratha">Bh.</is>
<is n="Bhārgava">B.</is>
<is n="Bhūliṅgā">Bh.</is>
<is n="Brahman">B.</is>
<is n="Brahman">Br.</is>
<is n="Brahmaṇācchaṃsin">Br.</is>
<is n="Cakora">C.</is>
<is n="Camasa">C.</is>
<is n="Dhanvantari">Dh.</is>
<is n="Dūrvā">D.</is>
<is n="Dvārakaukas">Dv.</is>
<is n="Dvāravatī">Dv.</is>
<is n="Gandharva">G.</is>
<is n="Gaṇeśa">G.</is>
<is n="Gārgya">G.</is>
<is n="Himālaya">H.</is>
<is n="Indra">I.</is>
<is n="Jagatī">J.</is>
<is n="Jamadagni">J.</is>
<is n="Kālī">K.</is>
<is n="Kānyakubja">Kānyak.</is>
<is n="Kārikā">K.</is>
<is n="Karṇāṭa">K.</is>
<is n="Kāśi">K.</is>
<is n="Kāśmīra">K.</is>
<is n="Kosala">K.</is>
<is n="Kuḍava">K.</is>
<is n="Likhita">L.</is>
<is n="Makara">M.</is>
<is n="Manu">M.</is>
<is n="Marut">M.</is>
<is n="Mathurā">M.</is>
<is n="Nairañjanā">N.</is>
<is n="Nalikā">N.</is>
<is n="Narmadā">N.</is>
<is n="Nīlakaṇṭha">Nīlak.</is>
<is n="Pañcālā">P.</is>
<is n="Paphaka">P.</is>
<is n="Pavamāna Stotra">P. St.</is>
<is n="Puronuvākyā">P.</is>
<is n="Pūru">P.</is>
<is n="Pūṣan">P.</is>
<is n="Rāma">R.</is>
<is n="Rāmāyaṇa">R.</is>
<is n="Revatī">R.</is>
<is n="Śākaṭāyana">Śāk.</is>
<is n="Sāman">S.</is>
<is n="Śaṅkha">Ś.</is>
<is n="Sarasvatī">S.</is>
<is n="Savitar">S.</is>
<is n="Sāyaṇa">Sāy.</is>
<is n="Soma">S.</is>
<is n="Śūdra">Ś.</is>
<is n="Sumantra">S.</is>
<is n="Tārkṣya">T.</is>
<is n="Udgātar">U.</is>
<is n="Udumbara">U.</is>
<is n="Vaṅkara">V.</is>
<is n="Vāsudeva">Vās.</is>
<is n="Vāyu">V.</is>
<is n="Veda">V.</is>
<is n="Vidura">V.</is>
<is n="Viśvarūpa">V.</is>
<is n="Yajus">Y.</is>
<is n="Yayāti">Y.</is>
<is n="Yuvanāśva">Yuv.</is>

funderburkjim · 2023-07-18T16:54:14Z

@Andhrabharati I do not find any of your examples of .</is>
Can you provide the line-numbers of your examples?

funderburkjim · 2023-07-18T16:56:39Z

I should have made previous comment in sanskrit-lexicon/PWK#95

Andhrabharati · 2023-07-18T17:02:52Z

Pl. see my initial post above --

it isn't .</is> but is </is>. in the CDSL file and I had brought the dot inside the is-tagging (and of course, there were some typos as well that I had corrected).

funderburkjim · 2023-07-25T23:06:43Z

How to deal with them, something like Z.?

@Andhrabharati
In your pw versions discussed at sanskrit-lexicon/PWK#95,
you introduce markup such is <is n="TOOLTIP">Z.</is> (also <ab n="TOOLTIP">Z.</ab>, etc.),
and the displays provide html so TOOLTIP is available to users.

This seems to answer the above question.

funderburkjim mentioned this issue Jul 22, 2018

Sanskrit coding conventions sanskrit-lexicon/COLOGNE#227

Closed

funderburkjim added a commit to sanskrit-lexicon/sanskrit-lexicon.github.io that referenced this issue Aug 31, 2018

PW IAST corrections. ref: sanskrit-lexicon/CORRECTIONS#419

90aa476

gasyoun added the Correction submission label Jan 22, 2019

gasyoun added the IAST modern IAST label Dec 19, 2020

drdhaval2785 mentioned this issue Dec 20, 2020

todo list in 2021 (in descending order of importance) sanskrit-lexicon/COLOGNE#325

Open

funderburkjim mentioned this issue Jul 17, 2023

Fresh Look, starting with <is> tag sanskrit-lexicon/PWK#95

Closed

PW IAST corrections #419

PW IAST corrections #419

Comments

funderburkjim commented Jul 22, 2018

gasyoun commented Jul 22, 2018

funderburkjim commented Jul 22, 2018

the cases of <is> tag

funderburkjim commented Jul 22, 2018

Suggestion for correction

drdhaval2785 commented Jul 23, 2018

drdhaval2785 commented Jul 23, 2018

funderburkjim commented Jul 23, 2018

Autocorrection 1

gasyoun commented Jul 24, 2018

funderburkjim commented Jul 30, 2018

Autocorrection 2

Results:

gasyoun commented Jul 31, 2018 • edited Loading

funderburkjim commented Aug 31, 2018

LevAuto

funderburkjim commented Aug 31, 2018

drdhaval2785 commented Dec 18, 2020

funderburkjim commented Dec 19, 2020 • edited Loading

Example

open pwis_notmw3_correctionform.txt

open pwis_notmw.html in browser.

Examine instances

Decide modern IAST spelling:

Examine other uses:

Choose Answer

Fill in Correctionform for Case 23

funderburkjim commented Dec 19, 2020

installing corrections

gasyoun commented Dec 19, 2020 • edited Loading

funderburkjim commented Dec 19, 2020

funderburkjim commented Dec 19, 2020

Andhrabharati commented Aug 16, 2021 • edited Loading

Andhrabharati commented Jun 27, 2023

funderburkjim commented Jun 28, 2023 • edited Loading

Andhrabharati commented Jun 28, 2023 • edited Loading

Andhrabharati commented Jun 28, 2023 • edited Loading

Andhrabharati commented Jun 28, 2023 • edited Loading

Andhrabharati commented Jun 29, 2023 • edited Loading

Andhrabharati commented Jun 29, 2023

funderburkjim commented Jul 17, 2023

Andhrabharati commented Jul 17, 2023

Andhrabharati commented Jul 17, 2023

funderburkjim commented Jul 17, 2023 • edited Loading

Andhrabharati commented Jul 17, 2023

Andhrabharati commented Jul 17, 2023 • edited Loading

funderburkjim commented Jul 18, 2023

Andhrabharati commented Jul 18, 2023 • edited Loading

funderburkjim commented Jul 18, 2023

funderburkjim commented Jul 18, 2023

Andhrabharati commented Jul 18, 2023

funderburkjim commented Jul 25, 2023

the cases of `<is>` tag

gasyoun commented Jul 31, 2018 •

edited

Loading

funderburkjim commented Dec 19, 2020 •

edited

Loading

gasyoun commented Dec 19, 2020 •

edited

Loading

Andhrabharati commented Aug 16, 2021 •

edited

Loading

funderburkjim commented Jun 28, 2023 •

edited

Loading

Andhrabharati commented Jun 28, 2023 •

edited

Loading

Andhrabharati commented Jun 28, 2023 •

edited

Loading

Andhrabharati commented Jun 28, 2023 •

edited

Loading

Andhrabharati commented Jun 29, 2023 •

edited

Loading

funderburkjim commented Jul 17, 2023 •

edited

Loading

Andhrabharati commented Jul 17, 2023 •

edited

Loading

Andhrabharati commented Jul 18, 2023 •

edited

Loading