-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PW IAST corrections #419
Comments
This is a result of manual checking, right? |
These cases are identified in the current digitization by the the cases of
|
Suggestion for correctionMake a local copy of the pwis_notmw.txt file, and also of the pwis_notmw.html file. Indicate corrections in the pwis_notmw.txt file by adding a 4th field with the correct spelling in SLP1 form.
Then submit back to me the corrected file. I'll convert these to standard 'updateByLine' old/new corrections for PW, and install the corrections. Don't worry about whether the correction is a typo or print error. Probably almost all are typos. |
One more possibility to reduce the list.
E.g. pracetas - praketas If we make replacement from k to c and find the word in MW headword list, it can be listed as auto corrected. More observations to reduce list will be enumerated as and when I encounter such tendencies which are manageable programmatically |
Autocorrection 1pwis_notmw1.txt has been added to the gist. This contains the same list of 1585 words as in pwis_notmw.txt , but with 179 autocorrections.
These rules were applied to slp1 spelling of each of the 1585; if one of the rules resulted in a new
@drdhaval2785 This should help a bit, by autocorrecting 11% of the cases. You could download the |
That explains a lot.
Well done, well done. Dhaval, thanks again for being back. This one still remains the major dictionary. Not widely used in India, because people tend to forget German, but the most academic one up to now. |
Autocorrection 2This is based on an idea in article How to Write a Spelling Corrector by Peter Norvig. Consider example of yAjNavalkya, in slp1 spelling. The idea is
Results:See pwis_notmw2.txt
|
How can one help here?
The method is simple. The results - promising. What's the wanted output format? |
LevAutoAn additional step of autosuggestion was carried out on the remaining 300+ items of pwis_notmw2.txt that have no suggestions by the previous steps. An example will illustrate the conceptually simple process: The results are shown in pwis_notmw3.txt. While this technique is conceptually simple, it is computationally complex. In fact, the notmw3 LevAuto suggestions were generated by applying a Levenshtein Automaton built on top of the Pynini python library developed by Kyle Gorman. The details of my application are in this pynini-learn repository. As mentioned there, the current implementation does not appear efficient enough to be very useful with such a large 'lexicon' as the 200,000 MW headword list. Gorman held out the possibility of a more efficient algorithm in this comment. |
There is now a file in the PWK repository where corrections can be entered: pwis_notmw3_correctionform.txt . Here is link to brief readme. @drdhaval2785 If you already have corrections in some other format, I'll be glad to transfer them @gasyoun Does this procedure satisfy your needs ? |
Corrections still are not prepared / installed. |
ExampleThe readme at https://github.com/sanskrit-lexicon/PWK/tree/master/pw_iast gives some background on Here's how I would proceed to deal with the 'Agastiya' example. open pwis_notmw3_correctionform.txtlink = https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt
open pwis_notmw.html in browser.Link is https://sanskrit-lexicon.github.io/PWK/pwis_notmw.html and find 'Agastiya': Agastiya | 4 | agastiya | OrvaSeya kalaSaBU kumBaBU kumBasaMBava The 4 words 'OrvaSeya' are SLP1 spellings of headwords where the suspicious word Examine instancesFirst, look up OrvaSeya in PW dictionary using one of the displays Examine scanned image to see what print actually is: Decide modern IAST spelling:I would decide 'Agastya' --- PW's IAST typically uses 'j' when modern IAST uses 'y'. Examine other uses:Choose AnswerAll the cases are the same: print has 'Agastja', Modern form is Fill in Correctionform for Case 23Edit [pwis_notmw3_correctionform.txt]( https://github.com/sanskrit-lexicon/PWK/blob/master/pw_iast/pwis_notmw3_correctionform.txt
Commit the change (commit message = 'Case 23'). |
installing correctionsFilling in the correction form does not install the corrections to pw.txt. Installation would be a separate step done This is a slow process, but looks reliable. There are 1585 cases. The end result would be improvement to modern IAST |
Exactly.
Modern form is 'Agastya', and not 'Agastiya' only. |
Wonder if @SergeA would have interest in working on this? |
|
Can I poke-in my nose in this, if @funderburkjim is willing to work on it, if given in 'some other format'? |
This is one of the many cases that are "counter" to what was replied by @drdhaval2785 and @gasyoun against my posting somewhere [that I do not get @funderburkjim's response for months together, while others get almost 'immediately'], that my posts are "heavy-meals" and not easily chewable/digestible as are all others' postings. Here, I just wrote a single sentence, and yet to get some/any response from Jim (for almost 2 years now)! |
@Andhrabharati Obviously, I lost track of your question here. Please provide a couple of examples of what you mean by When the current work with you on Grassman dictionary is complete, I will examine the feasibility of working with you on this. |
Obviously, my resolutions would be non-slp1 but in plain iast. I see that some of these are abbr.s with a dot after the ending is-tag. How to deal with them, something like |
Now that I saw many later posts at this forum, I think the regular old: yyy would be the way for me take it up, wrt the latest pw.txt lines (at csl-orig); probably with just the is-tagged word [yyy/zzz] (as at times the line could be quite longer). |
Just had a "look" inside the pw.txt for the Also many more "bad"-tagging/marking of various types are seen. This calls for a full overhaul of the data, and I get reminded of the earlier reaction of Thomas, if I say anything more!! I see not much worth taking up correction of just the |
Did a quick checking for these "notmw" words in MW, and found quite many to be present in MW! Just a few small changes/additions in the "search pattern" would eliminate all such ones from the list!! |
When tried looking for some random words in the list, noticed that the CDSL pwk scan pages are not so clear, as compared to my copy. [Probably, it could be a reason for the typo errors.] Probably, these scans could be replaced, for the benefit of any and everyone. |
@Andhrabharati Am ready to begin working with you in this issue related to
I think we should restrict this study to is-tag, if possible (so that this good issue can be solved). What do you need from me? |
I had progressed much ahead from this IAST part in pw in the past couple of days, @funderburkjim !! I shall post my work in due course of time, for your perusal. If you are willing, pl. give me the links to the conversion scripts to Devanagari and iast of the full file, to be run from a command line locally. I am about to start proofing full headwords, as some errors were noted while working on this pw.txt |
My present work is covering all kinds of markups and listing the abbr. and ls entries. |
I could provide a python script that would run as follows
Is this what you request? |
Yes, exactly. |
Probably, you could leave the metalines as is, as I am not going to touch that portion. My reading will be limited to the header and body portions alone. The metalines would have to be generated from the header portion, as done in case of GRA. |
@Andhrabharati. Further discussions found in sanskrit-lexicon/PWK#95. We can leave this #419 issue open until the work in PWK repository completed. |
Here are some small pieces from my work, wrt the
|
@Andhrabharati I do not find any of your examples of |
I should have made previous comment in sanskrit-lexicon/PWK#95 |
Pl. see my initial post above -- it isn't |
@Andhrabharati This seems to answer the above question. |
In the PW dictionary, a relatively small number of words appear in IAST spellings; for examole
Some of these have spelling errors in the Cologne digitization:
This issue is devoted to correcting such spelling errors.
The text was updated successfully, but these errors were encountered: