Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INDOLOGY] Upgrade to online Koln Bohtlink-Roth dictionary? #72

Open
gasyoun opened this issue Aug 14, 2024 · 16 comments
Open

[INDOLOGY] Upgrade to online Koln Bohtlink-Roth dictionary? #72

gasyoun opened this issue Aug 14, 2024 · 16 comments

Comments

@gasyoun
Copy link
Member

gasyoun commented Aug 14, 2024

@funderburkjim @Andhrabharati the work is started to be noticed! And so I can a question if we can batch get an OCR of the scans on our end with https://ocr.sanskritdictionary.com and with a little help from @martingluckman

"Does anyone know if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live (i.e. point to the actual page of the work referenced). and if this project also extends to other of the Koln on-line dictionaries." - what is the plan and at what URL as of now? What is already covered? Even I miss part of the changelog.

Harry Spier via INDOLOGY
indology@list.indology.info

Dear list members,
I just looked up cint in the Koln online Bohtlink-Roth Grosses Worterbuch.
https://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2020/web/webtc2/index.php 
and I noticed that for the references given for the different formations of cint  listed, for those references to the Mahabharata,the Harivamsa, the Ramayana, the Kathasaritsagara,  but not references to other works, you can download a pdf of the image of the actual page of the work ch with the reference, just by clicking on the reference..Very impressive! I had not noticed that before. 

Does anyone know if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live (i.e. point to the actual page of the work referenced). and if this project also extends to other of the Koln on-line dictionaries.

What makes this especially useful, is that the images are good enough to put in "Sanskrit CR"  https://ocr.sanskritdictionary.com/
and get almost flawless digitization.
@Andhrabharati
Copy link

Andhrabharati commented Aug 14, 2024

I don't think that's worth spending our time on at CDSL; getting the links to scan pages itself is a big task and Jim has taken up the same with some support from my side.

That OCRing is best left to the interested people (if any!!).

I really doubt if anyone would venture the task and complete even a single book; people are just "making use" of the texts provided by open sources like GRETIL, Sanskritdocuments etc. (with whatever quality/drawbacks that they possess). No further improvement, nor any independent work!!

@Andhrabharati
Copy link

And I recall that not even a single step has been taken (at your end, @gasyoun) for "getting" the text out of the front pages of the CDSL works [which is a very practical & achievable task] that was talked about few years back!!

@gasyoun
Copy link
Member Author

gasyoun commented Aug 14, 2024

I don't think that's worth spending our time on at CDSL

Disagree. Would want to discuss it with @martingluckman at a later stage.

@funderburkjim
Copy link
Contributor

Encouraging to see that this feature of links to references is noticed by Gluckman.

@gasyoun I agree with AB that OCRing (getting the text out of) the Documentation Frontmatter scans would be an upgrade to that section at Cologne.

@Andhrabharati
Copy link

Encouraging to see that this feature of links to references is noticed by Gluckman.

@funderburkjim

What Marcis said is that Harry Spier had noticed the linking feature, not Gluckman (whom Marcis wants to approach for helping in OCRing the full-works!)

@funderburkjim
Copy link
Contributor

if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live

Yes, at least for the 'major' PWG references.

I think I've put all of the 'link targets' here: https://github.com/orgs/sanskrit-lexicon-scans/repositories

This repo also contains copies of the scanned images for the dictionaries.

So someone interested in OCRing any of the link targets could clone one of these repos to get images of the individual pages.

@funderburkjim
Copy link
Contributor

In fact, These github repos are also used by cdsl displays (e.g. of PWG) to 'serve' the images.

@Andhrabharati
Copy link

Just OCRing can be done practically in no-time these days (courtesy Google); but it is the next phase, i.e. proofing the OCRed text to match the print is the REAL task.

@Andhrabharati
Copy link

Andhrabharati commented Sep 7, 2024

if this is an ongoing project to make all the references in the B-R Grosse Worterbuch live

Yes, at least for the 'major' PWG references.

Is it not worthy to do this for all the works that exceed a count of 10k (references), in this spree?

And @funderburkjim should update the lsextract_pwg file (which seems to have been last updated on 13th Jan. 2023) again, which will have further members (extending the list that I mentioned at the KSS issue) joining the 10k+ club!
--------------------------------------
PS. If the Skt. lexicons are also to be covered, I can prepare 'the index files' for those as well (taking Jim's indexing for AK. as "done").

@Andhrabharati
Copy link

And also link the Indische Sprüche (1st ed.) scans, though the 2nd ed. has been already linked as a digital text.

@Andhrabharati
Copy link

Andhrabharati commented Sep 7, 2024

I don't think that's worth spending our time on at CDSL

Disagree. Would want to discuss it with @martingluckman at a later stage.

I am sure Jim cannot spend any time for this, and I WILL NOT (though I can do the proofing also, iff I take up the work); so you are welcome to get it done by any interested party, @gasyoun !!

@gasyoun
Copy link
Member Author

gasyoun commented Sep 7, 2024

@Andhrabharati I'm speaking of a dirty OCR, nonproofed

proofing the OCRed text to match the print is the REAL task.

@Andhrabharati
Copy link

Andhrabharati commented Sep 7, 2024

A simple script will do it, @gasyoun!

[And quite many of them are floating across the net.]

@Andhrabharati
Copy link

Andhrabharati commented Sep 8, 2024

Looks like Suśruta, 1835-6 is the only other candidate coming into the 10k+ club!

Once this 'bound book" is split into two constituent volumes [Vol.1 (1835): 378pp and Vol.2 (1836): 562pp, leaving the front 4 "title" pages in each volume], there is no need for any indexing for this work-- as the references are just in the (volume,page,line) manner.

Very easy for Jim, just like in the case of the Verz. d. Oxf. H.!!

@gasyoun
Copy link
Member Author

gasyoun commented Sep 9, 2024

[And quite many of them are floating across the net.]

Never seen one @Andhrabharati

Looks like Suśruta, 1835-6 is the only other candidate coming into the 10k+ club!

Where are the others?

@Andhrabharati
Copy link

Andhrabharati commented Sep 9, 2024

[And quite many of them are floating across the net.]

Never seen one @Andhrabharati

Well, not everyone need to know everything!
You may just use the places like wikisource, ocr.sanskritdictionary.com, ambuda.org etc.

Looks like Suśruta, 1835-6 is the only other candidate coming into the 10k+ club!

Where are the others?

You mean the list of names? Look at my post above!
If it is about the scans, they would come when Jim starts working for them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants