Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove gratuitous spaces in ls #77

Closed
funderburkjim opened this issue Sep 18, 2024 · 15 comments
Closed

Remove gratuitous spaces in ls #77

funderburkjim opened this issue Sep 18, 2024 · 15 comments

Comments

@funderburkjim
Copy link
Contributor

This is in response to this comment.

I

@funderburkjim
Copy link
Contributor Author

Will change space between comma and digit within scope of an <ls...</ls>, as was preferred in above comment.

Most lines of pwg.txt have a space at the end, which serves no purpose currently; so removing the ending space.

These two changes modify over half of the million-plus lines of pwg.txt.

@Andhrabharati
Copy link

Andhrabharati commented Sep 18, 2024

Excellent; I would also request Jim to remove the blank lines within the body matter (mostly preceding the <ls or <div breaks). [A blank line should only be present between two entries (or at the header H-lines), as a NORM.]

There are about 13k of such blank lines on the whole.

Also there are some cases [3 blanks (2 places), 2 blanks (95 places)], where multiple blank-lines are before the new entry meta-line.

@Andhrabharati
Copy link

Most lines of pwg.txt have a space at the end, which serves no purpose currently; so removing the ending space.

These are all at the (artificial) line-splitiing introduced at <ls (~500k), <div (~100k) and <is (94) tags!

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Sep 20, 2024
funderburkjim added a commit that referenced this issue Sep 20, 2024
funderburkjim added a commit that referenced this issue Sep 20, 2024
@funderburkjim
Copy link
Contributor Author

This cleanup completed.
work directory.

Summary:

  • the two changes mentioned above
  • remove blank lines within an entry (16000 of these)
  • remove [Pagev-xxxx] lines outside entries (3232 of these)
  • remove extra blank lines outside of entries (138 of these)
  • a handful of misc. changes

The lines outside of entries can be summarized as:

  • H-line, blank-line precede first entry
  • no line after LEND line of last entry
  • A single empty line between LEND and metaline for 122661 entries
  • For 74 entries, H-lines with blank line as per file analyze_between.txt

grep -E '^' temp_pwg_5.txt | wc -l
122736 entries
= (+ 1 122661 74)

@Andhrabharati
Copy link

Andhrabharati commented Sep 20, 2024

@funderburkjim

  • remove [Pagev-xxxx] lines outside entries (3232 of these)

There still remain 3 instances where a [Pagev-xxxx] line precedes the LEND line, that are of same nature as the above criterion.
These lines could also be removed.

  • a handful of misc. changes

Also found some misc. cases that need correction:
>[^<]*>: [36 instances]
^[^<]*>: [3 instances]

And the 85 cases of "more than a single space together", [ ]+, could also be corrected.

@Andhrabharati
Copy link

I am glad that Jim has put his "first-step" in getting the digital text closer to the PWG print.

@Andhrabharati
Copy link

Andhrabharati commented Sep 20, 2024

Seen that I had split two entries 63080 & 79329 into two parts each, making them look more "meaningful" in my version.

CDSL: <L>63080<pc>5-0966<k1>atrijAta<k2>atrijAta<h>12
AB: <L>63080<pc>5-0966<k1>atrijAta<k2>atrijAta<h>1
<L>63080.1<pc>5-0966<k1>atrijAta<k2>atrijAta<h>2

image
------------------------------------------
CDSL: <L>79329<pc>5-1624<k1>pradoza<k2>pradoza<h>23
AB: <L>79329<pc>5-1624<k1>pradoza<k2>pradoza<h>2
<L>79329.1<pc>5-1624<k1>pradoza<k2>pradoza<h>3

image

Probably, Jim might not disagree doing thus in the cdsl file.

@funderburkjim funderburkjim reopened this Sep 20, 2024
@Andhrabharati
Copy link

Dear Jim,

Now that you've reopened this issue, you might consider this point as well!

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Sep 20, 2024
funderburkjim added a commit that referenced this issue Sep 20, 2024
@funderburkjim
Copy link
Contributor Author

AB addtional changes

Per above suggestions. Jim notes in issue77 readme at 09-20-2024 Reopened. Further corrections from AB.

Jim will consider the hyphens em-dashes en-dashes point next.

@Andhrabharati
Copy link

Thank you, Jim; now this issue can be closed again!

funderburkjim added a commit to sanskrit-lexicon/csl-orig that referenced this issue Sep 20, 2024
funderburkjim added a commit that referenced this issue Sep 20, 2024
@funderburkjim
Copy link
Contributor Author

hyphen to em-dash changes made.
github and cologne revised.

Many improvements made to cdsl pwg, thanks to @Andhrabharati suggestions, and his followup of some 3.5 year old suggestions from @gasyoun .

@Andhrabharati I suspect many of these changes have the side-effect of decreasing the diff between your version(s) and the cdsl version of pwg. Perhaps there are other 'discrete' (well-defined) changes that cdsl could tackle now? If so these can be taken up, in new issues or in existing issues which have been neglected by cdsl.

@Andhrabharati
Copy link

Andhrabharati commented Sep 21, 2024

Many improvements made to cdsl pwg, thanks to @Andhrabharati suggestions

@Andhrabharati I suspect many of these changes have the side-effect of decreasing the diff between your version(s) and the cdsl version of pwg.

Perhaps there are other 'discrete' (well-defined) changes that cdsl could tackle now? If so these can be taken up, in new issues or in existing issues which have been neglected by cdsl.

Thank you Jim, for appreciating my work.
It is not a side-effect (of decreasing the diff. between AB and CDSL versions), but some deliberate stepping towards matching the PWG's "intended" (undocumented) theme and presentation.

And yes, there are quite many refinements in my version that could (and should) be carried into the cdsl version.
But, that would be a "long journey" (may be taking 4-6 months), when done in a step-by-step fashion.
May be we can do it in parts, taking breaks in-between for attending to other tasks as well.

If you are serious and willing, I can make and post an initial version (with sufficient details of working) for you to start with.

@funderburkjim
Copy link
Contributor Author

Systematic attention to pwg is a worthwhile goal.

Incorporation of digitizations of the missing VN (#76) should be done first.

Then posting of your initial version.

Agree?

@Andhrabharati
Copy link

Andhrabharati commented Sep 24, 2024

And probably after the AES "revision", which is more or less a very straight-forward and easy work (from my version).

@gasyoun
Copy link
Member

gasyoun commented Sep 24, 2024

And probably after the AES "revision", which is more or less a very straight-forward and easy work (from my version).

Makes sense -- just to finish it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants