Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chess.pgn.read_headers inserts empty header entries related to newlines and empty movetext #1087

Open
MatijaSi opened this issue Jun 7, 2024 · 6 comments
Milestone

Comments

@MatijaSi
Copy link

MatijaSi commented Jun 7, 2024

I am trying to parse a largeish (7,000,000 games) pgn using read_headers. However, I only managed to scan 84,039 games before it stopped as if it finished (no error message).

I managed to narrow it down to this testcase:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


{ Both Chinese players were late to the board for game two and were
defaulted }


0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

while True:
    headers = chess.pgn.read_headers(f)
    print(headers)

    if not headers:
        break

Which prints:

Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0')
Headers()
@MatijaSi
Copy link
Author

MatijaSi commented Jun 7, 2024

Investigating a bit further, there seems to be some issue related to newlines between games.

For example:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]

{ Both Chinese players were late to the board for game two and were
defaulted }

0-1

[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (note the empty Headers() between both "real" games):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 None]

While file from original issue:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


{ Both Chinese players were late to the board for game two and were
defaulted }


0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to games being (again plenty of empties):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(),
 Headers(),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]

So my code in original issue is slightly wrong: it looks at headers being false-ish:

if not headers:
    break

instead of comparing them to None:

if headers is None:
    break

However this is probably still bug in library, since empty line probably shouldn't be empty game. Additionaly it's somehow related to movetext being empty, since if we provide it we get different return:

testcase = """
[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Li Chao2"]
[Black "Gashimov,V"]
[Result "0-1"]
[WhiteElo "2596"]
[BlackElo "2758"]
[EventDate "2009.11.21"]
[HashCode "00000000"]
[TotalPlyCount "0"]


1. e4 e5 0-1


[Event "World Cup"]
[Site "Khanty-Mansiysk RUS"]
[Date "2009.11.29"]
[Round "3.4"]
[White "Naiditsch,A"]
[Black "Svidler,P"]
[Result "1/2-1/2"]
[WhiteElo "2689"]
[BlackElo "2754"]
[ECO "C97"]
[Opening "Ruy Lopez"]
[Variation "closed, Chigorin defence"]
[EventDate "2009.11.21"]
[HashCode "312a25cf"]
[TotalPlyCount "121"]


1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5 7. Bb3 O-O 8.
c3 d6 9. h3 Na5 10. Bc2 c5 11. d4 Qc7 12. d5 c4 13. Nbd2 Nb7 14. Nf1 Nc5
15. Kh1 Bd7 16. Ng3 Ne8 17. Nh2 Bh4 18. Rf1 Qd8 19. Nf3 Bxg3 20. fxg3 f5
21. Ng5 h6 22. Ne6 Bxe6 23. dxe6 fxe4 24. Rxf8+ Kxf8 25. Qh5 Qf6 26. Be3
Nd3 27. Qg4 Nc7 28. Qxe4 Qxe6 29. Qh7 Qg8 30. Bxd3 cxd3 31. Qxd3 Qe6 32.
Qh7 Qg8 33. Rf1+ Ke7 34. Qg6 Rf8 35. Rd1 Rf6 36. Qg4 g5 37. Kh2 Qxa2 38.
Qc8 Ne8 39. Bb6 Kf8 40. Bd8 Rf7 41. h4 gxh4 42. Bxh4 Qc4 43. Qxa6 Kg7 44.
Qa8 Qg4 45. Ra1 Qd7 46. Qb8 Qc6 47. b4 Kg6 48. Ra8 Ng7 49. Ra2 Qd5 50. Ra6
Rd7 51. Rb6 Nf5 52. Rxb5 Qf7 53. Rb6 Kh7 54. Qa8 Nxh4 55. gxh4 Qf4+ 56. Kh1
Qxh4+ 57. Kg1 Rg7 58. Qf3 Qe1+ 59. Kh2 Qh4+ 60. Kg1 Qe1+ 61. Kh2 1/2-1/2
"""

import io

f = io.StringIO(testcase)

games = []
while True:
    headers = chess.pgn.read_headers(f)
    games.append(headers)

    if headers == None:
        break

Leads to (note that now there is no Headers() between games, but one extra still got appended):

[Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Li Chao2', Black='Gashimov,V', Result='0-1', WhiteElo='2596', BlackElo='2758', EventDate='2009.11.21', HashCode='00000000', TotalPlyCount='0'),
 Headers(Event='World Cup', Site='Khanty-Mansiysk RUS', Date='2009.11.29', Round='3.4', White='Naiditsch,A', Black='Svidler,P', Result='1/2-1/2', WhiteElo='2689', BlackElo='2754', ECO='C97', Opening='Ruy Lopez', Variation='closed, Chigorin defence', EventDate='2009.11.21', HashCode='312a25cf', TotalPlyCount='121'),
 Headers(),
 None]

@MatijaSi MatijaSi changed the title chess.pgn.read_headers stops reading headers after game with empty movetext chess.pgn.read_headers inserts empty header entries related to newlines and empty movetext Jun 7, 2024
@niklasf niklasf added this to the v1.11.0 milestone Jul 19, 2024
@tage64
Copy link

tage64 commented Aug 21, 2024

I have also had this problem.

If I put a blank line between the games, it works. So:

Example 1, BAD, does only parse the first game, no blank line between games:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

Example 2, GOOD, does parse both games, a blank line between the games:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

*

I think that both examples should work.

@niklasf
Copy link
Owner

niklasf commented Sep 27, 2024

This is tricky to deal with for chess.pgn.read_game() with its current interface: It reads the file line by line, without being able to look ahead. And so with the parser at <-

[Header "A"]


1. e4

<-
[Header "B"]

a decision has to be made:

  • Guess that the game contains consecutive empty lines (not allowed!) and will continue. In this example, it would incorrectly consume the first header of the second game, which is bad.
  • Guess that the game is terminated by consecutive empty lines and is just missing a result marker like * or 1-0 (not allowed!). This terminates the game too early in your examples, which is bad.

Currently the parser always does the latter. This is not a bug, because the PGN is invalid anyway, but maybe some heuristics can be added to better deal with it.

Robustly handling all of this would require changing the API, so that the parser can look ahead one line, without necessarily consuming it. Pushing this back to 2.x, for that reason.

@niklasf niklasf modified the milestones: v1.11.0, v2.0.0 Sep 27, 2024
@MatijaSi
Copy link
Author

MatijaSi commented Oct 16, 2024

Hey niklasf, here we actually have result marker - we are missing movetext. I guess minimal testcase would be:

[Header "A"]

{ Comment }

0-1

[Header "B"]

1. e4

1-0

I didn't try it though, since I don't have python on this computer.

@MarkZH
Copy link
Contributor

MarkZH commented Oct 16, 2024

I've written a class that can assist with looking ahead in a PGN without necessarily consuming the line. There are two methods that can be used to address the lookahead difficulties.

  • iterator.pushback(line) puts line at the front of the iterator so that it is returned on the next loop. For example, if the PGN reader comes to a line with header information while scanning a game (if line.startswith("["):), then the current game can be finalized and the header line returned to the iterator to start the next game (iterator.pushback(line)).
  • iterator.lookahead() returns the next line while preserving it for the next loop. Similarly to the above, if iterator.lookahead().startswith("["): can be used to detect the end of a game that is missing an endgame annotation.

Here's the code with some usage example below. Let me know if this could be useful.

from typing import Iterable, Optional

class PreviewIterator:
    def __init__(self, source: Iterable[str]) -> None:
        self.source = iter(source)
        self.putback_line: Optional[str] = None

    def __iter__(self) -> Iterable[str]:
        return self

    def __next__(self) -> str:
        if self.putback_line is not None:
            line = self.putback_line
            self.putback_line = None
            return line
        else:
            return next(self.source)

    def putback(self, line: str) -> None:
        self.putback_line = line

    def lookahead(self) -> Optional[str]:
        try:
            line = next(self)
        except StopIteration:
            return None

        self.putback(line)
        return line


lines = ["first", "second", "third repeat", "fourth"]
line_iterator = PreviewIterator(lines)
for line in line_iterator:
    print(line)
    if line.endswith("repeat"):
        line_iterator.putback(line.removesuffix("repeat"))

print("")

line_iterator_2 = PreviewIterator(lines)
for line in line_iterator_2:
    print(line)
    look_ahead = line_iterator_2.lookahead()
    if look_ahead and look_ahead.endswith("repeat"):
        print("+++")

Output:

first
second
third repeat
third
fourth

first
second
+++
third repeat
fourth

@niklasf
Copy link
Owner

niklasf commented Oct 16, 2024

Yes. I think for 2.x I'd like to replace the stateless chess.pgn.read_game(f: file) -> Optional[Game] with something like a stateful

class PgnReader:
    def __init__(self, f: file): ...
    def read_game(self) -> Optional[Game]: ...

that can internally use a PreviewIterator like you suggested, or saves information for the next game, if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants