Skip to content
This repository has been archived by the owner on Dec 7, 2018. It is now read-only.

finediff is eating my words when showing the comparisons. #20

Open
newpen opened this issue Jan 9, 2015 · 14 comments
Open

finediff is eating my words when showing the comparisons. #20

newpen opened this issue Jan 9, 2015 · 14 comments

Comments

@newpen
Copy link

newpen commented Jan 9, 2015

After reading #18, I can now use finediff for Chinese successfully, but sometimes it will eat out my words!

Example:


(before)
1根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區域中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的时空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。 T'尸F.


(after)
根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時

2空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。


Using FineDiff::renderDiffToHTMLFromOpcodes($a, $opcodes), the result will just be:
根據相對論,信息的傳播速度有限,因此在某些情况下,例如在發生宇宙膨胀時,距离我们非常遥远的區域空稱為“可觀測宇宙”、“可見宇宙”或“我們的宇宙”。應該強調,這是由於時空本身的结構造成的,與我們所用的觀測設備没有關係。 T'尸F.


The whole "中我們將只能收到一小部分区域的信息,其他部分的信息将永远无法传播到我们的區域。可以被我們觀測到的時

2" is missing! I don;t know where to look at to solve the problem. Please give me some guidance. Thanks!

@gorhill
Copy link
Owner

gorhill commented Jan 9, 2015

Probably one of the HTML tag end up being inserted in the middle of a multibyte character.

FineDiff works on a binary byte basis, it doesn't know about characters. It happens to work fine for display for ASCII characters because they are single byte. renderToTextFromOpcodes($from, $opcodes) should work fine, except that you will have to render yourself to HTML.

Not sure if you could find where an HTML tag split a whole character and shift back or forth (depending on whether it is the opening or closing tag) to a proper character boundary.

@newpen
Copy link
Author

newpen commented Jan 12, 2015

Where can I try and set the character boundary? I tried to look into the codes but it was a bit too difficult for me to follow... Thanks in advance!

@gorhill
Copy link
Owner

gorhill commented Jan 12, 2015

Create your own rendering handler:

public static function renderFromOpcodes($from, $opcodes, $callback);

See code. each time your callback is called, you may want to see if the start/end of the segment are valid Unicode characters, and if not look around to fetch the previous/following missing bytes. Frankly, it's just an untested idea, but if I had time, that what I would look into.

Changing FineDiff code is not an option, it's completely designed to work on bytes, and these bytes could be anything, FineDiff doesn't care about their meaning.

@newpen
Copy link
Author

newpen commented Jan 13, 2015

OK thanks, but the thing is that it is missing more than one characters (sometimes a few sentences), so if I shift back and forth, most likely I would just have one more character back, which doesn't really help much...

@gorhill
Copy link
Owner

gorhill commented Jan 13, 2015

it is missing more than one characters (sometimes a few sentences)

Probably because you are looking at the broken HTML result. Look at the binary string internally, not the broken rendered HTML. Putting in there an HTML renderer was my biggest mistake, I should not have created this helper method because FineDiff is really completely binary and it doesn't care about what the data is, originally it was used just to save storage, saving only what changed.

Many users of the library think the library is to render diff visually on screen, that wasn't my intention at all originally.

Isn't it true that if you use renderToTextFromOpcodes($from, $opcodes), the output will be as expected?

Edit: Out of curiosity, what granularity do you use?

@newpen
Copy link
Author

newpen commented Jan 13, 2015

I played on it for a bit, the following codes seem to work on my case, but not sure about other cases..

        if ( $opcode === 'c' ) { // copy n characters from source

            $shift = 0;
            $char = mb_substr($from, $from_offset+$n-3, 3);
            while(strlen($char) == strlen(utf8_decode($char))){                 
                $shift++;
                $char = mb_substr($from, $from_offset+$n-3-$shift, 3);
            }

            call_user_func($callback, 'c', $from, $from_offset, $n - $shift, '');
            $from_offset += $n;
            }
        else if ( $opcode === 'd' ) { // delete n characters from source

            $shift = 0;
            $char = mb_substr($from, $from_offset, 3);
            while(strlen($char) == strlen(utf8_decode($char))){                 
                $shift++;
                $char = mb_substr($from, $from_offset-$shift, 3);
            }

            call_user_func($callback, 'd', $from, $from_offset - $shift, $n + $shift, '');
            $from_offset += $n;
            }

@gorhill
Copy link
Owner

gorhill commented Jan 13, 2015

Are Chinese characters always 3-byte large? (including whitespace, etc.)

@newpen
Copy link
Author

newpen commented Jan 13, 2015

No... can be 1-4 bytes... I guess I may have to refine it for a bit to suit more cases... But how about insertion? I can't get it right using the same technique...

@gorhill
Copy link
Owner

gorhill commented Jan 13, 2015

Alright, looking at Unicode encoding, to find the beginning of the character seems pretty easy: if bit 7-6 are binary 0x80 (i.e. char code & 0xC0 === 0x80), go back one byte, check again. As soon as the condition bit 7-6 !== 0x80, you have the beginning of your character.

Now use the distance of the beginning of the character to the passed $from_offset to correct the start and the end -- i.e. $from_offset - $distance and $n + distance. Do this for each segment regardless of whether it is insert, delete, copy. I believe this should work all fine, with much less overhead than what you have above.

It has been a while since I wrote PHP, so I would have to check again the PHP reference.. I forgot.. Can we check a single byte in a string using array notation? If so this become very easy.

You could all do this without changing FineDiff, just by providing your own callback to renderFromOpcodes($from, $opcodes, $callback), it's up to you.

Edit: fixed mistakes

@newpen
Copy link
Author

newpen commented Jan 13, 2015

Thanks! I also just googled it and discovered this fork
https://github.com/xrstf/PHP-FineDiff

It is working well with my Chinese characters!

@gorhill
Copy link
Owner

gorhill commented Jan 13, 2015

Just be aware you won't have the same kind of performance however, as there is no equivalent for strspn/strcspn with mb_ functions.

@gorhill
Copy link
Owner

gorhill commented Jan 14, 2015

I've worked a bit on this today, I wanted to test the idea above about nudging the boundary back/forth. The idea works, it's all in the details though. It's not perfect yet but I have figured how to make it work perfectly, but I don't know when this will be ready.

@newpen
Copy link
Author

newpen commented Jan 14, 2015

Cool! Thanks! I realize the the fork isn't performing as efficient as this one, but it still can serve as the temporary solution. Looking forward to your updates!

@erfanatp
Copy link

See this https://github.com/xrstf/PHP-FineDiff
I used it for Farsi/Persian language and it works perfectly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants