Adds Portuguese #18

tgnm · 2016-01-20T00:46:44Z

Adds support for Portuguese.

ploeh · 2016-01-20T08:24:14Z

Thank you for your interest in contributing to Numsense. FTR, I'm currently travelling, and have only limited time to evaluate pull requests. In due time, I'll work through my backlog, but it may take weeks 😳 😓

ploeh · 2016-01-24T16:33:45Z

Numsense.UnitTests/PortugueseExamples.fs

+[<InlineData(
+    "doismilcentoequarentaesetemilhõesquatrocentoseoitentaetrêsmilseiscentosequarentaesete",
+    System.Int32.MaxValue)>]
+let ``tryParsePortuguese returns correct result`` (portuguese, expected) =


Bonus points for not repeating my mistake of naming the test function tryOfPortuguese... etc. 👍

ploeh · 2016-01-24T19:24:33Z

This is looking promising, but may need a bit more work. For instance, using the Devil's Advocate review technique, I can make a few changes to the code without failing any tests.

In Ploeh.Numsense.Portuguese.toPortugueseImp, I can remove the match for negative numbers:

    match x with
//    |  x when x < 0 -> sprintf "menos %s" (toPortugueseImp -x)
    |  0 -> "zero"
    |  1 -> "um"

And, at the same time, I can change the return expression for tryParsePortugueseImp to this:

    match canonicalized with
    | _ -> conv 0 canonicalized

If I do that, all tests still pass.

Also, did you mean to add support for object-oriented languages later? If not, I can easily do that.

tgnm · 2016-01-24T21:59:38Z

@ploeh I will have a closer look at the code coverage. By the looks of it there's no test currently covering negative number parsing.
Regarding the object-oriented language interface, I totally forgot about it...
Thanks for reviewing this!

ploeh · 2016-01-26T10:19:26Z

there's no test currently covering negative number parsing

In the current pull request: correct.

FYI, existing languages (English, Polish, Dutch, Danish) have coverage of negative numbers in NumeralProperties.fs.

tgnm · 2016-01-26T22:31:55Z

@ploeh done!

ploeh · 2016-01-31T16:27:54Z

It builds and runs all tests without warnings on my machine, so from a technical perspective I think we're good to go 👍

As I don't read Portuguese, I'd like someone who does to review that part of it, if at all possible. I don't expect any errors, but it's always good with a second pair of eyes 😄

ploeh · 2016-01-31T16:29:17Z

I solicited a review on Twitter: https://twitter.com/ploeh/status/693833260472860672

Please retweet and spread the word 😄

albertocsm · 2016-01-31T22:01:45Z

Regarding Portuguese semantics, LGTM

mrinaldi · 2016-02-01T00:06:21Z

How strict/loose should the parser be?

The parser doesn't work with spaces, i.e. dez mil (10,000). However, it does parse dezmil.
Strictly speaking, not using a space is not correct at all.

Another issue I noticed, with English included, is when I try to parse something like vintevinte (English: twentytwenty) it parses as 40.

Brazilian Portuguese (and I believe European Portuguese) has its weird rules like:
mil e quinhentos (1,500) is correct, mil e quinhentos e um (1,501) is not, it should be mil quinhentos e um.

Should we try to parse as loose as possible? How about parsing vintevinte or twentytwenty, should it throws, return None or just return 40?

ploeh · 2016-02-01T12:37:12Z

How strict/loose should the parser be?

My default API design philosophy is to follow Postel's law, which would make parsing functions Tolerant Readers. Thus, if the parsing function can parse something unambiguously, even if it's not strictly correct, it should do so. It shouldn't preclude it from being able to parse correct values as well, though.

The most important property of the system is that it can round-trip: given any number, toPortuguese will produce a string that, when fed into tryParsePortuguese is the original number. This property is captured in tryParsePortuguese is the inverse of toPortuguese.

This property isn't an isomorphism, though, because the opposite doesn't hold. Because of Postel's law, there are (acceptable) inputs into tryParsePortuguese that can't be created by toPortuguese.

The parser doesn't work with spaces, i.e. dez mil (10,000). However, it does parse dezmil.
Strictly speaking, not using a space is not correct at all.

In context of the above, are you thereby saying that the values produced by toPortuguese are incorrect?

Should we try to parse as loose as possible? How about parsing vintevinte or twentytwenty, should it throws, return None or just return 40?

I thought about that when I wrote the English and Danish parsers, but I decided to leave that behaviour undefined at the time. I must admit that I don't know what the English or Danish parsers do in that situation, but I think that 40 should be considered a defect.

The question is whether something like twentytwenty is unambiguous? I can't easily come up with counter-examples, so perhaps it is...

mrinaldi · 2016-02-01T13:17:40Z

The most important property of the system is that it can round-trip: given any number, toPortuguese will produce a string that, when fed into tryParsePortuguese is the original number.

How about valid numbers not returned from toPortuguese, but written by a person?
Take vinte e um for instance. It's the correct way to write 21 but it can't be parsed.

In context of the above, are you thereby saying that the values produced by toPortuguese are incorrect

That depends on what you expect the output to be.
If you expect it to be correct according to the rules of the language, it's not correct.
On the other hand, if it just need to be understandable, it's OK.

I mean, that are missing prepositions, thus it looks weird to the eyes. I, for one, would never use this in an enterprise system. However, any person that reads Portuguese understands the number.

Despite of that, I think it'd be better to replace the - separating the words with spaces. Since in Portuguese you'd never use - to separate the words in a number, that would make it a little bit closer to what is correct.

I thought about that when I wrote the English and Danish parsers, but I decided to leave that behaviour undefined at the time. I must admit that I don't know what the English or Danish parsers do in that situation, but I think that 40 should be considered a defect.

The question is whether something like twentytwenty is unambiguous? I can't easily come up with counter-examples, so perhaps it is...

I can confirm the English parser returns 40 when you input twentytwenty. Of course this happens with any combination you input, i.e. eleven-nine returns 20; two-two returns 4.

tgnm · 2016-02-02T10:00:23Z

@mrinaldi toPortuguese is able to parse number with prepositions like vinte e um. See: https://github.com/ploeh/Numsense/pull/18/files#diff-174c06e8bbf694ed1b0c333135715987R46

The reason why I did not add support for prepositions in ToPortuguese is because I wanted to keep the output similar to what existed for other languages (all languages use the dash to separate numbers). This is indeed something that looks a bit odd but I presumed it was intended.

mrinaldi · 2016-02-02T11:05:04Z

@mrinaldi toPortuguese is able to parse number with prepositions like vinte e um. See: https://github.com/ploeh/Numsense/pull/18/files#diff-174c06e8bbf694ed1b0c333135715987R46

I was actually referring to the output of toPortuguese: https://github.com/ploeh/Numsense/pull/18/files#diff-174c06e8bbf694ed1b0c333135715987R143

The reason why I did not add support for prepositions in ToPortuguese is because I wanted to keep the output similar to what existed for other languages (all languages use the dash to separate numbers).

I see your point. However, in English, it's normal to use hyphens in numbers - although I'm not sure if you can use the hyphen for anything but tens and units.

I don't speak the other languages though, thus I can't say the same for them.

This is indeed something that looks a bit odd but I presumed it was intended.

Maybe @ploeh can tell us that.

ploeh · 2016-02-02T13:09:07Z

The idea was to produce values that are correct within the rules of a particular language. These are expected to vary widely from language to language.

This discussion (about Spanish) may shed more light on the subject.

In short, if hyphens aren't legal in Portuguese numerals, they shouldn't be there.

kerams · 2016-02-02T18:34:12Z

I can confirm the English parser returns 40 when you input twentytwenty. Of course this happens with any combination you input, i.e. eleven-nine returns 20; two-two returns 4

That's how most (if not all) parsers in the library currently behave. Returning None is definitely the proper thing to do in my opinion, but it's easier said than done. The solution would be neither pretty nor trivial. And this is only the tip of the iceberg; consider languages with a rich set of grammatical cases and/or genders that allow numerals to take various shapes and forms.

At the same time, I wonder how much of a problem @ploeh thinks this is. Of course you don't want to get junk if the input is meaningful, but detecting invalid input of almost any kind is perhaps too much to ask.

tgnm · 2016-02-03T09:53:36Z

Thanks guys. I'm going to fix the output of toPortuguese so we don't use hyphens. The parser will remain as is.
As an example, for number 21, toPortuguese returns "vinte-um" but after the fix the output will be "vinte e um".

ploeh · 2016-02-03T12:20:34Z

I wonder how much of a problem @ploeh thinks this is

Me too 😉

You may not have seen this, but I've hinted at this before: it's not that I currently have a particular purpose for Numsense. Originally, I wrote the English and Danish implementations during my Christmas holiday because I thought it was fun.

Then I thought it'd be a great exercise for other people who wished to get their feet wet with an easy F# exercise. It's also a good exercise for people not familiar with contributing to open source.

I try to be as friendly here as possible, but I also try to keep it realistic. This means that I run it as I run all my other open source projects. You'll get the same amount of feedback here, with the same quality bar. Hopefully, it's worth everyone's time.

Looping back to the question about the importance of addressing corner cases when parsing: I do think it's important that Numsense behaves correctly within a language's rules. This is also the reason I ask other people to review the linguistic aspects of it.

I consider it a defect in the English parser that twentytwenty parses into 40, so I've created an issue for it.

… different classes of numbers.

tgnm · 2016-02-20T21:01:29Z

@mrinaldi @kerams @ploeh all done. Please let me know what you think.

tgnm · 2016-02-25T21:32:25Z

@ploeh just merged master to resolve merge conflicts.

ploeh · 2016-02-27T17:31:59Z

Thank you for your contribution! It's now live as Numsense 0.12.0.

tgnm added 7 commits January 10, 2016 22:10

Adds Portuguese files.

066c94a

Fixes typo in test name.

cc33d9d

TryParsePortuguese now passes all tests.

3414e4e

Implemented more tests.

d614b82

Tests now pass. Yay

f904ef8

All Portuguese tests pass now. yay again

1b3544f

Fixes incorrect spaces.

ea07378

ploeh reviewed Jan 24, 2016
View reviewed changes

tgnm added 3 commits January 26, 2016 12:47

Merges upstream/master into master.

4757228

Adds support for object oriented languages.

9257893

Adds more tests for Portuguese.

5adbc2e

Adds Portuguese numeral tests.

f0fd719

tgnm added 7 commits February 6, 2016 19:38

Merge upstream/master into add-portuguese

670260c

Fixes silly merge leftover.

d486044

Portuguese implementation now matches the expected separators between…

4a83569

… different classes of numbers.

Merged master into add-portuguese.

3b2c54e

Fixes incorrect indentation in test.

65f2a41

Adds comment to test and removes unnecessary whitespace.

fe0a03e

Cleans up reference in project file and incorrect indentation in test.

c23bfb8

Merged master into add-portuguese

0e0a3ee

ploeh merged commit 0e0a3ee into ploeh:master Feb 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds Portuguese #18

Adds Portuguese #18

tgnm commented Jan 20, 2016

ploeh commented Jan 20, 2016

ploeh Jan 24, 2016

ploeh commented Jan 24, 2016

tgnm commented Jan 24, 2016

ploeh commented Jan 26, 2016

tgnm commented Jan 26, 2016

ploeh commented Jan 31, 2016

ploeh commented Jan 31, 2016

albertocsm commented Jan 31, 2016

mrinaldi commented Feb 1, 2016

ploeh commented Feb 1, 2016

mrinaldi commented Feb 1, 2016

tgnm commented Feb 2, 2016

mrinaldi commented Feb 2, 2016

ploeh commented Feb 2, 2016

kerams commented Feb 2, 2016

tgnm commented Feb 3, 2016

ploeh commented Feb 3, 2016

tgnm commented Feb 20, 2016

tgnm commented Feb 25, 2016

ploeh commented Feb 27, 2016

Adds Portuguese #18

Adds Portuguese #18

Conversation

tgnm commented Jan 20, 2016

ploeh commented Jan 20, 2016

ploeh Jan 24, 2016

Choose a reason for hiding this comment

ploeh commented Jan 24, 2016

tgnm commented Jan 24, 2016

ploeh commented Jan 26, 2016

tgnm commented Jan 26, 2016

ploeh commented Jan 31, 2016

ploeh commented Jan 31, 2016

albertocsm commented Jan 31, 2016

mrinaldi commented Feb 1, 2016

ploeh commented Feb 1, 2016

mrinaldi commented Feb 1, 2016

tgnm commented Feb 2, 2016

mrinaldi commented Feb 2, 2016

ploeh commented Feb 2, 2016

kerams commented Feb 2, 2016

tgnm commented Feb 3, 2016

ploeh commented Feb 3, 2016

tgnm commented Feb 20, 2016

tgnm commented Feb 25, 2016

ploeh commented Feb 27, 2016