Skip to content
This repository has been archived by the owner on Aug 13, 2020. It is now read-only.

Adds Portuguese #18

Merged
merged 19 commits into from
Feb 27, 2016
Merged

Adds Portuguese #18

merged 19 commits into from
Feb 27, 2016

Conversation

tgnm
Copy link
Contributor

@tgnm tgnm commented Jan 20, 2016

Adds support for Portuguese.

@ploeh
Copy link
Owner

ploeh commented Jan 20, 2016

Thank you for your interest in contributing to Numsense. FTR, I'm currently travelling, and have only limited time to evaluate pull requests. In due time, I'll work through my backlog, but it may take weeks 😳 😓

[<InlineData(
"doismilcentoequarentaesetemilhõesquatrocentoseoitentaetrêsmilseiscentosequarentaesete",
System.Int32.MaxValue)>]
let ``tryParsePortuguese returns correct result`` (portuguese, expected) =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus points for not repeating my mistake of naming the test function tryOfPortuguese... etc. 👍

@ploeh
Copy link
Owner

ploeh commented Jan 24, 2016

This is looking promising, but may need a bit more work. For instance, using the Devil's Advocate review technique, I can make a few changes to the code without failing any tests.

In Ploeh.Numsense.Portuguese.toPortugueseImp, I can remove the match for negative numbers:

    match x with
//    |  x when x < 0 -> sprintf "menos %s" (toPortugueseImp -x)
    |  0 -> "zero"
    |  1 -> "um"

And, at the same time, I can change the return expression for tryParsePortugueseImp to this:

    match canonicalized with
    | _ -> conv 0 canonicalized

If I do that, all tests still pass.

Also, did you mean to add support for object-oriented languages later? If not, I can easily do that.

@tgnm
Copy link
Contributor Author

tgnm commented Jan 24, 2016

@ploeh I will have a closer look at the code coverage. By the looks of it there's no test currently covering negative number parsing.
Regarding the object-oriented language interface, I totally forgot about it...
Thanks for reviewing this!

@ploeh
Copy link
Owner

ploeh commented Jan 26, 2016

there's no test currently covering negative number parsing

In the current pull request: correct.

FYI, existing languages (English, Polish, Dutch, Danish) have coverage of negative numbers in NumeralProperties.fs.

@tgnm
Copy link
Contributor Author

tgnm commented Jan 26, 2016

@ploeh done!

@ploeh
Copy link
Owner

ploeh commented Jan 31, 2016

It builds and runs all tests without warnings on my machine, so from a technical perspective I think we're good to go 👍

As I don't read Portuguese, I'd like someone who does to review that part of it, if at all possible. I don't expect any errors, but it's always good with a second pair of eyes 😄

@ploeh
Copy link
Owner

ploeh commented Jan 31, 2016

I solicited a review on Twitter: https://twitter.com/ploeh/status/693833260472860672

Please retweet and spread the word 😄

@albertocsm
Copy link

Regarding Portuguese semantics, LGTM

@mrinaldi
Copy link

mrinaldi commented Feb 1, 2016

How strict/loose should the parser be?

The parser doesn't work with spaces, i.e. dez mil (10,000). However, it does parse dezmil.
Strictly speaking, not using a space is not correct at all.

Another issue I noticed, with English included, is when I try to parse something like vintevinte (English: twentytwenty) it parses as 40.

Brazilian Portuguese (and I believe European Portuguese) has its weird rules like:
mil e quinhentos (1,500) is correct, mil e quinhentos e um (1,501) is not, it should be mil quinhentos e um.

Should we try to parse as loose as possible? How about parsing vintevinte or twentytwenty, should it throws, return None or just return 40?

@ploeh
Copy link
Owner

ploeh commented Feb 1, 2016

How strict/loose should the parser be?

My default API design philosophy is to follow Postel's law, which would make parsing functions Tolerant Readers. Thus, if the parsing function can parse something unambiguously, even if it's not strictly correct, it should do so. It shouldn't preclude it from being able to parse correct values as well, though.

The most important property of the system is that it can round-trip: given any number, toPortuguese will produce a string that, when fed into tryParsePortuguese is the original number. This property is captured in tryParsePortuguese is the inverse of toPortuguese.

This property isn't an isomorphism, though, because the opposite doesn't hold. Because of Postel's law, there are (acceptable) inputs into tryParsePortuguese that can't be created by toPortuguese.

The parser doesn't work with spaces, i.e. dez mil (10,000). However, it does parse dezmil.
Strictly speaking, not using a space is not correct at all.

In context of the above, are you thereby saying that the values produced by toPortuguese are incorrect?

Should we try to parse as loose as possible? How about parsing vintevinte or twentytwenty, should it throws, return None or just return 40?

I thought about that when I wrote the English and Danish parsers, but I decided to leave that behaviour undefined at the time. I must admit that I don't know what the English or Danish parsers do in that situation, but I think that 40 should be considered a defect.

The question is whether something like twentytwenty is unambiguous? I can't easily come up with counter-examples, so perhaps it is...

@mrinaldi
Copy link

mrinaldi commented Feb 1, 2016

The most important property of the system is that it can round-trip: given any number, toPortuguese will produce a string that, when fed into tryParsePortuguese is the original number.

How about valid numbers not returned from toPortuguese, but written by a person?
Take vinte e um for instance. It's the correct way to write 21 but it can't be parsed.

In context of the above, are you thereby saying that the values produced by toPortuguese are incorrect

That depends on what you expect the output to be.
If you expect it to be correct according to the rules of the language, it's not correct.
On the other hand, if it just need to be understandable, it's OK.

I mean, that are missing prepositions, thus it looks weird to the eyes. I, for one, would never use this in an enterprise system. However, any person that reads Portuguese understands the number.

Despite of that, I think it'd be better to replace the - separating the words with spaces. Since in Portuguese you'd never use - to separate the words in a number, that would make it a little bit closer to what is correct.

I thought about that when I wrote the English and Danish parsers, but I decided to leave that behaviour undefined at the time. I must admit that I don't know what the English or Danish parsers do in that situation, but I think that 40 should be considered a defect.

The question is whether something like twentytwenty is unambiguous? I can't easily come up with counter-examples, so perhaps it is...

I can confirm the English parser returns 40 when you input twentytwenty. Of course this happens with any combination you input, i.e. eleven-nine returns 20; two-two returns 4.

@tgnm
Copy link
Contributor Author

tgnm commented Feb 2, 2016

@mrinaldi toPortuguese is able to parse number with prepositions like vinte e um. See: https://github.com/ploeh/Numsense/pull/18/files#diff-174c06e8bbf694ed1b0c333135715987R46

The reason why I did not add support for prepositions in ToPortuguese is because I wanted to keep the output similar to what existed for other languages (all languages use the dash to separate numbers). This is indeed something that looks a bit odd but I presumed it was intended.

@mrinaldi
Copy link

mrinaldi commented Feb 2, 2016

@mrinaldi toPortuguese is able to parse number with prepositions like vinte e um. See: https://github.com/ploeh/Numsense/pull/18/files#diff-174c06e8bbf694ed1b0c333135715987R46

I was actually referring to the output of toPortuguese: https://github.com/ploeh/Numsense/pull/18/files#diff-174c06e8bbf694ed1b0c333135715987R143

The reason why I did not add support for prepositions in ToPortuguese is because I wanted to keep the output similar to what existed for other languages (all languages use the dash to separate numbers).

I see your point. However, in English, it's normal to use hyphens in numbers - although I'm not sure if you can use the hyphen for anything but tens and units.

I don't speak the other languages though, thus I can't say the same for them.

This is indeed something that looks a bit odd but I presumed it was intended.

Maybe @ploeh can tell us that.

@ploeh
Copy link
Owner

ploeh commented Feb 2, 2016

The idea was to produce values that are correct within the rules of a particular language. These are expected to vary widely from language to language.

This discussion (about Spanish) may shed more light on the subject.

In short, if hyphens aren't legal in Portuguese numerals, they shouldn't be there.

@kerams
Copy link

kerams commented Feb 2, 2016

I can confirm the English parser returns 40 when you input twentytwenty. Of course this happens with any combination you input, i.e. eleven-nine returns 20; two-two returns 4

That's how most (if not all) parsers in the library currently behave. Returning None is definitely the proper thing to do in my opinion, but it's easier said than done. The solution would be neither pretty nor trivial. And this is only the tip of the iceberg; consider languages with a rich set of grammatical cases and/or genders that allow numerals to take various shapes and forms.

At the same time, I wonder how much of a problem @ploeh thinks this is. Of course you don't want to get junk if the input is meaningful, but detecting invalid input of almost any kind is perhaps too much to ask.

@tgnm
Copy link
Contributor Author

tgnm commented Feb 3, 2016

Thanks guys. I'm going to fix the output of toPortuguese so we don't use hyphens. The parser will remain as is.
As an example, for number 21, toPortuguese returns "vinte-um" but after the fix the output will be "vinte e um".

@ploeh
Copy link
Owner

ploeh commented Feb 3, 2016

I wonder how much of a problem @ploeh thinks this is

Me too 😉

You may not have seen this, but I've hinted at this before: it's not that I currently have a particular purpose for Numsense. Originally, I wrote the English and Danish implementations during my Christmas holiday because I thought it was fun.

Then I thought it'd be a great exercise for other people who wished to get their feet wet with an easy F# exercise. It's also a good exercise for people not familiar with contributing to open source.

I try to be as friendly here as possible, but I also try to keep it realistic. This means that I run it as I run all my other open source projects. You'll get the same amount of feedback here, with the same quality bar. Hopefully, it's worth everyone's time.

Looping back to the question about the importance of addressing corner cases when parsing: I do think it's important that Numsense behaves correctly within a language's rules. This is also the reason I ask other people to review the linguistic aspects of it.

I consider it a defect in the English parser that twentytwenty parses into 40, so I've created an issue for it.

@tgnm
Copy link
Contributor Author

tgnm commented Feb 20, 2016

@mrinaldi @kerams @ploeh all done. Please let me know what you think.

@tgnm
Copy link
Contributor Author

tgnm commented Feb 25, 2016

@ploeh just merged master to resolve merge conflicts.

@ploeh ploeh merged commit 0e0a3ee into ploeh:master Feb 27, 2016
@ploeh
Copy link
Owner

ploeh commented Feb 27, 2016

Thank you for your contribution! It's now live as Numsense 0.12.0.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants