Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent Coding Style #60

Open
frink opened this issue May 5, 2019 · 40 comments
Open

Consistent Coding Style #60

frink opened this issue May 5, 2019 · 40 comments
Labels
API design Discussions about API choices

Comments

@frink
Copy link

frink commented May 5, 2019

I realize this may seem like grasping at straws but spending tons of time programing these things become quite an annoyance. Both Google and Github searches show clearly that ARR instead of ARY is the preferred abbreviation for array. While this would not be a big concern in a vacuum, in combining several libraries remembering that Facil says ARY while all other libs say ARR is frustrating. For these reasons I'm suggesting that for the next major version we should standardize this abbreviation for everyone and grep all ARY/ary replacing with ARR/arr. I am willing to take on this task if we decide to do it...

The biggest concern is that this will impact backwards compatibility. So it needs to be planned instead of just sending a pull request in a vacuum.

@boazsegev
Copy link
Owner

Hi @frink ,

Thank you for opening this issue. I think no small detail is too small for consideration.

Both Google and Github searches show clearly that ARR instead of ARY is the preferred abbreviation for array.

I was unaware of that and would love to get some more information before making a decision.

I was following the Ruby naming conventions, probably since the first iteration for the library was designed to be implemented as a Ruby extension.

I know that PHP uses the whole word (array) for array functions, as does JavaScript (i.e., a = [1,2,3]; Array.isArray(a);).

And although I am aware that arr is used as a variable name in many cases (as do a and ary), I do not know that it's used as an API namespace for Arrays.

Again, I really don't mind fixing a bad naming choice for the 0.8.x release cycle. I'm already planning some breaking changes that should homogenize the API for the core type... I just prefer to make a good choice.

@boazsegev boazsegev added API design Discussions about API choices waiting for response labels May 7, 2019
@frink
Copy link
Author

frink commented May 8, 2019

Interesting. I've never seen the Ruby way before... and I've been programing for 20+ years. Learn something new every day. But you are right. Further research does show a competing standard. Here are cliffnotes from my deeper dive...

I did a quick search from comparing top result via Google and looking if they are relevant results. For each language I search, "arr" array $lang and "ary" array $lang. I wasn't so much concerned with search volume as I was with search accuracy as to what Google recognizes. (Ary is proper name and gives false positives, also blockarray has something called ARY, etc...)

Bash, C#, C++, Go, PHP, Python and JavaScript prefer arr. Ruby definitely prefers ary. Lua, Java, Erlang and Perl seemed to slightly prefer arr but this may be my own prejudice - really they are too close to call. But I guess the real question here is which way C prefers...

I grepped (via github) a random smattering of common C project .c files only...

glibc - arr: 29 files - ary: 0 files
linux - arr: 365 files - ary: 10 files
git - arr: 3 files - ary: 2 files
redis - arr: 4 files - ary: 0 files
darling - arr: 6 files - ary 0 files
stb - arr: 8 files - ary: 0 files
ffmpeg - arr: 22 files - ary: 0 files
mpv - arr: 5 files - ary: 5 files
busybox - arr: 3 files - ary: 1 files
php - arr: 90 files - ary: 1 files
ruby - arr: 38 files - ary: 139 files
lua - arr: 3 files - ary: 0 files
cpython - arr: 3 files - ary: files

Most cases where ary was mentioned was the phrase N-ary...

After all this it occurred to me to search all of Github code directly. I searched for "arr" "ary" "arr array" and "ary array". Here are the results:

Without searching for array:
arr - 7,185,118 C files from 53,602,575 total code files
ary - 956,676 C files from 6,213,739 total code files

Searching for array too:
arr array - 4,187,965 C files from 34,566,459 total code files
ary array 675,610 C files from 3,202,635 total code files

I conclude from this that arr is definitely the preferred abbreviation in C. Also it is the clear leader in other web languages such as PHP, JavaScript, Python and Go.

Bottom line: Should we rename things?
It's up to you. If it were me I probably would.

@boazsegev
Copy link
Owner

@frink ,

Thank you very much for the detailed response. I think your information provides a very important perspective and I am almost convinced to rename the array API.

I'm not writing this library only for myself and I agree that the naming should be intuitive for the majority of programmers rather than the minority - even if, personally, I feel ary is more "correct" (similar to apt for apartment).

In other words, a name change is probably in order.

The only thing I am uncomfortable with is the fact that arr abbreviations are usually used by programmers for variable naming, it isn't used as much for API naming.

As far as API naming goes, it seems that JavaScript, PHP, node.js, Erlang, Objective-C (NSArray) and Swift prefer to keep the whole of the name (array / Array) rather than abbreviate.

C++ and Rust uses vector (or Vec), which I simply refuse to adopt - only people who never studied physics could imagine vector as a collection of elements.

Perhaps facil.io should adopt a full name approach, as in fiobj_array instead of fiobj_ary or fiobj_arr... I will need to sleep on it for a bit.

I would appreciate any input from anyone reading this discussion.

Kindly,
Bo.

@pr0con
Copy link

pr0con commented May 8, 2019 via email

@frink
Copy link
Author

frink commented May 9, 2019

If array is not abbreviated then it will not disrupt either group who are used to typing ary or arr.

"The best solution to a persistent, apparently non-solvable problem is to make the problem itself obsolete. Go around it. Cease to need it to be solved..." - Seth Godin

As I said before, the only reason I mention this is because someone who is using Facil regularly is likely to get annoyed if they are used to using arr instead of ary. I found the annoyance large enough that year of Facil would get me frustrated enough to fork it just so I didn't have to go outside what is comfortable to me.

But I realize I'm not the whole world and I want to be fair to all. @pr0con provides the best solution that avoids personal taste in view of pedantic completeness. This is probably the right choice for a framework which caters to all and cannot afford to get into a debate of personal taste...

@frink
Copy link
Author

frink commented May 26, 2019

I've got some freetime over the weekend if you want me to build this out. Just want to know which branch to do the PR for...

@boazsegev
Copy link
Owner

Actually, the name change is mostly done ☺️

I moved all the dynamic types into a macro based header library in thecore-type-updates-(pre0.8.x) branch.

I kept the ARY shortcut in the macros defining the dynamic array type, but I renamed the FIOBJ dynamic arrays to fiobj_array. I still need to finish writing the new documentation, but It's a simple thing to choose your own type name using the FIO_ARY_NAME macro when defining an array type.

After I finish updating the new core FIOBJ types and API, I need to finish the documentation and port the existing features to the new API.

You're welcome to pitch in 👍🏻

...

However, I got sidetracked trying to re-write the atof (strtod) function since I discovered an annoying issue with the standard library shipped with the OSX compiler (clang)...

The standard strtod seems to call strlen somewhere internally, which means the buffer is read until a NUL terminator character is detected... which is both suboptimal and risky. I just need to fix my math. It also effects JSON parsers (since each floating point number invokes strlen for the whole JSON).

@frink
Copy link
Author

frink commented May 27, 2019

Give me some clear direction and I'll jump in! :-D
Is there a project board with work items to be claimed?

@frink
Copy link
Author

frink commented May 27, 2019

Wow! That's some pretty interesting use of macros. You've turned the whole thing into an implementation of MACRO+MACRO... Very interesting. Avoid anymore debate about naming. Please everybody...

@boazsegev
Copy link
Owner

Is there a project board with work items to be claimed?

That's a great idea!

I've started a Project with an initial to-do list.

It's still a bit on the general side, but I guess details will be added as we go.

Wow! That's some pretty interesting use of macros. You've turned the whole thing into an implementation of MACRO+MACRO... Very interesting.

Thanks... I think.

I know it can be harder to browse through the source-code this way, but it's make it so much more flexible. This approach is more modular and it makes it super simple to create different types with different properties using a few MACROs and an include statement.

Give me some clear direction and I'll jump in! :-D

I have much to learn about directing a workflow. I'm used to working alone or taking instructions (and I'm not even very good at taking instructions)... So feel free to pick something you would enjoy working on rather than something that feel like a grind.

If I could pick one big focus: I would love it if you reviewed the API and complained about any inconsistencies or poor naming choices.

I don't want to create the API again if I can avoid it, so it would be great to have another pair of eyes making sure the API is intuitive and easy to use.

@frink
Copy link
Author

frink commented May 27, 2019

Okay I'll dig into the API. Then probably look at the inclusion of BearSSL next so we can make the whole thing embeddable as a static library. That should give us most of the exciting pieces of what's been scoped out.

@frink
Copy link
Author

frink commented Jun 2, 2019

Wanted to report my progress in reviewing naming and ensuring we never need to argue about names again. It's turned into quite a bit more work than we originally envisioned. The big aha is understanding the reason WHY to use array as opposed to arr or ary and how that directs all other usage. This lead to changing the macros to be abbreviated and more descriptive.

While I expect that we will probably go back and use macro naming conventions everywhere else, I found it easier to expand everything first for the sake of readability. It's really night and day difference. Ideally, I expect all usage of abbreviations to be expanded to full words. This will allow the most standard readability of the code. Then we can use macro naming to abbreviate according to the wishes of those who include this in their own codebase.

Here is my current progress:

  • ary -> array - This is done in both function names and macros. While this doesn't HAVE TO be done, it makes the code a LOT more readable. For a framework library, I feel this is the right way to go to please the most people. A few will likely complain about requiring more typing to write code. But it will preserve readability which is probably the higher priority for maintaining code built on facil.io...

  • str -> string - While this is not strictly needed, string is the only other data type that was abbreviated. This enforces expected semantics across the library which should make things both more readable and memorable for developers. I found that this just makes sense in most cases. I have done this in both Macros and function names.

  • buf, ptr, capa -> buffer, pointer, capacity - I found that once array and string were expanded it seemed strange not to expand these suffixes as well. This is also a choice for the sake of semantic correctness and easy memory of the functions. By avoiding abbreviations altogether it takes less work to remember the proper spelling of functions.

  • x2y -> xxxx_to_yyy - Once string and array were expanded the x2y functions didn't look right. (e.g. string2i) So I have expanded these forms as well so as to keep with the chosen semantics of expansion instead of abbreviation. This also means expanding out the destination types. To this end I have expanded i to integer, f to float, and bool to boolean.

There is still a bit of work to do before I'm ready to call this viable.

I'm still in the conversion phase of the source code. I've NOT tried to compile yet. I've decided to wait on that until I'm done playing origami with names.

Everything is in a separate branch based off of the core-types-update(pre0.8.x) branch you mentioned.

Here are some points of contention:

  • length - I feel like the core structures (fiobj for example) should probably go ahead and provide length spelled out. But there may be some that don't follow this nomenclature. I'm still wrestling with this one. Within the source code it is inconsistent whether len or length is used.

  • xxx2() function - I feel like this should probably be spelled out as well rather than numeric but I've not come up with a good suffix yet.

  • variables - In order to make the source more readable it would probably be the right thing to expand variables as well. However, there is also an argument for not doing so. For this reason, I've left variable names untouched for the time being.

  • pr, ch, sub - For the time being I've left these variable names. However, I think that they do provide problems with readability of the code. These are much more delicate but to make the framework scannable I think it's worth sifting through.

  • When we get closer to release I'll put together a script to upgrade a source code tree to the new names to avoid complaints from users trying to upgrade. That way upgrading to 0.8.x doesn't break things so bad...

Digging into this it's been interesting thinking through all different viewpoints. I wasn't intending a huge renaming of everything when I got started. But, the argument for using array over arr or ary leads us down the path of fully semantic naming conventions. I decided to try this and I'm surprised how much more readable the code becomes spelling things out. That's why I've continued down this path. I truly believe this will give our users the most readable code.

WIll probably take me another week before I'm ready to issue a pull request.

But we're getting close...

@frink
Copy link
Author

frink commented Jun 2, 2019

Oh one more thing I'm trying to understand is when to use multiple underscores in functions...

e.g. xxx____yyy()

Why are we using so many underscores?

@boazsegev
Copy link
Owner

Wow! Amazing work!

Here are my 2¢ for some of the renaming concerns. I'm writing out of order:

My biggest concern is consistency. My second concern is ease of use - which balances ease of maintainability with fast and easy coding.

Since I wrote the code over a number of years, my coding style shifted while writing different sections. This is something that hurts readability and I would love to fix.

Internal Functions

when to use multiple underscores in functions... e.g. xxx____yyy()

Three underscores are used for internal functions.

i.e., fio_typename___foo is reserved for internal helper functions that are called by the API but could break without warning. These functions are undocumented and are not part of the public API. Changes to these functions won't be reflected in semantic versioning.

The three underscores are used to mark these internal functions. When less or more than three underscores are used, this is an inconsistency and it should be fixed.

Re: Full Expansion?

The way I understood it, the core logic behind choosing array was to avoid contention regarding the abbreviated names and conform to common practices (in C and other languages).

I doubt str or ptr are contentious and both of them conform to common practices.

However, I was reading through some of the changes and loved the readability. In many cases, such as POINTER and STRING macros, I think full expansion works beautifully! Thanks!

Then again, when it comes to function names, I feel str, ptr and len are good enough. The readability difference isn't as meaningful.

a2b - expansion of the 2

I'm somewhat of a lazy typist and I think a2b is 5 key-strokes away from a_to_b (counting the required shift key).

This is especially true for the FIOBJ dynamic types, where these functions are often used (fiobj2i, fiobj2cstr, etc').

On the other hand, fio_string_to_u64 is definitely easier to read than fio_str2u64 and I love it!

It also clearly marks the distinction between the str type and an operational function-style macro that accepts string data.

Reading through other languages, such as Ruby I found some .to_X functions, some Type(X) (constructor) functions, but not many a2b style functions.

IMHO, I prefer the a2b style. I agree that both approaches (a2b and a_to_b) make sense. Perhaps more input on this would help.

a2b - suffix expansion

Hmm... I think this makes sense. to a degree. fiobj_string2number vs. fiobj_str2num or fiobj_str2i... I don't know.

Personally, I think the middle offers a balanced approach.

len, capa, subscription and other property / variable names?

I'm not a native English speaker. For me, spelling length is easy, capacity requires attention and by the time I spell subscription I forgot what I was trying to code.

I want the framework to be accessible to non-English speakers. I also want it to be consistent.

However, it's also important to remember that type property names are mostly considered opaque. These properties should be accessed by get/set functions, not altered directly.

Properties / Function names

Most IDEs will offer auto-completion for function names, so spelling less of an an issue. At the same time, there is such a thing as a name that's too long.

As you pointed out, len and length aren't consistent, and this is a must fix. IMHO, I preffer len, but both will do.

Other property names, such as capacity or capa are practically the same for me. I don't mind either way.

IMHO, full-expansion of property names could be preferred for the accessor functions, in which case the preference should probably be reflected in an object's internal structure (but doesn't have to be).

Variable names

Variable names are another matter altogether.

The names in the header file should be fully descriptive (this is part or the documentation),

However, within the implementation, shorthand makes a lot of sense. In the foo_subscribe, a variable names s is totally fine. Even if the abbreviation is ambiguous in a global scope, it's quite obvious in the context of subscribe that this would be the subscription.

Some abbreviations make it easier to skim through the code, since it allows logic to fit in single lines.

There was a time in my life that I subscribed to the convention where r is used for the function's return value. This was a bad habit and should probably get phased out as we go.

xxx2() Functions

The numerical approach makes sense to me, even if it's less descriptive. I think it's intuitive to balance x_new2 against x_free2.

Although I agree that some xxx2 functions could be more descriptive, these are functions that performs the same (or a similar enough) function, so I could think of a descriptive addition.

Side-Note: I noticed you took the time to update some of the new2(capa) style functions. They will be deprecated since new2 now functions the same as new for all FIOBJ types (on no-FIOBJ types, it adds reference counting).

Final thoughts

the argument for using array over arr or ary leads us down the path of fully semantic naming conventions. I decided to try this and I'm surprised how much more readable the code becomes spelling things out.

I wish we could have discussed this before you did all that work. I believe in balancing full-semantic naming approaches against other considerations.

I really love most of the work you did. While str vs. string doesn't seem like a big-impact change, the fiobj_string_duplicate function name is too long (22 characters). It could cause the code to overflow across line boundaries, making it, overall, less readable.

Also, duplicate feels heavier than the function might accord... dup is both shorthand and also expresses the fact that the function is simply an atomic ++ operation.

@frink
Copy link
Author

frink commented Jun 3, 2019

Thanks. Lot of thought in your reply. Appreciate the discussion.

I wish we could have discussed this before you did all that work. I believe in balancing full-semantic naming approaches against other considerations.

Don't worry. It wasn't a lot of time. I'm an avid grep and vim user.

I realized showing full expansion in an entire files would help the discussion. It's a process to standardized everything. At this point it's easy to go back and change whatever we want. So my time is not wasted if we decide to do something different. Also, I stopped before going down a long rabbit hole like len/length noted above...

My biggest concern is consistency. My second concern is ease of use - which balances ease of maintainability with fast and easy coding.

These are my two of my top concerns as well. I would also add readability and maintainability as others that vy for attention. It is EXTREMELY difficult to create a consistently coding style that is both readable and easy to remember and therefore use...

I propose this axiom: Consistently easy to remember so we maintain readable meaning.

  • Consistently: Using the same style, words, and grammar
  • Easy: Without an overabundance of rules or regulations
  • To Remember: Without consulting documentation constantly
  • So We Maintain: With continual forethought for future growth
  • Readable: Can be understood and debugged by humans
  • Meaning: The code itself provides implied usage from context

Consistent Abbreviation Problems

In truth there is no such thing as consistent abbreviation in code:

  • arr, str, buf, cap, dup, num, int, bin, hex, oct, sub, pat, ref, log, sin, cos, chan, bool, and len are all first syllables or first letters...
  • ptr, pnt, flt, bln, bfr, lng, ptn, chnl, and ary are all removing non-essential letters to leave a shortened pronounceable word...
  • float doesn't have an abbreviations in almost any programing language. With so many bad options, (fpt, flt, flo, flp, flpt, and flot) float almost always wins...
  • Some use single letters (c, a, i, f, b, p, s) while others use double letters (st, ar, in, fl, bl, pt, ch, fn) in Hungarian Notation for abbreviations. Some don't use Hungarian Notation at all.
  • Most coders have a thing for three letter abbreviations but the promptly break that rule for bool, chan, float, func, hash and many others that don't fit the mold nicely.

There are multiple problems here. The abc2xzy() function are one problem but also the naming of data types in general. We shouldn't use fio_string one place and then have a converter function fiobj_arr2str() another place or we dilute the meaning of both str and fio_string...

Ultimate Flexibility

In theory, we could create a MACRO for each use case allowing a person to rename all exposed functions in a consistent manner. We're already quite a ways down that road already. We would need a MACRO for a_to_b function and we would also need a MACRO for each type that could be reused in these functions. Essentially we would end up with a glossary that is user defined.

This is definitely the most flexible approach. But is it the best? I'm not sure...

The downside to being this flexible is that it makes building on non-standard implementations problematic for the greater ecosystem. But since Facil is C code this is less problems and obviously there isn't really an ecosystem surrounding the framework at present. So I'm inclined to think that providing a flexible API may actually strengthen the library...

However, documentation is another problem. We need to figure out how to provide re-written documentation for custom named APIs. I'm not sure of an easy way to do this. But I think we need to figure that out before we release the MACRO concept that you've been working on...

International Non-English Users

You make a good case for shortened wording because they are always easier to remember than long words for non-native speakers. I personally speak six languages at some level of proficiency so I can readily empathize with your point towards brevity. However, I also think that anchoring to a single language gives more definition than simply memorising characters. Most programming jobs originate from companies in English speaking countries, so English is the natural choice as an anchor language.

abc2xyz vs abc_to_xyz

The abc2xyz syntax is ancient as the C language and pretty much everyone understands its usage. I like this syntax personally, but at the same time, I also like the differentiation between fiobj_string and cstr. I feel like a framework may need to be more verbose because it's data types are NOT standard C. This puts a goo argument for using to syntax even though it is extremely verbose. So the coder in me really wants the abc2xyz but the software architect's gut says that to will create less bugs precisely because it is verbose.

What is most important?

Differentiating data types is probably more important than fast data entry since IDEs can fill in the blanks for us. But I should note that I'm biased because I'm a fast typist myself. 22 characters doesn't seem too long for a function call. The Linux Kernel has much longer calls sometimes... By making the API use narrative English we create a self documenting code that tells the story of what the computer is doing. I like that aspect a LOT.

Line Boundaries

This is why abbreviations were initially used. But in an era where even small laptops have 1080p displays the concern of wrapping code at 80 characters is probably not something we should care about anymore. (I know, some purists will shoot me now...) If we want to honor 80 character boundaries we need to abbreviate everything to 3 chars. Otherwise, we need to realize that screens are much bigger than 640x480 these days and lighten up on the whole thing.

Standard Suffixes

If we are going to use abbreviated suffixes for _len I think we should also do the same for other common use cases. _dup, _cap, _ptr are obvious choices. I think using 3-4 letter abbreviations makes sense. (With the caveat that we should not abbreviate float - because nobody does.)

Conclusion

I think this really has to be an all or nothing choice. The reasons for abbreviating data types are the same reasons for abbreviating function names. We can however, have separate rules in MACROs vs code. I very much believe we should stick with verbose long macros because they DO force use to think about their meaning more closely. Using the abc2xyz form ONLY makes sense if abbreviating. Otherwise, it needs to be expanded to feel like a consistent style.

Having rambled a bit exposing my thoughts I propose:

  1. We expand the names of fio data types (only affects array and string as others are full words already)
  2. We abbreviate to three letter suffixes in for _len, _dup, and _cap (readable and short)
  3. We shorten all uses of length to len (short consistency)
  4. We abbreviate common words ptr, ref, buf, sub, chan (readable but short)
  5. We use f_xyz_ prefix for all internal variables referencing fio data types (so internal code becomes self documenting)
  6. We use the _to_ wording for conversion of types (I know we like the other way better but I think this will create more readable code)
  7. We create comment headers for every function with Doxygen style definitions of their usage (This way we can use Doxdown via CI/CD to generate markdown documentation)

I really feel the more we can use the code as the documentation the better off this API will be...

Let me know if the above conclusion makes sense. I'll update my branch based on these thoughts...

@boazsegev
Copy link
Owner

TL;DR;

  1. We expand the names of fio data types (only affects array and string as others are full words already)

I agree with the fiobj_array, but I don't think fiobj_string is any different than fiobj_str. I don't see the added verbosity adding anything to the readability.

However, where string MACROS are concerned, I totally agree that FIO_STRING should be preferred. In fact, this could help differentiate between the FIOBJ String type (fiobj_str), the C String type (cstr) and string related operations.

  1. We abbreviate to three letter suffixes in for len, dup, and cap (readable and short)

I obviously agree with len and dup... I wonder, why cap vs. capa?

I'm not against this change, I just haven't seen cap as much as I've seen capa. Is it more common in C?

  1. We shorten all uses of length to len (short consistency)

Yes. Definitely.

  1. We abbreviate common words ptr, ref, buf, sub, chan (readable but short)

I think these are, except ptr and buf (and, maybe ref), context specific abbreviations.

For example, sub could be subtract and could be subscription.

Where API is concerned, I think channel and subscriptions are too context specific to be comfortably abbreviated unless the context is present (i.e., fio_atomic_sub isn't ambiguous, but fio_sub is).

  1. We use f_xyz_ prefix for all internal variables referencing fio data types (so internal code becomes self documenting)

Could you give an example for this? I'm not sure I understand the idea.

Also, what is the added benefit in adding an f_ prefix to code that is already limited in scope to the fio "namespace"?

  1. We use the _to_ wording for conversion of types (I know we like the other way better but I think this will create more readable code)

I'll write about this a little later. In short, I believe the abc2xyz will make more developers happy and that this happiness will result in better code that is more maintainable.

As an added optimization, it will also move all type conversions to a different placement in the auto-complete list.

For example, writing fiobj2 will expose all the type conversions available for FIOBJ types and writing fiobj_ will exclude conversion functions from the auto-complete list. This is a big optimization when using auto-complete in any IDE.

  1. We create comment headers for every function with Doxygen style definitions of their usage (This way we can use Doxdown via CI/CD to generate markdown documentation)

That would be great. I already tried my best to author Doxygen style comments (although I'm not fluent in the @return style keywords and such).

However, I truly dislike the Doxygen output. I find it very hard to locate the information I need using Doxygen documentation. I would much rather see the output re-formatted into Markdown and made available using the facil.io website.

Reasoning and Discussion

My biggest concern is consistency. My second concern is ease of use - which balances ease of maintainability with fast and easy coding.

These are my two of my top concerns as well. I would also add readability and maintainability as others that vy for attention.

True. I always think of readability and maintainability as integral parts of ease of use. I agree with this core outlook.

In theory, we could create a MACRO for each use case allowing a person to rename all exposed functions in a consistent manner. We're already quite a ways down that road already.

I think this will introduce a number of issues as the framework and the ecosystem evolves.

This type of flexibility means that questions on SO will be harder to answer, tutorials harder to write, etc'. It also means different projects have different APIs, which can't be always comfortable.

documentation is another problem. We need to figure out how to provide re-written documentation for custom named APIs.

Yes, this is an issue.

I tried to solve this when authoring the core-library documentation. However, this precludes automation for the documentation. Yet another reason for the main API to avoid MACROs for function names.

Most programming jobs originate from companies in English speaking countries, so English is the natural choice as an anchor language.

I agree, English is the natural choice.

However, making things a little easier for non-native speakers, by adopting well established abbreviations where appropriate, wouldn't hurt.

The abc2xyz syntax is ancient as the C language and pretty much everyone understands its usage. ... So the coder in me really wants the abc2xyz but the software architect's gut says that to will create less bugs precisely because it is verbose.

IMHO, it's more important to make the coder in you happy. Happy coders write better code.

In other words, I believe that happiness trumps verbosity where maintenance, readability and coding are concerned.

Also see my note above regarding auto-complete optimization.

But in an era where even small laptops have 1080p displays the concern of wrapping code at 80 characters is probably not something we should care about anymore.

IMHO, we should. As long as there are framework users that care about line length, we should care about line length as well.

Besides: I don't want to get into lengthly arguments about having multiple files open side by side and all that. The technology makes line-lengths far less of an issue... but biology still means that moving the eyes sideways will tire the eye muscles than moving the eyes up and down. This means that short lines are always better for long working hours and our health.

If we are going to use abbreviated suffixes for len I think we should also do the same for other common use cases. dup, cap, ptr are obvious choices. I think using 3-4 letter abbreviations makes sense. (With the caveat that we should not abbreviate float - because nobody does.)

I think 3-4 letters is a reasonable abbreviation length.

However, I want to raise the question of float for two special case functions: fio_atof and fiobj2f:

Unlike their names, and much like the standard library's atof, both function return a (floating point) double. I wonder, when being explicit about float, could these functions be considered misleading?

@frink
Copy link
Author

frink commented Jun 4, 2019

I agree with the fiobj_array, but I don't think fiobj_string is any different than fiobj_str. I don't see the added verbosity adding anything to the readability.

It's about grammaring consistency... str is not a word. String is. arr and ary are not words. Array is. We should be consistent in expansion.

I wonder, why cap vs. capa?

This is also a suggestion based on name consistency... All other suffixes are three letters. Also, cap is the actual English abbreviation for capacity...

Could you give an example for this? I'm not sure I understand the idea.

a variable named f_str_data as opposed to just data.
This is purely for human readability of library source in tracing and debugging...

Also, what is the added benefit in adding an f_ prefix to code that is already limited in scope to the fio "namespace"?

Because cstr and other native types are used regularly and it's not easy to imply from explicit context without a lot of vertical scrolling in some cases. This will make code more readable...

However, making things a little easier for non-native speakers, by adopting well established abbreviations where appropriate, wouldn't hurt.

Agreed. This is where the MACROs come in...

For example, writing fiobj2 will expose all the type conversions available for FIOBJ types and writing fiobj_ will exclude conversion functions from the auto-complete list. This is a big optimization when using auto-complete in any IDE.

Does this currently work like that or is this something that you are suggesting to change...?

I truly dislike the Doxygen output. I find it very hard to locate the information I need using Doxygen documentation. I would much rather see the output re-formatted into Markdown and made available using the facil.io website.

Doxydown is much better. Generates markdown like you would expect exactly for the use case you mentioned...

I think this will introduce a number of issues as the framework and the ecosystem evolves.

Agreed. While the MACRO thing is cool it needs to be limited in scope.

This type of flexibility means that questions on SO will be harder to answer, tutorials harder to write, etc'. It also means different projects have different APIs, which can't be always comfortable.

This is my fear in going down that road as well.

Besides: I don't want to get into lengthly arguments about having multiple files open side by side and all that.

After reading this I agree. Line length is an issue then. 80 char?

However, I want to raise the question of float for two special case functions: fio_atof and fiobj2f

I'll have to dig into those points more directly before addressing...

Bottom Lines

  • We're really close to nailing this thing down.
  • I'm a little iffy after reading the whole MACRO debate on why/if we need MACRO naming at all.
  • I really think grammatical consistency is important. Unless we want to revisit the arr/ary debate to make all fio types three letters I think we do have to expand string and array to be consistent.

I really need to go back and familiarize myself with the unmodified source again before I dig deeper into this. I've got too many concurrent version stuck in my head. I think the next step is to generate a list of functions and make sure they are to everyone's liking. After that, we can come back to variable naming etc. But that seems like a good baby step.

As always. THanks for the input...

@boazsegev
Copy link
Owner

boazsegev commented Jun 4, 2019

I would like to thank you for this discussion.

I feel that this is really helping me understand naming choices, where before I was either making my choices unaware, copying other people's choices, or making the wrong choices.

Consistency

I agree with the fiobj_array, but I don't think fiobj_string is any different than fiobj_str. I don't see the added verbosity adding anything to the readability.

It' s about grammaring consistency... str is not a word. String is. arr and ary are not words. Array is. We should be consistent in expansion.

I think we understand consistency in a different way.

For example: I have just discovered that the Array API uses set and get while the Map API uses insert and find. This is inconsistent and I should fix this before version 0.8.0 is beta released.

Another example: both array and hash types use count to return the number of elements. Array could have used length (or len) and still be grammatically "correct", but it wouldn't have been consistent.

When I think of consistency, I think that:

  1. Similar actions should have similar names; and

  2. The names "speak the same language". i.e.:

    1. I can expect all names to start with their "namespace" prefix (fio_ for core functions, fiobj_ for FIOBJ functions, etc').
    2. I can expect an (optional) middle / type name (i.e., array, resulting in fiobj_array_)
    3. I can expect all names to end with their action / name and an optional variation indicator (i.e., fio_str2u16 vs. fio_str2u32).

What I'm not thinking that the results for the naming decisions should be consistently the same.

I suspect that you're taking consistency to a meta-level, where we consistently make naming decisions based on the same naming logic / results.

IMHO, we should consider each naming decision separately unless it had been considered before.

cap, capa and Common Practices

I know you're the one that stated this first, but I would like to return this concern to the forefront.

The fact that len is a three letter abbreviation doesn't mean we should use cap unless cap is the common practice (which I thought to be capa).

As far as the English language is concerned, you might be right that cap is a better choice... but is it the common practice in programming languages?

In the C code base for Ruby, capa is preferred. The same appears to be true in the Linux kernel where cap is used for "capability", "Component Authentication Protocol", "capsule" and other context specific shorthands.

After we decide on an abbreviation (which I believe to be needed) we should apply this consistently to all places where capacity appears (strings, arrays, hash maps, etc').

MACROs and a Single API

However, making things a little easier for non-native speakers, by adopting well established abbreviations where appropriate, wouldn't hurt.

Agreed. This is where the MACROs come in...

Maybe I don't understand you. Are you suggesting that functions with long names (i.e., fiobj_string_capacity) will offer helper abbreviated MACROs (fiobj_str_capa)?

I really appreciate the ideas and the discussion, however, it will be very hard to convince me that the API should be supplemented by shorthand macros.

Honestly, I suspect I misunderstood, because my experience shows that this will produce two APIs and make code even more difficult to read.

More likely than not, the shorthand will be used in the code and the long version will be reported in trace reports, making things hard to debug.

MACRO naming for core types

The core types in the fio-stl.h library are used within the core library in different ways.

For example, linked lists are used for timers and state callbacks, arrays are used for publishing metadata and cluster clients (workers), and hash maps / sets (still un-ported to the new API) are used for pub/sub engines and uuid links.

Using MACROs for type authoring allows us to have a number of List / Array / Hash types, each managing a different collection type and providing type-safety for a different use-case.

By using MACROs the code is more DRY. I don't need to repeatedly author lists, arrays or hash maps for different types.

In addition, these same MACROs will be used to create the FIOBJ type system under version 0.8.0, making the API more consistent.

Auto-Complete Optimizations

Yes, the new abc2xyz style (i.e., fiobj2cstr) was adopted because it added this auto-complete optimization which I found very useful.

The previous style (fiobj_obj2X) didn't offer this optimization and cluttered the auto-complete list.

I think this optimization will make developers happy. I know I'm enjoying it.

Doxygen

Yay 🎉👍

I guess I need to learn how to use the tool. It just always seemed to spit out a specific document structure in HTML whenever I used it.

Automating the documentation would be great!

f_xyz_ Prefixes

Could you give an example for this? I'm not sure I understand the idea.

a variable named f_str_data as opposed to just data.

This is purely for human readability of library source in tracing and debugging...

Honestly, I still don't understand the usefulness or the use case.

Is this your example(?):

typedef struct {
 char * f_str_data; /* should be f_str_buf ? */
 size_t f_str_len;
 size_t f_str_capa; /* or f_str_cap */
} fio_str_info_s;

This is an anti-DRY pattern which I never understood and always found distracting. The f_str prefix in this example simply repeats known information, which actually reduces readability IMHO.

How does this help in tracing and debugging?

boazsegev added a commit that referenced this issue Jun 4, 2019
@frink
Copy link
Author

frink commented Jun 4, 2019

Abbreviation Consistency

What I'm not thinking that the results for the naming decisions should be consistently the same.

I feel strongly that we should be consistent in naming either through abbreviating or expanding to full words or at least to keep lengths the same. It's a big pet peeve of many senior coders I know. To be clear, I'm fine with using array if we expand str to string. I'm find using str if we abbreviate array to the more standard arr. But both need the same rules of abbreviation and expansion. This may not annoy non-English speakers but it WILL drive English speaking programmers nuts if the abbreviations are not relatively consistent...

After we decide on an abbreviation (which I believe to be needed) we should apply this consistently to all places where capacity appears (strings, arrays, hash maps, etc').

This is the consistency that I'm pushing for... (In fact, how the the discussion got started ;-)

MACROs and a Single API

Maybe I don't understand you. Are you suggesting that functions with long names (i.e., fiobj_string_capacity) will offer helper abbreviated MACROs (fiobj_str_capa)?

I was suggesting that the purpose of the FIOBJ_XXX_NAME seem to be used specifically to help people choose their own abbreviations. You've explained a little further on that Macros have a slightly different use case.

The core types in the fio-stl.h library are used within the core library in different ways.

Thanks for the explanation. Makes sense.

Auto-Complete Optimization

Yes, the new abc2xyz style (i.e., fiobj2cstr) was adopted because it added this auto-complete optimization which I found very useful.

Which branch is this in? I've got too many versions of code in my head! hehehe...

Doxydown vs Doxygen

Doxydown a separate Perl script that is MUCH smaller than the Doxygen monstrocity. (Doxygen really is a monster!!) I've got a highly modified branch laying around that adds full Lua support and a few other languages. I'd like to rewrite the tool at some point in C with configuration via some external .rc file but I've not gotten that far yet...

f_xyz_ prefix

Is this your example?

No. I'm thinking specifically of cases like the HTTP1 code where a lot of 'fiobj_strings' are used and then some cstr and an fiobj_array. I found the code almost meaningless as I read through it and had to keep scrolling back to see what the variable type was on almost every line...

Let's table this one for now. I'll illustrate it for you once we have the abbreviation discussion finished...

The Big Decision:

I think we really need to finish the abbreviation discussion and come to a conclusion. There are a couple ways to go to keep code conversational and familiar.

  • _shorthand - Use abbreviated types and actions (e.g. fiobj_str_dup / fiobj_str2cstr)

    • This means functions will be short but may not be readable as common English
    • Core types more than 4 letters should be abbreviated to 3 letter abbreviations
    • Actions suffixes (_dup, _len, _capa) should be abbreviated to 3 or 4 letters
    • Conversions should use the 2 conjunction
  • _conversational - Use words for types and actions (e.g. fiobj_string_duplicate, fiobj_string_to_cstr)

    • This means function names will be longer but the code will be more conversationally readable
    • Core types should probably be named between 3 and 7 letters to avoid insane lengths
    • Action suffixes (_duplicate, _length, _capacity) should seek to stay below 9 characters
    • Conversions should use the to conjunction to keep things readable

I suggest that we choose one of these styles completely and conform everything to this consistency. There are pros and cons to both styles. Here's my thoughts:

  • shorthand coding style

    • standard C libraries and small C projects tend to use this style
    • helps solo coders quickly write small projects
    • focused on getting work done and may not invite junior coders to read deeper
    • often requires extensive comments to be considered code complete
    • easier for code to be misunderstood when scanned
    • better for those coding in text editors
    • generally requires more work for English speakers to memorize
    • APIs that use this style require more extensive documentation
  • conversational coding style

    • GUI toolkits and C++ tend to use variations of this style
    • helps wrangle massive projects with many junior coders
    • focused on explaining the code as completely self documenting with few comments
    • often requires excessive typing to write much code (autocomplete is your friend)
    • easier for code to be too long and wrap lines
    • better for those coding in IDEs
    • generally requires more work for non-English speakers to memorize
    • APIs that use this style are naturally self documenting

We really need to decide which direction we go and then build out the rules for our naming style based on this. Personally, I've used both styles and in this project I can argue both sides of the issue very strongly.

I provided an eloquent argument above for why I think expanding may be better here. But it really comes down to understanding the variety of use cases that OUR USERS present. Facil will probably be used most for high workload, heavy throughput, micro services. It may also be used as a backing library for custom scriptable web application servers. While Facil could be used for large applications, it is more likely that it will be in the underbelly of a scriptable framework that the large application uses.

Personally, I hope to create a database-first framework for web application requiring zero custom API code. (All business logic in stored procs) For my usage either style works - so I really don't care which direction we go. But it does need some cleanup and consistency...

@pr0con
Copy link

pr0con commented Jun 4, 2019 via email

@frink
Copy link
Author

frink commented Jun 4, 2019

Thanks @pr0con.

Most people don't realize how much code style plays into the utility of an API and they let their egos get in the way of everything. I once worked on a project where they had no coding style, most database tables were named one thing while the business unit used a different name. There was no verb consistency and it was over 250k sloc. Talk about a nightmare!!!

That's when I started caring about naming conventions and consistency.

What do you think? Are you an abbreviation or expanded word type person?

@frink frink changed the title Array abbreviation... Consistent Coding Style Jun 4, 2019
@boazsegev
Copy link
Owner

@frink ,

I want to think more about this, but I'll write a short response to a few things I hope you consider:

I feel strongly that we should be consistent in naming either through abbreviating or expanding to full words or at least to keep lengths the same. It's a big pet peeve of many senior coders I know.

This all-or-nothing approach isn't based on a logical argument. I don't see the harm in working with both approaches in order to get the best of both worlds according to context.

For example, str in the auto-complete works the same as string - the IDE will show the abbreviation (part of the reason I thought array shouldn't be abbreviated).

[Auto-Complete] Which branch is this in? I've got too many versions of code in my head! hehehe...

Grab the fio-stl.h header from the 0.8.0 branch and use it's FIOBJ implementation (new API vs. old API).

#define FIO_FIOBJ 1
#define FIO_LOG 1
#include "fio-stl.h"

// ...

@pr0con
Copy link

pr0con commented Jun 4, 2019 via email

@boazsegev
Copy link
Owner

@pr0con - I'd love to read your 2¢.

@pr0con
Copy link

pr0con commented Jun 4, 2019 via email

@frink
Copy link
Author

frink commented Jun 5, 2019

@boazsegev Thanks.

This all-or-nothing approach isn't based on a logical argument.

Correct. My approach is not based on logic in the pedagogic sense but rather on linguistic grammar consistency and personal anecdote. I believe that we SHOULD use the same naming conventions and grammar in all cases within any library. But, adherence to the same internal standard need not be a strictly all-or-nothing approach. There are several points of contention which needed to be decided separately. These decisions inform the style guide and present definition of code code with relation to the library used. Here are a few points that need to be addressed separately.

  1. LIbrary data type naming... - This feeds most other decisions. If core names are abbreviated everything that refers to them SHOULD also be abbreviated. If expanded references SHOULD also be expanded. The main thing is not whether abbreviations are used but whether the choice to abbreviate is consistent. Another thing to consider is that some abbreviations may be mistaken for standard C data types. This is not a big deal for coding because the compiler will scream if you do something terribly wrong. But it does make coming back to code after a year slightly less intuitive...

  2. Postfixed verbs and attributes... - If these are abbreviated the library tends to FEEL more like shorthand code. This can be either a good or bad thing depending on whether you are trying to create a simple program or something deep and complex. The more complex you get the more you need to remind yourself what you are doing. This means that complex programs MUST include both comments and verbose names in order to keep things straight. If you spell out the postfixes the function names can get very long. But the resulting code tells the story of what is happening in plain English even without comments...

  3. Type Conversion... If you use abc2xyz it looks really weird if everything else is expanded. (fiobj_string2number vs fiobj_string_to_number) But, if you abbreviate then abc_to_xyz looks out of place. (fiobj_str_to_num vs 'fiobj_str2num). One could use abc_to_xyzfor internal andabc2xyz for external conversions (fiobj_string_to_numberandfiobj_string2f) - but that is likely to be a bit confusing and feel inconsistent. This is why there is a general unspoken rule is expansion uses abc_to_xyzand abbreviation usesabc2xyz`.

  4. Internal variable names... - While this may seem like overkill, making sure that the inner working of a library are readable is the surest way to get users involved with further development. A common problem with frameworks is naming internal variables to determine which vars are library types and which are standard C types without going back to the declarations. There is also the argument over abbreviating variable names. (My general rule is to abbreviate internal variables but spell out function names...)

  5. Exposed API functions... Differentiating exposed and internal functions is important. While the function naming may be generally the same, many libraries abbreviate the inner workings but are more verbose externally. I've seen that style several time but never the other way round where the API is abbreviated but the internals are expanded. Could happen, I guess. But it's more common for people to feel that since they are familiar with the internals they shouldn't have to type so much. Usually, this decision is reversed over time...

For example, str in the auto-complete works the same as string - the IDE will show the abbreviation (part of the reason I thought array shouldn't be abbreviated).

So I'm confused. Do you want to expand string now? I thought you wanted to expand array but NOT string.


AS AN ASIDE: I was talking to one of my mentors tonight who suggested that any seasoned coder should be a fast typist and therefore not concerned about the amount of time it takes to type something but rather the amount of time it takes to understand something a year later when reading it again. His impression, after coding for 40+ years, is that it is always better to have long names because it makes maintenance easier. Since he trained me, I'm really not surprised to learn we have the same opinion....

I'm standing by... Let me know what you conclude!

@frink
Copy link
Author

frink commented Jun 5, 2019

@pr0con Thanks.

... But the way the brain works(in my case) well knowns across languages such as str ary & len read fine to for type defs.

ary still throws me personally, I'm too used to arr. (That's how this whole thread got started...)

However, if str DOES NOT mean the same as str in standard C, we SHOULD differentiate so as not to confuse anyone... (e.g. fiobj_string and 'cstr' should be obviously separate types...)

@boazsegev
Copy link
Owner

boazsegev commented Jun 5, 2019

Hi @fink ,

Thank you for your time and for your considerably knowledgeable input.

I'm fairly convinced that verbosity is more beneficial most of the time.

However, too much verbosity is makes code harder to read, harder to author and harder to debug using common debugging tools.

For this reason, I won't subscribe to the all-or-nothing approach.

My decision is to go with a mostly verbose API, where:

  • Common practice abbreviations and auto-complete optimizations are preferred when readability isn't significantly affected.

  • Function names should be as succinct as possible.

  • MACROs may be more verbose than function names when appropriate.

    Since MACROs don't show up in debugging / tracing tools, abbreviating their names isn't always necessary, especially if the MACROs aren't function-like in nature.

If you think I missed anything in my considerations, please let me know.

Reasoning

Some abbreviations help readability

IMHO, when well established abbreviations are used (i.e., len or str), the extra letters don't add anything to the code's legibility.

In fact, I think developers expect these abbreviations and breaking this expectation distracts them. This is especially true where common practices are concerned.

Consider the following example, and notice how the verbosity actually requires more attention:

// verbose
for(int temporary_counter = 0; temporary_counter < 10;
    temporary_counter = temporary_counter + 1)  {
}
// common shorthand
for(int i = 0; i < 10; ++i)  {
}

The API is also used by the other developer tools (IDE etc')

The API we choose is used mostly for reading, but also for coding, debugging and optimizing.

When debugging, optimizing or coding, variable and function names are used by these tools that often have limited space for function / variable names.

In addition, the auto-complete feature in many IDEs is often used as a good-enough replacement for API documentation when general (but not specific) knowledge of the API exists.

For this reason, function prefixes have an additional function. Spacing characters have a strong functionality. Using _ vs. 2 eliminates irrelevant results.

For example, fiobj2cstr will not hurt readability, but the prefix fiobj2 will either exclude or include all generic conversion functions for the FIOBJ dynamic types.

Implications

I believe the following abbreviations don't affect readability even when out of context: str, cstr, buf, len, capa, ptr, num, hash, tmp.

The following abbreviations may appear in context: i, f, s, sub, ch, ptrn, msg.

These lists aren't complete, but I believe they provide a good enough explanation.

fiobj_str vs. cstr

As noted by @fink , there should be a clear distinction between facil.io types and language types. Also, the fiobj_ prefix might not be enough (it's too similar to the fio_ prefix).

For this reason, when referencing C strings, the cstr notation should be preferred (indicating fio_str_info_s probably needs to be renamed as fio_cstr_s).

fiobj_array

Arrays don't have a non-contentious common practice and most languages / libraries avoid abbreviations of the name where type name / API is concerned.

The most common abbreviation (arr) is common among computer science graduates / students, but might be read as ambiguous or unclear by self-taught developers.

fiobj2 vs. fiobj_to_

Conversion functions can be grouped together (for auto-complete optimization) without hurting readability, since the abc2xyz notation is not significantly different in readability than abc_to_xyz.

In addition, conversion functions create a distinct context, allowing the postfix to be abbreviated where context allows.

i.e., fiobj2num, fiobj2cstr, fiobj2float are all as readable as fiobj_to_num, fiobj_to_cstr, fiobj_to_float.

Internal code

Internal code should be updated to both reflect the new API and the new approach, improving it's readability and maintainability.

The 0.8.0 beta can't be released without finishing the new API and porting existing code. However, unclear internal code may be phased out and made more verbose with time.

For this reason, porting should be given priority for now.

Priorities in porting the existing code base to the new API / approach

Updates should be performed from the ground up, starting with the simple type core library (fio-stl.h), moving on to the core IO / reactor library (fio.h and fio.c) and from there on unfolding to each extension in turn (fio_tls, fiobj extensions, http, websockets and redis).

In order to allow for concurrent work, small PRs are preferred over larger PRs. At this point, simply make sure fio-stl.h compiles and passes tests.

Consider using an external project / C file that compiles and runs the following code successfully without compiling the rest of the facil.io library:

#define FIO_TEST_CSTL 1
#include "fio-stl.h"
int main(void) {
  fio_test_dynamic_types();
}

@frink
Copy link
Author

frink commented Jun 5, 2019

@boazsegev - Thanks for clarifications.

If I understand correctly, the goal is to abbreviate without obfuscation.

I'm concerned about going a file at a time with renaming. It is much safer to globally replace word at a time so that all references stay intact. In the above proposal there are not many that need to change so I should be able to get the code to conform to that spec quickly. Anything that affects deprecated APIs will still be changed because I am doing global search and replace. I'll branch off of 0.8.x for this.

Because of the nature of renaming it is better in this case only to do a monolithic commit since git considers any change to a line a complete line change. But most of the style described above has already been implemented and just needs some tidying up. Give me until the end of the week and I should have the abbreviations translated consistently.

If you do any coding work on the 0.8.x between now and then I will review and merge your changes into my branch so that it doesn't stop development. Then I can go back through file by file and do small PRs fine tuning internal variables per function.

Sounds like we got a plan. Thanks!


The most common abbreviation (arr) is common among computer science graduates / students, but might be read as ambiguous or unclear by self-taught developers.

I am a self-taught developer and have preferred to hired mostly self-taught developers over my 20+ year career. Yet I never encountered ary until this year as I started to work with Ruby. Every one of the coders I worked with used arr before working with me. I would not expect self-trained coders to start with Facil or any C library for that matter. If they started with Ruby they may have a problem with arr but every other entry language prefers arr... (See my previous survey above)

So my conclusion is that arr is ONLY ambiguous for self-taught coders who started with Ruby and moved IMMEDIATELY to Facil without any other coding experience. - Personally, I think that is a small enough minority as to be excluded from major concern...

@boazsegev
Copy link
Owner

the goal is to abbreviate without obfuscation.

That's one way to put it :)

On the other hand, readability is the higher concern, so missing possible abbreviations isn't a big deal.

I'm just thinking pragmatically and I believe some abbreviations have become a written dev language rather then an abbreviation.

I think about it as expanding the dictionary so deeply ingrained "abbreviations" (such as str) are considered as a synonym rather than an abbreviation :)

I'm concerned about going a file at a time with renaming. It is much safer to globally replace word at a time so that all references stay intact.

That's a good point.

The counter argument is that by updating the files "manually" (grep-ing one file at a time), we can comb through their exposed API and make adjustments as necessary.

This would also help us build a porting guide / script with a list of changes.

Also, finalizing the fio-stl.h API will make all later transitions easier, since we will get to iron out the details.

If they started with Ruby they may have a problem with arr but every other entry language prefers arr.

Actually, I started with C++ (where I would name these as v, for vector), moved to PHP, then Ruby, JavaScript and C.

For me arr still feels foreign, more like an abbreviation than a word. ary feels like a word (albeit misspelled) and array is the obvious English spelling.

I like array, but I'll leave this decision in your hands. I'm happy with both (except the MACROs, where ARRAY should be used).

If you do any coding work on the 0.8.x between now and then I will review and merge your changes into my branch so that it doesn't stop development.

Yes, please do.

I've made a few changes, changing small letter MACROs (uncapitalized) into inline functions and improving consistency for the abc2xyz syntax.

In addition, one of my latest commits updates the fio_str_info_s type to fio_cstr_s. I would love your opinion on that change, since I'm referencing cstr as a full information structure (containing len, capa and data / ptr) rather then only a pointer.

Perhaps we could start with finalizing fio-stl.h (which is almost done) and then perform a global grep on the core IO library and the extensions?

I'll hold off from committing any further updates.

@boazsegev
Copy link
Owner

Hi @frink ,

FYI: the FIOBJ data types were re-written from the ground up.

Currently, the 0.8.0 branch bypasses the new FIOBJ types, but the old implementation should be totally replaced with the implementation in the fio-stl.h (which needs to be re-extended using the fiobj_data type and the mustache templating extension).

This would make it a little hard to perform an "inplace" grep based replacement, as some API approaches were deprecated (for example, fiobj_ary_new2(capa) doesn't exist).

Anyway, I'll postpone any further changes for a little while. I need to write a quartet anyway, so I'll focus on my music.

Kindly,
Bo.

@frink
Copy link
Author

frink commented Jun 6, 2019

I think about it as expanding the dictionary so deeply ingrained "abbreviations" (such as str) are considered as a synonym rather than an abbreviation :)

Hehe. You're funny. Obviously not a native English speaker... ;-P

Currently, the 0.8.0 branch bypasses the new FIOBJ types, but the old implementation should be totally replaced with the implementation in the fio-stl.h (which needs to be re-extended using the fiobj_data type and the mustache templating extension).

I don't quite understand your vision there yet. But I expect that I just need to spend a little bit more time in the code. Now that we have a direction, that's where I'm headed...

@boazsegev
Copy link
Owner

My Vision for 0.8.0

I don't quite understand your vision there yet. But I expect that I just need to spend a little bit more time in the code. Now that we have a direction, that's where I'm headed...

I believe C is a beautiful language and many more developers would enjoy it if a few more basic tools were made available to help beginners.

And I know facil.io has these tools and could be used to make developers happy, not just when writing network applications, but when coding any application.

My vision for 0.8.0 is to separate the IO core library (fio.h and fio.c) from the "generic" core library (fio-stl.h), unify the API and remove code duplication.

It came about because I kept copying pieces from facil.io for non-network projects. I think one of the pieces I kept reusing most was the CLI (command line interface) API, but the simple types (lists, arrays, hash maps and strings) and the memory allocator were also high on the list.

This made me think "I should really separate the non-IO library from the IO library", which would also make it easier for other developers to enjoy facil.io.

Why the FIOBJ changes?

I though about it as a small STL, much like C++ has, only for C, and drew the line at JSON support (since every language today could need that).

JSON support meant that the dynamic types (FIOBJ) had to move back into the core library.

This was a wonderful opportunity to minimize code duplication that occurred between the FIOBJ types and the core (MACRO based) types and implement much of what I've learned when I extracted the FIOBJ dependency from the core library in version 0.7.0.

It was a good time to improve the pointer tagging system as well, improving type identification speeds significantly and minimizing some memory allocations.

However, some FIOBJ types / extensions are more network oriented, such as the fiobj_data type that abstracts away the data storage for binary data or the mustache template support.

I decided these extensions should be kept out of the core library, even if it means I will have to port or re-write some of their code to manage the new way FIOBJ objects can be extended (the new pointer tagging and virtual function structure).

The reason was simple - these extensions weren't required to support JSON.

So I rewrote the FIOBJ type system using only the Macro based types (Arrays, Hash Maps, Strings), added the special code required to support numbers in a way that minimizes memory allocations. This took 1,437 lines (including comments) and now we have a FIOBJ type-system that's about 20% faster (good) and sometimes recursive (which is less abuse resistant).

Now core FIOBJ types (without the extensions) and JSON parsing / formatting can be used in non-network programs with ease, copying a single file (fio-stl.h) into any project

That's why the FIOBJ changes.

@frink
Copy link
Author

frink commented Jun 7, 2019

That's why the FIOBJ changes.

Thanks. Makes sense.

... sometimes recursive (which is less abuse resistant).

I've never heard anyone suggest recursion invited abuse. What abuse are you referencing?

... some FIOBJ types / extensions are more network oriented, such as the fiobj_data type that abstracts away the data storage for binary data or the mustache template support.

Most implementations of Mustache (including two I've written in Lua and PHP) just use a JSON type representation natively with the addition of juggling context and partials separately. Is that how the fiobj_data is used?

@boazsegev
Copy link
Owner

I've never heard anyone suggest recursion invited abuse. What abuse are you referencing?

Incorrect use will result in stack explosion. i.e., freeing a deeply nested data structure will recursively free each object, including array / hash objects. This might result in a stack explosion.

Since JSON and HTTP form data are the only unsafe input source (currently), they should be protected against malicious data that will absue the nesting level.

The HTTP form parser is limited to floating 32 levels of nesting (32 levels starting at entry point, which might shift according to application usage).

The JSON implementation protects against this by limiting nesting input and output. The current limits should be updated since they aren't consistent.

The output nesting level is limited to FIOBJ_JSON_MAX_NESTING (currently 28), and the JSON input is limited to JSON_MAX_DEPTH (currently 31, and can't be more than 32).

These are low limits, but they are still higher than test number 18 in the JSON testing suite (which expects a JSON parser failure after 21 nested levels).

Is that how the fiobj_data is used?

The fiobj_data type is used for incoming HTTP uploads and form data, so large data isn't stored in-memory (but uses a temporary file), while small incoming data lives in-memory for higher performance.

When fiobj2cstr is called for a fiobj_data object, it degrades to a string, temporarily dumping the data to the RAM.

Most implementations of Mustache (including two I've written in Lua and PHP) just use a JSON type representation natively with the addition of juggling context and partials separately.

The fiobj_mustache_build function takes a FIOBJ object (usually a Hash) and builds a template using the data in the FIOBJ object.

Since JSON can be easily parsed into FIOBJ objects, this approach offers more flexibility than requiring a JSON input.

@frink
Copy link
Author

frink commented Jun 7, 2019

I get it now.

How bad does Facil blow up if nesting is too deep?

@boazsegev
Copy link
Owner

BTW, how's the progress on the renaming?

How bad does Facil blow up if nesting is too deep?

Depends.

There are generally two approaches:

  1. Quiet failure (the operation fails).

  2. Crash the process.

The general rule is that external input processing should result in quiet failures, as to prevent malicious attacks, while internal causes (programming errors) should crash as early as possible.

Specifically:

  1. JSON and HTTP nesting level errors result inn a quiet failure.

    1. HTTP stops processing the data (so some FIOBJ data exists, but it is limited in scope).
    2. JSON input fails to parse (no FIOBJ objects).
    3. JSON output results in partial output (all possible levels are processed, while levels that are too deep are ignored.
  2. Stack explosion / overflow (fiobj_free for a programmatically initiated nesting level error) should result in the OS killing the process.

  3. Recursive nesting (placing arr1 in arr2, placing arr2 in arr1, and then trying to free / iterate) will probably result in an eternal CPU spin until the stack explodes (the OS will kill the process) or all available memory is consumed (fiobj_each2 doesn't use recursion at the moment).

    I might need to make a few changes so this possible abuse / error is handles more smoothly. However, facil.io doesn't check against this error since the test is performance and resource intensive and the error can only be caused by a programming approach (error / intentional).

    At first I had a compilation flag that enables the recursive nesting test (making recursive nesting fail quietly), but I could never properly test that code branch, so I eventually removed it.

@frink
Copy link
Author

frink commented Jun 7, 2019

BTW, how's the progress on the renaming?

I"m behind. I've got a holiday coming up that will render me offline until Tuesday. Got about an hour and a half left on the whole thing. So Tuesday night US time Wednesday morning Russian time is when I'll likely report things finished.

I might need to make a few changes so this possible abuse / error is handles more smoothly.

Definitely for freeing reciprocal reference nesting. You may need to use some sort of reference counting to make sure you don't continue in recursion. In theory, you could do a two step process. One that would unset the linkage between the parent and child while storing a link in a new array temporarily. Then looping through the newly created array and freeing everything. (This I've seen reciprocal references how it is usually handled in higher level languages. Should work in C too.)

@boazsegev
Copy link
Owner

I've got a holiday coming up that will render me offline until Tuesday. Got about an hour and a half left on the whole thing...

Happy holiday. Have fun.

I guess I'll be waiting, since I've got many small updates to do in the fio-stl.h file and I don't want to disrupt your work.

Definitely for freeing reciprocal reference nesting. You may need to use some sort of reference counting to make sure you don't continue in recursion...

One of the changes I want to perform in the fio-stl.h files is to make sure that Arrays and Hash Maps appear as empty while they are freeing their own data.

I don't think I'll add other guards until I actually find a practical use-case for recursive nesting that isn't an error.

@boazsegev
Copy link
Owner

@fink , thank you for your help and input.

Since no PR arrived, I'm assuming you got bogged down with actual work (the one that puts food on the table)...? If so, I'll take over from here. If you have anything you want to add, let me know.

Thanks again!
Bo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design Discussions about API choices
Projects
None yet
Development

No branches or pull requests

3 participants