This README will catalogue common source code identifier naming structures, best practices, and semantics derived from research. The goal of this document is to act as a resource for researchers, students, and developers that want to learn about what is scientifically known about naming identifiers. We are currently looking into other types of identifier characteristics that should be included in this document. This is a living document, we will expand this as we discover more patterns and characteristics through our, and possibly others', research. Check back periodically for more information!
This document is broken down into the following sections:
-
Linguistic Terminology used throughout the document.
-
Part-of-speech Tagset used throughout the document.
-
Common Naming Structures derived by analyzing identifier names and deriving part-of-speech sequences called grammar patterns. This section discusses common identifier naming patterns and their meaning.
-
Linguistic Antipatterns, which are recurring, detrimental practices in the naming, documentation, and/or choice of identifier. In this section we provide the antipattern name, a definition, an example, and several options for resolving the antipattern.
-
Naming Styles, which are practices that dictate how identifiers should be lexically formed. The three most common naming styles: camelCase, under_score, and PascalCase are pivotal to developer comprehension.
First you should be familiar with some simple linguistic concepts.
Linguistic-terminology | Definition |
---|---|
Head-noun | The right-most noun in a noun phrase is typically referred to as a head-noun. This noun is the word that most-closely embodies the concept that represents the in-memory entity that the identifier is used to describe. |
Noun-adjunct | Noun-adjuncts are defined as a noun acting as (i.e., being used as) an adjective. These are found in certain types of compound-words which, in English, are often groups of two-or-more words separated by a dash. For example, in the word employee-name, 'employee' is a noun-adjunct and 'name' is a noun (or, more specifically, a head-noun). |
Hypernym | A word with a broad meaning that more specific words fall under; a superordinate. For example, color is a hypernym of red. Definition from Oxford Languages |
Hyponym | a word of more specific meaning than a general or superordinate term applicable to it. For example, spoon is a hyponym of cutlery. Definition from Oxford Languages |
The tagset that we use is a subset of Penn treebank. Each of our annotations and an example can be found below. Further examples and definitions can be found in the paper [1]
Abbreviation | Expanded Form | Examples |
---|---|---|
N | noun | Disneyland, shoe, faucet, mother, bedroom |
DT | determiner | the, this, that, these, those, which |
CJ | conjunction | and, for, nor, but, or, yet, so |
P | preposition | behind, in front of, at, under, beside, above, beneath, despite |
NPL | noun plural | streets, cities, cars, people, lists, items, elements. |
NM | noun modifier (adjective) | red, cold, hot, scary, beautiful, happy, faster, small |
NM | noun modifier (noun-adjunct italicized) | employeeName, filePath, fontSize, userId |
V | verb | run, jump, drive, spin |
VM | verb modifier (adverb) | very, loudly, seriously, impatiently, badly |
PR | pronoun | she, he, her, him, it, we, us, they, them, I, me, you |
D | digit | 1, 2, 10, 4.12, 0xAF |
PRE | preamble (e.g., Hungarian) | Gimp, GLEW, GL, G, p_, m_, b_ |
The grammar patterns below represent different naming structures found in source code; they are represented by sequences of part-of-speech tags. The patterns we present are all empirically derived from a manually-tagged sample of 1,335 identifiers. Refer to Newman et al [1] for more information. The manually tagged dataset is freely available here.
We present each pattern, a definition for the pattern, and examples of the pattern below. We use regular expression synax, where the * symbol means "zero or more" while the + symbol means "one or more" of the token.
Grammar_Pattern_sequences | Definition | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NM* N | Noun Phrase: Zero or more noun-modifiers appear to the left of a head-noun. Noun-modifiers that appear before the head-noun act as a way to specialize our understanding of the head-noun by taking the general concept the head-noun represents and reducing it to a more concise, specific concept. For example, in the identifier 'issueDescription' the head-noun is 'Description', which is the general concept. The noun-adjunct, 'issue', specializes our understanding of the 'Description' by specifying what kind of 'Description' we are talking about. It is good practice to be careful in the choice, and number, of noun-modifiers to use before the head-noun. A good identifier will include only enough noun-modifiers to concisely define the concept represented by the head-noun. This is the most common naming pattern for identifiers that are not function names. Here are some examples that follow this pattern:
|
||||||||||||
NM* NPL | Plural noun phrase: This is identical to Noun Phrase (NM* N), except the head-noun is plural. The plural is often purposeful in that the head-noun's plurality expresses the multiplicity of the data. That is, these identifiers (when they are not function names) are more likely to have a collection data type [1]. Some naming conventions (e.g., the Java naming standard) generally consider it good practice to match the plurality of the identifier with whether its type represents a singular or collection object. Identifiers that follow this pattern are usually not function names. Here are some examples that follow this pattern:
|
||||||||||||
V NM* (N|NPL) | Verb Phrase: The addition of a verb to a noun phrase creates a verb phrase. The verb in a verb phrase is an action being applied to (or with) the concept embodied by the noun phrase that follows. In some cases, instead of being an action, the verb is an existential quantifier. In this case, the identifier's data type is probably (interpretable as) Boolean. These are typically either function identifiers or identifiers with a boolean type. Here are some examples that follow this pattern:
|
||||||||||||
P NM* (N|NPL) | Prepositonal phrase: A noun or verb-phrase with a leading preposition is a prepositional phrase. The preposition in a prepositional phrase typically explains how the entity (or entities) represented by the accompanying noun or verb-phrase are related in terms of order, space, time (e.g., on_enter), ownership, causality, or representation (e.g., to_string). In the case of this specific grammar pattern, there is oftentimes an un-specified verb on the left-hand-side of the preposition. The un-specified verb is usually an action such as the following: GET, CONVERT (e.g., to string), EXECUTE (e.g., on enter) or some other action. Developers understand the implied action because of experience or domain knowledge, for example, understanding the implied verb in event-driven functions beginning with the preposition 'on'. There may also be noun-phrase to the left of the preposition. We discuss these in another grammar pattern below. This pattern is used in many types of identifiers whether they are function names or otherwise. Here are some examples that follow this pattern:
|
||||||||||||
NM* N P NM* (N|NPL) | Prepositional phrase with leading noun phrase: Sometimes a noun phrase is explicitly present on both the left and right of the preposition. When the left-hand-side noun-phrase is specified, there is an explicit relationship between the left- and right-hand side noun-phrases. This relationship is expressed through the preposition. The preposition helps us understand how the entity (or entities) represented by both noun-phrases are related in terms of order, space, time (e.g., generated_token_on_creation), ownership (e.g., scroll_id_for_node), causality, or representation (e.g., url_from_json, query_timeout_in_milliseconds). This pattern is used in many types of identifiers whether they are function names or otherwise. Here are some examples that follow this pattern:
|
||||||||||||
V P NM* (N|NPL) | Prepositional phrase with leading verb: Same as prepositional phrase pattern but the leading verb, or verb phrase, is specified this time. As before, the preposition helps us understand how the entity (or entities) represented by the verb- and noun-phrases are related in terms of order, space, time, ownership, causality (e.g., destroy_with_parent), or representation (e.g., save_as_quadratic_png, tessellate_to_mesh, convert_to_php_namespace). The usage of this pattern is similar to when the verb is implicit. There may still be an implicit noun phrase to the right of the verb and to the left of the preposition. This pattern is used in many types of identifiers whether they are function names or otherwise. Here are some examples that follow this pattern:
| ||||||||||||
V* DT NM* (N|NPL) | Noun phrase with leading determiner: The addition of a determiner tells us how much of the population, which is specified by the noun-phrase, is represented, or acted on, by the identifier. Typically, the determiner will tell us that we are interested in ALL, ANY, ONE, A, THE, SEVERAL, etc., of the population of objects specified by the noun phrase. If there is a leading verb, the verb specifies an action to take on the population or it represents existential quantification (e.g., matchesAnyParentCategories). This pattern is used in many types of identifiers whether they are function names or otherwise. Here are some examples that follow this pattern:
| ||||||||||||
V+ | Verb sequence: One or more verbs with no noun phrase. Because these are missing a noun phrase to act upon (in contrast to the Verb Phrase pattern above), a larger population of these are likely generic functions like Sort (though more data/research is needed), which can act upon many different types of data and have different behaviors depending on the data being acted upon. The noun phrase that this action (i.e., the verb) is applied to is implicit. That is, it is not present in the identifier name. Instead, the noun phrase is implied by the program context (e.g., it is represented by a this-pointer) or it is present in the function parameters. In some cases, these are boolean-type variables that may be missing an existential quantifier (e.g., add 'is' before 'parsing' to make it explicit) These are typically function names or identifiers with a boolean type. Here are some examples that follow this pattern:
|
Linguistic Antipatterns (LAs) in software systems are recurring, detrimental practices in the naming, documentation, and/or choice of identifier in the implementation of an entity; thus impairing program understanding. They were first discussed by Arnaoudova et al [2]. They typically take the form of an identifier name that incorrectly describes the behavior of the entity that it represents OR an entity that betrays the behavior conveyed linguistically by its corresponding identifier.
Name | Definition and Example |
---|---|
Get more than accessor | A getter that performs actions other than returning the corresponding attribute. Example: method getImageData which always returns a new object . ImageData getImageData(){
final Point size = this.getSize();
this.imageData = new ImageData(size.x, size.y, 8);
return this.imageData;
}
|
Is returns more than a Boolean | The name of a method is a predicate suggesting a true/false value in return. However the return type is not Boolean but rather a more complex type thus allowing a wider range of values without documenting them. Example: method isValid with return type int . public int isValid(){
final long currentTime = System.currentTimeMillis();
if (currentTime <= this.expires) {
// The delay has not passed yet -
// assuming source is valid.
return SourceValidity.VALID;
}
// The delay has passed, prepare for the next interval.
this.expires = currentTime + this.delay;
return this.delegate.isValid();
}
|
Set method returns | A set method having a return type different than void without proper documentation of the return type/values. Example: method setBreadth has a non-void return type.
public Dimension setBreadth(final Dimension target, final int source) {
if (this.orientation == Orientation.VERTICAL) {
return new Dimension(source, (int) target.getHeight());
} else {
return new Dimension((int) target.getWidth(), source);
}
}
|
Expecting but not getting single instance | The name of a method indicates that a single object is returned but the return type is a collection. Example: method getExpansion , which ends with a head-noun that is singular, but returns a List object.
/**
* Returns the expansion state for a tree.
*
* @return the expansion state for a tree
*/
public List getExpansion() {
return this.fExpansion;
}
|
Not implemented condition | The comments of a method suggest a conditional behavior that is not implemented in the code. When the implementation is default this should be documented. Example: method getChildren has a comment which indicates there should be a conditional within its body.
/**
* Returns the children of this object. When this object is
* displayed in a tree, the returned objects will be this
* element's children. Returns an empty array if this object
* has no children.
*
* @param object The object to get the children for.
*/
public Object[] getChildren(final Object o) {
return new Object[0];
}
|
Validation method does not confirm | A validation method (e.g., name starting with "validate", "check", "ensure") does not confirm the validation, i.e., the method neither provides a return value informing whether the validation was successful, nor documents how to proceed to understand. Example: method checkCollision returns void despite indicating that it is designed to perform validation.
public void checkCollision(final String before,
final String after) {
final boolean collision = before != null
&& before.equals(this._shortName) || after != null
&& after.equals(this._shortName);
if (collision) {
if (this._longName == null) {
this._longName = this.getLongName();
}
this. _displayName = this._longName;
}
}
|
Get method does not return | The name suggests that the method returns something (e.g., name starts with "get" or "return") but the return type is void. The documentation should explain where the resulting data is stored and how to obtain it. Example: method getMethodBodies has a void return type but its name indicates that it is a getter method.
protected void getMethodBodies(
final CompilationUnitDeclaration unit,
final int place) {
//[Removed some code for conciseness]
this.parser.scanner
.setSourceBuffer(
unit.compilationResult.compilationUnit
.getContents());
if (unit.types != null) {
for (int i = unit.types.length; --i >= 0;) {
unit.types[i].parseMethod(this.parser, unit);
}
}
}
|
Not answered question | The name of a method is in the form of predicate whereas the return type is not Boolean. Example: method isValid with a void return type.
public void isValid(final Object[] selection,
final StatusInfo res) {
// only single selection
if (selection.length == 1
&& selection[0] instanceof IFile) {
res.setOK();
} else {
res.setError(""); //$NON-NLS-1$
}
}
|
Transform method does not return | The name of a method suggests the transformation of an object but there is no return value and it is not clear from the documentation where the result is stored. Example: method javaToNative has a void return type but indicates that it performs a transformation (i.e., type conversion).
public void javaToNative(final Object object,
final TransferData transferData) {
final byte[] check =
LocalSelectionTransfer.TYPE_NAME.getBytes();
super.javaToNative(check, transferData);
}
|
Expecting but not getting a collection | The name of a method suggests that a collection should be returned but a single object or nothing is returned. Example: method getStats with a Boolean return type; making it difficult to understand the reason behind the plurality of the method name.
public boolean getStats() {
return SAXParserBase._stats;
}
|
Method name and return type are opposite | The intent of the method suggested by its name is in contradiction with what it returns. Example: method disable with return type ControlEnableState . The words "disable" and "enable" having opposite meanings.
public static ControlEnableState disable(Control w) {
return new ControlEnableState(w);
}
|
Method signature and comment are opposite | The documentation of a method is in contradiction with its declaration. Example: method isNavigateForwardEnabled is in contradiction with its comment documenting "a back navigation", as "forward" and "back" are antonyms
/**
* Returns true if this listener has a target for a
* back navigation. Only one listener needs to return
* true for the back button to be enabled.
*/
public boolean isNavigateForwardEnabled() {
boolean enabled = false;
if (this._isForwardEnabled == 1) {
enabled = true;
} else {
if (this._isForwardEnabled != 0) { enabled =
this.navigateForward(false) != null;
}
}
return enabled;
}
|
Says one but contains many | The name of an attribute suggests a single instance, while its type suggests that the attribute stores a collection of objects. Example: attribute _target that is of type Vector . It is unclear whether a change aspects one or multiple instances in the collection.
Vector _target;
|
Name suggests boolean but type is not | The name of an attribute suggests that its value is true or false, but its declaring type is not Boolean. Example: attribute isReached that is of type int[] where the declared type and values are not documented.
int[] isReached;
|
Says many but contains one | The name of an attribute suggests multiple instances, but its type suggests a single one. Example: attribute stats that is of type Boolean . Documenting such inconsistencies avoids additional comprehension effort to understand the purpose of the attribute.
private static boolean _stats = true;
|
Attribute name and type are opposite | The name of an attribute is in contradiction with its type as they contain antonyms. Example: attribute start that is of type MAssociationEnd . The use of antonyms can induce wrong assumptions.
MAssociationEnd start = null;
|
Attribute signature and comment are opposite | The declaration of an attribute is in contradiction with its documentation. Example: attribute INCLUDE_NAME_DEFAULT whose comment documents an "exclude pattern". Whether the pattern is included or excluded is thus unclear.
/**
* Configuration default exclude pattern,
* ie .*\/@href|.*\/@action|frame/@src
*/
public final static String INCLUDE_NAME_DEFAULT
= ".*/@href=|.*/@action=|frame/@src=";
|
Naming style concerns the lexical structure of an identifier name. The three most common naming styles are camelCase, under_score, and PascalCase. Prior research [3] found that camelCase and under_score do not significantly differ in terms of improving or degrading the comprehension abilities of developers as long as the developer had training or experience using the given style. It is worth noting that this same paper found that camelCase has a slight edge in terms of comprehension for shorter identifier names. This observation is supported by [4] and [5]. The importance of naming style was further emphasized in a study of developer opinions on identifier naming practices [6].
Because there has been no data to suggest that one naming style is better than the others, it is most important that development projects pick a naming style and remain consistent in the usage of that naming style throughout the code.
Naming Style | Definition | Example |
---|---|---|
camelCase | The first letter of each word in an identifier, except the first word, is capitalized | getFullName() |
under_score | An under_score is placed between each word in the identifier | call_with_default() |
PascalCase | The first letter of each word in an identifier, including the first word, is capitalized. | NewObject() |
kebab-case | This is a variant of under_score, used in languages that allow dashes (-) in identifier names, such as Lisp and Forth | employee-name |
-
Christian D. Newman, Reem S. Alsuhaibani, Michael J. Decker, Anthony Peruma, Dishant Kaushik, Mohamed Wiem Mkaouer, Emily Hill, On the generation, structure, and semantics of grammar patterns in source code identifiers, Journal of Systems and Software, 2020, 110740, ISSN 0164-1212, https://doi.org/10.1016/j.jss.2020.110740. (http://www.sciencedirect.com/science/article/pii/S0164121220301680)
-
Arnaoudova, V., Di Penta, M. & Antoniol, G. Linguistic antipatterns: what they are and how developers perceive them. Empir Software Eng., Vol 21, 104–158 (2016). https://doi.org/10.1007/s10664-014-9350-8
-
Binkley, D., Davis, M., Lawrie, D. et al. The impact of identifier style on effort and comprehension. Empir Software Eng 18, 219–276 (2013). https://doi.org/10.1007/s10664-012-9201-4
-
D. Binkley, M. Davis, D. Lawrie and C. Morrell, "To camelcase or under_score," 2009 IEEE 17th International Conference on Program Comprehension, 2009, pp. 158-167, doi: https://doi.org/10.1109/ICPC.2009.5090039.
-
B. Sharif and J. I. Maletic, "An Eye Tracking Study on camelCase and under_score Identifier Styles," 2010 IEEE 18th International Conference on Program Comprehension, 2010, pp. 196-205, doi: https://doi.org/10.1109/ICPC.2010.41.
-
R. S. Alsuhaibani, C. D. Newman, M. J. Decker, M. L. Collard and J. I. Maletic, "On the Naming of Methods: A Survey of Professional Developers," 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021, pp. 587-599, doi: https://doi.org/10.1109/ICSE43902.2021.00061.
This material is based in part upon work supported by the National Science Foundation under Grant No. 1850412.
This page is currently supported by SCANL lab. If other research labs join this effort, we will put their webpages down here as well.
If you are interested in correcting something in this document, make an issue! If you would like to add, or otherwise somehow contribute, or if you're just interested in our research and want to ask questions, please email: scanl.lab@gmail.com