Skip to content

Multiple values in one field

John Wieczorek edited this page May 17, 2018 · 5 revisions

How to deal with multiple values in one Darwin Core field

At times we have multiple values for a Darwin Core term in a single record.

Some examples of this would be:

  • More than one stateProvince or county (e.g., “Queensland/New South Wales”, stateProvinces in Australia; “Pondera;Toole”, counties in Montana, USA).
  • More than one sex value (e.g., when the occurrence refers to more than one individual, or a lot, we can have things like “1 female, 2 males”). Update: Based on community discussions (http://lists.tdwg.org/pipermail/tdwg-content/2015-February/003450.html), multiple values for sex are not recommended.
  • More than one life stage (e.g., when the occurrence refers to more than one individual, or a lot, we can have things like “adult and juvenile”). Update: Based on community discussions (http://lists.tdwg.org/pipermail/tdwg-content/2015-February/003450.html), multiple values for sex are not recommended.
  • More than one measurement (e.g., “total length = 140cm, snout-vent length = 125cm”).

In cases such as those, we face a problem when deciding how to use the Darwin Core terms, or fields:

. How do we use the corresponding fields?

. . Do we capture only one value?

. . Do we capture all values in those fields?

. . If so, how? Should we follow any particular format?

These questions are not trivial, and the answers are not simple, nor are they fixed and homogeneous for every field, as we will see.

What’s published out there

When we look at the published datasets, the most common way in which multiple values are encoded in the Darwin Core fields is by separating them with a comma ‘ , ’. Other less common options found are ‘ ; ’, ‘ or ’, and sometimes ‘ and ’. However, just because it is usually done this way, it does not mean it is a good idea.

The county example:

Let’s take an example for the county field. And let’s suppose our verbatim geographic data captured in a label is:

“US, Montana, Pondera-Toole county, along the Interstate 15”

Commonly, in the published datasets we would find the following:

dwc:country: United States

dwc:stateProvince: Montana

dwc:county: Pondera, Toole                     ----> note that these are two distinct counties

dwc:municipality:                              ----> usually this field would be found left empty

dwc:locality: Along the Interstate 15

What are the downsides of how data is currently published?

Note that if we use the way described above to record the county, that is, if it is populated with multiple values, there can be ambiguity when one looks at the dataset. For instance, by looking only at the county field, some user could wonder whether “Pondera, Toole” means that the location was in one or another of the counties listed, or if it could be that it was in both (and therefore necessarily along their shared border). Sometimes this ambiguity is resolved by capturing further information in the locality, verbatimLocality, or locationRemarks fields. However, if one looks at the county value alone, this ambiguity is not resolvable. Furthermore, strictly speaking, the value in the county field is not a county, and therefore the semantic of the term are being abused. Then, is there a way to make it not ambiguous?

What does Darwin Core standard have to say about it?

Let’s take a look at the Darwin Core definition of county:

The full, unabbreviated name of the next smaller administrative region than stateProvince (county, shire, department, etc.) in which the Location occurs.

This definition is silent when it comes to having multiple counties, except that it uses the singular “administrative region”.

Then, if we are to be strict to the intention of the county field, that is, to capture a standard full name for one administrative level below that in stateProvince, multiple counties should not go there.

Ok, understood, but… Then... what do we do with the county information?

In this case, if we were to follow strictly the Darwin Core standard, we should probably capture the county information in the locality field, appended to whatever else was already there, and to be explicit about whether the multiplicity is due to uncertainty ("Pondera or Toole") or due to being on the border between two (e.g., "Pondera/Toole county line").

In our example case, a best practice would then be to populate the fields as follows:

dwc:country: United States

dwc:stateProvince: Montana

dwc:county:                                                     ----> this field would be left empty

dwc:municipality:                                               ----> this field would be left empty

dwc:locality: Pondera/Toole county line, Along the Interstate 15

Other examples:

Let’s take a look at what Darwin Core standard has to say in some other examples, then.

Here is a list of some separators Darwin Core recommends or suggests to use for populating different terms when there are multiple values:

a. Separate values with “ | ”. This is explicitly recommended as best practice in Darwin Core for some fields.

Examples: dwc:higherGeography, dwc:typeStatus, dwc:identifiedBy, dwc:recordedBy, dwc:preparations, etc.

b. Separate values with “ , ”. This, actually, is not recommended by the standard, but cases can be found in the examples provided by Darwin Core, associated to some term definitions. Update: Based on community discussions (http://lists.tdwg.org/pipermail/tdwg-content/2015-February/003450.html), multiple values for sex are not recommended and the example has been removed.

Example: dwc:sex (e.g., “8 males, 4 females”).

c. No separation. This, actually, is not recommended by the standard, but cases can be found in the examples provided by Darwin Core, associated to some term definitions. Update: Based on community discussions (http://lists.tdwg.org/pipermail/tdwg-content/2015-February/003450.html), multiple values for lifeStage are not recommended and the example has been removed.

Example: dwc:lifeStage (e.g., “2 adults 4 juveniles).

d. Use JSON format. This is explicitly recommended as best practice in Darwin Core for the dwc:dynamicProperties field.

Example: { "tragusLengthInMeters":0.014, "weightInGrams":120 }

CONCLUSIONS:

And so…? What do we do…??

Best practice would be to:

1. Follow, first, the strict definitions of the Darwin Core terms. Do not put multiple values in terms that have singular meanings. Do not mismatch values with the semantics of the terms.

2. Follow the best practices recommended by the standard. Use ' | 'as separator for values in a list. Use JSON for dynamicProperties.

3. Be consistent.

4. Please participate with your questions and comments. :)