Skip to content

Map The Collections

Mike Caprio edited this page Mar 23, 2018 · 41 revisions

Georeference Collections Specimens to Visualize Expeditions and Ecosystems in Space and Time

SPECIAL NOTE: This is a major challenge with extensive scope, and likely requires a minimum of three teams working in parallel cooperation with each other to produce a full proof of concept and data transformation and visualization pipeline.

Background

Museum collections are the backbone of their respective institutions, and the American Museum of Natural History is no exception. Of the 33 million specimens and artifacts housed in the Museum, approximately 24 million of them (comprised of about 500,000 species) are part of the collections of the Division of Invertebrate Zoology. The staff in the Division of Invertebrate Zoology study and archive the living non-vertebrate animals, which make up 95% of all animal species. Most of the specimens are terrestrial arthropods, but there are large collections of marine and freshwater invertebrates.

A drawer full of urchins from the Invertebrate collections

The digitizing of museum collections is a challenge that every GLAM (galleries, libraries, archives, and museums) institution faces today, and there are no easy answers. Input of all the necessary data and metadata of millions of specimens is the work of many, many, many years - but it starts with gathering all the data currently available and building a foundation of good, clean records in a common collections database system. AMNH uses Axiell EMu as a specimen/object record database and our goal is to input all our specimens so we can publish available data to partners and consortia, extend our loan management capabilities, and ultimately be able to reconstruct historical ecosystems and relate them to other environmental changes over time.


Solutions

This challenge involves cleaning and updating taxonomic names, as well as cleaning and/or parsing locality data (from where specimens were collected) of approximately 132,000 marine specimen records in the Division of Invertebrate Zoology database. This challenge is primarily about data transformation on the Mollusca, Crustacea and Other Invertebrate Phyla databases, populating georeferenced coordinates for as many localities as possible for all the Museum's marine animals, and then visualizing ecological communities over the last 150 years (combining invertebrate data with that from marine mammal, sea snake, sea bird, and fish databases).

Here are the possible solution paths we would like to take:

  • Data Cleansing and Transformation: The digitized invertebrate data need a lot of cleanup before they can be entered into EMu. There are misspellings across many columns, sometimes data placed in the wrong columns; we are unsure how dirty the data really are, we just know they're very dirty. It would make sense to try and programmatically compare data like taxonomy / species naming to known species in online dictionaries and resources. Output for Axiell EMu can be Excel or CSV files with the given headers as found in the "clean" files in the repository. Can we make tools that can help automate and manage this process as an ongoing task in digitizing collections? More notes on data cleansing below:

    1. In the clean dataset most everything has been parsed and cleaned
    2. All date fields should follow the same format: YYYY-Month-DD, e.g. 1921-January-30
    3. All verbatim columns must be READ-ONLY and never be touched
    4. Add a column prefixed with NEW_ for any part of a taxonomic name you can find
    5. Try to find the right class and order for a family name
    6. The clean spreadsheets are NOT authoritative for taxa for Mollusca, but are for Crustacea and Other Invert Phyla
  • Georeferencing: Locality data also needs cleaning; there are many incorrect data in various fields (countries appearing in county fields, data misplaced or incorrectly imported into the wrong fields). The group of fields that all represent location may need to be joined all together as a single string, then parsed out into correct locations and broken down into the right fields. Can you clean and transform the location data into latitude/longitude coordinates that we can place on a map, and import back into the collections database? More notes on georeferencing below:

    1. Decimals are the best format for lat/long
    2. All verbatim columns must be READ-ONLY and never be touched
    3. Try combining all locality data together in one concatenated verbatim field, then parse it and compare to gazeteers
    4. The clean spreadsheets are authoritative for localities for ALL collections subsets (Mollusca, Crustacea and Other Invert Phyla)
  • Ecosystem mapping: We want to start creating a platform that will allow us to visualize the AMNH collections as "ecosystem snapshots" that we can explore through both space and time. Like the RC-Pangea project created at Hack the Dinos, we want to click on a region on a globe and see what species the Museum has collected from that region, when they were collected, and by whom. We've been inspired by the Welikia Project which was able to recreate Manhattan island in the 1600s. This platform could someday turn into an interactive way to model expeditions, or could be used to otherwise show things like ocean temperatures and include other habitat modelling data to make that ecological snapshot even more rich. How can we "bin" the localities of an ecosystem of species - a.k.a. "The Caribbean"? Can you create a prototype that will take the cleaned taxonomic data and georeferenced locality data and put it on an interactive map?

Data points should display the following information: specimen name, taxonomic hierarchy, catalog number, locality region, IRN, year/month collected, name of first collector


Resources

Be sure to check the Online Resources and Data Sets page to see if there might be any general purpose code or utilities you might use. Also note that datasets from the Division of Vertebrate Zoology are subject to these terms and conditions.

Data Cleansing and Transformation:

Georeferencing:

Ecosystem mapping:


Challenge owner: Christine Johnson