Skip to content

Commit

Permalink
Remove 'figure' from figure references, autoref already does it
Browse files Browse the repository at this point in the history
  • Loading branch information
ks905383 committed Aug 22, 2024
1 parent 4fc5172 commit 410aa2c
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions joss_paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,13 @@ bibliography: paper.bib
# Summary
Scientific data is often stored on grids or rasters: gridded weather observations, interpolated pollution data, night-time lights, or other remote sensing products all approximate the continuous real world for ease of calculation, standardization, or technical limiations. However, living things don't live on grids, and rarely act or observe data on grids either. Instead, demographic or agricultural data is often collected on the county or city level, birds fly along complex migratory corridors, and rain- and watersheds follow valleys and mountains, in other words, along areas that can be described using geographic polygons.

When these raster and polygon worlds collide, as they often do in social or natural science research, data must often be aggregated between them (e.g., @auffhammer_using_2013). This aggregation must, however, be done with care. Consider a researcher who needs to aggregate temperature data from a gridded reanalysis product onto Los Angeles County, at which level they observe population or mortality statistics (Figure \autoref{fig1}). The simplest way to aggregate data would be to average across every grid cell that partially overlaps with the county. However, given the complex topography of the region, a grid cell only slightly overlapping with the county, or only overlapping with the sparsely populated mountains of the county, would be unhelpful if studying the relationship between temperature and society.
When these raster and polygon worlds collide, as they often do in social or natural science research, data must often be aggregated between them (e.g., @auffhammer_using_2013). This aggregation must, however, be done with care. Consider a researcher who needs to aggregate temperature data from a gridded reanalysis product onto Los Angeles County, at which level they observe population or mortality statistics (\autoref{fig1}). The simplest way to aggregate data would be to average across every grid cell that partially overlaps with the county. However, given the complex topography of the region, a grid cell only slightly overlapping with the county, or only overlapping with the sparsely populated mountains of the county, would be unhelpful if studying the relationship between temperature and society.

![Illustration of `xagg` workflow. Variables stored on a geographic grid (in this case 2-meter daily temperature from ERA5 reanalysis; @hersbach_era5_2020), a set of geographic polygons (in this case US county borders, focusing on Los Angeles County as an example), and an optional second weight on a geographic grid (in this case LandScan Day Population; @rose_landscan_2017) are inputted (panels a., c.). `xagg` calculates the relative overlap between each ERA5 grid cell and each county (panel b.). `xagg` regrids the population grid to the ERA5 grid (panel d.), and produces a set of final grid cell weights composed of both the area overlap and the population density (panel e.). For each county, these weights are used to calculate weighted averages of daily temperature (panel f.), which can be then be outputted in multiple formats for further analysis.\label{fig1}](xagg_joss_figure1.pdf)

Therefore, an ideal aggregation would weight not only by the area overlap between grid cells and polygons, but also optionally by other densities of relevant variables - population, area planted, etc. [@auffhammer_using_2013].

`xagg` fulfills this need, by providing a simple interface for aggregating raster data stored in `xarray` [@hoyer_xarray_2017] `Datasets` or `DataArrays` onto polygons stored in `geopandas` [@bossche_geopandasgeopandas_2024] `geodataframes`, weighted by the fractional area overlap between the raster grid and the polygon, and optionally additionally weighted by a secondary gridded variable (see Figure \autoref{fig1} for a sample workflow). Fractional area weights are generated by constructing polygons for each grid cell and using `geopandas`' `gpd.overlay()` function to calculate the overlaps between input polygons and grid cells. Aggregated data is then returned as an `xarray` `Dataset`, a `pandas` `DataFrame`, or a `geopandas` `GeoDataFrame`, depending on the user's needs.
`xagg` fulfills this need, by providing a simple interface for aggregating raster data stored in `xarray` [@hoyer_xarray_2017] `Datasets` or `DataArrays` onto polygons stored in `geopandas` [@bossche_geopandasgeopandas_2024] `geodataframes`, weighted by the fractional area overlap between the raster grid and the polygon, and optionally additionally weighted by a secondary gridded variable (see \autoref{fig1} for a sample workflow). Fractional area weights are generated by constructing polygons for each grid cell and using `geopandas`' `gpd.overlay()` function to calculate the overlaps between input polygons and grid cells. Aggregated data is then returned as an `xarray` `Dataset`, a `pandas` `DataFrame`, or a `geopandas` `GeoDataFrame`, depending on the user's needs.


# Statement of need
Expand Down

0 comments on commit 410aa2c

Please sign in to comment.