Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Error while generating tiles using render_tiles in tilling.ipynb example #1039

Open
JuluriSaiKiran opened this issue Jan 5, 2022 · 8 comments
Milestone

Comments

@JuluriSaiKiran
Copy link

JuluriSaiKiran commented Jan 5, 2022

Hi

Im trying to generate Tiles using datashader render_tiles.
Datashader Tilling Example

My dataset size is 8M records

In the process of aggregation facing memoryError

MemoryError: Unable to allocate 64.0 MiB for an array with shape (4096,4096) and data type uint32

Inside rasterize_func function at csv.points(df,'x,'y')

Am getting this error at zoom level 12 and above but till zoom 11 am able to successfully generate tile sets with same code snippet

I also tried with dask data frames to improve processing.

What is the best way to encounter this problem?

@JuluriSaiKiran JuluriSaiKiran changed the title Memory Error in rasterize_func while csv.points in tilling.ipynb example Memory Error while generating tiles using render_tiles in tilling.ipynb example Jan 5, 2022
@jbednar
Copy link
Member

jbednar commented Jan 5, 2022

Sounds like you already found a good way to encounter this problem, but I imagine you want to avoid this problem! :-) I would think that a dask dataframe without calling .persist() would avoid memory issues like this, at the cost of being slower than using a persisted Dask dataframe. But if you're using a non-persisted Dask dataframe and still running into issues, it's probably that we are keeping too many tiles in memory somehow, which I'd guess we could avoid. We'd need a reproducible example with runnable code and info about how much memory you have, and I can't promise when we'd be able to look into that, but I'd guess that it would be solvable if we did have time to look at it.

@JuluriSaiKiran
Copy link
Author

JuluriSaiKiran commented Jan 5, 2022

yes, im not persisting dask dataframe
am just converting pandas dataframe to dask by using the below.

df=dd.from_pandas(df_p,npartitions=mp.cpu_count())

This is the trace of the error

calculating statistics for level 13
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~\AppData\Local\Temp\3/ipykernel_6992/3644162636.py in <module>
     13     return df.loc[df['x'].between(*x_range) & df['y'].between(*y_range)]
     14 
---> 15 results = render_tiles(full_extent_of_data,
     16                        range(13,14),
     17                        load_data_func=load_data_func,

D:\sai\tiles.py in render_tiles(full_extent, levels, load_data_func, rasterize_func, shader_func, post_render_func, output_path, color_ranging_strategy)
     69     for level in levels:
     70         print('calculating statistics for level {}'.format(level))
---> 71         super_tiles, span = calculate_zoom_level_stats(list(gen_super_tiles(full_extent, level)),
     72                                                        load_data_func, rasterize_func,
     73                                                        color_ranging_strategy=color_ranging_strategy)

D:\sai\tiles.py in calculate_zoom_level_stats(super_tiles, load_data_func, rasterize_func, color_ranging_strategy)
     44             # print("SUPER TILE--")
     45             # print(super_tile)
---> 46             agg = _get_super_tile_min_max(super_tile, load_data_func, rasterize_func)
     47             # print("AGG---")
     48             # print(agg)

D:\sai\tiles.py in _get_super_tile_min_max(tile_info, load_data_func, rasterize_func)
     29     tile_size = tile_info['tile_size']
     30     df = load_data_func(tile_info['x_range'], tile_info['y_range'])
---> 31     agg = rasterize_func(df, x_range=tile_info['x_range'],
     32                          y_range=tile_info['y_range'],
     33                          height=tile_size, width=tile_size)

~\AppData\Local\Temp\3/ipykernel_6992/3644162636.py in rasterize_func(df, x_range, y_range, height, width)
      6                     plot_height=height, plot_width=width)
      7 #     print("rasterize_func Canvas")
----> 8     agg = cvs.points(df, 'x', 'y')
      9     return agg
     10 def load_data_func(x_range, y_range):

~\AppData\Roaming\Python\Python38\site-packages\datashader\core.py in points(self, source, x, y, agg, geometry)
    212             glyph = MultiPointGeometry(geometry)
    213 
--> 214         return bypixel(source, self, glyph, agg)
    215 
    216     def line(self, source, x=None, y=None, agg=None, axis=0, geometry=None,

~\AppData\Roaming\Python\Python38\site-packages\datashader\core.py in bypixel(source, canvas, glyph, agg)
   1211     with np.warnings.catch_warnings():
   1212         np.warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')
-> 1213         return bypixel.pipeline(source, schema, canvas, glyph, agg)
   1214 
   1215 

~\AppData\Roaming\Python\Python38\site-packages\datashader\utils.py in __call__(self, head, *rest, **kwargs)
    107         typ = type(head)
    108         if typ in lk:
--> 109             return lk[typ](head, *rest, **kwargs)
    110         for cls in getmro(typ)[1:]:
    111             if cls in lk:

~\AppData\Roaming\Python\Python38\site-packages\datashader\data_libraries\dask.py in dask_pipeline(df, schema, canvas, glyph, summary, cuda)
     27 
     28     if isinstance(dsk, da.Array):
---> 29         return da.compute(dsk, scheduler=scheduler)[0]
     30 
     31     keys = df.__dask_keys__()

d:\installed\anaconda3\envs\rasterio\lib\site-packages\dask\base.py in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    569         postcomputes.append(x.__dask_postcompute__())
    570 
--> 571     results = schedule(dsk, keys, **kwargs)
    572     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    573 

d:\installed\anaconda3\envs\rasterio\lib\site-packages\dask\threaded.py in get(dsk, result, cache, num_workers, pool, **kwargs)
     77             pool = MultiprocessingPoolExecutor(pool)
     78 
---> 79     results = get_async(
     80         pool.submit,
     81         pool._max_workers,

d:\installed\anaconda3\envs\rasterio\lib\site-packages\dask\local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    505                             _execute_task(task, data)  # Re-execute locally
    506                         else:
--> 507                             raise_exception(exc, tb)
    508                     res, worker_id = loads(res_info)
    509                     state["cache"][key] = res

d:\installed\anaconda3\envs\rasterio\lib\site-packages\dask\local.py in reraise(exc, tb)
    313     if exc.__traceback__ is not tb:
    314         raise exc.with_traceback(tb)
--> 315     raise exc
    316 
    317 

d:\installed\anaconda3\envs\rasterio\lib\site-packages\dask\local.py in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    218     try:
    219         task, data = loads(task_info)
--> 220         result = _execute_task(task, data)
    221         id = get_id()
    222         result = dumps((result, id))

d:\installed\anaconda3\envs\rasterio\lib\site-packages\dask\core.py in _execute_task(arg, cache, dsk)
    117         # temporaries by their reference count and can execute certain
    118         # operations in-place.
--> 119         return func(*(_execute_task(a, cache) for a in args))
    120     elif not ishashable(arg):
    121         return arg

~\AppData\Roaming\Python\Python38\site-packages\datashader\data_libraries\dask.py in wrapped_combine(x, axis, keepdims)
    138             #            isinstance(item[0], np.ndarray)
    139             #            for item in x)
--> 140             return combine(x)
    141         elif isinstance(x, tuple):
    142             # tuple with single ndarray

~\AppData\Roaming\Python\Python38\site-packages\datashader\compiler.py in combine(base_tuples)
    160     def combine(base_tuples):
    161         bases = tuple(np.stack(bs) for bs in zip(*base_tuples))
--> 162         return tuple(f(*get(inds, bases)) for (f, inds) in calls)
    163 
    164     return combine

~\AppData\Roaming\Python\Python38\site-packages\datashader\compiler.py in <genexpr>(.0)
    160     def combine(base_tuples):
    161         bases = tuple(np.stack(bs) for bs in zip(*base_tuples))
--> 162         return tuple(f(*get(inds, bases)) for (f, inds) in calls)
    163 
    164     return combine

~\AppData\Roaming\Python\Python38\site-packages\datashader\reductions.py in _combine(aggs)
    306     @staticmethod
    307     def _combine(aggs):
--> 308         return aggs.sum(axis=0, dtype='u4')
    309 
    310 

~\AppData\Roaming\Python\Python38\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     45 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     46          initial=_NoValue, where=True):
---> 47     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     48 
     49 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

MemoryError: Unable to allocate 64.0 MiB for an array with shape (4096, 4096) and data type uint32

what is the suggestable specs like ram and cpu cores to make it work?
I also have 80M records to work with.

@jbednar
Copy link
Member

jbednar commented Jan 17, 2022

That's not a reproducible example; we'd need something that fully specifies what code to run and how to get the data it needs (typically by synthesizing it). In any case, I don't know precisely where the bottleneck is, so I can't speculate how much memory would be needed, other than "more" :-/.

@hokieg3n1us
Copy link
Contributor

For a count aggregation, each super tile, the XArray data array returned from the rasterize_func, is roughly 64 MB in size. If you're doing a categorical aggregation, you can multiply this by the number of categories.

The rendering process currently keeps the results for each super tile in memory until they're sliced into the individual TMS tiles. For zoom level 12, with a global dataset, there's 65792 supertiles, so roughly 4 TB of memory usage. There's currently a pull request #1024 that adds an option for a local_cache_path so these intermediate aggregates can be stored to disk, so the tile rendering process is bounded by disk space/io instead of RAM.

You can get the number of super tiles for a specific zoom level for your dataset, using the below, and multiply this by 64 M to get an estimate on memory consumption.

len(list(gen_super_tiles((x_min, y_min, x_max, y_max), 12)))

@jbednar
Copy link
Member

jbednar commented Jan 19, 2022

Thanks, @hokieg3n1us ; that's super useful!

@brendancol
Copy link
Collaborator

@hokieg3n1us that clearly needs to be refactored if all supertiles are held in memory. I'm happy to take a look and suggest changes. We are also doing tile rendering in mapshader, but experimental.

@hokieg3n1us
Copy link
Contributor

@brendancol I know where to make the change that I can make in my branch. It's an easy enough fix. The only trade off is that you'll be calling the rasterize_func twice for each super_tile, once when you're calculating the zoom level statistics to get the span, and a second time later when you go to render the sub tiles.

@hokieg3n1us
Copy link
Contributor

@brendancol @jbednar I made the change on my branch for #1024, and pushed it. I included some notes in my latest commit that anyone using the render_tiles functionality should be careful to tune the Dask scheduler. Specifically, the Dask Bag default for 'processes' should be avoided if you persist your input DataFrame, since it'll get copied to each process during the load_data_func. And if you're using the MBTiles feature, you'll want to tune the num_workers to prevent locking of the SQLite file.

@MridulS MridulS added this to the wishlist milestone Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants