Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

Open
spwoodcock opened this issue Jul 30, 2024 · 20 comments
Open

Replace pySmartDL and ThreadPoolExecutor with aiohttp #279

spwoodcock opened this issue Jul 30, 2024 · 20 comments
Assignees
Labels
enhancement New feature or request

Comments

@spwoodcock
Copy link
Member

Is your feature request related to a problem? Please describe.

  • pySmartDL is very verbose.
  • Tiles are kb scale, so we don't really care about download progress.
  • Instead we just need to know how many tile downloads failed at the end.
  • Using aiohttp would be cleaner than ThreadPoolExecutor.

Describe the solution you'd like

  • Replace pySmartDL and ThreadPoolExecutor with an aiohttp implementation for tile downloads.
@spwoodcock spwoodcock added the enhancement New feature or request label Jul 30, 2024
@joemaren
Copy link
Contributor

joemaren commented Aug 9, 2024

Will you require aiohttp to run multiple threads like pySmartDL and ThreadPoolExecutor or just asynchronously?

@spwoodcock
Copy link
Member Author

Just asynchronously using the thread pool I think 👍

aiohttp has a limit for the connection pool size for downloads, with the default being 100

@RonaldRonnie
Copy link
Contributor

RonaldRonnie commented Aug 21, 2024

Does it mean that the focus is not to relying on multi-threading??? also, I think the slicing logic that splits xforms into chunks based on the number of CPU cores can remain , having looked at the code. Right? Also i would like to work on this given the procedures to test results after. I request to be assigned to this @spwoodcock .

@rsavoye
Copy link
Collaborator

rsavoye commented Aug 21, 2024

Yes, the number of threads is based on the number of cores. I think you'd want to keep the thread pool even if you use aiohttp. For FMTM you're downloading small amounts of tiles so don't see the performance problems. I'm commonly downloading imagery for a huge area. And often the server with the imagery is bandwidth limited, so the more concurrent threads the better. I used SmartDL cause it was easy, but I don't really see a problem with it. When downloading many, many tiles, but running from the command line, I do like to monitor the progress, which may run for hours... sometimes a few days, and it's the only way I can make sure it's still working. I don't have any issues if you want to replace it with aiohttp, most of what it is doing is network access. But please keep the thread pool unless aiohttp has something similar. A quick check, it appears to be single threaded, so I think more discussion is needed before starting on anything.

@spwoodcock
Copy link
Member Author

spwoodcock commented Aug 22, 2024

When it comes to IO-bound (network) tasks like downloading files, asyncio is much more efficient than threading.

We can't download all the files at once, as this chokes the network, so the OS manages connections for us.

The download process is the same between the two but threading creates a blocking socket (that will sit idle until the OS returns a response), while asyncio creates a non-blocking socket.

For asyncio, when a coroutine initiates some IO, the given socket is added to the selector. Then when that selector notices the socket has some actionable change, the coroutine waiting for it is continued.

Asyncio is there to basically minimise this unnecessary idle time and only action tasks when the OS sees they are ready to be completed. It does all of this on a single thread and is more resource efficient.

It's a minor benefit when running standalone, but when running on an ASGI webserver like FastAPI, as we can pass this to a single background worker thread to run until complete (instead of using up threads that could be used by other async tasks).

@spwoodcock
Copy link
Member Author

The logic to slice up the tasks and get the CPU cores can be removed entirely. This can be handled by the aoihttp connection pool automatically

@joemaren
Copy link
Contributor

I feel the same way. Keeping thread pool defeats the purpose of using aiohttp. It creates more complexity. aiohttp also manages multiple connections with a pool size which defaults to 100.

@RonaldRonnie
Copy link
Contributor

RonaldRonnie commented Aug 22, 2024

I believe transitioning to aiohttp without the thread pool is the better approach due to its simplicity and efficiency. It aligns well with modern async practices, especially in environments where async tasks need to be managed effectively. However, I recognize that long-term monitoring and handling bandwidth limitations might require additional considerations. These could include more extensive testing and fine-tuning aiohttp’s connection pool settings to meet those specific needs.

That said, I also acknowledge @rsavoye’s valid points regarding large-scale downloads. In cases where monitoring progress over extended periods is essential, or where maximizing bandwidth on limited servers is a concern, using multiple threads may still offer advantages. However, these edge cases can often be addressed by tuning aiohttp settings (e.g., adjusting connection pool size) and implementing custom progress logging.

In summary, with proper configuration and some minor custom additions, aiohttp can be optimized to handle even more demanding scenarios without needing to revert to threading.

@joemaren
Copy link
Contributor

I think @spwoodcock mentioned that the tiles to be downloaded are in kilobytes, so I don't see the concern for large scale downloads. Failed downloads can be retried. Yes, custom progress logging can still be implemented

@rsavoye
Copy link
Collaborator

rsavoye commented Aug 22, 2024

If aiohttp has support for it's own pool, I don't have a problem with the change. I just want to have multiple network connections to the remote server so I can download multiple tiles in parallel since I'm often downloading many, many thousands of tiles and don't want it to take forever... I agree that for python, threading is more for computational tasks (like conflation), and asyncio is more for I/O bound tasks. Also I still like the verbosity when running in a terminal, so that should be optional.

@rsavoye
Copy link
Collaborator

rsavoye commented Aug 22, 2024

Just as a note, I wrote basemapperr originally to make large basemaps of imagery and topo maps for OsmAnd that cover an entire state, like all of Colorado. I use this for navigating in rural areas where often the OSM data isn't very good.

@spwoodcock
Copy link
Member Author

There is no limit to the size of the area we attempt to download with either approach.

Updating to aoihttp simply makes the downloads more resource efficient.

It may even be marginally faster.

@rsavoye
Copy link
Collaborator

rsavoye commented Aug 22, 2024

I can test it with a large AOI when there is a PR to review.

@RonaldRonnie
Copy link
Contributor

Can i go ahead and work on this then we do testing after the PR

@spwoodcock
Copy link
Member Author

For sure! Go ahead 😁🙏

@RonaldRonnie
Copy link
Contributor

Am I allowed to use the Caching and Resume Support.

@spwoodcock
Copy link
Member Author

I'm not sure Resume functionality will provide much benefit as the files are so small.

Caching isn't actually necessary either, as the files on disk are essentially our cache.

The flow should be like this:

  • Check check for existing directory.

  • If present, then check the subdirs for required files recursively.

  • Create a list of missing files / files to download.

  • Use the list for the aoihttp pooled downloading.

@RonaldRonnie
Copy link
Contributor

Hi @spwoodcock , is it okay to implement asyncio.gather to run multiple asycnronous download tasks concurrrent??

@joemaren
Copy link
Contributor

You don't need to use asyncio.gather()
Use asyncio.create_task() to create a task object and append to it.

@spwoodcock
Copy link
Member Author

spwoodcock commented Aug 28, 2024

async def aiohttp_get(session: aiohttp.ClientSession, url: str):
    async with session.request("GET", url) as res:
        data = await res.read()
        return data

async with aiohttp.ClientSession() as session:
    await asyncio.gather(*(aiohttp_get(session, url) for url in urls))

I haven't tested the code above, but hopefully it gives an idea how to solve.

Using asyncio.gather is good 👍

Be sure to create the ClientSession once, outside of the async function that is pooled using gather (so the same connection pool is reused and not recreated each download)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

4 participants