Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve image handling #40

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

alanta
Copy link
Owner

@alanta alanta commented Nov 10, 2022

Motivation

Checklist

  • Code follows coding conventions held in this repo
  • Automated tests have been added
  • Tests are passing
  • Docs have been updated (if applicable)
  • Temporary settings (e.g. variables used during development and testing) have been reverted to defaults

How to test

If manual testing is required, what are the steps?

@alanta alanta force-pushed the feature/improve-image-handling branch from be35b23 to 74378bd Compare November 10, 2022 14:52
@alanta
Copy link
Owner Author

alanta commented Nov 10, 2022

@daveaglick I'd really appreciate your input on this PR.

What I'm trying to do is optimize the way images are downloaded from the Kontent API. Ideally I'd like to prevent excessive downloading by caching the images to disk (currently caching in memory for preview mode).
I noticed Statiq is caching some files already, is there a mechanism in Statiq I can hook into as well?

I'm also running into issues with the ReadWeb module from Statiq Framework that is spawning a task for each image, which seems to choke up either the API or the application when downloading hundreds of images, resulting in a timeout. So I've batched the download and that seems to work. I'd be happy to contribute a PR to (optionally) limit parallelism for the ReadWeb module.

@alanta alanta self-assigned this Nov 10, 2022
@daveaglick
Copy link

I'm here! Well...sort of. You probably know as well as anyone how flaky my available time has been this year.

In any case, I did get a chance to take a look at the stuff here.

I noticed Statiq is caching some files already, is there a mechanism in Statiq I can hook into as well?

There isn't really a generic file-based caching mechanism. That said, Statiq does try to track the files it writes, and if the input to a module is the same as a previous run, and the file on disk hasn't changed, it can be a little clever. That works mostly for things like Razor and Markdown processing though - it doesn't really work that well for stuff from the web because we have no way of knowing if anything has changed.

That said, you can obviously be a little smarter for this scenario than Statiq can be for a more generic feature, so some disk caching probably makes sense. There's a ConcurrentCache<TKey, TValue> class in Statiq.Common that might help: https://www.statiq.dev/api/statiq.common/concurrentcache_2/. It attempts to provide a slightly smarter concurrent dictionary you can use for storing cached data and (optionally) resetting in between executions.

One thing you probably do want to do is use IFileSystem.CachePath for the path where you're storing any cached files. That will ensure Statiq doesn't overwrite them or anything when it shouldn't and stay consistent with other cached information. The convention would probably be to create a subfolder under the CachePath directory for your own cached files. There are also some IReadOnlyFileSystem extensions for working with the cache path that can help resolve full file paths to the cache directory at runtime.

Here's a good example of some custom caching in action for the Razor module: https://github.com/statiqdev/Statiq.Framework/blob/1a8bd58e6f1a4e63f0cae2870db78126bc511928/src/extensions/Statiq.Razor/RazorService.cs#L56

I'm also running into issues with the ReadWeb module from Statiq Framework that is spawning a task for each image

Yeah, that's probably an unfortunate design. It currently uses ParallelSelectAsync which is essentially a call to Task.WhenAll(), so while the task manager does have an upper task limit and some queuing built in, it's probably nowhere near reasonable for throttling web requests.

I'm getting a little bit of deja vu because I feel like I've tackled this problem a few times already. One approach off the top of my head is at: https://github.com/daveaglick/discoverdotnet/blob/master/FoundationManager.cs - that queues up web requests using a semaphore.

Anything inbox that does a better job of batching/queuing web requests would be a great addition - and I trust you, so I wouldn't need to review it very closely which is an added bonus :)

@alanta alanta force-pushed the feature/improve-image-handling branch from 747a424 to 395a8b4 Compare February 28, 2024 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants