Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processing is running slower #939

Open
lydiascarf opened this issue Apr 18, 2023 · 0 comments
Open

Batch processing is running slower #939

lydiascarf opened this issue Apr 18, 2023 · 0 comments

Comments

@lydiascarf
Copy link
Contributor

lydiascarf commented Apr 18, 2023

Notes from Klaas:

  • From everything I can see in terms of resource usage, we're actually massively underutilizing the i3.2xlarge instances. We're only even allocating half of the instance's memory to the task, and then less than half of that ends up getting used.
  • There are a few steps in the analysis that run in parallel, but much of it runs single-threaded. The parallel steps are big ones, like scoring neighborhood_ways segments, but still, most of the the analysis time is spent running single-threaded tasks like indexing the big tables and calculating the accessibility scores for different destination types.
  • The i3 instances are actually somewhat old, so given the above, I think there's a good chance that Fargate would speed things up because the single-threaded performance would be better on the newer CPUs in the fleet. I tried to figure out what CPUs are actually used for Fargate and it appears that the answer is "it depends" and it's not configurable. But I found this Stack Overflow post where someone gathered their own statistics. The first processor listed there, the Xeon E5-2686 v4 @ 2.30GHz, is actually the processor used by the i3 instances, but it's the slowest one on the list, and that list is from a year ago, so possibly more of those have been cycled out of the fleet by now.
  • If you scroll down to the "Container" section on the job detail page, you can see the parameters the job was run with. Which is handy because the PFB_SHPFILE_URL value will tell you what city the job is for. The failed on I linked above was for Helsinki.
  • I started watching this one, for Houston, yesterday evening because it had been running for days. It actually finished since then, for a total runtime of just over 5 days. So it doesn't seem to be the case that the huge ones can't succeed. Though I don't know if that's what she was actually saying. But yeah, I think we should separate the time question from the failures question, and for the latter we should focus on diagnosing, for individual failed jobs, what actually brought them down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant