Batch processing is running slower #939

lydiascarf · 2023-04-18T14:51:22Z

Notes from Klaas:

From everything I can see in terms of resource usage, we're actually massively underutilizing the i3.2xlarge instances. We're only even allocating half of the instance's memory to the task, and then less than half of that ends up getting used.
There are a few steps in the analysis that run in parallel, but much of it runs single-threaded. The parallel steps are big ones, like scoring neighborhood_ways segments, but still, most of the the analysis time is spent running single-threaded tasks like indexing the big tables and calculating the accessibility scores for different destination types.
The i3 instances are actually somewhat old, so given the above, I think there's a good chance that Fargate would speed things up because the single-threaded performance would be better on the newer CPUs in the fleet. I tried to figure out what CPUs are actually used for Fargate and it appears that the answer is "it depends" and it's not configurable. But I found this Stack Overflow post where someone gathered their own statistics. The first processor listed there, the Xeon E5-2686 v4 @ 2.30GHz, is actually the processor used by the i3 instances, but it's the slowest one on the list, and that list is from a year ago, so possibly more of those have been cycled out of the fleet by now.
If you scroll down to the "Container" section on the job detail page, you can see the parameters the job was run with. Which is handy because the PFB_SHPFILE_URL value will tell you what city the job is for. The failed on I linked above was for Helsinki.
I started watching this one, for Houston, yesterday evening because it had been running for days. It actually finished since then, for a total runtime of just over 5 days. So it doesn't seem to be the case that the huge ones can't succeed. Though I don't know if that's what she was actually saying. But yeah, I think we should separate the time question from the failures question, and for the latter we should focus on diagnosing, for individual failed jobs, what actually brought them down.

lydiascarf self-assigned this Apr 18, 2023

This was referenced Apr 19, 2023

Research PFB Maintenance Tasks #938

Closed

Write a terraform config to automate provisioning the Fargate task #942

Open

Tear down old ECS infrastructure #944

Open

lydiascarf removed their assignment Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch processing is running slower #939

Batch processing is running slower #939

lydiascarf commented Apr 18, 2023 •

edited

Loading

Batch processing is running slower #939

Batch processing is running slower #939

Comments

lydiascarf commented Apr 18, 2023 • edited Loading

lydiascarf commented Apr 18, 2023 •

edited

Loading