Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending DC to be Multithreaded #520

Open
arya-hemanshu opened this issue Mar 12, 2018 · 3 comments
Open

Extending DC to be Multithreaded #520

arya-hemanshu opened this issue Mar 12, 2018 · 3 comments

Comments

@arya-hemanshu
Copy link
Contributor

arya-hemanshu commented Mar 12, 2018

Description

The current approach of DC executes tasks sequentially. Taking London Cycle Traffic Air Quality recipe as an example throughout the issue description, will explain the current approach and possible approach to make DC multithreaded.

The recipe executes 8 Tasks in total when running, explained below:
Task 1 -> Download LocalAuthority Data from OaImporter
Task 2 -> Download TrafficCounts Data from TrafficCountImporter
Task 3 -> Download airQuality Data from LAQNImporter
Task 4 -> Geographic Aggregation of NO2 40 ug/m3 and BicycleFraction in Fields
Task 5 -> Taking mean of NO2 40 ug/m3 using LatestValueField
Task 6 -> Calculation BicycleFraction by dividing sum of CountPedalCycles and sum of CountCarsTaxis
Task 7 -> Adding CountPedalCycles using LatestValueField
Task 8 -> Adding CountCarsTaxis using LatestValueField

Current Approach

In current approach DC executes one task at a time, so the order to execution would be:
Task 1, Task 2, Task 3, Task 5, Task 7, Task 8, Task 6, Task 4 (one at a time)

Proposed Approach

We could execute certain Tasks in Parallel as executing certain tasks doesn't depend upon other Tasks.
We could create Dependency Graph e.g
Tasks ------> Dependencies
Task 1------> 0
Task 2------> 0
Task 3------> 0
Task 4------> Task 5, 6
Task 5------> Task 1, 2, 3
Task 6------> Task 7, 8
Task 7------> Task 1, 2, 3
Task 8------> Task 1, 2, 3

Now we only execute those tasks which have 0 dependencies in parallel and keeps updating Dependency Graph e.g
We could execute Task 1,2,3 in parallel, once they are done we update the Dependency Graph and remove dependency count for Task 5, 7 ,8
Then we execute Task 5,7,8 in parallel and once done,
update the Dependency Graph again and remove dependency count for Task 4, 6 but notice Task 4 can't still be executed in parallel with Task 6 as it has Task 6 as dependency, which means now we execute Task 6 and 4 sequentially.

Making DC multi-threaded could significantly improve run times.

Error log

None

@lorenaqendro
Copy link
Contributor

If that's ok, I would like to work on this when I'm back.

@arya-hemanshu
Copy link
Contributor Author

Great, the task is all yours

@sassalley
Copy link
Contributor

May be related - but would this enable the export of completed results if a build were to fail half way? E.g. if a build were to fail due to some resources not being available (e.g. Server returned HTTP response code: 429 for URL). Would be nice to get some of what had been analysed even if it failed half-way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants