Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDL::Core::dog doesn't scale #421

Open
mohawk2 opened this issue Mar 27, 2023 · 18 comments
Open

PDL::Core::dog doesn't scale #421

mohawk2 opened this issue Mar 27, 2023 · 18 comments

Comments

@mohawk2
Copy link
Member

mohawk2 commented Mar 27, 2023

For this, time perl -MPDL -e 'zeroes(X)->dog' is being discussed. This is prompted by an observation from @djerius.

X=1e4,6 takes no appreciable time. X=6,1e4 takes around 5s. The current implementation of dog does a Perl map slicing over the highest dimension. This reveals two problems:

  • the current machinery of starting up an ndarray operation seems to be very expensive (as noted long ago by @run4flat) and needs profiling to seek improvements
  • it's not trivial to make dog be a PP operation, because we still don't have a notation for a Par to be expressed in a way that would help, nor do we have a way to return many ndarrays

A partial solution to the second point would be to pass in an OtherPar as the input (with no Pars at all), and an OtherPar [o] pdl *retval[] where retval_count gets set automatically as with any OtherPar that's a [], but used for output. The typemap would turn that into an AVREF populated with new ndarray SVs. The input version of that would allow a PP cat operation as well.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

For comparison, within perldl, with_time { my @x = 1..1e7; $_++ for @x } takes about 750ms, and without the ++ operation, it's about 400ms just for the allocation.

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

Code to generate the data:
{ for t in $(seq 1000 1000 1000000); do echo /bin/time --format "'%E %U %S'" perl -MPDL -e "'print $t, q{ };zeroes(6,$t)->dog'" ; done; } | parallel --jobs 100 --joblog joblog |& tee timelog
Plotting code is here: plot-times.txt

times

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

profile.zip.txt

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

My analysis of the 1e5 results: for 100k calls, _slice_int took 126s, and DESTROY took 349s!

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

This is really cool - how did you make it?

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

@djerius
Copy link
Contributor

djerius commented Mar 27, 2023

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

Check out the plot_times.txt file attached to the comment.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 27, 2023

Thanks! This is the code, to avoid having to download things:

#! /usr/bin/env perl

use PDL;
use Path::Tiny;
use PDL::Graphics::PGPLOT;

my ( @n, @e, @u, @s );
for ( path('timelog')->lines ) {

   chomp;
   my ( $n, $e, $u, $s ) = split /\h+/;

   for my $time ( $e, $u, $s ) {
       my @c = reverse split /:/, $time;
       $time = (pdl(@c) * 60**sequence(0+@c))->dsumover / 3600;
   }

   push @n, $n;
   push @e, $e;
   push @u, $u;
   push @s, $s;
}

my $n = PDL->new( @n );
my $e = PDL->new( @e );
my $u = PDL->new( @u );
my $s = PDL->new( @s );

# points( $n, $e, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'Elapsed Time [hours]' );

points( $n, $u, { symbol => 2, symbolsize=>3});
label_axes( 'N', 'User Time [hours]' );

# points( $n, $s, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'System Time [hours]' );

@djerius
Copy link
Contributor

djerius commented Mar 28, 2023

Updated the plot in the above comment, just for kicks.

@djerius
Copy link
Contributor

djerius commented Mar 29, 2023

Updated the plot in the above comment, just for kicks.

Final update of the plot. Now the relationship looks really odd at higher N.

Maybe there was process contention on the machine after all? I'll run it again with fewer processes and see if it changes.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 29, 2023

Multiple clear-ish lines is very confusing!

@djerius
Copy link
Contributor

djerius commented Mar 29, 2023

Rerunning as:

seq 1000 1000 1000000                                                   \
| parallel --shuf -t --jobs 25 --joblog joblog                          \
  taskset --cpu-list '{%}'                                              \
  /bin/time --format "'%E %U %S'"                                       \
  perl -MPDL -e "'print STDERR {}, q{ }; zeroes(6,{})->dog'" '2>&1'     \
| tee timelog         

With results (as plotted via code in above comment)

plot-times-taskset

@djerius
Copy link
Contributor

djerius commented Mar 31, 2023

@mohawk2 new results finished; see plot in comment w/ revised code. Much cleaner.

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 31, 2023

Thank you. It does look like it might be O(n^2). Feel like making a log graph to see? :-)

@mohawk2
Copy link
Member Author

mohawk2 commented Apr 1, 2023

To be comparable, with the current code:

pdl> with_time { @x = map sequence(10), 1..1e5 }
1957.02 ms
pdl> with_time { pdl(\@x) }
283.285 ms
pdl> with_time { cat(@x) }
1073.88 ms

@mohawk2
Copy link
Member Author

mohawk2 commented Mar 31, 2024

Using cachegrind as noted in #451, 62% of the time for zeroes(6,1e4)->dog was taken in pdl__removetrans_children (called in DESTROY). Simply adding a count of how many children each ndarray so the linked-list walking can terminate early caused the typical time for that to go from about 800ms, to about 450ms.

mohawk2 added a commit that referenced this issue Sep 13, 2024
mohawk2 added a commit that referenced this issue Sep 13, 2024
@mohawk2
Copy link
Member Author

mohawk2 commented Sep 13, 2024

Without the C implementation of dog linked above, pdl__removetrans_children on zeroes(3,3e4)->dog takes 79% of the run time. With the C implementation, it's ~87%, which is comparable though the C implementation has eliminated some fat.

Within pdl__removetrans_children, an impressive 66% of total run time is on the first of these lines, 17% on the second, and 3% on the last:

PDL_START_CHILDLOOP(it)
if (PDL_CHILDLOOP_THISCHILD(it) != trans) continue;
/* stuff */
PDL_END_CHILDLOOP(it)

This will be because PDL uses this linked list to track an ndarray's child transforms:

typedef struct pdl_trans_children {
        pdl_trans *trans[PDL_NCHILDREN];
        struct pdl_trans_children *next;
} pdl_trans_children;

There's one inside the struct pdl, so that can't be changed in size (ABI), but the next pointer could be made into a single block that gets malloced when the number of children exceeds PDL_NCHILDREN (same size, so no ABI break). There'd be a trans_children_blocksize added to struct pdl at the end (so no ABI break), which would be used to track when reallocation is needed. There would also be a trans_children_firstnull (which would be updated on being used by searching for the next NULL), and the pdl_transes would need to store the offset into that block somewhere (on the end of the current dims block until CORE21).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants