PDL::Core::dog doesn't scale #421

mohawk2 · 2023-03-27T04:18:36Z

For this, time perl -MPDL -e 'zeroes(X)->dog' is being discussed. This is prompted by an observation from @djerius.

X=1e4,6 takes no appreciable time. X=6,1e4 takes around 5s. The current implementation of dog does a Perl map slicing over the highest dimension. This reveals two problems:

the current machinery of starting up an ndarray operation seems to be very expensive (as noted long ago by @run4flat) and needs profiling to seek improvements
it's not trivial to make dog be a PP operation, because we still don't have a notation for a Par to be expressed in a way that would help, nor do we have a way to return many ndarrays

A partial solution to the second point would be to pass in an OtherPar as the input (with no Pars at all), and an OtherPar [o] pdl *retval[] where retval_count gets set automatically as with any OtherPar that's a [], but used for output. The typemap would turn that into an AVREF populated with new ndarray SVs. The input version of that would allow a PP cat operation as well.

The text was updated successfully, but these errors were encountered:

mohawk2 · 2023-03-27T22:29:56Z

For comparison, within perldl, with_time { my @x = 1..1e7; $_++ for @x } takes about 750ms, and without the ++ operation, it's about 400ms just for the allocation.

djerius · 2023-03-27T22:42:22Z

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

Code to generate the data:
{ for t in $(seq 1000 1000 1000000); do echo /bin/time --format "'%E %U %S'" perl -MPDL -e "'print $t, q{ };zeroes(6,$t)->dog'" ; done; } | parallel --jobs 100 --joblog joblog |& tee timelog
Plotting code is here: plot-times.txt

djerius · 2023-03-27T22:52:19Z

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

profile.zip.txt

mohawk2 · 2023-03-27T23:02:49Z

I've attached a zip file with NYTProf output for N= 1_000, 10_000, and 100_000

My analysis of the 1e5 results: for 100k calls, _slice_int took 126s, and DESTROY took 349s!

mohawk2 · 2023-03-27T23:04:04Z

Here's a plot of user time vs N, where the code run is zeroes(6,N)->dog.

This was run on an AMD Epyc 7763 w/ 128 cores & 638Gb, using 100 parallel processes. Noise might be due to clocking of cores at different speeds (3244MHz vs. 2445Mhz)

This is really cool - how did you make it?

djerius · 2023-03-27T23:12:49Z

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

mohawk2 · 2023-03-27T23:14:38Z

This is really cool - how did you make it?

For future reference, it's been added to the plot comment above.

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

djerius · 2023-03-27T23:16:30Z

Thank you. That will be at least twice as valuable if you also include the rest, i.e. specifically how you turned those times into the plot shown?

Check out the plot_times.txt file attached to the comment.

mohawk2 · 2023-03-27T23:19:32Z

Thanks! This is the code, to avoid having to download things:

#! /usr/bin/env perl

use PDL;
use Path::Tiny;
use PDL::Graphics::PGPLOT;

my ( @n, @e, @u, @s );
for ( path('timelog')->lines ) {

   chomp;
   my ( $n, $e, $u, $s ) = split /\h+/;

   for my $time ( $e, $u, $s ) {
       my @c = reverse split /:/, $time;
       $time = (pdl(@c) * 60**sequence(0+@c))->dsumover / 3600;
   }

   push @n, $n;
   push @e, $e;
   push @u, $u;
   push @s, $s;
}

my $n = PDL->new( @n );
my $e = PDL->new( @e );
my $u = PDL->new( @u );
my $s = PDL->new( @s );

# points( $n, $e, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'Elapsed Time [hours]' );

points( $n, $u, { symbol => 2, symbolsize=>3});
label_axes( 'N', 'User Time [hours]' );

# points( $n, $s, { symbol => 2, symbolsize=>3});
# label_axes( 'N', 'System Time [hours]' );

djerius · 2023-03-28T21:46:16Z

Updated the plot in the above comment, just for kicks.

djerius · 2023-03-29T15:06:58Z

Updated the plot in the above comment, just for kicks.

Final update of the plot. Now the relationship looks really odd at higher N.

Maybe there was process contention on the machine after all? I'll run it again with fewer processes and see if it changes.

mohawk2 · 2023-03-29T15:14:14Z

Multiple clear-ish lines is very confusing!

djerius · 2023-03-29T16:17:15Z

Rerunning as:

seq 1000 1000 1000000                                                   \
| parallel --shuf -t --jobs 25 --joblog joblog                          \
  taskset --cpu-list '{%}'                                              \
  /bin/time --format "'%E %U %S'"                                       \
  perl -MPDL -e "'print STDERR {}, q{ }; zeroes(6,{})->dog'" '2>&1'     \
| tee timelog

With results (as plotted via code in above comment)

djerius · 2023-03-31T17:33:44Z

@mohawk2 new results finished; see plot in comment w/ revised code. Much cleaner.

mohawk2 · 2023-03-31T17:50:29Z

Thank you. It does look like it might be O(n^2). Feel like making a log graph to see? :-)

mohawk2 · 2023-04-01T13:40:47Z

To be comparable, with the current code:

pdl> with_time { @x = map sequence(10), 1..1e5 }
1957.02 ms
pdl> with_time { pdl(\@x) }
283.285 ms
pdl> with_time { cat(@x) }
1073.88 ms

mohawk2 · 2024-03-31T17:39:02Z

Using cachegrind as noted in #451, 62% of the time for zeroes(6,1e4)->dog was taken in pdl__removetrans_children (called in DESTROY). Simply adding a count of how many children each ndarray so the linked-list walking can terminate early caused the typical time for that to go from about 800ms, to about 450ms.

mohawk2 · 2024-09-13T21:17:26Z

Without the C implementation of dog linked above, pdl__removetrans_children on zeroes(3,3e4)->dog takes 79% of the run time. With the C implementation, it's ~87%, which is comparable though the C implementation has eliminated some fat.

Within pdl__removetrans_children, an impressive 66% of total run time is on the first of these lines, 17% on the second, and 3% on the last:

PDL_START_CHILDLOOP(it)
if (PDL_CHILDLOOP_THISCHILD(it) != trans) continue;
/* stuff */
PDL_END_CHILDLOOP(it)

This will be because PDL uses this linked list to track an ndarray's child transforms:

typedef struct pdl_trans_children {
        pdl_trans *trans[PDL_NCHILDREN];
        struct pdl_trans_children *next;
} pdl_trans_children;

There's one inside the struct pdl, so that can't be changed in size (ABI), but the next pointer could be made into a single block that gets malloced when the number of children exceeds PDL_NCHILDREN (same size, so no ABI break). There'd be a trans_children_blocksize added to struct pdl at the end (so no ABI break), which would be used to track when reallocation is needed. There would also be a trans_children_firstnull (which would be updated on being used by searching for the next NULL), and the pdl_transes would need to store the offset into that block somewhere (on the end of the current dims block until CORE21).

mohawk2 added a commit that referenced this issue Apr 6, 2023

add incomplete arrays of pdl* to Changes - #421

bc964a9

mohawk2 mentioned this issue Mar 31, 2024

better document debug macros #393

Open

mohawk2 added a commit that referenced this issue Mar 31, 2024

add pdl.ntrans_children to speed up destroying ndarrays - #421 #451

91802e9

mohawk2 added a commit that referenced this issue Sep 13, 2024

test, error-check dog - #421

0b961e8

mohawk2 added a commit that referenced this issue Sep 13, 2024

test, error-check dog - #421

70eb78d

mohawk2 added a commit that referenced this issue Sep 13, 2024

make C version of dog for ~5% speedup - #421

7bea3f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDL::Core::dog doesn't scale #421

PDL::Core::dog doesn't scale #421

mohawk2 commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 27, 2023 •

edited

Loading

djerius commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 28, 2023

djerius commented Mar 29, 2023

mohawk2 commented Mar 29, 2023

djerius commented Mar 29, 2023 •

edited

Loading

djerius commented Mar 31, 2023

mohawk2 commented Mar 31, 2023

mohawk2 commented Apr 1, 2023

mohawk2 commented Mar 31, 2024

mohawk2 commented Sep 13, 2024

PDL::Core::dog doesn't scale #421

PDL::Core::dog doesn't scale #421

Comments

mohawk2 commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 27, 2023 • edited Loading

djerius commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 27, 2023

mohawk2 commented Mar 27, 2023

djerius commented Mar 28, 2023

djerius commented Mar 29, 2023

mohawk2 commented Mar 29, 2023

djerius commented Mar 29, 2023 • edited Loading

djerius commented Mar 31, 2023

mohawk2 commented Mar 31, 2023

mohawk2 commented Apr 1, 2023

mohawk2 commented Mar 31, 2024

mohawk2 commented Sep 13, 2024

djerius commented Mar 27, 2023 •

edited

Loading

djerius commented Mar 29, 2023 •

edited

Loading