Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DMatrix creation performance in python #10407

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

arieleiz
Copy link

The xgboost python python package serializes numpy arrays as json. This can take up a considerable amount of time in production workloads. This patch optimizes the specific case where the numpy array is already in "C" contiguous 32-bit floating point format, and can be loaded directly without the json layer. This can improve performance up to 35% in some cases, as can be seen by the microbenchmark added in xgboost/tests/python/microbench_numpy.py:

Rows     | Cols     | Threads      | Contiguous      | Non-contiguous  | Ratio
---------+----------+--------------+-----------------+-----------------+--------------
   15000 |      100 |            0 |         0.01686 |         0.01988 |        84.8%
   15000 |      100 |            1 |         0.02897 |         0.04424 |        65.5%
   15000 |      100 |            2 |         0.02579 |          0.0392 |        65.8%
   15000 |      100 |           10 |         0.01581 |         0.02058 |        76.8%
---------+----------+--------------+-----------------+-----------------+--------------
       2 |     2000 |            0 |        0.001055 |        0.001205 |        87.6%
       2 |     2000 |            1 |       0.0004465 |       0.0005689 |        78.5%
       2 |     2000 |            2 |       0.0004609 |        0.000615 |        74.9%
       2 |     2000 |           10 |       0.0005087 |       0.0005623 |        90.5%
---------+----------+--------------+-----------------+-----------------+--------------

The pull request contains updated tests as well.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 10, 2024

@arieleiz The current interface uses NumPy's __array_interface__, which should be equivalent to passing the pointer handle. See https://numpy.org/doc/stable/reference/arrays.interface.html. The content of the matrix is not being copied or serialized; only the memory address gets copied. I'm not sure where the 35% difference is coming from.

@trivialfis Are you aware of the performance implications of the use of __array_interface__ ? Or it might be that the JSON parser is introducing significant overhead.

@arieleiz
Copy link
Author

arieleiz commented Jun 10, 2024

Hi @hcho3 !

You are of course correct, I did not describe the issue correctly, and the microbenchmark attached has side-effects that make the results incorrect.

After fixing the microbenchmark so that the data layout does not change, so we compare apples-to-apples:
a. there are no significant change when data sizes are very large (not our production use cases)
b. for smaller data (2 rows of 1500 cols, simulating what we have in production), we see a consistent 22% improvement in 1 thread and 50% improvement in 2 threads.

Analyzing (b) using a python+native profiler, we see the improvement comes directly from _from_numpy_array(), and digging deeper the use of DenseAdapterBatch vs. ArrayAdapterBatch.

Another small difference is, as you suggest, due to the fact that json is not used for either the array interface nor for the arguments (missing/nthread/data_split_mode).

If you are OK with the change in general, I'll update the commit message and fix the microbenchmark.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 10, 2024

Yes, please fix the benchmark. I will defer to @trivialfis to decide whether it's worth having a separate code path to optimize for a specific use case (small matrices).

@arieleiz
Copy link
Author

@hcho3

Here are the updated numbers comparing apples-to-apples. (the test is repeated 65536//rows times so the test durations are non-trivial).
I've updated the code to do the optimizations if the total data size is <= 32768 floats.

Threads  | Rows     | Cols     | Current (sec)   | Optimized (sec) | Ratio
       1 |        1 |     1000 |       0.0001921 |       0.0001703 |        88.6%
       1 |        4 |     1000 |       0.0001689 |       0.0001437 |        85.1%
       1 |       16 |     1000 |       0.0002639 |       0.0002457 |        93.1%
       1 |       64 |     1000 |       0.0006843 |       0.0006719 |        98.2%
       1 |      256 |     1000 |        0.002611 |        0.002655 |       101.7%
       1 |     1024 |     1000 |           0.013 |          0.0126 |        97.0%
       1 |     4096 |     1000 |         0.06081 |          0.0593 |        97.5%
       1 |    16384 |     1000 |          0.2981 |          0.2974 |        99.8%
       2 |        1 |     1000 |       0.0001415 |       0.0001196 |        84.6%
       2 |        4 |     1000 |       0.0002155 |       0.0002003 |        93.0%
       2 |       16 |     1000 |       0.0002137 |        0.000196 |        91.7%
       2 |       64 |     1000 |       0.0005054 |       0.0004855 |        96.1%
       2 |      256 |     1000 |        0.001613 |        0.001687 |       104.6%
       2 |     1024 |     1000 |        0.007743 |        0.008194 |       105.8%
       2 |     4096 |     1000 |         0.03791 |         0.03783 |        99.8%
       2 |    16384 |     1000 |          0.2077 |          0.2037 |        98.1%
       4 |        1 |     1000 |       0.0001374 |       0.0001237 |        90.0%
       4 |        4 |     1000 |       0.0001985 |       0.0001621 |        81.7%
       4 |       16 |     1000 |       0.0002266 |       0.0001988 |        87.7%
       4 |       64 |     1000 |       0.0005175 |       0.0004775 |        92.3%
       4 |      256 |     1000 |         0.00166 |        0.001594 |        96.0%
       4 |     1024 |     1000 |        0.008257 |        0.008097 |        98.1%
       4 |     4096 |     1000 |         0.03492 |          0.0354 |       101.4%
       4 |    16384 |     1000 |          0.1896 |          0.1897 |       100.0%
       8 |        1 |     1000 |       0.0001471 |       0.0001254 |        85.3%
       8 |        4 |     1000 |       0.0003609 |        0.000326 |        90.4%
       8 |       16 |     1000 |       0.0002651 |       0.0002217 |        83.6%
       8 |       64 |     1000 |       0.0003504 |       0.0003064 |        87.5%
       8 |      256 |     1000 |       0.0008264 |       0.0008729 |       105.6%
       8 |     1024 |     1000 |        0.003367 |        0.003127 |        92.9%
       8 |     4096 |     1000 |         0.01932 |         0.01799 |        93.1%
       8 |    16384 |     1000 |          0.1245 |          0.1208 |        97.0%

The xgboost python python package serializes numpy arrays as json.
This has non trivial overhead for small datasets.

This patch optimizes the specific case where the numpy is already in
"C" contigous 32-bit floating point format, and has rows*cols<=32768,
and loads it directly without the json layer.
xgboost/tests/python/microbench_numpy.py:

Threads  | Rows     | Cols     | Current (sec)   | Optimized (sec) | Ratio
       1 |        1 |     1000 |       0.0001921 |       0.0001703 |        88.6%
       1 |        4 |     1000 |       0.0001689 |       0.0001437 |        85.1%
       1 |       16 |     1000 |       0.0002639 |       0.0002457 |        93.1%
       1 |       64 |     1000 |       0.0006843 |       0.0006719 |        98.2%
       1 |      256 |     1000 |        0.002611 |        0.002655 |       101.7%
       1 |     1024 |     1000 |           0.013 |          0.0126 |        97.0%
       1 |     4096 |     1000 |         0.06081 |          0.0593 |        97.5%
       1 |    16384 |     1000 |          0.2981 |          0.2974 |        99.8%
       2 |        1 |     1000 |       0.0001415 |       0.0001196 |        84.6%
       2 |        4 |     1000 |       0.0002155 |       0.0002003 |        93.0%
       2 |       16 |     1000 |       0.0002137 |        0.000196 |        91.7%
       2 |       64 |     1000 |       0.0005054 |       0.0004855 |        96.1%
       2 |      256 |     1000 |        0.001613 |        0.001687 |       104.6%
       2 |     1024 |     1000 |        0.007743 |        0.008194 |       105.8%
       2 |     4096 |     1000 |         0.03791 |         0.03783 |        99.8%
       2 |    16384 |     1000 |          0.2077 |          0.2037 |        98.1%
       4 |        1 |     1000 |       0.0001374 |       0.0001237 |        90.0%
       4 |        4 |     1000 |       0.0001985 |       0.0001621 |        81.7%
       4 |       16 |     1000 |       0.0002266 |       0.0001988 |        87.7%
       4 |       64 |     1000 |       0.0005175 |       0.0004775 |        92.3%
       4 |      256 |     1000 |         0.00166 |        0.001594 |        96.0%
       4 |     1024 |     1000 |        0.008257 |        0.008097 |        98.1%
       4 |     4096 |     1000 |         0.03492 |          0.0354 |       101.4%
       4 |    16384 |     1000 |          0.1896 |          0.1897 |       100.0%
       8 |        1 |     1000 |       0.0001471 |       0.0001254 |        85.3%
       8 |        4 |     1000 |       0.0003609 |        0.000326 |        90.4%
       8 |       16 |     1000 |       0.0002651 |       0.0002217 |        83.6%
       8 |       64 |     1000 |       0.0003504 |       0.0003064 |        87.5%
       8 |      256 |     1000 |       0.0008264 |       0.0008729 |       105.6%
       8 |     1024 |     1000 |        0.003367 |        0.003127 |        92.9%
       8 |     4096 |     1000 |         0.01932 |         0.01799 |        93.1%
       8 |    16384 |     1000 |          0.1245 |          0.1208 |        97.0%
@trivialfis
Copy link
Member

trivialfis commented Jun 11, 2024

I don't mind the diverging code path in general. Many users have asked for a more robust implementation of streaming-based prediction. The question for me is whether the code is divergent enough. For instance, there are dedicated inference libraries with special optimizations for getting faster inference with tree models in general, like the FIL project in cuML. They work on both random forest and boosted trees and potentially many other types of tree-based models.

Does XGBoost want to complete in that space? If so, we might start building a set of APIs specifically for such prediction use cases, which can be optimized to its teeth. To provide some examples, we can bypass the JSON object, bypass the type dispatching, bypass the memory allocation in the predictor, specialize for dense data, specialize for balanced trees, etc. All are low-hanging fruits.

If not, then we can leave the predictor focusing on batch-based data.

@trivialfis
Copy link
Member

trivialfis commented Jun 11, 2024

Lastly, we have in-place prediction. Booster.inplace_predict that doesn't require the construction of DMatrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants