From 0e86f621bbf32c6b5a72fa95afd1f74d6fa50aba Mon Sep 17 00:00:00 2001 From: Bradley Dice Date: Thu, 5 Sep 2024 17:10:27 -0500 Subject: [PATCH] Add performance tips to cudf.pandas FAQ. (#16693) This PR adds a section with performance tips to the `cudf.pandas` FAQ. I based this section on some common user questions about performance, to make it clearer that `cudf.pandas` is designed for optimal performance with large data sizes and provide some alternatives for common needs where `cudf` or `cudf.pandas` aren't the best fit. See these links for examples: - https://github.com/rapidsai/cudf/issues/14548#issuecomment-1838529130 - https://github.com/rapidsai/cudf/issues/16065 - https://stackoverflow.com/questions/78626099/cudf-is-very-slow Authors: - Bradley Dice (https://github.com/bdice) - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: https://github.com/rapidsai/cudf/pull/16693 --- docs/cudf/source/cudf_pandas/faq.md | 38 ++++++++++++++++++++++++++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/docs/cudf/source/cudf_pandas/faq.md b/docs/cudf/source/cudf_pandas/faq.md index cdf32216619..fa5d203f52c 100644 --- a/docs/cudf/source/cudf_pandas/faq.md +++ b/docs/cudf/source/cudf_pandas/faq.md @@ -32,7 +32,7 @@ pandas. You can learn more about these edge cases in We also run nightly tests that track interactions between `cudf.pandas` and other third party libraries. See -[Third-Party Library Compatibility](#does-it-work-with-third-party-libraries). +[Third-Party Library Compatibility](#does-cudf-pandas-work-with-third-party-libraries). ## How can I tell if `cudf.pandas` is active? @@ -69,7 +69,38 @@ performance, try to use only functionality that can run entirely on GPU. This helps reduce the number of memory transfers needed to fallback to CPU. -## Does it work with third-party libraries? +## How can I improve performance of my workflow with `cudf.pandas`? + +Most workflows will see significant performance improvements with +`cudf.pandas`. However, sometimes things can be slower than expected. +First, it's important to note that GPUs are good at parallel processing +of large amounts of data. Small data sizes may be slower on GPU than +CPU, because of the cost of data transfers. cuDF achieves the highest +performance with many rows of data. As a _very rough_ rule of thumb, +`cudf.pandas` shines on workflows with more than 10,000 - 100,000 rows +of data, depending on the algorithms, data types, and other factors. +Datasets that are several gigabytes in size and/or have millions of +rows are a great fit for `cudf.pandas`. + +Here are some more tips to improve workflow performance: + +- Reshape data so it is long rather than wide (more rows, fewer + columns). This improves cuDF's ability to execute in parallel on the + entire GPU! +- Avoid element-wise iteration and mutation. If you can, use pandas + functions to manipulate an entire column at once rather than writing + raw `for` loops that compute and assign. +- If your data is really an n-dimensional array with lots of columns + where you aim to do lots of math (like adding matrices), + [CuPy](https://cupy.dev/) or [NumPy](https://numpy.org/) may be a + better choice than pandas or `cudf.pandas`. Array libraries are built + for different use cases than DataFrame libraries, and will get optimal + performance from using contiguous memory for multidimensional array + storage. Use the `.values` method to convert a DataFrame or Series to + an array. + +(does-cudf-pandas-work-with-third-party-libraries)= +## Does `cudf.pandas` work with third-party libraries? `cudf.pandas` is tested with numerous popular third-party libraries. `cudf.pandas` will not only work but will accelerate pandas operations @@ -97,7 +128,7 @@ common interactions with the following Python libraries: Please review the section on [Known Limitations](#are-there-any-known-limitations) for details about what is expected not to work (and why). -## Can I use this with Dask or PySpark? +## Can I use `cudf.pandas` with Dask or PySpark? `cudf.pandas` is not designed for distributed or out-of-core computing (OOC) workflows today. If you are looking for accelerated OOC and @@ -111,6 +142,7 @@ cuDF (learn more in [this blog](https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d)) and the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/) provides a similar configuration-based plugin for Spark. +(are-there-any-known-limitations)= ## Are there any known limitations? There are a few known limitations that you should be aware of: