Skip to content

Commit

Permalink
Add performance tips to cudf.pandas FAQ. (#16693)
Browse files Browse the repository at this point in the history
This PR adds a section with performance tips to the `cudf.pandas` FAQ.

I based this section on some common user questions about performance, to make it clearer that `cudf.pandas` is designed for optimal performance with large data sizes and provide some alternatives for common needs where `cudf` or `cudf.pandas` aren't the best fit. See these links for examples:

- #14548 (comment)
- #16065
- https://stackoverflow.com/questions/78626099/cudf-is-very-slow

Authors:
  - Bradley Dice (https://github.com/bdice)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #16693
  • Loading branch information
bdice authored Sep 5, 2024
1 parent 0cc059f commit 0e86f62
Showing 1 changed file with 35 additions and 3 deletions.
38 changes: 35 additions & 3 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ pandas. You can learn more about these edge cases in

We also run nightly tests that track interactions between
`cudf.pandas` and other third party libraries. See
[Third-Party Library Compatibility](#does-it-work-with-third-party-libraries).
[Third-Party Library Compatibility](#does-cudf-pandas-work-with-third-party-libraries).

## How can I tell if `cudf.pandas` is active?

Expand Down Expand Up @@ -69,7 +69,38 @@ performance, try to use only functionality that can run entirely on GPU.
This helps reduce the number of memory transfers needed to fallback to
CPU.

## Does it work with third-party libraries?
## How can I improve performance of my workflow with `cudf.pandas`?

Most workflows will see significant performance improvements with
`cudf.pandas`. However, sometimes things can be slower than expected.
First, it's important to note that GPUs are good at parallel processing
of large amounts of data. Small data sizes may be slower on GPU than
CPU, because of the cost of data transfers. cuDF achieves the highest
performance with many rows of data. As a _very rough_ rule of thumb,
`cudf.pandas` shines on workflows with more than 10,000 - 100,000 rows
of data, depending on the algorithms, data types, and other factors.
Datasets that are several gigabytes in size and/or have millions of
rows are a great fit for `cudf.pandas`.

Here are some more tips to improve workflow performance:

- Reshape data so it is long rather than wide (more rows, fewer
columns). This improves cuDF's ability to execute in parallel on the
entire GPU!
- Avoid element-wise iteration and mutation. If you can, use pandas
functions to manipulate an entire column at once rather than writing
raw `for` loops that compute and assign.
- If your data is really an n-dimensional array with lots of columns
where you aim to do lots of math (like adding matrices),
[CuPy](https://cupy.dev/) or [NumPy](https://numpy.org/) may be a
better choice than pandas or `cudf.pandas`. Array libraries are built
for different use cases than DataFrame libraries, and will get optimal
performance from using contiguous memory for multidimensional array
storage. Use the `.values` method to convert a DataFrame or Series to
an array.

(does-cudf-pandas-work-with-third-party-libraries)=
## Does `cudf.pandas` work with third-party libraries?

`cudf.pandas` is tested with numerous popular third-party libraries.
`cudf.pandas` will not only work but will accelerate pandas operations
Expand Down Expand Up @@ -97,7 +128,7 @@ common interactions with the following Python libraries:
Please review the section on [Known Limitations](#are-there-any-known-limitations)
for details about what is expected not to work (and why).

## Can I use this with Dask or PySpark?
## Can I use `cudf.pandas` with Dask or PySpark?

`cudf.pandas` is not designed for distributed or out-of-core computing
(OOC) workflows today. If you are looking for accelerated OOC and
Expand All @@ -111,6 +142,7 @@ cuDF (learn more in [this
blog](https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d)) and the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/)
provides a similar configuration-based plugin for Spark.

(are-there-any-known-limitations)=
## Are there any known limitations?

There are a few known limitations that you should be aware of:
Expand Down

0 comments on commit 0e86f62

Please sign in to comment.