Add performance tips to cudf.pandas FAQ. (#16693)

This PR adds a section with performance tips to the `cudf.pandas` FAQ. I based this section on some common user questions about performance, to make it clearer that `cudf.pandas` is designed for optimal performance with large data sizes and provide some alternatives for common needs where `cudf` or `cudf.pandas` aren't the best fit. See these links for examples: - #14548 (comment) - #16065 - https://stackoverflow.com/questions/78626099/cudf-is-very-slow Authors: - Bradley Dice (https://github.com/bdice) - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #16693
rapidsai · Sep 5, 2024 · 0e86f62 · 0e86f62
1 parent 0cc059f
commit 0e86f62
Showing 1 changed file with 35 additions and 3 deletions.
diff --git a/docs/cudf/source/cudf_pandas/faq.md b/docs/cudf/source/cudf_pandas/faq.md
@@ -32,7 +32,7 @@ pandas. You can learn more about these edge cases in
 
 We also run nightly tests that track interactions between
 `cudf.pandas` and other third party libraries. See
-[Third-Party Library Compatibility](#does-it-work-with-third-party-libraries).
+[Third-Party Library Compatibility](#does-cudf-pandas-work-with-third-party-libraries).
 
 ## How can I tell if `cudf.pandas` is active?
 
@@ -69,7 +69,38 @@ performance, try to use only functionality that can run entirely on GPU.
 This helps reduce the number of memory transfers needed to fallback to
 CPU.
 
-## Does it work with third-party libraries?
+## How can I improve performance of my workflow with `cudf.pandas`?
+
+Most workflows will see significant performance improvements with
+`cudf.pandas`. However, sometimes things can be slower than expected.
+First, it's important to note that GPUs are good at parallel processing
+of large amounts of data. Small data sizes may be slower on GPU than
+CPU, because of the cost of data transfers. cuDF achieves the highest
+performance with many rows of data. As a _very rough_ rule of thumb,
+`cudf.pandas` shines on workflows with more than 10,000 - 100,000 rows
+of data, depending on the algorithms, data types, and other factors.
+Datasets that are several gigabytes in size and/or have millions of
+rows are a great fit for `cudf.pandas`.
+
+Here are some more tips to improve workflow performance:
+
+- Reshape data so it is long rather than wide (more rows, fewer
+  columns). This improves cuDF's ability to execute in parallel on the
+  entire GPU!
+- Avoid element-wise iteration and mutation. If you can, use pandas
+  functions to manipulate an entire column at once rather than writing
+  raw `for` loops that compute and assign.
+- If your data is really an n-dimensional array with lots of columns
+  where you aim to do lots of math (like adding matrices),
+  [CuPy](https://cupy.dev/) or [NumPy](https://numpy.org/) may be a
+  better choice than pandas or `cudf.pandas`. Array libraries are built
+  for different use cases than DataFrame libraries, and will get optimal
+  performance from using contiguous memory for multidimensional array
+  storage. Use the `.values` method to convert a DataFrame or Series to
+  an array.
+
+(does-cudf-pandas-work-with-third-party-libraries)=
+## Does `cudf.pandas` work with third-party libraries?
 
 `cudf.pandas` is tested with numerous popular third-party libraries.
 `cudf.pandas` will not only work but will accelerate pandas operations
@@ -97,7 +128,7 @@ common interactions with the following Python libraries:
 Please review the section on [Known Limitations](#are-there-any-known-limitations)
 for details about what is expected not to work (and why).
 
-## Can I use this with Dask or PySpark?
+## Can I use `cudf.pandas` with Dask or PySpark?
 
 `cudf.pandas` is not designed for distributed or out-of-core computing
 (OOC) workflows today. If you are looking for accelerated OOC and
@@ -111,6 +142,7 @@ cuDF (learn more in [this
 blog](https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d)) and the [RAPIDS Accelerator for Apache Spark](https://nvidia.github.io/spark-rapids/)
 provides a similar configuration-based plugin for Spark.
 
+(are-there-any-known-limitations)=
 ## Are there any known limitations?
 
 There are a few known limitations that you should be aware of: