Skip to content

Commit

Permalink
docs: add faqs (#289)
Browse files Browse the repository at this point in the history
* docs: add faqs

* docs: update faqs

* docs: update mkdocs.yml
  • Loading branch information
miriamspsantos authored Jul 24, 2023
1 parent edd34a9 commit d642bd9
Show file tree
Hide file tree
Showing 3 changed files with 89 additions and 1 deletion.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,5 +130,8 @@ We are open to collaboration! If you want to start contributing you only need to
## Support
For support in using this library, please join our Discord server. Our Discord community is very friendly and great about quickly answering questions about the use and development of the library. [Click here to join our Discord community!](https://tiny.ydata.ai/dcai-ydata-synthetic)

## FAQs
Have a question? Check out the [Frequently Asked Questions](https://ydata.ai/resources/10-most-asked-questions-on-ydata-synthetic) about `ydata-synthetic`. If you feel something is missing, feel free to [book a beary informal chat with us](https://meetings.hubspot.com/fabiana-clemente).

## License
[MIT License](https://github.com/ydataai/ydata-synthetic/blob/master/LICENSE)
84 changes: 84 additions & 0 deletions docs/examples/faqs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Frequently Asked Questions

## How to get accurate data from my synthetic data generation processes?
Depending on your use case, the downstream application of your synthetic data, and the characteristics of your original data, you will need to adjust your synthetisation process accordingly. That often involves performing a thorough data preparation and fitting your generation models appropriately.

!!! tip

For a use-case oriented UI experience, try [YData Fabric](https://ydata.ai/ydata-fabric-free-trial). From an interactive and complete data profiling to an efficient synthetization, your data preparation process will be seamlessly adjusted to your data characteristics.

## How can I run the Streamlit app?

To try `ydata-synthetic` using the streamlit app, you need to install it using the `[]` notation that encodes the extras that the package incorporates. In this case, you can simply create your virtual environment and install `ydata-synthetic` as:

```bash
pip install ydata-synthetic[streamlit]
```

Note that Jupyter or Colab Notebooks are not yet supported, so you need to work it out in your Python environment. Once the package is installed, you can use the following snippet to start the app:

```python
from ydata_synthetic import streamlit_app

streamlit_app.run()
```

And that's it! After running the command, the console will output the URL from which you can access the app!

!!! example
For a step-by-step installation guide, [check this 5-min video](https://www.youtube.com/watch?v=jj9X1_cKRwI&t=2s) that will help you get started!


## What is the best way to evaluate the quality of my synthetic data?
The most appropriate metrics to evaluate the quality of your synthetic data are also dependent on the goal for which synthetic data will be used. Nevertheless, we may define three essential pillars for synthetic data quality: privacy, fidelity, and utility:

- Privacy refers to the ability of synthetic data to withhold any personal, private, or sensitive information, avoiding connections being drawn to the original data and preventing data leakage;

- Fidelity concerns the ability of the new data to preserve the properties of the original data (in other words, it refers to "how faithful, how precise" is the synthetic data in comparison to real data);

- Finally, utility relates to the downstream application where the synthetic data will be used: if the synthetization process is successful, the same insights should be derived from the new data as from the original data.

For each of these components, several specific statistical measures can be evaluated.

!!! abstract

To learn more about how to define specific trade-offs between privacy, fidelity, and utility, check out this white paper on [Synthetic Data Quality Metrics](https://ydata.ai/synthetic-data-quality-metrics).


## How to generate synthetic data in Google Colab and Python Environments?
Most issues with installations are usually associated with unsupported Python versions or misalignment between python environments and package requirements.

Let’s see how you can get both right:

### Python Versions
Note that `ydata-synthetic` currently requires Python >=3.9, < 3.11 so if you're trying to run our code in Google Colab, then you need to [update your Google Colab’s Python version](https://stackoverflow.com/questions/68657341/how-can-i-update-google-colabs-python-version/68658479#68658479) accordingly. The same goes for your development environment.

### Virtual Environments
A lot of troubleshooting arises due to misalignments between environments and package requirements.
Virtual Environments isolate your installations from the "global" environment so that you don't have to worry about conflicts.

Using conda, creating a new environment is as easy as running this on your shell:

```
conda create --name synth-env python==3.9 pip
conda activate synth-env
pip install ydata-synthetic
```

Now you can open up your Python editor or Jupyter Lab and use the synth-env as your development environment, without having to worry about conflicting versions or packages between projects!


## Does TimeGAN replicate my full sequence of data?
No. This is an unrealistic expectation because the TimeGAN architecture is not meant to replicate the long-term behavior of your data.

TimeGAN works with the concept of "windows": it learns to map the data distribution of short-term frames of time, within the time windows you provide. It also considers that those windows are independent of each other, so it cannot return a temporal pattern most people expect.

That's not supported by this architecture itself, but there are others that allow for both short-term and long-term synthesization, as those available in [YData Fabric](https://ydata.ai/products/synthetic_data).

!!! abstract

Learn more about how YData's Time-Series Synthetic Data Generation compare to TimeGAN in [this dedicated post](https://ydata.ai/resources/the-best-generative-ai-model-for-time-series-synthetic-data-generation).


# Additional Support
Couldn't find what you need? Reach out to our [dedicated team](https://meetings.hubspot.com/fabiana-clemente) for a quick and *syn-ple* assistance!
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ nav:
- Overview: 'index.md'
- Installation: 'getting-started/installation.md'
- Quickstart: 'getting-started/quickstart.md'
- Examples:
- Synthetic Data Generation:
- Generate Tabular Data:
- CGAN: "examples/cgan_example.md"
- WGAN: "examples/wgan_example.md"
Expand All @@ -21,6 +21,7 @@ nav:
- CWGAN-GP: "examples/cwgangp_example.md"
- Generate Time-Series Data:
- TimeGAN: "examples/timegan_example.md"
- Frequently Asked Questions: "examples/faqs.md"
- Integrations:
- Great Expectations: "integrations/gx_integration.md"
- Support:
Expand Down

0 comments on commit d642bd9

Please sign in to comment.