- Platform Overview
- Data Science Workflow
- End-to-End Use-Case Applications
- Jupyter Notebook Basics
- Additional Resources
- Support
The Iguazio Data Science Platform ("the platform") is a fully integrated and secure data science platform as a service (PaaS), which simplifies development, accelerates performance, facilitates collaboration, and addresses operational challenges. The platform incorporates the following components:
- A data science workbench that includes Jupyter Notebook, integrated analytics engines, and Python packages
- Real-time dashboards based on Grafana
- Managed data and machine-learning (ML) services over a scalable Kubernetes cluster
- A real-time serverless functions framework — Nuclio
- An extremely fast and secure data layer that supports SQL, NoSQL, time-series databases, files (simple objects), and streaming
- Integration with third-party data sources such as Amazon S3, HDFS, SQL databases, and streaming or messaging protocols
The platform uses Kubernetes (k8s) as the baseline cluster manager, and deploys various application microservices on top of Kubernetes to address different data science tasks. Most of the provided services support scaling out and GPU acceleration and have a secure and low-latency access to the platform's shared data store and file system, enabling high performance and scalability with maximum resource efficiency.
The platform makes extensive use of Nuclio serverless functions to automate various tasks — such as data collection, extract-transform-load (ETL) processes, model serving, and batch jobs. Nuclio functions describe the code and include all the required resource definitions and configuration for running the code. The functions auto scale and can be versioned. The platform supports various methods for generating Nuclio functions — using the graphical dashboard, Docker, Git, or Jupyter Notebook — as demonstrated in the platform tutorials.
For a more in-depth introduction to the platform, see the following resources:
- Components, Services, and Development Ecosystem
- Introduction video
- Unique data-layer architecture
- Creating and deploying Nuclio functions with Python and Jupyter Notebook
A good place to start your development is with the platform tutorial Jupyter notebooks.
- The getting-started directory contains information and code examples to help you quickly get started using the platform.
- The demos directory contains full end-to-end use-case application demos.
The Iguazio Data Science Platform provides a complete data science workflow in a single ready-to-use platform that includes all the required building blocks for creating data science applications from research to production:
- Collect, explore, and label data from various real-time or offline sources
- Run ML training and validation at scale over multiple CPUs and GPUs
- Deploy models and applications into production with serverless functions
- Log, monitor, and visualize all your data and services
There are many ways to collect and ingest data from various sources into the platform:
- Streaming data in real time from sources such as Kafka, Kinesis, Azure Event Hubs, or Google Pub/Sub.
- Loading data directly from external databases using an event-driven or periodic/scheduled implementation. See the explanation and examples in the read-external-db tutorial.
- Loading files (objects), in any format (for example, CSV, Parquet, JSON, or a binary image), from internal or external sources such as Amazon S3 or Hadoop. See, for example, the file-access tutorial.
- Importing time-series telemetry data using a Prometheus compatible scraping API.
- Ingesting (writing) data directly into the system using RESTful AWS-like simple-object, streaming, or NoSQL APIs. See the platform's Web-API References.
- Scraping or reading data from external sources — such as Twitter, weather services, or stock-trading data services — using serverless functions. See, for example, the stocks demo use-case application.
For more information and examples of data collection and ingestion with the platform, see the collect-n-explore tutorial Jupyter notebook.
The platform includes a wide range of integrated open-source data query and exploration tools, including the following:
- Apache Spark data-processing engine — including the Spark SQL and Datasets, MLlib, R, and GraphX libraries — with real-time access to the platform's NoSQL data store and file system. See the platform's Spark APIs reference and the examples in the spark-sql-analytics tutorial.
- Presto distributed SQL query engine, which can be used to run interactive SQL queries over platform NoSQL tables or other object (file) data sources. See the platform's Presto reference.
- pandas Python analysis library, including structured DataFrames.
- Dask parallel-computing Python library, including scaled pandas DataFrames.
- V3IO Frames — Iguazio's open-source data-access library, which provides a unified high-performance API for accessing NoSQL, stream, and time-series data in the platform's data store and features native integration with pandas and NVIDIA RAPIDS. See, for example, the frames tutorial.
- Built-in support for ML packages such as scikit-learn, Pyplot, NumPy, PyTorch, and TensorFlow.
All these tools are integrated with the platform's Jupyter Notebook service, allowing users to access the same data from Jupyter through different interfaces with minimal configuration overhead. Users can easily install additional Python packages by using the Conda binary package and environment manager and the pip Python package installer, which are both available as part of the Jupyter Notebook service. This design, coupled with the platform's unified data model, enables users to store and access data using different formats — such as NoSQL ("key/value"), time series, stream data, and files (simple objects) — and leverage different tools and APIs for accessing and manipulating the data, all from a single development environment (namely, Jupyter Notebook).
Note: You can deploy and manage application services, such as Spark and Jupyter Notebook, from the Services page of the platform dashboard.
For more information and examples of data exploration with the platform, see the collect-n-explore tutorial Jupyter notebook.
You can develop and test data science models in the platform's Jupyter Notebook service or in your preferred external editor. When your model is ready, you can train it in Jupyter Notebook or by using scalable cluster resources such as Nuclio functions, Dask, Spark ML, or Kubernetes jobs. You can find model-training examples in the platform's tutorial Jupyter notebooks:
- The NetOps demo tutorial demonstrates predictive infrastructure-monitoring using scikit-learn.
- The image-classification demo tutorial demonstrates image recognition using TensorFlow and Keras.
If you're are a beginner, you might find the following ML guide useful — Machine Learning Algorithms In Layman's Terms.
The platform allows you to easily deploy your models to production in a reproducible way by using the open-source Nuclio serverless framework. You provide Nuclio with code or Jupyter notebooks, resource definitions (such as CPU, memory, and GPU), environment variables, package or software dependencies, data links, and trigger information. Nuclio uses this information to automatically build the code, generate custom container images, and connect them to the relevant compute or data resources. The functions can be triggered by a wide variety of event sources, including the most commonly used streaming and messaging protocols, HTTP APIs, scheduled (cron) tasks, and batch jobs.
Nuclio functions can be created from the platform dashboard or by using standard code IDEs, and can be deployed on your platform cluster. A convenient way to develop and deploy Nuclio functions is by using Jupyter Notebook and Python tools. For detailed information about Nuclio, visit the Nuclio web site and see the product documentation.
Note: Nuclio functions aren't limited to model serving: they can automate data collection, serve custom APIs, build real-time feature vectors, drive triggers, and more.
For an overview of Nuclio and how to develop, document, and deploy serverless Python Nuclio functions from Jupyter Notebook, see the nuclio-jupyter documentation. You can also find examples in the platform tutorial Jupyter notebooks; for example, the NetOps demo tutorial demonstrates how to deploy a network-operations model as a function.
Data in the platform — including collected data, internal or external telemetry and logs, and program-output data — can be analyzed and visualized in different ways simultaneously. The platform supports multiple standard data analytics and visualization tools, including SQL, Prometheus, Grafana, and pandas. For example, you can plot or chart data within Jupyter Notebook using Matplotlib; use your favorite BI visualization tools, such as Tableau, to query data in the platform over a Java database connectivity connector (JDBC); or build real-time dashboards in Grafana.
The data analytics and visualization tools and services generate telemetry and log data that can be stored using the platform's time-series database (TSDB) service or by using external tools such as Elasticsearch. Platform users can easily instrument code and functions to collect various statistics or logs, and explore the collected data in real time.
The Grafana open-source analytics and monitoring framework is natively integrated into the platform, allowing users to create dashboards that provide access to platform NoSQL tables and time-series databases from different dashboard widgets. You can also create Grafana dashboards programmatically (for example, from Jupyter Notebook) using wizard scripts. For information on how to create Grafana dashboards to monitor and visualize data in the platform, see Adding a Custom Grafana Dashboard.
Iguazio provides full end-to-end use-case applications that demonstrate how to use the Iguazio Data Science Platform and related tools to address data science requirements for different industries and implementations. The applications are provided in the demos directory of the platform's tutorial Jupyter notebooks and cover the following use cases; for more detailed descriptions, see the demos README (notebook / Markdown):
- Smart stock trading (stocks) — the application reads stock-exchange data from an internet service into a time-series database (TSDB); uses Twitter to analyze the market sentiment on specific stocks, in real time; and saves the data to a platform NoSQL table that is used for generating reports and analyzing and visualizing the data on a Grafana dashboard.
- Predictive infrastructure monitoring (netops) — the application builds, trains, and deploys a machine-learning model for analyzing and predicting failure in network devices as part of a network operations (NetOps) flow. The goal is to identify anomalies for device metrics — such as CPU, memory consumption, or temperature — which can signify an upcoming issue or failure.
- Image recognition (image-classification) — the application builds and trains an ML model that identifies (recognizes) and classifies images by using Keras, TensorFlow, and scikit-learn.
- Natural language processing (NLP) (nlp) — the application processes natural-language textual data — including spelling correction and sentiment analysis — and generates a Nuclio serverless function that translates any given text string to another (configurable) language.
- Stream enrichment (stream-enrich) — the application demonstrates a typical stream-based data-engineering pipeline, which is required in many real-world scenarios: data is streamed from an event streaming engine; the data is enriched, in real time, using data from a NoSQL table; the enriched data is saved to an output data stream and then consumed from this stream.
The platform's Jupyter Notebook service displays the JupyterLab UI, which consists of a collapsible left sidebar, a main work area (on the right), and a top menu bar. For details, see the JupyterLab documentation.
The main work area (on the right) contains tabs of documents and activities — for creating, viewing, editing, and running interactive notebooks, shell terminals, or consoles, as well as viewing and editing other common file types.
To create a new notebook or terminal, select the New Launcher option (+
icon) from the top action toolbar in the left sidebar.
The top menu bar exposes available top-level actions, such as exporting a notebook in a different format.
The left-sidebar menu contains commonly used tabs, including a File Browser (directory icon) for browsing files.
The root file-browser directory of the platform's Jupyter Notebook service contains the following files and directories:
-
v3io directory, which displays the contents of the
v3io
platform cluster data mount for browsing the contents of the cluster's data containers. You can also browse the contents of the data containers from the Data page of the platform dashboard. -
The contents of the running-user home directory — users/<running user>. This directory contains the platform's tutorial Jupyter notebooks:
- welcome.ipynb / README.md — the current document, which provides a short introduction to the platform and how to use it to implement a full data science workflow.
- getting-started — a directory containing getting-started tutorials that explain and demonstrate how to perform different platform operations using the platform APIs and integrated tools.
- demos — a directory containing end-to-end application use-case demos.
For information about the predefined data containers and how to reference data in these containers, see Platform Data Containers in the collect-n-explore tutorial notebook.
A virtual environment is a named, isolated, working copy of Python that maintains its own files, directories, and paths so that you can work with specific versions of libraries or Python itself without affecting other Python projects. Virtual environments make it easy to cleanly separate projects and avoid problems with different dependencies and version requirements across components. See the virutal-env tutorial notebook for step-by-step instructions for using conda to create your own Python virtual environments, which will appear as custom kernels in Jupyter Notebook.
You can use the provided igz-tutorials-get.sh script to update the tutorial notebooks to the latest stable version available on GitHub. For details, see the update-tutorials.ipynb notebook.
- References
- Components, Services, and Development Ecosystem
- Iguazio sample data-set public Amazon S3 bucket
- 10 Minutes to pandas
- JupyterLab Tutorial
- Machine Learning Algorithms In Layman's Terms
- Registry of Open Data on AWS
The Iguazio support team will be happy to assist with any questions.