VAST DB Python SDK

Introduction

vastdb is a Python-based SDK designed for interacting with a VAST Database and the VAST Catalog, enabling schema and table management, efficient ingest, query, and modification of columnar data.

For more details about the VAST Database, see this whitepaper.

Getting Started

Requirements

Linux client with Python 3.9 or later, and network access to the VAST Cluster
Virtual IP pool configured with DNS service
S3 access & secret keys on the VAST cluster
Tabular identity policy with the proper permissions

Required VAST Cluster release

VAST DB Python SDK requires VAST Cluster release 5.0.0-sp10 or later.

If your VAST Cluster is running an older release, please contact customer.support@vastdata.com.

Installation

pip install vastdb

See the Release Notes for the SDK.

Quick Start

Create schemas and tables, basic inserts, and selects:

import pyarrow as pa
import vastdb

session = vastdb.connect(
    endpoint='http://vip-pool.v123-xy.VastENG.lab',
    access=AWS_ACCESS_KEY_ID,
    secret=AWS_SECRET_ACCESS_KEY)

with session.transaction() as tx:
    bucket = tx.bucket("bucket-name")

    schema = bucket.create_schema("schema-name")
    print(bucket.schemas())

    columns = pa.schema([
        ('c1', pa.int16()),
        ('c2', pa.float32()),
        ('c3', pa.utf8()),
    ])
    table = schema.create_table("table-name", columns)
    print(schema.tables())
    print(table.columns())

    arrow_table = pa.table(schema=columns, data=[
        [111, 222, 333],
        [0.5, 1.5, 2.5],
        ['a', 'bb', 'ccc'],
    ])
    table.insert(arrow_table)

    # run `SELECT * FROM t`
    reader = table.select()  # return a `pyarrow.RecordBatchReader`
    result = reader.read_all()  # build an PyArrow Table from the `pyarrow.RecordBatch` objects read from VAST
    assert result == arrow_table

    # the transaction is automatically committed when exiting the context

For configuration examples, see here.

Note: the transaction must be remain open while the returned pyarrow.RecordBatchReader generator is being used.

Use Cases

Filters and Projections

The SDK supports predicate and projection pushdown:

    from ibis import _

    # SELECT c1 FROM t WHERE (c2 > 2) AND (c3 IS NULL)
    table.select(columns=['c1'],
                 predicate=(_.c2 > 2) & _.c3.isnull())

    # SELECT c2, c3 FROM t WHERE (c2 BETWEEN 0 AND 1) OR (c2 > 10)
    table.select(columns=['c2', 'c3'],
                 predicate=(_.c2.between(0, 1) | (_.c2 > 10))

    # SELECT * FROM t WHERE c3 LIKE '%substring%'
    table.select(predicate=_.c3.contains('substring'))

See here for more details.

Import a single Parquet file via S3 protocol

You can efficiently create tables from Parquet files (without copying them via the client):

    with tempfile.NamedTemporaryFile() as f:
        pa.parquet.write_table(arrow_table, f.name)
        s3.put_object(Bucket='bucket-name', Key='staging/file.parquet', Body=f)

    schema = tx.bucket('bucket-name').schema('schema-name')
    table = util.create_table_from_files(
        schema=schema, table_name='imported-table',
        parquet_files=['/bucket-name/staging/file.parquet'])

Import multiple Parquet files concurrently via S3 protocol

Import multiple files concurrently into a table (by using multiple CNodes' cores):

    schema = tx.bucket('bucket-name').schema('schema-name')
    table = util.create_table_from_files(
        schema=schema, table_name='large-imported-table',
        parquet_files=[f'/bucket-name/staging/file{i}.parquet' for i in range(10)])

Semi-sorted Projections

Create, list and delete available semi-sorted projections:

p = table.create_projection('proj', sorted=['c3'], unsorted=['c1'])
print(table.projections())
print(p.get_stats())
p.drop()

Snapshots

You can access the VAST Database using snapshots:

snaps = bucket.list_snapshots()
batches = snaps[0].schema('schema-name').table('table-name').select()

Post-processing

Export

Table.select() returns a stream of PyArrow record batches, which can be directly exported into a Parquet file:

batches = table.select()
with contextlib.closing(pa.parquet.ParquetWriter('/path/to/file.parquet', batches.schema)) as writer:
    for batch in table_batches:
        writer.write_batch(batch)

DuckDB Integration

Use DuckDB to post-process the resulting stream of PyArrow record batches:

from ibis import _

import duckdb
conn = duckdb.connect()

with session.transaction() as tx:
    table = tx.bucket("bucket-name").schema("schema-name").table("table-name")
    batches = table.select(columns=['c1'], predicate=(_.c2 > 2))
    print(conn.execute("SELECT sum(c1) FROM batches").arrow())

Note: the transaction must be active while the DuckDB query is executing and fetching results using the Python SDK.

VAST Catalog

The VAST Catalog can be queried as a regular table:

import pyarrow as pa
import vastdb

session = vastdb.connect(
    endpoint='http://vip-pool.v123-xy.VastENG.lab',
    access=AWS_ACCESS_KEY_ID,
    secret=AWS_SECRET_ACCESS_KEY)

with session.transaction() as tx:
    table = tx.catalog().select(['element_type']).read_all()
    df = table.to_pandas()

    total_elements = len(df)
    print(f"Total elements in the catalog: {total_elements}")

    file_count = (df['element_type'] == 'FILE').sum()
    print(f"Number of files/objects: {file_count}")

    distinct_elements = df['element_type'].unique()
    print("Distinct element types on the system:")
    print(distinct_elements)

More Information

See these blog posts for more examples:

See also the full Vast DB Python SDK documentation

Name		Name	Last commit message	Last commit date
Latest commit History 338 Commits
docs		docs
install		install
manifests		manifests
scripts		scripts
vast_flatbuf		vast_flatbuf
vastdb		vastdb
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.readthedocs.yaml		.readthedocs.yaml
.ruff.toml		.ruff.toml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.prebuild		Dockerfile.prebuild
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
generate.sh		generate.sh
mypy.ini		mypy.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VAST DB Python SDK

Introduction

Getting Started

Requirements

Required VAST Cluster release

Installation

Quick Start

Use Cases

Filters and Projections

Import a single Parquet file via S3 protocol

Import multiple Parquet files concurrently via S3 protocol

Semi-sorted Projections

Snapshots

Post-processing

Export

DuckDB Integration

VAST Catalog

More Information

About

Releases 20

Contributors 10

Languages

License

vast-data/vastdb_sdk

Folders and files

Latest commit

History

Repository files navigation

VAST DB Python SDK

Introduction

Getting Started

Requirements

Required VAST Cluster release

Installation

Quick Start

Use Cases

Filters and Projections

Import a single Parquet file via S3 protocol

Import multiple Parquet files concurrently via S3 protocol

Semi-sorted Projections

Snapshots

Post-processing

Export

DuckDB Integration

VAST Catalog

More Information

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 20

Contributors 10

Languages