Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASSGO-11 Feature Request: Support Vector Type #1734

Open
7flash opened this issue Jan 12, 2024 · 15 comments · May be fixed by #1828
Open

CASSGO-11 Feature Request: Support Vector Type #1734

7flash opened this issue Jan 12, 2024 · 15 comments · May be fixed by #1828

Comments

@7flash
Copy link

7flash commented Jan 12, 2024

Since Cassandra has introduced Vector type, and it's already supported in Python driver, I hope it can be added in gocql as well, and it will resolve this issue: datastax/gocql-astra#17 (comment)

@jfleming-ic
Copy link
Contributor

We're also keen to see vector support in gocql, and I think our customers would love to see it as well. Any news on this feature?

@nkev
Copy link

nkev commented Feb 28, 2024

Are there plans to implement the many new features in Cassandra 5?

@martin-sucha
Copy link
Contributor

@nkev Personally I don't plan to work on Cassandra 5 support, if anyone else wants to, feel free. See a more detailed response in the mailing list.

@nkev
Copy link

nkev commented Feb 29, 2024

@martin-sucha Thanks for the update. Let's hope a gopher (or few) with a deep understanding of C* puts their hand up.

@tengu-alt
Copy link
Contributor

Hello! I will try to handle it.

@tengu-alt
Copy link
Contributor

tengu-alt commented May 22, 2024

Hello! I will try to handle it.

During the implementation of the vector type support I found several issues:

  • The Vector type is implemented in Cassandra as not the native collection type but the custom type.
  • The data serialization on the select operation happens differently because of non-mentioned in official Cassandra documentation restrictions of the vector elements length (that also causes errors when I am trying to select values that length are longer than Cassandra allows). Also, I tested it via cqlsh:
    create table example.vectors(id text, words vector <text, 3 >, PRIMARY KEY(id )) ;
    INSERT INTO vectors (id, words ) VALUES ('id', ['AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB','2','1']);
    cqlsh:example> SELECT * FROM vectors ;
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 767, in recv_results_rows
    self.parsed_rows = [decode_row(row) for row in rows]
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 767, in <listcomp>
    self.parsed_rows = [decode_row(row) for row in rows]
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 764, in decode_row
    return tuple(decode_val(val, col_md, col_desc) for val, col_md, col_desc in zip(row, column_metadata, col_descs))
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 764, in <genexpr>
    return tuple(decode_val(val, col_md, col_desc) for val, col_md, col_desc in zip(row, column_metadata, col_descs))
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 761, in decode_val
    return col_type.from_binary(raw_bytes, protocol_version)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 315, in from_binary
    return cls.deserialize(byts, protocol_version)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 1445, in deserialize
    return [cls.subtype.deserialize(byts[idx:idx + 4], protocol_version) for idx in indexes]
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 1445, in <listcomp>
    return [cls.subtype.deserialize(byts[idx:idx + 4], protocol_version) for idx in indexes]
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 769, in deserialize
    return byts.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 772, in recv_results_rows
    decode_val(val, col_md, col_desc)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 761, in decode_val
    return col_type.from_binary(raw_bytes, protocol_version)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 315, in from_binary
    return cls.deserialize(byts, protocol_version)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 1445, in deserialize
    return [cls.subtype.deserialize(byts[idx:idx + 4], protocol_version) for idx in indexes]
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 1445, in <listcomp>
    return [cls.subtype.deserialize(byts[idx:idx + 4], protocol_version) for idx in indexes]
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cqltypes.py", line 769, in deserialize
    return byts.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/cassandra/bin/../pylib/cqlshlib/cqlshmain.py", line 990, in perform_simple_statement
    result = future.result()
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/cluster.py", line 4920, in result
    raise self._final_exception
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/connection.py", line 1229, in process_msg
    response = decoder(header.version, self.user_type_map, stream_id,
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 1208, in decode_message
    msg = msg_class.recv_body(body, protocol_version, user_type_map, result_metadata, cls.column_encryption_policy)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 745, in recv_body
    msg.recv(f, protocol_version, user_type_map, result_metadata, column_encryption_policy)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 731, in recv
    self.recv_results_rows(f, protocol_version, user_type_map, result_metadata, column_encryption_policy)
  File "/opt/cassandra/bin/../lib/cassandra-driver-internal-only-3.28.0.zip/cassandra-driver-3.28.0/cassandra/protocol.py", line 774, in recv_results_rows
    raise DriverException('Failed decoding result column "%s" of type %s: %s' % (col_md[2],
cassandra.DriverException: Failed decoding result column "words" of type org.apache.cassandra.db.marshal.VectorType<text, 3>: 'utf-8' codec can't decode byte 0x81 in position 0: invalid start byte```       

@martin-sucha
Copy link
Contributor

The Vector type is implemented in Cassandra as not the native collection type but the custom type.

Could this be because gocql uses protocol v4, which does not have native support for the vector type, while protocol v5 does?

@martin-sucha
Copy link
Contributor

Please open a Cassandra issue about the length issue.

@tengu-alt
Copy link
Contributor

The Vector type is implemented in Cassandra as not the native collection type but the custom type.

Could this be because gocql uses protocol v4, which does not have native support for the vector type, while protocol v5 does?

Exactly!
I will hold it until the protocol v5 support will appear.

@rcosnita
Copy link

rcosnita commented Sep 3, 2024

Hello everyone,

For people who are using version 4 of the protocol but need to be able to write vectors from their code base here is a possible idea:

package warm

import (
	"fmt"

	"github.com/gocql/gocql"
)

func encInt(v int32) []byte {
	return []byte{byte(v >> 24), byte(v >> 16), byte(v >> 8), byte(v)}
}

func encFloat(v float32) []byte {
	return encInt(int32(math.Float32bits(v)))
}

type Float32Vector struct {
	value      []float32
	dimensions int
}

func (m *Float32Vector) MarshalCQL(info gocql.TypeInfo) ([]byte, error) {
	if len(m.value) != m.dimensions {
		return nil, fmt.Errorf("float32vector expects size %d but received size %d",
			m.dimensions, len(m.value))
	}

	var results []byte
	for _, part := range m.value {
		results = append(results, encFloat(part)...)
	}

	return results, nil
}

func (m *Float32Vector) UnmarshalCQL(info gocql.TypeInfo, data []byte) error {
	panic("unmarshalling vector is not fully implemented")
}

func (m *Float32Vector) Value() []float32 {
	return m.value
}

func ensureVectorDimension(values []float32, dimensions int) []float32 {
	if len(values) == dimensions {
		return values
	}

	delta := dimensions - len(values)
	result := make([]float32, dimensions)
	for idx, v := range values {
		if idx < dimensions {
			result[idx] = v
		} else {
                   break
                }
	}

	for idx := len(values); idx < delta+len(values); idx++ {
		result[idx] = 0.0
	}

	return result
}

func FromVectorFloat32(value []float32, dimensions int) *Float32Vector {
	return &Float32Vector{
		value:      ensureVectorDimension(value, dimensions),
		dimensions: dimensions,
	}
}

You can opt to skip the padding completely and return an error if that feels more natural for your use case. The encoding functions are extracted from the existing gocql codebase.

@lukasz-antoniak
Copy link
Member

@tengu-alt, did make progress on vector support? I cannot find relevant PR opened. Do you mind if I submit my nearly completed PoC? Based on your comment from 22 May, I think you are encoding length of vector element incorrectly.

@tengu-alt
Copy link
Contributor

@tengu-alt, did make progress on vector support? I cannot find relevant PR opened. Do you mind if I submit my nearly completed PoC? Based on your comment from 22 May, I think you are encoding length of vector element incorrectly.

I am currently implementing vector type support. I think the error that I mentioned is caused by the cqlsh incorrect work.

@joao-r-reis
Copy link
Contributor

@tengu-alt, did make progress on vector support? I cannot find relevant PR opened. Do you mind if I submit my nearly completed PoC? Based on your comment from 22 May, I think you are encoding length of vector element incorrectly.

I am currently implementing vector type support. I think the error that I mentioned is caused by the cqlsh incorrect work.

It looks like @lukasz-antoniak already has a functioning prototype, it might be more efficient to just have him open a PR with his work depending on how much progress you have on your work

@tengu-alt
Copy link
Contributor

@tengu-alt, did make progress on vector support? I cannot find relevant PR opened. Do you mind if I submit my nearly completed PoC? Based on your comment from 22 May, I think you are encoding length of vector element incorrectly.

I am currently implementing vector type support. I think the error that I mentioned is caused by the cqlsh incorrect work.

It looks like @lukasz-antoniak already has a functioning prototype, it might be more efficient to just have him open a PR with his work depending on how much progress you have on your work

Sounds great! I would be glad to see the @lukasz-antoniak PR. Currently I implemented a vector type unmarshal nearly the all datatypes that are supported by the driver (the vector data has a different serialization). The Marshal remains, and the test coverage also.

@lukasz-antoniak lukasz-antoniak linked a pull request Oct 10, 2024 that will close this issue
@lukasz-antoniak
Copy link
Member

I have opened #1828. Will continue with more unit and integration tests.

@joao-r-reis joao-r-reis changed the title Feature Request: Support Vector Type CASSGO-11 Feature Request: Support Vector Type Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants