Skip to content

Commit

Permalink
Merge branch 'master' of github.com:ClickHouse/ClickHouse into export…
Browse files Browse the repository at this point in the history
…-logs-in-ci-performance
  • Loading branch information
alexey-milovidov committed Aug 17, 2023
2 parents e357702 + ff04660 commit 64c8298
Show file tree
Hide file tree
Showing 206 changed files with 2,137 additions and 769 deletions.
34 changes: 33 additions & 1 deletion docker/test/fuzzer/run-fuzzer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,23 @@ EOL
<core_path>$PWD</core_path>
</clickhouse>
EOL

# Setup a cluster for logs export to ClickHouse Cloud
# Note: these variables are provided to the Docker run command by the Python script in tests/ci
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
echo "
remote_servers:
system_logs_export:
shard:
replica:
secure: 1
user: ci
host: '${CLICKHOUSE_CI_LOGS_HOST}'
port: 9440
password: '${CLICKHOUSE_CI_LOGS_PASSWORD}'
" > db/config.d/system_logs_export.yaml
fi
}

function filter_exists_and_template
Expand Down Expand Up @@ -223,7 +240,22 @@ quit
done
clickhouse-client --query "select 1" # This checks that the server is responding
kill -0 $server_pid # This checks that it is our server that is started and not some other one
echo Server started and responded
echo 'Server started and responded'

# Initialize export of system logs to ClickHouse Cloud
if [ -n "${CLICKHOUSE_CI_LOGS_HOST}" ]
then
export EXTRA_COLUMNS_EXPRESSION="$PR_TO_TEST AS pull_request_number, '$SHA_TO_TEST' AS commit_sha, '$CHECK_START_TIME' AS check_start_time, '$CHECK_NAME' AS check_name, '$INSTANCE_TYPE' AS instance_type"
# TODO: Check if the password will appear in the logs.
export CONNECTION_PARAMETERS="--secure --user ci --host ${CLICKHOUSE_CI_LOGS_HOST} --password ${CLICKHOUSE_CI_LOGS_PASSWORD}"

/setup_export_logs.sh

# Unset variables after use
export CONNECTION_PARAMETERS=''
export CLICKHOUSE_CI_LOGS_HOST=''
export CLICKHOUSE_CI_LOGS_PASSWORD=''
fi

# SC2012: Use find instead of ls to better handle non-alphanumeric filenames. They are all alphanumeric.
# SC2046: Quote this to prevent word splitting. Actually I need word splitting.
Expand Down
1 change: 1 addition & 0 deletions docker/test/install/deb/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ ENV \
# install systemd packages
RUN apt-get update && \
apt-get install -y --no-install-recommends \
sudo \
systemd \
&& \
apt-get clean && \
Expand Down
3 changes: 3 additions & 0 deletions docs/_includes/install/universal.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ then
elif [ "${ARCH}" = "riscv64" ]
then
DIR="riscv64"
elif [ "${ARCH}" = "s390x" ]
then
DIR="s390x"
fi
elif [ "${OS}" = "FreeBSD" ]
then
Expand Down
2 changes: 1 addition & 1 deletion docs/en/engines/database-engines/materialized-mysql.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ These are the schema conversion manipulations you can do with table overrides fo
* Modify [column TTL](/docs/en/engines/table-engines/mergetree-family/mergetree.md/#mergetree-column-ttl).
* Modify [column compression codec](/docs/en/sql-reference/statements/create/table.md/#codecs).
* Add [ALIAS columns](/docs/en/sql-reference/statements/create/table.md/#alias).
* Add [skipping indexes](/docs/en/engines/table-engines/mergetree-family/mergetree.md/#table_engine-mergetree-data_skipping-indexes)
* Add [skipping indexes](/docs/en/engines/table-engines/mergetree-family/mergetree.md/#table_engine-mergetree-data_skipping-indexes). Note that you need to enable `use_skip_indexes_if_final` setting to make them work (MaterializedMySQL is using `SELECT ... FINAL` by default)
* Add [projections](/docs/en/engines/table-engines/mergetree-family/mergetree.md/#projections). Note that projection optimizations are
disabled when using `SELECT ... FINAL` (which MaterializedMySQL does by default), so their utility is limited here.
`INDEX ... TYPE hypothesis` as [described in the v21.12 blog post]](https://clickhouse.com/blog/en/2021/clickhouse-v21.12-released/)
Expand Down
48 changes: 21 additions & 27 deletions docs/en/engines/table-engines/mergetree-family/annindexes.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Approximate Nearest Neighbor Search Indexes [experimental] {#table_engines-ANNIndex}
# Approximate Nearest Neighbor Search Indexes [experimental]

Nearest neighborhood search is the problem of finding the M closest points for a given point in an N-dimensional vector space. The most
straightforward approach to solve this problem is a brute force search where the distance between all points in the vector space and the
Expand All @@ -17,7 +17,7 @@ In terms of SQL, the nearest neighborhood problem can be expressed as follows:

``` sql
SELECT *
FROM table
FROM table_with_ann_index
ORDER BY Distance(vectors, Point)
LIMIT N
```
Expand All @@ -32,7 +32,7 @@ An alternative formulation of the nearest neighborhood search problem looks as f

``` sql
SELECT *
FROM table
FROM table_with_ann_index
WHERE Distance(vectors, Point) < MaxDistance
LIMIT N
```
Expand All @@ -45,12 +45,12 @@ With brute force search, both queries are expensive (linear in the number of poi
`Point` must be computed. To speed this process up, Approximate Nearest Neighbor Search Indexes (ANN indexes) store a compact representation
of the search space (using clustering, search trees, etc.) which allows to compute an approximate answer much quicker (in sub-linear time).

# Creating and Using ANN Indexes
# Creating and Using ANN Indexes {#creating_using_ann_indexes}

Syntax to create an ANN index over an [Array](../../../sql-reference/data-types/array.md) column:

```sql
CREATE TABLE table
CREATE TABLE table_with_ann_index
(
`id` Int64,
`vectors` Array(Float32),
Expand All @@ -63,7 +63,7 @@ ORDER BY id;
Syntax to create an ANN index over a [Tuple](../../../sql-reference/data-types/tuple.md) column:

```sql
CREATE TABLE table
CREATE TABLE table_with_ann_index
(
`id` Int64,
`vectors` Tuple(Float32[, Float32[, ...]]),
Expand All @@ -83,7 +83,7 @@ ANN indexes support two types of queries:

``` sql
SELECT *
FROM table
FROM table_with_ann_index
[WHERE ...]
ORDER BY Distance(vectors, Point)
LIMIT N
Expand All @@ -93,7 +93,7 @@ ANN indexes support two types of queries:

``` sql
SELECT *
FROM table
FROM table_with_ann_index
WHERE Distance(vectors, Point) < MaxDistance
LIMIT N
```
Expand All @@ -103,7 +103,7 @@ To avoid writing out large vectors, you can use [query
parameters](/docs/en/interfaces/cli.md#queries-with-parameters-cli-queries-with-parameters), e.g.

```bash
clickhouse-client --param_vec='hello' --query="SELECT * FROM table WHERE L2Distance(vectors, {vec: Array(Float32)}) < 1.0"
clickhouse-client --param_vec='hello' --query="SELECT * FROM table_with_ann_index WHERE L2Distance(vectors, {vec: Array(Float32)}) < 1.0"
```
:::

Expand Down Expand Up @@ -138,7 +138,7 @@ back to a smaller `GRANULARITY` values only in case of problems like excessive m
was specified for ANN indexes, the default value is 100 million.
# Available ANN Indexes
# Available ANN Indexes {#available_ann_indexes}
- [Annoy](/docs/en/engines/table-engines/mergetree-family/annindexes.md#annoy-annoy)
Expand All @@ -165,7 +165,7 @@ space in random linear surfaces (lines in 2D, planes in 3D etc.).
Syntax to create an Annoy index over an [Array](../../../sql-reference/data-types/array.md) column:
```sql
CREATE TABLE table
CREATE TABLE table_with_annoy_index
(
id Int64,
vectors Array(Float32),
Expand All @@ -178,7 +178,7 @@ ORDER BY id;
Syntax to create an ANN index over a [Tuple](../../../sql-reference/data-types/tuple.md) column:
```sql
CREATE TABLE table
CREATE TABLE table_with_annoy_index
(
id Int64,
vectors Tuple(Float32[, Float32[, ...]]),
Expand All @@ -188,23 +188,17 @@ ENGINE = MergeTree
ORDER BY id;
```
Annoy currently supports `L2Distance` and `cosineDistance` as distance function `Distance`. If no distance function was specified during
index creation, `L2Distance` is used as default. Parameter `NumTrees` is the number of trees which the algorithm creates (default if not
specified: 100). Higher values of `NumTree` mean more accurate search results but slower index creation / query times (approximately
linearly) as well as larger index sizes.
Annoy currently supports two distance functions:
- `L2Distance`, also called Euclidean distance, is the length of a line segment between two points in Euclidean space
([Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)).
- `cosineDistance`, also called cosine similarity, is the cosine of the angle between two (non-zero) vectors
([Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity)).
`L2Distance` is also called Euclidean distance, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
For example: If we have point P(p1,p2), Q(q1,q2), their distance will be d(p,q)
![L2Distance](https://en.wikipedia.org/wiki/Euclidean_distance#/media/File:Euclidean_distance_2d.svg)
For normalized data, `L2Distance` is usually a better choice, otherwise `cosineDistance` is recommended to compensate for scale. If no
distance function was specified during index creation, `L2Distance` is used as default.
`cosineDistance` also called cosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space. Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths.
![cosineDistance](https://www.tyrrell4innovation.ca/wp-content/uploads/2021/06/rsz_jenny_du_miword.png)
The Euclidean distance corresponds to the L2-norm of a difference between vectors. The cosine similarity is proportional to the dot product of two vectors and inversely proportional to the product of their magnitudes.
![compare](https://www.researchgate.net/publication/320914786/figure/fig2/AS:558221849841664@1510101868614/The-difference-between-Euclidean-distance-and-cosine-similarity.png)
In one sentence: cosine similarity care only about the angle between them, but do not care about the "distance" we normally think.
![L2 distance](https://www.baeldung.com/wp-content/uploads/sites/4/2020/06/4-1.png)
![cosineDistance](https://www.baeldung.com/wp-content/uploads/sites/4/2020/06/5.png)
Parameter `NumTrees` is the number of trees which the algorithm creates (default if not specified: 100). Higher values of `NumTree` mean
more accurate search results but slower index creation / query times (approximately linearly) as well as larger index sizes.
:::note
Indexes over columns of type `Array` will generally work faster than indexes on `Tuple` columns. All arrays **must** have same length. Use
Expand Down
Loading

0 comments on commit 64c8298

Please sign in to comment.