-
Notifications
You must be signed in to change notification settings - Fork 141
Shard Query Manual
Shard-Query is a MPP (massively parallel processing) query engine implemented with PHP and Gearman. It is designed to run queries over partitioned or sharded data sets. Shard-Query is targeted mainly at big data problems, and OLAP queries in general. Shard-Query is especially effective at querying very large tables, such as log tables and it works very well with star schemas.
Please use the latest MySQL 5.6 release for best results. If you use earlier versions, please only use LIST (COLUMNS), or RANGE (COLUMNS) partitioning, as HASH and KEY is not supported except on 5.6.
The proxy normally listens on 4040. The web interface, if set up, listens on port 80 and is normally available at http://hostname/shard-query
If you are setting up a new grid (group of machines), do not create any tables until all machines (nodes) are set up.
When using the Amazon marketplace AMI, the first step to configuring Shard-Query is connecting to the server with your web browser, which will launch a configuration wizard.
The proxy will listen on 4040 but will not accept any queries until the initial setup is completed. The AMI includes a GUI listening on port 80 that you should use for setup. If you are using the Amazon Marketplace AMI, only a single node is supported. You must partition your tables for improved performance.
The Amazon AMI comes with the proxy pre-installed. When connecting to the proxy, use the username/password that you registered in the web setup. The proxy will not answer queries until web setup has been completed.
A GRID, or "Group of Redundant Independent Databases" is a collection of nodes that represent one large "virtual schema". Each node in the grid is uniquely identified by a hostname, a port and a schema name. This is a physical schema where data will be stored, and should not be confused with a virtual schema.
Shard-Query organizes multiple physical schemata into one "virtual schema". You choose a name for the virtual schema when setting up Shard-Query. Shard-Query supports multiple different virtual schema, each with a unique sharding setup. Keep in mind that it is not normally possible to join between two different virtual schema unless a federated or connect engine table is made to the proxy.
Shard-Query requires that one node be designated for the configuration repository. This may be one of the nodes used for storage (see storage nodes below) or it can be a dedicated node. If HA is desired for Shard-Query, then this node, like all nodes, should be made redundant using available technology such as MySQL replication, MHA,Galera repliation, DRBD/corosync/pacemaker, etc.
The configuration repository is a set of MySQL tables that hold the configuration and set-up information for Shard-Query. The repository is the first component to be configured when setting up Shard-Query.
Shard-Query stores data in storage nodes. All storage nodes must contain identical schema in order for Shard-Query to work properly. Storage nodes contain both sharded and unsharded tables. Tables which contain the shard key are sharded, and the data in those tables is distributed around the grid. Tables that do not contain the shard key are duplicated on all nodes.
Shard-Query behaves like READ-COMMITTED with respect to each query. Multi-statement transactions are not supported. XA transactions are not supported either. Shard-Query behaves as if auto-commit is in effect.
Shard-Query works best with data marts utilizing a star schema. Shard-Query also works well with schema that feature large tables, such as machine generated logs or sensor data. The thing to keep in mind when creating a schema for Shard-Query is the shard column, or "shard key". This is the column in your data set that is used to divide the data over individual shards. The shard key must be present in every table you want to shard.
MySQL and many other databases tend to have scalability challenges as the size of the data set increases. There are two major reasons for this. The first is that indexes tend to get very slow and very big as the database size increases and the second reason is that databases usually only have single threaded queries.
Sharding resolves these issue in two ways. First and foremost, sharding splits the largest table or tables into smaller tables on many nodes. There is normally one column in the data set that the data is split on, such as customer_id or a date field. In any case, an even data distribution is desired. This is because sharded data sets can be operated on in parallel.
Both sharding and partitioning are very important for Shard-Query. Shard-Query can add parallelism to queries over both partitioned tables in many cases, in addition to over shards.
Like sharding, partitioning divides up data into smaller "buckets" called partitions. Shard-Query can access partitions in parallel, increasing the performance of queries. This parallel query ability makes Shard-Query extremely powerful, but you must partition your tables in order to get benefit from it. You can use HASH or KEY partitioning to simplify this. For best performance IT IS IMPORTANT TO PARTITION YOUR TABLES which will allow Shard-Query to access the table in parallel.
When setting up Shard-Query, a "shard key" is selected. This column, when present in a row, identifies the table as being "sharded". This means that the data in the table will depend on the value of the "shard key" column in each row. No single server will have an entire copy of a sharded table. Because the "shard key" is used to determine on which server the row is stored, rows in different tables that shared the same "shard key" value will be stored on the same server. This means that you can join tables together that are sharded, as long as you included the shard key in the join.
Shard-Query supports un-sharded tables. Un-sharded tables are tables which do not contain the "shard key". Insertions into un-sharded tables go to all shards. When a query does not reference any sharded tables, then the query is sent to only a single shard to prevent duplication of data.
One you pick a shard column/shard key you must also choose a mapper. Shard-Query includes two shard "mappers", described below. The Directory mapper is the recommended mapper!
Hash based sharding features very fast row to shard lookups (this is called mapping) because no network round trip is required to calculate the lookup. The downsides are significant though, because the number of shards is fixed. This can easily lead to data distribution issues unless you are sure that the shard key will evenly distribute at all times.
Directory based sharding uses a database table on a directory database server to map rows to shards. This can make the system scale more easily, as it allows rows to be migrated between shards. In addition, the number of shards is not fixed and can be easily grown. Directory mapping can be a problem if the directory must be very large. Directory mapping is slower than hash mapping because a round trip and database lookups must be performed. When examining more than trivial amounts of data, the overhead is negligible.
Shard-Query uses one of the above mappers in order to determine to which shards a query should be sent based on the WHERE clause used in the query.It may be possible to reduce the number of shards to which to send a query when the WHERE clause includes a condition on the shard column.
Shard-Query automatically rewrites queries so that they return proper results when the query is executed over multiple shards.For example, when the COUNT aggregate function is used, it is executed as the SUM of the COUNT from each of the queried shards.Shard-Query elects a 'coordinator node' for each SELECT query, which is used for aggregating the results.
Shard-Query will examine each row and send it to the appropriate shard.
UPDATE statements, DELETE statements and DDL are sent to all nodes. SHOW statements are only currently supported through the proxy interface.
When loading data or inserting rows, Shard-Query examines the data being inserted and sends the row to the appropriate shard. The loader is massively parallel and will load delimited files in chunks in parallel.
When setting up Shard-Query you have the option to specify a shared_path. If you want to load in parallel then you should set shared_path to the path the the shared filesystem (CIFS,NFS,etc) where you will place the files to be loaded.
You use the "LOAD DATA INFILE" MySQL command to load data into Shard-Query. By default, that is when no LOCAL keyword is used, the load will be performed in multiple background jobs. There are two phases to loading a file. The first phase, or "scanning phase", can take some time, especially if the file is very large, or if it is remote, such as S3. The scanning phase examines portions of the input file to determine where to end each input "chunk" to the loader. This is necessary because the loader will load the file in multiple chunks and chunks must end on a end-of-line boundary.
After the file is scanned and the background jobs are created, the jobs will be executed by Gearman workers in the background. There is a program in bin/ called update_job_table. This script can be executed to see how far along the load is. There are two options. --verbose will print out extra information and --wait will make the tool run a loop, displaying the status each iteration and looping until all jobs are completed.
Important: Do not specify a path to the file! The shared_path that has been configured will be searched for the file.
Use the LOCAL keyword, ie LOAD DATA LOCAL INFILE '/path/to/file'. When you use the LOCAL keyword, a single threaded load is performed and the file is opened locally instead of searching the shared_path. Another option is to place the files in S3 and load them from there.
Loading from S3 is now supported as well. You must set the aws_access_key and aws_secret_key when setting up Shard-Query. Loading from S3 is the most convenient loading mechanism, particularly if you are running in the Amazon cloud. To load from S3, you must specify "s3://bucket/file" as the path when loading the file.
Both INSERT ... SELECT and CREATE TABLE ... AS SELECT (CTAS) are both supported by Shard-Query.
When using CTAS do not add columns that do not exist in the resultset, and make sure all columns are aliased properly.
There are three ways to change the Shard-Query configuration.
- Place configuration options in a .ini file and use setup_virtual_schema.php - note that you can do this with an existing schema and the configuration database will be updated appropriately
- Update the schemata_config table in the Shard-Query configuration repo/primary node
- Use the GUI
The name of the virtual schema
The shard column/shard key
The type of column being sharded on. Options are integer or string.
The inlist option allows you to add parallelism to IN lists. This can be enabled for all columns (inlist=*), or just a certain column (inlist=certain_column). For example, assuming that inlist=certain_column, then the query "SELECT count(*) from some_table where certain_column IN (1,2,3)" will be executed in parallel as three separate queries "SELECT count(*) from some_table where certain_column = 1;", "SELECT count(*) from some_table where certain_column = 2;" and so on.
The between option works similarly to the inlist option. It will convert BETWEEN to IN where possible, and execute in parallel.
If an IN list greater than this size is encountered it will be broken up into multiple IN lists each of "inlist_merge_size"
If an IN list exceeds inlist_merge_threshold items, it is broken up into IN lists of inlist_merge_size or lesser.
Shard-Query includes a MySQL proxy compatible LUA script for transparent access to Shard-Query. You must install the Gearman UDF into the server which you run the proxy against (usually the primary node). You must set up the gearman UDF with gearman_set_servers() before connecting to the proxy. Simply specify the LUA script as an option to MySQL proxy.
If you are using a pre-installed AWS AMI, a GUI will be available after initial node setup. If you installed Shard-Query manually, the installation instructions (see INSTALL in the source) include information about installing the GUI. Once you put the appropriate files in place (including the .conf file), you can access Shard-Query at http://hostname/shard-query/
Look at bin/run_query to see how to directly integrate Shard-Query in your own app. If you simply want to execute queries, consider using the proxy.
provision.sh contains complete setup steps from a base CentOS 6.4 system with the "development tools" installed.
- At least one server for data storage
- A database schema for storing the Shard-Query configuration
- MySQL 5.1+ (5.6 STRONGLY suggested!)
- Apache
- PHP, PEAR
- gearmand (get it from epel, or compile from gearman.org)
- php-mbstring, php-mysql
1. Install gearmand
2. Install MySQL on each storage node (one or more databases)
3. Install apache or apache2 php php-mysql php-pear
4. Download and extract swanhart-tools. Move shard-query to /usr/share/ (so you have /usr/share/shard-query/include, etc..)
5. Install the config repo
* cd /usr/share/shard_query * php install_config_repo.php
> Make sure that the account you use to set up the repo was created WITH GRANT OPTION. > That is really important. Same thing goes for account used to set up virtual schema. It must be able to > delegate grants to the shard_query user (or whatever you select in the setup)
6. copy bootstrap.ini.example to bootstrap.ini and edit it appropriately
7. php setup_virtual_schema.php --ini=bootstrap.ini
8. start gearman:
* gearmand -d -p 7001 -L 127.0.0.1
9. test a worker:
* cd bin/ * ./worker
> If no errors in 10 to 15 seconds hit ctrl-c.
10. start workers
* ./start_workers 6 (where 6 = number of workers you want to start)