Cli toolbelt for Datashare.
/ \
\ \ ,, / /
'-.`\()/`.-'
.--_'( )'_--.
/ /` /`""`\ `\ \
| | >< | |
\ \ / /
'.__.'
Usage: tarentula [OPTIONS] COMMAND [ARGS]...
Options:
--syslog-address TEXT localhost Syslog address
--syslog-port INTEGER 514 Syslog port
--syslog-facility TEXT local7 Syslog facility
--stdout-loglevel TEXT ERROR Change the default log level for stdout error handler
--help Show this message and exit
--version Show the installed version of Tarentula
Commands:
aggregate
count
clean-tags-by-query
download
export-by-query
list-metadata
tagging
tagging-by-query
You can insatll Datashare Tarentula with your favorite package manager:
pip3 install --user tarentula
Or alternativly with Docker:
docker run icij/datashare-tarentula
Datashare Tarentula comes with basic commands to interact with a Datashare instance (running locally or on a remote server). Primarily focus on bulk actions, it provides you with both a cli interface and a python API.
To learn more about how to use Datashare Tarentula with a list of examples, please refer to the Cookbook.
A command to just count the number of files matching a query.
Usage: tarentula count [OPTIONS]
Options:
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
in the downloaded document from the index
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit
A command that uses Elasticsearch update-by-query
feature to batch untag documents directly in the index.
Usage: tarentula clean-tags-by-query [OPTIONS]
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--query TEXT Give a JSON query to filter documents that
will have their tags cleaned. It can be
afile with @path/to/file. Default to all.
--help Show this message and exit
A command to download all files matching a query.
Usage: tarentula download [OPTIONS]
Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--destination-directory TEXT Directory documents will be downloaded
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--path-format TEXT Downloaded document path template
--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the downloaded document from the index
-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--once / --not-once Download file only once
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--raw-file / --no-raw-file Download raw file from Datashare
--type [Document|NamedEntity] Type of indexed documents to download
--help Show this message and exit.
A command to export all files matching a query.
Usage: tarentula export-by-query [OPTIONS]
Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--output-file TEXT Path to the CSV file
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--scroll TEXT Scroll duration
--source TEXT A comma-separated list of field to include
in the export
--sort-by TEXT Field to use to sort results
--order-by [asc|desc] Order to use to sort results
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar
Display a progressbar
--type [Document|NamedEntity] Type of indexed documents to download
-f, --from INTEGER Passed to the search it will bypass the
first n documents
-l, --limit INTEGER Limit the total results to return
--size INTEGER Size of the scroll request that powers the
operation.
--query-field / --no-query-field
Add the query to the export CSV
--help Show this message and exit.
A command to batch tag documents with a CSV file.
Usage: tarentula tagging [OPTIONS] CSV_PATH
Options:
--datashare-url TEXT http://localhost:8080 Datashare URL
--datashare-project TEXT local-datashare Datashare project
--throttle INTEGER 0 Request throttling (in ms)
--cookies TEXT _Empty string_ Key/value pair to add a cookie to each request to the API. You can separate semicolons: key1=val1;key2=val2;...
--apikey TEXT None Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--help Show this message and exit
Tagging with a documentId
and routing
:
tag,documentId,routing
Actinopodidae,l7VnZZEzg2fr960NWWEG,l7VnZZEzg2fr960NWWEG
Antrodiaetidae,DWLOskax28jPQ2CjFrCo
Atracidae,6VE7cVlWszkUd94XeuSd,vZJQpKQYhcI577gJR0aN
Atypidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Barychelidae,DbhveTJEwQfJL5Gn3Zgi,DbhveTJEwQfJL5Gn3Zgi
Tagging with a documentUrl
:
tag,documentUrl
Mecicobothriidae,http://localhost:8080/#/d/local-datashare/DbhveTJEwQfJL5Gn3Zgi/DbhveTJEwQfJL5Gn3Zgi
Microstigmatidae,http://localhost:8080/#/d/local-datashare/iuL6GUBpO7nKyfSSFaS0/iuL6GUBpO7nKyfSSFaS0
Migidae,http://localhost:8080/#/d/local-datashare/BmovvXBisWtyyx6o9cuG/BmovvXBisWtyyx6o9cuG
Nemesiidae,http://localhost:8080/#/d/local-datashare/vZJQpKQYhcI577gJR0aN/vZJQpKQYhcI577gJR0aN
Paratropididae,http://localhost:8080/#/d/local-datashare/vYl1C4bsWphUKvXEBDhM/vYl1C4bsWphUKvXEBDhM
Porrhothelidae,http://localhost:8080/#/d/local-datashare/fgCt6JLfHSl160fnsjRp/fgCt6JLfHSl160fnsjRp
Theraphosidae,http://localhost:8080/#/d/local-datashare/WvwVvNjEDQJXkwHISQIu/WvwVvNjEDQJXkwHISQIu
A command that uses Elasticsearch update-by-query
feature to batch tag documents directly in the index.
To see an example of input file, refer to this JSON.
Usage: tarentula tagging-by-query [OPTIONS] JSON_PATH
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query
--throttle INTEGER Request throttling (in ms)
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--apikey TEXT Datashare authentication apikey
--traceback / --no-traceback Display a traceback in case of error
--progressbar / --no-progressbar Display a progressbar
--wait-for-completion / --no-wait-for-completion
Create a Elasticsearch task to perform the
updateasynchronously
--help Show this message and exit
You can list the metadata from the mapping, optionally counting the number of occurrences of each field in the index, with the --count
parameter. Counting the fields is disabled by default.
It includes a --filter_by
parameter to narrow retrieving metadata properties of specific sets of documents. For instance it can be used to get just emails related properties with: --filter_by "contentType=message/rfc822"
$ tarentula list-metadata --help
Usage: tarentula list-metadata [OPTIONS]
Options:
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a lot
of results)
--type [Document|NamedEntity] Type of indexed documents to get metadata
--filter_by TEXT Filter documents by pairs concatenated by
coma of field names and values separated by
=.Example "contentType=message/rfc822,content
Type=message/rfc822"
--count / --no-count Count or not the number of docs for each
property found
--help Show this message and exit.
You can run aggregations on the data, the ElasticSearch aggregations API is partially enabled with this command. The possibilities are:
- count: grouping by a given field different values, and count the num of docs.
- nunique: returns the number of unique values of a given field.
- date_histogram: returns counting of monthly or yearly grouped values for a given date field.
- sum: returns the sum of values of number type fields.
- min: returns the min of values of number type fields.
- max: returns the max of values of number type fields.
- avg: returns the average of values of number type fields.
- stats: returns a bunch of statistics for a given number type fields.
- string_stats: returns a bunch of string statistics for a given string type fields.
$ tarentula aggregate --help
Usage: tarentula aggregate [OPTIONS]
Options:
--apikey TEXT Datashare authentication apikey
--datashare-url TEXT Datashare URL
--datashare-project TEXT Datashare project
--elasticsearch-url TEXT You can additionally pass the Elasticsearch
URL in order to use scrollingcapabilities of
Elasticsearch (useful when dealing with a
lot of results)
--query TEXT The query string to filter documents
--cookies TEXT Key/value pair to add a cookie to each
request to the API. You can
separatesemicolons: key1=val1;key2=val2;...
--traceback / --no-traceback Display a traceback in case of error
--type [Document|NamedEntity] Type of indexed documents to download
--group_by TEXT Field to use to aggregate results
--operation_field TEXT Field to run the operation on
--run [count|nunique|date_histogram|sum|stats|string_stats|min|max|avg]
Operation to run
--calendar_interval [year|month]
Calendar interval for date histogram
aggregation
--help Show this message and exit.
When running Elasticsearch changes on big datasets, it could take a very long time. As we were curling ES to see if the task was still running well, we added a small utility to follow the changes. It makes a live graph of a provided ES indicator with a specified filter.
It uses mathplotlib and python3-tk.
If you see the following message :
$ graph_es
graph_realtime.py:32: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure
Then you have to install tkinter, i.e. python3-tk for Debian/Ubuntu.
The command has the options below:
$ graph_es --help
Usage: graph_es [OPTIONS]
Options:
--query TEXT Give a JSON query to filter documents. It can be
a file with @path/to/file. Default to all.
--index TEXT Elasticsearch index (default local-datashare)
--refresh-interval INTEGER Graph refresh interval in seconds (default 5s)
--field TEXT Field value to display over time (default "hits.total")
--elasticsearch-url TEXT Elasticsearch URL which is used to perform
update by query (default http://elasticsearch:9200)
Tarentula supports several sources for configuring its behavior, including an ini files and command-line options.
Configuration file will be searched for in the following order (use the first file found, all others are ignored):
TARENTULA_CONFIG
(environment variable if set)tarentula.ini
(in the current directory)~/.tarentula.ini
(in the home directory)/etc/tarentula/tarentula.ini
It should follow the following format (all values bellow are optional):
[DEFAULT]
apikey = SECRETHALONOPROCTIDAE
datashare_url = http://here:8080
datashare_project = local-datashare
[logger]
syslog_address = 127.0.0.0
syslog_port = 514
syslog_facility = local7
stdout_loglevel = INFO
To test this tool, you must have Datashare and Elasticsearch running on your development machine.
After you installed Datashare, just run it with a test project/user:
datashare -p test-datashare -u test
In a separate terminal, install the development dependencies:
make install
Finally, run the test
make test
The releasing process uses bumpversion to manage versions of this package, pypi to publish the Python package and Docker Hub for the Docker image.
make [patch|minor|major]
To be able to do this, you will need to be a maintainer of the pypi project.
make distribute
To build and upload a new image on the docker repository :
To be able to do this, you will need to be part of the ICIJ organization on docker
make docker-publish
Note: Datashare Tarentula is a multi-platform build. You might need to setup your environment for
multi-platform using the make docker-setup-multiarch
command. Read more
on Docker documentation.
Git push release and tag :
git push origin master --tags