Skip to content

Commit

Permalink
User scripts, bug fixes, docker image
Browse files Browse the repository at this point in the history
  • Loading branch information
simon987 committed Nov 13, 2019
1 parent 6931d32 commit ebfd7e0
Show file tree
Hide file tree
Showing 21 changed files with 488 additions and 62 deletions.
6 changes: 3 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,10 @@ if (WITH_SIST2)

target_compile_options(sist2
PRIVATE
-Ofast
# -Ofast
# -march=native
-fno-stack-protector
-fomit-frame-pointer
# -fno-stack-protector
# -fomit-frame-pointer
)

TARGET_LINK_LIBRARIES(
Expand Down
9 changes: 9 additions & 0 deletions Docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
FROM ubuntu:19.10
MAINTAINER simon987 <me@simon987.net>

RUN apt update
RUN apt install -y libglib2.0-0 libcurl4 libmagic1 libharfbuzz-bin libopenjp2-7

ADD sist2 /root/sist2

ENTRYPOINT ["/root/sist2"]
8 changes: 8 additions & 0 deletions Docker/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
cp ../sist2 .

version=$(./sist2 --version)

echo "Version ${version}"
docker build . -t simon987/sist2:${version} -t simon987/sist2:latest
docker push simon987/sist2:${version}
docker push simon987/sist2:latest
37 changes: 33 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,21 @@ sist2 (Simple incremental search tool)
* Extracts text from common file types\*
* Generates thumbnails\*
* Incremental scanning
* Automatic tagging from file attributes via [user scripts](scripting/README.md)


\* See [format support](#format-support)

## Getting Started

1. Have an [Elasticsearch](https://www.elastic.co/downloads/elasticsearch) instance running
1. Download the [latest sist2 release](https://github.com/simon987/sist2/releases)
1.
1. Download the [latest sist2 release](https://github.com/simon987/sist2/releases) *
1. *(or)* `docker pull simon987/sist2:latest`


*Windows users*: `sist2` runs under [WSL](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux)

*Mac users*: See [#1](https://github.com/simon987/sist2/issues/1)
\* *Windows users*: **sist2** runs under [WSL](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux)
\* *Mac users*: See [#1](https://github.com/simon987/sist2/issues/1)


## Example usage
Expand All @@ -52,6 +55,32 @@ sist2 index --print ./my_idx > raw_documents.ndjson
sist2 web --bind 0.0.0.0 --port 4321 ./my_idx1 ./my_idx2 ./my_idx3
```

### Use sist2 with docker

**scan**
```bash
docker run -it \
-v /path/to/files/:/files \
-v $PWD/out/:/out \
simon987/sist2 scan -t 4 /files -o /out/my_idx1
```
**index**
```bash
docker run -it --network host\
-v $PWD/out/:/out \
simon987/sist2 index /out/my_idx1
```

**web**
```bash
docker run --rm --network host -d --name sist2\
-v $PWD/out/my_idx:/idx \
-v $PWD/my/files:/files
simon987/sist2 web --bind 0.0.0.0 /idx
docker stop sist2
```


## Format support

File type | Library | Content | Thumbnail | Metadata
Expand Down
3 changes: 3 additions & 0 deletions schema/mappings.json
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@
"analyzer": "my_nGram"
}
}
},
"tag": {
"type": "keyword"
}
}
}
117 changes: 117 additions & 0 deletions scripting/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
## User scripts

*This document is under construction, more in-depth guide coming soon*

During the `index` step, you can use the `--script-file <script>` option to
modify documents or add user tags. This option is mainly used to
implement automatic tagging based on file attributes.

The scripting language used
([Painless Scripting Language](https://www.elastic.co/guide/en/elasticsearch/painless/7.4/index.html))
is very similar to Java, but you should be able to create user scripts
without programming experience at all if you're somewhat familiar with
regex.

This is the base structure of the documents we're working with:
```json
{
"_id": "e171405c-fdb5-4feb-bb32-82637bc32084",
"_index": "sist2",
"_type": "_doc",
"_source": {
"index": "206b3050-e821-421a-891d-12fcf6c2db0d",
"mime": "application/json",
"size": 1799,
"mtime": 1545443685,
"extension": "md",
"name": "README",
"path": "sist2/scripting",
"content": "..."
}
}
```

**Example script**

This script checks if the `genre` attribute exists, if it does
it adds the `genre.<genre>` tag.
```Java
ArrayList tags = ctx._source.tag = new ArrayList();

if (ctx._source?.genre != null) {
tags.add("genre." + ctx._source.genre.toLowerCase())
}
```

You can use `.` to create a hierarchical tag tree:

![scripting/genre_example](genre_example.png)


To use regular expressions, you need to add this line in `/etc/elasticsearch/elasticsearch.yml`
```yaml
script.painless.regex.enabled: true
```
Or, if you're using docker add `-e "script.painless.regex.enabled=true"`

### Examples

If `(20XX)` is in the file name, add the `year.<year>` tag:
```Java
ArrayList tags = ctx._source.tag = new ArrayList();
Matcher m = /[\(\.+](20[0-9]{2})[\)\.+]/.matcher(ctx._source.name);
if (m.find()) {
tags.add("year." + m.group(1))
}
```

Use default *Calibre* folder structure to infer author.
```Java
ArrayList tags = ctx._source.tag = new ArrayList();
// We expect the book path to look like this:
// /path/to/Calibre Library/Author/Title/Title - Author.pdf
if (ctx._source.name.contains("-") && ctx._source.extension == "pdf") {
String[] names = ctx._source.name.splitOnToken('-');
tags.add("author." + names[1].strip());
}
```

If the file matches a specific pattern `AAAA-000 fName1 lName1, <fName2 lName2>...`, add the `actress.<actress>` and
`studio.<studio>` tag:
```Java
ArrayList tags = ctx._source.tag = new ArrayList();
Matcher m = /([A-Z]{4})-[0-9]{3} (.*)/.matcher(ctx._source.name);
if (m.find()) {
tags.add("studio." + m.group(1));
// Take the matched group (.*), and add a tag for
// each name, separated by comma
for (String name : m.group(2).splitOnToken(',')) {
tags.add("actress." + name);
}
}
```

Set the name of the last folder (`/path/to/<studio>/file.mp4`) to `studio.<studio>` tag
```Java
ArrayList tags = ctx._source.tag = new ArrayList();
if (ctx._source.path != "") {
String[] names = ctx._source.path.splitOnToken('/');
tags.add("studio." + names[names.length-1]);
}
```

Set the name of the last folder (`/path/to/<studio>/file.mp4`) to `studio.<studio>` tag
```Java
ArrayList tags = ctx._source.tag = new ArrayList();
if (ctx._source.path != "") {
String[] names = ctx._source.path.splitOnToken('/');
tags.add("studio." + names[names.length-1]);
}
```
Binary file added scripting/genre_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 27 additions & 6 deletions src/cli.c
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

#define DEFAULT_OUTPUT "index.sist2/"
#define DEFAULT_CONTENT_SIZE 4096
#define DEFAULT_QUALITY 15
#define DEFAULT_SIZE 200
#define DEFAULT_QUALITY 5
#define DEFAULT_SIZE 500
#define DEFAULT_REWRITE_URL ""

#define DEFAULT_ES_URL "http://localhost:9200"
Expand All @@ -25,7 +25,7 @@ int scan_args_validate(scan_args_t *args, int argc, const char **argv) {

char *abs_path = abspath(argv[1]);
if (abs_path == NULL) {
fprintf(stderr, "File not found: %s", argv[1]);
fprintf(stderr, "File not found: %s\n", argv[1]);
return 1;
} else {
args->path = abs_path;
Expand All @@ -34,7 +34,7 @@ int scan_args_validate(scan_args_t *args, int argc, const char **argv) {
if (args->incremental != NULL) {
abs_path = abspath(args->incremental);
if (abs_path == NULL) {
fprintf(stderr, "File not found: %s", args->incremental);
fprintf(stderr, "File not found: %s\n", args->incremental);
return 1;
}
}
Expand Down Expand Up @@ -100,7 +100,7 @@ int index_args_validate(index_args_t *args, int argc, const char **argv) {

char *index_path = abspath(argv[1]);
if (index_path == NULL) {
fprintf(stderr, "File not found: %s", argv[1]);
fprintf(stderr, "File not found: %s\n", argv[1]);
return 1;
} else {
args->index_path = argv[1];
Expand All @@ -109,6 +109,27 @@ int index_args_validate(index_args_t *args, int argc, const char **argv) {
if (args->es_url == NULL) {
args->es_url = DEFAULT_ES_URL;
}

if (args->script_path != NULL) {
struct stat info;
int res = stat(args->script_path, &info);

if (res == -1) {
fprintf(stderr, "Error opening script file '%s': %s\n", args->script_path, strerror(errno));
return 1;
}

int fd = open(args->script_path, O_RDONLY);
if (fd == -1) {
fprintf(stderr, "Error opening script file '%s': %s\n", args->script_path, strerror(errno));
return 1;
}

args->script = malloc(info.st_size + 1);
read(fd, args->script, info.st_size);
*(args->script + info.st_size) = '\0';
close(fd);
}
return 0;
}

Expand Down Expand Up @@ -137,7 +158,7 @@ int web_args_validate(web_args_t *args, int argc, const char **argv) {
for (int i = 0; i < args->index_count; i++) {
char *abs_path = abspath(args->indices[i]);
if (abs_path == NULL) {
fprintf(stderr, "File not found: %s", abs_path);
fprintf(stderr, "File not found: %s\n", abs_path);
return 1;
}
}
Expand Down
2 changes: 2 additions & 0 deletions src/cli.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ int scan_args_validate(scan_args_t *args, int argc, const char **argv);
typedef struct index_args {
char *es_url;
const char *index_path;
const char *script_path;
char *script;
int print;
int force_reset;
} index_args_t;
Expand Down
50 changes: 46 additions & 4 deletions src/index/elastic.c
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
#include <stdio.h>
#include <string.h>
#include <cJSON/cJSON.h>
#include <src/ctx.h>

#include "static_generated.c"

Expand Down Expand Up @@ -54,6 +53,40 @@ void index_json(cJSON *document, const char uuid_str[UUID_STR_LEN]) {
elastic_index_line(bulk_line);
}

void execute_update_script(const char *script, const char index_id[UUID_STR_LEN]) {

cJSON *body = cJSON_CreateObject();
cJSON *script_obj = cJSON_AddObjectToObject(body, "script");
cJSON_AddStringToObject(script_obj, "lang", "painless");
cJSON_AddStringToObject(script_obj, "source", script);

cJSON *query = cJSON_AddObjectToObject(body, "query");
cJSON *term_obj = cJSON_AddObjectToObject(query, "term");
cJSON_AddStringToObject(term_obj, "index", index_id);

char * str = cJSON_Print(body);

char bulk_url[4096];
snprintf(bulk_url, 4096, "%s/sist2/_update_by_query?pretty", Indexer->es_url);
response_t *r = web_post(bulk_url, str, "Content-Type: application/json");
printf("Executed user script <%d>\n", r->status_code);
cJSON *resp = cJSON_Parse(r->body);

cJSON_free(str);
cJSON_Delete(body);
free_response(r);

cJSON *error = cJSON_GetObjectItem(resp, "error");
if (error != NULL) {
char *error_str = cJSON_Print(error);

fprintf(stderr, "User script error: \n%s\n", error_str);
cJSON_free(error_str);
}

cJSON_Delete(resp);
}

void elastic_flush() {

if (Indexer == NULL) {
Expand Down Expand Up @@ -115,6 +148,7 @@ void elastic_flush() {
cJSON_Delete(ret_json);

free_response(r);
free(buf);
}

void elastic_index_line(es_bulk_line_t *line) {
Expand All @@ -140,8 +174,7 @@ void elastic_index_line(es_bulk_line_t *line) {

es_indexer_t *create_indexer(const char *url) {

size_t url_len = strlen(url);
char *es_url = malloc(url_len);
char *es_url = malloc(strlen(url) + 1);
strcpy(es_url, url);

es_indexer_t *indexer = malloc(sizeof(es_indexer_t));
Expand All @@ -154,7 +187,7 @@ es_indexer_t *create_indexer(const char *url) {
return indexer;
}

void destroy_indexer() {
void destroy_indexer(char * script, char index_id[UUID_STR_LEN]) {

char url[4096];

Expand All @@ -163,6 +196,15 @@ void destroy_indexer() {
printf("Refresh index <%d>\n", r->status_code);
free_response(r);

if (script != NULL) {
execute_update_script(script, index_id);
}

snprintf(url, sizeof(url), "%s/sist2/_refresh", IndexCtx.es_url);
r = web_post(url, "", NULL);
printf("Refresh index <%d>\n", r->status_code);
free_response(r);

snprintf(url, sizeof(url), "%s/sist2/_forcemerge", IndexCtx.es_url);
r = web_post(url, "", NULL);
printf("Merge index <%d>\n", r->status_code);
Expand Down
Loading

0 comments on commit ebfd7e0

Please sign in to comment.