Skip to content

Commit

Permalink
refactor: use jsonl and "JSON Lines" terminology
Browse files Browse the repository at this point in the history
It is best to avoid ambiguity
about whether the tools work with JSON or JSON Lines documents.

This breaks the JSON import workflow:

- `tools/dir2json` is renamed to `tools/dir2jsonl`.
- `tools/import json` becomes `tools/import jsonl`.

Convert the readme to SemBr.
  • Loading branch information
dbohdan committed Nov 12, 2024
1 parent 0503738 commit e67fb59
Show file tree
Hide file tree
Showing 5 changed files with 83 additions and 88 deletions.
109 changes: 52 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,33 +4,32 @@

A very small standalone full-text search HTTP/SCGI server.

![A screenshot of what the unofficial tinyfts search service for the
Tcler's Wiki looked like](screenshot.png)
![A screenshot of what the unofficial tinyfts search service for the Tcler's Wiki looked like](screenshot.png)


## Contents

* [Dependencies](#dependencies)
* [Usage](#usage)
* [Query syntax](#query-syntax)
* [Setup](#setup)
* [Operating notes](#operating-notes)
* [License](#license)
- [Dependencies](#dependencies)
- [Usage](#usage)
- [Query syntax](#query-syntax)
- [Setup](#setup)
- [Operating notes](#operating-notes)
- [License](#license)


## Dependencies

### Server

* Tcl 8.6
* tclsqlite3 with [FTS5](https://sqlite.org/fts5.html)
- Tcl 8.6
- tclsqlite3 with [FTS5](https://sqlite.org/fts5.html)

### Building, tools, and tests

The above and
* Tcllib
* kill(1), make(1), sqlite3(1)
* tDOM and file(1) to run `tools/dir2json`
- Tcllib
- kill(1), make(1), sqlite3(1)
- tDOM and file(1) to run `tools/dir2jsonl`

On recent Debian and Ubuntu install the dependencies with

Expand Down Expand Up @@ -73,7 +72,7 @@ Options:
The basic usage is

```sh
tools/import json example.jsonl example.sqlite3
tools/import jsonl example.jsonl example.sqlite3
# Local server
./tinyfts --db-file example.sqlite3 --local 8080
# Server available over the network
Expand All @@ -84,68 +83,66 @@ tools/import json example.jsonl example.sqlite3

### Default or "web"

The default full-text search query syntax in tinyfts resembles that of a Web
search engine. It can handle the following types of expressions.
The default full-text search query syntax in tinyfts resembles that of a Web search engine.
It can handle the following types of expressions.

* `foo` — search for the word *foo*.
* `"foo bar"` — search for the phrase *foo bar*.
* `foo AND bar`, `foo OR bar`, `NOT foo` — search for both *foo* and *bar*, at
least one of *foo* and *bar*, documents without *foo* respectively.
*foo AND bar* is identical to *foo bar*. The operators *AND*, *OR*, and *NOT*
must be in all caps.
* `-foo`, `-"foo bar"` — the same as `NOT foo`, `NOT "foo bar"`.
- `foo` — search for the word *foo*.
- `"foo bar"` — search for the phrase *foo bar*.
- `foo AND bar`, `foo OR bar`, `NOT foo` — search for both *foo* and *bar*,
at least one of *foo* and *bar*,
documents without *foo* respectively.
*foo AND bar* is identical to *foo bar*.
The operators *AND*, *OR*, and *NOT* must be in all caps.
- `-foo`, `-"foo bar"` — the same as `NOT foo`, `NOT "foo bar"`.

### FTS5

You can allow your users to write full
[FTS5 queries](https://www.sqlite.org/fts5.html#full_text_query_syntax)
with the command line option `--query-syntax fts5`. FTS5 queries are more
powerful but expose the technical details of the underlying database. (For
example, the column names.) Users who are unfamiliar with the FTS5 syntax
will find it surprising and run into errors because they did not quote a word
that has a special meaning.
with the command line option `--query-syntax fts5`.
FTS5 queries are more powerful but expose the technical details of the underlying database.
(For example, the column names.)
Users who are unfamiliar with the FTS5 syntax will find it surprising and run into errors because they did not quote a word that has a special meaning.


## Setup

Tinyfts searches the contents of an SQLite database table with a particular
schema. The bundled import tool `tools/import` can import serialized data
(text files with one JSON object or Tcl dictionary per line) and wiki pages
from a [Wikit](https://wiki.tcl-lang.org/page/Wikit)/Nikit database into
a tinyfts database.
Tinyfts searches the contents of an SQLite database table with a particular schema.
The bundled import tool `tools/import` can import serialized data
(text files with one [JSON object](https://jsonlines.org/) or Tcl dictionary per line)
and wiki pages from a [Wikit](https://wiki.tcl-lang.org/page/Wikit)/Nikit database to a tinyfts database.

### Example

This example shows how to set up search for a backup copy of the
[Tcler's Wiki](https://wiki.tcl-lang.org/page/About+the+WIki). The
instructions should work on most Linux distributions and FreeBSD with the
dependencies and Git installed.
[Tcler's Wiki](https://wiki.tcl-lang.org/page/About+the+WIki).
The instructions should work on most Linux distributions and FreeBSD with the dependencies and Git installed.

1\. Go to <https://sourceforge.net/project/showfiles.php?group_id=211498>.
Download and extract the last Wikit database snapshot of the Tcler's Wiki.
Currently that is `wikit-20141112.zip`. Let's assume you have extracted the
database file to `~/Downloads/wikit.tkd`.
Download and extract the last Wikit database snapshot of the Tcler's Wiki.
Currently that is `wikit-20141112.zip`.
Let's assume you have extracted the database file to `~/Downloads/wikit.tkd`.

2\. Download, build, and test tinyfts. In this example we use Git to get the
latest development version.
2\. Download, build, and test tinyfts.
In this example we use Git to get the latest development version.

```sh
git clone https://github.com/dbohdan/tinyfts
cd tinyfts
make
```

3\. Create a tinyfts search database from the Tcler's Wiki database. The
repository includes an import tool that supports Wikit databases. Depending
on your hardware, this may take up to several minutes with an input database
size in the hundreds of megabytes.
3\. Create a tinyfts search database from the Tcler's Wiki database.
The repository includes an import tool that supports Wikit databases.
Depending on your hardware, this may take up to several minutes with an input database size in the hundreds of megabytes.

```sh
./tools/import wikit ~/Downloads/wikit.tkd /tmp/fts.sqlite3
```

4\. Start tinyfts on <http://localhost:8080>. The server URL should open
automatically in your browser. Try searching.
4\. Start tinyfts on <http://localhost:8080>.
The server URL should open automatically in your browser.
Try searching.

```sh
./tinyfts --db-file /tmp/fts.sqlite3 --title 'tinyfts demo' --local 8080
Expand All @@ -154,17 +151,15 @@ automatically in your browser. Try searching.

## Operating notes

* If you put tinyfts behind a reverse proxy, remember to start it with the
command line option `--behind-reverse-proxy true`. It is necessary for
correct client IP address detection, which rate limiting depends on. Do
**not** enable `--behind-reverse-proxy` if tinyfts is not behind a reverse
proxy. It will let clients spoof their IP with the header `X-Real-IP` or
`X-Forwarded-For` and evade rate limiting themselves and rate limit others.
- If you put tinyfts behind a reverse proxy, remember to start it with the command line option `--behind-reverse-proxy true`.
It is necessary for
correct client IP address detection, which rate limiting depends on.
Do **not** enable `--behind-reverse-proxy` if tinyfts is not behind a reverse proxy.
It will let clients spoof their IP with the header `X-Real-IP` or `X-Forwarded-For` and evade rate limiting themselves and rate limit others.


## License

MIT. [Wapp](https://wapp.tcl.tk/) is copyright (c) 2017-2022 D. Richard Hipp
and is distributed under the Simplified BSD License.
[Tacit](https://github.com/yegor256/tacit) is copyright (c) 2015-2020
Yegor Bugayenko and is distributed under the MIT license.
MIT.
[Wapp](https://wapp.tcl.tk/) is copyright (c) 2017-2022 D. Richard Hipp and is distributed under the Simplified BSD License.
[Tacit](https://github.com/yegor256/tacit) is copyright (c) 2015-2020 Yegor Bugayenko and is distributed under the MIT license.
28 changes: 14 additions & 14 deletions tests/tests.tcl
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ package require textutil

cd [file dirname [info script]]/..

set td(json-sample) [string map [list \n\n \n \n {}] {
set td(json-lines-sample) [string map [list \n\n \n \n {}] {
{
"url": "https://fts.example.com/foo",
"title": "Foo",
Expand Down Expand Up @@ -67,7 +67,7 @@ set td(json-sample) [string map [list \n\n \n \n {}] {
}
}]

set td(tcl-sample) [join [lmap line [split $td(json-sample) \n] {
set td(tcl-sample) [join [lmap line [split $td(json-lines-sample) \n] {
json::json2dict $line
}] \n]

Expand Down Expand Up @@ -147,20 +147,20 @@ tcltest::test tools-import-1.1.3 {Tcl import} -cleanup $td(cleanup) -body {
} -match glob -result https://fts.example.com/foo\nhttps://fts.example.com/bar


tcltest::test tools-import-1.2.1 {JSON import} -body {
tclsh tools/import json - $td(dbFile) << $td(json-sample)
tcltest::test tools-import-1.2.1 {JSON Lines import} -body {
tclsh tools/import jsonl - $td(dbFile) << $td(json-lines-sample)
} -cleanup $td(cleanup) -result {}

tcltest::test tools-import-1.2.2 {JSON import} -cleanup $td(cleanup) -body {
tclsh tools/import json - $td(dbFile) --table blah \
<< $td(json-sample)
tcltest::test tools-import-1.2.2 {JSON Lines import} -cleanup $td(cleanup) -body {
tclsh tools/import jsonl - $td(dbFile) --table blah \
<< $td(json-lines-sample)

exec sqlite3 $td(dbFile) .schema
} -match glob -result {*CREATE VIRTUAL TABLE "blah"*USING fts5*}

tcltest::test tools-import-1.2.3 {JSON import} -cleanup $td(cleanup) -body {
tclsh tools/import json - $td(dbFile) --url-prefix http://example.com/ \
<< $td(json-sample)
tcltest::test tools-import-1.2.3 {JSON Lines import} -cleanup $td(cleanup) -body {
tclsh tools/import jsonl - $td(dbFile) --url-prefix http://example.com/ \
<< $td(json-lines-sample)

exec sqlite3 $td(dbFile) {SELECT url FROM tinyfts LIMIT 2}
} -match glob -result https://fts.example.com/foo\nhttps://fts.example.com/bar
Expand All @@ -179,14 +179,14 @@ tcltest::test tools-import-2.3 {} -cleanup $td(cleanup) -body {
} -returnCodes 1 -match glob -result *


tcltest::test tools-dir2json-1.1 {Normal dir} -body {
tclsh tools/dir2json x/ tests/dir1/
tcltest::test tools-dir2jsonl-1.1 {Normal dir} -body {
tclsh tools/dir2jsonl x/ tests/dir1/
} -match regexp -result \
{{"url":"x/bar.html","timestamp":\d+,"title":"bar.html","content":"Bar."}
{"url":"x/foo.txt","timestamp":\d+,"title":"foo.txt","content":"Foo."}}

tcltest::test tools-dir2json-2.1 {Bad HTML} -body {
tclsh tools/dir2json x/ tests/dir2/ 2>@1
tcltest::test tools-dir2jsonl-2.1 {Bad HTML} -body {
tclsh tools/dir2jsonl x/ tests/dir2/ 2>@1
} -match glob -result \
{*can't parse HTML*Missing ">"*"content":"<HTML><HEA"\}}

Expand Down
2 changes: 1 addition & 1 deletion tinyfts-dev.tcl
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@ proc translate-query::web query {
set not {}
set translated {}

# A crude query tokenizer. Doesn't understand escaped double quotes.
# A crude query tokenizer. Doesn't understand escaped double quotes.
set start 0
while {[regexp -indices \
-start $start \
Expand Down
18 changes: 9 additions & 9 deletions tools/dir2json → tools/dir2jsonl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#! /usr/bin/env tclsh
# Print text files in a directory as tinyfts JSON Lines. This program uses
# file(1) to determine what is a text file. It attempts to extract text from
# Print text files in a directory as tinyfts JSON Lines. This program uses
# file(1) to determine what is a text file. It attempts to extract text from
# HTML using tDOM.
# ==============================================================================
# Copyright (c) 2022, 2024 D. Bohdan
Expand Down Expand Up @@ -47,8 +47,8 @@ proc main {url-prefix dir args} {

for {set i 0} {$i < $len} {incr i $::batchSize} {
# Do not use "--files-from -", since it doesn't seem able to handle
# filenames with newlines. Run file(1) with batches of files to avoid
# exceeding the maximum argument length. A possible multiuser privacy
# filenames with newlines. Run file(1) with batches of files to avoid
# exceeding the maximum argument length. A possible multiuser privacy
# concern: the filenames are visible in the process list.
set output [exec file \
--mime-type \
Expand All @@ -60,7 +60,7 @@ proc main {url-prefix dir args} {

dict for {file type} $types {
try {
print-file-as-json ${url-prefix} $file $type
print-file-as-json-lines ${url-prefix} $file $type
} on error e {
puts stderr [list $file $e]
set status 1
Expand All @@ -72,16 +72,16 @@ proc main {url-prefix dir args} {
}


proc print-file-as-json {url-prefix file type} {
proc print-file-as-json-lines {url-prefix file type} {
if {![string match text/* $type]} return

set path [string range $file 2 end]
set text [fileutil::cat $file]

if {$type eq {text/html}} {
try {
# tDOM 0.9.2 can modify text passed to [dom parse -html]. This is
# a violation of Tcl semantics. For example, "<HTML><HEA"
# tDOM 0.9.2 can modify text passed to [dom parse -html]. This is
# a violation of Tcl semantics. For example, "<HTML><HEA"
# becomes "<html><hea" after a failed attempt to parse it.
set copy [string range $text 0 end]
set doc [dom parse -html $copy]
Expand All @@ -102,7 +102,7 @@ proc print-file-as-json {url-prefix file type} {


proc usage {} {
puts stderr {usage: dir2json url-prefix dir [glob-pattern ...]}
puts stderr {usage: dir2jsonl url-prefix dir [glob-pattern ...]}
puts stderr "\nConvert text files in dir to tinyfts JSON Lines and print\
them to stdout. What is a text file is determined using file(1). The\
names of the files must match at least one glob pattern if any glob\
Expand Down
14 changes: 7 additions & 7 deletions tools/import
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ proc import::serialized {srcPath destPath config} {
}]

try {
set jsonMode [expr {
[dict get $config command] eq {json}
set jsonLinesMode [expr {
[dict get $config command] eq {jsonl}
}]

sqlite3 dest $destPath
Expand All @@ -89,7 +89,7 @@ proc import::serialized {srcPath destPath config} {
if {[string is space $line]} continue

set dict [expr {
$jsonMode
$jsonLinesMode
? [json::json2dict $line]
: $line
}]
Expand Down Expand Up @@ -162,8 +162,8 @@ proc import::parse-options {command options} {
}


proc import::json {src dest args} {
serialized $src $dest [parse-options json $args]
proc import::jsonl {src dest args} {
serialized $src $dest [parse-options jsonl $args]
}


Expand All @@ -183,15 +183,15 @@ proc import::wikit {src dest args} {


namespace eval import {
namespace export json nikit tcl wikit
namespace export jsonl nikit tcl wikit
namespace ensemble create
}


proc import::usage {} {
variable defaults

puts stderr "usage: import (json|tcl|nikit|wikit) src dest\
puts stderr "usage: import (jsonl|tcl|nikit|wikit) src dest\
\[--table [dict get $defaults table]\]\
\[--url-prefix [dict get $defaults url-prefix]\]"
puts stderr "\nImport src (a file path or \"-\" for stdin) of one of\
Expand Down

0 comments on commit e67fb59

Please sign in to comment.