reliq.1

.TH RELIQ 1 reliq\-VERSION

.SH NAME
reliq - simple html searching tool

.SH SYNOPSIS
.B reliq
.RI [ OPTION .\|.\|.]\&
.I PATTERNS
.RI [ FILE .\|.\|.]\&
.br

.SH DESCRIPTION
.B reliq
searches for
.I PATTERNS
in each
.IR FILE .
Typically
.I PATTERNS
should be quoted when reliq is used in shell command.
.PP
When
.I FILE
isn't specified,
.I FILE
will become standard input.

.SH OPTIONS
.SS "Generic Program Information"
.TP
.B \-h
Output a usage message and exit.
.TP
.BR \-v
Output the version number of
.B reliq
and exit.

.SS "Pattern Syntax"
.TP
.BR \-F
Enter fast and low memory consumption mode in which pattern is a chained expression, meaning that it's made only of \fBTAG\fRs separated with ';'.

.SS "Matching Control"
.TP
.BI \-f " FILE"
Obtain pattern from
.IR FILE.
The empty file contains zero patterns, and therefore matches nothing.

.SS "General Output Control"
.TP
.BI \-o " FILE"
Change output to a
.IR FILE
instead of stdout.

.TP
.BI \-e " FILE"
Change output of errors to a
.IR FILE
instead of stderr.

.SS "File and Directory Selection"
.TP
.BR \-r
Read all files under each directory, recursively.
.TP
.BR \-H
Follow symbolic links.
.TP
.BR \-R
Read all files under each directory, recursively.
Follow all symbolic links, unlike
.BR \-r .
.SS "Other Options"
.TP
.B \-l
set
.IR PATTERN
to '| "%n%Ua - desc(%c) lvl(%L) size(%s) pos(%I)\\n"'.

.SH SYNTAX
.SS PATTERN
Consists of text that can be specified in ' and " brackets.

It can be preceded by characters specifying the way of matching ended with '>' character. By default it's set to "tWfncs".

After '>' character a range might be specified, which will match the number of characters in matched string. In such case specifying text is not mandatory.

If unquoted text is set to a single '*' character matching of text will be ommited, leaving only matching by range.

.nf
\&
.in +4m
\fBTEXT\fR
\fB"TEXT"\fR
\fB'TEXT'\fR
\fB>'TEXT'\fR
\fB>TEXT\fR             - match TEXT, same as above
\fB[a-z]>TEXT\fR        - match TEXT depending on options
\fB[a-z]>[RANGE]TEXT\fR
\fB*\fR                 - match to everything
\fB>*\fR                - same as above
\fB>[RANGE]"TEXT"\fR
\fB>[RANGE]\fR          - match length of pattern
\fB>[RANGE]*\fR         - same as above
.in
\&

Matching options:

.nf
\&
.in +4m
\fBt\fR - Match \fIt\fRrimmed
\fBu\fR - Match \fIu\fRntrimmed

\fBc\fR - Be \fIc\fRase sensitive
\fBi\fR - Be case \fIi\fRnsensitive

\fBv\fR - In\fIv\fRert matching
\fBn\fR - Match \fIn\fRormally, without inversion

\fBa\fR - Match if it occurs at \fIa\fRny position
\fBf\fR - Match if it \fIf\fRully matches
\fBb\fR - Match if it matches at the \fIb\fRegining
\fBe\fR - Match if it matches at the \fIe\fRnd

\fBw\fR - Match it to any of the \fIw\fRords in it
\fBW\fR - Match it to the \fIW\fRhole thing

\fBs\fR - Use \fIs\fRtring matching
\fBB\fR - Use \fIB\fRRE matching
\fBE\fR - Use \fIE\fRRE matching
.in
\&

.SS ATTRIBUTE_NAME
Attribute's name \fBPATTERN\fR. Starting by '+' or '-' will ensure that it will be treated as attribute, but in most cases can be omitted. '-' signifies its negation. It can be followed by \fB[RANGE]\fR which matches attribute's position in tag.

Attribute name can be shortened to '.' or '#' (see below). This sets \fBATTRIB_NAME\fR flags to 'uWsfi' and 'uwsf' for \fBPATTERN\fR.
.nf
\&
.in +4m
\fBPATTERN\fR          - existing attribute
\fB+PATTERN\fR         - existing attribute
\fB-PATTERN\fR         - non existing attribute
\fB.PATTERN\fR         - same as i>class=w>PATTERN
\fB#PATTERN\fR         - same as i>id=w>PATTERN
\fB[RANGE]PATTERN\fR   - existing attribute at position determined by RANGE
\fB-[RANGE]PATTERN\fR
\fB+.PATTERN\fR
\fB-[RANGE]#PATTERN\fR
.in
\&
.SS ATTRIBUTE
Consists of \fBATTRIBUTE_NAME\fR followed by '=' and \fBPATTERN\fR of attribute's value. Specifying only \fBATTRIBUTE_NAME\fR  without specifying its value equals to ignoring its value.

.nf
\&
.in +4m
\fBATTRIBUTE\fR - ignore value of attribute
\fBATTRIBUTE_NAME=PATTERN\fR
.in
\&
.SS RANGE
Is always embedded in square brackets. Consists of groups of four numbers separated by ':', that can be practically endlessly separated by ','. Empty values will be complemented. Putting '-' before two first values (even if they are not specified) makes them subtracted from the maximal value. If '!' is found before the first value the matching will be inversed.

By default the minimal matching value is 0 (exceptions are documented at definition).

Specifying only one value equals to matching only to that value.

Specifying two values equals to matching range between and of them.

Specifying three values additionally matches only values of which modulo of third value is equal to 0. Forth value is an offset to value from which modulo is calculated from.

.nf
\&
.in +4m
\fB[x1,!x2,x3,x4]\fR        - match to one of the values that is not x2
\fB[x1:y1,x2:y2,!x3:y4]\fR  - match to one of the ranges that is not in x3:y4
\fB[-]\fR                   - match to last value subtracted by 0
\fB[-x]\fR                  - match to last value subtracted by x
\fB[:-y]\fR                 - match to range from 0 to y'th value from end
\fB[::w]\fR                 - match to values from which modulo of w is equal to 0
\fB[x:y:w]\fR               - match to range from x to y from which modulo of w is equal to 0
\fB[x:y:w:z]\fR             - match to range from x to y with value increased by z from which modulo of w is equal to 0
\fB[::2:1]\fR               - match to uneven values (assuming that the minimal value is 0)
.in
\&
.SS HOOK
Begins with a name of function followed by '@' and ended with argument which can be a \fBRANGE\fR, \fBEXPRESSION\fR, \fBPATTERN\fR or nothing.

Hook name can be preceded with '+' or '-' character. If it's '-' matching of hook will be inverted.

.nf
\&
.in +4m
\fBNAME@PATTERN\fR
\fBNAME@[RANGE]\fR
\fBNAME@"EXPRESSION"\fR
\fBNAME@\fR
\fB+NAME@[RANGE]\fB
\fB-NAME@"EXPRESSION"\fR
.in
\&

List of implemented hooks:

.TP
.BR m ",  " match " " \fI"PATTERN"\fR
Get tags with insides that match \fIPATTERN\fR, by default pattern flags set to "uWcas".
.TP
.BR a ",  " attributes " " \fI[RANGE]\fR
Get tags with attributes that are within the \fIRANGE\fR.
.TP
.BR L ",  " level " " \fI[RANGE]\fR
Get tags that are on level within \fIRANGE\fR.
.TP
.BR l ",  " levelrelative " " \fI[RANGE]\fR
Get tags that are on level relative to parent within the \fIRANGE\fR.
.TP
.BR c ",  " count " " \fI[RANGE]\fR
Get tags with descendant count that is within the \fIRANGE\fR.
.TP
.BR C ",  " childmatch " " \fI"EXPRESSION"\fR
Get tags in which chained \fIEXPRESSION\fR (see \fB-F\fR option) matches at least one of its children.
.TP
.BR e ",  " endmatch " " \fI"PATTERN"\fR
Get tags with insides of ending tag that match \fIPATTERN\fR, by default pattern flags set to 'uWcnfs'.
.TP
.BR P ",  " position " " \fI[RANGE]\fR
Get tags with position within \fIRANGE\fR.
.TP
.BR p ",  " positionrelative " " \fI[RANGE]\fR
Get tags with position relative to parent within \fIRANGE\fR.
.TP
.BR I ",  " index " " \fI[RANGE]\fR
Get tags with starting index of tag in file that is within \fIRANGE\fR.
.TP

Access hooks specify what nodes will be matched:

.TP
.BR full
Matches node itself and everything below it (set by default).
.TP
.BR self
Matches only node itself (similar to \fBl@[0]\fR).
.TP
.BR child
Matches nodes that are only one level higher then self (similar to \fBl@[1]\fR).
.TP
.BR desc ",    " descendant
Matches nodes that are lower than self (simalar to \fBl@[1:]\fR).
.TP
.BR ancestor
Matches nodes that are ancestors of self, relative level of 0 matches to parent.
.TP
.BR parent
Matches node that is a parent of self.
.TP
.BR rparent ", " relative_parent
Matches node that matched self in script e.g. \fB'TAG1; TAG2; * rparent@'\fR will return \fBTAG1\fR.
.TP
.BR sibl ",    " sibling
Matches siblings of self.
.TP
.BR spre ",    " sibling_preceding.
Matches preceding siblings of self.
.TP
.BR ssub ",    " sibling_subsequent
Matches subsequent siblings of self.
.TP
.BR fsibl ",   " full_sibling
Matches siblings of self and nodes below them.
.TP
.BR fspre ",   " full_sibling_preceding
Matches preceding siblings of self and nodes below them.
.TP
.BR fssub ",   " full_sibling_subsequent
Matches subsequent siblings of self and nodes below them.

.SS TAG
At the begining each \fBTAG\fR must contain \fBPATTERN\fR of html tag and that can be followed by a number of \fBATTRIBUTE\fRs and \fBHOOK\fRs.

Range separated by spaces will match the position of results relative to parent nodes, or if specified before tag \fBPATTERN\fR absolute to all results.

.nf
\&
.in +4m
\fBPATTERN\fR
\fBPATTERN ATTRIBUTE... HOOK... [RANGE]\fR - match RANGE to results relative to parent nodes
\fB[RANGE] PATTERN\fR - match RANGE to results
.in
\&

\fBTAG\fR, \fBATTRIBUTE\fRs and \fBHOOK\fRs can be grouped in '(' ')' brackets. ')' has to be preceded by space otherwise it will be treated as part of argument.

.nf
\&
.in +4
    \fB(... )\fR - correct
    \fB( ... )\fR - correct
    \fB( ...)\fR - incorrect
    \fB( ( ... )(... ))\fR - correct
.in
\&

If brackets are 'touching' themselves they will match if at least one of them matches. Groups cannot contain position or access hooks definitions. If \fBTAG\fR pattern is not specified before any groups then all of the first groups will specify it.

.nf
\&
.in +4m
    \fBTAG ( ATTRIB1 HOOK )( ATTRIB2 ( ATTRIB3 ATTRIB4 )( ATTRIB5 ) )\fR - TAG having either ATTRIB1 with HOOK or ATTRIB2 which has ATTRIB3 and ATTRIB4 or ATTRIB5
    \fBTAG ( ATTRIB1 ) ( ATTRIB2 )\fR - TAG having both ATTRIB1 and ATTRIB2. Since groups have space in between they are not 'touching'.
    \fB( TAG1 HOOK )( TAG2 ) ATTRIB\fR - either TAG1 with HOOK or TAG2 both having ATTRIB.
.in
\&


.SS TAG_FORMAT
It has to be specified in '"' or '\\'' quotes.

If format is not specified it will be set to "%C\\n".

\fIi\fR, \fIt\fR, \fIT\fR, \fIC\fR, \fIa\fR, \fIv\fR \fIS\fR, \fIE\fR, \fIe\fR directives can be preceded with '\fIU\fR' and '\fID\fR' characters that change their output e.g. '%Ui', '%(href)DUa', '%1UDa'. '\fIU\fR' causes output to be untrimmed (by default they are trimmed), '\fID\fR' decodes html escape codes.

Prints output according to \fBFORMAT\fR interpreting '\e' escapes and `%' directives. The escapes and directives are:
.RS
.IP \ea
Alarm bell.
.IP \eb
Backspace.
.IP \ef
Form feed.
.IP \en
Newline.
.IP \er
Carriage return.
.IP \et
Horizontal tab.
.IP \ev
Vertical tab.
.IP \e0
ASCII NUL.
.IP \e0\fIXXX\fR
Byte in octal.
.IP \ex\fIXX\fR
Byte in hexadecimal.
.IP \eu\fIXXXX\fR
UTF-8 character.
.IP \eU\fIXXXXXXXX\fR
UTF-8 character.
.IP \e\e
A literal backslash (`\e').
.IP %%
A literal percent sign.
.IP %n
Tag's name.
.IP %i
Tag's insides.
.IP %t
Tag's text.
.IP %T
Tag's text, recursive.
.IP %l
Tag's level relative to parent.
.IP %L
Tag's level.
.IP %p
Tag's position in current file.
.IP %s
Tag's size.
.IP %c
Tag's count of descendants.
.IP %C
Contents of tag.
.IP %a
All of the tag's attributes.
.IP %v
Values of tag's attributes separated with '"'.
.IP %\fIk\fPv
Value of tag's attribute, where \fIk\fP is its position counted from zero.
.IP %(\fIk\fP)v
Value of tag's attribute, where \fIk\fP is its name.
.IP %S
Starting tag.
.IP %E
Ending tag.
.IP %e
Contents of ending tag.
.IP %P
Position of tag.
.IP %p
Position of tag relative to parent.
.IP %I
Index of tag's starting position in file.

.SS FUNCTION
Begins with name followed by arguments separated by whitespaces.

\fBFUNCTION\fR can have up to 4 arguments that are clearly defined in [] brackets or in "",'' quotes.

.nf
\&
.in +4m
\fBNAME\fR - function with no arguments
\fBNAME [list] "text1" "text2"\fR - function with first argument as a list, and second and third as text
.in
\&

List of implemented functions:

.B line \fI[SELECTED]\fR \fI"DELIM"\fR
.IP
Return selected lines. Lines start from 1 and such is the minimal value in \fISELECTED\fR.

Lines are separated by \fIDELIM\fR (by default '\\n').
.TP

.B trim \fI"DELIM"\fR
.IP
Trim whitespaces at the beginning and end of the whole input.

Input can be split by \fIDELIM\fR and trimmed separatedly.
.TP

.B sort \fI"FLAGS"\fR \fI"DELIM"\fR
.IP
Sort input delimited by \fIDELIM\fR (by default '\\n').

Flags:
    r - reverse the results of comparison
    u - omit repeated lines
.TP

.B uniq \fI"DELIM"\fR
.IP
Filter out repeating lines from input delimited by \fIDELIM\fR (by default '\\n').
.TP

.B echo \fI"TEXT1"\fR \fI"TEXT2"\fR
.IP
Print \fITEXT1\fR before the input and \fITEXT2\fR after.
.TP

.B tr \fI"STR1"\fR \fI"STR2"\fR \fI"FLAGS"\fR
.IP
Translate characters in \fISTR1\fR to \fISTR2\fR.

If \fISTR2\fR is not specified characters in \fISTR1\fR will be deleted.

special STR syntax:
    \fBCHAR1-CHAR2\fR - all characters from CHAR1 to CHAR2 in descending order
    \fB[CHAR*REPEAT]\fR - REPEAT copies of CHAR
    \fB[:space:]\fR - support for common character types, written in lower case

Flags:
    s - replace repeating sequences of characters with only one
    c - use the complement of \fISTR1\fR
.TP

.B cut \fI[SELECTED]\fR \fI"DELIMS"\fR \fI"FLAGS"\fR \fI"DELIM"\fR
.IP
Return selected parts from input delimited by \fIDELIM\fR (by default '\\n').
The minimal value in \fISELECTED\fR is 1.

\fIDELIMS\fR specifies delimiters for fields and selects fields, if none are specified selection is based on bytes.

\fIDELIMS\fR have the same syntax as \fBtr\fR \fISTR\fR.

Flags:
    s - do not return lines with no delimiters
    c - complement \fILIST\fR
    z - sets \fIDELIM\fR to '\\0'
.TP

.B sed \fI"SCRIPT"\fR \fI"FLAGS"\fR \fI"DELIM"\fR
.IP
Implementation of \fBsed\fR(1).

Lines are delimited by \fIDELIM\fR (by default '\\n').

Flags:
    n - suppress automatic printing of pattern space
    z - set \fIDELIM\fR to '\\0'
    E - use extended regexp

Deviations from standard:
    \fBi\fR \fBc\fR \fBa\fR commands do nothing even though they take arguments
    \fBl\fR \fBr\fR \fBR\fR \fBQ\fR \fBw\fR \fBW\fR are not implemented
.TP

.B rev \fI"DELIM"\fR
.IP
Reverse order of characters in every line delimited by \fIDELIM\fR (by default '\\n').
.TP

.B tac \fI"DELIM"\fR
.IP
Reverse order of input lines delimited by \fIDELIM\fR (by default '\\n').
.TP

.B wc \fI"FLAGS"\fR \fI"DELIM"\fR
.IP
Print count of lines, words, bytes.

Input is delimited by \fIDELIM\fR (by default '\\n').

Flags:
    l - lines count
    w - words count
    c - bytes count
    L - size of the longest line

If multiple values are returned each will be separated with '\\t'.

If no flags are given then flags are set to "lwc".
.TP

.B decode
.IP
Decode html escape codes.

.SS FORMAT
Consists of \fBFUNCTION\fRs separated by whitespace. Output of the tag is passed to \fBFUNCTION\fR, and its output is passed to the next until the last one which will print it.

If specified after '|' \fBFORMAT\fR will be executed separately for each matched tag.

If after '/' \fBFORMAT\fR will be executed for the whole output.

.nf
\&
.in +4m
\fBFUNCTION FUNCTION...\fR
.in
\&

.SS NODE
Consists of \fBTAG\fRs and \fBEXPRESSION\fRs separated by ';' which makes them pass result from previous node to the next.

Output \fBFORMAT\fR can be specified after '|' and '/' characters, everything after it will be taken as \fBFORMAT\fR.

.nf
\&
.in +4m
\fBTAG1; TAG2; NODE\fR - matches results of TAG1 by TAG2 and by NODE
\fBNODE1; NODE2\fR - process the results of NODE1 by NODE2
.in
\&

.SS EXPRESSION
Consists of \fBNODE\fRs separated by ',' and grouped in '{' '}' brackets (which accumulate their output and increases their level).
.nf
\&
.in +4m
\fBNODE1, NODE2\fR - two expressions
\fBEXPRESSION1; { EXPRESSION2; {EXPRESSION3, EXPRESSION4}, EXPRESSION5}; EXPRESSION6\fR
.in
\&

.SS OUTPUT_FORMAT

Is changed based on output \fBFORMAT\fR and can be specified only to the last \fBEXPRESSION\fRs, or if they are children of groups having \fBFORMAT\fR.

Output \fBFORMAT\fR can be specified after '|' and '/' characters, everything after it will be taken as \fBFORMAT\fR.

If the first thing in \fBFORMAT\fR specified after '|' character of a node is a \fI"TEXT"\fR it will be used as \fBTAG_FORMAT\fR.

\fBFORMAT\fR after '|' character is executed on each element matched, and after '/' to the whole result.

Groups with format after '|' will execute their \fBEXPRESSION\fRs for each element in input independently, contrary to normal behavior where the child processes all the input at once before going to the next.

.nf
\&
.in +4m
\fBNODE1; NODE2 | TAG_FORMAT FORMAT1 / FORMAT2\fR - matches NODE2 to results of NODE1, then prints them one by one with TAG_FORMAT and processes WITH FORMAT1, then processes everything by FORMAT2
\fBNODE1 | FORMAT, NODE2 / FORMAT\fR
\fBNODE1; { NODE1 | FORMAT, NODE2 / FORMAT }\fR
\fBNODE1; { NODE1 | FORMAT, NODE2 / FORMAT } / FORMAT\fR - process results of GROUP
\fBNODE1; { NODE1 | FORMAT, NODE2 / FORMAT } | FORMAT / FORMAT\fR - process results of GROUP one by one, and then as a whole
.in
\&

.SS OUTPUT_FIELD

Accumulates output and prints it in json format.

Begins before \fBEXPRESSION\fR, starts with '.' character and is followed by name, which can be defined as [A-Za-z0-9_-]+.

If field doesn't have a name it will be a protected field i.e. if the \fBEXPRESSION\fR matches nothing a newline will be printed.

To specify type of field the name has to be followed by '.' and type name.

Only the first letter of type matters (e.g. \fI.n\fR and \fI.number\fR are equivalent), but type name can consist only of alphanumeric characters.

List of types:
    \fI.s\fR             string

    \fI.b\fR             boolean value, returns true only if input starts with 'y', 'Y', 't', 'T' otherwise return false

    \fI.n\fR             number, return the first found floating point number, if none found returns 0

    \fI.i\fR             integer, return the first found integer number, if none found returns 0

    \fI.u\fR             unsigned integer, return the first non negative integer number, if none found returns 0

    \fI.a\fR             array, of strings, delimited by '\\n'

    \fI.a("\\t")\fR       array of strings, delimited by the first character in the string, i.e. '\\t'

    \fI.a.type\fR        array of type, delimited by '\\n'

    \fI.a("-").type\fR   array of type, delimited by the first character in the string (only one can be specified)

Examples:

    if field doesn't return to a field is globally available, and even if \fBdiv .author\fR is not found the fields will be printed.
    \fB.title h2, div .author; { .image img, .bolded b }\fR
    \fI{"title":"...","image":"...","bolded":"..."}\fR

    if field has fields as an input it is an object.
    \fB.title h2, .author div .author; { .image img, .bolded b }\fR
    \fI{"title":"...","author":{"image":"...","bolded":"..."}}\fR

    if field has fields as an input and expressions without fields, first it prints out result of expressions, and then prints the object.
    \fBdiv .author; { .image img, .bolded b, a }\fR
    \fI<a objects>\fR
    \fI{"author":{"image":"...","bolded":"..."}}\fR

    since it has no field to return to as an array it creates incorrect json objects with repeating fields.
    \fBdiv .author; { .images img, .bolded b } |\fR
    \fI{"images":"...","bolded":"...","images":"...","bolded":"..."}\fR

    blocks ended with '|' character are treated as arrays. If such block has no input, it returns "[]".
    \fB.authors div .author; { .images img, .bolded b } |\fR
    \fI{"authors":[{"images":"...","bolded":"..."}]}\fR

.SS COMMENTS

Basic C style comments are implemented, although they have to be preceded by whitespace.

.SH NOTES

UTF-8 \fI\eu\fR and \fI\eU\fR escape directives don't work in any place where delimiter is to be specified i.e. array separator in fields, in translasion tables of \fBtr\fR, in \fIy\fR \fBsed\fR command, in delimiters of \fBsed\fR, \fBtr\fR, \fBline\fR, \fBtrim\fR, \fBuniq\fR, \fBsort\fR, \fBecho\fR, \fBcut\fR, \fBrev\fR, \fBtac\fR, \fBwc\fR commands.

.SH EXAMPLES
Get tags 'a' with attribute 'href' at position 0 of value ending with '.org' or '.com', from result of matching tags 'div' with attribute 'id', and without attribute 'class', from file 'index.html'.
.nf
\&
.in +4m
.B $ reliq 'div id \-class; a [0]href=E>".*\\\\.(org|com)"' index.html
.in
\&
Get tags that don't have any tags inside them
.nf
\&
.in +4m
.B $ reliq '* c@[0]' index.html
.in
\&
Get empty tags.
.nf
\&
.in +4m
.B $ reliq '* m@>[0]' index.html
.in
\&
Get hyperlinks from level greater or equal to 6 (counting from zero).
.nf
\&
.in +4m
.B $ reliq 'a href @l[6:] | "%(href)v\\\\n"' index.html
.in
\&
Get all urls from 'a' and 'img' tags, where images are end with '.png'.
.nf
\&
.in +4m
.B $ reliq 'img src=e>.png | "%(src)v\\\\n", a href | "%(href)v\\\\n"' index.html
.in
\&
Get all urls in div with class 'index' or ul both with title attribute.
.nf
\&
.in +4m
.B $ reliq '( div .index )( ul ) title; a href | "%(href)v\\\\n' index.html
.in
\&

.SH EXIT STATUS
.sp
\fB0\fP
.RS 4
success
.RE
\fB5\fP
.RS 4
system call failure
.RE
\fB10\fP
.RS 4
mangled html
.RE
\fB15\fP
.RS 4
incorrect script
.RE

.SH AUTHOR
Dominik Stanisław Suchora <suchora.dominik7@gmail.com>