Description

A method combining predictive pattern mining (itemset, sequence) using a linear model with L1 regularization and statistical hypothesis test considering selection bias.
Note that this code is created based on SafePatternPruning(https://github.com/takeuchi-lab/SafePatternPruning).

Reference paper

Nakagawa, K., Suzumura, S., Karasuyama, M., Tsuda, K., & Takeuchi, I. (2016, August). Safe pattern pruning: An efficient approach for predictive pattern mining. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 1785-1794). ACM.[pdf]
Suzumura, S., Nakagawa, K., Umezu, Y., Tsuda, K., & Takeuchi, I. (2017, August). Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3338-3347). JMLR. org.[pdf]

Verified Environmental

gcc version 6.4.0
GNU Make 3.81
Eigen 3.3.4(Install yourself)

Files

.
├── README.md
└── src
    ├── Eigen/
    ├── crossvalidation.cc
    ├── crossvalidation.h
    ├── database.cc
    ├── database.h
    ├── lassospp.cc
    ├── lassospp.h
    ├── learnerspp.h
    ├── makefile
    ├── modelselector.h
    ├── prefixspan.cc
    ├── prefixspan.h
    ├── selectiveInference.cc
    ├── selectiveInference.h
    └── train.cc

Setup

$ cd src/ && make

Usage

$ ./train [-options] [filename]

Option

    -t : learning regularization parameter index (besides CV) (default:min lambda)
    -m : min support (default 1)
    -L : max pattern length  (default 10)
    -T : number of regularization path (default 100)
    -r : min regularization path ratio (default 2.0)
    -i : max iteration (default 1000000)
    -f : frequency of convergence check (default 50)
    -e : tolerance (default 1e-6)
    -F : output filename (default output/result.csv)
    -p : max event interval (default 0|-1:none)
    -c : is cross validation (default 0:false|1:true)
        -k : k-fold (default:10)
        -a : average (default:10)
    -P : is cross validation for performance evaluation (default 0:false|1:true)
    -I : importance sampling(default 1:true|0:false)
    -R : sampling(importance sampling , default 1e+8)
    -C : is CloSpan (default 0:false|1:true)
    -s : is support frequency (default 0:false|1:true)

Example

./train -L 30 -F output/result.csv data/test.txt

Demo

Input Data

As shown in the following example, input data should have a obvective variable(real number) at the beginning and a sequence of symbols behind it. Note that separate each column with space, and the symbol must a positive integer.

Itemset

-0.0340430505 (19_39_75)
-0.0462595143 (9_10_30_36_38_53)
-0.0532870357 (2_3_17_22_23_38_51_73)
-0.0756281423 (2_85)
-0.0922903300 (1_11_31_32_36_58_59_62_65_70_77_95_99)
-0.1064798239 (4_15_26_28_55_69_76_91_93_99)
-0.1144663632 (10_40_44_67_78_80_87_96)
-0.1179265562 (20_25_35_37_43_64_68_82_83_87_89_91_99)
-0.1425129899 (29_38_54_57_66_75_83_98)
-0.1432412355 (17)
-0.1665651004 (11_20_59_72_95)
-0.1734397456 (8_27_37_42_44_49_55_58_76_80_83_91_94)
︙

For example, the line

-1.15 (1_2_3)

indicates the objective variable is "-1.15" and the itemset "(1,2,3)".

Sequence

1.1532180855 1 2 1 2 2 1 2 1 2 1 2 3 3 4 1 3 4 3 4 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 3 4 4 1
0.4647166093 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
-0.1800180228 1 2 1 2 3 1 2 1 2 1 2 4 3 4 1 1 2 1 2 1 2 1 2 1 2 1 2 3 3 4 3 4 3 4 3 4 3 4 3 4 3
-0.6434264106 1 2 1 2 4 1 2 1 2 1 2 1 3 4 3 4 4 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
-0.1432412355 3 4 3 4 4 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 3 4 3 4 3 4 3 4 3 4 3 4 4
3.1031693637 1 2 1 2 1 2 3 3 4 3 3 4 3 4 3 1 2 14 3 4 3 4 3 4 3 4 4
-0.3882494977 1 2 1 2 3 1 2 1 2 1 2 3 1 2 1 2 1 2 1 2 1 2 1 2 3 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 3 4 3 4 3 4 3 4 3 4 3 4 3 4 4
1.5801588357 1 2 1 2 2 3 4 3 3 4 3 4 4 1 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 3 4 3 4 3 4 3 4 3 4 3
0.9166487651 1 2 1 2 1 2 2 3 4 2 1 2 1 2 1 2 1 2 1 2 1 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 3 3 4 3 4 3 4 3 4 4
-1.1884857828 1 2 1 2 4 3 4 2 3 4 3 4 2 1 2 1 2 1 2 1 2 1 2 1 2 1 3 4 3 4 3 4 3 4 3 4 3 4 3
︙

For example, the line

-1.15 1 2 1 2 3

indicates the objective variable is "-1.15" and the sequence "1 2 1 2 3".

Output

For the output result, the total support, the lambda when that the pattern was added, the weight and p-value including that pattern are output in CSV for each extracted pattern as follows.

Itemset

pattern,supportSum,addλ,w,p-value
(0),25,53,-0.251998,0.16729
(0_23),4,90,-0.533815,0.0327464
(0_31_65),2,64,0.568254,0.256419
(0_42),4,47,1.44459,0.376632
(0_49),2,43,-0.696259,0.316504
︙

Sequence

pattern,supportSum,addλ,w,p-value
1 1 2 1 2 1 2 1 2 1,50,21,0.245609,0.167073
1 2 1 2 1 2 1 2 1 2,233,1,0.586735,1.09544e-26
1 2 1 2 2,128,28,-0.0590423,4.185e-05
2 1 2 1 2 1 2 1 2 1,233,1,0.607691,9.99603e-28
︙

License

GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
src		src
LISENCE.txt		LISENCE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Reference paper

Verified Environmental

Files

Setup

Usage

Option

Example

Demo

Input Data

Itemset

Sequence

Output

Itemset

Sequence

License

About

Releases

Packages

Languages

takeuchi-lab/SelectiveInference-for-SafePatternPruning

Folders and files

Latest commit

History

Repository files navigation

Description

Reference paper

Verified Environmental

Files

Setup

Usage

Option

Example

Demo

Input Data

Itemset

Sequence

Output

Itemset

Sequence

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages