-
Notifications
You must be signed in to change notification settings - Fork 2
/
readme.polarity
658 lines (485 loc) · 33.9 KB
/
readme.polarity
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
Locations:
DATA:
page rank and diff output:
/home/j/anick/patent-classifier/ontology/roles/data/polarity/bio_2003
input data for pr (e.g. a.tf files)
/home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_tas_cond
(pr output for wt_type="cp")
/home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv_tas
CODE:
run_pr.py runs page rank
def run_prs(db, year, tv_dir, size=200, wt_type="mi") => term_file, feature_file, diff_file
This computes page rank for pos/neg/both using a full graph with higher teleportation rates for seed set terms
def run_pos_neg_pr(db, year, tv_dir, size=200, wt_type="cp", pos_file, neg_file)
This uses selected feature subsets in the graph,. rather than all links. It does not give higher teleportation to
seed set terms.
polarity.py do_diff computes the diff_file
This contains various metrics differentiating pos and neg terms and features
select_feats: takes a feature diff file and selects a subset of pos/neg features => pos and neg feats files
The data (term_features) is created by
run_term_features_multi.sh
run_term_features.sh
term_features.sh
term_features.py
This runs over phr_feats files and creates a single line for each term/desired_feature in a file and then sums up term/feature combinations
over documents, resulting in a doc count for every combination. It either runs over title and abstract (ta) or title, abstract, summary (tas).
The latter may miss sections like DESC.
Note that es_np_index.py also populates from the phr_feats file and may be an alternative source for the chunk data, features and frequencies.
Gitit code/data
/home/j/llc/gititkeh/PageRank/annotation
Annotation data transferred into tab separated unix file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/annot
polarity.annot.computers.gitit
working here: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_tas_cond
python networkx package is installed on sarpedon and pasiphae for python2.7
Gitit's page rank code in /home/j/llc/gititkeh/PageRank
Gitit data in /home/j/llc/gititkeh/malletex/health_data
The files are in /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers_test_pa/data/tv
2002.a.pn.tcs has the training terms and their labels
computer domain data is at /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv
code to create .tf, .feats, .terms, and .cs is in tf.py
.tf: term, feature, pair_freq, pair_prob, prob_fgt
prob_fgt = pair_freq/term_freq where
pair_freq is the # docs in which feature/term cooccur
term_freq = # docs in which term occurs
.terms: term term_freq term_instance_freq term_prob
.feats: feature, feat_freq, feat_instnace_freq, feat_prob
4/4/15 Rerunning tv files using canonicalization
# first make a copy of the old directory
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/
mv tv tv_20150404_uncanon
mkidr tv
# create 2002 computer data
[anick@sarpedon roles]$ python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/ 2002 2002
To get a set of candidate attribute terms, extract those which occur with a feature prev_Npr and value including "_of". Sort by dispersion of the feature across different terms.
cat 2002.tf | grep prev_Npr | grep _of | sed -e 's/^.*=//' | sed -e 's/_of.*$//' | sort | uniq -c | sortnr -k1 > 2002.tf.attr_of.uc
///In tf.py, TODO: canonicalize and filter .terms and .feats
DONE
------
in python2.7
import role
>>> role.run_tf_steps("ln-us-A21-computers", 2002, 2002, "act", ["tc", "tcs", "fc", "uc", "prob"])
[run_tf_steps]tv_root: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/, fcat_file: /home/j/anick/patent-classifier/ontology/roles/seed.act.en.dat, cat_list: ['a', 'c', 't']
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/
Mon Apr 6 20:09:13 2015 0 Starting run_tf_steps for years: 2002 2002
Mon Apr 6 20:09:13 2015 0 Starting tc step
[run_tf_steps]Creating .tc, .tfc
[run_tf2tfc]Processing dir: 2002
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.tf
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.tc
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.tfc
[run_tf2tfc]Completed: 2002.tc
Mon Apr 6 20:34:01 2015 1487 Completing tc step
Mon Apr 6 20:34:01 2015 0 Starting tcs step
[run_tf_steps]Creating .tcs
[run_tc2tcs]Processing dir: 2002
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.tc
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.tcs
Mon Apr 6 20:34:25 2015 24 Completing tcs step
Mon Apr 6 20:34:25 2015 0 Starting fc step
[run_tf_steps]Creating .fc
[run_tcs2fc]Processing dir: 2002
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.tcs
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.tf
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.fc
Mon Apr 6 20:48:24 2015 838 Completing fc step
Mon Apr 6 20:48:24 2015 0 Starting uc step
[run_tf_steps]Creating .fc_uc
[run_fc2fcuc.sh] SUBSET is [.], cat_type is [act], filestr_before_year is [/home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/], filestr_after_year is [.act]
[run_fc2fcuc.sh]input_file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.fc, output_file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.fc_uc
Mon Apr 6 20:48:31 2015 7 Completing uc step
Mon Apr 6 20:48:31 2015 0 Starting prob step
[run_tf_steps]Creating .fc_prob, fc_cat_prob and .fc_kl
[run_fcuc2fcprob]Processing dir: 2002
[fcuc2fcprob]cat_list: ['a', 'c', 't']
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.fc_uc
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.fc_prob
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.cat_prob
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.tcs
[tv_filepath]file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/2002.act.fc_kl
[fcuc2fcprob]category: a, cum_fgc_prob (should total to 1.0): 1.000000
[fcuc2fcprob]category: c, cum_fgc_prob (should total to 1.0): 1.000000
[fcuc2fcprob]category: t, cum_fgc_prob (should total to 1.0): 1.000000
Mon Apr 6 20:53:41 2015 310 Completing prob step
[run_tf_steps]Completed
Mon Apr 6 20:53:41 2015 2668 [run_tf_steps]Completed
------------------------
from nbayes.py
# (3) nbayes.run_steps("ln-us-A21-computers", 2002, ["nb", "ds", "cf"])
# (4) nbayes.run_filter_tf_file("ln-us-A21-computers", 2002, "0.0") # create a.tf, needed for running polarity
# (5) role.run_tf_steps("ln-us-A21-computers", 2002, 2002, "pn", ["tc", "tcs", "fc", "uc", "prob"], "a")
# (6) nbayes.run_steps("ln-us-A21-computers", 2002, ["nb", "ds", "cf"], cat_type="pn", subset="a")
To see attrs sorted by freq:
cat 2002.act.cat.w0.0 | grep ' a ' | sortnr -k3 | cut -f1,2,3 | more
Note that "cost" is categorized as positive, along with many specializations. Often there are conflicting features (increase/decrease).
cat 2002.a.pn.cat.w0.1 | cut -f1,3,4,7,8,9 | grep ' p ' | sortnr -k2 | grep cost | more
cost 1918 p -11763.1400271 -14084.0394336 prev_V=increase^866 prev_V=incur^124 prev_V=raise^23 prev_V=allow_for^5 prev_V=prevent^1 prev_V=
experience^2 prev_V=desire^1 prev_J=substantial^58 prev_V=assess^8 prev_Npr=lack_of^1 prev_J=potential^8 prev_Npr=%_of^12 prev_V=generate^7 prev_V=satis
fy^2 prev_V=avoid^58 prev_V=concern^2 prev_V=minimize^176 prev_V=decrease^120 prev_V=support^8 prev_J=considerable^54 prev_V=suffer_from^15 prev_Npr=adv
antage_of^8 prev_V=relate_to^12 prev_V=eliminate^39 prev_V=lower^148 prev_V=cause^41 prev_V=establish^8 prev_V=facilitate^1 prev_V=realize^6 prev_V=cont
ribute_to^19 prev_V=suffer^5 prev_J=excessive^27 prev_V=lead_to^28 prev_V=suppress^2 prev_V=reflect^6 prev_V=introduce^17
manufacturing cost 80 p -349.705180981 -667.265848847 prev_V=raise^2 prev_Npr=%_of^3 prev_V=decrease^3 prev_V=cause^2 prev_V=lower^3 p
rev_V=increase^60 prev_V=incur^1 prev_V=minimize^5 prev_V=suppress^1
system cost 39 p -152.696877136 -336.058240038 prev_V=minimize^4 prev_V=increase^32 prev_Npr=%_of^1 prev_V=raise^2
? Do we get most of the increase cost occurrences within the background section?
As seen below, in the abstract we get:
reduce cost: 101
increase cost: 8
in the summary, we get:
reduce cost: 4
increase cost: 1463
This suggests that the abstract is more likely to reflect the "positive review" than the patent as a whole.
>>> r = es_np_query.qmamf(l_query_must=[["spv", "reduce"], ["sp", "cost ]"], ["section", "ABSTRACT"] ],l_fields=["spv", "cphr", "section"], query_type="count", index_name="i_cs_2002")
>>> r
{u'count': 101, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}}
>>> r = es_np_query.qmamf(l_query_must=[["spv", "increase"], ["sp", "cost ]"], ["section", "ABSTRACT"] ],l_fields=["spv", "cphr", "section"], query_type="count", index_name="i_cs_2002")
>>> r
{u'count': 8, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}}
>>> r = es_np_query.qmamf(l_query_must=[["spv", "increase"], ["sp", "cost ]"], ["section", "SUMMARY"] ],l_fields=["spv", "cphr", "section"], query_type="count", index_name="i_cs_2002")
>>> r
{u'count': 1463, u'_shards': {u'successful': 5, u'failed': 0, u'total': 5}}
>>> r = es_np_query.qmamf(l_query_must=[["spv", "reduce"], ["sp", "cost ]"], ["section", "SUMMARY"] ],l_fields=["spv", "cphr", "section"], query_type="count", index_name="i_cs_2002")
>>> 4
# create new directories for title-abstract data only
# the ta parameter causes output to be written to /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/term_features_ta/2002
sh run_term_features.sh ln-us-A21-computers 2002 2002 ta
ls -1 | wc -l
45431
# I moved the tv directory
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data
mv tv tv_20150407_tas
mkdir tv
# I reran the Bayes analysis using just abstract data. The results (in /tv) are better but much, much smaller.
# It might be possible to combine abstracts across many years to get enough data.
#I also created canonical seed sets in fr_code dir (/roles):
seed.pn.en.canon.dat
seed.act.en.canon.dat
Conversion was done using: canon_seed_set.py
-----------------------------------------
Running all steps on bio domain
# Move the existing tv directory and create a new empty one
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data
mv tv tv_20150407_tas
mkdir tv
# populate the term features dir with abstract only data (ta) or all (tas)
cd /home/j/anick/patent-classifier/ontology/roles
sh run_term_features.sh ln-us-A27-molecular-biology 2002 2002 tas
# in bash, populate the tv directory and run nbayes for act and pn classification
# first make a copy of the old directory
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/
mv tv tv_21050408_uncanon
mkdir tv
NOTE: after populating the tv directory, move the files to another location before rerunning using a different
term_features set (ta vs. tas)
# note: (using term_features) for title/abstract and summary/background sections
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv/ 2002 2002
(this is slow)
# version for title/abstract only (using use term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv/ 2002 2002
# in python, do nbayes for act and pn over the abstract data
import role
import nbayes
# (2) role.run_tf_steps("ln-us-A27-molecular-biology", 2002, 2002, "act", ["tc", "tcs", "fc", "uc", "prob"])
# (3) nbayes.run_steps("ln-us-A27-molecular-biology", 2002, ["nb", "ds", "cf"])
# (4) nbayes.run_filter_tf_file("ln-us-A27-molecular-biology", 2002, "0.0") # create a.tf, needed for running polarity
# (5) role.run_tf_steps("ln-us-A27-molecular-biology", 2002, 2002, "pn", ["tc", "tcs", "fc", "uc", "prob"], "a")
# (6) nbayes.run_steps("ln-us-A27-molecular-biology", 2002, ["nb", "ds", "cf"], cat_type="pn", subset="a")
4/9/15 Fixed a bug in canon.py to make sure unicode characters were detected in illegal char regex. So any data prior to
this date might contain some illegal terms e.g. 3 dash equals sign or R sign)
!! potential problem with doing canonicalization. "increased" as past participle modifier may be used as a negative, whereas
other forms may have invention as subject and hence be positive.
e.g. inflammatory response, cell death (negatives)
# 4/13/15 creating polarity functions for pagerank in polarity.py
#####################################
# Move the existing tv directory and create a new empty one (not necessary for new domain)
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data
mv tv_20141125_uncanon tv_20141125_uncanon_tas
mkdir tv
# populate the term features dir with abstract only data (ta) or all (tas)
cd /home/j/anick/patent-classifier/ontology/roles
#sh run_term_features.sh ln-us-A23-semiconductors 2002 2002 tas
not necessary since tas output exists for all years
# the "ta" option will write output to term_features_ta subdirectory
#sh run_term_features.sh ln-us-A23-semiconductors 2002 2002 ta
///
# in bash, populate the tv directory and run nbayes for act and pn classification
NOTE: after populating the tv directory, move the files to another location before rerunning using a different
term_features set (ta vs. tas)
mv tv tv_20140415_canon_tas
# note: (using term_features) for title/abstract and summary/background sections
# currently term_features
#for tas:
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/tv/ 2002 2002
(this is slow)
# for tas
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/tv/ 2002 2002
# version for title/abstract only (using term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/tv/ 2002 2002
# in python, do nbayes for act and pn over the abstract data
import role
import nbayes.py
# (2) role.run_tf_steps("ln-us-A23-semiconductors", 2002, 2002, "act", ["tc", "tcs", "fc", "uc", "prob"])
# (3) nbayes.run_steps("ln-us-A23-semiconductors", 2002, ["nb", "ds", "cf"])
# (4) nbayes.run_filter_tf_file("ln-us-A23-semiconductors", 2002, "0.0") # create a.tf, needed for running polarity
# (5) role.run_tf_steps("ln-us-A23-semiconductors", 2002, 2002, "pn", ["tc", "tcs", "fc", "uc", "prob"], "a")
# (6) nbayes.run_steps("ln-us-A23-semiconductors", 2002, ["nb", "ds", "cf"], cat_type="pn", subset="a")
4/16/15 completed NBayes analsysis of semiconductors 2002
term_features contains tas data
term_features_ta contains ta data
tv_20140415_canon_ta contains ta files
tv_20140415_canon_tas contains tas files
------------------------------------------
4/18/15 Adding MI to .tf file
///NOTE: verify that term_features dir for 2002 semiconductors is full tas, not just ta data
Rerun tf.py on all domains/subsets
/home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_ta
# for tas
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/term_features_tas/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/tv_tas_mi/ 2002 2002
# version for title/abstract only (using term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/tv_ta_mi/ 2002 2002
# for tas
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_tas_mi/ 2002 2002
# version for title/abstract only (using term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_ta_mi/ 2002 2002
# for tas
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv_tas_mi/ 2002 2002
# version for title/abstract only (using term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv_ta_mi/ 2002 2002
Creating a file with
term, feature, term_freq, npmi
removing last_word, prev_J features and any terms with freq = 1
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data/tv_ta_mi
cat 2002.tf | cut -f1,2,7,9 | grep -v ' 1 ' | grep -v "last_word" | grep -v "prev_J" | sortnr -k4 > 2002.tf.npmi
cat 2002.tf.npmi | wc -l
195773
Uses of polarity. Determine sentiment regarding systems/inventions outside of patents (e.g. papers) where there will be criticism of prior work.
Knowing the default polarity of attributes can determine the author's sentiment towards a technology. It will reduce bandwidth = negative.
It will reduce memory requirements = positive.
The following steps assume that the directory is called tv. This means we have to mv tv_ta (and then tv_tas) to tv before running and then
move them back before running the other. This is needed because we are rerunning the first <year>.tf file to include mutual info fields, which
are needed when we create the <year>.a.tf file.
So first recreate .tf file, then move the directory into tv and run the steps below.
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A23-semiconductors/data
mv tv_ta tv
# (4) nbayes.run_filter_tf_file("ln-us-A23-semiconductors", 2002, "0.0") # create a.tf, needed for running polarity
The following are not needed for running pagerank:
# (5) role.run_tf_steps("ln-us-A23-semiconductors", 2002, 2002, "pn", ["tc", "tcs", "fc", "uc", "prob"], "a")
# (6) nbayes.run_steps("ln-us-A23-semiconductors", 2002, ["nb", "ds", "cf"], cat_type="pn", subset="a")
# now run pagerank using the npmi field
run_pr.run_prs("ln-us-A23-semiconductors", 2002, "tv", size=0, wt_type="mi")
#Note we get warnings for 0 out-degree (for those with 0 wt)
We need to rerun tf.py for tf file only.
move the tf and other files into tv
rerun nbayes.run_filter_tf_file
run_pr.run_prs("ln-us-A23-semiconductors", 2002, "tv", size=0, wt_type="mi")
diff fields 8,9 seem to create good pos/neg sorts
cat 2002.t.diff.mi.0 | cut -f1,8 | sortnr -k2 | more
------------------------------------------
computer domain
# for ta
# version for title/abstract only (using term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/ 2002 2002
# to create the tf.a file, we also need the corresponding 2002.act.cat.w0.0
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_ta_cond
cp 2002.act.cat.w0.0 ../tv
nbayes.run_filter_tf_file("ln-us-A21-computers", 2002, "0.0")
run_pr.run_prs("ln-us-A21-computers", 2002, "tv", size=0, wt_type="mi")
# for tas
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv/ 2002 2002
# to create the tf.a file, we also need the corresponding 2002.act.cat.w0.0
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_tas_cond
cp 2002.act.cat.w0.0 ../tv
nbayes.run_filter_tf_file("ln-us-A21-computers", 2002, "0.0")
run_pr.run_prs("ln-us-A21-computers", 2002, "tv", size=0, wt_type="mi")
run_pr.run_prs("ln-us-A21-computers", 2002, "tv", size=0, wt_type="mi")
# move tv data to a labeled subdirectory
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data
mv tv tv_tas_mi
bio domain
# for tas
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv/ 2002 2002
# act file already existed
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv_tas
cp 2002.act.cat.w0.0 ../tv
nbayes.run_filter_tf_file("ln-us-A27-molecular-biology", 2002, "0.0")
run_pr.run_prs("ln-us-A27-molecular-biology", 2002, "tv", size=0, wt_type="mi")
///
# for ta
# version for title/abstract only (using term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A27-molecular-biology/data/tv/ 2002 2002
communications domain
# for tas
python2.7 tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A22-communications/data/term_features/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A22-communications/data/tv/ 2002 2002
///
# for ta
# version for title/abstract only (using term_features_ta)
python tf.py /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A22-communications/data/term_features_ta/ /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A22-communications/data/tv/ 2002 2002
# Where is the data for bio domain?
############
cs domain
comparing using ta vs tas
For ta feature selection yields many more positive features: pos: 455, neg: 91
tas gives pos: 152, neg: 216
This suggests that abstracts behave differently from summary/background. Both in terms of size and content.
We may want to do feature selection on entire patent and final ranking on abstracts.
cat 2002.t.fdiff.cp.0 | sortnr -k5 | more <= k5 appears to do a good job of separating pos and neg terms
However, the terms are the same in both graphs but the features are not. Can we really rely on pr prob from 2
different graphs?
Note the different feature sets for each domain.
Todo: compare polarity of phrases with same head term.
Try generating seed features over subsets of the data. NN bigrams and unigrams. We'll need to compute
probs specially for each case, though!
Hypothesis is that single word terms may be more ambiguous than bigrams, hence less apt to produce discriminating features.
# 6/15/2015
I rewrote the page rank pipeline to use the index to remove np's with adjectives and extract unigrams and bigrams.
# cat 2002.a.tf | cut -f1 | sort | uniq | grep -v " " > 2002.a.tf.f1.unigram # cat 2002.a.tf | cut -f1 | sort | uniq | grep '^[^ ]* [^ ]*$' > 2002.a.tf.f1.bigram # es_np_query.tfi_health()
This creates
2002.a.tf.f1.bigram.mtf
2002.a.tf.f1.unigram.mtf
which contain term-feature cond probs for the terms under 4 circumstances (inst/doc_id, all/abstract)
Then import run_pr into python to run
run_pr.prs_health()
This creates 4 subdirections under the health directory, one for each circumstance.
Note that processing pagerank for abstracts gives a number of warnings:
UserWarning: zero out-degree for node power operation ...
This is because the data was generated off the full patent, which contains term/feature pairs not in
the abstracts.
TODO next: generate round 2 features and rerun pr.
compare round 1 features for the different cases.
computers: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-A21-computers/data/tv_tas_cond/doc_all
health: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv/doc_all
Compare the results of top/bottom 100 k5 ranked terms given new seed set.
To generate new seed sets:
polarity.select_feats("ln-us-14-health/data/tv/doc_all", "2002.bigram.t.diff", 2)
This has been replaced as of June 20.
To generate new seed sets, use a bash script
[anick@sarpedon roles]$ sh gen_pr_seedsets.sh /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv doc_abs 50
[anick@sarpedon roles]$ sh gen_pr_seedsets.sh /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv doc_all 50
Then in python:
reload or import run_pr
>>> run_pr.prs_health_xs(50)
where the arg corresponds to the number of features selected in the gen_pr_seedsets.sh call.
How I generated evaluation sets:
Use the diff data generated from abstracts using top 50 pos and neg features ("_xs50") as extended features for pagerank
personalization. Take top 200 terms sorted by page_rank from pos and neg personalized graphs (columns 2 and 3 in the diff file).
Combine the results, removing duplicates to create:
unigrams: 257
bigrams: 344
Note that the output will include polar terms as well as conflicting and highly connected neutral terms.
Sorting alphabetically effectively randomizes the order of terms.
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv/doc_abs
1114 cat 2002_xs50.bigram.t.diff | sortnr -k3 | cut -f1 | head -200 > 2002_xs50.bigram.t.diff.k3.200
1115 cat 2002_xs50.bigram.t.diff | sortnr -k2 | cut -f1 | head -200 > 2002_xs50.bigram.t.diff.k2.200
1116 cat 2002_xs50.bigram.t.diff.k3.200 2002_xs50.bigram.t.diff.k2.200 | sort | uniq > 2002_xs50.bigram.t.diff.k1_2.200
1118 cat 2002_xs50.unigram.t.diff | sortnr -k3 | cut -f1 | head -200 > 2002_xs50.unigram.t.diff.k3.200
1120 cat 2002_xs50.unigram.t.diff | sortnr -k2 | cut -f1 | head -200 > 2002_xs50.unigram.t.diff.k2.200
1122 cat 2002_xs50.unigram.t.diff.k3.200 2002_xs50.unigram.t.diff.k2.200 | sort | uniq > 2002_xs50.unigram.t.diff.k1_2.200
Format the files for annotation
# add tab column separator at the front
1131 cat 2002_xs50.bigram.t.diff.k1_2.200 | sed -e 's/^/ /' > 2002_xs50.bigram.t.diff.k1_2.200.dja
1132 cat 2002_xs50.unigram.t.diff.k1_2.200 | sed -e 's/^/ /' >2002_xs50.unigram.t.diff.k1_2.200.dja
# copy to uploads directory for transfer to mac
1138 cp 2002_xs50.bigram.t.diff.k1_2.200.dja ~/uploads
1139 cp 2002_xs50.unigram.t.diff.k1_2.200.dja ~/uploads
copy files onto mac to email to Dave:
2 scp anick@sarpedon.cs.brandeis.edu:uploads/2002_xs50.bigram.t.diff.k1_2.200.dja .
3 scp anick@sarpedon.cs.brandeis.edu:uploads/2002_xs50.unigram.t.diff.k1_2.200.dja .
Interesting questions:
How to choose the right features if we use pos-neg diff from pagerank?
Does it help to train on unigrams and bigrams separately or should they be trained together?
Compare Mallet NBayes with features pruned by pagerank vs. no pruning.
Should model be limited to abstract data? Does it improve precision?
How does the classification of a term change with adjectival modifiers? (Add them from prev_J and JNN pos_sig)
How does mallet infogain compare to pagerank top n polar features?
To run mallet, we need tcs file and tf file
run_pr.run_make_tcs(1.0, 2, "unigram","i")
For tcs, use /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv/doc_all/2002.a.tf.f1.bigram.mtf.1.2.tcs
For tf, use /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv
cat 2002.a.tf.f1.bigram.mtf | sed -e 's/ /_/' | sed -e 's/ / /g' | cut -f1,2,5 > 2002.a.tf.f1.bigram.mtf.doc_all
Which includes all features, with counts based on co-occurrence in entire body, counting doc_freq.
///todo
cat 2002.a.tf.f1.unigram.mtf | sed -e 's/ /_/' | sed -e 's/ / /g' | cut -f1,2,5 > 2002.a.tf.f1.unigram.mtf.doc_all
# remove term/feat counts with 0
cat 2002.a.tf.f1.unigram.mtf.doc_all | grep -v ' [0]$' > 2002.a.tf.f1.unigram.mtf.doc_all.no0
Next, limit the tf data to only a subset of features as produced by pagerank diff pos-neg.
To filter terms to those with highest pos and neg diff scores (k5):
in /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv/doc_all
cat 2002.bigram.f.diff | sortnr -k5 | cut -f1 | head -100 > 2002.bigram.f.diff.k5.p.100
cat 2002.bigram.f.diff | sortnr -k5 | tail -100 | sort -n -k5 | cut -f1 > 2002.bigram.f.diff.k5.n.100
These give us the top 100 pos and neg features, in order.
Now we can create reduced feature sets with 25 and 50 pn features each:
in /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv/doc_all
1786 cat 2002.bigram.f.diff | sortnr -k5 | tail -100 | sortnr -k5 | cut -f1 | > 2002.bigram.f.diff.k5.n.100
1787 more 2002.bigram.f.diff.k5.n.100
1788 cat 2002.bigram.f.diff | sortnr -k5 | tail -100 | sortnr -k5 | cut -f1 > 2002.bigram.f.diff.k5.n.100
1789 more 2002.bigram.f.diff.k5.n.100
1790 cat 2002.bigram.f.diff | sortnr -k5 | tail -100 | sort -n -k5 | cut -f1 > 2002.bigram.f.diff.k5.n.100
1791 more 2002.bigram.f.diff.k5.n.100
1792 head -25 2002.bigram.f.diff.k5.n.100 > 2002.bigram.f.diff.k5.n.25
1793 head -25 2002.bigram.f.diff.k5.p.100 > 2002.bigram.f.diff.k5.p.25
1794 cat 2002.bigram.f.diff.k5.n.25 2002.bigram.f.diff.k5.p.25 > 2002.bigram.f.diff.k5.pn.25
1795 head -50 2002.bigram.f.diff.k5.p.100 > 2002.bigram.f.diff.k5.p.50
1797 head -50 2002.bigram.f.diff.k5.n.100 > 2002.bigram.f.diff.k5.n.50
1798 cat 2002.bigram.f.diff.k5.n.50 2002.bigram.f.diff.k5.p.50 > 2002.bigram.f.diff.k5.pn.50
### 6/25/15 Building i_health_abs index on pareia
I am out of index space on sarpedon, so we have created 1TB of index space on pareia.
From pareia:
python2.7
>>> import es_np_index
>>> es_np_index.np_populate("i_health_abs", "health_abs", "ln-us-14-health", 1997, 1998, 5000, True, True, 0, True)
But we don't have 1998 phr_feats data!
[es_np_index.py] Bulk loaded sublist 111
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "es_np_index.py", line 593, in np_populate
for l_bulk_elements in bulk_generator:
File "es_np_index.py", line 358, in gen_bulk_lists
s_file_list = open(filelist_file)
IOError: [Errno 2] No such file or directory: '/home/j/corpuswork/fuse/FUSEData/corpora/ln-us-14-health/subcorpora/1998/config/files.txt'
Building molecular biology instead:
>>> es_np_index.np_populate("i_bio_abs", "bio_abs", "ln-us-A27-molecular-biology", 1997, 2007, 5000, True, True, 0, True)
Looking over David's annotations:
cd /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/eval
# distribution of labels
cat 2002_xs50.unigram.t.diff.k1_2.200.dja.txt | cut -f1 | sort | uniq -c | sort -nr
61 x
49 ug
41 n
39 p
35 pc
24 nc
6 u
2 ?
confidence can be
num_tf_occurrences, ratio_polar_all_occurrences ratio_pos_neg_occurrences
Create the following python function:
Given tf file,
tcs file,
optional_file_containing_list_of_features_to_filter_for,
#infogain_features_to_use ( 0 to indicate using the optional feature file),
file_output_name_prefix (e.g. "NB.50")
create a mallet classifier and then classify the terms in the tf file, putting the output in files using the file_output_name_prefix
as the filename prefix.
Run this function with 50 mallet infogain features and the pagerank 25 set.
tf_file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv/2002.a.tf.f1.unigram.mtf.doc_all.no0
tcs_file: /home/j/anick/patent-classifier/ontology/roles/data/patents/ln-us-14-health/data/tv/doc_all/2002.a.tf.f1.unigram.mtf.ssi.1.0.2.tcs
output_prefix: "NB.IG50.ssi.uni"
Then send me the location of the function (make sure it is world/read/write/execute on the cs cluster machine) so that I can run it on other data. I have a lot of things to test.
Thanks,
Peter
6/26/15 I loaded abstracts from 1997-2007 into es index on pareia.
[es_np_index.py] Bulk loaded sublist 1135
Thu Jun 25 22:23:35 2015 0 [es_np_index.py]Completed make_bulk_lists for years: 1997 2007. Number of lines: 151569
[gen_bulk_lists]151569 lines from 5935 files written to index i_bio_abs
[es_np_index.py] Bulk loaded sublist 1136
[se_np_index.py]np_populate completed at Thu Jun 25 22:23:35 2015
(elapsed time in hr:min:sec: 5:27:13.046177)
Bug to fix for evaluation:
>>> polarity.run_polareval()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "polarity.py", line 233, in run_polareval
pe.compare(output_dir)
File "polarity.py", line 280, in compare
for (term, gold_value) in self.d_gold_unigram2label:
ValueError: too many values to unpack
/ probably because one of the lines has a blank field value.