Model building #28

himesh257 · 2022-08-09T17:33:24Z

Added PURE data setup and nervaluate scripts
Restructured folders
Took care of Dependabot alerts caused by previous changes to the master branch

CVxTz · 2022-08-11T19:28:38Z

model_dev/entity_extraction/PURE/PURE_data_setup.py

+sorted_by_direction = sorted(final_sent_arr, key=lambda d: d['right_to_left'], reverse=True) 
+
+# getting the index after which no sentence will have a right_to_left relationship
+right_to_left_boundary = get_right_to_left_boundary(sorted_by_direction)


@himesh257 I am not sure about this part. Would using the stratify parameter of sklearn's split solve your problem instead of doing all this?

@CVxTz the only reason I manually did it is because I wanted to make sure that enough amount of left_to_right and right_to_left relationships were present in the training data. I tried completely relying on train_test_split but the samples weren’t equally classified. We can talk more on it later if you'd like

Also, for the entity training model (with which we got 0.75 strict f1 for "base"), this is the file PURE_ent_data_setup.py which uses train_test_split. Thanks!

You should try the stratify parameter, it will ensure everything is the same between train, test and val.

Better to also have the same split between ent/relation, what do you think?

one thing that confuses me is the usage of stratify where we don’t have a "y", you see? @CVxTz do you know how it would work in this case?

And yes, PURE will first predict the entities and then relationships. For simplicity and speed, I am populating the entities so I can skip that step and directly run the relationship model for now. So all in all, it will eventually have the same split

@himesh257 You can create an indicator about the most common relationship direction in the sentence and use this as the stratify value.
Can you share the resulting files from your different splits somewhere ?
Thanks

@CVxTz I am not sure what you mean by that, can you fix the script if it's not a big change? I slacked you the train and test splits

Realized that stratify isn’t needed anymore, updated scripts

eugen-vusak · 2022-09-25T13:22:32Z

model_dev/entity_extraction/PURE/PURE_data_setup.py

+import os
+
+file_name_answers = "gold_standard"
+file_path_answers = "/Users/ash/Desktop/PURE/PURE/new_data/"+file_name_answers+".jsonl"


hardcoded datapaths are not good for reproducibility

in order to achieve reproducibility it would probably be a good idea to have a way for the user to download dataset using a script which would save it in a directory and then use relative paths to that dir. README.md should contain steps necessary for executing experiment, e.g.

run download_data,py and then data_setup.py and then ...

also pathlib.Path should be used for paths in order to make sure that they work on all machines.

eugen-vusak · 2022-09-25T13:23:25Z

model_dev/entity_extraction/PURE/PURE_ent_data_setup.py

+
+train, test = train_test_split(final_sent_arr, test_size=0.1)
+#val, test = train_test_split(test, test_size=0.5)
+data_folder = "data_ent1"


this should be defined on top of the python file with rest of the constants

eugen-vusak · 2022-09-25T13:24:22Z

model_dev/entity_extraction/PURE/PURE_ent_data_setup.py

+data_folder = "data_ent1"
+
+#shutil.rmtree(data_folder)
+os.makedirs(data_folder, exist_ok=True)


again pathlib is always prefered for handling paths and also over os module for stuff like this

eugen-vusak · 2022-09-25T13:25:19Z

model_dev/entity_extraction/PURE/PURE_ent_data_setup.py

+# for item in val:
+# file.write("%s\n" % json.dumps(item))
+
+with open('./{}/train.json'.format(data_folder), 'w') as file:


Suggested change

with open('./{}/train.json'.format(data_folder), 'w') as file:

with open(f'./{data_folder}/train.json', 'w') as file:

f-strings should be used over .format

Himesh Buch and others added 12 commits February 24, 2022 17:41

modified spacy model to work with latest spacy version

5279fc4

initial model training code done

2d3347a

models ready to be experimented on a larger dataset

75b92ef

updates

21acaa2

updated rule based approach

3401b02

rel extraction classifier

a7c8168

added svm classifier

f973383

xgboost classifier

bdb341d

updated with PURE

2fb4810

folders restructred

3925201

removing unused docs folder

afb6673

PURE and nervaluate scripts

51c9068

himesh257 requested review from CVxTz and rodriguesk August 9, 2022 17:33

CVxTz reviewed Aug 11, 2022

View reviewed changes

added additonal scrips, modified PURE data setups

f9bd4f2

himesh257 self-assigned this Sep 17, 2022

himesh257 requested a review from eugen-vusak September 24, 2022 17:19

eugen-vusak approved these changes Sep 25, 2022

View reviewed changes

updated data and spacy training script

380803b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model building #28

Model building #28

himesh257 commented Aug 9, 2022

CVxTz Aug 11, 2022 •

edited

Loading

himesh257 Aug 16, 2022

himesh257 Aug 16, 2022

CVxTz Aug 18, 2022

CVxTz Aug 18, 2022

himesh257 Aug 23, 2022

himesh257 Aug 23, 2022

CVxTz Aug 28, 2022

himesh257 Sep 8, 2022

himesh257 Sep 17, 2022

eugen-vusak Sep 25, 2022

eugen-vusak Sep 25, 2022

eugen-vusak Sep 25, 2022

eugen-vusak Sep 25, 2022

	with open('./{}/train.json'.format(data_folder), 'w') as file:
	with open(f'./{data_folder}/train.json', 'w') as file:

Model building #28

Are you sure you want to change the base?

Model building #28

Conversation

himesh257 commented Aug 9, 2022

CVxTz Aug 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CVxTz Aug 11, 2022 •

edited

Loading