You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
At the end of the data efficiency project, we need to measure whether it is successful by profiling the new time vs old time.
Depends on
This should be one of the last issues to solve in the data efficiency project (Apparently if we find problems we should create additional issues to address them).
Describe the solution you'd like
At the beginning of the project, we prepared a profiling task. Now we only need to create re-run the profiling code with 2 different versions of Forte.
The following are the details of the initial profiling task:
The size of the entire dataset is 459.5 MB. The *_conll files contain data in a tabular structure. Our ontonotes reader takes in _conll files and parses them to data packs. You may refer here for the specific column name descriptions. We use a subset of data to test performance.
Statistics:
data pack: 1491
sentence: 31746
tokens: 610114
Comparison
Perform the task on version 0.2.0 and >0.3.0
Profiling Task
We expect the profiling task to have the following stages:
load data from source file into datapacks.
write datapacks to disk (serialize)
load data from disk (deserialize) to datapack to use
data preparation: subword tokenization, ...
query data from datapack to mimic preprocess for training (POS, NER, links…)
Additional context
This is part of the data efficiency project
This task won't necessarily create new code to the repo, but we can include the profiling script in
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
At the end of the data efficiency project, we need to measure whether it is successful by profiling the new time vs old time.
Depends on
This should be one of the last issues to solve in the data efficiency project (Apparently if we find problems we should create additional issues to address them).
Describe the solution you'd like
At the beginning of the project, we prepared a profiling task. Now we only need to create re-run the profiling code with 2 different versions of Forte.
The following are the details of the initial profiling task:
Metrics
Dataset: Ontonotes from Github or Official Link
The size of the entire dataset is 459.5 MB. The *_conll files contain data in a tabular structure. Our ontonotes reader takes in _conll files and parses them to data packs. You may refer here for the specific column name descriptions. We use a subset of data to test performance.
Statistics:
data pack: 1491
sentence: 31746
tokens: 610114
Comparison
Perform the task on version 0.2.0 and >0.3.0
Profiling Task
We expect the profiling task to have the following stages:
Additional context
The text was updated successfully, but these errors were encountered: