Profiling new data pack speed #805

hunterhector · 2022-05-25T01:06:54Z

Is your feature request related to a problem? Please describe.
At the end of the data efficiency project, we need to measure whether it is successful by profiling the new time vs old time.

Depends on
This should be one of the last issues to solve in the data efficiency project (Apparently if we find problems we should create additional issues to address them).

Describe the solution you'd like
At the beginning of the project, we prepared a profiling task. Now we only need to create re-run the profiling code with 2 different versions of Forte.

The following are the details of the initial profiling task:

Metrics
Dataset: Ontonotes from Github or Official Link

The size of the entire dataset is 459.5 MB. The *_conll files contain data in a tabular structure. Our ontonotes reader takes in _conll files and parses them to data packs. You may refer here for the specific column name descriptions. We use a subset of data to test performance.

Statistics:
data pack: 1491
sentence: 31746
tokens: 610114

Comparison
Perform the task on version 0.2.0 and >0.3.0

Profiling Task
We expect the profiling task to have the following stages:

load data from source file into datapacks.
write datapacks to disk (serialize)
load data from disk (deserialize) to datapack to use
data preparation: subword tokenization, ...
query data from datapack to mimic preprocess for training (POS, NER, links…)

Additional context

This is part of the data efficiency project
This task won't necessarily create new code to the repo, but we can include the profiling script in

hunterhector added the data_efficiency label May 25, 2022

hunterhector added this to the 0.4 milestone May 25, 2022

hunterhector modified the milestones: 0.3 interface clearance, 0.3 stable version May 25, 2022

J007X self-assigned this Oct 17, 2022

J007X mentioned this issue Nov 15, 2022

Performance improvement for new Datapack changes: Reduce excessive calls related to DataPack and SortedList (based on profiling analysis) #904

Closed

This was referenced Dec 5, 2022

Implementation 805 #905

Open

Performance improvement (for new Datapack changes): Improve class loading performance (based on profiling analysis on new use of get_class) #908

Closed

J007X mentioned this issue Mar 1, 2023

_is_annotation_tid() in data_store exceptions throwing causing (significant) slowing down in typical usage scenarios (such as NLP) #923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling new data pack speed #805

Profiling new data pack speed #805

hunterhector commented May 25, 2022

Profiling new data pack speed #805

Profiling new data pack speed #805

Comments

hunterhector commented May 25, 2022