Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling new data pack speed #805

Open
hunterhector opened this issue May 25, 2022 · 0 comments
Open

Profiling new data pack speed #805

hunterhector opened this issue May 25, 2022 · 0 comments

Comments

@hunterhector
Copy link
Member

Is your feature request related to a problem? Please describe.
At the end of the data efficiency project, we need to measure whether it is successful by profiling the new time vs old time.

Depends on
This should be one of the last issues to solve in the data efficiency project (Apparently if we find problems we should create additional issues to address them).

Describe the solution you'd like
At the beginning of the project, we prepared a profiling task. Now we only need to create re-run the profiling code with 2 different versions of Forte.

The following are the details of the initial profiling task:

Metrics
Dataset: Ontonotes from Github or Official Link

The size of the entire dataset is 459.5 MB. The *_conll files contain data in a tabular structure. Our ontonotes reader takes in _conll files and parses them to data packs. You may refer here for the specific column name descriptions. We use a subset of data to test performance.

Statistics:
data pack: 1491
sentence: 31746
tokens: 610114

Comparison
Perform the task on version 0.2.0 and >0.3.0

Profiling Task
We expect the profiling task to have the following stages:

  • load data from source file into datapacks.
  • write datapacks to disk (serialize)
  • load data from disk (deserialize) to datapack to use
  • data preparation: subword tokenization, ...
  • query data from datapack to mimic preprocess for training (POS, NER, links…)

Additional context

  • This is part of the data efficiency project
  • This task won't necessarily create new code to the repo, but we can include the profiling script in
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants