Skip to content

TaskIO framework for handling a task's input and output

License

Notifications You must be signed in to change notification settings

gabeorlanski/taskio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TaskIO (TIO)

A minimal framework for handling tasks input and output processing. This is heavily inspired by Google's SeqIO but not written with tf.data. For the time being, this uses HuggingFace's Dataset framework as the backbone.

Install

To install run:

git clone https://github.com/gabeorlanski/taskio.git
cd taskio
pip install -r requirements.txt
pip install -e .

Basic Guide

Each Task has 4 key elements that make it up:

  1. A SPLIT_MAPPING that maps a split name (e.g. train,validation) to some key value.
  2. A tokenizer for automatically encoding and decoding the inputs
  3. Two list of callable functions preprocessors and postprocessors that are for preprocessing and postprocessing respectively. Each callable in these must take in a single dictionary argument. (More advanced things can be done with functools.partial)
  4. A set of metric_fns that are a list of callables. Each function must have the signature predictions: List[str], targets: List[str]

To create your own task, you must first subclass the Task class:

from tio import Task


@Task.register('example')
class ExampleTask(Task):
    SPLIT_MAPPING = {
        "train"     : "path to the train file",
        "validation": "Path to the validation file"
    }

    @staticmethod
    def map_to_standard_entries(sample: Dict) -> Dict:
        sample['input_sequence'] = sample['input']
        sample['target'] = sample['output']
        return sample

    def dataset_load_fn(self, split: str) -> Dataset:
        # This is only an example and will not work
        return Dataset.load_dataset(self.SPLIT_MAPPING[split])

The first step is to register your task in the Task registry (Inspired by AllenNLP's registrable). Then you must set the SPLIT_MAPPING and override the two functions:

  1. map_to_standard_entries: When preprocessing and postprocessing, the Task class expects there to be two columns input_sequence and target. This function maps the input to those columns.
  2. dataset_load_fn: Function to load the dataset.

To actually use the task and get the dataset use:

from tio import Task

task = Task.get_task(
    name='example',
    tokenizer=tokenizer,
    preprocessors=preprocessors,
    postprocessors=postprocessors,
    metric_fns=metric_fns
)
tokenized_dataset = task.get_split("train")

...

metrics = task.evaluate(
    **task.postprocess_raw_tokens(predictions, tokenized_dataset['labels'])
)

TODO: Make this less clunky

About

TaskIO framework for handling a task's input and output

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages