Extending Renard

Creating new steps

Usually, steps must implement at least four functions :

PipelineStep.__init__(): is used to pass options at step init time. Options passed at step init time should be valid for a collection of texts, and not be text specific.
PipelineStep.__call__(): is called at pipeline run time.
PipelineStep.needs(): declares the set of informations needed from the pipeline state by this step. Each returned string should be an attribute of PipelineState.
PipelineStep.production(): declares the set of informations produced by this step. As in PipelineStep.needs(), each returned string should be an attribute of PipelineState.

Here is an example of creating a basic tokenization step :

from typing import Dict, Any, Set
from renard.pipeline.core import PipelineStep

class BasicTokenizerStep(PipelineStep):

    def __init__(self):
        pass

    def __call__(self, text: str, **kwargs) -> Dict[str, Any]:
        return {"tokens": text.split(" ")}

    def needs(self) -> Set[str]:
        return {"text"}

    def production(self) -> Set[str]:
        return {"tokens"}

Additionally, the following methods can be overridden:

PipelineStep.optional_needs(): specifies optional dependencies the same way as PipelineStep.needs().
PipelineStep._pipeline_init_(): is used for pipeline-wide arguments, such as language settings. This method is called at by the pipeline at pipeline run time.
PipelineStep.supported_langs(): declares the set of supported languages as a set of ISO 639-3 codes (or the special value "any"). By default, will be {"eng"}.