Reference

Core

Pipeline

class renard.pipeline.core.Pipeline(steps, lang='eng', progress_report='tqdm', warn=True)

A flexible NLP pipeline

Parameters
  • steps (List[PipelineStep]) –

  • lang (str) –

  • progress_report (Optional[Literal[‘tqdm’]]) –

  • warn (bool) –

PipelineParameter

all the possible parameters of the whole pipeline, that are shared between steps

alias of Literal[‘lang’, ‘progress_reporter’, ‘character_ner_tag’]

__call__(text=None, ignored_steps=None, **kwargs)

Run the pipeline sequentially.

Parameters
  • ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.

  • text (Optional[str]) –

Return type

PipelineState

Returns

the output of the last step of the pipeline

__init__(steps, lang='eng', progress_report='tqdm', warn=True)
Parameters
  • steps (List[PipelineStep]) – a tuple of :class:PipelineStep, that will be executed in order

  • progress_report (Optional[Literal[‘tqdm’]]) – if tqdm, report the pipeline progress using tqdm. if None, does not report progress.

  • lang (str) – ISO 639-3 language code

  • warn (bool) –

_non_ignored_steps(ignored_steps)

Get steps that are not ignored.

Parameters

ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps wont be returned.

Return type

List[PipelineStep]

_pipeline_init_steps_(ignored_steps=None)

Initialise steps with global pipeline parameters.

Parameters

ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.

check_valid(*args, ignored_steps=None)

Check that the current pipeline can be run, which is possible if all steps needs are satisfied

Parameters
  • args – list of additional attributes to add to the starting pipeline state.

  • ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.

Return type

Tuple[bool, List[str]]

Returns

a tuple : (True, [warnings]) if the pipeline is valid, (False, [errors]) otherwise

rerun_from(state, from_step, ignored_steps=None)

Recompute steps, starting from from_step (included). Previous steps results are not recomputed.

Note

steps are not re-inited using _pipeline_init_steps().

Parameters
  • state (PipelineState) – the previously computed state

  • from_step (Union[str, Type[PipelineStep]]) –

    first step to recompute from. Either :

    • str : in that case, the name of a step production ('tokens', 'corefs'…)

    • Type[PipelineStep] : in that case, the class of a step

  • ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.

Return type

PipelineState

Returns

the output of the last step of the pipeline

Pipeline State

class renard.pipeline.core.PipelineState(text, dynamic_blocks=None, tokens=None, char2token=None, sentences=None, quotes=None, speakers=None, sentences_polarities=None, entities=None, corefs=None, characters=None, character_network=None)

The state of a pipeline, annotated in a Pipeline lifetime

Parameters
  • text (Optional[str]) –

  • dynamic_blocks (Optional[List[Tuple[int, int]]]) –

  • tokens (Optional[List[str]]) –

  • char2token (Optional[List[int]]) –

  • sentences (Optional[List[List[str]]]) –

  • quotes (Optional[List[Quote]]) –

  • speakers (Optional[List[Optional[Character]]]) –

  • sentences_polarities (Optional[List[float]]) –

  • entities (Optional[List[NEREntity]]) –

  • corefs (Optional[List[List[Mention]]]) –

  • characters (Optional[List[Character]]) –

  • character_network (Union[List[Graph], Graph, None]) –

__eq__(other)

Return self==value.

__hash__ = None
__init__(text, dynamic_blocks=None, tokens=None, char2token=None, sentences=None, quotes=None, speakers=None, sentences_polarities=None, entities=None, corefs=None, characters=None, character_network=None)
Parameters
  • text (Optional[str]) –

  • dynamic_blocks (Optional[List[Tuple[int, int]]]) –

  • tokens (Optional[List[str]]) –

  • char2token (Optional[List[int]]) –

  • sentences (Optional[List[List[str]]]) –

  • quotes (Optional[List[Quote]]) –

  • speakers (Optional[List[Optional[Character]]]) –

  • sentences_polarities (Optional[List[float]]) –

  • entities (Optional[List[NEREntity]]) –

  • corefs (Optional[List[List[Mention]]]) –

  • characters (Optional[List[Character]]) –

  • character_network (Union[List[Graph], Graph, None]) –

__repr__()

Return repr(self).

char2token: Optional[List[int]] = None

mapping from a character to its corresponding token

character_network: Optional[Union[List[networkx.classes.graph.Graph], networkx.classes.graph.Graph]] = None

character network (or list of network in the case of a dynamic network)

characters: Optional[List[renard.pipeline.character_unification.Character]] = None

detected characters

corefs: Optional[List[List[renard.pipeline.core.Mention]]] = None

coreference chains

dynamic_blocks: Optional[List[Tuple[int, int]]] = None

text split into blocks of texts. When dynamic blocks are given, the final network is dynamic, and split according to blocks.

entities: Optional[List[renard.pipeline.ner.NEREntity]] = None

NER entities

export_graph_to_gexf(path, name_style='most_frequent')

Export characters graph to Gephi’s gexf format

Parameters
  • path (str) – export file path

  • name_style (Union[Literal[‘longest’, ‘shortest’, ‘most_frequent’], Callable[[Character], str]]) – see graph_with_names() for more details

get_character(name, partial_match=True)

Try to get a character by one of its name.

Note

Several characters may match the given name, but only the first one is returned.

Note

Comparison is case-insensitive.

Parameters
  • name (str) – One of the name of the searched character.

  • partial_match (bool) – when True, will also return a character if the given name is only part of one of its name. Otherwise, only a character with the given name will be returned.

Return type

Optional[Character]

Returns

a Character, or None if no character was found.

plot_graph(name_style='most_frequent', fig=None, cumulative=False, graph_start_idx=1, stable_layout=False, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)

Plot self.character_network using reasonable default parameters

Note

when plotting a dynamic graph, a slider attribute is added to fig when it is given, in order to keep a reference to the slider.

Parameters
  • name_style (Union[Literal[‘longest’, ‘shortest’, ‘most_frequent’], Callable[[Character], str]]) – see graph_with_names() for more details

  • fig (Optional[Figure]) – if specified, this matplotlib figure will be used for plotting

  • cumulative (bool) – if True and self.character_network is dynamic, plot a cumulative graph instead of a sequential one

  • graph_start_idx (int) – When self.character_network is dynamic, index of the first graph to plot, starting at 1 (not 0, since the graph slider starts at 1)

  • stable_layout (bool) – if self.character_network is dynamic and this parameter is True, characters will keep the same position in space at each timestep. Characters’ positions are based on the final cumulative graph layout.

  • layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray], None]) – pre-computed graph layout

  • node_kwargs (Union[Dict[str, Any], List[Dict[str, Any]], None]) – passed to nx.draw_networkx_nodes()

  • edge_kwargs (Union[Dict[str, Any], List[Dict[str, Any]], None]) – passed to nx.draw_networkx_nodes()

  • label_kwargs (Union[Dict[str, Any], List[Dict[str, Any]], None]) – passed to nx.draw_networkx_labels()

plot_graph_to_file(path, name_style='most_frequent', layout=None, fig=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)

Plot self.character_graph using reasonable parameters, and save the produced figure to a file

Parameters
  • name_style (Union[Literal[‘longest’, ‘shortest’, ‘most_frequent’], Callable[[Character], str]]) – see graph_with_names() for more details

  • layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray], None]) – pre-computed graph layout

  • fig (Optional[Figure]) – if specified, this matplotlib figure will be used for plotting

  • node_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()

  • edge_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()

  • label_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_labels()

  • path (str) –

plot_graphs_to_dir(directory, name_style='most_frequent', cumulative=False, stable_layout=False, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)

Plot self.character_graph using reasonable default parameters, and save the produced figures in the specified directory.

Parameters
  • name_style (Union[Literal[‘longest’, ‘shortest’, ‘most_frequent’], Callable[[Character], str]]) – see graph_with_names() for more details

  • cumulative (bool) – if True plot a cumulative graph instead of a sequential one

  • stable_layout (bool) – If this parameter is True, characters will keep the same position in space at each timestep. Characters’ positions are based on the final cumulative graph layout.

  • layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray], None]) – pre-computed graph layout

  • node_kwargs (Optional[List[Dict[str, Any]]]) – passed to nx.draw_networkx_nodes()

  • edge_kwargs (Optional[List[Dict[str, Any]]]) – passed to nx.draw_networkx_nodes()

  • label_kwargs (Optional[List[Dict[str, Any]]]) – passed to nx.draw_networkx_labels()

  • directory (str) –

quotes: Optional[List[renard.pipeline.quote_detection.Quote]] = None

quotes

sentences: Optional[List[List[str]]] = None

text splitted into sentences, each sentence being a list of tokens

sentences_polarities: Optional[List[float]] = None

polarity of each sentence

speakers: Optional[List[Optional[renard.pipeline.character_unification.Character]]] = None

quotes speakers

text: Optional[str]

input text

tokens: Optional[List[str]] = None

text splitted in tokens

Pipeline Steps

class renard.pipeline.core.PipelineStep

An abstract pipeline step

Note

The __call__, needs and production methods _must_ be overridden by derived classes.

Note

The optional_needs and supported_langs methods can be overridden by derived classes.

__call__(text, **kwargs)

Call self as a function.

Parameters

text (str) –

Return type

Dict[str, Any]

__init__()

Initialize the PipelineStep with a given configuration.

_pipeline_init_(lang, progress_reporter, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter (ProgressReporter) –

  • kwargs – additional pipeline parameters.

Return type

Optional[Dict[Literal[‘lang’, ‘progress_reporter’, ‘character_ner_tag’], Any]]

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()
Return type

Set[str]

Returns

a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

Preprocessing

class renard.pipeline.preprocessing.CustomSubstitutionPreprocessor(substition_rules)

A preprocessor allowing regex-based substition

Parameters

substition_rules (List[Tuple[str, str]]) –

__call__(text, **kwargs)
Parameters

text (str) –

Return type

Dict[str, Any]

__init__(substition_rules)
Parameters

substition_rules (List[Tuple[str, str]]) – A list of rules, each rule being of the form (match, substitution).

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

Tokenization

NLTKTokenizer

class renard.pipeline.tokenization.NLTKTokenizer

A NLTK-based tokenizer

__call__(text, **kwargs)

Call self as a function.

Parameters

text (str) –

Return type

Dict[str, Any]

__init__()

Initialize the PipelineStep with a given configuration.

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter

  • kwargs – additional pipeline parameters.

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

Named Entity Recognition

class renard.pipeline.ner.NEREntity(tokens, start_idx, end_idx, tag)
Parameters
  • tokens (List[str]) –

  • start_idx (int) –

  • end_idx (int) –

  • tag (str) –

__eq__(other)

Return self==value.

__hash__()

Return hash(self).

Return type

int

__init__(tokens, start_idx, end_idx, tag)
Parameters
  • tokens (List[str]) –

  • start_idx (int) –

  • end_idx (int) –

  • tag (str) –

__repr__()

Return repr(self).

shifted(shift)

Note

This method is implemtented here to avoid type issues. Since Mention.shifted() cannot be annotated as returning Self, this method annotate the correct return type when using NEREntity.shifted().

Parameters

shift (int) –

Return type

NEREntity

tag: str

NER class (without BIO prefix as in PER and not B-PER)

BertNamedEntityRecognizer

class renard.pipeline.ner.BertNamedEntityRecognizer(model=None, batch_size=4, device='auto', tokenizer=None, context_retriever=None)

An entity recognizer based on BERT

Parameters
  • model (Union[PreTrainedModel, str, None]) –

  • batch_size (int) –

  • device (Literal[‘cpu’, ‘cuda’, ‘auto’]) –

  • tokenizer (Optional[PreTrainedTokenizerFast]) –

  • context_retriever (Optional[NERContextRetriever]) –

__call__(tokens, sentences, **kwargs)
Parameters
  • text

  • tokens (List[str]) –

  • sentences (List[List[str]]) –

Return type

Dict[str, Any]

__init__(model=None, batch_size=4, device='auto', tokenizer=None, context_retriever=None)
Parameters
  • model (Union[PreTrainedModel, str, None]) –

    Either:

    • None: the model will be chosen accordingly knowing the lang of the pipeline

    • str: a hugginface model ID

    • a PreTrainedModel: a custom pre-trained BERT model. If specified, a tokenizer must be passed as well.

  • batch_size (int) – batch size at inference

  • device (Literal[‘cpu’, ‘cuda’, ‘auto’]) – computation device

  • tokenizer (Optional[PreTrainedTokenizerFast]) – a custom tokenizer

  • context_retriever (Optional[NERContextRetriever]) – if specified, use context_retriever to retrieve relevant global context at run time, generally trading runtme for NER performance.

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter

  • kwargs – additional pipeline parameters.

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

batch_labels(batchs, batch_i, wp_labels, tokens, context_mask)

Align labels to tokens rather than wordpiece tokens.

Parameters
  • batchs (BatchEncoding) – huggingface batch

  • batch_i (int) – batch index

  • wp_labels (List[str]) – wordpiece aligned labels

  • tokens (List[str]) – original tokens

  • context_mask (Tensor) –

Return type

List[str]

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

NLTKNamedEntityRecognizer

class renard.pipeline.ner.NLTKNamedEntityRecognizer

An entity recognizer based on NLTK

__call__(tokens, **kwargs)
Parameters
  • text

  • tokens (List[str]) –

Return type

Dict[str, Any]

__init__()
Parameters

language – iso 639-2 3-letter language code

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

Coreference Resolution

A coreference resolver returns a list of coreference chains, each chain being Mention.

class renard.pipeline.core.Mention(tokens, start_idx, end_idx)
Parameters
  • tokens (List[str]) –

  • start_idx (int) –

  • end_idx (int) –

__eq__(other)

Return self==value.

Parameters

other (Mention) –

Return type

bool

__hash__()

Return hash(self).

Return type

int

__init__(tokens, start_idx, end_idx)
Parameters
  • tokens (List[str]) –

  • start_idx (int) –

  • end_idx (int) –

__repr__()

Return repr(self).

BertCoreferenceResolver

class renard.pipeline.corefs.BertCoreferenceResolver(model=None, hugginface_model_id=None, batch_size=1, device='auto', tokenizer=None, block_size=512, hierarchical_merging=False)

A coreference resolver using BERT. Loosely based on ‘End-to-end Neural Coreference Resolution’ (Lee et at. 2017) and ‘BERT for coreference resolution’ (Joshi et al. 2019).

Parameters
  • model (Optional[BertForCoreferenceResolution]) –

  • hugginface_model_id (Optional[str]) –

  • batch_size (int) –

  • device (Literal[‘auto’, ‘cuda’, ‘cpu’]) –

  • tokenizer (Optional[PreTrainedTokenizerFast]) –

  • block_size (int) –

  • hierarchical_merging (bool) –

__call__(tokens, **kwargs)

Call self as a function.

Parameters

tokens (List[str]) –

Return type

Dict[str, Any]

__init__(model=None, hugginface_model_id=None, batch_size=1, device='auto', tokenizer=None, block_size=512, hierarchical_merging=False)

Note

In the future, only mentions_per_tokens, antecedents_nb and max_span_size shall be read directly from the model’s config.

Parameters
  • huggingface_model_id – a custom huggingface model id. This allows to bypass the lang pipeline parameter which normally choose a huggingface model automatically.

  • batch_size (int) – batch size at inference

  • device (Literal[‘auto’, ‘cuda’, ‘cpu’]) – computation device

  • block_size (int) – size of blocks to pass to the coreference model

  • hierarchical_merging (bool) – if True, attempts to use tibert’s hierarchical merging feature. In that case, blocks of size block_size are merged to perform inference on the whole document.

  • model (Optional[BertForCoreferenceResolution]) –

  • hugginface_model_id (Optional[str]) –

  • tokenizer (Optional[PreTrainedTokenizerFast]) –

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter

  • kwargs – additional pipeline parameters.

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

SpacyCorefereeCoreferenceResolver

class renard.pipeline.corefs.SpacyCorefereeCoreferenceResolver(max_chunk_size=10000)

A coreference resolver using spacy’s corefree.

Note

  • This step requires to install Renard’s extra ‘spacy’

  • While this step automatically install the needed spacy models, it still needs a manual installation of the coreferee model: python -m coreferee install en

Parameters

max_chunk_size (Optional[int]) –

__call__(text, tokens, dynamic_blocks_tokens=None, **kwargs)

Call self as a function.

Parameters
  • text (str) –

  • tokens (List[str]) –

  • dynamic_blocks_tokens (Optional[List[List[str]]]) –

Return type

Dict[str, Any]

__init__(max_chunk_size=10000)
Parameters
  • chunk_size – coreference chunk size, in tokens

  • max_chunk_size (Optional[int]) –

static _coreferee_get_mention_tokens(coref_model, mention_heads, doc)

Coreferee only return mention heads for mention, and not the whole span. This hack (defined in coreferee README at the end of part 2 https://github.com/richardpaulhudson/coreferee#2-interacting-with-the-data-model) gets the whole span as a list of spacy tokens.

Parameters
  • coref_model (CorefereeBroker) –

  • mention_heads (Mention) –

  • doc (Doc) –

Return type

List[Token]

_pipeline_init_(lang, progress_reporter)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter (ProgressReporter) –

  • kwargs – additional pipeline parameters.

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

static _spacy_try_infer_spaces(tokens)

Try to infer, for each token, if there is a space between this token and the next.

Parameters

tokens (List[str]) –

Return type

List[bool]

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()
Return type

Set[str]

Returns

a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

Quote Detection

QuoteDetector

class renard.pipeline.quote_detection.QuoteDetector(quote_pairs=None)

Extract quotes using simple rules.

Parameters

quote_pairs (Optional[List[Tuple[str, str]]]) –

__call__(tokens, **kwargs)

Call self as a function.

Parameters

tokens (List[str]) –

Return type

Dict[str, Any]

__init__(quote_pairs=None)
Parameters

quote_pairs (Optional[List[Tuple[str, str]]]) – if None, default to QuoteDetector.DEFAULT_QUOTE_PAIRS

needs()

tokens

Return type

Set[str]

production()

quotes

Return type

Set[str]

supported_langs()

any

Return type

Union[Set[str], Literal[‘any’]]

Sentiment Analysis

NLTKSentimentAnalyzer

class renard.pipeline.sentiment_analysis.NLTKSentimentAnalyzer

A sentiment analyzer based on NLTK’s Vader.

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

__call__(sentences, **kwargs)

Call self as a function.

Parameters

sentences (List[List[str]]) –

Return type

Dict[str, Any]

__init__()

Initialize the PipelineStep with a given configuration.

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

Characters Unification

class renard.pipeline.character_unification.Character(names, mentions, gender=Gender.UNKNOWN)
Parameters
  • names (FrozenSet[str]) –

  • mentions (List[Mention]) –

  • gender (Gender) –

__delattr__(name)

Implement delattr(self, name).

__eq__(other)

Return self==value.

__hash__()

Return hash(self).

Return type

int

__init__(names, mentions, gender=Gender.UNKNOWN)
Parameters
  • names (FrozenSet[str]) –

  • mentions (List[Mention]) –

  • gender (Gender) –

__repr__()

Return repr(self).

Return type

str

__setattr__(name, value)

Implement setattr(self, name, value).

NaiveCharacterUnifier

class renard.pipeline.character_unification.NaiveCharacterUnifier(min_appearances=0)

A basic character unifier using NER

Parameters

min_appearances (int) –

__call__(text, entities, corefs=None, **kwargs)
Parameters
  • text (str) –

  • tokens

  • entities (List[NEREntity]) –

  • corefs (Optional[List[List[Mention]]]) –

Return type

Dict[str, Any]

__init__(min_appearances=0)
Parameters

min_appearances (int) – minimum number of appearances of a character for it to be valid

_pipeline_init_(lang, character_ner_tag, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter

  • kwargs – additional pipeline parameters.

  • character_ner_tag (str) –

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()
Return type

Set[str]

Returns

a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

GraphRulesCharacterUnifier

class renard.pipeline.character_unification.GraphRulesCharacterUnifier(min_appearances=0, additional_hypocorisms=None, link_corefs_mentions=False, ignore_lone_titles=None)

Unify characters by creating a graph where mentions are linked when they refer to the same character, and then merging this graph nodes.

Note

This algorithm is inspired from Vala et al., 2015.

Parameters
  • min_appearances (int) –

  • additional_hypocorisms (Optional[List[Tuple[str, List[str]]]]) –

  • link_corefs_mentions (bool) –

  • ignore_lone_titles (Optional[Set[str]]) –

__call__(entities, corefs=None, **kwargs)

Call self as a function.

Parameters
  • entities (List[NEREntity]) –

  • corefs (Optional[List[List[Mention]]]) –

  • kwargs (dict) –

Return type

Dict[str, Any]

__init__(min_appearances=0, additional_hypocorisms=None, link_corefs_mentions=False, ignore_lone_titles=None)
Parameters
  • min_appearances (int) – minimum number of appearances of a character for it to be considered valid.

  • additional_hypocorisms (Optional[List[Tuple[str, List[str]]]]) – a tuple of additional hypocorisms. Each hypocorism is a tuple where the first element is a name, and the second element is a set of nicknames associated with it

  • link_corefs_mentions (bool) – if True, will also use coreference resolution to link names between them. This is disabled by default since a coreference model can extract a lot of spurious links. However, linking by coref is sometimes the only way to resolve a character alias.

  • ignore_lone_titles (Optional[Set[str]]) – a set of titles to ignore when they stand on their own. This avoids extracting false positives characters such as ‘Mr.’ or ‘Miss’.

_pipeline_init_(lang, character_ner_tag, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter

  • kwargs – additional pipeline parameters.

  • character_ner_tag (str) –

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

infer_name_gender(name, corefs, hname_constants)

Try to infer a name’s gender

Parameters
  • name (str) –

  • corefs (Optional[List[List[Mention]]]) –

  • hname_constants (Constants) – HumanName constants

Return type

Gender

Check if two names are related after removing their titles

Parameters
  • name1 (str) –

  • name2 (str) –

  • hname_constants (Constants) –

Return type

bool

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()
Return type

Set[str]

Returns

a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

Speaker Attribution

class renard.pipeline.speaker_attribution.BertSpeakerDetector(model=None, batch_size=4, device='auto', tokenizer=None)

Detect quote speaker in text

Parameters
  • model (Union[PreTrainedModel, str, None]) –

  • batch_size (int) –

  • device (Literal[‘cpu’, ‘cuda’, ‘auto’]) –

  • tokenizer (Optional[PreTrainedTokenizerFast]) –

__call__(tokens, quotes, characters, **kwargs)

Call self as a function.

Parameters
  • tokens (List[str]) –

  • quotes (List[Quote]) –

  • characters (List[Character]) –

Return type

Dict[str, Any]

__init__(model=None, batch_size=4, device='auto', tokenizer=None)

Initialize the PipelineStep with a given configuration.

Parameters
  • model (Union[PreTrainedModel, str, None]) –

  • batch_size (int) –

  • device (Literal[‘cpu’, ‘cuda’, ‘auto’]) –

  • tokenizer (Optional[PreTrainedTokenizerFast]) –

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters
  • lang (str) – the lang of the whole pipeline

  • progress_reporter

  • kwargs – additional pipeline parameters.

Returns

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()

quotes, tokens, characters

Return type

Set[str]

production()

speaker

Return type

Set[str]

Graph Extraction

CoOccurrencesGraphExtractor

class renard.pipeline.graph_extraction.CoOccurrencesGraphExtractor(co_occurrences_dist=None, dynamic=False, dynamic_window=None, dynamic_overlap=0, additional_ner_classes=None)

A simple character graph extractor using co-occurences

Parameters
  • co_occurrences_dist (Union[int, Tuple[int, Literal[‘tokens’, ‘sentences’]], None]) –

  • dynamic (bool) –

  • dynamic_window (Optional[int]) –

  • dynamic_overlap (int) –

  • additional_ner_classes (Optional[List[str]]) –

__call__(characters, sentences, char2token=None, dynamic_blocks=None, sentences_polarities=None, entities=None, co_occurrences_blocks=None, **kwargs)

Extract a co-occurrence character network.

Parameters
  • co_occurrences_blocks (Optional[Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]]) – custom blocks where co-occurrences should be recorded. For example, this can be used to perform chapter level co-occurrences.

  • characters (Set[Character]) –

  • sentences (List[List[str]]) –

  • char2token (Optional[List[int]]) –

  • dynamic_blocks (Optional[Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]]) –

  • sentences_polarities (Optional[List[float]]) –

  • entities (Optional[List[NEREntity]]) –

Return type

Dict[str, Any]

Returns

a dict with key 'character_network' and a nx.Graph or a list of nx.Graph as value.

__init__(co_occurrences_dist=None, dynamic=False, dynamic_window=None, dynamic_overlap=0, additional_ner_classes=None)
Parameters
  • co_occurrences_dist (Union[int, Tuple[int, Literal[‘tokens’, ‘sentences’]], None]) –

    max accepted distance between two character appearances to form a co-occurence interaction.

    • if an int is given, the distance is in number of tokens

    • if a tuple is given, the first element of the tuple is a distance while the second is an unit. Examples : (1, "sentences"), (3, "tokens").

  • dynamic (bool) –

    • if False (the default), a static nx.graph is extracted

    • if True, several nx.graph are extracted. In that case, dynamic_window and dynamic_overlap``*can* be specified.  If ``dynamic_window is not specified, this step is expecting the text to be cut into ‘dynamic blocks’, and a graph will be extracted for each block. In that case, dynamic_blocks must be passed to the pipeline as a List[str] at runtime.

  • dynamic_window (Optional[int]) – dynamic window, in number of interactions. a dynamic window of n means that each returned graph will be formed by n interactions.

  • dynamic_overlap (int) – overlap, in number of interactions.

  • additional_ner_classes (Optional[List[str]]) – if specified, will include entities other than characters in the final graph. No attempt will be made at unifying the entities (for example, “New York” will be distinct from “New York City”).

_create_co_occurrences_blocks(sentences, mentions)

Create co-occurrences blocks using self.co_occurrences_dist. All entities within a block are considered as co-occurring.

Parameters
  • sentences (List[List[str]]) –

  • mentions (List[Tuple[Any, NEREntity]]) –

Return type

Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]

_extract_dynamic_graph(mentions, window, overlap, dynamic_blocks, sentences, sentences_polarities, co_occurrences_blocks)

Note

only one of window or dynamic_blocks_tokens should be specified

Parameters
  • mentions (List[Tuple[Any, NEREntity]]) – A list of entity mentions, ordered by appearance, each of the form (KEY MENTION). KEY determines the unicity of the entity.

  • window (Optional[int]) – dynamic window, in tokens.

  • overlap (int) – window overlap

  • dynamic_blocks (Optional[Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]]) – boundaries of each dynamic block

  • co_occurrences_blocks (Optional[Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]]) – boundaries of each co-occurrences blocks

  • sentences (List[List[str]]) –

  • sentences_polarities (Optional[List[float]]) –

Return type

List[Graph]

_extract_graph(mentions, sentences, sentences_polarities, co_occurrences_blocks)
Parameters
  • mentions (List[Tuple[Any, NEREntity]]) – A list of entity mentions, ordered by appearance, each of the form (KEY MENTION). KEY determines the unicity of the entity.

  • sentences (List[List[str]]) – if specified, sentences_polarities must be specified as well.

  • sentences_polarities (Optional[List[float]]) – if specified, sentences must be specified as well. In that case, edges are annotated with the 'polarity attribute, indicating the polarity of the relationship between two characters. Polarity between two interactions is computed as the strongest sentence polarity between those two mentions.

  • co_occurrences_blocks (Optional[Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]]) – only unit ‘tokens’ is accepted.

Return type

Graph

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()
Return type

Set[str]

Returns

a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()
Return type

Union[Set[str], Literal[‘any’]]

Returns

a list of supported languages, as ISO 639-3 codes, or the string 'any'

ConversationalGraphExtractor

class renard.pipeline.graph_extraction.ConversationalGraphExtractor(graph_type, conversation_dist=None, ignore_self_mention=True)

A graph extractor using conversation between characters or mentions.

Note

Does not support dynamic networks yet.

Parameters
  • graph_type (Literal[‘conversation’, ‘mention’]) –

  • conversation_dist (Union[int, Tuple[int, Literal[‘tokens’, ‘sentences’]], None]) –

  • ignore_self_mention (bool) –

__call__(sentences, quotes, speakers, characters, **kwargs)

Call self as a function.

Parameters
  • sentences (List[List[str]]) –

  • quotes (List[Quote]) –

  • speakers (List[Optional[Character]]) –

  • characters (Set[Character]) –

Return type

Dict[str, Any]

__init__(graph_type, conversation_dist=None, ignore_self_mention=True)
Parameters
  • graph_type (Literal[‘conversation’, ‘mention’]) – either ‘conversation’ or ‘mention’. ‘conversation’ extracts an undirected graph with interactions being extracted from the conversations occurring between characters. ‘mention’ extracts a directed graph where interactions are character mentions of one another in quoted speech.

  • conversation_dist (Union[int, Tuple[int, Literal[‘tokens’, ‘sentences’]], None]) – must be supplied if graph_type is ‘conversation’. The distance between two quotation for them to be considered as being interacting.

  • ignore_self_mention (bool) – if True, self mentions are ignore for graph_type=='mention'

needs()

sentences, quotes, speakers, characters

Return type

Set[str]

production()

character_network

Return type

Set[str]

Stanford CoreNLP Pipeline

class renard.pipeline.stanford_corenlp.StanfordCoreNLPPipeline(annotate_corefs=False, corefs_algorithm='statistical', corenlp_custom_properties=None, server_timeout=9999999, **server_kwargs)

a full NLP pipeline using stanford CoreNLP

Note

The Stanford CoreNLP pipeline requires the stanza library. You can install it with poetry using poetry install -E stanza.

Warning

RAM usage might be high for coreference resolutions as it uses the entire novel ! If CoreNLP terminates with an out of memory error, you can try allocating more memory for the server by using server_kwargs (example : {"memory": "8G"}).

Parameters
  • annotate_corefs (bool) –

  • corefs_algorithm (Literal[‘deterministic’, ‘statistical’, ‘neural’]) –

  • corenlp_custom_properties (Optional[Dict[str, Any]]) –

  • server_timeout (int) –

__call__(text, **kwargs)

Call self as a function.

Parameters

text (str) –

Return type

Dict[str, Any]

__init__(annotate_corefs=False, corefs_algorithm='statistical', corenlp_custom_properties=None, server_timeout=9999999, **server_kwargs)
Parameters
  • annotate_corefs (bool) – True if coreferences must be annotated, False otherwise. This parameter is not yet implemented.

  • corefs_algorithm (Literal[‘deterministic’, ‘statistical’, ‘neural’]) – one of {"deterministic", "statistical", "neural"}

  • corenlp_custom_properties (Optional[Dict[str, Any]]) – custom properties dictionary to pass to the CoreNLP server. Note that some properties are already set when calling the server, so not all properties are supported : it is intended as a last resort escape hatch. In particular, do not set 'ner.applyFineGrained'. If you need to set the coreference algorithm used, see corefs_algorithm.

  • server_timeout (int) – CoreNLP server timeout in ms

  • server_kwargs – extra args for stanford CoreNLP server. be_quiet and max_char_length are not supported. See here for a list of possible args : https://stanfordnlp.github.io/stanza/client_properties.html#corenlp-server-start-options-server

needs()
Return type

Set[str]

Returns

a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()
Return type

Set[str]

Returns

a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

renard.pipeline.stanford_corenlp.corenlp_annotations_bio_tags(annotations)

Returns an array of bio tags extracted from stanford corenlp annotations

Note

only PERSON, LOCATION, ORGANIZATION and MISC entities are considered. Other types of entities are discarded. (see https://stanfordnlp.github.io/CoreNLP/ner.html#description) for a list of usual coreNLP types.

Note

Weirdly, CoreNLP will annotate pronouns as entities. Only tokens having a NNP POS are kept by this function.

Parameters

annotations (Document) – stanford coreNLP text annotations

Return type

List[str]

Returns

an array of bio tags.

Resources

Hypocorism

class renard.resources.hypocorisms.HypocorismGazetteer(lang='eng')

An hypocorism (nicknames) gazetteer

Note

datas used for this gazeeter come from https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup and are licensed under the Apache 2.0 License

Parameters

lang (str) –

__init__(lang='eng')
Parameters

lang (str) – gazetteer language. Must be in HypocorismGazetteer.supported_langs.

_add_hypocorism_(name, nicknames)

Add a name associated with several nicknames

Parameters
  • name (str) –

  • nicknames (List[str]) – nicknames to associate to the given name

Check if one name is an hypocorism of the other (or if both names are equals)

Parameters
  • name1 (str) –

  • name2 (str) –

Return type

bool

get_nicknames(name)

Return all possible nickname for the given name

Parameters

name (str) –

Return type

Set[str]

get_possible_names(nickname)

Return all names that can correspond to the given nickname

Parameters

nickname (str) –

Return type

Set[str]

Utils

renard.utils.BlockBounds

A BlockBounds delimits blocks in either raw text (“characters”) or tokenized text (“tokens”). It has the following form:

([(block start, block end), …], unit)

see block_indices() to easily create BlockBounds

alias of Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]

renard.utils.batch_index_select(input, dim, index)

Batched version of torch.index_select(). Inspired by https://discuss.pytorch.org/t/batched-index-select/9115/8

Parameters
  • input (Tensor) – a torch tensor of shape (B, *) where * is any number of additional dimensions.

  • dim (int) – the dimension in which to index

  • index (Tensor) – index tensor of shape (B, I)

Return type

Tensor

Returns

a tensor which indexes input along dimension dim using index. This tensor has the same shape as input, except in dimension dim, where it has dimension I.

renard.utils.block_bounds(blocks)

Return the boundaries of a series of blocks.

Parameters

blocks (Union[List[str], List[List[str]]]) – either a list of raw texts or a list of tokenized texts.

Return type

Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]

Returns

A BlockBounds with the correct unit.

renard.utils.charbb2tokenbb(char_bb, char2token)

Convert a BlockBounds in characters to a BlockBounds in tokens.

Parameters
  • char_bb (Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]) – block bounds, in ‘characters’.

  • char2token (List[int]) – a list with char2token[i] being the index of token corresponding to character i.

Return type

Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]

Returns

a BlockBounds, in ‘tokens’.

renard.utils.search_pattern(seq, pattern)

Search a pattern in sequence

Parameters
  • seq (Iterable[TypeVar(R)]) – sequence in which to search

  • pattern (List[TypeVar(R)]) – searched pattern

Return type

List[int]

Returns

a list of patterns start index

renard.utils.spans(seq, max_len)

Cut the input sequence into all possible spans up to a maximum length

Note

spans are ordered from the smallest to the biggest, from the beginning of seq to the end of seq.

Parameters
  • seq (Collection[TypeVar(T)]) –

  • max_len (int) –

Return type

List[Tuple[TypeVar(T)]]

Returns

Graph utils

renard.graph_utils.cumulative_graph(graphs)

Turns a dynamic graph to a cumulative graph, weight wise

Parameters

graphs (List[Graph]) – A list of sequential graphs

Return type

List[Graph]

renard.graph_utils.dynamic_graph_to_gephi_graph(graphs)

Convert a dynamic graph to a Gephi-compatible dynamic graph. The resulting graph can be exported using G.write_gexf() and will be read correctly by Gephi.

Note

Because of a limitation in networkx, the dynamic weight attribute is stored as dweight instead of weight.

Parameters

graphs (List[Graph]) – a dynamic graph

Return type

Graph

Returns

A dynamic Gephi-compatible graph

renard.graph_utils.graph_edges_attributes(G)

Compute the set of all attributes of a graph

Parameters

G (Graph) –

Return type

Set[str]

renard.graph_utils.graph_with_names(G, name_style='most_frequent')

Relabel a characters graph, using a single name for each node

Parameters
  • name_style (Union[Literal[‘longest’, ‘shortest’, ‘most_frequent’], Callable[[Character], str]]) – characters name style in the resulting graph. Either a string ('longest or shortest or most_frequent) or a custom function associating a character to its name

  • G (Graph) –

Return type

Graph

renard.graph_utils.layout_with_names(G, layout, name_style='most_frequent')
Parameters
  • G (Graph) – a graph of Character

  • layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray]]) –

  • name_style (Union[Literal[‘longest’, ‘shortest’, ‘most_frequent’], Callable[[Character], str]]) –

Return type

dict

Plot utils

renard.plot_utils.plot_nx_graph_reasonably(G, ax=None, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)

Try to plot a nx.Graph with ‘reasonable’ parameters

Parameters
  • G (Graph) – the graph to draw

  • ax – matplotlib axes

  • layout (Optional[dict]) – if given, this graph layout will be applied. Otherwise, use layout_nx_graph_reasonably().

  • node_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()

  • edge_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()

  • label_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_labels()

NER utils

class renard.ner_utils.DataCollatorForTokenClassificationWithBatchEncoding(tokenizer, pad_to_multiple_of=None)

Same as transformers.DataCollatorForTokenClassification, except it correctly returns a BatchEncoding object with correct encodings attribute.

Don’t know why this is not the default ?

Parameters
  • tokenizer (PreTrainedTokenizerFast) –

  • pad_to_multiple_of (Optional[int]) –

__call__(features)

Call self as a function.

Parameters

features (List[dict]) –

Return type

Union[dict, BatchEncoding]

__init__(tokenizer, pad_to_multiple_of=None)
Parameters
  • tokenizer (PreTrainedTokenizerFast) –

  • pad_to_multiple_of (Optional[int]) –

class renard.ner_utils.NERDataset(elements, tokenizer, context_mask=None)
Variables

_context_mask – for each element, a mask indicating which tokens are part of the context (1 for context, 0 for text on which to perform inference). The mask allows to discard predictions made for context at inference time, even though the context can still be passed as input to the model.

Parameters
  • elements (List[List[str]]) –

  • tokenizer (PreTrainedTokenizerFast) –

  • context_mask (Optional[List[List[int]]]) –

__init__(elements, tokenizer, context_mask=None)
Parameters
  • elements (List[List[str]]) –

  • tokenizer (PreTrainedTokenizerFast) –

  • context_mask (Optional[List[List[int]]]) –

renard.ner_utils._tokenize_and_align_labels(examples, tokenizer, label_all_tokens=True)

Adapted from https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ

Parameters
  • examples – an object with keys ‘tokens’ and ‘labels’

  • tokenizer (PreTrainedTokenizerFast) –

  • label_all_tokens (bool) –

renard.ner_utils.hgdataset_from_conll2002(path, tag_conversion_map=None, separator='\\t', **kwargs)

Load a CoNLL-2002 file as a Huggingface Dataset.

Parameters
Return type

Dataset

Returns

a datasets.Dataset with features ‘tokens’ and ‘labels’.

renard.ner_utils.load_conll2002_bio(path, tag_conversion_map=None, separator='\\t', **kwargs)

Load a file under CoNLL2022 BIO format. Sentences are expected to be separated by end of lines. Tags should be in the CoNLL-2002 format (such as ‘B-PER I-PER’) - If this is not the case, see the tag_conversion_map argument.

Parameters
  • path (str) – path to the CoNLL-2002 formatted file

  • separator (str) – separator between token and BIO tags

  • tag_conversion_map (Optional[Dict[str, str]]) – conversion map for tags found in the input file. Example : {'B': 'B-PER', 'I': 'I-PER'}

  • kwargs – additional kwargs for open (such as encoding or newline).

Return type

Tuple[List[List[str]], List[str], List[NEREntity]]

Returns

(sentences, tokens, entities)

renard.ner_utils.ner_entities(tokens, bio_tags, resolve_inconsistencies=True)

Extract NER entities from a list of BIO tags

Parameters
  • tokens (List[str]) – a list of tokens

  • bio_tags (List[str]) – a list of BIO tags. In particular, BIO tags should be in the CoNLL-2002 form (such as ‘B-PER I-PER’)

  • resolve_inconsistencies (bool) –

Return type

List[NEREntity]

Returns

A list of ner entities, in apparition order