Reference

Core

Pipeline

class renard.pipeline.core.Pipeline(steps, lang='eng', progress_report='tqdm', warn=True)

A flexible NLP pipeline

Parameters:

steps (List[PipelineStep])
lang (str)
progress_report (Optional[Literal['tqdm']])
warn (bool)

PipelineParameter

all the possible parameters of the whole pipeline, that are shared between steps

alias of Literal[‘lang’, ‘progress_reporter’, ‘character_ner_tag’]

__call__(text=None, ignored_steps=None, **kwargs)

Run the pipeline sequentially.

Parameters:

ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.
text (Optional[str])

Return type:

PipelineState

Returns:

the output of the last step of the pipeline

__init__(steps, lang='eng', progress_report='tqdm', warn=True)

Parameters:

steps (List[PipelineStep]) – a tuple of :class:PipelineStep, that will be executed in order
progress_report (Optional[Literal['tqdm']]) – if tqdm, report the pipeline progress using tqdm. if None, does not report progress.
lang (str) – ISO 639-3 language code
warn (bool)

_non_ignored_steps(ignored_steps)

Get steps that are not ignored.

Parameters:: ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps wont be returned.
Return type:: List[PipelineStep]

_pipeline_init_steps_(ignored_steps=None)

Initialise steps with global pipeline parameters.

Parameters:: ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.

check_valid(*args, ignored_steps=None)

Check that the current pipeline can be run, which is possible if all steps needs are satisfied

Parameters:

args – list of additional attributes to add to the starting pipeline state.
ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.

Return type:

Tuple[bool, List[str]]

Returns:

a tuple : (True, [warnings]) if the pipeline is valid, (False, [errors]) otherwise

rerun_from(state, from_step, ignored_steps=None)

Recompute steps, starting from from_step (included). Previous steps results are not recomputed.

Note

steps are not re-inited using _pipeline_init_steps().

Parameters:

state (PipelineState) – the previously computed state
from_step (Union[str, Type[PipelineStep]]) –
first step to recompute from. Either :
- str : in that case, the name of a step production ('tokens', 'corefs'…)
- Type[PipelineStep] : in that case, the class of a step
ignored_steps (Optional[List[str]]) – a list of steps production. All steps with a production in ignored_steps will be ignored.

Return type:

PipelineState

Returns:

the output of the last step of the pipeline

Pipeline State

class renard.pipeline.core.PipelineState(text, dynamic_blocks=None, tokens=None, char2token=None, sentences=None, quotes=None, speakers=None, sentences_polarities=None, entities=None, corefs=None, characters=None, character_network=None)

The state of a pipeline, annotated in a Pipeline lifetime

Parameters:

text (Optional[str])
dynamic_blocks (Optional[List[Tuple[int, int]]])
tokens (Optional[List[str]])
char2token (Optional[List[int]])
sentences (Optional[List[List[str]]])
quotes (Optional[List[Quote]])
speakers (Optional[List[Optional[Character]]])
sentences_polarities (Optional[List[float]])
entities (Optional[List[NEREntity]])
corefs (Optional[List[List[Mention]]])
characters (Optional[List[Character]])
character_network (Union[List[Graph], Graph, None])

__eq__(other): Return self==value.

__hash__ = None

__init__(text, dynamic_blocks=None, tokens=None, char2token=None, sentences=None, quotes=None, speakers=None, sentences_polarities=None, entities=None, corefs=None, characters=None, character_network=None)

Parameters:

text (Optional[str])
dynamic_blocks (Optional[List[Tuple[int, int]]])
tokens (Optional[List[str]])
char2token (Optional[List[int]])
sentences (Optional[List[List[str]]])
quotes (Optional[List[Quote]])
speakers (Optional[List[Optional[Character]]])
sentences_polarities (Optional[List[float]])
entities (Optional[List[NEREntity]])
corefs (Optional[List[List[Mention]]])
characters (Optional[List[Character]])
character_network (Union[List[Graph], Graph, None])

__repr__(): Return repr(self).

char2token: Optional[List[int]] = None: mapping from a character to its corresponding token

character_network: Union[List[Graph], Graph, None] = None: character network (or list of network in the case of a dynamic network)

characters: Optional[List[Character]] = None: detected characters

corefs: Optional[List[List[Mention]]] = None: coreference chains

dynamic_blocks: Optional[List[Tuple[int, int]]] = None: text split into blocks of texts. When dynamic blocks are given, the final network is dynamic, and split according to blocks.

entities: Optional[List[NEREntity]] = None: NER entities

export_graph_to_gexf(path, name_style='most_frequent')

Export characters graph to Gephi’s gexf format

Parameters:

path (str) – export file path
name_style (Union[Literal['longest', 'shortest', 'most_frequent'], Callable[[Character], str]]) – see graph_with_names() for more details

get_character(name, partial_match=True)

Try to get a character by one of its name.

Note

Several characters may match the given name, but only the first one is returned.

Note

Comparison is case-insensitive.

Parameters:

name (str) – One of the name of the searched character.
partial_match (bool) – when True, will also return a character if the given name is only part of one of its name. Otherwise, only a character with the given name will be returned.

Return type:

Optional[Character]

Returns:

a Character, or None if no character was found.

plot_graph(name_style='most_frequent', fig=None, cumulative=False, graph_start_idx=1, stable_layout=False, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None, tight_layout=False, legend=False)

Plot self.character_network using reasonable default parameters

Note

when plotting a dynamic graph, a slider attribute is added to fig when it is given, in order to keep a reference to the slider.

Parameters:

name_style (Union[Literal['longest', 'shortest', 'most_frequent'], Callable[[Character], str]]) – see graph_with_names() for more details
fig (Optional[Figure]) – if specified, this matplotlib figure will be used for plotting
cumulative (bool) – if True and self.character_network is dynamic, plot a cumulative graph instead of a sequential one
graph_start_idx (int) – When self.character_network is dynamic, index of the first graph to plot, starting at 1 (not 0, since the graph slider starts at 1)
stable_layout (bool) – if self.character_network is dynamic and this parameter is True, characters will keep the same position in space at each timestep. Characters’ positions are based on the final cumulative graph layout.
layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray], None]) – pre-computed graph layout
node_kwargs (Union[Dict[str, Any], List[Dict[str, Any]], None]) – passed to nx.draw_networkx_nodes()
edge_kwargs (Union[Dict[str, Any], List[Dict[str, Any]], None]) – passed to nx.draw_networkx_nodes()
label_kwargs (Union[Dict[str, Any], List[Dict[str, Any]], None]) – passed to nx.draw_networkx_labels()
tight_layout (bool) – if True, will use matplotlib’s tight_layout
legend (bool) – passed to plot_nx_graph_reasonably()

plot_graph_to_file(path, name_style='most_frequent', layout=None, fig=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None, tight_layout=False, legend=False)

Plot self.character_graph using reasonable parameters, and save the produced figure to a file

Parameters:

name_style (Union[Literal['longest', 'shortest', 'most_frequent'], Callable[[Character], str]]) – see graph_with_names() for more details
layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray], None]) – pre-computed graph layout
fig (Optional[Figure]) – if specified, this matplotlib figure will be used for plotting
node_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()
edge_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()
label_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_labels()
tight_layout (bool) – if True, will use matplotlib’s tight_layout
legend (bool) – passed to plot_nx_graph_reasonably()
path (str)

plot_graphs_to_dir(directory, name_style='most_frequent', cumulative=False, stable_layout=False, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None, legend=False)

Plot self.character_graph using reasonable default parameters, and save the produced figures in the specified directory.

Parameters:

name_style (Union[Literal['longest', 'shortest', 'most_frequent'], Callable[[Character], str]]) – see graph_with_names() for more details
cumulative (bool) – if True plot a cumulative graph instead of a sequential one
stable_layout (bool) – If this parameter is True, characters will keep the same position in space at each timestep. Characters’ positions are based on the final cumulative graph layout.
layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray], None]) – pre-computed graph layout
node_kwargs (Optional[List[Dict[str, Any]]]) – passed to nx.draw_networkx_nodes()
edge_kwargs (Optional[List[Dict[str, Any]]]) – passed to nx.draw_networkx_nodes()
label_kwargs (Optional[List[Dict[str, Any]]]) – passed to nx.draw_networkx_labels()
legend (bool) – passed to plot_nx_graph_reasonably()
directory (str)

quotes: Optional[List[Quote]] = None: quotes

sentences: Optional[List[List[str]]] = None: text splitted into sentences, each sentence being a list of tokens

sentences_polarities: Optional[List[float]] = None: polarity of each sentence

speakers: Optional[List[Optional[Character]]] = None: quotes speakers

text: Optional[str]: input text

tokens: Optional[List[str]] = None: text splitted in tokens

Pipeline Steps

class renard.pipeline.core.PipelineStep

An abstract pipeline step

Note

The __call__, needs and production methods _must_ be overridden by derived classes.

Note

The optional_needs and supported_langs methods can be overridden by derived classes.

__call__(text, **kwargs)

Call self as a function.

Parameters:: text (str)
Return type:: Dict[str, Any]

__init__(): Initialize the PipelineStep with a given configuration.

_pipeline_init_(lang, progress_reporter, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter (ProgressReporter)
kwargs – additional pipeline parameters.

Return type:

Optional[Dict[Literal['lang', 'progress_reporter', 'character_ner_tag'], Any]]

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()

Return type:: Set[str]
Returns:: a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

Preprocessing

class renard.pipeline.preprocessing.CustomSubstitutionPreprocessor(substition_rules)

A preprocessor allowing regex-based substition

Parameters:: substition_rules (List[Tuple[str, str]])

__call__(text, **kwargs)

Parameters:: text (str)
Return type:: Dict[str, Any]

__init__(substition_rules)

Parameters:: substition_rules (List[Tuple[str, str]]) – A list of rules, each rule being of the form (match, substitution).

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

Tokenization

NLTKTokenizer

class renard.pipeline.tokenization.NLTKTokenizer

A NLTK-based tokenizer

__call__(text, **kwargs)

Call self as a function.

Parameters:: text (str)
Return type:: Dict[str, Any]

__init__(): Initialize the PipelineStep with a given configuration.

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter
kwargs – additional pipeline parameters.

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

Named Entity Recognition

class renard.pipeline.ner.NEREntity(tokens, start_idx, end_idx, tag)

Parameters:

tokens (List[str])
start_idx (int)
end_idx (int)
tag (str)

__eq__(other): Return self==value.

__hash__()

Return hash(self).

Return type:: int

__init__(tokens, start_idx, end_idx, tag)

Parameters:

tokens (List[str])
start_idx (int)
end_idx (int)
tag (str)

__repr__(): Return repr(self).

shifted(shift)

Note

This method is implemented here to avoid type issues. Since Mention.shifted() cannot be annotated as returning Self, this method annotate the correct return type when using NEREntity.shifted().

Parameters:: shift (int)
Return type:: NEREntity

tag: str: NER class (without BIO prefix as in PER and not B-PER)

BertNamedEntityRecognizer

class renard.pipeline.ner.BertNamedEntityRecognizer(model=None, batch_size=4, device='auto', tokenizer=None, context_retriever=None)

An entity recognizer based on BERT

Parameters:

model (Union[PreTrainedModel, str, None])
batch_size (int)
device (Literal['cpu', 'cuda', 'auto'])
tokenizer (Optional[PreTrainedTokenizerFast])
context_retriever (Optional[NERContextRetriever])

__call__(tokens, sentences, **kwargs)

Parameters:

text
tokens (List[str])
sentences (List[List[str]])

Return type:

Dict[str, Any]

__init__(model=None, batch_size=4, device='auto', tokenizer=None, context_retriever=None)

Parameters:

model (Union[PreTrainedModel, str, None]) –
Either:
- None: the model will be chosen accordingly knowing the lang of the pipeline
- str: a hugginface model ID
- a PreTrainedModel: a custom pre-trained BERT model. If specified, a tokenizer must be passed as well.
batch_size (int) – batch size at inference
device (Literal['cpu', 'cuda', 'auto']) – computation device
tokenizer (Optional[PreTrainedTokenizerFast]) – a custom tokenizer
context_retriever (Optional[NERContextRetriever]) – if specified, use context_retriever to retrieve relevant global context at run time, generally trading runtme for NER performance.

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter
kwargs – additional pipeline parameters.

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

batch_labels(batchs, batch_i, wp_labels, tokens, ctxmask)

Align labels to tokens rather than wordpiece tokens.

Parameters:

batchs (BatchEncoding) – huggingface batch
batch_i (int) – batch index
wp_labels (List[str]) – wordpiece aligned labels
tokens (List[str]) – original tokens
ctxmask (Tensor)

Return type:

List[str]

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

NLTKNamedEntityRecognizer

class renard.pipeline.ner.NLTKNamedEntityRecognizer

An entity recognizer based on NLTK

__call__(tokens, **kwargs)

Parameters:

text
tokens (List[str])

Return type:

Dict[str, Any]

__init__()

Parameters:: language – iso 639-2 3-letter language code

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter
kwargs – additional pipeline parameters.

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

Coreference Resolution

A coreference resolver returns a list of coreference chains, each chain being Mention.

class renard.pipeline.core.Mention(tokens, start_idx, end_idx)

Parameters:

tokens (List[str])
start_idx (int)
end_idx (int)

__eq__(other)

Return self==value.

Parameters:: other (Mention)
Return type:: bool

__hash__()

Return hash(self).

Return type:: int

__init__(tokens, start_idx, end_idx)

Parameters:

tokens (List[str])
start_idx (int)
end_idx (int)

__repr__(): Return repr(self).

BertCoreferenceResolver

class renard.pipeline.corefs.BertCoreferenceResolver(model=None, huggingface_model_id=None, batch_size=1, device='auto', tokenizer=None, block_size=512, hierarchical_merging=False)

A coreference resolver using BERT. Loosely based on ‘End-to-end Neural Coreference Resolution’ (Lee et at. 2017) and ‘BERT for coreference resolution’ (Joshi et al. 2019).

Parameters:

model (Optional[BertForCoreferenceResolution])
huggingface_model_id (Optional[str])
batch_size (int)
device (Literal['auto', 'cuda', 'cpu'])
tokenizer (Optional[PreTrainedTokenizerFast])
block_size (int)
hierarchical_merging (bool)

__call__(tokens, **kwargs)

Call self as a function.

Parameters:: tokens (List[str])
Return type:: Dict[str, Any]

__init__(model=None, huggingface_model_id=None, batch_size=1, device='auto', tokenizer=None, block_size=512, hierarchical_merging=False)

Note

In the future, only mentions_per_tokens, antecedents_nb and max_span_size shall be read directly from the model’s config.

Parameters:

huggingface_model_id (Optional[str]) – a custom huggingface model id. This allows to bypass the lang pipeline parameter which normally choose a huggingface model automatically.
batch_size (int) – batch size at inference
device (Literal['auto', 'cuda', 'cpu']) – computation device
block_size (int) – size of blocks to pass to the coreference model
hierarchical_merging (bool) – if True, attempts to use tibert’s hierarchical merging feature. In that case, blocks of size block_size are merged to perform inference on the whole document.
model (Optional[BertForCoreferenceResolution])
tokenizer (Optional[PreTrainedTokenizerFast])

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter
kwargs – additional pipeline parameters.

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

SpacyCorefereeCoreferenceResolver

class renard.pipeline.corefs.SpacyCorefereeCoreferenceResolver(max_chunk_size=10000)

A coreference resolver using spacy’s corefree.

Note

This step requires to install Renard’s extra ‘spacy’
While this step automatically install the needed spacy models, it still needs a manual installation of the coreferee model: python -m coreferee install en

Parameters:: max_chunk_size (Optional[int])

__call__(text, tokens, dynamic_blocks_tokens=None, **kwargs)

Call self as a function.

Parameters:

text (str)
tokens (List[str])
dynamic_blocks_tokens (Optional[List[List[str]]])

Return type:

Dict[str, Any]

__init__(max_chunk_size=10000)

Parameters:

chunk_size – coreference chunk size, in tokens
max_chunk_size (Optional[int])

static _coreferee_get_mention_tokens(coref_model, mention_heads, doc)

Coreferee only return mention heads for mention, and not the whole span. This hack (defined in coreferee README at the end of part 2 https://github.com/richardpaulhudson/coreferee#2-interacting-with-the-data-model) gets the whole span as a list of spacy tokens.

Parameters:

coref_model (CorefereeBroker)
mention_heads (Mention)
doc (Doc)

Return type:

List[Token]

_pipeline_init_(lang, progress_reporter)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter (ProgressReporter)
kwargs – additional pipeline parameters.

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

static _spacy_try_infer_spaces(tokens)

Try to infer, for each token, if there is a space between this token and the next.

Parameters:: tokens (List[str])
Return type:: List[bool]

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()

Return type:: Set[str]
Returns:: a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

Quote Detection

QuoteDetector

class renard.pipeline.quote_detection.QuoteDetector(quote_pairs=None)

Extract quotes using simple rules.

Parameters:: quote_pairs (Optional[List[Tuple[str, str]]])

__call__(tokens, **kwargs)

Call self as a function.

Parameters:: tokens (List[str])
Return type:: Dict[str, Any]

__init__(quote_pairs=None)

Parameters:: quote_pairs (Optional[List[Tuple[str, str]]]) – if None, default to QuoteDetector.DEFAULT_QUOTE_PAIRS

needs()

tokens

Return type:: Set[str]

production()

quotes

Return type:: Set[str]

supported_langs()

any

Return type:: Union[Set[str], Literal['any']]

Sentiment Analysis

NLTKSentimentAnalyzer

class renard.pipeline.sentiment_analysis.NLTKSentimentAnalyzer

A sentiment analyzer based on NLTK’s Vader.

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

__call__(sentences, **kwargs)

Call self as a function.

Parameters:: sentences (List[List[str]])
Return type:: Dict[str, Any]

__init__(): Initialize the PipelineStep with a given configuration.

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

Characters Unification

class renard.pipeline.character_unification.Character(names, mentions, gender=Gender.UNKNOWN)

Parameters:

names (FrozenSet[str])
mentions (List[Mention])
gender (Gender)

__delattr__(name): Implement delattr(self, name).

__eq__(other): Return self==value.

__hash__()

Return hash(self).

Return type:: int

__init__(names, mentions, gender=Gender.UNKNOWN)

Parameters:

names (FrozenSet[str])
mentions (List[Mention])
gender (Gender)

__repr__()

Return repr(self).

Return type:: str

__setattr__(name, value): Implement setattr(self, name, value).

NaiveCharacterUnifier

class renard.pipeline.character_unification.NaiveCharacterUnifier(min_appearances=0)

A basic character unifier using NER

Parameters:: min_appearances (int)

__call__(text, entities, corefs=None, **kwargs)

Parameters:

text (str)
tokens
entities (List[NEREntity])
corefs (Optional[List[List[Mention]]])

Return type:

Dict[str, Any]

__init__(min_appearances=0)

Parameters:: min_appearances (int) – minimum number of appearances of a character for it to be valid

_pipeline_init_(lang, character_ner_tag, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter
kwargs – additional pipeline parameters.
character_ner_tag (str)

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()

Return type:: Set[str]
Returns:: a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

GraphRulesCharacterUnifier

class renard.pipeline.character_unification.GraphRulesCharacterUnifier(min_appearances=0, additional_hypocorisms=None, link_corefs_mentions=False, ignore_lone_titles=None, ignore_leading_determiner=False)

Unify characters by creating a graph where mentions are linked when they refer to the same character, and then merging this graph nodes.

Note

This algorithm is inspired from Vala et al., 2015.

Parameters:

min_appearances (int)
additional_hypocorisms (Optional[List[Tuple[str, List[str]]]])
link_corefs_mentions (bool)
ignore_lone_titles (Optional[Set[str]])
ignore_leading_determiner (bool)

__call__(entities, corefs=None, **kwargs)

Call self as a function.

Parameters:

entities (List[NEREntity])
corefs (Optional[List[List[Mention]]])
kwargs (dict)

Return type:

Dict[str, Any]

__init__(min_appearances=0, additional_hypocorisms=None, link_corefs_mentions=False, ignore_lone_titles=None, ignore_leading_determiner=False)

Parameters:

min_appearances (int) – minimum number of appearances of a character for it to be considered valid.
additional_hypocorisms (Optional[List[Tuple[str, List[str]]]]) – a tuple of additional hypocorisms. Each hypocorism is a tuple where the first element is a name, and the second element is a set of nicknames associated with it
link_corefs_mentions (bool) – if True, will also use coreference resolution to link names between them. This is disabled by default since a coreference model can extract a lot of spurious links. However, linking by coref is sometimes the only way to resolve a character alias.
ignore_lone_titles (Optional[Set[str]]) – a set of titles to ignore when they stand on their own. This avoids extracting false positives characters such as ‘Mr.’ or ‘Miss’.
ignore_leading_determiner (bool) – if True, will ignore the leading determiner when applying unification rules. This is useful if the NER model used in the pipeline adds leading determiners as part of entites.

_pipeline_init_(lang, character_ner_tag, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter
kwargs – additional pipeline parameters.
character_ner_tag (str)

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

infer_name_gender(name, corefs, hname_constants)

Try to infer a name’s gender

Parameters:

name (str)
corefs (Optional[List[List[Mention]]])
hname_constants (Constants) – HumanName constants

Return type:

Gender

names_are_related_after_title_removal(name1, name2, hname_constants)

Check if two names are related after removing their titles

Parameters:

name1 (str)
name2 (str)
hname_constants (Constants)

Return type:

bool

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()

Return type:: Set[str]
Returns:: a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

Speaker Attribution

class renard.pipeline.speaker_attribution.BertSpeakerDetector(model=None, batch_size=4, device='auto', tokenizer=None)

Detect quote speaker in text

Parameters:

model (Union[PreTrainedModel, str, None])
batch_size (int)
device (Literal['cpu', 'cuda', 'auto'])
tokenizer (Optional[PreTrainedTokenizerFast])

__call__(tokens, quotes, characters, **kwargs)

Call self as a function.

Parameters:

tokens (List[str])
quotes (List[Quote])
characters (List[Character])

Return type:

Dict[str, Any]

__init__(model=None, batch_size=4, device='auto', tokenizer=None)

Initialize the PipelineStep with a given configuration.

Parameters:

model (Union[PreTrainedModel, str, None])
batch_size (int)
device (Literal['cpu', 'cuda', 'auto'])
tokenizer (Optional[PreTrainedTokenizerFast])

_pipeline_init_(lang, **kwargs)

Set the step configuration that is common to the whole pipeline.

Parameters:

lang (str) – the lang of the whole pipeline
progress_reporter
kwargs – additional pipeline parameters.

Returns:

a step can return a dictionary of pipeline params if it wish to modify some of these.

needs()

quotes, tokens, characters

Return type:: Set[str]

production()

speaker

Return type:: Set[str]

Graph Extraction

CoOccurrencesGraphExtractor

class renard.pipeline.graph_extraction.CoOccurrencesGraphExtractor(co_occurrences_dist=None, dynamic=False, dynamic_window=None, dynamic_overlap=0, additional_ner_classes=None)

A simple character graph extractor using co-occurences

Parameters:

co_occurrences_dist (Union[int, Tuple[int, Literal['tokens', 'sentences']], None])
dynamic (bool)
dynamic_window (Optional[int])
dynamic_overlap (int)
additional_ner_classes (Optional[List[str]])

__call__(characters, sentences, char2token=None, dynamic_blocks=None, sentences_polarities=None, entities=None, co_occurrences_blocks=None, **kwargs)

Extract a co-occurrence character network.

Parameters:

co_occurrences_blocks (Optional[Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]]) – custom blocks where co-occurrences should be recorded. For example, this can be used to perform chapter level co-occurrences.
characters (Set[Character])
sentences (List[List[str]])
char2token (Optional[List[int]])
dynamic_blocks (Optional[Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]])
sentences_polarities (Optional[List[float]])
entities (Optional[List[NEREntity]])

Return type:

Dict[str, Any]

Returns:

a dict with key 'character_network' and a nx.Graph or a list of nx.Graph as value.

__init__(co_occurrences_dist=None, dynamic=False, dynamic_window=None, dynamic_overlap=0, additional_ner_classes=None)

Parameters:

co_occurrences_dist (Union[int, Tuple[int, Literal['tokens', 'sentences']], None]) –
max accepted distance between two character appearances to form a co-occurence interaction.
- if an int is given, the distance is in number of tokens
- if a tuple is given, the first element of the tuple is a distance while the second is an unit. Examples : (1, "sentences"), (3, "tokens").
dynamic (bool) –
- if False (the default), a static nx.graph is extracted
- if True, several nx.graph are extracted. In that case, dynamic_window and dynamic_overlap``*can* be specified. If ``dynamic_window is not specified, this step is expecting the text to be cut into ‘dynamic blocks’, and a graph will be extracted for each block. In that case, dynamic_blocks must be passed to the pipeline as a List[str] at runtime.
dynamic_window (Optional[int]) – dynamic window, in number of interactions. a dynamic window of n means that each returned graph will be formed by n interactions.
dynamic_overlap (int) – overlap, in number of interactions.
additional_ner_classes (Optional[List[str]]) – if specified, will include entities other than characters in the final graph. No attempt will be made at unifying the entities (for example, “New York” will be distinct from “New York City”).

_create_co_occurrences_blocks(sentences, mentions)

Create co-occurrences blocks using self.co_occurrences_dist. All entities within a block are considered as co-occurring.

Parameters:

sentences (List[List[str]])
mentions (List[Tuple[Any, NEREntity]])

Return type:

Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]

_extract_dynamic_graph(mentions, window, overlap, dynamic_blocks, sentences, sentences_polarities, co_occurrences_blocks)

Note

only one of window or dynamic_blocks_tokens should be specified

Parameters:

mentions (List[Tuple[Any, NEREntity]]) – A list of entity mentions, ordered by appearance, each of the form (KEY MENTION). KEY determines the unicity of the entity.
window (Optional[int]) – dynamic window, in tokens.
overlap (int) – window overlap
dynamic_blocks (Optional[Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]]) – boundaries of each dynamic block
co_occurrences_blocks (Optional[Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]]) – boundaries of each co-occurrences blocks
sentences (List[List[str]])
sentences_polarities (Optional[List[float]])

Return type:

List[Graph]

_extract_graph(mentions, sentences, sentences_polarities, co_occurrences_blocks)

Parameters:

mentions (List[Tuple[Any, NEREntity]]) – A list of entity mentions, ordered by appearance, each of the form (KEY MENTION). KEY determines the unicity of the entity.
sentences (List[List[str]]) – if specified, sentences_polarities must be specified as well.
sentences_polarities (Optional[List[float]]) – if specified, sentences must be specified as well. In that case, edges are annotated with the 'polarity attribute, indicating the polarity of the relationship between two characters. Polarity between two interactions is computed as the strongest sentence polarity between those two mentions.
co_occurrences_blocks (Optional[Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]]) – only unit ‘tokens’ is accepted.

Return type:

Graph

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

optional_needs()

Return type:: Set[str]
Returns:: a set of state attributes optionally neeeded by this PipelineStep. This method can be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

supported_langs()

Return type:: Union[Set[str], Literal['any']]
Returns:: a list of supported languages, as ISO 639-3 codes, or the string 'any'

ConversationalGraphExtractor

class renard.pipeline.graph_extraction.ConversationalGraphExtractor(graph_type, conversation_dist=None, ignore_self_mention=True)

A graph extractor using conversation between characters or mentions.

Note

Does not support dynamic networks yet.

Parameters:

graph_type (Literal['conversation', 'mention'])
conversation_dist (Union[int, Tuple[int, Literal['tokens', 'sentences']], None])
ignore_self_mention (bool)

__call__(sentences, quotes, speakers, characters, **kwargs)

Call self as a function.

Parameters:

sentences (List[List[str]])
quotes (List[Quote])
speakers (List[Optional[Character]])
characters (Set[Character])

Return type:

Dict[str, Any]

__init__(graph_type, conversation_dist=None, ignore_self_mention=True)

Parameters:

graph_type (Literal['conversation', 'mention']) – either ‘conversation’ or ‘mention’. ‘conversation’ extracts an undirected graph with interactions being extracted from the conversations occurring between characters. ‘mention’ extracts a directed graph where interactions are character mentions of one another in quoted speech.
conversation_dist (Union[int, Tuple[int, Literal['tokens', 'sentences']], None]) – must be supplied if graph_type is ‘conversation’. The distance between two quotation for them to be considered as being interacting.
ignore_self_mention (bool) – if True, self mentions are ignore for graph_type=='mention'

needs()

sentences, quotes, speakers, characters

Return type:: Set[str]

production()

character_network

Return type:: Set[str]

Stanford CoreNLP Pipeline

class renard.pipeline.stanford_corenlp.StanfordCoreNLPPipeline(annotate_corefs=False, corefs_algorithm='statistical', corenlp_custom_properties=None, server_timeout=9999999, **server_kwargs)

a full NLP pipeline using stanford CoreNLP

Note

The Stanford CoreNLP pipeline requires the stanza library. You can install it with uv using uv pip install stanza.

Warning

RAM usage might be high for coreference resolutions as it uses the entire novel ! If CoreNLP terminates with an out of memory error, you can try allocating more memory for the server by using server_kwargs (example : {"memory": "8G"}).

Parameters:

annotate_corefs (bool)
corefs_algorithm (Literal['deterministic', 'statistical', 'neural'])
corenlp_custom_properties (Optional[Dict[str, Any]])
server_timeout (int)

__call__(text, **kwargs)

Call self as a function.

Parameters:: text (str)
Return type:: Dict[str, Any]

__init__(annotate_corefs=False, corefs_algorithm='statistical', corenlp_custom_properties=None, server_timeout=9999999, **server_kwargs)

Parameters:

annotate_corefs (bool) – True if coreferences must be annotated, False otherwise. This parameter is not yet implemented.
corefs_algorithm (Literal['deterministic', 'statistical', 'neural']) – one of {"deterministic", "statistical", "neural"}
corenlp_custom_properties (Optional[Dict[str, Any]]) – custom properties dictionary to pass to the CoreNLP server. Note that some properties are already set when calling the server, so not all properties are supported : it is intended as a last resort escape hatch. In particular, do not set 'ner.applyFineGrained'. If you need to set the coreference algorithm used, see corefs_algorithm.
server_timeout (int) – CoreNLP server timeout in ms
server_kwargs – extra args for stanford CoreNLP server. be_quiet and max_char_length are not supported. See here for a list of possible args : https://stanfordnlp.github.io/stanza/client_properties.html#corenlp-server-start-options-server

needs()

Return type:: Set[str]
Returns:: a set of state attributes needed by this PipelineStep. This method must be overriden by derived classes.

production()

Return type:: Set[str]
Returns:: a set of state attributes produced by this PipelineStep. This method must be overriden by derived classes.

renard.pipeline.stanford_corenlp.corenlp_annotations_bio_tags(annotations)

Returns an array of bio tags extracted from stanford corenlp annotations

Note

only PERSON, LOCATION, ORGANIZATION and MISC entities are considered. Other types of entities are discarded. (see https://stanfordnlp.github.io/CoreNLP/ner.html#description) for a list of usual coreNLP types.

Note

Weirdly, CoreNLP will annotate pronouns as entities. Only tokens having a NNP POS are kept by this function.

Parameters:: annotations (Document) – stanford coreNLP text annotations
Return type:: List[str]
Returns:: an array of bio tags.

Resources

Hypocorism

class renard.resources.hypocorisms.HypocorismGazetteer(lang='eng')

An hypocorism (nicknames) gazetteer

Note

datas used for this gazeeter come from https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup and are licensed under the Apache 2.0 License

Parameters:: lang (str)

__init__(lang='eng')

Parameters:: lang (str) – gazetteer language. Must be in HypocorismGazetteer.supported_langs.

_add_hypocorism_(name, nicknames)

Add a name associated with several nicknames

Parameters:

name (str)
nicknames (List[str]) – nicknames to associate to the given name

are_related(name1, name2)

Check if one name is an hypocorism of the other (or if both names are equals)

Parameters:

name1 (str)
name2 (str)

Return type:

bool

get_nicknames(name)

Return all possible nickname for the given name

Parameters:: name (str)
Return type:: Set[str]

get_possible_names(nickname)

Return all names that can correspond to the given nickname

Parameters:: nickname (str)
Return type:: Set[str]

Utils

renard.utils.BlockBounds

A BlockBounds delimits blocks in either raw text (“characters”) or tokenized text (“tokens”). It has the following form:

([(block start, block end), …], unit)

see block_indices() to easily create BlockBounds

alias of Tuple[List[Tuple[int, int]], Literal[‘characters’, ‘tokens’]]

renard.utils.batch_index_select(input, dim, index)

Batched version of torch.index_select(). Inspired by https://discuss.pytorch.org/t/batched-index-select/9115/8

Parameters:

input (Tensor) – a torch tensor of shape (B, *) where * is any number of additional dimensions.
dim (int) – the dimension in which to index
index (Tensor) – index tensor of shape (B, I)

Return type:

Tensor

Returns:

a tensor which indexes input along dimension dim using index. This tensor has the same shape as input, except in dimension dim, where it has dimension I.

renard.utils.block_bounds(blocks)

Return the boundaries of a series of blocks.

Parameters:: blocks (Union[List[str], List[List[str]]]) – either a list of raw texts or a list of tokenized texts.
Return type:: Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]
Returns:: A BlockBounds with the correct unit.

renard.utils.charbb2tokenbb(char_bb, char2token)

Convert a BlockBounds in characters to a BlockBounds in tokens.

Parameters:

char_bb (Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]) – block bounds, in ‘characters’.
char2token (List[int]) – a list with char2token[i] being the index of token corresponding to character i.

Return type:

Tuple[List[Tuple[int, int]], Literal['characters', 'tokens']]

Returns:

a BlockBounds, in ‘tokens’.

renard.utils.search_pattern(seq, pattern)

Search a pattern in sequence

Parameters:

seq (Iterable[TypeVar(R)]) – sequence in which to search
pattern (List[TypeVar(R)]) – searched pattern

Return type:

List[int]

Returns:

a list of patterns start index

renard.utils.spans(seq, max_len)

Cut the input sequence into all possible spans up to a maximum length

Note

spans are ordered from the smallest to the biggest, from the beginning of seq to the end of seq.

Parameters:

seq (Collection[TypeVar(T)])
max_len (int)

Return type:

List[Tuple[TypeVar(T)]]

Returns:

Graph utils

renard.graph_utils.cumulative_graph(graphs)

Turns a dynamic graph to a cumulative graph, weight wise

Parameters:: graphs (List[Graph]) – A list of sequential graphs
Return type:: List[Graph]

renard.graph_utils.dynamic_graph_to_gephi_graph(graphs)

Convert a dynamic graph to a Gephi-compatible dynamic graph. The resulting graph can be exported using G.write_gexf() and will be read correctly by Gephi.

Note

Because of a limitation in networkx, the dynamic weight attribute is stored as dweight instead of weight.

Parameters:: graphs (List[Graph]) – a dynamic graph
Return type:: Graph
Returns:: A dynamic Gephi-compatible graph

renard.graph_utils.graph_edges_attributes(G)

Compute the set of all attributes of a graph

Parameters:: G (Graph)
Return type:: Set[str]

renard.graph_utils.graph_with_names(G, name_style='most_frequent')

Relabel a characters graph, using a single name for each node

Parameters:

name_style (Union[Literal['longest', 'shortest', 'most_frequent'], Callable[[Character], str]]) – characters name style in the resulting graph. Either a string ('longest or shortest or most_frequent) or a custom function associating a character to its name
G (Graph)

Return type:

Graph

renard.graph_utils.layout_with_names(G, layout, name_style='most_frequent')

Parameters:

G (Graph) – a graph of Character
layout (Union[Dict[Character, Tuple[float, float]], Dict[Character, ndarray]])
name_style (Union[Literal['longest', 'shortest', 'most_frequent'], Callable[[Character], str]])

Return type:

dict

Plot utils

renard.plot_utils.plot_nx_graph_reasonably(G, ax=None, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None, legend=False)

Try to plot a nx.Graph with ‘reasonable’ parameters

Parameters:

G (Graph) – the graph to draw
ax – matplotlib axes
layout (Optional[dict]) – if given, this graph layout will be applied. Otherwise, use layout_nx_graph_reasonably().
node_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()
edge_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_nodes()
label_kwargs (Optional[Dict[str, Any]]) – passed to nx.draw_networkx_labels()
legend (bool) – if True, will try to plot an additional legend.

NER utils

class renard.ner_utils.DataCollatorForTokenClassificationWithBatchEncoding(tokenizer, pad_to_multiple_of=None)

Same as transformers.DataCollatorForTokenClassification, except it correctly returns a BatchEncoding object with correct encodings attribute.

Don’t know why this is not the default ?

Parameters:

tokenizer (PreTrainedTokenizerFast)
pad_to_multiple_of (Optional[int])

__call__(features)

Call self as a function.

Parameters:: features (List[dict])
Return type:: Union[dict, BatchEncoding]

__init__(tokenizer, pad_to_multiple_of=None)

Parameters:

tokenizer (PreTrainedTokenizerFast)
pad_to_multiple_of (Optional[int])

class renard.ner_utils.NERDataset(elements, tokenizer, context_mask=None)

Variables:

_context_mask – for each element, a mask indicating which tokens are part of the context (0 for context, 1 for text on which to perform inference). The mask allows to discard predictions made for context at inference time, even though the context can still be passed as input to the model.

Parameters:

elements (List[List[str]])
tokenizer (PreTrainedTokenizerFast)
context_mask (Optional[List[List[int]]])

__init__(elements, tokenizer, context_mask=None)

Parameters:

elements (List[List[str]])
tokenizer (PreTrainedTokenizerFast)
context_mask (Optional[List[List[int]]])

renard.ner_utils._tokenize_and_align_labels(examples, tokenizer, label_all_tokens=True)

Adapted from https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=vc0BSBLIIrJQ

Parameters:

examples – an object with keys ‘tokens’ and ‘labels’
tokenizer (PreTrainedTokenizerFast)
label_all_tokens (bool)

renard.ner_utils.hgdataset_from_conll2002(path, tag_conversion_map=None, separator='\\t', max_sent_len=None, **kwargs)

Load a CoNLL-2002 file as a Huggingface Dataset.

Parameters:

path (str) – passed to load_conll2002_bio()
tag_conversion_map (Optional[Dict[str, str]]) – passed to load_conll2002_bio()
separator (str) – passed to load_conll2002_bio()
max_sent_len (Optional[int]) – passed to load_conll2002_bio()
kwargs – additional kwargs for open()

Return type:

Dataset

Returns:

a datasets.Dataset with features ‘tokens’ and ‘labels’.

renard.ner_utils.load_conll2002_bio(path, tag_conversion_map=None, separator='\\t', max_sent_len=None, **kwargs)

Load a file under CoNLL2022 BIO format. Sentences are expected to be separated by end of lines. Tags should be in the CoNLL-2002 format (such as ‘B-PER I-PER’) - If this is not the case, see the tag_conversion_map argument.

Parameters:

path (str) – path to the CoNLL-2002 formatted file
separator (str) – separator between token and BIO tags
tag_conversion_map (Optional[Dict[str, str]]) – conversion map for tags found in the input file. Example : {'B': 'B-PER', 'I': 'I-PER'}
max_sent_len (Optional[int]) – if specified, maximum length, in tokens, of sentences.
kwargs – additional kwargs for open() (such as encoding or newline).

Return type:

Tuple[List[List[str]], List[str], List[NEREntity]]

Returns:

(sentences, tokens, entities)

renard.ner_utils.ner_entities(tokens, bio_tags, resolve_inconsistencies=True)

Extract NER entities from a list of BIO tags

Parameters:

tokens (List[str]) – a list of tokens
bio_tags (List[str]) – a list of BIO tags. In particular, BIO tags should be in the CoNLL-2002 form (such as ‘B-PER I-PER’)
resolve_inconsistencies (bool)

Return type:

List[NEREntity]

Returns:

A list of ner entities, in apparition order