Reference
Core
Pipeline
- class renard.pipeline.core.Pipeline(steps, lang='eng', progress_report='tqdm', warn=True)
A flexible NLP pipeline
- Parameters
steps (
List
[PipelineStep
]) –lang (
str
) –progress_report (
Optional
[Literal
[‘tqdm’]]) –warn (
bool
) –
- PipelineParameter
all the possible parameters of the whole pipeline, that are shared between steps
alias of
Literal
[‘lang’, ‘progress_reporter’, ‘character_ner_tag’]
- __call__(text=None, ignored_steps=None, **kwargs)
Run the pipeline sequentially.
- Parameters
ignored_steps (
Optional
[List
[str
]]) – a list of steps production. All steps with a production inignored_steps
will be ignored.text (
Optional
[str
]) –
- Return type
- Returns
the output of the last step of the pipeline
- __init__(steps, lang='eng', progress_report='tqdm', warn=True)
- Parameters
steps (
List
[PipelineStep
]) – atuple
of :class:PipelineStep
, that will be executed in orderprogress_report (
Optional
[Literal
[‘tqdm’]]) – iftqdm
, report the pipeline progress using tqdm. ifNone
, does not report progress.lang (
str
) – ISO 639-3 language codewarn (
bool
) –
- _non_ignored_steps(ignored_steps)
Get steps that are not ignored.
- Parameters
ignored_steps (
Optional
[List
[str
]]) – a list of steps production. All steps with a production inignored_steps
wont be returned.- Return type
List
[PipelineStep
]
- _pipeline_init_steps_(ignored_steps=None)
Initialise steps with global pipeline parameters.
- Parameters
ignored_steps (
Optional
[List
[str
]]) – a list of steps production. All steps with a production inignored_steps
will be ignored.
- check_valid(*args, ignored_steps=None)
Check that the current pipeline can be run, which is possible if all steps needs are satisfied
- Parameters
args – list of additional attributes to add to the starting pipeline state.
ignored_steps (
Optional
[List
[str
]]) – a list of steps production. All steps with a production inignored_steps
will be ignored.
- Return type
Tuple
[bool
,List
[str
]]- Returns
a tuple :
(True, [warnings])
if the pipeline is valid,(False, [errors])
otherwise
- rerun_from(state, from_step, ignored_steps=None)
Recompute steps, starting from
from_step
(included). Previous steps results are not recomputed.Note
steps are not re-inited using
_pipeline_init_steps()
.- Parameters
state (
PipelineState
) – the previously computed statefrom_step (
Union
[str
,Type
[PipelineStep
]]) –first step to recompute from. Either :
str
: in that case, the name of a step production ('tokens'
,'corefs'
…)Type[PipelineStep]
: in that case, the class of a step
ignored_steps (
Optional
[List
[str
]]) – a list of steps production. All steps with a production inignored_steps
will be ignored.
- Return type
- Returns
the output of the last step of the pipeline
Pipeline State
- class renard.pipeline.core.PipelineState(text, dynamic_blocks=None, tokens=None, char2token=None, sentences=None, quotes=None, speakers=None, sentences_polarities=None, entities=None, corefs=None, characters=None, character_network=None)
The state of a pipeline, annotated in a
Pipeline
lifetime- Parameters
text (
Optional
[str
]) –dynamic_blocks (
Optional
[List
[Tuple
[int
,int
]]]) –tokens (
Optional
[List
[str
]]) –char2token (
Optional
[List
[int
]]) –sentences (
Optional
[List
[List
[str
]]]) –quotes (
Optional
[List
[Quote
]]) –speakers (
Optional
[List
[Optional
[Character
]]]) –sentences_polarities (
Optional
[List
[float
]]) –entities (
Optional
[List
[NEREntity
]]) –corefs (
Optional
[List
[List
[Mention
]]]) –characters (
Optional
[List
[Character
]]) –character_network (
Union
[List
[Graph
],Graph
,None
]) –
- __eq__(other)
Return self==value.
- __hash__ = None
- __init__(text, dynamic_blocks=None, tokens=None, char2token=None, sentences=None, quotes=None, speakers=None, sentences_polarities=None, entities=None, corefs=None, characters=None, character_network=None)
- Parameters
text (
Optional
[str
]) –dynamic_blocks (
Optional
[List
[Tuple
[int
,int
]]]) –tokens (
Optional
[List
[str
]]) –char2token (
Optional
[List
[int
]]) –sentences (
Optional
[List
[List
[str
]]]) –quotes (
Optional
[List
[Quote
]]) –speakers (
Optional
[List
[Optional
[Character
]]]) –sentences_polarities (
Optional
[List
[float
]]) –entities (
Optional
[List
[NEREntity
]]) –corefs (
Optional
[List
[List
[Mention
]]]) –characters (
Optional
[List
[Character
]]) –character_network (
Union
[List
[Graph
],Graph
,None
]) –
- __repr__()
Return repr(self).
- char2token: Optional[List[int]] = None
mapping from a character to its corresponding token
- character_network: Optional[Union[List[networkx.classes.graph.Graph], networkx.classes.graph.Graph]] = None
character network (or list of network in the case of a dynamic network)
- characters: Optional[List[renard.pipeline.character_unification.Character]] = None
detected characters
- corefs: Optional[List[List[renard.pipeline.core.Mention]]] = None
coreference chains
- dynamic_blocks: Optional[List[Tuple[int, int]]] = None
text split into blocks of texts. When dynamic blocks are given, the final network is dynamic, and split according to blocks.
- entities: Optional[List[renard.pipeline.ner.NEREntity]] = None
NER entities
- export_graph_to_gexf(path, name_style='most_frequent')
Export characters graph to Gephi’s gexf format
- Parameters
path (
str
) – export file pathname_style (
Union
[Literal
[‘longest’, ‘shortest’, ‘most_frequent’],Callable
[[Character
],str
]]) – seegraph_with_names()
for more details
- get_character(name, partial_match=True)
Try to get a character by one of its name.
Note
Several characters may match the given name, but only the first one is returned.
Note
Comparison is case-insensitive.
- Parameters
name (
str
) – One of the name of the searched character.partial_match (
bool
) – whenTrue
, will also return a character if the givenname
is only part of one of its name. Otherwise, only a character with the givenname
will be returned.
- Return type
Optional
[Character
]- Returns
a
Character
, orNone
if no character was found.
- plot_graph(name_style='most_frequent', fig=None, cumulative=False, graph_start_idx=1, stable_layout=False, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)
Plot
self.character_network
using reasonable default parametersNote
when plotting a dynamic graph, a
slider
attribute is added tofig
when it is given, in order to keep a reference to the slider.- Parameters
name_style (
Union
[Literal
[‘longest’, ‘shortest’, ‘most_frequent’],Callable
[[Character
],str
]]) – seegraph_with_names()
for more detailsfig (
Optional
[Figure
]) – if specified, this matplotlib figure will be used for plottingcumulative (
bool
) – ifTrue
andself.character_network
is dynamic, plot a cumulative graph instead of a sequential onegraph_start_idx (
int
) – Whenself.character_network
is dynamic, index of the first graph to plot, starting at 1 (not 0, since the graph slider starts at 1)stable_layout (
bool
) – ifself.character_network
is dynamic and this parameter isTrue
, characters will keep the same position in space at each timestep. Characters’ positions are based on the final cumulative graph layout.layout (
Union
[Dict
[Character,Tuple
[float
,float
]],Dict
[Character,ndarray
],None
]) – pre-computed graph layoutnode_kwargs (
Union
[Dict
[str
,Any
],List
[Dict
[str
,Any
]],None
]) – passed tonx.draw_networkx_nodes()
edge_kwargs (
Union
[Dict
[str
,Any
],List
[Dict
[str
,Any
]],None
]) – passed tonx.draw_networkx_nodes()
label_kwargs (
Union
[Dict
[str
,Any
],List
[Dict
[str
,Any
]],None
]) – passed tonx.draw_networkx_labels()
- plot_graph_to_file(path, name_style='most_frequent', layout=None, fig=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)
Plot
self.character_graph
using reasonable parameters, and save the produced figure to a file- Parameters
name_style (
Union
[Literal
[‘longest’, ‘shortest’, ‘most_frequent’],Callable
[[Character
],str
]]) – seegraph_with_names()
for more detailslayout (
Union
[Dict
[Character,Tuple
[float
,float
]],Dict
[Character,ndarray
],None
]) – pre-computed graph layoutfig (
Optional
[Figure
]) – if specified, this matplotlib figure will be used for plottingnode_kwargs (
Optional
[Dict
[str
,Any
]]) – passed tonx.draw_networkx_nodes()
edge_kwargs (
Optional
[Dict
[str
,Any
]]) – passed tonx.draw_networkx_nodes()
label_kwargs (
Optional
[Dict
[str
,Any
]]) – passed tonx.draw_networkx_labels()
path (
str
) –
- plot_graphs_to_dir(directory, name_style='most_frequent', cumulative=False, stable_layout=False, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)
Plot
self.character_graph
using reasonable default parameters, and save the produced figures in the specified directory.- Parameters
name_style (
Union
[Literal
[‘longest’, ‘shortest’, ‘most_frequent’],Callable
[[Character
],str
]]) – seegraph_with_names()
for more detailscumulative (
bool
) – ifTrue
plot a cumulative graph instead of a sequential onestable_layout (
bool
) – If this parameter isTrue
, characters will keep the same position in space at each timestep. Characters’ positions are based on the final cumulative graph layout.layout (
Union
[Dict
[Character,Tuple
[float
,float
]],Dict
[Character,ndarray
],None
]) – pre-computed graph layoutnode_kwargs (
Optional
[List
[Dict
[str
,Any
]]]) – passed tonx.draw_networkx_nodes()
edge_kwargs (
Optional
[List
[Dict
[str
,Any
]]]) – passed tonx.draw_networkx_nodes()
label_kwargs (
Optional
[List
[Dict
[str
,Any
]]]) – passed tonx.draw_networkx_labels()
directory (
str
) –
- quotes: Optional[List[renard.pipeline.quote_detection.Quote]] = None
quotes
- sentences: Optional[List[List[str]]] = None
text splitted into sentences, each sentence being a list of tokens
- sentences_polarities: Optional[List[float]] = None
polarity of each sentence
- speakers: Optional[List[Optional[renard.pipeline.character_unification.Character]]] = None
quotes speakers
- text: Optional[str]
input text
- tokens: Optional[List[str]] = None
text splitted in tokens
Pipeline Steps
- class renard.pipeline.core.PipelineStep
An abstract pipeline step
Note
The
__call__
,needs
andproduction
methods _must_ be overridden by derived classes.Note
The
optional_needs
andsupported_langs
methods can be overridden by derived classes.- __call__(text, **kwargs)
Call self as a function.
- Parameters
text (
str
) –- Return type
Dict
[str
,Any
]
- __init__()
Initialize the
PipelineStep
with a given configuration.
- _pipeline_init_(lang, progress_reporter, **kwargs)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter (
ProgressReporter
) –kwargs – additional pipeline parameters.
- Return type
Optional
[Dict
[Literal
[‘lang’, ‘progress_reporter’, ‘character_ner_tag’],Any
]]- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- optional_needs()
- Return type
Set
[str
]- Returns
a set of state attributes optionally neeeded by this
PipelineStep
. This method can be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
Preprocessing
- class renard.pipeline.preprocessing.CustomSubstitutionPreprocessor(substition_rules)
A preprocessor allowing regex-based substition
- Parameters
substition_rules (
List
[Tuple
[str
,str
]]) –
- __call__(text, **kwargs)
- Parameters
text (
str
) –- Return type
Dict
[str
,Any
]
- __init__(substition_rules)
- Parameters
substition_rules (
List
[Tuple
[str
,str
]]) – A list of rules, each rule being of the form (match, substitution).
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
Tokenization
NLTKTokenizer
- class renard.pipeline.tokenization.NLTKTokenizer
A NLTK-based tokenizer
- __call__(text, **kwargs)
Call self as a function.
- Parameters
text (
str
) –- Return type
Dict
[str
,Any
]
- __init__()
Initialize the
PipelineStep
with a given configuration.
- _pipeline_init_(lang, **kwargs)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter –
kwargs – additional pipeline parameters.
- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
Named Entity Recognition
- class renard.pipeline.ner.NEREntity(tokens, start_idx, end_idx, tag)
- Parameters
tokens (
List
[str
]) –start_idx (
int
) –end_idx (
int
) –tag (
str
) –
- __eq__(other)
Return self==value.
- __hash__()
Return hash(self).
- Return type
int
- __init__(tokens, start_idx, end_idx, tag)
- Parameters
tokens (
List
[str
]) –start_idx (
int
) –end_idx (
int
) –tag (
str
) –
- __repr__()
Return repr(self).
- shifted(shift)
Note
This method is implemtented here to avoid type issues. Since
Mention.shifted()
cannot be annotated as returningSelf
, this method annotate the correct return type when usingNEREntity.shifted()
.- Parameters
shift (
int
) –- Return type
- tag: str
NER class (without BIO prefix as in
PER
and notB-PER
)
BertNamedEntityRecognizer
- class renard.pipeline.ner.BertNamedEntityRecognizer(model=None, batch_size=4, device='auto', tokenizer=None, context_retriever=None)
An entity recognizer based on BERT
- Parameters
model (
Union
[PreTrainedModel
,str
,None
]) –batch_size (
int
) –device (
Literal
[‘cpu’, ‘cuda’, ‘auto’]) –tokenizer (
Optional
[PreTrainedTokenizerFast
]) –context_retriever (
Optional
[NERContextRetriever
]) –
- __call__(tokens, sentences, **kwargs)
- Parameters
text –
tokens (
List
[str
]) –sentences (
List
[List
[str
]]) –
- Return type
Dict
[str
,Any
]
- __init__(model=None, batch_size=4, device='auto', tokenizer=None, context_retriever=None)
- Parameters
model (
Union
[PreTrainedModel
,str
,None
]) –Either:
None
: the model will be chosen accordingly knowing thelang
of the pipelinestr
: a hugginface model IDa
PreTrainedModel
: a custom pre-trained BERT model. If specified, a tokenizer must be passed as well.
batch_size (
int
) – batch size at inferencedevice (
Literal
[‘cpu’, ‘cuda’, ‘auto’]) – computation devicetokenizer (
Optional
[PreTrainedTokenizerFast
]) – a custom tokenizercontext_retriever (
Optional
[NERContextRetriever
]) – if specified, usecontext_retriever
to retrieve relevant global context at run time, generally trading runtme for NER performance.
- _pipeline_init_(lang, **kwargs)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter –
kwargs – additional pipeline parameters.
- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- batch_labels(batchs, batch_i, wp_labels, tokens, context_mask)
Align labels to tokens rather than wordpiece tokens.
- Parameters
batchs (
BatchEncoding
) – huggingface batchbatch_i (
int
) – batch indexwp_labels (
List
[str
]) – wordpiece aligned labelstokens (
List
[str
]) – original tokenscontext_mask (
Tensor
) –
- Return type
List
[str
]
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
NLTKNamedEntityRecognizer
- class renard.pipeline.ner.NLTKNamedEntityRecognizer
An entity recognizer based on NLTK
- __call__(tokens, **kwargs)
- Parameters
text –
tokens (
List
[str
]) –
- Return type
Dict
[str
,Any
]
- __init__()
- Parameters
language – iso 639-2 3-letter language code
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
Coreference Resolution
A coreference resolver returns a list of coreference chains, each
chain being Mention
.
- class renard.pipeline.core.Mention(tokens, start_idx, end_idx)
- Parameters
tokens (
List
[str
]) –start_idx (
int
) –end_idx (
int
) –
- __hash__()
Return hash(self).
- Return type
int
- __init__(tokens, start_idx, end_idx)
- Parameters
tokens (
List
[str
]) –start_idx (
int
) –end_idx (
int
) –
- __repr__()
Return repr(self).
BertCoreferenceResolver
- class renard.pipeline.corefs.BertCoreferenceResolver(model=None, hugginface_model_id=None, batch_size=1, device='auto', tokenizer=None, block_size=512, hierarchical_merging=False)
A coreference resolver using BERT. Loosely based on ‘End-to-end Neural Coreference Resolution’ (Lee et at. 2017) and ‘BERT for coreference resolution’ (Joshi et al. 2019).
- Parameters
model (
Optional
[BertForCoreferenceResolution
]) –hugginface_model_id (
Optional
[str
]) –batch_size (
int
) –device (
Literal
[‘auto’, ‘cuda’, ‘cpu’]) –tokenizer (
Optional
[PreTrainedTokenizerFast
]) –block_size (
int
) –hierarchical_merging (
bool
) –
- __call__(tokens, **kwargs)
Call self as a function.
- Parameters
tokens (
List
[str
]) –- Return type
Dict
[str
,Any
]
- __init__(model=None, hugginface_model_id=None, batch_size=1, device='auto', tokenizer=None, block_size=512, hierarchical_merging=False)
Note
In the future, only
mentions_per_tokens
,antecedents_nb
andmax_span_size
shall be read directly from the model’s config.- Parameters
huggingface_model_id – a custom huggingface model id. This allows to bypass the
lang
pipeline parameter which normally choose a huggingface model automatically.batch_size (
int
) – batch size at inferencedevice (
Literal
[‘auto’, ‘cuda’, ‘cpu’]) – computation deviceblock_size (
int
) – size of blocks to pass to the coreference modelhierarchical_merging (
bool
) – ifTrue
, attempts to use tibert’s hierarchical merging feature. In that case, blocks of sizeblock_size
are merged to perform inference on the whole document.model (
Optional
[BertForCoreferenceResolution
]) –hugginface_model_id (
Optional
[str
]) –tokenizer (
Optional
[PreTrainedTokenizerFast
]) –
- _pipeline_init_(lang, **kwargs)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter –
kwargs – additional pipeline parameters.
- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
SpacyCorefereeCoreferenceResolver
- class renard.pipeline.corefs.SpacyCorefereeCoreferenceResolver(max_chunk_size=10000)
A coreference resolver using spacy’s corefree.
Note
This step requires to install Renard’s extra ‘spacy’
While this step automatically install the needed spacy models, it still needs a manual installation of the coreferee model:
python -m coreferee install en
- Parameters
max_chunk_size (
Optional
[int
]) –
- __call__(text, tokens, dynamic_blocks_tokens=None, **kwargs)
Call self as a function.
- Parameters
text (
str
) –tokens (
List
[str
]) –dynamic_blocks_tokens (
Optional
[List
[List
[str
]]]) –
- Return type
Dict
[str
,Any
]
- __init__(max_chunk_size=10000)
- Parameters
chunk_size – coreference chunk size, in tokens
max_chunk_size (
Optional
[int
]) –
- static _coreferee_get_mention_tokens(coref_model, mention_heads, doc)
Coreferee only return mention heads for mention, and not the whole span. This hack (defined in coreferee README at the end of part 2 https://github.com/richardpaulhudson/coreferee#2-interacting-with-the-data-model) gets the whole span as a list of spacy tokens.
- Parameters
coref_model (
CorefereeBroker
) –mention_heads (
Mention
) –doc (
Doc
) –
- Return type
List
[Token
]
- _pipeline_init_(lang, progress_reporter)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter (
ProgressReporter
) –kwargs – additional pipeline parameters.
- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- static _spacy_try_infer_spaces(tokens)
Try to infer, for each token, if there is a space between this token and the next.
- Parameters
tokens (
List
[str
]) –- Return type
List
[bool
]
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- optional_needs()
- Return type
Set
[str
]- Returns
a set of state attributes optionally neeeded by this
PipelineStep
. This method can be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
Quote Detection
QuoteDetector
- class renard.pipeline.quote_detection.QuoteDetector(quote_pairs=None)
Extract quotes using simple rules.
- Parameters
quote_pairs (
Optional
[List
[Tuple
[str
,str
]]]) –
- __call__(tokens, **kwargs)
Call self as a function.
- Parameters
tokens (
List
[str
]) –- Return type
Dict
[str
,Any
]
- __init__(quote_pairs=None)
- Parameters
quote_pairs (
Optional
[List
[Tuple
[str
,str
]]]) – ifNone
, default toQuoteDetector.DEFAULT_QUOTE_PAIRS
- needs()
tokens
- Return type
Set
[str
]
- production()
quotes
- Return type
Set
[str
]
- supported_langs()
any
- Return type
Union
[Set
[str
],Literal
[‘any’]]
Sentiment Analysis
NLTKSentimentAnalyzer
- class renard.pipeline.sentiment_analysis.NLTKSentimentAnalyzer
A sentiment analyzer based on NLTK’s Vader.
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
- __call__(sentences, **kwargs)
Call self as a function.
- Parameters
sentences (
List
[List
[str
]]) –- Return type
Dict
[str
,Any
]
- __init__()
Initialize the
PipelineStep
with a given configuration.
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
Characters Unification
- class renard.pipeline.character_unification.Character(names, mentions, gender=Gender.UNKNOWN)
- Parameters
names (
FrozenSet
[str
]) –mentions (
List
[Mention
]) –gender (
Gender
) –
- __delattr__(name)
Implement delattr(self, name).
- __eq__(other)
Return self==value.
- __hash__()
Return hash(self).
- Return type
int
- __init__(names, mentions, gender=Gender.UNKNOWN)
- Parameters
names (
FrozenSet
[str
]) –mentions (
List
[Mention
]) –gender (
Gender
) –
- __repr__()
Return repr(self).
- Return type
str
- __setattr__(name, value)
Implement setattr(self, name, value).
NaiveCharacterUnifier
- class renard.pipeline.character_unification.NaiveCharacterUnifier(min_appearances=0)
A basic character unifier using NER
- Parameters
min_appearances (
int
) –
- __call__(text, entities, corefs=None, **kwargs)
- __init__(min_appearances=0)
- Parameters
min_appearances (
int
) – minimum number of appearances of a character for it to be valid
- _pipeline_init_(lang, character_ner_tag, **kwargs)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter –
kwargs – additional pipeline parameters.
character_ner_tag (
str
) –
- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- optional_needs()
- Return type
Set
[str
]- Returns
a set of state attributes optionally neeeded by this
PipelineStep
. This method can be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
GraphRulesCharacterUnifier
- class renard.pipeline.character_unification.GraphRulesCharacterUnifier(min_appearances=0, additional_hypocorisms=None, link_corefs_mentions=False, ignore_lone_titles=None)
Unify characters by creating a graph where mentions are linked when they refer to the same character, and then merging this graph nodes.
Note
This algorithm is inspired from Vala et al., 2015.
- Parameters
min_appearances (
int
) –additional_hypocorisms (
Optional
[List
[Tuple
[str
,List
[str
]]]]) –link_corefs_mentions (
bool
) –ignore_lone_titles (
Optional
[Set
[str
]]) –
- __call__(entities, corefs=None, **kwargs)
Call self as a function.
- __init__(min_appearances=0, additional_hypocorisms=None, link_corefs_mentions=False, ignore_lone_titles=None)
- Parameters
min_appearances (
int
) – minimum number of appearances of a character for it to be considered valid.additional_hypocorisms (
Optional
[List
[Tuple
[str
,List
[str
]]]]) – a tuple of additional hypocorisms. Each hypocorism is a tuple where the first element is a name, and the second element is a set of nicknames associated with itlink_corefs_mentions (
bool
) – ifTrue
, will also use coreference resolution to link names between them. This is disabled by default since a coreference model can extract a lot of spurious links. However, linking by coref is sometimes the only way to resolve a character alias.ignore_lone_titles (
Optional
[Set
[str
]]) – a set of titles to ignore when they stand on their own. This avoids extracting false positives characters such as ‘Mr.’ or ‘Miss’.
- _pipeline_init_(lang, character_ner_tag, **kwargs)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter –
kwargs – additional pipeline parameters.
character_ner_tag (
str
) –
- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- infer_name_gender(name, corefs, hname_constants)
Try to infer a name’s gender
- Parameters
name (
str
) –corefs (
Optional
[List
[List
[Mention
]]]) –hname_constants (
Constants
) – HumanName constants
- Return type
Gender
Check if two names are related after removing their titles
- Parameters
name1 (
str
) –name2 (
str
) –hname_constants (
Constants
) –
- Return type
bool
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- optional_needs()
- Return type
Set
[str
]- Returns
a set of state attributes optionally neeeded by this
PipelineStep
. This method can be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
Speaker Attribution
- class renard.pipeline.speaker_attribution.BertSpeakerDetector(model=None, batch_size=4, device='auto', tokenizer=None)
Detect quote speaker in text
- Parameters
model (
Union
[PreTrainedModel
,str
,None
]) –batch_size (
int
) –device (
Literal
[‘cpu’, ‘cuda’, ‘auto’]) –tokenizer (
Optional
[PreTrainedTokenizerFast
]) –
- __call__(tokens, quotes, characters, **kwargs)
Call self as a function.
- Parameters
tokens (
List
[str
]) –quotes (
List
[Quote
]) –characters (
List
[Character
]) –
- Return type
Dict
[str
,Any
]
- __init__(model=None, batch_size=4, device='auto', tokenizer=None)
Initialize the
PipelineStep
with a given configuration.- Parameters
model (
Union
[PreTrainedModel
,str
,None
]) –batch_size (
int
) –device (
Literal
[‘cpu’, ‘cuda’, ‘auto’]) –tokenizer (
Optional
[PreTrainedTokenizerFast
]) –
- _pipeline_init_(lang, **kwargs)
Set the step configuration that is common to the whole pipeline.
- Parameters
lang (
str
) – the lang of the whole pipelineprogress_reporter –
kwargs – additional pipeline parameters.
- Returns
a step can return a dictionary of pipeline params if it wish to modify some of these.
- needs()
quotes, tokens, characters
- Return type
Set
[str
]
- production()
speaker
- Return type
Set
[str
]
Graph Extraction
CoOccurrencesGraphExtractor
- class renard.pipeline.graph_extraction.CoOccurrencesGraphExtractor(co_occurrences_dist=None, dynamic=False, dynamic_window=None, dynamic_overlap=0, additional_ner_classes=None)
A simple character graph extractor using co-occurences
- Parameters
co_occurrences_dist (
Union
[int
,Tuple
[int
,Literal
[‘tokens’, ‘sentences’]],None
]) –dynamic (
bool
) –dynamic_window (
Optional
[int
]) –dynamic_overlap (
int
) –additional_ner_classes (
Optional
[List
[str
]]) –
- __call__(characters, sentences, char2token=None, dynamic_blocks=None, sentences_polarities=None, entities=None, co_occurrences_blocks=None, **kwargs)
Extract a co-occurrence character network.
- Parameters
co_occurrences_blocks (
Optional
[Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]]) – custom blocks where co-occurrences should be recorded. For example, this can be used to perform chapter level co-occurrences.characters (
Set
[Character
]) –sentences (
List
[List
[str
]]) –char2token (
Optional
[List
[int
]]) –dynamic_blocks (
Optional
[Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]]) –sentences_polarities (
Optional
[List
[float
]]) –entities (
Optional
[List
[NEREntity
]]) –
- Return type
Dict
[str
,Any
]- Returns
a
dict
with key'character_network'
and anx.Graph
or a list ofnx.Graph
as value.
- __init__(co_occurrences_dist=None, dynamic=False, dynamic_window=None, dynamic_overlap=0, additional_ner_classes=None)
- Parameters
co_occurrences_dist (
Union
[int
,Tuple
[int
,Literal
[‘tokens’, ‘sentences’]],None
]) –max accepted distance between two character appearances to form a co-occurence interaction.
if an
int
is given, the distance is in number of tokensif a
tuple
is given, the first element of the tuple is a distance while the second is an unit. Examples :(1, "sentences")
,(3, "tokens")
.
dynamic (
bool
) –if
False
(the default), a staticnx.graph
is extractedif
True
, severalnx.graph
are extracted. In that case,dynamic_window
anddynamic_overlap``*can* be specified. If ``dynamic_window
is not specified, this step is expecting the text to be cut into ‘dynamic blocks’, and a graph will be extracted for each block. In that case,dynamic_blocks
must be passed to the pipeline as aList[str]
at runtime.
dynamic_window (
Optional
[int
]) – dynamic window, in number of interactions. a dynamic window of n means that each returned graph will be formed by n interactions.dynamic_overlap (
int
) – overlap, in number of interactions.additional_ner_classes (
Optional
[List
[str
]]) – if specified, will include entities other than characters in the final graph. No attempt will be made at unifying the entities (for example, “New York” will be distinct from “New York City”).
- _create_co_occurrences_blocks(sentences, mentions)
Create co-occurrences blocks using
self.co_occurrences_dist
. All entities within a block are considered as co-occurring.- Parameters
sentences (
List
[List
[str
]]) –mentions (
List
[Tuple
[Any
,NEREntity
]]) –
- Return type
Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]
- _extract_dynamic_graph(mentions, window, overlap, dynamic_blocks, sentences, sentences_polarities, co_occurrences_blocks)
Note
only one of
window
ordynamic_blocks_tokens
should be specified- Parameters
mentions (
List
[Tuple
[Any
,NEREntity
]]) – A list of entity mentions, ordered by appearance, each of the form (KEY MENTION). KEY determines the unicity of the entity.window (
Optional
[int
]) – dynamic window, in tokens.overlap (
int
) – window overlapdynamic_blocks (
Optional
[Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]]) – boundaries of each dynamic blockco_occurrences_blocks (
Optional
[Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]]) – boundaries of each co-occurrences blockssentences (
List
[List
[str
]]) –sentences_polarities (
Optional
[List
[float
]]) –
- Return type
List
[Graph
]
- _extract_graph(mentions, sentences, sentences_polarities, co_occurrences_blocks)
- Parameters
mentions (
List
[Tuple
[Any
,NEREntity
]]) – A list of entity mentions, ordered by appearance, each of the form (KEY MENTION). KEY determines the unicity of the entity.sentences (
List
[List
[str
]]) – if specified,sentences_polarities
must be specified as well.sentences_polarities (
Optional
[List
[float
]]) – if specified,sentences
must be specified as well. In that case, edges are annotated with the'polarity
attribute, indicating the polarity of the relationship between two characters. Polarity between two interactions is computed as the strongest sentence polarity between those two mentions.co_occurrences_blocks (
Optional
[Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]]) – only unit ‘tokens’ is accepted.
- Return type
Graph
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- optional_needs()
- Return type
Set
[str
]- Returns
a set of state attributes optionally neeeded by this
PipelineStep
. This method can be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- supported_langs()
- Return type
Union
[Set
[str
],Literal
[‘any’]]- Returns
a list of supported languages, as ISO 639-3 codes, or the string
'any'
ConversationalGraphExtractor
- class renard.pipeline.graph_extraction.ConversationalGraphExtractor(graph_type, conversation_dist=None, ignore_self_mention=True)
A graph extractor using conversation between characters or mentions.
Note
Does not support dynamic networks yet.
- Parameters
graph_type (
Literal
[‘conversation’, ‘mention’]) –conversation_dist (
Union
[int
,Tuple
[int
,Literal
[‘tokens’, ‘sentences’]],None
]) –ignore_self_mention (
bool
) –
- __call__(sentences, quotes, speakers, characters, **kwargs)
Call self as a function.
- __init__(graph_type, conversation_dist=None, ignore_self_mention=True)
- Parameters
graph_type (
Literal
[‘conversation’, ‘mention’]) – either ‘conversation’ or ‘mention’. ‘conversation’ extracts an undirected graph with interactions being extracted from the conversations occurring between characters. ‘mention’ extracts a directed graph where interactions are character mentions of one another in quoted speech.conversation_dist (
Union
[int
,Tuple
[int
,Literal
[‘tokens’, ‘sentences’]],None
]) – must be supplied if graph_type is ‘conversation’. The distance between two quotation for them to be considered as being interacting.ignore_self_mention (
bool
) – ifTrue
, self mentions are ignore forgraph_type=='mention'
- needs()
sentences, quotes, speakers, characters
- Return type
Set
[str
]
- production()
character_network
- Return type
Set
[str
]
Stanford CoreNLP Pipeline
- class renard.pipeline.stanford_corenlp.StanfordCoreNLPPipeline(annotate_corefs=False, corefs_algorithm='statistical', corenlp_custom_properties=None, server_timeout=9999999, **server_kwargs)
a full NLP pipeline using stanford CoreNLP
Note
The Stanford CoreNLP pipeline requires the
stanza
library. You can install it with poetry usingpoetry install -E stanza
.Warning
RAM usage might be high for coreference resolutions as it uses the entire novel ! If CoreNLP terminates with an out of memory error, you can try allocating more memory for the server by using
server_kwargs
(example :{"memory": "8G"}
).- Parameters
annotate_corefs (
bool
) –corefs_algorithm (
Literal
[‘deterministic’, ‘statistical’, ‘neural’]) –corenlp_custom_properties (
Optional
[Dict
[str
,Any
]]) –server_timeout (
int
) –
- __call__(text, **kwargs)
Call self as a function.
- Parameters
text (
str
) –- Return type
Dict
[str
,Any
]
- __init__(annotate_corefs=False, corefs_algorithm='statistical', corenlp_custom_properties=None, server_timeout=9999999, **server_kwargs)
- Parameters
annotate_corefs (
bool
) –True
if coreferences must be annotated,False
otherwise. This parameter is not yet implemented.corefs_algorithm (
Literal
[‘deterministic’, ‘statistical’, ‘neural’]) – one of{"deterministic", "statistical", "neural"}
corenlp_custom_properties (
Optional
[Dict
[str
,Any
]]) – custom properties dictionary to pass to the CoreNLP server. Note that some properties are already set when calling the server, so not all properties are supported : it is intended as a last resort escape hatch. In particular, do not set'ner.applyFineGrained'
. If you need to set the coreference algorithm used, seecorefs_algorithm
.server_timeout (
int
) – CoreNLP server timeout in msserver_kwargs – extra args for stanford CoreNLP server. be_quiet and max_char_length are not supported. See here for a list of possible args : https://stanfordnlp.github.io/stanza/client_properties.html#corenlp-server-start-options-server
- needs()
- Return type
Set
[str
]- Returns
a set of state attributes needed by this
PipelineStep
. This method must be overriden by derived classes.
- production()
- Return type
Set
[str
]- Returns
a set of state attributes produced by this
PipelineStep
. This method must be overriden by derived classes.
- renard.pipeline.stanford_corenlp.corenlp_annotations_bio_tags(annotations)
Returns an array of bio tags extracted from stanford corenlp annotations
Note
only PERSON, LOCATION, ORGANIZATION and MISC entities are considered. Other types of entities are discarded. (see https://stanfordnlp.github.io/CoreNLP/ner.html#description) for a list of usual coreNLP types.
Note
Weirdly, CoreNLP will annotate pronouns as entities. Only tokens having a NNP POS are kept by this function.
- Parameters
annotations (
Document
) – stanford coreNLP text annotations- Return type
List
[str
]- Returns
an array of bio tags.
Resources
Hypocorism
- class renard.resources.hypocorisms.HypocorismGazetteer(lang='eng')
An hypocorism (nicknames) gazetteer
Note
datas used for this gazeeter come from https://github.com/carltonnorthern/nickname-and-diminutive-names-lookup and are licensed under the Apache 2.0 License
- Parameters
lang (
str
) –
- __init__(lang='eng')
- Parameters
lang (
str
) – gazetteer language. Must be inHypocorismGazetteer.supported_langs
.
- _add_hypocorism_(name, nicknames)
Add a name associated with several nicknames
- Parameters
name (
str
) –nicknames (
List
[str
]) – nicknames to associate to the given name
Check if one name is an hypocorism of the other (or if both names are equals)
- Parameters
name1 (
str
) –name2 (
str
) –
- Return type
bool
- get_nicknames(name)
Return all possible nickname for the given name
- Parameters
name (
str
) –- Return type
Set
[str
]
- get_possible_names(nickname)
Return all names that can correspond to the given nickname
- Parameters
nickname (
str
) –- Return type
Set
[str
]
Utils
- renard.utils.BlockBounds
A BlockBounds delimits blocks in either raw text (“characters”) or tokenized text (“tokens”). It has the following form:
([(block start, block end), …], unit)
see
block_indices()
to easily create BlockBoundsalias of
Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]
- renard.utils.batch_index_select(input, dim, index)
Batched version of
torch.index_select()
. Inspired by https://discuss.pytorch.org/t/batched-index-select/9115/8- Parameters
input (
Tensor
) – a torch tensor of shape(B, *)
where*
is any number of additional dimensions.dim (
int
) – the dimension in which to indexindex (
Tensor
) – index tensor of shape(B, I)
- Return type
Tensor
- Returns
a tensor which indexes
input
along dimensiondim
usingindex
. This tensor has the same shape asinput
, except in dimensiondim
, where it has dimensionI
.
- renard.utils.block_bounds(blocks)
Return the boundaries of a series of blocks.
- Parameters
blocks (
Union
[List
[str
],List
[List
[str
]]]) – either a list of raw texts or a list of tokenized texts.- Return type
Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]- Returns
A BlockBounds with the correct unit.
- renard.utils.charbb2tokenbb(char_bb, char2token)
Convert a BlockBounds in characters to a BlockBounds in tokens.
- Parameters
char_bb (
Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]) – block bounds, in ‘characters’.char2token (
List
[int
]) – a list withchar2token[i]
being the index of token corresponding to characteri
.
- Return type
Tuple
[List
[Tuple
[int
,int
]],Literal
[‘characters’, ‘tokens’]]- Returns
a BlockBounds, in ‘tokens’.
- renard.utils.search_pattern(seq, pattern)
Search a pattern in sequence
- Parameters
seq (
Iterable
[TypeVar
(R
)]) – sequence in which to searchpattern (
List
[TypeVar
(R
)]) – searched pattern
- Return type
List
[int
]- Returns
a list of patterns start index
- renard.utils.spans(seq, max_len)
Cut the input sequence into all possible spans up to a maximum length
Note
spans are ordered from the smallest to the biggest, from the beginning of seq to the end of seq.
- Parameters
seq (
Collection
[TypeVar
(T
)]) –max_len (
int
) –
- Return type
List
[Tuple
[TypeVar
(T
)]]- Returns
Graph utils
- renard.graph_utils.cumulative_graph(graphs)
Turns a dynamic graph to a cumulative graph, weight wise
- Parameters
graphs (
List
[Graph
]) – A list of sequential graphs- Return type
List
[Graph
]
- renard.graph_utils.dynamic_graph_to_gephi_graph(graphs)
Convert a dynamic graph to a Gephi-compatible dynamic graph. The resulting graph can be exported using
G.write_gexf()
and will be read correctly by Gephi.Note
Because of a limitation in networkx, the dynamic weight attribute is stored as
dweight
instead ofweight
.- Parameters
graphs (
List
[Graph
]) – a dynamic graph- Return type
Graph
- Returns
A dynamic Gephi-compatible graph
- renard.graph_utils.graph_edges_attributes(G)
Compute the set of all attributes of a graph
- Parameters
G (
Graph
) –- Return type
Set
[str
]
- renard.graph_utils.graph_with_names(G, name_style='most_frequent')
Relabel a characters graph, using a single name for each node
- Parameters
name_style (
Union
[Literal
[‘longest’, ‘shortest’, ‘most_frequent’],Callable
[[Character
],str
]]) – characters name style in the resulting graph. Either a string ('longest
orshortest
ormost_frequent
) or a custom function associating a character to its nameG (
Graph
) –
- Return type
Graph
- renard.graph_utils.layout_with_names(G, layout, name_style='most_frequent')
- Parameters
G (
Graph
) – a graph ofCharacter
layout (
Union
[Dict
[Character,Tuple
[float
,float
]],Dict
[Character,ndarray
]]) –name_style (
Union
[Literal
[‘longest’, ‘shortest’, ‘most_frequent’],Callable
[[Character
],str
]]) –
- Return type
dict
Plot utils
- renard.plot_utils.plot_nx_graph_reasonably(G, ax=None, layout=None, node_kwargs=None, edge_kwargs=None, label_kwargs=None)
Try to plot a
nx.Graph
with ‘reasonable’ parameters- Parameters
G (
Graph
) – the graph to drawax – matplotlib axes
layout (
Optional
[dict
]) – if given, this graph layout will be applied. Otherwise, uselayout_nx_graph_reasonably()
.node_kwargs (
Optional
[Dict
[str
,Any
]]) – passed tonx.draw_networkx_nodes()
edge_kwargs (
Optional
[Dict
[str
,Any
]]) – passed tonx.draw_networkx_nodes()
label_kwargs (
Optional
[Dict
[str
,Any
]]) – passed tonx.draw_networkx_labels()
NER utils
- class renard.ner_utils.DataCollatorForTokenClassificationWithBatchEncoding(tokenizer, pad_to_multiple_of=None)
Same as
transformers.DataCollatorForTokenClassification
, except it correctly returns aBatchEncoding
object with correctencodings
attribute.Don’t know why this is not the default ?
- Parameters
tokenizer (
PreTrainedTokenizerFast
) –pad_to_multiple_of (
Optional
[int
]) –
- __call__(features)
Call self as a function.
- Parameters
features (
List
[dict
]) –- Return type
Union
[dict
,BatchEncoding
]
- __init__(tokenizer, pad_to_multiple_of=None)
- Parameters
tokenizer (
PreTrainedTokenizerFast
) –pad_to_multiple_of (
Optional
[int
]) –
- class renard.ner_utils.NERDataset(elements, tokenizer, context_mask=None)
- Variables
_context_mask – for each element, a mask indicating which tokens are part of the context (1 for context, 0 for text on which to perform inference). The mask allows to discard predictions made for context at inference time, even though the context can still be passed as input to the model.
- Parameters
elements (
List
[List
[str
]]) –tokenizer (
PreTrainedTokenizerFast
) –context_mask (
Optional
[List
[List
[int
]]]) –
- __init__(elements, tokenizer, context_mask=None)
- Parameters
elements (
List
[List
[str
]]) –tokenizer (
PreTrainedTokenizerFast
) –context_mask (
Optional
[List
[List
[int
]]]) –
- renard.ner_utils._tokenize_and_align_labels(examples, tokenizer, label_all_tokens=True)
-
- Parameters
examples – an object with keys ‘tokens’ and ‘labels’
tokenizer (
PreTrainedTokenizerFast
) –label_all_tokens (
bool
) –
- renard.ner_utils.hgdataset_from_conll2002(path, tag_conversion_map=None, separator='\\t', **kwargs)
Load a CoNLL-2002 file as a Huggingface Dataset.
- Parameters
path (
str
) – passed toload_conll2002_bio()
tag_conversion_map (
Optional
[Dict
[str
,str
]]) – passed toload_conll2002_bio()
separator (
str
) – passed toload_conll2002_bio()
kwargs – passed to
load_conll2002_bio()
- Return type
Dataset
- Returns
a
datasets.Dataset
with features ‘tokens’ and ‘labels’.
- renard.ner_utils.load_conll2002_bio(path, tag_conversion_map=None, separator='\\t', **kwargs)
Load a file under CoNLL2022 BIO format. Sentences are expected to be separated by end of lines. Tags should be in the CoNLL-2002 format (such as ‘B-PER I-PER’) - If this is not the case, see the
tag_conversion_map
argument.- Parameters
path (
str
) – path to the CoNLL-2002 formatted fileseparator (
str
) – separator between token and BIO tagstag_conversion_map (
Optional
[Dict
[str
,str
]]) – conversion map for tags found in the input file. Example :{'B': 'B-PER', 'I': 'I-PER'}
kwargs – additional kwargs for
open
(such asencoding
ornewline
).
- Return type
Tuple
[List
[List
[str
]],List
[str
],List
[NEREntity
]]- Returns
(sentences, tokens, entities)
- renard.ner_utils.ner_entities(tokens, bio_tags, resolve_inconsistencies=True)
Extract NER entities from a list of BIO tags
- Parameters
tokens (
List
[str
]) – a list of tokensbio_tags (
List
[str
]) – a list of BIO tags. In particular, BIO tags should be in the CoNLL-2002 form (such as ‘B-PER I-PER’)resolve_inconsistencies (
bool
) –
- Return type
List
[NEREntity
]- Returns
A list of ner entities, in apparition order