The Pipeline

Renard’s central concept is the Pipeline. A Pipeline is a list of PipelineStep that are run sequentially in order to extract a character graph from a document. Here is a simple example:

from renard.pipeline import Pipeline
from renard.pipeline.tokenization import NLTKTokenizer
from renard.pipeline.ner import NLTKNamedEntityRecognizer
from renard.pipeline.character_unification import GraphRulesCharacterUnifier
from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor

with open("./my_doc.txt") as f:
    text = f.read()

pipeline = Pipeline(
    [
        NLTKTokenizer(),
        NLTKNamedEntityRecognizer(),
        GraphRulesCharacterUnifier(min_appearances=10),
        CoOccurrencesGraphExtractor(co_occurrences_dist=25)
    ]
)

out = pipeline(text)

Each step of a pipeline may require information from previous steps before running : therefore, it is possible to create intractable pipelines when a step’s requirements are not satisfied. To troubleshoot these issues more easily, a Pipeline checks its validity at run time, and throws an exception with an helpful message in case it is intractable.

You can also specify the result of certains steps manually when calling the pipeline if you already have those results or if you want to compute them yourself:

from renard.pipeline import Pipeline
from renard.pipeline.ner import NLTKNamedEntityRecognizer
from renard.pipeline.character_unification import GraphRulesCharacterUnifier
from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor

with open("./my_doc.txt") as f:
    text = f.read()

# note that this pipeline doesn't have any tokenizer
pipeline = Pipeline(
    [
        NLTKNamedEntityRecognizer(),
        GraphRulesCharacterUnifier(min_appearances=10),
        CoOccurrencesGraphExtractor(co_occurrences_dist=25)
    ]
)

# tokens are passed at call time
out = pipeline(text, tokens=my_tokenization_function(text))

In that case, the tokens requirements is fulfilled at run time. If you don’t pass the parameter, Renard will throw the following exception:

>>> ValueError: ["step 1 (NLTKNamedEntityRecognizer) has unsatisfied needs. needs: {'tokens'}. available: {'text'}). missing: {'tokens'}."]

For simplicity, one can use one of the preconfigured pipelines:

from renard.pipeline.preconfigured import bert_pipeline

with open("./my_doc.txt") as f:
    text = f.read()

pipeline = bert_pipeline(
    graph_extractor_kwargs={"co_occurrences_dist": (1, "sentences")}
)
out = pipeline(text)

Pipeline Output: the Pipeline State

The PipelineState represents a state that is propagated and annotated during the execution of a Pipeline. It is the final value returned when running a pipeline with Pipeline.__call__(). As such, one can use it to do different things. For example, one can access the extracted character network as a networkx graph:

>>> out.character_network
<networkx.classes.graph.Graph object at 0x7fd9e9115900>

one can also access the output of each PipelineStep.

A few matplotlib-based plot functions are provided for convenience (PipelineState.plot_graph(), PipelineState.plot_graph_to_file()):

>>> import matplotlib.pyplot as plt
>>> out.plot_graph()
>>> plt.show()

These functions should be seen more as exploration and debug tools rather than fully-fledged visualisation platforms. If you want a fully-featured visualisation tool, you can export your graph to Gephi’s gexf format:

>>> out.export_graph_to_gexf("./graph.gexf")

Available Steps: An Overview

Below is an overview of the different steps that can make up a pipeline. Note that StanfordCoreNLPPipeline is a special case and regroup several steps as the same time.

Preprocessing

CustomSubstitutionPreprocessor allows to make regex-based substitutions in the text.

Tokenization

Tokenization is the task of cutting text in tokens. It is usually the first task to apply to a text. 2 tokenizer are available:

Named Entity Recognition

Named entity recognition (NER) detects entities occurences in the text. 3 modules are available:

Coreference Resolution

Quote Detection

Sentiment Analysis

Characters Extraction

Characters extraction (or alias resolution) extract characters from occurences detected using NER. This is done by assigning each mention to a unique character.

Speaker Attribution

Graph Extraction

Dynamic Graphs

Renard can also extract dynamic graphs: graphs that evolve through time. In Renard, such graphs are representend by a List of networkx.Graph.

from renard.pipeline import Pipeline
from renard.pipeline.tokenization import NLTKTokenizer
from renard.pipeline.ner import NLTKNamedEntityRecognizer
from renard.pipeline.character_unification import GraphRulesCharacterUnifier
from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor

with open("./my_doc.txt") as f:
    text = f.read()

pipeline = Pipeline(
    [
        NLTKTokenizer(),
        NLTKNamedEntityRecognizer(),
        GraphRulesCharacterUnifier(min_appearances=10),
        CoOccurrencesGraphExtractor(
            co_occurrences_dist=25,
            dynamic=True,     # note the 'dynamic'
            dynamic_window=20 # and the 'dynamic_window' argument
        )
    ]
)

out = pipeline(text)

When executing the above block of code, the output attribute character_network will be a list of networkx graphs:

>>> out.character_network
[<networkx.classes.graph.Graph object at 0x7fd9e9115900>]

See CoOccurrencesGraphExtractor for more details on the usage of the dynamic and dynamic_window arguments.

Plot and export functions work as one would expect intuitively. PipelineState.plot_graph() allow to visualize the dynamic graph using a slider, and PipelineState.plot_graphs_to_dir() saves plots of the dynamic graph to a directory. Meanwhile, PipelineState.export_graph_to_gexf() correctly exports the dynamic graph to the Gephi format.

Custom Segmentation

The dynamic_window parameter of CoOccurencesGraphExtractor determines the segmentation of the dynamic networks, in number of interactions. In the example above, a new graph will be created for each 20 interactions.

While one can rely on the arguments of the graph extractor of the pipeline to determine the dynamic window, Renard allows to specify a custom segmentation of a text with the dynamic_blocks argument. When running a pipeline, you can cut your text however you want and pass this argument instead of the usual text:

from renard.pipeline import Pipeline
from renard.pipeline.tokenization import NLTKTokenizer
from renard.pipeline.ner import NLTKNamedEntityRecognizer
from renard.pipeline.character_unification import GraphRulesCharacterUnifier
from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor
from renard.utils import block_bounds

with open("./my_doc.txt") as f:
    text = f.read()

# let's suppose the 'cut_into_chapters' function cut the text into chapters.
chapters = cut_into_chapters(text)

pipeline = Pipeline(
    [
        NLTKTokenizer(),
        NLTKNamedEntityRecognizer(),
        GraphRulesCharacterUnifier(),
        CoOccurrencesGraphExtractor(co_occurrences_dist=25, dynamic=True)
    ]
)

# the 'block_bounds' function automatically extracts the boundaries of your
# block of text.
out = pipeline(text, dynamic_blocks=block_bounds(chapters))

Multilingual Support

Renard supports multiple languages. By default, a Pipeline is configured for English, but can create a pipeline for any language as long as all of its steps support it. To configure a pipeline for another language, you can pass the ISO 639-3 code of the language you want:

from renard.pipeline import Pipeline
from renard.pipeline.tokenization import NLTKTokenizer
from renard.pipeline.ner import BertNamedEntityRecognizer
from renard.pipeline.character_unification import GraphRulesCharacterUnifier
from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor

with open("./my_doc_in_french.txt") as f:
    text = f.read()

pipeline = Pipeline(
    [
        NLTKTokenizer(),
        BertNamedEntityRecognizer(),
        GraphRulesCharacterUnifier(min_appearances=10),
        CoOccurrencesGraphExtractor(co_occurrences_dist=25)
    ],
    lang="fra" # ISO 639-3 language code for french
)

out = pipeline(text)

This pipeline is valid because NLTKTokenizer, BertNamedEntityRecognizer and GraphRulesCharacterUnifier all support french, and that CoOccurencesGraphExtractor works for any language. If that pipeline was invalid, Renard would display an error message explaining why. Renard can perform this language check because each step explicitely indicates which languages it supports by overriding the PipelineStep.supported_langs() method. This method returns the sets of languages supported by a step as ISO 639-3 codes. The special string "any" is used to indicate that the step works regardless of language. If the method is not overrided, the default is english support only.