inout module

class inout.InoutHandler(init_data_fields: list = None)

Bases: object

data_to_xml()
extract_email_info(file: Path) dict

Function to extract the textual content and other metadata from a single email file.

Parameters:

file (Path) – The path to the email file.

Returns:

Dictionary containing email text and metadata.

Return type:

dict

get_email_list()

Function that returns an iterator of email_list

Returns:

Iterator of self.email_list.

Return type:

iter

get_html_text(text_check: str) str

Clean up a string if it contains html content. :param text_check: The string that may contain html content. :type text_check: str

Returns:

The (potentially) cleaned up string.

Return type:

str

list_of_files(directory_name: str, file_types: list = ['.eml', '.html'])

Method to create a list of Path objects (files) that are present in a directory.

Parameters:
  • directory_name (str) – The directory where the files are located.

  • file_types (list, optional) – The list of file types to be processed. Defaults to [“.eml”, “.html”].

load_csv(infile: str, col_names: list = ['message'], unmatched_keyword: str = 'unmatched')

Load the email list from a csv file. The col_names should map the init_data_fields by order. + If col_names is empty, the email dict will be created with the init_data_fields as keys. + If number of col_names is less than the number of init_data_fields, the rest of the fields in email dict will be set to None. + If number of col_names is greater than the number of init_data_fields, the rest of the col_names will be used as keys in the email dict.

Parameters:
  • infile (str) – The path of the file to be read.

  • col_names (list) – The list of column names that map the init_data_fields. Defaults to [“message”].

  • unmatched_keyword (str) – The keyword for marking unmatched columns.

process_emails()

Function that processes all emails in the directory and saves their contents in email_list

validate_data(email_dict: dict)

Check if all fields in init_data_fields are present. If not, set them to None.

write_csv(outfile: str)

Write the email list containing all dictionaries to csv.

Parameters:

outfile (str) – The path of the file to be written.

write_file(text: str, outfile: str) None

Write the extracted string to a text file.

Parameters:
  • text (str) – The string to be written to the file.

  • outfile (str) – The file to be written.

lang_detector module

class lang_detector.LangDetector(trans_loader: TransformerLoader = None)

Bases: object

constrain_langid(lang_set=[])

Set constraint for language set of langid. Default is no constrained languages.

contains_only_emails(text: str) bool

Check if a given text contains only email(s).

Parameters:

text (str) – The text to check.

Returns:

True if the text contains only email(s), False otherwise.

Return type:

bool

Check if a given text contains only links.

Parameters:

text (str) – The text to check.

Returns:

True if the text contains only links, False otherwise.

Return type:

bool

contains_only_numbers(text: str) bool

Check if a given text contains only numbers.

Parameters:

text (str) – The text to check.

Returns:

True if the text is only numbers, False otherwise.

Return type:

bool

contains_only_punctuations(text: str) bool

Check if a given text contains only punctuations.

Parameters:

text (str) – The text to check.

Returns:

True if the text is only punctuations, False otherwise.

Return type:

bool

detect_lang_sentences(sentences: list[str], lang_lib='langid', pipeline_info: dict = None) IntervalTree

Detect languages of a list of sentences using a specified language library.

Parameters:
  • sentences (str) – The document to detect the languages of.

  • lang_lib (str) – The lang_lib to use for detection. Options are “langid”, “langdetect” and “trans”.

  • pipeline_info (dict, optional) – The pipeline information, used for detecting with “trans” option.

Returns:

An interval tree with the detected languages and their spans.

Return type:

IntervalTree

detect_with_langdetect(sentence: str) list[tuple[str, float]]

Dectect language of a given text using langdetect library. Recommended for a single language detection.

Parameters:

text (str) – The text to detect the language of.

Returns:

The possible language and their probabilities.

Return type:

list(str, float)

detect_with_langid(sentence: str) list[tuple[str, float]]

Dectect language of a given text using langid library. Recommended for a single language detection.

Parameters:

sentence (str) – The text to detect the language of.

Returns:

The detected language and its probability.

Return type:

[(str, float)]

detect_with_transformers(sentence: str, pipeline_info: dict = None) list[tuple[str, float]]

Dectect language of a given text using transformers library.

Parameters:
  • sentence (str) – The text to detect the language of.

  • pipeline_info (dict, optional) – The pipeline information

Returns:

The possible language and their probabilities.

Return type:

list(str, float)

determine_langdetect()

Enforce consistent results for langdetect.

get_detections(text: str, lang_lib='langid', pipeline_info: dict = None) list[tuple[str, float]]

Get detections for a given text using a specified lang_lib or model.

Parameters:
  • text (str) – The text to detect the language of.

  • lang_lib (str) – The lang_lib to use for detection. Options are “langid”, “langdetect” and “trans”. The default is “langid”.

  • pipeline_info (dict, optional) – The pipeline information, used for detecting with “trans” option.

Returns:

A list of detected languages and their probabilities.

Return type:

list(str, float)

init_transformers(pipeline_info: dict = None)

Initialize transformers for language detection.

strip_punctuations(text: str) str

Strip punctuations from a given text.

Parameters:

text (str) – The text to strip punctuations from.

Returns:

The text with punctuations stripped.

Return type:

str

main module

main.get_input_handler(in_path: str, in_type: str = 'dir', col_names: list = ['message'], init_data_fields: list = ['content', 'date', 'attachment', 'attachement type', 'subject'], unmatched_keyword: str = 'unmatched', file_types: list = ['.eml', '.html']) InoutHandler

Get input handler for a file or directory.

Parameters:
  • in_path (str) – The path to the input data.

  • in_type (str, optional) – The type of input data. Defaults to “dir”. Possible values are [“dir”, “csv”].

  • col_names (list, optional) – The list of column names that map the init_data_fields.

  • init_data_fields (list, optional) – The list of fields should be present in the data dict.

  • unmatched_keyword (str, optional) – The keyword for marking unmatch columns in csv files. Defaults to “unmatched”.

  • file_types (list, optional) – The list of file types to be processed in the directory.

Returns:

The input handler object.

Return type:

InoutHandler

main.get_workflow_settings(setting_path: str = 'default', new_settings: dict = {}, updated_setting_dir: str = None, save_updated_settings: bool = True) dict

Get the workflow settings. If the setting path is “default”, return the default settings. If the setting path is not default, read the settings from the file. If the new settings are provided, overwrite the default/loaded settings.

Parameters:
  • setting_path (str) – Path to the workflow settings file. Defaults to “default”.

  • new_settings (dict) – New settings to overwrite the existing settings. Defaults to {}.

  • updated_setting_dir (str) – Directory to save the updated settings file. Defaults to None.

  • save_updated_settings (bool) – Whether to save the updated settings to a file.

Returns:

The workflow settings.

Return type:

dict

main.is_valid_settings(workflow_setting: dict) bool

Check if the workflow settings are valid. :param workflow_setting: The workflow settings. :type workflow_setting: dict

Returns:

True if the settings are valid, False otherwise.

Return type:

bool

main.process_data(email_list: Iterator[list[dict]], workflow_settings: dict)

Process the input data in this order: + detect language (optional) + detect date time (optional) + pseudonymize email addresses (optional) + pseudoymize name entities + pseudonymize numbers (optional)

Parameters:
  • email_list (Iterator[list[dict]]) – The list of dictionaries of input data.

  • content. ("content" field in each dictionary contains the main)

  • workflow_settings (dict) – The workflow settings.

main.save_settings_to_file(workflow_settings: dict, dir_path: str = None)

Save the workflow settings to a file. If dir_path is None, save to the current directory.

Parameters:
  • workflow_settings (dict) – The workflow settings.

  • dir_path (str, optional) – The path to save the settings file. Defaults to None.

main.write_output_data(inout_hl: InoutHandler, out_path: str, overwrite: bool = False)

Write the output data to a file.

Parameters:
  • inout_hl (InoutHandler) – The input handler object containing the data.

  • out_path (str) – The path to the output file.

  • overwrite (bool, optional) – Flag to overwrite the output file if it exists. Defaults to False.

parse module

class parse.Pseudonymize(pseudo_first_names: dict, trans_loader: TransformerLoader = None, spacy_loader: SpacyLoader = None)

Bases: object

choose_per_pseudonym(name, lang='fr')

Chooses a pseudonym for a PER named entity based on previously used pseudonyms. If the name has previously been replaced, the same pseudonym is used again. If not, a new pseudonym is taken from the list of available pseudonyms.

Parameters:
  • name (str) – Word of the named entity.

  • lang (str, optional) – Language to choose pseudonyms from. Defaults to “fr”.

Returns:

Chosen pseudonym.

Return type:

str

concatenate(sentences)

Concatenates a list of sentences to a coherent text.

Parameters:

sentences (list[str]) – List containing all sentences to concatenate.

Returns:

Concatenated text.

Return type:

str

get_ner(sentence, pipeline_info: dict = None)

Retrieves named entities in a String from transformers model.

Parameters:
  • sentence (str) – Input text to search for named entities.

  • pipeline_info (dict, optional) – Transformers pipeline info. Defaults to None.

Returns:

List of named entities retrieved from transformers model.

Return type:

list[dict]

get_sentences(input_text, language, model='default')

Splits a text into sentences using spacy.

Parameters:
  • input_text (str) – Text to split into sentences.

  • language (str) – Language for spacy initialization.

  • model (str, optional) – Model of the spacy instance. Defaults to “default”.

Returns:

List of sentences.

Return type:

list[str]

init_spacy(language: str, model='default')

Initializes spacy model.

Parameters:
  • language (str) – Language of the desired spacy model.

  • model (str, optional) – Model specifier. Defaults to “default”.

init_transformers(pipeline_info: dict = None)

Initializes transformers NER model.

Parameters:

pipeline_info (dict, optional) – Transformers pipeline info.

pseudonymize(text, language='de', model='default', pipeline_info: dict = None, detected_dates: list[str] = None, pseudo_emailaddresses=True, pseudo_ne=True, pseudo_numbers=True)

Function that handles the pseudonymization of an email and all its steps

Parameters:
  • text (str) – Text to pseudonymize.

  • language (str, optional) – Language of the email. Defaults to “de”.

  • model (str, optional) – Model to use for NER. Defaults to “default”.

  • pipeline_info (dict, optional) – Pipeline information for NER. Defaults to None.

  • detected_dates (list[str], optional) – Detected dates in the email. Defaults to None.

  • pseudo_emailaddresses (bool, optional) – Whether to pseudonymize email addresses. Defaults to True.

  • pseudo_ne (bool, optional) – Whether to pseudonymize named entities. Defaults to True.

  • pseudo_numbers (bool, optional) – Whether to pseudonymize numbers. Defaults to True.

Returns:

Pseudonymized text

Return type:

str

pseudonymize_email_addresses(sentence)

Replaces words containing @ in a String with placeholder.

Parameters:

sentence (str) – Sentence to search for emails.

Returns:

Text with emails replaced by placeholder.

Return type:

str

pseudonymize_ne(ner, sentence, lang='fr', sent_idx=0)

Pseudonymizes all found named entities in a String. Named entities categorized as persons are replaced with a pseudonym. Named entities categorized as locations, organizations or miscellaneous are replaced by a placeholder. Used pseudonyms are saved for each entity for reuse in case of multiple occurrences.

Parameters:
  • ner (list[dict]) – List of named entities found by the transformers model.

  • sentence (str) – Input String to replace all named entities in.

  • lang (str, optional) – Language to choose pseudonyms from. Defaults to “fr”.

  • sent_idx (int, optional) – Index of the sentence in the email. Defaults to 0.

Returns:

Pseudonymized sentence as list.

Return type:

list[str]

pseudonymize_numbers(sentence, detected_dates: list[str] = None)

Replaces numbers that are not dates in a sentence with placeholder.

Parameters:
  • sentence (str) – Sentence to search for numbers

  • detected_dates (list[str], optional) – List of detected dates

  • replaced. (which will not be) – Defaults to None.

Returns:

Text with non-date numbers replaced by placeholder.

Return type:

str

pseudonymize_with_updated_ne(sentences, ne_sent_dict: dict[list[dict]], language='de', detected_dates: list[str] = None, pseudo_emailaddresses=True, pseudo_ne=True, pseudo_numbers=True)

Pseudonymizes the email with updated named entities. This function is used when the named entities have been updated in the email and need to be pseudonymized again.

Parameters:
  • sentences (list[str]) – List of sentences to pseudonymize.

  • ne_sent_dict (dict[list[dict]]) – Dictionary containing named entities

  • language (str, optional) – Language of the email. Defaults to “de”.

  • detected_dates (list[str], optional) – Detected dates in the email. Defaults to None.

  • pseudo_emailaddresses (bool, optional) – Whether to pseudonymize email addresses. Defaults to True.

  • pseudo_ne (bool, optional) – Whether to pseudonymize named entities. Defaults to True.

  • pseudo_numbers (bool, optional) – Whether to pseudonymize numbers. Defaults to True.

Returns:

Pseudonymized text

Return type:

str

reset()

Clears the named entity list for processing a new email.

time_detector module

class time_detector.TimeDetector(strict_parsing='non-strict', spacy_loader: SpacyLoader = None)

Bases: object

add_merged_datetime(merged_datetime: list, new_item: tuple) list

Add a new item to the merged datetime list.

Parameters:
  • merged_datetime (list) – The list of merged datetime.

  • new_item (tuple) – The new item to add. It contains the date string, the datetime object, the start index and the end index.

Returns:

The updated list of merged datetime.

Return type:

list

add_pattern(pattern: list[dict], mode: str) None

Add a new pattern to the matcher if it’s a non-empty list of dictionaries and not already present.

Parameters:
  • pattern (list[dict]) – The pattern to add to the matcher.

  • mode (str) – The mode of the pattern, either “strict” or “non-strict”.

extract_date_time(doc: Doc, language: str, model='default') list

Extract dates from a given text.

Parameters:
  • doc (Doc) – The spacy doc object.

  • language (str) – The language of the text.

  • model (str, optional) – The model to use for the spacy instance. Defaults to “default”.

Returns:

A list of extracted dates.

Return type:

list

extract_date_time_multi_words(doc: Doc, language: str, model='default') tuple[list, list]

Extract time from a given text when it is multiple words. E.g. 12 mars 2025, 17. April 2024

Parameters:
  • doc (Doc) – The spacy doc object.

  • language (str) – The language of the text.

  • model (str, optional) – The model to use for the spacy instance. Defaults to “default”.

Returns:

A list of extracted dates and

marks of locations in the doc.

Return type:

tuple[list, list]

extract_date_time_single_word(doc: Doc, marked_locations: list) list

Extract time from a given text when it is only one word. E.g. 2009/02/17, 17:23

Parameters:
  • doc (Doc) – The spacy doc object.

  • marked_locations (list) – A list of marked locations of dates in multiple word format.

Returns:

A list of extracted dates.

Return type:

list

filter_non_numbers(date_time: list[str, datetime, int, int]) list[str, datetime, int, int]

Filter out the date time phrases that do not contain numbers.

Parameters:

date_time (list[(str, datetime, int, int)]) – The original list.

Returns:

The filtered list.

Return type:

list[(str, datetime, int, int)]

find_dates(text: str) list[datetime]

Find dates in a given text.

Parameters:

text (str) – The text to find dates in.

Returns:

A list of dates found in the text.

Return type:

list[datetime]

get_date_time(text: str, language: str, model='default') list[str, datetime, int, int]

Get the date and time from a given text.

Parameters:
  • text (str) – The text to get the date and time from.

  • lang (str, optional) – The language of the text. Defaults to “fr”.

  • model (str, optional) – The model to use for the spacy instance. Defaults to “default

Returns:

A list of tuples containing

the date string, the datetime object, the start index and the end index

Return type:

list[(str, datetime, int, int)]

init_strict_patterns() None

Add strict patterns to the matcher for strict parsing cases.

is_time_mergeable(first_token: object, second_token: object, doc: Doc) bool

Check if the two time tokens can be merged. True: if they are next to each other in the token list,

or the word in between them in [“at”, “um”, “à”, “a las”, “,”, “.”, “-“], or the words in between them are [“.,”]

False: otherwise

Parameters:
  • first_token (object) – The first spaCy token or span.

  • second_token (object) – The second spaCy token or span.

  • doc (Doc) – The spacy doc object.

Returns:

True if the tokens can be merged, False otherwise.

Return type:

bool

merge_date_time(extracted_datetime: list, doc: Doc) list[str, datetime, int, int]

Merge the extracted date and time if they are mergeable.

Parameters:
  • extracted_datetime (list) – The extracted date and time.

  • doc (Doc) – The spacy doc object.

Returns:

A list of tuples containing

the date string, the datetime object, the start index and the end index

Return type:

list[(str, datetime, int, int)]

parse_time(text: str) datetime

Parse the time from text format to datetime format.

Parameters:

text (str) – The text to parse the time from.

Returns:

The datetime object of the time parsed.

Return type:

datetime

remove_pattern(pattern: list, mode: str) None

Remove pattern from the matcher if it’s present.

Parameters:
  • pattern (list) – The pattern to remove from the matcher.

  • mode (str) – The mode of the pattern, either “strict” or “non-strict”.

search_dates(text: str, langs=['es', 'fr']) list[str, datetime]

Search for dates in a given text.

Parameters:

text (str) – The text to search for dates in.

Returns:

A list of tuples containing the date string

and the datetime object.

Return type:

list[(str, datetime)]

unite_overlapping_words(multi_word_date_time: list, marked_locations: list, doc: Doc) tuple[list, list]

Unite overlapping words between two items in the matched multi-word date time.

Parameters:
  • multi_word_date_time (list) – A list of multi-word date time.

  • marked_locations (list) – A list of marked locations of dates in multiple word format.

  • doc (Doc) – The spacy doc object.

Returns:

A list of updated multi-word date time and

a list of marked locations.

Return type:

tuple[list, list]

utils module

class utils.SpacyLoader

Bases: object

get_default_model(language: str)
init_spacy(language: str, model='default')
class utils.TransformerLoader

Bases: object

init_transformers(feature: str, pipeline_info: dict = None)
utils.check_dir(path: Path) bool

Check if a directory exists at a given path.

Parameters:

path (pathlib.Path)) – The path to check.

Returns:

True if the directory exists, False otherwise.

Return type:

bool

utils.clean_up_content(content: str) tuple[str, list]

Clean up the content of an email.

Parameters:

content (str) – The content of the email.

Returns:

The cleaned up content and a list of cleaned up sentences.

Return type:

tuple[str, list]

utils.get_spacy_instance(spacy_loader: SpacyLoader, language: str, model: str = 'default')

Get the spacy instance for a given language and model.

Parameters:
  • spacy_loader (SpacyLoader) – The spacy loader.

  • language (str) – The language of the spacy instance.

  • model (str) – The model of the spacy instance, defaults to “default”.

Returns:

The spacy instance.

Return type:

spacy.Language

utils.get_trans_instance(trans_loader: TransformerLoader, feature: str, pipeline_info: dict = None)

Get the transformer instance for a given feature.

Parameters:
  • trans_loader (TransformerLoader) – The transformer loader.

  • feature (str) – The feature to get the transformer instance.

  • pipeline_info (dict) – The setting info for the transformer, defaults to None.

Returns:

The transformer instance.

Return type:

pipeline

utils.make_dir(path: Path) None

Make a directory at a given path.

Parameters:

path (pathlib.Path) – The path to make a directory at.