Demonstration notebook for the mailcom package

Scientific Software Center, University of Heidelberg, May 2025

The mailcom package is used to anonymize/pseudonymize textual data, i.e. email content. It takes an eml, html or csv file as input and extracts information about attachements, number of attachements and type, and the content of the email body and subject line. The email body and subject line are then parsed through `spaCy <https://spacy.io/>`__ and divided into sentences. The sentences are fed to a `transformers <https://huggingface.co/docs/transformers/en/index>`__ named entity recognition (NER) pipeline, and person names, places, organizations, miscellaneous, are detected in the inference task. Names are replaced by pseudonyms, while locations, organizations and miscellaneous are replaced by [location], [organization] and [misc]. The text is further parsed using string methods, to replace any numbers with [number] and email addresses with [email]. The processed text and metadata can then be written to an xml file or into a pandas dataframe.

mailcom can automatically detect the (dominant) text language and also has the capability of preserving dates in the text (so that only numbers are replaced, but not patterns that match dates).

Please note that 100% accuracy is not possible with this task. Any output needs to be further checked by a human to ensure the text has been anonymized completely.

The current set-up is for Romance languages, however other language models can also be loaded into the spaCy pipeline. The transformers pipeline uses the xlm-roberta-large-finetuned-conll03-english model revision number 18f95e9 by default, but other models can also be passed via the config file (see below).

Before using the mailcom package, please install it into your conda environment using

pip install mailcom

After that, select the appropriate kernel for your Jupyter notebook and execute the cells below to import the package. The package is currently under active development and any function calls are subject to changes.

You may also run this on google colab via the link provided in the repository.

[ ]:
# if running on google colab
# flake8-noqa-cell

if "google.colab" in str(get_ipython()):
    %pip install git+https://github.com/ssciwr/mailcom.git -qqq
    # mount Google Drive to access files
    from google.colab import drive

    drive.mount("/content/drive")
[ ]:
import mailcom
import pandas as pd
from IPython.display import display, HTML
import pprint

pp = pprint.PrettyPrinter(indent=4)

Processed text visualization

The cells below define functionality used to display the result in the end, and highlight all named entities found in the text. It is used for demonstration purposes in this demo notebook.

[ ]:
# a dictionary matching colors to the different entity types
colors = {
    "LOC": "green",
    "ORG": "blue",
    "MISC": "yellow",
    "PER": "red"
}

The highlight function below is used to visualize what will be replaced in the text, but only after email addresses in the input text have been pseudonymized (i.e. replaced with [email]).

[ ]:
# function for displaying the result using HTML
def highlight_ne_sent(text, ne_list):
    if not ne_list:
        return text

    # create a list of all entities with their positions
    entities = []
    for ne in ne_list:
        # avoid substituting the same entity multiple times
        if ne["word"] not in entities and ne["entity_group"] in colors:
            entities.append((ne, colors.get(ne["entity_group"])))

    # replace entities with highlighted spans
    text_chunks = []
    last_idx = 0
    for entity, color in entities:
        ent_word = entity["word"]
        s_idx = entity["start"]
        e_idx = entity["end"]
        # add text before the entity
        text_chunks.append(text[last_idx:s_idx].replace("<", "&lt;").replace(">", "&gt;"))
        # add the entity with a span
        # assume that the entity does not have any HTML tags
        replacement = f"<span style=\"background-color:{color}\">{ent_word}</span>"
        text_chunks.append(replacement)
        last_idx = e_idx
    # add the remaining text
    text_chunks.append(text[last_idx:].replace("<", "&lt;").replace(">", "&gt;"))
    # join all text chunks
    result = "".join(text_chunks)

    return result

Configuring your text processing pipeline

All settings for the whole text processing are stored in the file mailcom/default_settings.json. You can customize them by:

  • Modifying mailcom/default_settings.json directly, or

  • Creating a new configuration file, or

  • Updating specific fields when loading the configuration.

The function mailcom.get_workflow_settings() is used to load the workflow settings. It will also store the updated settings to a directory provided as keyword.

[ ]:
# get workflow settings from a configuration file
setting_path = "../../../mailcom/default_settings.json"
workflow_settings = mailcom.get_workflow_settings(setting_path=setting_path)

# update some fields while loading the settings
new_settings = {"default_lang": "es"}
# save the updated configuration to a file for reproducibility purposes
new_settings_dir = "../../../mailcom/"
workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings,
                                                  updated_setting_dir= new_settings_dir,
                                                  save_updated_settings=True)

In the last example of the cell above, the updated settings are saved to a file. If updated_setting_dir is not provided, the file is saved in the current directory. To skip saving, set save_updated_settings to False.

For this demo, we will use the default workflow settings:

[ ]:
# get default workflow settings
workflow_settings = mailcom.get_workflow_settings()

The configuration options are:

keyword

options [default in parenthesis]

explanation

default_lang

[“fr”], “es”, “pt”

default language of the textual data

pseudo_emailaddresses

[true], false

replace email addresses by [email]

pseudo_ne

[true], false

replace named entities by pseudonyms

pseudo_numbers

[true], false

replace numbers by [number]

ner_pipeline

[null], [valid transformers model name, revision number, and pipeline, aggregation strategy]

the transformers pipeline to use for the NER

spacy_model

[“default”], valid spaCy model

which spaCy model to use for the sentence splitting (see below)

These keywords set the options for the main processes of the mailcom package. The default language can be used for text that is always in the same language, that is, each eml/html file or row of the csv contains data in the same language. If this is the case, processing is much faster. If not, the language of the text can be detected on-the-fly with options specified below. In this case, leave the default language empty, ie. "" an empty string.

The keywords pseudo_emailaddresses and pseudo_numbers are by default set to true, which triggers the replacement of email addresses such as email@gmail.com by [email], and numbers such as 69120 by [number].

By using pseudo_ne, the replacement of recognized entities by a pseudonym or spaceholder is triggered. A person’s name, i.e. “Michael” is replaced by “James”, a location like “Paris” is replaced by [location], an organization such as “GitHub” is replaced by [organization], and other entities like “iPhone 15” are replaced by [misc].

All these three options related to replacement of identifying information can be triggered separately, but are set to true by default.

An example for the transformers pipeline is this, with the default options:

"ner": {
    "task": "token-classification",
    "model": "xlm-roberta-large-finetuned-conll03-english",
    "revision": "18f95e9",
    "aggregation_strategy": "simple",
}

The task is token-classification, which is NER (for a description of the available tasks, see here). The default model is Hugging Face’s default model for this task and default revision number as of January 2025. The aggregation strategy determines how the tokens are aggregated after the pipeline; with simple the text is basically reconstructed as it was and the beginning and end of each recognized NER is given in accordance. The options task and aggregation_strategy are not likely to be changed by the user, however you may want to use a different model and revision number, which is possible using the ner_pipeline keyword.

The keyword spacy_model sets the model to use for the sentencizing and pattern recognition. It is important that the initial text is split into sentences with a high accuracy, since this directly affects the subsequent NER accuracy. If the keyword is set to default, the models that spaCy uses as default for the given language is used. Some of the default models are:

"es": "es_core_news_md"
"fr": "fr_core_news_md"
"de": "de_core_news_md"
"pt": "pt_core_news_md"

Other models can directly be passed using this keyword, see the spaCy reference. To extend the available languages in mailcom, this list needs to be extended. Please also note that not all spaCy models have pipelines with the necessary components.

mailcom has additional capabilities that can be used to enhance the text processing:

keyword

options [default in parenthesis]

explanation

lang_detection_lib

[“langid”], “langdetect”, “trans”

automatically detect language of each text using the specified library

lang_pipeline

[null], {“task”: “text-classification”}, for others see here

the pipeline to use for the language detection, only valid for transformers language detection

datetime_detection

[true], false

detect dates and retain them in the text

time_parsing

[“strict”], “non-strict”

the pattern matching used to detect date/time patterns in the text (see below)

The first keyword in this table, lang_detection_lib, enables dynamic detection of the language. While this increases the processing time, it is crucial for correct sentence splitting when multiple languages are present in the data. In principle, the language can be determined for each sentence; but the general use of this capability is language detection per eml/html file/row in the csv file. Please note that the default language must not be set for this option to be triggered (default_lang="")! Three different libraries are available for language detection, `langid <https://github.com/saffsd/langid.py>`__, `langdetect <https://github.com/Mimino666/langdetect>`__, `transformers <https://huggingface.co/papluca/xlm-roberta-base-language-detection>`__, that all lead to a similar performance on our test set. With the language detected dynamically, the spaCy model for sentence splitting is also set dynamically based on the detected language for each file/row; this should be combined with the default option for the spaCy model in order to work correctly.

Using the keyword datetime_detection, mailcom can detect patterns that match dates, such as “09 février 2009” or “April 17th 2024” for "non-strict" parsing. These patterns can then be protected from the replacement of numbers, which would result in (for these examples) “[number] février [number]” or “April [number]th [number]”. This feature could be important in texts in which chronology is not easy to follow, or where it is important to retain any information about time in the data.

Setting the time_parsing to "strict", only precise date-time formats such as “17. April 2024 um 16:58:57” or “17.04.2024 17:33:23” are detected, not using the more flexible pattern matching rules as in “April 17th 2024”. This option could be useful for identifying forwarded dates within email bodies.

The input data can be provided as eml or html files, or as a csv file. For reading a csv file, more information about the column names needs to be provided. This is explained in the demo notebook (click here to Open In Colab).

First and last names are replaced by pseudonyms. To make the pseudonimized text read more smoothly, names that are common for a specific language can be chosen; but basically any names can be set for any language using the pseudo_first_names keyword. The default option is:

pseudo_first_names = {
        "es": [
            "José",
            "Angel",
            "Alex",
            "Ariel",
            "Cruz",
            "Fran",
            "Arlo",
            "Adri",
            "Marce",
            "Mati"
        ],
        "fr": [
            "Claude",
            "Dominique",
            "Claude",
            "Camille",
            "Charlie",
            "Florence",
            "Francis",
            "Maxime",
            "Remy",
            "Cécile"
        ],
        "de": ["Mika"]
    }

Reading input data

We currently support two types of input data: (1) a csvfile and (2) a directory of eml and/or html files.

Each row of the csv file, eml file, or html file will be stored in a dictionary, with pre-defined keys: content, date, attachment, attachement type and subject. Dicttionaries of eml and html files have an additional key named file_name.

Of these pre-defined keys, only ``content`` will be processed through the pipeline, all other information is retained as is.

Reading from a csv file

When loading a csvfile as an input, a list of columns in the file to map with the above pre-defined keys should be provided, in the correct order.

  1. Example of correct matching:

  • pre-defined keys init_data_fields = content, date, attachment, attachement type, subject

  • matching columns col_names = message, time, attachement, attachement_type, subject

  1. If there are fewer columns in the csv than pre-defined keys, the remaining pre-defined keys will be set to None in the processing, for instance:

  • pre-defined keys init_data_fields = content, date, attachment

  • matching columns col_names = message, time

The input data dictionary for each row in this case is saved like this:

row_data = {
    "content": row["message"],
    "date": row["time"],
    "attachment": None
}
  1. If there are more columns than pre-defined keys, the extra columns are stored in the data dictionary without modification.

  • pre-defined keys init_data_fields = content, date

  • matching columns col_names = message, time, summary

The data dictionary for each row is in this case is:

row_data = {
    "content": row["message"],
    "date": row["time"],
    "summary": row["summary"]
}
  1. If a column name intended to match a predefined key is misspelled, a string label is stored for that key instead. This label is specified by the csv_col_unmatched_keyword in the configuration file. By default, this keyword is set to "unmatched", but can be updated by the user through modifying the configuration file/passing a new value to this keyword.

  • pre-defined keys init_data_fields = content, date

  • matching columns col_names = message, tiem_with_typo

Assuming that column tiem_with_typo does not exist in the csv file, the data dictionary for each row in this case is:

row_data = {
    "content": row["message"],
    "date": "unmatched"
}

The examples below shall serve to demonstrate the input options and resulting behaviour of mailcom when processing csv files.

[ ]:
# path to your csv file - change this to your own file
input_csv = "../../../data/in/sample_data.csv"
# the columns of the csv that should be passed through the processing pipeline/retained in the pipeline
matching_columns = ["message", "date", "attachment", "attachement_type", "subject"]
# the predefined keys that should be used to match these columns, in the correct order
pre_defined_keys = ["content", "date", "attachment", "attachement_type", "subject"]
# what to call any columns that are not matched to pre-defined keys
unmatched_keyword = "unmatched"
# or get the unmatched keyword from the workflow settings
unmatched_keyword = workflow_settings.get("csv_col_unmatched_keyword")

input_handler = mailcom.get_input_handler(in_path=input_csv, in_type="csv",
                                          col_names=matching_columns,
                                          init_data_fields=pre_defined_keys,
                                          unmatched_keyword=unmatched_keyword)

In the cell above, the message column from the csv file is mapped to the content key in the email dictionary.

[ ]:
pp.pprint(input_handler.email_list[0])
[ ]:
# path to your csv file - change this to your own file
input_csv = "../../../data/in/sample_data.csv"
# the columns of the csv that should be passed through the processing pipeline/retained in the pipeline
matching_columns = ["message", "date"]
# the predefined keys that should be used to match these columns, in the correct order
pre_defined_keys = ["content", "date"]
# what to call any columns that are not matched to pre-defined keys
unmatched_keyword = "unmatched"
# or get the unmatched keyword from the workflow settings
unmatched_keyword = workflow_settings.get("csv_col_unmatched_keyword")

input_handler = mailcom.get_input_handler(in_path=input_csv, in_type="csv",
                                          col_names=matching_columns,
                                          init_data_fields=pre_defined_keys,
                                          unmatched_keyword=unmatched_keyword)
[ ]:
pp.pprint(input_handler.email_list[0])

Here, we have asked the input handler only to match two of the columns, so the other columns are discarded.

Reading eml/html files from a directory

Below, the input files are loaded from the given input_dir directory into an input handler. You can provide relative or absolute paths to the directory that contains your eml or html files. All files of the eml or html file type in that directory will be considered input files.

[ ]:
# import files from input_dir - change this to your own directory
input_dir = "../../../data/in/"
input_handler = mailcom.get_input_handler(in_path=input_dir, in_type="dir")

The data is then loaded into the same dictionary structure used for the csv input file, with the addition of a file_namekey.

[ ]:
pp.pprint(input_handler.email_list[0])

Processing of the data

In the cell below, the emails are looped over and the email content is processed. Depending on the settings, each “content” goes through the following steps:

  1. language detection (optional)

  2. date time detection (optional)

  3. email addresses pseudonymization (optional)

  4. name entities pseudonymization

  5. numbers pseudonymization (optional)

For steps 3-5, the email content is divided into sentences, which are then pseudonymized. The modified sentences are recombined into a text and stored in the email dictionary under the key "pseudo_content".

[ ]:
# process the input data
mailcom.process_data(input_handler.get_email_list(), workflow_settings)

In case we pseudonymize all the emails first, the named entities in the input text are highlighted as follows:

[ ]:
# loop over mails and display the highlights
for email in input_handler.get_email_list():
    # get NE for each sentence in the email
    ne_sent_dict = {}
    for sent_idx, ne in zip(email["ne_sent"], email["ne_list"]):
        if str(sent_idx) not in ne_sent_dict:
            ne_sent_dict[str(sent_idx)] = []
        ne_sent_dict[str(sent_idx)].append(ne)

    # display original text and highlight found and replaced NEs
    html_content = []
    for sent_idx, sentence in enumerate(email["sentences_after_email"]):
        ne_list = ne_sent_dict.get(str(sent_idx), [])
        highlighted_html = highlight_ne_sent(sentence, ne_list)
        html_content.append(highlighted_html)
    display(HTML(" ".join(html_content)))

After this, the output can be written to a file or processed further. The output is a list of dictionaries, each containing the metadata of the email and the pseudonymized content. In the below cell, the output is saved in a pandas dataframe.

[ ]:
# write output to pandas df
df = pd.DataFrame(input_handler.get_email_list())
df.head(5)

The meaning of the added columns are:

cleaned_content - the text cleaned from extra newlines and extra heading and trailing whitespaces;

lang - the language used to parse the emails (depends on your settings in the configuration file);

detected_datetime - the dates that were detected;

pseudo_content - the pseudonymized content of the processed text;

ne_list - the list of recognized named entities and their properties;

ne_sent - indices of sentences containing named entities;

sentences - a list of sentences as detected by spaCy, of the text data;

sentences_after_email - the list of sentences after replacing email addresses by [email].

The output can be saved as a csv file as well.

[ ]:
# set overwrite to True to overwrite the existing file
mailcom.write_output_data(input_handler, "../../../data/out/out_demo.csv", overwrite=True)