Demonstration notebook for the mailcom package
Scientific Software Center, University of Heidelberg, May 2025
The mailcom
package is used to anonymize/pseudonymize textual data, i.e. email content. It takes an eml
, html
or csv
file as input and extracts information about attachements, number of attachements and type, and the content of the email body and subject line. The email body and subject line are then parsed through `spaCy
<https://spacy.io/>`__ and divided into sentences. The sentences are fed to a `transformers
<https://huggingface.co/docs/transformers/en/index>`__ named
entity recognition (NER) pipeline, and person names, places, organizations, miscellaneous, are detected in the inference task. Names are replaced by pseudonyms, while locations, organizations and miscellaneous are replaced by [location]
, [organization]
and [misc]
. The text is further parsed using string methods, to replace any numbers with [number]
and email addresses with [email]
. The processed
text and metadata can then be written to an xml
file or into a pandas dataframe.
mailcom
can automatically detect the (dominant) text language and also has the capability of preserving dates in the text (so that only numbers are replaced, but not patterns that match dates).
Please note that 100% accuracy is not possible with this task. Any output needs to be further checked by a human to ensure the text has been anonymized completely.
The current set-up is for Romance languages, however other language models can also be loaded into the spaCy pipeline. The transformers pipeline uses the xlm-roberta-large-finetuned-conll03-english
model revision number 18f95e9
by default, but other models can also be passed via the config file (see below).
Before using the mailcom
package, please install it into your conda environment using
pip install mailcom
After that, select the appropriate kernel for your Jupyter notebook and execute the cells below to import the package. The package is currently under active development and any function calls are subject to changes.
You may also run this on google colab via the link provided in the repository.
[ ]:
# if running on google colab
# flake8-noqa-cell
if "google.colab" in str(get_ipython()):
%pip install git+https://github.com/ssciwr/mailcom.git -qqq
# mount Google Drive to access files
from google.colab import drive
drive.mount("/content/drive")
[ ]:
import mailcom
import pandas as pd
from IPython.display import display, HTML
import pprint
pp = pprint.PrettyPrinter(indent=4)
Processed text visualization
The cells below define functionality used to display the result in the end, and highlight all named entities found in the text. It is used for demonstration purposes in this demo notebook.
[ ]:
# a dictionary matching colors to the different entity types
colors = {
"LOC": "green",
"ORG": "blue",
"MISC": "yellow",
"PER": "red"
}
The highlight function below is used to visualize what will be replaced in the text, but only after email addresses in the input text have been pseudonymized (i.e. replaced with [email]
).
[ ]:
# function for displaying the result using HTML
def highlight_ne_sent(text, ne_list):
if not ne_list:
return text
# create a list of all entities with their positions
entities = []
for ne in ne_list:
# avoid substituting the same entity multiple times
if ne["word"] not in entities and ne["entity_group"] in colors:
entities.append((ne, colors.get(ne["entity_group"])))
# replace entities with highlighted spans
text_chunks = []
last_idx = 0
for entity, color in entities:
ent_word = entity["word"]
s_idx = entity["start"]
e_idx = entity["end"]
# add text before the entity
text_chunks.append(text[last_idx:s_idx].replace("<", "<").replace(">", ">"))
# add the entity with a span
# assume that the entity does not have any HTML tags
replacement = f"<span style=\"background-color:{color}\">{ent_word}</span>"
text_chunks.append(replacement)
last_idx = e_idx
# add the remaining text
text_chunks.append(text[last_idx:].replace("<", "<").replace(">", ">"))
# join all text chunks
result = "".join(text_chunks)
return result
Configuring your text processing pipeline
All settings for the whole text processing are stored in the file mailcom/default_settings.json
. You can customize them by:
Modifying
mailcom/default_settings.json
directly, orCreating a new configuration file, or
Updating specific fields when loading the configuration.
The function mailcom.get_workflow_settings()
is used to load the workflow settings. It will also store the updated settings to a directory provided as keyword.
[ ]:
# get workflow settings from a configuration file
setting_path = "../../../mailcom/default_settings.json"
workflow_settings = mailcom.get_workflow_settings(setting_path=setting_path)
# update some fields while loading the settings
new_settings = {"default_lang": "es"}
# save the updated configuration to a file for reproducibility purposes
new_settings_dir = "../../../mailcom/"
workflow_settings = mailcom.get_workflow_settings(new_settings=new_settings,
updated_setting_dir= new_settings_dir,
save_updated_settings=True)
In the last example of the cell above, the updated settings are saved to a file. If updated_setting_dir
is not provided, the file is saved in the current directory. To skip saving, set save_updated_settings
to False
.
For this demo, we will use the default workflow settings:
[ ]:
# get default workflow settings
workflow_settings = mailcom.get_workflow_settings()
The configuration options are:
keyword |
options [default in parenthesis] |
explanation |
---|---|---|
|
[“fr”], “es”, “pt” |
default language of the textual data |
|
[true], false |
replace email addresses by [email] |
|
[true], false |
replace named entities by pseudonyms |
|
[true], false |
replace numbers by [number] |
|
[null], [valid transformers model name, revision number, and pipeline, aggregation strategy] |
the transformers pipeline to use for the NER |
|
[“default”], valid spaCy model |
which spaCy model to use for the sentence splitting (see below) |
These keywords set the options for the main processes of the mailcom
package. The default language can be used for text that is always in the same language, that is, each eml
/html
file or row of the csv
contains data in the same language. If this is the case, processing is much faster. If not, the language of the text can be detected on-the-fly with options specified below. In this case, leave the default language empty, ie. ""
an empty string.
The keywords pseudo_emailaddresses
and pseudo_numbers
are by default set to true
, which triggers the replacement of email addresses such as email@gmail.com by [email], and numbers such as 69120 by [number].
By using pseudo_ne
, the replacement of recognized entities by a pseudonym or spaceholder is triggered. A person’s name, i.e. “Michael” is replaced by “James”, a location like “Paris” is replaced by [location], an organization such as “GitHub” is replaced by [organization], and other entities like “iPhone 15” are replaced by [misc].
All these three options related to replacement of identifying information can be triggered separately, but are set to true
by default.
An example for the transformers pipeline is this, with the default options:
"ner": {
"task": "token-classification",
"model": "xlm-roberta-large-finetuned-conll03-english",
"revision": "18f95e9",
"aggregation_strategy": "simple",
}
The task is token-classification
, which is NER (for a description of the available tasks, see here). The default model is Hugging Face’s default model for this task and default revision number as of January 2025. The aggregation strategy determines how the tokens are aggregated after the pipeline; with simple
the text is basically reconstructed as it was and the beginning and end of each recognized NER is given in
accordance. The options task
and aggregation_strategy
are not likely to be changed by the user, however you may want to use a different model and revision number, which is possible using the ner_pipeline
keyword.
The keyword spacy_model
sets the model to use for the sentencizing and pattern recognition. It is important that the initial text is split into sentences with a high accuracy, since this directly affects the subsequent NER accuracy. If the keyword is set to default
, the models that spaCy uses as default for the given language is used. Some of the default models are:
"es": "es_core_news_md"
"fr": "fr_core_news_md"
"de": "de_core_news_md"
"pt": "pt_core_news_md"
Other models can directly be passed using this keyword, see the spaCy reference. To extend the available languages in mailcom
, this list needs to be extended. Please also note that not all spaCy models have pipelines with the necessary components.
mailcom
has additional capabilities that can be used to enhance the text processing:
keyword |
options [default in parenthesis] |
explanation |
---|---|---|
|
automatically detect language of each text using the specified library |
|
|
[null], {“task”: “text-classification”}, for others see here |
the pipeline to use for the language detection, only valid for transformers language detection |
|
[true], false |
detect dates and retain them in the text |
|
[“strict”], “non-strict” |
the pattern matching used to detect date/time patterns in the text (see below) |
The first keyword in this table, lang_detection_lib
, enables dynamic detection of the language. While this increases the processing time, it is crucial for correct sentence splitting when multiple languages are present in the data. In principle, the language can be determined for each sentence; but the general use of this capability is language detection per eml
/html
file/row in the csv
file. Please note that the default language must not be set for this option to be triggered
(default_lang=""
)! Three different libraries are available for language detection, `langid
<https://github.com/saffsd/langid.py>`__, `langdetect
<https://github.com/Mimino666/langdetect>`__, `transformers
<https://huggingface.co/papluca/xlm-roberta-base-language-detection>`__, that all lead to a similar performance on our test set. With the language detected dynamically, the spaCy model for sentence splitting is also set dynamically based on the detected language for each
file/row; this should be combined with the default
option for the spaCy model in order to work correctly.
Using the keyword datetime_detection
, mailcom
can detect patterns that match dates, such as “09 février 2009” or “April 17th 2024” for "non-strict"
parsing. These patterns can then be protected from the replacement of numbers, which would result in (for these examples) “[number] février [number]” or “April [number]th [number]”. This feature could be important in texts in which chronology is not easy to follow, or where it is important to retain any information about time in the data.
Setting the time_parsing
to "strict"
, only precise date-time formats such as “17. April 2024 um 16:58:57” or “17.04.2024 17:33:23” are detected, not using the more flexible pattern matching rules as in “April 17th 2024”. This option could be useful for identifying forwarded dates within email bodies.
The input data can be provided as eml
or html
files, or as a csv
file. For reading a csv
file, more information about the column names needs to be provided. This is explained in the demo notebook (click here to ).
First and last names are replaced by pseudonyms. To make the pseudonimized text read more smoothly, names that are common for a specific language can be chosen; but basically any names can be set for any language using the pseudo_first_names
keyword. The default option is:
pseudo_first_names = {
"es": [
"José",
"Angel",
"Alex",
"Ariel",
"Cruz",
"Fran",
"Arlo",
"Adri",
"Marce",
"Mati"
],
"fr": [
"Claude",
"Dominique",
"Claude",
"Camille",
"Charlie",
"Florence",
"Francis",
"Maxime",
"Remy",
"Cécile"
],
"de": ["Mika"]
}
Reading input data
We currently support two types of input data: (1) a csv
file and (2) a directory of eml
and/or html
files.
Each row of the csv
file, eml
file, or html
file will be stored in a dictionary, with pre-defined keys: content
, date
, attachment
, attachement type
and subject
. Dicttionaries of eml
and html
files have an additional key named file_name
.
Of these pre-defined keys, only ``content`` will be processed through the pipeline, all other information is retained as is.
Reading from a csv
file
When loading a csv
file as an input, a list of columns in the file to map with the above pre-defined keys should be provided, in the correct order.
Example of correct matching:
pre-defined keys
init_data_fields
=content
,date
,attachment
,attachement type
,subject
matching columns
col_names
=message
,time
,attachement
,attachement_type
,subject
If there are fewer columns in the
csv
than pre-defined keys, the remaining pre-defined keys will be set toNone
in the processing, for instance:
pre-defined keys
init_data_fields
=content
,date
,attachment
matching columns
col_names
=message
,time
The input data dictionary for each row in this case is saved like this:
row_data = {
"content": row["message"],
"date": row["time"],
"attachment": None
}
If there are more columns than pre-defined keys, the extra columns are stored in the data dictionary without modification.
pre-defined keys
init_data_fields
=content
,date
matching columns
col_names
=message
,time
,summary
The data dictionary for each row is in this case is:
row_data = {
"content": row["message"],
"date": row["time"],
"summary": row["summary"]
}
If a column name intended to match a predefined key is misspelled, a string label is stored for that key instead. This label is specified by the
csv_col_unmatched_keyword
in the configuration file. By default, this keyword is set to"unmatched"
, but can be updated by the user through modifying the configuration file/passing a new value to this keyword.
pre-defined keys
init_data_fields
=content
,date
matching columns
col_names
=message
,tiem_with_typo
Assuming that column tiem_with_typo
does not exist in the csv
file, the data dictionary for each row in this case is:
row_data = {
"content": row["message"],
"date": "unmatched"
}
The examples below shall serve to demonstrate the input options and resulting behaviour of mailcom
when processing csv
files.
[ ]:
# path to your csv file - change this to your own file
input_csv = "../../../data/in/sample_data.csv"
# the columns of the csv that should be passed through the processing pipeline/retained in the pipeline
matching_columns = ["message", "date", "attachment", "attachement_type", "subject"]
# the predefined keys that should be used to match these columns, in the correct order
pre_defined_keys = ["content", "date", "attachment", "attachement_type", "subject"]
# what to call any columns that are not matched to pre-defined keys
unmatched_keyword = "unmatched"
# or get the unmatched keyword from the workflow settings
unmatched_keyword = workflow_settings.get("csv_col_unmatched_keyword")
input_handler = mailcom.get_input_handler(in_path=input_csv, in_type="csv",
col_names=matching_columns,
init_data_fields=pre_defined_keys,
unmatched_keyword=unmatched_keyword)
In the cell above, the message
column from the csv
file is mapped to the content
key in the email dictionary.
[ ]:
pp.pprint(input_handler.email_list[0])
[ ]:
# path to your csv file - change this to your own file
input_csv = "../../../data/in/sample_data.csv"
# the columns of the csv that should be passed through the processing pipeline/retained in the pipeline
matching_columns = ["message", "date"]
# the predefined keys that should be used to match these columns, in the correct order
pre_defined_keys = ["content", "date"]
# what to call any columns that are not matched to pre-defined keys
unmatched_keyword = "unmatched"
# or get the unmatched keyword from the workflow settings
unmatched_keyword = workflow_settings.get("csv_col_unmatched_keyword")
input_handler = mailcom.get_input_handler(in_path=input_csv, in_type="csv",
col_names=matching_columns,
init_data_fields=pre_defined_keys,
unmatched_keyword=unmatched_keyword)
[ ]:
pp.pprint(input_handler.email_list[0])
Here, we have asked the input handler only to match two of the columns, so the other columns are discarded.
Reading eml
/html
files from a directory
Below, the input files are loaded from the given input_dir
directory into an input handler. You can provide relative or absolute paths to the directory that contains your eml
or html
files. All files of the eml
or html
file type in that directory will be considered input files.
[ ]:
# import files from input_dir - change this to your own directory
input_dir = "../../../data/in/"
input_handler = mailcom.get_input_handler(in_path=input_dir, in_type="dir")
The data is then loaded into the same dictionary structure used for the csv
input file, with the addition of a file_name
key.
[ ]:
pp.pprint(input_handler.email_list[0])
Processing of the data
In the cell below, the emails are looped over and the email content is processed. Depending on the settings, each “content” goes through the following steps:
language detection (optional)
date time detection (optional)
email addresses pseudonymization (optional)
name entities pseudonymization
numbers pseudonymization (optional)
For steps 3-5, the email content is divided into sentences, which are then pseudonymized. The modified sentences are recombined into a text and stored in the email dictionary under the key "pseudo_content"
.
[ ]:
# process the input data
mailcom.process_data(input_handler.get_email_list(), workflow_settings)
In case we pseudonymize all the emails first, the named entities in the input text are highlighted as follows:
[ ]:
# loop over mails and display the highlights
for email in input_handler.get_email_list():
# get NE for each sentence in the email
ne_sent_dict = {}
for sent_idx, ne in zip(email["ne_sent"], email["ne_list"]):
if str(sent_idx) not in ne_sent_dict:
ne_sent_dict[str(sent_idx)] = []
ne_sent_dict[str(sent_idx)].append(ne)
# display original text and highlight found and replaced NEs
html_content = []
for sent_idx, sentence in enumerate(email["sentences_after_email"]):
ne_list = ne_sent_dict.get(str(sent_idx), [])
highlighted_html = highlight_ne_sent(sentence, ne_list)
html_content.append(highlighted_html)
display(HTML(" ".join(html_content)))
After this, the output can be written to a file or processed further. The output is a list of dictionaries, each containing the metadata of the email and the pseudonymized content. In the below cell, the output is saved in a pandas
dataframe.
[ ]:
# write output to pandas df
df = pd.DataFrame(input_handler.get_email_list())
df.head(5)
The meaning of the added columns are:
cleaned_content
- the text cleaned from extra newlines and extra heading and trailing whitespaces;
lang
- the language used to parse the emails (depends on your settings in the configuration file);
detected_datetime
- the dates that were detected;
pseudo_content
- the pseudonymized content of the processed text;
ne_list
- the list of recognized named entities and their properties;
ne_sent
- indices of sentences containing named entities;
sentences
- a list of sentences as detected by spaCy, of the text data;
sentences_after_email
- the list of sentences after replacing email addresses by [email].
The output can be saved as a csv
file as well.
[ ]:
# set overwrite to True to overwrite the existing file
mailcom.write_output_data(input_handler, "../../../data/out/out_demo.csv", overwrite=True)