Documentation, manual, and code for de-identification and anonymization of MCEC email data that guided by the HIPAA Safe-Harbor method and FERPA regulations.
- MCEC De-Identification Tool (MCEC DeID)
This repository is part of the eco-system of the Multilingual College Email Corpus Project. It contains source code and documentation for the de-identification and anonymization of MCEC research data. Please notice that we are pacing the public release of our source code and documentation, so you may find some sections of this documentation incomplete. If you have any questions please make sure to email the research team at:
mcec.team [at] gmail
You may find the description of the source code and documentation below. Please follow the links in there for the full content.
The src
directory inside this repository hosts the code used to automatically aid the de-identification and anonymization of MCEC data and its documentation. The code is used by the MCEC Team to processes email text files and aids in the detection of Safe-Harbor and FERPA-defined identifiers, but it is not the only resource for de-identification and anonymization. The main de-identification and anonymization processes are carried out manually by the MCEC Team in various stages. For more information about the de-identification and anonymization processes as well as the MCEC data collection workflow, please refer to section (C) below.
There are two main programs organized as Python modules:
- RAD-ID: A rule-based algorithm designed to be the first-pass for data de-identification. It deals mostly with direct identifiers.
- AnonyM: A machine-learning algorithm used to ensure that manual de-identification was successful. This is meant to be the first check after data has been anonymized. The second check is a manual check.
For the most part, email format changes are kept to a minimum except in the case of email footers which often contain more personal information than the entire body of their email (Su, Romero Diaz & Jia, Forthcoming).
Please note that all of our software is written in Python (3.9+). If you wish to contribute, please read the Contribute section in this document.
This is the manual exclusive to the de-identification - anonymization process of the MCEC Project and should not be confused with the MCEC Annotation Manual, which is currently under development. For examples on corpus annotation manuals, please see the ICAME Corpus manuals.
The MCEC Data Stewardship Documentation outlines how the MCEC Team manages data quality, integrity, accessibility, and security from collection to publication. It also contains the workflows that integrate the MCEC-DeID automated checks (RAD-ID and AnonyM) with the manual checks for data de-identification and anonymization. The last section of the MCEC Data Stewardship Documentation contains a list of resources related to everything contained in this repository.
A language corpus is an organized collection of text, usually around a specific topic. Language corpora (plural of corpus) are used by researchers around the world to understand different language systems and human communication in general, as well as for technological applications. The goal of the Multilingual College Email Corpus (MCEC) is to help increase our understanding of communication between students and instructors. Many researchers have studied the characteristics of academic email language, however their data has remained private and because of this it is not comparable and that data cannot be used for diachronic studies in the future (Romero Diaz, forthcoming). For these reasons, the MCEC aims to be an open-access, annotated corpus, as are many of the publicly available language corpora out there. This means releasing human-created written data, which in principle does not have readily available identifying information in comparison to video or audio. However, the contents of such data may contain identifiable information, which is why de-identification and anonymization are necessary.
The release of written human data is permitted under University and federal regulations with certain restrictions:
"Entities releasing data should apply a consistent de-identification strategy to all of their data releases of a similar type (e.g., tabular and individual level data) and similar sensitivity level. It is advised that organizations document their data reporting rules in the documents describing their data reporting policies and privacy protection practices, such as a Data Governance Manual. (See PTAC’s Data Governance and Stewardship brief at https://studentprivacy.ed.gov/resources/issue-brief-data-governance-and-stewardship for more information on best practices in data governance.)"(Data De-identification: An Overview of Basic Terms, Privacy Technical Assistance Center, 2021-02-09)
The MCEC Project is an IRB - approved research project at the University of Arizona. As such, there are measures in place to exclude personally identifiable information (PII) from any published research data. Two of those measures are data de-identification and data anonymization.
The Directors of the MCEC Project have determined that first, data must be de-identified by PIs and co-PIs for the internal use of the MCEC Research team, who will then follow the MCEC-defined processes to anonymize the data and prepare it for making it publicly available.
Both the MCEC Research Team as well as the University of Arizona are committed to protecting participant's privacy and confidentiality.
Yes.
Not only does the University of Arizona Human Subjects Protection Program require researchers to keep a record of how participant data is processed and protected, but there are also U.S. regulations that require this documentation. For researchers:
"The importance of documentation for which values in health data correspond to PHI, as well as the systems that manage PHI, for the de-identification process cannot be overstated. Esoteric notation, such as acronyms whose meaning are known to only a select few employees of a covered entity, and incomplete description may lead those overseeing a de-identification procedure to unnecessarily redact information or to fail to redact when necessary. When sufficient documentation is provided, it is straightforward to redact the appropriate fields." Preparation for De-identification, HIPAA, 2021/02/07
The MCEC Team is committed to protecting all personal data and FERPA-protected school records, even if this means lowering the usability of the public version of the MCEC corpus. Our participants' privacy comes first.
As with most academic language corpora, the MCEC keeps two versions of the relevant language data:
A) An internal version that is only available to the research team. This version is de-identified) and only the MCEC co-directors Damian Romero and Hanyu Jia have access to the de-identification table (see our MCEC Keys (Codes) documentation).
B) A second version which will be public and anonymized. This anonymized data will be published via the Research Data Repository of the University of Arizona (also known as ReDATA). There is currently no time-frame for when the MCEC data will be published.
There are a few reasons why the MCEC project does not anonymize participant data immediately upon collection. The first one is that we wish to provide our participants with the opportunity for retracting their data from any unpublished materials. Secondly, the University of Arizona Human Subjects Protection Program requires researchers to retain participant records for a number of years (see the Data Security and Records Retention document). Finally, in special cases, certain U.S. government officials may require us to disclose relevant records, for example if they are needed for law enforcement purposes. For more information, please visit the UArizona HSPP page as well as the Office of the Registrar's FERPA page
If you have any questions about the MCEC Project's data collection, de-identification and anonymization process, or making email text publicly available, the best way to get started is to consult our Publishing Email Data
document.
The Multilingual College Email Corpus works under a repository-wide batch de-identification modality, which means that the whole MCEC dataset is de-identified soon after new data is obtained (usually each academic term). The de-identification process consists mainly in redacting identifiers (direct and indirect) out of the data and replacing them by tags. The redaction process is supplemented by the suppression of information whenever the Research Team deems it necessary. The MCEC Team reserves the right to exclude any research data from publication, for example if it contains confidential information, or if it is the case that the relevant referents cannot be satisfactorily anonymized.
We base our de-identification methodology on two main resources: the Health Insurance Portability and Accountability Act (HIPAA), and the Family Educational Rights and Privacy Act (FERPA). We use these two resources because they are well-known academic standards on data privacy.
While both HIIPAA and FERPA define what data counts as personally identifiable information (PII), HIPAA provides a clear de-identification method known as the Safe Harbor method. Additionally, FERPA regulates students' education records, which are a large part of the MCEC project's records since student academic emails are considered educational records by the University of Arizona.
To learn more about identifiers, please refer to our Terms and Definitions document.
As mentioned above, the MCEC Project uses the Safe Harbor method to de-identify the data. This is done by substituting direct and indirect identifiers with tags. See the fictional example below:
<<Document name: "Susan-to-Kirke-1.txt">>
Dear Professor Kirke,
My classmate Peter and I have finished our English homework. Where can we turn it in?
Sincerely,
Susan
After applying the Safe Harbor method using a algorithm followed by manual human de-identification, the example above is transformed into:
<<Document name: "de-identified-file-name_1.txt">>
Dear [[last-name[instructor]]],
My classmate [[first-name[student][2]]] and I have finished our [[subject-name]] homework. Where can we turn it in?
Sincerely,
[[first-name[student][1]]]
While the data in the second example is certainly more secure, it is still not ready to be made publicly available as part of the MCEC. For that, there is a last computer (Machine Learning) and manual human check. Then, the de-identified-file-name
is anonymized so that the email cannot be linked back to the original student record system nor to other emails produced by the same student:
<<Document name: "anonymized-file-name 1.txt">>
Dear [[last-name[instructor]]],
My classmate [[first-name[student][2]]] and I have finished our [[subject-name]] homework. Where can we turn it in?
Sincerely,
[[first-name[student][1]]]
In the example above notice that the file name has changed from <<de-identified-file-name.txt>>
to <<anonymized-file-name.txt>>
.
The whole process/workflow from data collection to anonymization is described in our Publishing Email Data documentation.
For a definition of what a tag is in the context of the MCEC DeID as well as the types of tags used by the MCEC Team and the tagging style manual, please read the Tags document.
The following documents list map MCEC identifiers to their corresponding tags:
MCEC-specific tags (incorporates the FERPA-specific tags)
Sub-tags (both for HIPAA and MCEC-specific)
.
├── AUTHORS.md
├── LICENSE
├── README.md
├── bin <- Binary code (not tracked by git)
├── config <- Configuration files
├── data
│ ├── external <- Data from third party sources [ENRON corpus, etc]
│ ├── interim <- Data examples made of fictitious emails
│ ├── processed <- (not tracked by git) The final de-identified and anonymized data
│ └── raw <- (not tracked by git) Original data to be de-identified
├── docs <- Documentation
├── notebooks <- Ipython or R notebooks
├── reports <- For a manuscript source, e.g., LaTeX, Markdown, etc., or any project reports
│ └── figures <- Figures for manuscripts or reports
└── src <- Source code for this project
├── data <- Scripts and programs to process data
├── external <- Any external source code, e.g., pull other git projects, or external libraries
├── models <- Source code for your own model e.g, word-vectors
├── tools <- Any helper scripts go here
└── visualization <- Scripts for visualizing results
v0.0.1:
- Start MCEC DeID Repository
- Project organization
- Basic documentation
- LICENSE
- README.md
If you or your department is interested in being part of the MCEC project, please contact the MCEC team (mcec.team [at] gmail) where someone will put you in contact with our Outreach Coordinator, Miss. Wei Xu, directly.
The code inside this project is licensed under the MIT License - see the LICENSE file for details.
The documentation is currently not licensed. Any use of the documentation contained in this repository is currently disallowed until this temporary notice has been erased. If you are seeing this, then you do not have permission to use or reproduce the documentation in this repository in any way.
The project structure for this repository was generated using the Reproducible Science cookiecutter boilerplate by Mario Krapp (2016)