Background
Personally identifiable information (PII) is sensitive information that can be used to identify or locate an individual. Protecting PII in the data science world is important to maintain the privacy and security of individuals, and essential for compliance with data privacy laws and regulations. The increasing number of regulations is making it difficult to start building an application without knowing all the protocols to follow.
Several methods can be used to anonymize PII, including masking, hashing, and encryption and decryption. Each method has its own strengths and limitations. The appropriate method to use depends on the specific requirements and constraints of the dataset and the use case.
Data scientists can minimize privacy challenges in the design and development stage well before production. Speeding up the data pipeline, and extract, transform, and load (ETL) is critical to scale AI solutions.
Solution
In collaboration with Accenture*, Intel developed this AI data protection reference kit. This kit may help customers develop PII anonymization utility functions, which include methods for masking, hashing, and encrypting and decrypting the PII in large datasets (such as names, IP addresses, and phone numbers).
To anonymize data fields that contain names, a random-name-generator recurrent neural network (RNN) model is used in a pickled format, which generates realistic synthetic names. A pretrained BERT model is used for named-entity recognition (NER) in the free-flowing text. The identified entities are then masked using available obfuscation methods. This reference implementation considers the following tags for masking the PII datasets.
- PER: Person name
- LOC: Location name
- ORG: Organization name
End-to-End Flow Using Intel® AI Software Products
This reference kit includes:
- Training data
- An open source, trained model
- Libraries
- User guides
- Intel® AI software products
At a Glance
- Industry: Cross-industry
- Task: Mask PII using random generation functions for strings and numeric characters
- Dataset: Random dataset generator script to produce random PII
- Type of Learning: Deep learning
- Models: Pretrained BERT model, RNN
- Output: A .csv file with anonymized PII and a JSON file to aid the decryption process (optional, only in case of encryption)
- Intel AI Software Products:
- Intel® Extension for PyTorch* v1.13.0
- Intel® Distribution for Python* (specifically the optimizations for NumPy and SciPy)
- Intel® Distribution of Modin*
Technology
Optimized with Intel AI Software Products for Better Performance
The AI structured data generation models were optimized by Intel Extension for PyTorch, Intel Distribution of Modin, and Intel Distribution for Python (specifically the optimizations for NumPy and SciPy).
Intel Extension for PyTorch, Intel Distribution of Modin, and Intel Distribution for Python allow you to reuse your model development code with minimal code changes for training and inferencing.
Performance benchmark tests were run on Microsoft Azure* Standard_D8_v5 using 3rd generation Intel® Xeon® processors to optimize the solution.
Benefits
Being able to protect PII in the data science world is important to maintain the privacy and security of individuals and essential for compliance with data privacy laws and regulations.
This reference kit provides PII anonymization utility functions, which include methods for masking, hashing, and encrypting and decrypting the PII in large datasets (such as names, IP addresses, and phone numbers).
With Intel® oneAPI toolkits, little to no code change is required to attain the performance boost.
Related Reference Kits
Additional Resources