Business Results

  • Up to 63% faster inference of a BERT-based NER model

  • Up to 135% faster performance of a read_csv() API

  • Ability to process terabytes of data on a single workstation and scale from a single workstation to the cloud, using the same code, and focus more on data analysis and less on learning new APIs, with Intel® Distribution of Modin*

author-image

作者

View All Reference Kits

Background

Personally identifiable information (PII) is sensitive information that can be used to identify or locate an individual. Protecting PII in the data science world is important to maintain the privacy and security of individuals, and essential for compliance with data privacy laws and regulations. The increasing number of regulations is making it difficult to start building an application without knowing all the protocols to follow.

Several methods can be used to anonymize PII, including masking, hashing, and encryption and decryption. Each method has its own strengths and limitations. The appropriate method to use depends on the specific requirements and constraints of the dataset and the use case.

Data scientists can minimize privacy challenges in the design and development stage well before production. Speeding up the data pipeline, and extract, transform, and load (ETL) is critical to scale AI solutions.

Solution

In collaboration with Accenture*, Intel developed this AI data protection reference kit. This kit may help customers develop PII anonymization utility functions, which include methods for masking, hashing, and encrypting and decrypting the PII in large datasets (such as names, IP addresses, and phone numbers).

To anonymize data fields that contain names, a random-name-generator recurrent neural network (RNN) model is used in a pickled format, which generates realistic synthetic names. A pretrained BERT model is used for named-entity recognition (NER) in the free-flowing text. The identified entities are then masked using available obfuscation methods. This reference implementation considers the following tags for masking the PII datasets.

 

  • PER: Person name
  • LOC: Location name
  • ORG: Organization name
     

End-to-End Flow Using Intel® AI Software Products


This reference kit includes:
 

  • Training data
  • An open source, trained model
  • Libraries
  • User guides
  • Intel® AI software products

At a Glance

  • Industry: Cross-industry
  • Task: Mask PII using random generation functions for strings and numeric characters
  • Dataset: Random dataset generator script to produce random PII
  • Type of Learning: Deep learning​
  • Models: Pretrained BERT model, RNN
  • Output: A .csv file with anonymized PII and a JSON file to aid the decryption process (optional, only in case of encryption)
  • Intel AI Software Products:
    • Intel® Extension for PyTorch* v1.13.0
    • Intel® Distribution for Python* (specifically the optimizations for NumPy and SciPy)
    • Intel® Distribution of Modin*

Technology

Optimized with Intel AI Software Products for Better Performance

The AI structured data generation models were optimized by Intel Extension for PyTorch, Intel Distribution of Modin, and Intel Distribution for Python (specifically the optimizations for NumPy and SciPy).

Intel Extension for PyTorch, Intel Distribution of Modin, and Intel Distribution for Python allow you to reuse your model development code with minimal code changes for training and inferencing.

Performance benchmark tests were run on Microsoft Azure* Standard_D8_v5 using 3rd generation Intel® Xeon® processors to optimize the solution.

Benefits

Being able to protect PII in the data science world is important to maintain the privacy and security of individuals and essential for compliance with data privacy laws and regulations.

This reference kit provides PII anonymization utility functions, which include methods for masking, hashing, and encrypting and decrypting the PII in large datasets (such as names, IP addresses, and phone numbers).

With Intel® oneAPI toolkits, little to no code change is required to attain the performance boost.

Download Kit

Stay Up to Date on AI Workload Optimizations

Sign up to receive hand-curated technical articles, tutorials, developer tools, training opportunities, and more to help you accelerate and optimize your end-to-end AI and data science workflows.

Take a chance and subscribe. You can change your mind at any time.

通过提交此表单,您确认您已年满 18 周岁,并同意就执行此业务请求与英特尔分享您的个人信息。英特尔的网站和通讯受隐私声明使用条款的制约
通过提交此表单,您确认您已年满 18 周岁,并同意就执行此业务请求与英特尔分享您的个人信息。此外,您还同意通过电子邮件和电话订阅来随时了解最新英特尔技术和行业趋势。您可以随时取消订阅。英特尔的网站和通讯受隐私声明使用条款的制约