ADP - Machine Learning Automation

During my internship at ADP, I developed a Natural Language Processing (NLP) solution using a BERT model to automatically extract and organize information from bulk-imported documents. This project was one of ADP’s first machine learning-based solutions integrated into their product suite, showcasing my ability to innovate and deliver impactful solutions.

Key Elements

Natural Language Processing

Transformers

Tokenization Preparation of the Dataset

Training & Testing BERT Model

Containerizing via Kubernetes to Scale Training

Statistical Analysis of the Model's Performance

ADP Illustration 1
ADP Illustration 2

What I Learned

By leveraging an NLP model, BERT (Bidirectional Encoder Representations from Transformers), we can extract key information from bulk imported documents to categorize & sort any documents uploaded by ADP clients.

BERT can use its NER (Named Entity Recognition) ability to mark various types of entities in a text sequence to extract what we'd like. To do so we have to label our own dataset.

Tokenization is used to prepare the training and testing dataset.

We fine-tune the model by feeding it our custom dataset and apply MLM (Masked Language Modeling) technique to produce a NER capable model with high enough accuracy.

In our case, ~93%