Using Machine Learning to Better Classify Email

By Gurinder Walia

The first time an email classifier helped me was dealing with all the newsletters I had signed up for because I wanted 10% off my new jean purchase. I can still remember the joy when I entered my inbox in the morning, thinking I would have to go through and delete 20 newsletters, while trying to find my important email in between them. Yet, when I finally logged on, they were nowhere to be found! Tucked away by being tagged by an algorithm! We’re hoping we can provide a little bit of this joy, to our clients who tag emails; so we started an experiment to better understand machine learning and its application to SEDNA.

Improve Your Workflow Using Tags

Organizing email is an incredibly important part of anyone’s communication workflow. Whether it is our first look at an email, or returning to an email after some time, it’s important to immediately understand its content and purpose so that it can be acted on accordingly. One method that our clients use to organize their email is with category tags. With these, any team member can read an email and let their other team members know what the email is about; providing context for how it will fit into their workflow. We want to save our clients even more time, and are working on a way to to use machine learning to intelligently help our clients tag their emails.

Email Classification Is Not a New Problem

There are many examples online using machine learning to classify emails into categories. While these classifiers are doing a similar task to what we were trying to accomplish, there is a distinct difference: these public models are using universally known categories (personal, promotional, or classifying emails into political/economical/sport). Our clients create and use custom tags specifically for their workflows, thus it became apparent that we would have to create individual machine learning models for them as well.

Data, Data, and More Data

There’s a phrase in computer science: “garbage in, garbage out”. Meaning, if a program’s inputs are faulty, then its outputs will be as well. This phrase is perhaps even more important in the field of machine learning. If we train our model on poor data, it will be a poor model. Therefore, it was vital that we create excellent client-specific datasets in order to create great models. Our application has its basis in a relational database, so securely pulling out data was a matter of creating a query to pull out the information we needed. Included in this dataset, amongst other “features”, were the email contents themselves. With the goal of using various SciKitLearn libraries to create our models, we needed to quantify the qualitative information. This was done using a “bag-of-words” vector model.

Model Creation

With the model now in place, we researched different scikit learn libraries to perform multi-class classification. It’s a fine line between trying too many models (and one of the models being good – on our data – by chance) and not trying enough, so that we get our best possible model. Ultimately, we tried between 20-30 models which were various combinations LinearSVC, GaussianNB, and RandomTree and their associated hyper-parameters. To try the different model hyper-parameters and score our models via cross validation we used scikitlearn’s gridSearchCV class.

In machine learning, we have to be careful not to “overfit” our models to the data we are using to train them (our “training data”). An anecdote that has helped us understand this phenomena is studying for a test in school. Students often take practice tests, and it’s important that they use the practice tests to learn the material, rather than just memorizing the answers. They want to perform well on the real test, not just know the answers to the practice ones! For our machine learning models, our training data is the practice test, and new emails the models will see in production are the real test.

To prevent this overfitting (see figure 1), we use a variety of techniques. GridSearch allows us to perform cross-validation. Cross-validation saves some of our training data to mimic a “real test” after the models have been trained. With some of our models we also used regularization. Without getting into the specifics of the math, regularization “minimizes” the model. There are also forms of regularization which can perform feature selection (in our case, finding what parts of the email are relevant to predicting what classification it belongs under). More can be read about regularization in machine learning here.

Figure 1: Blue Line: an overfitted model, Black Line: What the model should be ( Source )

Figure 1: Blue Line: an overfitted model, Black Line: What the model should be (Source)

LinearSVC was found to be the best model, predicting the correct tag 80% of the time for our initial client we tested on.

Model Deployment

Knowing that we wanted to deploy our model on AWS’s API Gateway and Lambda, a library was needed to find a means to easily deploy the model. Other Lambda functions on our code base had used serverless. Serverless allowed us to easily deploy the model simultaneously on Lambda with trigger events coming from API Gateway. The use of API Gateway as a trigger also came with the advantage of having IAM permission authorization on our API, securing our models and data.


Accurate email classification is an important mechanism our clients use to efficiently handle their communications. By intelligently suggesting tags to our users we can save them the headache of trying to first think of how to classify an email, and then the process of finding the correct tag. Using machine learning, and best machine learning practices, we can create client-specific models that classify their emails with the tags they create not only on our training data but all future incoming emails. Even more important is that as our clients get more emails, the models will only get better.

Next Steps

The findings in this experiment uncovered that the probability to return an accurate result was strong but the operational overhead were too intense. The only thing preventing actual use in SEDNA at this time was the expense of returning category tags on every message load. The time delay when calculating the tags increased message load by over one second. We are eager to explore new ways of applying this strategy to offset the costs to bring results more in line with our standard of delivering sub-second responses. This work does however lay the foundation for building towards a better future where machine learning can improve how clients’ tag their emails.