Adversarial Validation Simplified

Param Saraf
2 min readFeb 26, 2021

This technique comes in handy in a lot of situations specifically when the model performance on test data doesn’t match that on validation data.

The main reason is the different distribution of data on test and train data, probably due to change in categorical data with time. For e.g. If a product category got EOL and got replaced with a new one that is not present in your training data.

To counter this we combine test and train data and create a new feature Is_Test (marking all test data as 1 and train data as 0) and run the model with this new target. The most significant features will be the ones that got changed drastically in test data from the train data and hence are able to distinguish test data from the train. One way to handle this is by removing time-specific attributes from your category. For e.g. If you had Chrome_v_2.0 as a category in train data and Chrome_v_3.0 on test data, you can remove version info from your category.

After doing various cleaning, you can fit your train data on the model and sort the predictions in decreasing order. Select the top 20% data as your validation data and then finally use it in your actual model.

I suggest two good references for the same:

  1. For understanding

2. For implementation

--

--

Param Saraf

Data Scientist | Machine Learning Engineer | Power BI/ MSBI Expert