[TOOL] CM Duplicate Detector

User avatar
Patch Team
Posts: 2291
Joined: Wed Nov 25, 2020 5:01 am
Has thanked: 197 times
Been thanked: 792 times

[TOOL] CM Duplicate Detector

Post by Xeno »

CM Duplicate Detector , Version: 1.0

Tool/Pre-Data Editor

Created By

This tools will generate a spreadsheet listing potential duplicate players and staff in the database.
Finds and exports a list of duplicate players and staff in a CM database.

Attention! Always take a back-up copy for all Data files before using tools! So, if you get an error later , you still can restore it with the back-up Data.

Installation Instructions
You MUST first install the MS Visual C++ 2017 Redistributable x64 (you must install the 64-bit aka x64 version): https://aka.ms/vs/15/release/vc_redist.x64.exe

Download from below link and extract the CM_Duplicate_Detector.zip file to a folder of your choice.
Ensure to tick Run As Administrator in Properties > Compatibility of the file.
Double-click on the CM Duplicate Detector.exe file to run the tool.

The Duplicate Detector undertakes three passes of the database as follows (using pseudo-code to try to explain it):
Pass 1: (First Name AND Second Name are equal) AND (Year of Birth OR Club Contracted are equal)
Pass 2: (Common Names are equal) AND (Year of Birth OR Club Contracted are equal)
Pass 3: First Name AND Second Name are near identical when compared using Levenshtein distance.

Once you have extracted the zip file to a folder, you will find that there are three .txt files with the application named "first_name.txt", "second_name.txt" and "common_name.txt". This is where you can enter the synonym data. In other words, you can list out alternate spellings/versions of names so that they are treated as duplicates when searching the database. I have added some examples from my EHM Editor in the first and second name synonym text files but have not included any examples in the common name file. Hopefully it is self-evident from looking at the first/second name text files as to how these work.
As an example, you will see in "first_name.txt" that Nick, Nicky and Nicholas are listed as synonyms of one another. This means that "Nicky Butt", "Nicholas Butt" and "Nick Butt" would all be treated as duplicates.
The synonym data is used for Passes 1 and 2. The synonym data is not used for Pass 3.

Once you have added any additional synonym data to the .txt files (this is of course optional), load the Duplicate Detector.
You will see there is an option to set "Year of Birth Tolerance". This determines how close two persons' years of birth have to be in order to be considered identical for the purposes of Pass 1 and Pass 2. If you set it to zero then years of birth have to be exactly the same. If you set it to, for example, 2 then the two years of birth simply need to be within 2 years of each other (e.g. 1980 would be considered identical to 1982 as it is within two years). Consequently, the larger the Tolerance value, the wider the net will be cast - but this also increases the chances of false positives.
After you have set your Tolerance setting (I suggest trying a value of 1 or 2), click on the "Open Database" button and select your database. The Duplicate Detector will then start scanning the database. The process takes around 30-40 seconds with a database of around 150,000 people.
Once the process is complete, you will find a "Duplicate Report.csv" file which sets out the results. Note that results from Pass 3 can sometimes appear more than once in the spreadsheet. I will think about making Pass 3 optional in the future.

FAQs and Technical Support
Have you followed the instructions above to the letter but have come across an issue? If so, then do not worry!
We have worked hard to put together a FAQ specifically to this type of download so all you need to do now is click on the link below and you'll be taken to our 'Technical Support' area where you will hopefully find a solution to your problem.
However, if you find that no solution has been provided, then please do use this same thread to report your issue and in order for us to help you, please be as detailed as possible. If a solution has been found, we will then update the FAQ with your own issue for others who may come across the issue in the future!

Download Link
You do not have the required permissions to view the files attached to this post.