Datacleansing with Power * MatchMaker

Power MatchMaker is a Data Cleansing tool that has freed SQLPower becoming licensed in Open Source, along with the Power Architect (Data Modeling Tool). As there is not that too many Open Source tools in the field of data cleansing, I have been curious and I've installed to see this work. The installation was very simple, the software is downloaded from Download Power MatchMaker in different versions depending on the OS. I have tried the windows, which is installed in a coup button 2 minutes. Important not to forget the order of the Java Runtime 5. Once installed, to see how it is best to follow the tutorial that it is in aid of the tool. I also recommend seeing the demo available from the same page MatchMaker. The operation of the software is very simple, creating a repository on a different database on which to work, and by connecting with JDBC, and can create 3 different projects: Deduplicación, Datacleansing and cross references. So in theory, because the functionality of cross-referencing is not implemented yet and can not be used. The draft Datacleansing not add anything new because it uses all the functionality is a subset of which offers one of Deduplicación, thereby creating a project of this type because we see everything. As for the deduplicación, organizes the process into several steps:

1. Definition of transformation processes of the origin and comparison fields including

You can define various processes of comparison, using different operators to the original data to obtain meaningful data for the comparison, and is also defined exactly what they want to compare. The interface for these actions is very intuitive and visual, at one point defines everything. The paste is that the comparison operators are quite simple. Although there are comparison operators such as phonics, are lacking in less fuzzy logic functions for comparing similar words, or work for a percentage of similarity and field record. The final results are not identical or that everything that has been defined. The only thing you can do is a priority and assign a color to distinguish it visually after each comparison process. It also cast under specific addresses or other data 'standard', although there is an operator that performs a validation of the address with Google Maps. I have not managed to get me to work, but it is something to explore with more calm. You can also define translation dictionaries of words, something very useful when comparing names or addresses, for example.

Datacleansing MatchMaker

 

2. Implementation of the comparison

Nothing to stress, with few records works well, should be tested with large tables and assess performance.

3. Validation of the match

This part is also very good. The tool shows in a very visual matches found, with a defined process for each color, and can see the differences between records, and discarding matches, decide which is the master record (which is to retain data after the light) and what is going to merge and how. Defaults are chosen master data record, unless the field is null, and can also concatenate the data, or take the maximum, minimum or the sum of all. If you want you can choose to let the tool automatically record is as a teacher, and to merge all the records in which coincidence has been found. The tool is very good to work with a limited number of records that can be reviewed by a person before the merger, but it lacks a bit of 'intelligence' to deal with large numbers of records, and mergers without requiring manual intervention. It should also be able to choose the master data at the field level and not at the level of registration, and the best data from each field to create the best teacher registration.

Merge validacion MatchMaker

 

4. Process of merging records

Working properly, makes and keeps a log identifiers which merges into a table of results. Just be careful because it works directly on the source table, and deletes the records that were marked as duplicates.

Conclusions

Ultimately, it is a very useful tool for cleaning processes, especially if the amount of data to review is not very large. Without major complications can make the whole process and greatly facilitates the comparison between candidates and the election of the records that are as teachers. It has several aspects to improve, but it sure is not going to be in this version, especially knowing that it is now open source.