-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match Type NULL_OR_BLANK causing zingg.block.Block NPE #818
Comments
Can you please share the error message? |
Here is the stacktrace:
|
thanks. is the match type for id field dont_use or dont use? |
Sorry, it is DONT_USE |
Ok thanks. Changing the field type of phone from numerical to string seems to be causing this, as the unmarked and/or marked data from earlier rounds would be a number and now a string. How much training data do you have? Is it possible to start from scratch on a new model? Or if you want, you could change the ltraning data under the model folder through pyspark. Hope that helps |
I have essentially started from scratch each time. We currently aren't setting this up for incremental. I have essentially cleared the directory each time and am creating a new database on databricks. EDIT: Just to clarify, this has been done several times, so anytime I have made any changes to the model, I start over. I again, started from a new directory and am still getting same error. |
Ok. then this may be a bug in the code which is triggered in the case of certain values in the data. Is it possible for you to share a test case and your config for us to reproduce this issue at our end? |
Certainly. Besides the config, what exactly do you need from me? Sample set of data? I am using the Databricks Solution Accelerator for this. |
Yes, a sample dataset and config/python code should be good enough to get started on reproducing this. @vikasgupta78 fyi |
Sorry for the delay. Since this is personal data, I am having to generate mock data with the same fields and then running that through to make sure the error still exists. |
Here is the mock dataset: Here is the field definition:
The code is the same as the Databricks Solution Accelerator with the exception that I remove the loading of the incremental in 00.1 and copied the attached dataset in both downloads and initial. If following that document, the failure is during 01_Intitial for the step 'Get Data (Run Once Per Cycle)' or sometimes it will get through that and fail during the next step 'Perform Labeling (Run Repeatedly Until All Candidate Pairs Labeled)'. With the NUUL_OR_BLANK, I have never gotten past two iterations of those two steps. Without, I was able to run and label repeatedly until I had enough matches to proceed. Thanks, |
thank a lot @TXAggie2000 , will take a look today. |
One question @TXAggie2000 - have you tried with zingg 0.4.0 ? |
I have not. I had followed the Solution Accelerator which uses 0.3.3 and has success de-duping a few different datasets, but only started having issues when adding the extra match type. |
I see. I can not locate the null_or_blank type in 0.3.3. I would suggest trying on 0.4.0 to see if this problem persists. |
Understood. I will try it with 0.4.0 and I will let you know! Thanks, |
Tried the same code with 0.4.0 and am now getting the error:
|
Can you please share the steps you used to install 0.4.0 and also the spark/java version you are using? |
@vikasgupta78 - I had modified the notebooks (config/setup) in the solution accelerator to download that version. I did notice that I was a minor spark version off for that so, I am re-testing with Databricks runtime version 14.3 LTS |
Cool. Please use dbr 14.2 and spark 3.5.0 with Zingg 0.4.0. |
Okay, I ran it in 14.2, spark 3.5.0 with Zingg 0.4.0 and still have the same error:
Here is the code for the findTrainingData where it is failing:
config['job']['initial']['findTrainingData'] = zingg_initial_findTrainingData Could this code have changed from 0.3.3 to 0.4.0? Keep in mind that if I download 0.3.3 and remove the NULL_OR_BLANK everything runs as expected. I just update 8 lines of code to switch. |
You seem to be using an older notebook , please try https://github.com/zinggAI/zingg-vikas/blob/0.4.0/examples/databricks/FebrlExample.ipynb |
@vikasgupta78 - Thank you! I will review and test over the weekend and let you know by Monday! |
@vikasgupta78 - I got through one round of training/labeling. On the second pass at training the data, I got the following exception:
|
This is my field definitions. Didn't see any examples in the docs for multiple match types, but it said it takes an array:
|
Did you change the definition after a round of training / labelling? |
can you try with phone = FieldDefinition("phone", "string", MatchType. FUZZY,MatchType.NULL_OR_BLANK) |
I would be happy to try it out locally if there is a test data you could share |
|
If you want to book time and the issue is still not resolved, please use the link on the docs @TXAggie2000 |
Thanks everyone. I am struggling to get decent results with it, and it seems the results are either not using the training data or not ignoring null or blank values. I've redone it several times. I will continue to train, but seeing a lot of records in a single cluster that used one column to match across |
I tried to replicate/fix the issue:
I am attaching the python file I used (renamed to txt) and final output I got. Let me know how we can help further. |
Also worth mentioning that in final few rounds model started converging that zingg prediction was in line with the input I gave in label |
@TXAggie2000 did you get a chance to look into the results? |
I am rerunning again. I had switched the email and phone order because those are more important to determine a match. There could be two different people within a company with the same email domain and phone number that we would consider to be a match. That would make those two fields more significant in that case, correct? |
@TXAggie2000 From what you are describing does that mean you won't consider first name and last name in such cases? If in some cases you consider certain fields like first name and last name as match but in other cases you don't it will not work out. Model has to be consistent. Fields you don't want to consider should be don't match. e.g. if you consider [email protected] vikas gupta as match with [email protected] sonal goyal If you want to consider domain as a company better to split it in seperate field. In summary be consistent in your training otherwise you will not get good matches |
I had some issues I corrected, and stuck with your suggestion. I went through 4 training cycles over the course of the day (findTrainingData took about an hour each cycle). When done, I had 3 matches and 83 labeled as not matching. With the ~40 records of matching manual training data, the trainMatch failed with not enough training data. After an additional round, it increased to 103 not matching. "trainMatch" did not error out, but the results were still subpar. For example, one cluster had 41 matches for John Smith where the phone numbers and emails were all different. Another where first name and last name were Mike, had 282 matches but none of the phone numbers or emails matched. I am not sure if I just need to spend several days training the model to get different results or is there a way to weigh these differently, or at least equally. |
Are any of the pairs you marked as match in training had different phone numbers and email but same name? If you did this for any of them it would result in Zingg also learning it the same way. Please run --phase generateDocs and send me the files so that I can check if there is a problem with training data |
The phase generateDocs ran with no issues, but there is no docs directory in my model directory. I ran the following block:
It's interesting, because if I run the following:
I can see the results, but if I print DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/, this location does not exist.
The model.html shows: Unmarked 0/144, Marked 144/144 (8 Matches, 103 Non-Matches, 33 Unsure) No sign of the training data... |
Training samples won't come in generateDocs. The purpose of it is to review label data. Regarding DOCS_DIR : If you see in https://github.com/zinggAI/zingg-vikas/blob/0.4.0/examples/databricks/FebrlExample.ipynb you need to assign DOCS_DIR = zinggDir + "/" + modelId + "/docs/" Please share model.html so that I can review |
Yes, that example is what I ran and showed what the path was when I ran print(DOCS_DIR), so you could see the path was set
|
whats your zinggDir? |
and what are you getting when you do dbutils.fs.ls('file:'+DOCS_DIR) |
DOCS_DIR: /mnt/data/raw/zingg/models/contacts/docs/ dbutils.fs.ls('file:'+DOCS_DIR): java.io.FileNotFoundException: No such file or directory /mnt/data/raw/zingg/models/contacts/docs |
and if you do dbutils.fs.ls('file:'+ zinggDir) |
[FileInfo(path='file:/mnt/data/raw/zingg/models/contacts/', name='contacts/', size=4096, modificationTime=1715090590510)] |
Also if you are able to see models.html as you have said above can you please share that? |
I believe the issue with it saving is because the notebook is referencing a mounted directory. If I run the following, it creates it in the mounted directory:
|
How can I send this to you privately |
I did see that on one of my matches, one record does not have a phone number and the other does not have an email. This should have been marked as uncertain. Is there a way to correct that? |
Also, at what point is the training data used? trainMatch? |
you can send it 1-1 on slack |
it is used in findTrainingData, trainMatch |
use updateLabel phase |
It appears this phase is requiring input, so I am guessing I will need to build out a notebook widget for cluster id and get that input into this phase since we are running a notebook and not command line. I did not see an example in the source code for this unless I missed it. |
zinggai.slack.com |
I wasn't able to log in to slack. Thought I had an account at one point, but apparently not. I also wasn't able to get the --updateLabel phase to work in a notebook, so I just started over. Still not getting good results so I will keep training. |
Just ensure that you are consistent in training and also with training samples. prefer to use actual data with training samples instead of handcrafted data |
I am using 0.3.3 to train and dedupe a very simple dataset. The initial results matched too many incorrect values due to null fields. I went back and added NULL_OR_BLANK to the field definition and now I can't even get through training without failure. Here is the current field definition:
fieldDefinition = [ {'fieldName':'id', 'matchType':'DONT USE', 'fields':'id', 'dataType':'"integer"'}, {'fieldName':'email', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'email', 'dataType':'"string"'}, {'fieldName':'firstname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'firstname', 'dataType':'"string"'}, {'fieldName':'lastname', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'lastname', 'dataType':'"string"'}, {'fieldName':'phone', 'matchType':'FUZZY,NULL_OR_BLANK', 'fields':'phone', 'dataType':'"string"'} ]
The 'phone' field was initially NUMERIC, but adding NULL_OR_BLANK to that caused failure. The above would sometimes get through a single training and labeling, but I was never able to train/label enough data before a failure would occur.
All we want to do is have null values not count as a match. How do I proceed?
Thanks,
Scott
The text was updated successfully, but these errors were encountered: