-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faker unstructured #148
Faker unstructured #148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This already gets us very far for the unstructured text!! Super nice. Indeed, the text generated by faker does not look very realistic, but it is multilingual (which is really great).
how do you envision users implementing this? with var_spec unstructured: True
?
To add: json validation (because it errors on that now)
metasynth/distribution/faker.py
Outdated
@classmethod | ||
def _fit(cls, values): | ||
"""Select the appropriate faker function and locale.""" | ||
detector = LanguageDetectorBuilder.from_all_languages().with_low_accuracy_mode().build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we can do statically and not on-the-fly for each column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the loading is cached, so I don't think it's a big performance problem. If it is we can look at it later.
Here is a nice test script by the way, in case you did not have this yet: import polars as pl
from metasynth import MetaFrame
from metasynth.distribution.faker import UnstructuredTextDistribution # should be exported by distribution
df = pl.DataFrame({
"nltxt": ["Ik ben een kleine eenhoorn.", "Mijn opa loopt op sokken.", "Wie gaat weg?"],
"entxt": ["I'm a small unicorn.", "My grandfather walks in socks.", "Who is leaving?"],
"detxt": ["Ich bin ein kleiner Einhorn.", "Mein Opa läuft auf Socken.", "Wer geht weg?"]
})
var_spec = {
"nltxt": {"distribution": UnstructuredTextDistribution},
"entxt": {"distribution": UnstructuredTextDistribution},
"detxt": {"distribution": UnstructuredTextDistribution}
}
mf = MetaFrame.fit_dataframe(df, spec= var_spec)
mf.synthesize(10) |
I would like to have some heuristic that compares the regex to the unstructured text, but manually it can be set in the var_spec yes. Okay, I'll develop this further than if it works well enough! |
Might be good to write some documentation on this at some point as well! @vankesteren I don't get any validation errors though? |
Ah I tested it now, there's no more validation error on my side either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done already! Very nice
metasynth/distribution/faker.py
Outdated
lang = self.detect_language(series) | ||
if lang is None: | ||
return 9999999 | ||
return -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what makes the choice between regex and unstructured? Can we make it more explicit somehow? This logic is now hidden in the detect_language function right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if lingua detects a language -> UnstructuredText, otherwise -> Regex. The logic is in the lingua package, so I don't really know how to make it more explicit? I can add a comment?
PS: I should test it with the pilot data first. |
OK did that now, basically everything is replaced by the unstructured text by default. Is this what we want? I don't think so. It's maybe better to explicitly have to set unstructured: True or so in the var_spec. |
@vankesteren It shouldn't replace all columns: df = pl.DataFrame({
"nltxt": ["Ik ben een kleine eenhoorn.", "Mijn opa loopt op sokken.", "Wie gaat weg?"],
"entxt": ["I'm a small unicorn.", "My grandfather walks in socks.", "Who is leaving?"],
"detxt": ["Ich bin ein kleiner Einhorn.", "Mein Opa läuft auf Socken.", "Wer geht weg?"],
"struct": ["x123", "x523", "x631"],
}) What kind of columns are you talking about? Categorical variables that are not labeled as such mostly? I do think we should auto-detect, but obviously the contraints can be more stringent. |
I think this is a bit related to the regex distribution as well, detecting how well the regex does, will help the detection of the unstructured text as well. |
What about criteria for unstructured, something like:
|
In the end it would be nice to integrate it using AIC (or derivative). If we have one word at 1/10000 chance, then we could simply have L = 10000**-N_words. I don't know about values that cannot be fit with the regex, but I could imagine that simply giving a small value of 10^-6 or something. Otherwise we can compute the probability of the regexes reasonably easily. Some of the problems might also be fixed if we have a string Multinoulli distribution? |
5b487c4
to
48ec04d
Compare
@vankesteren I have updated the branch so that it shouldn't do unstructured text by default anymore. I have also renamed it to "freetext", let me know what you think! |
Great, checked. Feel free to merge. |
Creating this PR as a possibility for unstructured text. It uses
lingua-py
to detect the language and uses thelorem
provide fromfaker
to create sentences/words of a certain length. Obviously, it won't look much like the original text, but on the other hand, it might also be much easier to accept from a privacy stand point.Current state of the PR is unfinished (hence a draft).