Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faker unstructured #148

Merged
merged 9 commits into from
Sep 1, 2023
Merged

Faker unstructured #148

merged 9 commits into from
Sep 1, 2023

Conversation

qubixes
Copy link
Member

@qubixes qubixes commented Aug 15, 2023

Creating this PR as a possibility for unstructured text. It uses lingua-py to detect the language and uses the lorem provide from faker to create sentences/words of a certain length. Obviously, it won't look much like the original text, but on the other hand, it might also be much easier to accept from a privacy stand point.

Current state of the PR is unfinished (hence a draft).

Copy link
Member

@vankesteren vankesteren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already gets us very far for the unstructured text!! Super nice. Indeed, the text generated by faker does not look very realistic, but it is multilingual (which is really great).

how do you envision users implementing this? with var_spec unstructured: True?

To add: json validation (because it errors on that now)

@classmethod
def _fit(cls, values):
"""Select the appropriate faker function and locale."""
detector = LanguageDetectorBuilder.from_all_languages().with_low_accuracy_mode().build()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we can do statically and not on-the-fly for each column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the loading is cached, so I don't think it's a big performance problem. If it is we can look at it later.

metasynth/distribution/faker.py Outdated Show resolved Hide resolved
metasynth/distribution/faker.py Outdated Show resolved Hide resolved
@vankesteren
Copy link
Member

vankesteren commented Aug 16, 2023

Here is a nice test script by the way, in case you did not have this yet:

import polars as pl
from metasynth import MetaFrame
from metasynth.distribution.faker import UnstructuredTextDistribution # should be exported by distribution

df = pl.DataFrame({
    "nltxt": ["Ik ben een kleine eenhoorn.", "Mijn opa loopt op sokken.", "Wie gaat weg?"], 
    "entxt": ["I'm a small unicorn.", "My grandfather walks in socks.", "Who is leaving?"],
    "detxt": ["Ich bin ein kleiner Einhorn.", "Mein Opa läuft auf Socken.", "Wer geht weg?"]
})

var_spec = {
    "nltxt": {"distribution": UnstructuredTextDistribution},
    "entxt": {"distribution": UnstructuredTextDistribution},
    "detxt": {"distribution": UnstructuredTextDistribution}
}

mf = MetaFrame.fit_dataframe(df, spec= var_spec)
mf.synthesize(10)

@qubixes
Copy link
Member Author

qubixes commented Aug 16, 2023

This already gets us very far for the unstructured text!! Super nice. Indeed, the text generated by faker does not look very realistic, but it is multilingual (which is really great).

how do you envision users implementing this? with var_spec unstructured: True?

To add: json validation (because it errors on that now)

I would like to have some heuristic that compares the regex to the unstructured text, but manually it can be set in the var_spec yes. Okay, I'll develop this further than if it works well enough!

@qubixes qubixes marked this pull request as ready for review August 16, 2023 11:16
@qubixes
Copy link
Member Author

qubixes commented Aug 16, 2023

Might be good to write some documentation on this at some point as well!

@vankesteren I don't get any validation errors though?

@vankesteren
Copy link
Member

Ah I tested it now, there's no more validation error on my side either

Copy link
Member

@vankesteren vankesteren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done already! Very nice

lang = self.detect_language(series)
if lang is None:
return 9999999
return -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what makes the choice between regex and unstructured? Can we make it more explicit somehow? This logic is now hidden in the detect_language function right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if lingua detects a language -> UnstructuredText, otherwise -> Regex. The logic is in the lingua package, so I don't really know how to make it more explicit? I can add a comment?

@vankesteren
Copy link
Member

PS: I should test it with the pilot data first.

@vankesteren
Copy link
Member

OK did that now, basically everything is replaced by the unstructured text by default. Is this what we want? I don't think so.

It's maybe better to explicitly have to set unstructured: True or so in the var_spec.

@qubixes
Copy link
Member Author

qubixes commented Aug 16, 2023

@vankesteren It shouldn't replace all columns:

df = pl.DataFrame({
    "nltxt": ["Ik ben een kleine eenhoorn.", "Mijn opa loopt op sokken.", "Wie gaat weg?"], 
    "entxt": ["I'm a small unicorn.", "My grandfather walks in socks.", "Who is leaving?"],
    "detxt": ["Ich bin ein kleiner Einhorn.", "Mein Opa läuft auf Socken.", "Wer geht weg?"],
    "struct": ["x123", "x523", "x631"],
})

What kind of columns are you talking about? Categorical variables that are not labeled as such mostly?

I do think we should auto-detect, but obviously the contraints can be more stringent.

@qubixes
Copy link
Member Author

qubixes commented Aug 16, 2023

I think this is a bit related to the regex distribution as well, detecting how well the regex does, will help the detection of the unstructured text as well.

@qubixes
Copy link
Member Author

qubixes commented Aug 16, 2023

What about criteria for unstructured, something like:

  • Number of words/row > 1
  • Variable number of words/row > 90%

@qubixes
Copy link
Member Author

qubixes commented Aug 17, 2023

In the end it would be nice to integrate it using AIC (or derivative). If we have one word at 1/10000 chance, then we could simply have L = 10000**-N_words. I don't know about values that cannot be fit with the regex, but I could imagine that simply giving a small value of 10^-6 or something. Otherwise we can compute the probability of the regexes reasonably easily.

Some of the problems might also be fixed if we have a string Multinoulli distribution?

@qubixes
Copy link
Member Author

qubixes commented Aug 30, 2023

@vankesteren I have updated the branch so that it shouldn't do unstructured text by default anymore. I have also renamed it to "freetext", let me know what you think!

@vankesteren
Copy link
Member

Great, checked. Feel free to merge.

@qubixes qubixes merged commit 0a49617 into main Sep 1, 2023
6 checks passed
@qubixes qubixes deleted the faker-unstructured branch September 27, 2023 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants