Faker unstructured #148

qubixes · 2023-08-15T14:35:43Z

Creating this PR as a possibility for unstructured text. It uses lingua-py to detect the language and uses the lorem provide from faker to create sentences/words of a certain length. Obviously, it won't look much like the original text, but on the other hand, it might also be much easier to accept from a privacy stand point.

Current state of the PR is unfinished (hence a draft).

vankesteren

This already gets us very far for the unstructured text!! Super nice. Indeed, the text generated by faker does not look very realistic, but it is multilingual (which is really great).

how do you envision users implementing this? with var_spec unstructured: True?

To add: json validation (because it errors on that now)

vankesteren · 2023-08-16T07:47:13Z

metasynth/distribution/faker.py

+    @classmethod
+    def _fit(cls, values):
+        """Select the appropriate faker function and locale."""
+        detector = LanguageDetectorBuilder.from_all_languages().with_low_accuracy_mode().build()


Is this something we can do statically and not on-the-fly for each column?

It seems that the loading is cached, so I don't think it's a big performance problem. If it is we can look at it later.

metasynth/distribution/faker.py

vankesteren · 2023-08-16T08:19:12Z

Here is a nice test script by the way, in case you did not have this yet:

import polars as pl
from metasynth import MetaFrame
from metasynth.distribution.faker import UnstructuredTextDistribution # should be exported by distribution

df = pl.DataFrame({
    "nltxt": ["Ik ben een kleine eenhoorn.", "Mijn opa loopt op sokken.", "Wie gaat weg?"], 
    "entxt": ["I'm a small unicorn.", "My grandfather walks in socks.", "Who is leaving?"],
    "detxt": ["Ich bin ein kleiner Einhorn.", "Mein Opa läuft auf Socken.", "Wer geht weg?"]
})

var_spec = {
    "nltxt": {"distribution": UnstructuredTextDistribution},
    "entxt": {"distribution": UnstructuredTextDistribution},
    "detxt": {"distribution": UnstructuredTextDistribution}
}

mf = MetaFrame.fit_dataframe(df, spec= var_spec)
mf.synthesize(10)

qubixes · 2023-08-16T09:30:20Z

This already gets us very far for the unstructured text!! Super nice. Indeed, the text generated by faker does not look very realistic, but it is multilingual (which is really great).

how do you envision users implementing this? with var_spec unstructured: True?

To add: json validation (because it errors on that now)

I would like to have some heuristic that compares the regex to the unstructured text, but manually it can be set in the var_spec yes. Okay, I'll develop this further than if it works well enough!

qubixes · 2023-08-16T11:17:58Z

Might be good to write some documentation on this at some point as well!

@vankesteren I don't get any validation errors though?

vankesteren · 2023-08-16T11:33:00Z

Ah I tested it now, there's no more validation error on my side either

vankesteren

Almost done already! Very nice

vankesteren · 2023-08-16T11:39:33Z

metasynth/distribution/faker.py

+        lang = self.detect_language(series)
+        if lang is None:
+            return 9999999
+        return -1


Is this what makes the choice between regex and unstructured? Can we make it more explicit somehow? This logic is now hidden in the detect_language function right?

Yes, if lingua detects a language -> UnstructuredText, otherwise -> Regex. The logic is in the lingua package, so I don't really know how to make it more explicit? I can add a comment?

vankesteren · 2023-08-16T11:40:38Z

PS: I should test it with the pilot data first.

vankesteren · 2023-08-16T11:44:45Z

OK did that now, basically everything is replaced by the unstructured text by default. Is this what we want? I don't think so.

It's maybe better to explicitly have to set unstructured: True or so in the var_spec.

qubixes · 2023-08-16T11:55:03Z

@vankesteren It shouldn't replace all columns:

df = pl.DataFrame({
    "nltxt": ["Ik ben een kleine eenhoorn.", "Mijn opa loopt op sokken.", "Wie gaat weg?"], 
    "entxt": ["I'm a small unicorn.", "My grandfather walks in socks.", "Who is leaving?"],
    "detxt": ["Ich bin ein kleiner Einhorn.", "Mein Opa läuft auf Socken.", "Wer geht weg?"],
    "struct": ["x123", "x523", "x631"],
})

What kind of columns are you talking about? Categorical variables that are not labeled as such mostly?

I do think we should auto-detect, but obviously the contraints can be more stringent.

qubixes · 2023-08-16T11:56:15Z

I think this is a bit related to the regex distribution as well, detecting how well the regex does, will help the detection of the unstructured text as well.

qubixes · 2023-08-16T12:01:54Z

What about criteria for unstructured, something like:

Number of words/row > 1
Variable number of words/row > 90%

qubixes · 2023-08-17T11:27:47Z

In the end it would be nice to integrate it using AIC (or derivative). If we have one word at 1/10000 chance, then we could simply have L = 10000**-N_words. I don't know about values that cannot be fit with the regex, but I could imagine that simply giving a small value of 10^-6 or something. Otherwise we can compute the probability of the regexes reasonably easily.

Some of the problems might also be fixed if we have a string Multinoulli distribution?

qubixes · 2023-08-30T13:47:40Z

@vankesteren I have updated the branch so that it shouldn't do unstructured text by default anymore. I have also renamed it to "freetext", let me know what you think!

vankesteren · 2023-09-01T12:52:00Z

Great, checked. Feel free to merge.

vankesteren requested changes Aug 16, 2023

View reviewed changes

qubixes marked this pull request as ready for review August 16, 2023 11:16

qubixes requested a review from vankesteren August 16, 2023 11:16

vankesteren requested changes Aug 16, 2023

View reviewed changes

qubixes added 9 commits August 30, 2023 15:32

Add unstructured text based on faker

4098c2b

Fix dependencies

3cc17a6

Improve unstructured text

3480d25

Improve documentation / automatic detection

1dc2b25

Fix pylint

7c43cb4

Fix pytest

3e9c8ff

Lower usage of unstructured text

1389e0f

Change name to free text

340e8b5

Rename again

48ec04d

qubixes force-pushed the faker-unstructured branch from 5b487c4 to 48ec04d Compare August 30, 2023 13:38

qubixes merged commit 0a49617 into main Sep 1, 2023
6 checks passed

qubixes deleted the faker-unstructured branch September 27, 2023 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faker unstructured #148

Faker unstructured #148

qubixes commented Aug 15, 2023 •

edited

Loading

vankesteren left a comment

vankesteren Aug 16, 2023

qubixes Aug 16, 2023

vankesteren commented Aug 16, 2023 •

edited

Loading

qubixes commented Aug 16, 2023

qubixes commented Aug 16, 2023

vankesteren commented Aug 16, 2023

vankesteren left a comment

vankesteren Aug 16, 2023

qubixes Aug 16, 2023

vankesteren commented Aug 16, 2023

vankesteren commented Aug 16, 2023

qubixes commented Aug 16, 2023

qubixes commented Aug 16, 2023

qubixes commented Aug 16, 2023

qubixes commented Aug 17, 2023

qubixes commented Aug 30, 2023

vankesteren commented Sep 1, 2023

Faker unstructured #148

Faker unstructured #148

Conversation

qubixes commented Aug 15, 2023 • edited Loading

vankesteren left a comment

Choose a reason for hiding this comment

vankesteren Aug 16, 2023

Choose a reason for hiding this comment

qubixes Aug 16, 2023

Choose a reason for hiding this comment

vankesteren commented Aug 16, 2023 • edited Loading

qubixes commented Aug 16, 2023

qubixes commented Aug 16, 2023

vankesteren commented Aug 16, 2023

vankesteren left a comment

Choose a reason for hiding this comment

vankesteren Aug 16, 2023

Choose a reason for hiding this comment

qubixes Aug 16, 2023

Choose a reason for hiding this comment

vankesteren commented Aug 16, 2023

vankesteren commented Aug 16, 2023

qubixes commented Aug 16, 2023

qubixes commented Aug 16, 2023

qubixes commented Aug 16, 2023

qubixes commented Aug 17, 2023

qubixes commented Aug 30, 2023

vankesteren commented Sep 1, 2023

qubixes commented Aug 15, 2023 •

edited

Loading

vankesteren commented Aug 16, 2023 •

edited

Loading