KEN embeddings integration #578

LilianBoulard · 2023-06-05T14:54:06Z

LilianBoulard
Jun 5, 2023
Maintainer

Following the discussion we had during today's brainstorm, an interesting prospect is how to best integrate the KEN embeddings into the library.

Current state

Currently, the embeddings are lazily downloaded when the fetching function is first called.

A remark @Vincent-Maladiere and I had was that when first looking at the KEN datasets, the first reaction upon seeing the limited set of subjects available (games, movies, etc), few of which have common real-world use cases, it tends to turn down the user.

Ideas

As mentioned by @Vincent-Maladiere, one cool feature of pyarrow is that we can read only the portion we are interested in. That means that even if the total embeddings are on disk (a few gigabytes), if we only need a subset, we don't need to load it all into memory.

Proposal

The embedding file all_embeddings.parquet could be saved on the GitHub repository (dropping third-party dependencies like Figshare) with Git LFS.
It would not be downloaded by default when installing skrub (pip install skrub), instead provided as part of an optional dependency (pip install skrub[ken]), which would download the file.
It would then be possible for users to create subsets of the embeddings from the code used to create the sub-tables we currently have, save them to the disk, and/or upload them to any online service.
To use these embeddings, users would have to specify the URI to the file, which is then read with parquet.

Example with pseudo-code:

>>> pip install skrub

>>> from skrub.ken import get_embeddings
>>> emb = get_embeddings()
ImportError: calling the function `get_embeddings` empty tries to load the KEN embeddings from disk, but they were not found. Install them with `pip install skrub[ken]` (recommended, saves the file on disk) or use `get_embeddings("https://raw.githubusercontent.com/skrub-data/skrub/main/skrub/embeddings/ken\all_embeddings.parquet")` (not recommended, cached in memory).
>>> emb = get_embeddings("https://example.com/my_files/my_embeddings.parquet")  # Work fine!

Please let us know what you think!

Vincent-Maladiere · 2023-06-05T16:25:30Z

Vincent-Maladiere
Jun 5, 2023
Maintainer

Thanks for assembling this note.

Currently, the embeddings are lazily downloaded when the fetching function is first called.

What do you mean by "lazily downloaded"?
We need to check whether pyarrow filters works with URL, i.e.:

import pyarrow.parquet as pq
pq.read_parquet("https://example.parquet", filter="hour == 23")

Otherwise, the user would need to download the entire file to disk (4GB), which is doable but we'd like to avoid that.

The embedding file all_embeddings.parquet could be saved on the GitHub repository (dropping third-party dependencies like Figshare) with Git LFS.

Good idea, why did we set Figshare in the first place?

Example with pseudo-code:

We should add the filtering function to the get_embeddings() function, so we don't crash the RAM. Also, we should write things to disk for persistence (you don't want to spend time downloading the same thing twice by mistake)

0 replies

GaelVaroquaux · 2023-06-05T18:11:38Z

GaelVaroquaux
Jun 5, 2023
Maintainer

The embedding file all_embeddings.parquet could be saved on the GitHub repository (dropping third-party dependencies like Figshare) with Git LFS.

Hosting the data on git is a bad idea: we will hit our data size limits really fast.

instead provided as part of an optional dependency (pip install skrub [ken]), which would download the file.

I'm not very enthusiastic about using pip to install data. First it cannot be called easily programmatically. Second, not everybody uses pip (some people use conda, poetry...)

>>> emb = get_embeddings("https://example.com/my_files/my_embeddings.parquet") # Work fine!

I'm fine having other possible embeddings than ken. But 1) the difficulty is to create it, 2) we might need a bit more than a single parquet file: for instance with KEN the types are very very useful.

0 replies

Vincent-Maladiere · 2023-06-05T19:17:07Z

Vincent-Maladiere
Jun 5, 2023
Maintainer

Hey @GaelVaroquaux, what alternative to pip to download the dataset would you recommend? My understanding is that a user currently has to go to the website to download it, but maybe that's the design we want?

0 replies

GaelVaroquaux · 2023-06-05T19:19:54Z

GaelVaroquaux
Jun 5, 2023
Maintainer

Hey @GaelVaroquaux, what alternative to pip to download the dataset would you recommend?

What's wrong with what we have?

My understanding is that a user currently has to go to the website to download it,

No: we have functions that download them.

0 replies

LilianBoulard · 2023-06-06T14:35:15Z

LilianBoulard
Jun 6, 2023
Maintainer Author

Hi, thanks for the remarks

What do you mean by "lazily downloaded"?

Maybe not the right term, but I mean that the file is downloaded the first time the embeddings are needed (on first call of get_ken_types or get_ken_embeddings).

Hosting the data on git is a bad idea: we will hit our data size limits really fast.

On GitHub, this is indeed an issue: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage
I didn't think the limits were this low 😅

Second, not everybody uses pip (some people use conda, poetry...)

Good point!

I'm fine having other possible embeddings than ken. But 1) the difficulty is to create it, 2) we might need a bit more than a single parquet file: for instance with KEN the types are very very useful.

True, I was thinking primarily of sub-sections of the main KEN embeddings, in the likes we created (games, movies, etc). For example, people working often with data from football players (I actually know a couple 😄) could create them following a guide we can make later on, and re-use them later on, without having to deal with the massive original embeddings.

Taking into account the remarks, here's an update proposal:

We can keep the download-on-first-call mechanism we currently have. The signature of get_ken_embeddings would be:

def get_ken_embeddings(
    *,
    source: Optional[Union[Path, str]] = None,
    filter: str = "",
    pca_components: Optional[int] = None,
    suffix: str = "",
)

Here, source can be any parquet file URI (local path or via http). If left empty, it downloads and returns the full embeddings file.
If an URL is provided, I'm not sure what's the best strategy: my gut feeling would tell me that we should not save the content on disk, only loading it in memory, which is convenient if the online file can change (and in most cases, it probably can). If the user wants to load it multiple times and the file is long to download, then they should download it manually and pass the path. What do you think?
filter would inherit the syntax or pyarrow's filter.

0 replies

GaelVaroquaux · 2023-06-06T15:03:01Z

GaelVaroquaux
Jun 6, 2023
Maintainer

I don't think that what we are discussing is a priority: in terms of the big-picture, I don't think that the downloading mechanism of ken is a major weak point of skrub in the current big picture.

@Vincent-Maladiere raised a comment during our Monday meeting that showed that he did not have a good understanding of what was in ken. This is unrelated to the downloading. I think that the way we should address it with an online UI (ideally reusing an existing online service) that gives a better understanding of what's in ken, and a few lines in the docs / examples.

0 replies

Vincent-Maladiere · 2023-06-07T09:51:59Z

Vincent-Maladiere
Jun 7, 2023
Maintainer

Yes, I think Gaël is right @LilianBoulard, we should keep in mind your ideas about optimizing the download but the most important part is the discovery of the categories.
What are your thoughts about using a simple service like Streamlit or HuggingFace Gradio / Spaces to create a tiny interface? @GaelVaroquaux
It would also be great to add as an example @jovan-stojanovic's code extracting categories from the main table.

0 replies

GaelVaroquaux · 2023-06-07T10:39:09Z

GaelVaroquaux
Jun 7, 2023
Maintainer

What are your thoughts about using a simple service like Streamlit or HuggingFace Gradio / Spaces to create a tiny app? @GaelVaroquaux

Cool, but IMHO not a priority. My priority would be to get the data-assembly features in skrub.

0 replies

jovan-stojanovic · 2023-06-08T13:07:03Z

jovan-stojanovic
Jun 8, 2023
Maintainer

Maybe the easiest solution currently is just to correct the KEN example as to:

Emphasize more the fact that KEN contains a lot of different types that the user may filter, as we filtered games for our use case. Use get_ken_types to illustrate and maybe add a link to a google doc that will have all the types (114K unique types, not sure how this will fit in).
Describe how to do it: as a simple example on the get_ken_embeddings function page, and refer to this in the example.
Point out again (maybe it's not clear enough) that we have some filtered sub-tables but leave it to the user to filter what suits them: we will be unable to create and host all sub-tables.

Tell me if this would make it more understandable @Vincent-Maladiere.

0 replies

LilianBoulard · 2023-06-09T12:22:51Z

LilianBoulard
Jun 9, 2023
Maintainer Author

Streamlit or HuggingFace Gradio / Spaces to create a tiny interface?

Sounds cool to me too!

Describe how to do it: as a simple example on the get_ken_embeddings function page

Could you please submit a PR with the code? I tend to disagree and think a fully-fledged example would be better (as it is easier to reference and search for), but maybe the code is short enough that it could be part of the function docstring!

Point out again (maybe it's not clear enough) that we have some filtered sub-tables but leave it to the user to filter what suits them: we will be unable to create and host all sub-tables.

To be honest, I don't even think they are useful enough for us to maintain. I think we should only have the main table and a comprehensible example on how to efficiently filter the embeddings. The goal is indeed not to host the user-created sub-tables.

0 replies

GaelVaroquaux · 2023-06-12T08:55:57Z

GaelVaroquaux
Jun 12, 2023
Maintainer

Could you please submit a PR with the code? I tend to disagree and think a fully-fledged example would be better (as it is easier to reference and search for)

I worry about adding too many examples: as the package grows people will end up overwhelmed. In the function docstring is definitely cool. We also need to make the functionality discoverable, and we need to make sure that we add a sentence on the functionality and link to the function where relevant.

0 replies

LilianBoulard · 2023-08-18T14:27:29Z

LilianBoulard
Aug 18, 2023
Maintainer Author

Hello, here's an update on the proposal:

As discussed earlier, we should support a single file for all embeddings, and it should be up to the users to create their own subsets for their needs
- If we do, we should have an example explaining how to get and export a subset (which should be simple, just subset = fetch_ken_embeddings(filter="...") then subset.to_parquet("/my/file"), then upload the file to a service (e.g. figshare).

Current implementation

The file is fetched from figshare and automatically saved in a temporary file by urllib.request.urlretrieve.
It is then split into smaller files of maximum 1M rows
The list of paths is passed to fetch_ken_embeddings by fetch_figshare
Each file is read one after the other, filtered given the user filter, and PCA is performed on it
The stacked result is returned

Problems with the current implementation

Having x small files is annoying, even though it's internal, the code and tree could be simplified
Running PCA.fit_transform on each batch seems like an error, since we are not fitting on all the data ; there are two solutions to this problem:
- Load the whole file in memory, fit, transform
- Load by batch, but do an incremental PCA, although this solution would also induce reading the data twice (once to fit the object, a second to transform, stack and return).

Pseudo-code of the suggested change

file = urllib.request.urlretrieve(url)
os.move(file, destination)  # or copy

kwargs = {}

if search:
    kwargs.update({"filters": search})

if pca_components:
    ipca = IncrementalPCA(pca_components)
    try:
        df = pyarrow.read_parquet(temp_file, **kwargs)
        return ipca.fit_transform(df)
    except NotEnoughMemoryError:
        for batch in make_batches(file, **kwargs)
            ipca.partial_fit(batch)
        return pd.concat([
            ipca.transform(batch)
            for batch in make_batches(file, **kwargs)
        ])

else:
    return pyarrow.read_parquet(temp_file, **kwargs)

cc @Vincent-Maladiere

1 reply

Vincent-Maladiere Aug 23, 2023
Maintainer

These questions are crucial for KEN embedding usage but not a priority right now, IMHO. Let's tackle them after TableVectorizer and AggJoiner are stabilized!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEN embeddings integration #578

{{title}}

Replies: 12 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

KEN embeddings integration #578

LilianBoulard Jun 5, 2023 Maintainer

Current state

Ideas

Proposal

Replies: 12 comments · 1 reply

Vincent-Maladiere Jun 5, 2023 Maintainer

GaelVaroquaux Jun 5, 2023 Maintainer

Vincent-Maladiere Jun 5, 2023 Maintainer

GaelVaroquaux Jun 5, 2023 Maintainer

LilianBoulard Jun 6, 2023 Maintainer Author

GaelVaroquaux Jun 6, 2023 Maintainer

Vincent-Maladiere Jun 7, 2023 Maintainer

GaelVaroquaux Jun 7, 2023 Maintainer

jovan-stojanovic Jun 8, 2023 Maintainer

LilianBoulard Jun 9, 2023 Maintainer Author

GaelVaroquaux Jun 12, 2023 Maintainer

LilianBoulard Aug 18, 2023 Maintainer Author

Current implementation

Problems with the current implementation

Pseudo-code of the suggested change

Vincent-Maladiere Aug 23, 2023 Maintainer

LilianBoulard
Jun 5, 2023
Maintainer

Replies: 12 comments 1 reply

Vincent-Maladiere
Jun 5, 2023
Maintainer

GaelVaroquaux
Jun 5, 2023
Maintainer

Vincent-Maladiere
Jun 5, 2023
Maintainer

GaelVaroquaux
Jun 5, 2023
Maintainer

LilianBoulard
Jun 6, 2023
Maintainer Author

GaelVaroquaux
Jun 6, 2023
Maintainer

Vincent-Maladiere
Jun 7, 2023
Maintainer

GaelVaroquaux
Jun 7, 2023
Maintainer

jovan-stojanovic
Jun 8, 2023
Maintainer

LilianBoulard
Jun 9, 2023
Maintainer Author

GaelVaroquaux
Jun 12, 2023
Maintainer

LilianBoulard
Aug 18, 2023
Maintainer Author

Vincent-Maladiere Aug 23, 2023
Maintainer