KEN embeddings integration #578
Replies: 12 comments 1 reply
-
Thanks for assembling this note.
import pyarrow.parquet as pq
pq.read_parquet("https://example.parquet", filter="hour == 23") Otherwise, the user would need to download the entire file to disk (4GB), which is doable but we'd like to avoid that.
|
Beta Was this translation helpful? Give feedback.
-
The embedding file all_embeddings.parquet could be saved on the GitHub repository (dropping third-party dependencies like Figshare) with Git LFS.
Hosting the data on git is a bad idea: we will hit our data size limits really fast.
instead provided as part of an optional dependency (pip install skrub [ken]), which would download the file.
I'm not very enthusiastic about using pip to install data. First it cannot be called easily programmatically. Second, not everybody uses pip (some people use conda, poetry...)
>>> emb = get_embeddings("https://example.com/my_files/my_embeddings.parquet") # Work fine!
I'm fine having other possible embeddings than ken. But 1) the difficulty is to create it, 2) we might need a bit more than a single parquet file: for instance with KEN the types are very very useful.
|
Beta Was this translation helpful? Give feedback.
-
Hey @GaelVaroquaux, what alternative to pip to download the dataset would you recommend? My understanding is that a user currently has to go to the website to download it, but maybe that's the design we want? |
Beta Was this translation helpful? Give feedback.
-
Hey @GaelVaroquaux, what alternative to pip to download the dataset would you recommend?
What's wrong with what we have?
My understanding is that a user currently has to go to the website to download it,
No: we have functions that download them.
|
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for the remarks
Maybe not the right term, but I mean that the file is downloaded the first time the embeddings are needed (on first call of
On GitHub, this is indeed an issue: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage
Good point!
True, I was thinking primarily of sub-sections of the main KEN embeddings, in the likes we created (games, movies, etc). For example, people working often with data from football players (I actually know a couple 😄) could create them following a guide we can make later on, and re-use them later on, without having to deal with the massive original embeddings. Taking into account the remarks, here's an update proposal: We can keep the download-on-first-call mechanism we currently have. The signature of def get_ken_embeddings(
*,
source: Optional[Union[Path, str]] = None,
filter: str = "",
pca_components: Optional[int] = None,
suffix: str = "",
) Here, |
Beta Was this translation helpful? Give feedback.
-
I don't think that what we are discussing is a priority: in terms of the big-picture, I don't think that the downloading mechanism of ken is a major weak point of skrub in the current big picture. @Vincent-Maladiere raised a comment during our Monday meeting that showed that he did not have a good understanding of what was in ken. This is unrelated to the downloading. I think that the way we should address it with an online UI (ideally reusing an existing online service) that gives a better understanding of what's in ken, and a few lines in the docs / examples. |
Beta Was this translation helpful? Give feedback.
-
Yes, I think Gaël is right @LilianBoulard, we should keep in mind your ideas about optimizing the download but the most important part is the discovery of the categories. |
Beta Was this translation helpful? Give feedback.
-
What are your thoughts about using a simple service like Streamlit or HuggingFace Gradio / Spaces to create a tiny app? @GaelVaroquaux
Cool, but IMHO not a priority. My priority would be to get the data-assembly features in skrub.
|
Beta Was this translation helpful? Give feedback.
-
Maybe the easiest solution currently is just to correct the KEN example as to:
Tell me if this would make it more understandable @Vincent-Maladiere. |
Beta Was this translation helpful? Give feedback.
-
Sounds cool to me too!
Could you please submit a PR with the code? I tend to disagree and think a fully-fledged example would be better (as it is easier to reference and search for), but maybe the code is short enough that it could be part of the function docstring!
To be honest, I don't even think they are useful enough for us to maintain. I think we should only have the main table and a comprehensible example on how to efficiently filter the embeddings. The goal is indeed not to host the user-created sub-tables. |
Beta Was this translation helpful? Give feedback.
-
Could you please submit a PR with the code? I tend to disagree and think a fully-fledged example would be better (as it is easier to reference and search for)
I worry about adding too many examples: as the package grows people will end up overwhelmed.
In the function docstring is definitely cool. We also need to make the functionality discoverable, and we need to make sure that we add a sentence on the functionality and link to the function where relevant.
|
Beta Was this translation helpful? Give feedback.
-
Hello, here's an update on the proposal:
Current implementation
Problems with the current implementation
Pseudo-code of the suggested changefile = urllib.request.urlretrieve(url)
os.move(file, destination) # or copy
kwargs = {}
if search:
kwargs.update({"filters": search})
if pca_components:
ipca = IncrementalPCA(pca_components)
try:
df = pyarrow.read_parquet(temp_file, **kwargs)
return ipca.fit_transform(df)
except NotEnoughMemoryError:
for batch in make_batches(file, **kwargs)
ipca.partial_fit(batch)
return pd.concat([
ipca.transform(batch)
for batch in make_batches(file, **kwargs)
])
else:
return pyarrow.read_parquet(temp_file, **kwargs) |
Beta Was this translation helpful? Give feedback.
-
Following the discussion we had during today's brainstorm, an interesting prospect is how to best integrate the KEN embeddings into the library.
Current state
Currently, the embeddings are lazily downloaded when the fetching function is first called.
A remark @Vincent-Maladiere and I had was that when first looking at the KEN datasets, the first reaction upon seeing the limited set of subjects available (games, movies, etc), few of which have common real-world use cases, it tends to turn down the user.
Ideas
As mentioned by @Vincent-Maladiere, one cool feature of pyarrow is that we can read only the portion we are interested in. That means that even if the total embeddings are on disk (a few gigabytes), if we only need a subset, we don't need to load it all into memory.
Proposal
The embedding file
all_embeddings.parquet
could be saved on the GitHub repository (dropping third-party dependencies like Figshare) with Git LFS.It would not be downloaded by default when installing skrub (
pip install skrub
), instead provided as part of an optional dependency (pip install skrub[ken]
), which would download the file.It would then be possible for users to create subsets of the embeddings from the code used to create the sub-tables we currently have, save them to the disk, and/or upload them to any online service.
To use these embeddings, users would have to specify the URI to the file, which is then read with parquet.
Example with pseudo-code:
Please let us know what you think!
Beta Was this translation helpful? Give feedback.
All reactions