Simpler PenguinDataset #71

ma595 · 2024-07-04T08:17:48Z

To make the data reading aspect a little easier to understand, we intend to embed a simpler version of the PenguinDataset directly into the notebook. The src/ml_workshop/_penguins.py will remain untouched and can still be used as before.

Thought process around the 'simpler' class (discussion between @jatkinson1000 and @ma595)

Load pandas df in notebook:

Put the definition of the PenguinDataset in the notebook:

Remove x_tfms and y_tfms
Hardcode one_hot, tensor, fp32 into __getitem__

Propagate change to

solution notebook
colab
ex 2+? Consider leaving this as is.

The text was updated successfully, but these errors were encountered:

ma595 · 2024-07-04T12:29:06Z

Below is a first attempt at this:

For clarity I removed all docstrings. Whether we should do this in the notebook is up for discussion.
I also pass a transform boolean - which is a bit hacky.
I wonder whether it's possible to 'hide' some of the code.

from typing import Optional, List, Dict, Tuple, Any

# import pytorch functions necessary for transformations:
from torch import tensor, float32, eye

from torch.utils.data import Dataset
from torchvision.transforms import Compose

from pandas import DataFrame

from palmerpenguins import load_penguins


class PenguinDataset(Dataset):
    def __init__(
        self,
        input_keys: List[str],
        target_keys: List[str],
        train: bool,
        transform: Optional[bool] = False,
    ):
        """Build ``PenguinDataset``."""
        self.input_keys = input_keys
        self.target_keys = target_keys

        self.full_df = _load_penguin_data()
        self.split = _split_data(self.full_df)["train" if train is True else "valid"]

        self.transform = transform

    def __len__(self) -> int:
        return len(self.split)
    
    def __getitem__(self, idx: int) -> Tuple[Any, Any]:
        # get the row index (idx) from the dataframe and 
        # select relevant column features (provided as input_keys)
        feats = self.split.iloc[idx][self.input_keys]

        # this gives a 'species' i.e. one of ('Gentoo',), ('Chinstrap',), or ('Adelie',)  
        tgts = self.split.iloc[idx][self.target_keys]

        if self.transform: 
            # Put the features in a more suitable format for PyTorch 
            
            # Exercise #1: convert the feats to PyTorch
            feats = tensor(feats, dtype=float32)

            # Exercise #2: convert this to a 'one-hot vector' 
            target_names = sorted(self.full_df.species.unique())
            
            tgts = eye(len(target_names))[target_names.index(tgts[0])]
                

        return (feats, tgts)
    
def _load_penguin_data() -> DataFrame:

    data = load_penguins()
    data = (
        data.loc[~data.isna().any(axis=1)]
        .sort_values(by=sorted(data.keys()))
        .reset_index(drop=True)
    )
    # Transform the sex field into a float, with male represented by 1.0, female by 0.0
    data.sex = (data.sex == "male").astype(float)
    return data

def _split_data(penguin_df: DataFrame) -> Dict[str, DataFrame]:
    valid_df = penguin_df.groupby(by=["species", "sex"]).sample(
        n=10,
        random_state=123,
    )

    # The training items are simply the items *not* in the valid split
    train_df = penguin_df.loc[~penguin_df.index.isin(valid_df.index)]

    return {"train": train_df, "valid": valid_df}

jatkinson1000 · 2024-07-04T13:01:30Z

Another 2c:

Could you get rid of the transform and instead hard code the processing? I can't see that anyone will ever use it without transform=True.
You could have the default code in there to start, run and show the obvious issues, then replace the code for feats and tgts? Would remove the 'hackiness' and potential confusion.

The _load_penguins function could just be subsumed into the __init__ method of the PenguinDataset?

Same for _split_data?
I only just see how Jim did this by running the split on the fly but ensuring it will always be the same split by fixing the seed.

ma595 · 2024-07-04T13:15:25Z

thanks @jatkinson1000, I'll make those changes.

ma595 · 2024-07-04T13:55:38Z

from typing import Optional, List, Dict, Tuple, Any

# import pytorch functions necessary for transformations:
from torch import tensor, float32, eye

from torch.utils.data import Dataset
from torchvision.transforms import Compose

from pandas import DataFrame

from palmerpenguins import load_penguins


class PenguinDataset(Dataset):
    def __init__(
        self,
        input_keys: List[str],
        target_keys: List[str],
        train: bool,
    ):
        """Build ``PenguinDataset``."""
        self.input_keys = input_keys
        self.target_keys = target_keys

        data = load_penguins()
        data = (
        data.loc[~data.isna().any(axis=1)]
        .sort_values(by=sorted(data.keys()))
        .reset_index(drop=True)
        )
        # Transform the sex field into a float, with male represented by 1.0, female by 0.0
        data.sex = (data.sex == "male").astype(float)
        self.full_df = data

        valid_df = self.full_df.groupby(by=["species", "sex"]).sample(
                n=10,
                random_state=123,
            )
        # The training items are simply the items *not* in the valid split
        train_df = self.full_df.loc[~self.full_df.index.isin(valid_df.index)]

        self.split = {"train": train_df, "valid": valid_df}["train" if train is True else "valid"]


    def __len__(self) -> int:
        return len(self.split)
    
    def __getitem__(self, idx: int) -> Tuple[Any, Any]:
        # get the row index (idx) from the dataframe and 
        # select relevant column features (provided as input_keys)
        feats = self.split.iloc[idx][self.input_keys]

        # this gives a 'species' i.e. one of ('Gentoo',), ('Chinstrap',), or ('Adelie',)  
        tgts = self.split.iloc[idx][self.target_keys]

        # Exercise #1: convert the feats (Series) to PyTorch Tensors
        feats = tensor(tuple(feats), dtype=float32)

        # Exercise #2: convert target to a 'one-hot' vector. 
        target_names = sorted(self.full_df.species.unique())
        tgts = eye(len(target_names))[target_names.index(tuple(tgts)[0])]
                
        return feats, tgts

jatkinson1000 · 2024-07-04T14:29:42Z

If the above all works I think that is a lot more transparent to students what is going on, and directly introduces the Torch Dataset. Nice one.

ma595 self-assigned this Jul 4, 2024

This was referenced Jul 4, 2024

Define a simpler PenguinDataset in Exercise 1 notebook #73

Merged

Update markdown instructions for Penguin classification exercise (1) #66

Open

ma595 closed this as completed in #73 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler PenguinDataset #71

Simpler PenguinDataset #71

ma595 commented Jul 4, 2024 •

edited

Loading

ma595 commented Jul 4, 2024 •

edited

Loading

jatkinson1000 commented Jul 4, 2024

ma595 commented Jul 4, 2024

ma595 commented Jul 4, 2024 •

edited

Loading

jatkinson1000 commented Jul 4, 2024

Simpler PenguinDataset #71

Simpler PenguinDataset #71

Comments

ma595 commented Jul 4, 2024 • edited Loading

ma595 commented Jul 4, 2024 • edited Loading

jatkinson1000 commented Jul 4, 2024

ma595 commented Jul 4, 2024

ma595 commented Jul 4, 2024 • edited Loading

jatkinson1000 commented Jul 4, 2024

ma595 commented Jul 4, 2024 •

edited

Loading

ma595 commented Jul 4, 2024 •

edited

Loading

ma595 commented Jul 4, 2024 •

edited

Loading