New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add support for NoiseBench #3512

Merged

alanakbik merged 14 commits into master from add_noisebench

Dec 19, 2024

+176 −1

Collaborator

elenamer commented Jul 24, 2024 •

edited

Loading

This PR adds support for NoiseBench, a subset of CoNLL-03 for English with 7 different label sets, including 6 different types of label noise.

Example usage (intialize dataset with a crowdsourced version of the NER labels):

corpus = NER_NOISEBENCH(noise="crowd")

~~Potential issues:~~

To create the dataset, we need the CleanCoNLL labels as gold labels. This required running an external script in this implementation: https://github.com/flairNLP/CleanCoNLL/blob/main/create_cleanconll_from_conll03.sh

EDIT: The implementation was modified to use a CLEANCONLL object, which is why this issue is not relevant anymore.

elenamer added 5 commits

July 23, 2024 22:59


          add intial noisebench dataset class

f82969d


          add downloading of cleanconll

2ede719


          update __init__

a34d2b2


          add processing of noisebench label sets

2684e93


          fix formatting

0e8183d

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                          **corpusargs: The arguments propagated to :meth:'flair.datasets.ColumnCorpus.__init__'.
+                      """
+                      if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
+                          raise Exception("Please choose a valid version")

Contributor

fkdosilovic Oct 31, 2024

It is recommended to raise ValueError for wrong and unexpected values passed to the argument.

Also, an error message would be more helpful if it listed valid values. Something like:

if noise not in VALID_NOISE_VALUES:
    raise ValueError(f"Unsupported value for noise argument. Got {noise}, expected one of {VALUE_NOISE_VALUES}!")

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                      if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
+                          raise Exception("Please choose a valid version")
+                      self._set_path(base_path)

Contributor

fkdosilovic Oct 31, 2024

I would remove this function and just set the base_path here in __init__:

if base_path:
    self.base_path = Path(base_path)
else:
    self.base_path = flair.cache_root / "datasets" / "noisebench"

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                      self.base_path = flair.cache_root / "datasets" / "noisebench" if not base_path else Path(base_path)
+                  @staticmethod
+                  def read_column_file(filename):

Contributor

fkdosilovic Oct 31, 2024

You are adding a new public method to NER_NOISEBENCH. I would avoid this. Also, you are relying on this method only in non-public methods in the class (except in generate_data_files, which I would also put as non-public.)

My suggestion is to rename the method to _read_column_file and add type annotations for filename argument.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                      return all_x
+                  @staticmethod
+                  def save_to_column_file(filename, list):

Contributor

fkdosilovic Oct 31, 2024 •

edited

Loading

Rename to _save_to_column_file
Add type annotations to arguments
Avoid using the name of built-in modules/types/etc. for argument and variable names. Here, you are using list as the name for the argument. Seems like a more appropriate name is sentences (which is of type list).

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated


		def generate_data_files(self, filename):

		with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:

Contributor

fkdosilovic Oct 31, 2024

You can avoid this os.path.join since base_path is of type Path. Just use / to concatenate the path. Also, we should avoid relying on implicit default values, it is better to explicitly mention that we want to read the file with utf-8:

with open(self.base_path / "annotations_only" / "index.txt", "r", encoding="utf-8") as fp:
    # ...

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                      with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:
+                          token_indices = index_file.readlines()
+                          all_clean_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.train"))

Contributor

fkdosilovic Oct 31, 2024

Same here, use self.cleanconll_base_path / "cleanconll.train" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                          )
+                      # copy test set
+                      all_clean_test_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.test"))

Contributor

fkdosilovic Oct 31, 2024

Same here, use self.cleanconll_base_path / "cleanconll.test" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                              new_s.append([token[0], token[4]])
+                          test_sentences.append(new_s)
+                      self.save_to_column_file(os.path.join(self.base_path, "clean.test"), test_sentences)

Contributor

fkdosilovic Oct 31, 2024

Same here, use self.base_path / "clean.test" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                          new_s = []
+                          for token in s:
+                              new_s.append([token[0], token[4]])
+                          test_sentences.append(new_s)

Contributor

fkdosilovic Oct 31, 2024

This can be refactored to a more idiomatic approach with list comprehensions:

for sentence in all_clean_test_sentences:
    new_sentence = [[tokens[0], tokens[4]] for tokens in sentence]
    test_sentences.append(new_sentence)

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                          for token, label in zip(clean_sentence, sentence):
+                              label[0] = token[0]  # token[0] -> text, token[1] -> BIO label
+                      if self.SAVE_TRAINDEV_FILE:
+                          self.save_to_column_file(os.path.join(self.base_path, f"{corpus}.traindev"), noisy_labels)

Contributor

fkdosilovic Oct 31, 2024

Same here, use self.base_path / f"{corpus}.traindev" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                      noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))
+                      # print(noisy_labels)
+                      # print(token_indices)

Contributor

fkdosilovic Oct 31, 2024

Remove these debugging prints.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                  def _merge_tokens_labels(self, corpus, all_clean_sentences, token_indices):
+                      # generate NoiseBench dataset variants, given CleanCoNLL, noisy label files and index file
+                      noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))

Contributor

fkdosilovic Oct 31, 2024

Same here, use self.base_path / "annotations_only" / f"{corpus}.traindev" instead of os.path.join.

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py Outdated

+                          point = []
+                          for line in lines:
+                              if "\t" in line.strip():
+                                  stripped_line = line.strip().split("\t") if "\t" in line.strip() else line.strip().split(" ")

Contributor

fkdosilovic Oct 31, 2024

Looks like stripped_line = line.strip().split("\t") if "\t" in line.strip() will always be set due to the condition if "\t" in line.strip() in line 5076. Refactor to:

if "\t" in line.strip():
    stripped_line = line.strip().split("\t")
else:
    stripped_line = line.strip().split(" ")

fkdosilovic reviewed

View reviewed changes

flair/datasets/sequence_labeling.py

+                  def save_to_column_file(filename, list):
+                      with open(filename, "w") as f:
+                          for sentence in list:
+                              for token in sentence:

Contributor

fkdosilovic Oct 31, 2024

Rename the iteration variable to tokens to convert the meaning that this is a container of elements and not a single token.

Collaborator

alanakbik commented Oct 31, 2024

@fkdosilovic thanks a lot for your review, this is very helpful! Please note that this is still a draft PR and one major change is still planned.

elenamer added 8 commits

December 13, 2024 12:36


          Merge branch 'master' into add_noisebench

565fc29


          change to use CLEANCONLL corpus

7903d2d


          raise ValueError and list supported noise types in the message

1893d65


          add file encoding

24a2829


          simplify paths

ae10a61


          make functions private and add type annotations

be83c9e


          rename some variables and small refactor

c7f4a6c


          formatting and fix some typing

4e15992

Collaborator Author

elenamer commented Dec 16, 2024

@fkdosilovic Thanks a lot for the review, it was very helpful and I incorporated most of your feedback!

Also, the CLEANCONLL Corpus is now a part of Flair, so I modified this PR to use it, instead of downloading and running an external .sh script.

elenamer marked this pull request as ready for review

December 16, 2024 10:08


          Merge branch 'master' into add_noisebench

4516da6

Collaborator

alanakbik commented Dec 19, 2024

@elenamer thanks for adding this!

alanakbik merged commit 016079e into master

2 checks passed

alanakbik deleted the add_noisebench branch

December 19, 2024 12:48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet