Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files #34

QuadratJunges · 2024-09-20T19:21:09Z

While trying to tokenize sequences from a multi-fasta-list, the generated input using the prot-BERT tokenizer always generates the same numpy arrays, eventhough the initial sequences differ significantly. Anyone else ever faced such problems? Looking forward for some input.

With the submitted sequence list looking like this "['IISACLAGEKCRYTGDGFDYPALRKLVEEGKAIPVCPEVLGGLSVPRDPNEIIGGNGFDVLDGKAKVLTNRGVDTTAAFVKGAAEVLAIAQKKGARVAVLKERSPSCGSTMIYDGTFSGRRIPGCGCTAALLVKEGIRVFSEEN', 'RLLLIDGNSIAFRSFFALQNSLSRFTNADGLHTNAIYGFNKMLDIILDNVNPTDALVAFDAGKTTFRTKMYTNYKGGRAKTPSELTEQMPYLRDLLTGYGIKSYEL...]"

the output arrays look like this [array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32)]

    while the def im currently using looks like the following

def EMBED_SEQUENCE(QUERY_SEQUENCES, TOKENIZER, MODEL):
EMBEDDINGS = []
MODEL.eval()
for SEQ in QUERY_SEQUENCES:
INPUTS = TOKENIZER(SEQ, return_tensors="pt", padding=True, truncation=True, max_length=1024)
with torch.no_grad():
OUTPUTS = MODEL(**INPUTS)
EMBEDDING = OUTPUTS.last_hidden_state.mean(dim=1).cpu().numpy()
EMBEDDINGS.append(EMBEDDING)
return EMBEDDINGS

QUERIES = EMBED_SEQUENCE(QUERY_SEQUENCES, TOKENIZER, MODEL)

The text was updated successfully, but these errors were encountered:

QuadratJunges · 2024-09-20T19:23:20Z

I might need to add that im using the prot_bert tokenizer as following

from transformers import BertTokenizer, BertModel
TOKENIZER = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False, clean_up_tokenization_spaces=True)
MODEL = BertModel.from_pretrained("Rostlab/prot_bert")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files #34

Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files #34

QuadratJunges commented Sep 20, 2024

QuadratJunges commented Sep 20, 2024

Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files #34

Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files #34

Comments

QuadratJunges commented Sep 20, 2024

QuadratJunges commented Sep 20, 2024