Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files #34

Open
QuadratJunges opened this issue Sep 20, 2024 · 1 comment

Comments

@QuadratJunges
Copy link

While trying to tokenize sequences from a multi-fasta-list, the generated input using the prot-BERT tokenizer always generates the same numpy arrays, eventhough the initial sequences differ significantly. Anyone else ever faced such problems? Looking forward for some input.

With the submitted sequence list looking like this "['IISACLAGEKCRYTGDGFDYPALRKLVEEGKAIPVCPEVLGGLSVPRDPNEIIGGNGFDVLDGKAKVLTNRGVDTTAAFVKGAAEVLAIAQKKGARVAVLKERSPSCGSTMIYDGTFSGRRIPGCGCTAALLVKEGIRVFSEEN', 'RLLLIDGNSIAFRSFFALQNSLSRFTNADGLHTNAIYGFNKMLDIILDNVNPTDALVAFDAGKTTFRTKMYTNYKGGRAKTPSELTEQMPYLRDLLTGYGIKSYEL...]"

the output arrays look like this [array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149,
-0.04542159, 0.07880748]], dtype=float32)]

    while the def im currently using looks like the following

def EMBED_SEQUENCE(QUERY_SEQUENCES, TOKENIZER, MODEL):
EMBEDDINGS = []
MODEL.eval()
for SEQ in QUERY_SEQUENCES:
INPUTS = TOKENIZER(SEQ, return_tensors="pt", padding=True, truncation=True, max_length=1024)
with torch.no_grad():
OUTPUTS = MODEL(**INPUTS)
EMBEDDING = OUTPUTS.last_hidden_state.mean(dim=1).cpu().numpy()
EMBEDDINGS.append(EMBEDDING)
return EMBEDDINGS

QUERIES = EMBED_SEQUENCE(QUERY_SEQUENCES, TOKENIZER, MODEL)

@QuadratJunges
Copy link
Author

I might need to add that im using the prot_bert tokenizer as following

from transformers import BertTokenizer, BertModel
TOKENIZER = BertTokenizer.from_pretrained("Rostlab/prot_bert", do_lower_case=False, clean_up_tokenization_spaces=True)
MODEL = BertModel.from_pretrained("Rostlab/prot_bert")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant