Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #2626

Open
hexylena opened this issue Jul 2, 2021 · 7 comments
Open

Performance #2626

hexylena opened this issue Jul 2, 2021 · 7 comments

Comments

@hexylena
Copy link
Collaborator

hexylena commented Jul 2, 2021

@abretaud and I are working to debug an issue where the slowness of findAllOrganisms (>30s) is killing the training we're giving.

This route should be fast. like <2 seconds fast. I've replaced it with a flask app that talks directly to the DB and does all of the joins and filtering on the DB side which seems to be MUCH more efficient.

Here's the flask app which just replaces that one route.

from flask import Flask
import codecs
from flask import jsonify
from functools import wraps
from flask import render_template
from flask import request
from flask_sqlalchemy import SQLAlchemy
import time

global CACHED_RESULT
global CACHED_TIME
CACHED_RESULT = None
CACHED_TIME = 0

app = Flask(__name__)

app.config["SQLALCHEMY_DATABASE_URI"] = "postgresql://...:5432/apollo"
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False
db = SQLAlchemy(app)

QUERY = """
SELECT
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta,
    false AS currentorganism,
    sum(
        CASE
        WHEN feature.class
        IN (
                'org.bbop.apollo.RepeatRegion',
                'org.bbop.apollo.Terminator',
                'org.bbop.apollo.TransposableElement',
                'org.bbop.apollo.Gene',
                'org.bbop.apollo.Pseudogene',
                'org.bbop.apollo.PseudogenicRegion',
                'org.bbop.apollo.ProcessedPseudogene',
                'org.bbop.apollo.Deletion',
                'org.bbop.apollo.Insertion',
                'org.bbop.apollo.Substitution',
                'org.bbop.apollo.SNV',
                'org.bbop.apollo.SNP',
                'org.bbop.apollo.MNV',
                'org.bbop.apollo.MNP',
                'org.bbop.apollo.Indel'
            )
        THEN 1
        ELSE 0
        END
    ) AS annotationcount,
    count(distinct sequence.id) AS sequences
FROM
    organism
    LEFT OUTER JOIN sequence ON organism.id = sequence.organism_id
    LEFT OUTER JOIN feature_location ON
            sequence.id = feature_location.sequence_id
    LEFT OUTER JOIN feature ON
            feature.id = feature_location.feature_id
GROUP BY
    organism.common_name,
    organism.blatdb,
    organism.metadata,
    organism.obsolete,
    organism.directory,
    organism.public_mode,
    organism.valid,
    organism.genome_fasta_index,
    organism.genus,
    organism.species,
    organism.id,
    organism.non_default_translation_table,
    organism.genome_fasta
    ;
"""

columns =  [
    "commonName", "blatdb", "metadata" , "obsolete", "directory",
    "publicMode", "valid", "genomeFastaIndex", "genus", "species", "id",
    "nonDefaultTranslationTable", "genomeFasta", "currentOrganism",
    "annotationCount", "sequences"
]

def _fetch():
    roles = db.engine.execute(QUERY)
    out = []
    for role in roles:
        out.append(dict(zip(columns, role)))
    return out


@app.route("/get", methods=["GET", "POST"])
def doit():
    global CACHED_TIME
    global CACHED_RESULT
    now = time.time()
    if now - CACHED_TIME > 30:
        CACHED_RESULT = _fetch()
        CACHED_TIME = now

    return jsonify(CACHED_RESULT)

I'm running this service and we're just proxying that one route through our own version:

location /apollo/organism/findAllOrganisms {
   proxy_pass http://127.0.0.1:4321/get;
}

I think there are a couple parts to the issue:

  • lack of any indexes, not even on sequence.id, feature.id, etc.
  • doing operations in groovy rather than doing them in the DB, resulting in fetching more data and processing more slowly than the DB can.

A key point for me is that I really don't think apollo needs a graph database. I think it just needs some time spent understanding how to most effectively use SQL (I'm happy to offer my expertise there.)

@abretaud
Copy link
Collaborator

abretaud commented Jul 2, 2021

On my side I've made some profiling: most of the time is spent in this for loop, it takes ~0.2s per organism on my test setup => ~10sec for 40 orgs => you can easily hit a timeout if you have many orgs

abretaud added a commit to galaxy-genome-annotation/apolpi that referenced this issue Aug 27, 2021
@abretaud
Copy link
Collaborator

@hexylena I took the liberty to dockerize your code there: https://github.com/galaxy-genome-annotation/apolpi
I hope/guess it's ok for you (licensing too?)

@hexylena
Copy link
Collaborator Author

Ahhh awesome @abretaud that'll make it easier to deploy.

yeah license is fine :) (Normally I'd do agpl3 to force folks to contribute back their changes, but, in this case I don't think it matters)

@abretaud
Copy link
Collaborator

Cool thanks, used on apololo.genouest.org and bipaa.genouest.org/apollo now

@hexylena hexylena changed the title Performance on /organism/findAllOrganisms Performance Sep 3, 2021
@hexylena

This comment has been minimized.

@abretaud
Copy link
Collaborator

abretaud commented Sep 3, 2021

Yep I noticed it's slow too but I don't know why, maybe it's doing things on the data dir!?

@hexylena
Copy link
Collaborator Author

hexylena commented Sep 6, 2021

It's odd, the API responds quickly, it was just through the UI. Anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants