Skip to content

Latest commit

 

History

History
271 lines (190 loc) · 10.3 KB

README.md

File metadata and controls

271 lines (190 loc) · 10.3 KB

python-review-zoo

sys.argv usage

Consider the following code sample:

def main():
    if len(sys.argv) < 4:
        raise Exception("Cannot work with less than 3 arguments")
    # TODO check if files exists?

    files = Files(
        ancestor=sys.argv[1],
        mine=sys.argv[2],
        yours=sys.argv[3]
    )
    dry_run = "-d" in sys.argv

    merger = Merger(files=files)
    merger.merge(dry_run=dry_run)


if __name__ == "__main__":
    main()

In such case, it is strongly suggested to use argparse. Unless you are referencing sys.argv[1] in a one-off script, you should always use argparse to accept command line arguments.

usage of mkstemp

Consider the following class:

class Temporal(object):
    def __init__(self, files_contents):
        self.locations = {}
        for version, content in files_contents.items():
            fd, file_name = tempfile.mkstemp()
            self.locations[version] = file_name
            with open(fd, 'w') as opened:
                opened.write(content)

    def remove_all(self):
        for _, file_name in self.locations.items():
            os.remove(file_name)

From python docs mkstemp() and mkdtemp() are lower-level functions which require manual cleanup.

fd here is left dangling. See https://docs.python.org/3/library/os.html#os.close - file descriptors opened by tempfile.mkstemp() need to be closed manually or they leak. If you will cross the limit of open file descriptors on your system (usually 1024 per user), you'll start getting random errors.

Also, you don't provide any prefix or suffix to mkstemp. If your program dies unexpectedely, a file will be left in /tmp and it will be hard to distinguish it from other files that might be needed for something. Please provide a prefix.

NotImplementedError

Consider the following class:

class BaseMerger(object):
    def __init__(self, files):
        self._files = files

    def merge(self):
        raise NotImplementedError("Implemented in subclass")

In such case, it is strongly suggested to use the abc metaclass. This way, in case you forget to implement a method in one of the children, you will get the error when the class is being read and not when the object is constructed.

variable passed directly to %

see here: https://www.python.org/dev/peps/pep-0498/#rationale

invalid import order

see here: http://isort.readthedocs.io/en/latest/#how-does-isort-work

catching too much

Consider the following method:

def download_page_with_retries(self, url):
    retries_num = 0 
    while retries_num <= self.retries_limit:
        try:
            proxy = self.get_proxy()
            html_page, http_code = self.download_page(url, proxy)
            if http_code == 200:
                return html_page
        except requests.exceptions.RequestException as e:
            self.notice_proxy_error(proxy)
        retires_num =+ 1
    ...

if proxy = self.get_proxy() will one day have a new implementation which sometimes raises requests.exceptions.RequestException, it will be silently caught and a proxy will get downranked. Better move that line out of the try block. Generally try to keep as little as possible in there to avoid catching something accidentally.

formatting logs

Consider the following line:

logging.info("Spawned thread: {}".format(thread.name))

potentially expensive format() should only be called in case log level is set in a way that allows the given log statement to be emitted. logging.info("Spawned thread: %s", thread.name) uses % only if necessary.

django default model ordering

Technically we may want to use it one day, but so far every single time we've seen it used, it was excessive and would unnecessairly sort where it's not needed, often missing an index.

Specifically if you have column a that you have an index on, and you enable ordering on that column, and you have a column b where you have an index, but you'll query with a filter on b, then postgres can either filter by b and sort in memory or use the sorting from a and do a full-scan filtering on b. This is all fine if you really need it, but often times when you write Something.objects.filter(b=...) you don't really need .order_by('a') and asking everyone to remember to disable the sorting by .order_by() to override the sorting inherited from the model does not sound like a very good idea.

django order_by annotation phenomenon

>>> list(SomeModel.objects.values('code').annotate(points=Count('date')).values_list('code', 'points'))
[('some_code', 1), ('some_code', 1), ('some_code', 1)]

but:

>>> list(SomeModel.objects.order_by().values('code').annotate(points=Count('date')).values_list('code', 'points'))
[('some_code', 3)]

(.values(...).annotate(...) is a Django idiom for GROUP BY, and Count('date') counts rows with NOT NULL date column)

what is happening is inside SomeModel.Meta:

ordering = ['other', 'date']

and Django automatically adds, those to GROUP BY to be able to ORDER BY those fields

What is even worse - other is a models.ForeignKey(OtherModel) field and OtherModel.Meta has ordering = ['name'] - so Django automatically JOIN tables to acquire other__name and is ordering by this field

Therefore:

  • clean up ordering by .order_by() before .values(...).annotate(...) GROUP BY pattern
  • maybe don't use Meta.ordering?

raising Exception with a message

Consider the following method:

def get_proxy(self):
    with self.lock:
        if len(self.proxies) > 0:
            return random.sample(self.proxies, 1)[0]
        else:
            raise Exception("Lack of working proxies")

the only way to catch this is to except Exception, compare the string and re-raise if we caught the wrong thing. Please don't force the user of this class to do this. Instead, inherit from Exception like this:

class YourModuleException(Exception):
    pass

class NoViableProxies(YourModuleException):
    pass

then the user can catch all of your exceptions or just the one and handle it appropriately.

formatting multiline statements

Consider the following code snippet:

if response_html is None: 
    logging.info(
        "Aborting search for keyword: %s. Cannot download this page",
        job.start_url
    )

When the log template line will be extended to accept another parameter, job.start_url line has to be deleted and re-added with a , after it. This may not be very bad, but it makes it harder for the reviewer to see what the change is about. If the trailing comma was provided, the diff would show the template change and then a line added. Therefore, end the multi-line function calls with a trailing comma. The same also applies to lists, sets, dicts and so on, when they are created in multiple lines.

usage of itertools chain

Consider the following code snippets:

for app_node in itertools.chain(*self._app_nodes.values()):
    app_node.destroy()
for app_nodes in self._app_nodes.values():
    for app_node in app_nodes:
        app_node.destroy()

2-nested loop is not so bad, itertools.chain and tuple unfolding version of the code is correct (aldough it consumes more memory), but which version would you prefer to support at 3:43 AM when a production system is crashing and burning and you were woken up by a monitoring operator to fix it? I know I would go with the nested loop, it doesn't make me think as much as the first one.

manual memoization

class A:
    def __init__(self):
        self._cached_value = None
    @property
    def b(self):
        if self._cached_value is None:
            print('generating')
            self._cached_value = self._helper_method()
        return self._cached_value

vs

class A:
    @functools.cached_property
    def b(self):
        print('generating')
        return self._helper_method()

(actually the helper method could be inlined here, making the code even cleaner)

This is how it works:

>>> a = A()
>>> a.b
generating
123
>>> a.b
123
>>> a.b
123
>>> 

while it uses slightly more memory than the first version, it makes the code much cleaner.

how to efficiently check if a loop was entered or not

http://python-notes.curiousefficiency.org/en/latest/python_concepts/break_else.html#but-how-do-i-check-if-my-loop-never-ran-at-all

and

https://github.com/bleachbit/bleachbit/commit/a5cbfbcec7705c35bd96792bb06d70fea125985c

quite a few well-described anti-patterns

https://docs.quantifiedcode.com/python-anti-patterns/correctness/index.html

Requests timeout

Remember to set timeout in all requests in production code.

https://requests.readthedocs.io/en/latest/user/quickstart/#timeouts

https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

usage of @staticmethod

Guido van Rossum, python BDFL, wrote this about staticmethods:

They're basically an accident -- back in the Python 2.2 days when I was inventing new-style classes and descriptors, I meant to implement class methods but at first I didn't understand them and accidentally implemented static methods first. Then it was too late to remove them and only provide class methods.

Generally, whenever there is a temptation to use @staticmethod, just use @classmethod instead (and if it doesn't work, speak up on #default, someone will help you).

Usage of __subclass__

This language feature is an accident: https://mail.python.org/pipermail/python-list/2003-August/210297.html

It's dangerous, because if you'd like to subclass an interface in an incompatible way but a (possibly indirect (if recursive)) parent uses this in a classmethod to generate something like a registry, but you've added a mandatory function argument, the parent will crash. This is really funny if the parents add the classmethod in a version that is created after you have subclassed and then it crashes your deployment even if both sides use best practices for versioning such as SemVer and appropriate version pinning.

Instead, explicitly use a decorator to register. We usually use https://class-registry.readthedocs.io/en/latest/iterating_over_registries.html