parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? #1101

nicolaipre · 2020-02-05T13:33:14Z

Elasticsearch server version: 7.5.2
Elasticsearch python version: 7.5.1

I am trying to parse files containing millions of lines, and I am using the helpers.parallel_bulk function for indexing data.

However, it seems that parallel_bulk does not respect the chunk_size parameter, and instead fills up my memory with all the data before it starts insertion.

Code excerpt (full script can be found here):

    # Generate data for Elasticsearch to index
    def generator(self):
        yields = 0
        with open(self.file_name, encoding="utf-8") as fp:
            for line in fp:
                match = self.regex.search( line.strip() )

                if match:
                    if not self.dry_run: # expensive to have dry run check here?
                        entry = match.groupdict()
                        entry['leak'] = self.leak_name # use _type instead? save space... / bad for search performance?

                        doc = dict()
                        doc['_index']  = self.index
                        doc['_type']   = self.doc_type # or should this be self.leak_name maybe ?
                        doc['_id']     = hashlib.md5( json.dumps(entry, sort_keys=True).encode("utf-8") ).hexdigest() # Create md5sum based on the entry, and add it to the document we want to index # expensive operation
                        doc['_source'] = entry
                        #pprint(doc)
                        yields += 1
                        print('Yields: ', yields)
                        yield doc

                    self.good += 1
                else:
                    self.bad += 1
                    self.ignored.write("%s" % line)

                status = "[Good: {}, bad: {}]".format(self.good, self.bad)
                Utils.progressbar( (self.good+self.bad), self.total, suffix=status)



    # Initialize
    def run(self):

        # Very good documentation here: https://bluesock.org/~willkg/blog/dev/elasticsearch_part1_index.html
        # and here: https://elasticsearch-py.readthedocs.io/en/master/helpers.html
        # Create Elasticsearch connection
        es = Elasticsearch(
            host               = self.host,
            port               = self.port,
            timeout            = 10,
            request_timeout    = 10,
            max_retries        = 10,
            retry_on_timeout   = True
        )
        es.cluster.health(wait_for_status='yellow')

        try:
            for success, info in helpers.parallel_bulk(
                client             = es,
                actions            = self.generator(),
                #thread_count       = 2,
                chunk_size         = self.chunk_size#,
                #max_chunk_bytes    = 1 * 1024 * 1024,
                #queue_size         = 2,
                #raise_on_exception = True
            ):
                #print(success)
                print('INSERTED!')
                #es.cluster.health(wait_for_status='yellow') # During insertion, wait for cluster to be in good state?
                if not success:
                    #print('[INDEXER]: A document failed:', info)
                    self.failedfile.write("A document failed: %s" % info) # TODO: replace stuff like this with logger

        except Exception as error:
            raise error

The "yields" variable counts all the yields done in the generator loop. If I specify a chunk_size of 500, it was of my understanding that the parallel_bulk function should start indexing once the chunk_size is reached? Instead it continues without inserting until all input is completely read (or at around 400.000 chunks). I have confirmed this by printing on success.

Perhaps I am missing something here, or is this expected behavior?

nicolaipre · 2020-02-05T14:42:17Z

I found the problem with not the chunk_size parameter not being respected - I had forgotten to cast the chunk_size input to an integer, so it was parsed as a string. Unfortunately I did not get an error about this.

For the memory issue; it seems to work fine with streaming_bulk, and is stable at around 50MB memory usage.

For parallel_bulk however, there seems to be memory leaking/filled up.

AntonFriberg · 2020-02-14T11:20:30Z

I have had similar issues with parallel_bulk and memory issues and those were also resolved by instead utilizing streaming_bulk in my own process pool.

#1077

seanzhang422 · 2020-05-30T11:32:42Z

Same issue here with parallel_bulk, the memory usage kept going up for all python processes that run parallel_bulk. Each process reached about 2G (I have 10 child processes) and the system run out of memory. I had to switch back to streaming_bulk and increased the workers count. After that each process only consumed 88M and were stable.

sethmlarson · 2020-05-30T14:28:38Z

It does look like all the documents are loaded into memory to be distributed to workers. Would be better to use memory channels to distribute tasks.

bzijlema · 2020-10-14T08:35:43Z

I have the same problem of memory filling up using parallel_bulk. I see in the source code that parallel_bulk is using multiprocess.Threadpool. When I did some experiments I found that the problem doesn't occur when using multiprocess.Pool instead. So it might be a good idea to use multiprocess.Pool in parallel_bulk, especially since Threadpool is not documented

mathijsfr · 2022-11-11T11:22:41Z

Could you give a sample implementation on how you solved this issue? Using streaming bulk instead of parallel_bulk doesn't seem to solve my memory issues. I am trying to index 60 million documents.

ShikharGuptaX · 2023-10-30T05:06:22Z

I had an issue with parallel_bulk leaking memory when called over and over again. Was hitting this on Python 3.11.0 but not on Python 3.11.1. I believe it was due to gh-99205 which was fixed in Python 3.11.1. Creating a new ThreadPool on every parallel_bulk call was causing the above issue to be hit.

nicolaipre changed the title ~~parallel_bulk does not seem to respect chunk_size~~ parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? Feb 5, 2020

kdutia mentioned this issue May 20, 2021

re-implement parallel loading into Elasticsearch TheScienceMuseum/elastic-wikidata#20

Open

mathijsfr mentioned this issue Nov 11, 2022

Out of memory when populating large dataset django-es/django-elasticsearch-dsl#433

Open

albertisfu mentioned this issue May 30, 2023

Create new script for adding oral argument (and later other forms of) data to Elastic Search freelawproject/courtlistener#2677

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? #1101

parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? #1101

nicolaipre commented Feb 5, 2020

nicolaipre commented Feb 5, 2020 •

edited

Loading

AntonFriberg commented Feb 14, 2020

seanzhang422 commented May 30, 2020

sethmlarson commented May 30, 2020

bzijlema commented Oct 14, 2020 •

edited

Loading

mathijsfr commented Nov 11, 2022 •

edited

Loading

ShikharGuptaX commented Oct 30, 2023 •

edited

Loading

parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? #1101

parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? #1101

Comments

nicolaipre commented Feb 5, 2020

nicolaipre commented Feb 5, 2020 • edited Loading

AntonFriberg commented Feb 14, 2020

seanzhang422 commented May 30, 2020

sethmlarson commented May 30, 2020

bzijlema commented Oct 14, 2020 • edited Loading

mathijsfr commented Nov 11, 2022 • edited Loading

ShikharGuptaX commented Oct 30, 2023 • edited Loading

nicolaipre commented Feb 5, 2020 •

edited

Loading

bzijlema commented Oct 14, 2020 •

edited

Loading

mathijsfr commented Nov 11, 2022 •

edited

Loading

ShikharGuptaX commented Oct 30, 2023 •

edited

Loading