Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? #1101

Open
nicolaipre opened this issue Feb 5, 2020 · 7 comments

Comments

@nicolaipre
Copy link

  • Elasticsearch server version: 7.5.2
  • Elasticsearch python version: 7.5.1

I am trying to parse files containing millions of lines, and I am using the helpers.parallel_bulk function for indexing data.

However, it seems that parallel_bulk does not respect the chunk_size parameter, and instead fills up my memory with all the data before it starts insertion.

Code excerpt (full script can be found here):

    # Generate data for Elasticsearch to index
    def generator(self):
        yields = 0
        with open(self.file_name, encoding="utf-8") as fp:
            for line in fp:
                match = self.regex.search( line.strip() )

                if match:
                    if not self.dry_run: # expensive to have dry run check here?
                        entry = match.groupdict()
                        entry['leak'] = self.leak_name # use _type instead? save space... / bad for search performance?

                        doc = dict()
                        doc['_index']  = self.index
                        doc['_type']   = self.doc_type # or should this be self.leak_name maybe ?
                        doc['_id']     = hashlib.md5( json.dumps(entry, sort_keys=True).encode("utf-8") ).hexdigest() # Create md5sum based on the entry, and add it to the document we want to index # expensive operation
                        doc['_source'] = entry
                        #pprint(doc)
                        yields += 1
                        print('Yields: ', yields)
                        yield doc

                    self.good += 1
                else:
                    self.bad += 1
                    self.ignored.write("%s" % line)

                status = "[Good: {}, bad: {}]".format(self.good, self.bad)
                Utils.progressbar( (self.good+self.bad), self.total, suffix=status)



    # Initialize
    def run(self):

        # Very good documentation here: https://bluesock.org/~willkg/blog/dev/elasticsearch_part1_index.html
        # and here: https://elasticsearch-py.readthedocs.io/en/master/helpers.html
        # Create Elasticsearch connection
        es = Elasticsearch(
            host               = self.host,
            port               = self.port,
            timeout            = 10,
            request_timeout    = 10,
            max_retries        = 10,
            retry_on_timeout   = True
        )
        es.cluster.health(wait_for_status='yellow')

        try:
            for success, info in helpers.parallel_bulk(
                client             = es,
                actions            = self.generator(),
                #thread_count       = 2,
                chunk_size         = self.chunk_size#,
                #max_chunk_bytes    = 1 * 1024 * 1024,
                #queue_size         = 2,
                #raise_on_exception = True
            ):
                #print(success)
                print('INSERTED!')
                #es.cluster.health(wait_for_status='yellow') # During insertion, wait for cluster to be in good state?
                if not success:
                    #print('[INDEXER]: A document failed:', info)
                    self.failedfile.write("A document failed: %s" % info) # TODO: replace stuff like this with logger

        except Exception as error:
            raise error

The "yields" variable counts all the yields done in the generator loop. If I specify a chunk_size of 500, it was of my understanding that the parallel_bulk function should start indexing once the chunk_size is reached? Instead it continues without inserting until all input is completely read (or at around 400.000 chunks). I have confirmed this by printing on success.

Perhaps I am missing something here, or is this expected behavior?

@nicolaipre
Copy link
Author

nicolaipre commented Feb 5, 2020

I found the problem with not the chunk_size parameter not being respected - I had forgotten to cast the chunk_size input to an integer, so it was parsed as a string. Unfortunately I did not get an error about this.

For the memory issue; it seems to work fine with streaming_bulk, and is stable at around 50MB memory usage.

For parallel_bulk however, there seems to be memory leaking/filled up.

@nicolaipre nicolaipre changed the title parallel_bulk does not seem to respect chunk_size parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? Feb 5, 2020
@AntonFriberg
Copy link
Contributor

I have had similar issues with parallel_bulk and memory issues and those were also resolved by instead utilizing streaming_bulk in my own process pool.

#1077

@seanzhang422
Copy link

Same issue here with parallel_bulk, the memory usage kept going up for all python processes that run parallel_bulk. Each process reached about 2G (I have 10 child processes) and the system run out of memory. I had to switch back to streaming_bulk and increased the workers count. After that each process only consumed 88M and were stable.

@sethmlarson
Copy link
Contributor

It does look like all the documents are loaded into memory to be distributed to workers. Would be better to use memory channels to distribute tasks.

@bzijlema
Copy link

bzijlema commented Oct 14, 2020

I have the same problem of memory filling up using parallel_bulk. I see in the source code that parallel_bulk is using multiprocess.Threadpool. When I did some experiments I found that the problem doesn't occur when using multiprocess.Pool instead. So it might be a good idea to use multiprocess.Pool in parallel_bulk, especially since Threadpool is not documented

@mathijsfr
Copy link

mathijsfr commented Nov 11, 2022

Could you give a sample implementation on how you solved this issue? Using streaming bulk instead of parallel_bulk doesn't seem to solve my memory issues. I am trying to index 60 million documents.

@ShikharGuptaX
Copy link

ShikharGuptaX commented Oct 30, 2023

I had an issue with parallel_bulk leaking memory when called over and over again. Was hitting this on Python 3.11.0 but not on Python 3.11.1. I believe it was due to gh-99205 which was fixed in Python 3.11.1. Creating a new ThreadPool on every parallel_bulk call was causing the above issue to be hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants