[BUG] reviews_all doesn't download all reviews of an app with large amount of reviews #209

Jl-wei · 2024-03-01T20:26:04Z

Library version
1.2.6

Describe the bug
I cannot download all the reviews of an app with large amount of reviews. The number of downloaded reviews is always a multiple of 199.

Code

result = reviews_all("com.google.android.apps.fitness")
print(len(result))
# get 995

Expected behavior
Expect to download all the reviews with reviews_all, which should be at least 20k

Additional context
No

The text was updated successfully, but these errors were encountered:

funnan · 2024-03-02T06:07:57Z

Im seeing the same issue even when I set the number of reviews (25000 in my case). Im only getting back about 500 and the output number changes each time I run it.

Jl-wei · 2024-03-02T16:43:59Z

Im seeing the same issue even when I set the number of reviews (25000 in my case). Im only getting back about 500 and the output number changes each time I run it.

Me too, and I found that the output number is always a multiple of 199. It seems that Google Play randomly block the retrieval of next page of reviews.

adilosa · 2024-03-05T02:16:50Z

This is probably a dupe of #208.

The error seems to be the play service intermittently returning an error inside a 200 success code, which then fails to parse as the json the library expects. It seems to contain this ....store.error.PlayDataError message.

)]}'

[["wrb.fr","UsvDTd",null,null,null,[5,null,[["type.googleapis.com/wireless.android.finsky.boq.web.data.store.error.PlayDataError",[1]]]],"generic"],["di",45],["af.httprm",45,"-6355766929392607683",2]]

The error seems to happen frequently but not reliably. Scraping in chunks of 200 reviews, basically every request has a decent chance of crashing, resulting in usually 200-1000 total reviews scraped before it craps out.

Currently, the library swallows this exception silently and quits. Handling this error lets the scraping continue as normal.

We monkey-patched around it like this and seem to have gotten back to workable scraping:

import google_play_scraper
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )

    # MOD error handling
    if "error.PlayDataError" in dom:
        return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
    # ENDMOD

    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]


google_play_scraper.reviews._fetch_review_items = _fetch_review_items

funnan · 2024-03-05T12:13:28Z

Still not able to get more than a few hundred reviews.

paulolacombe · 2024-03-05T14:34:16Z

@funnan, the monkey patch @adilosa posted worked well for me.

Shivam-170103 · 2024-03-06T06:22:52Z

Hey @adilosa @funnan @paulolacombe can you all please tell how to implement this in order to fix this issue. I am trying to scrape reviews using reviews_all in Google Colab but it wont work for me. It would be great if you could help!

paulolacombe · 2024-03-06T15:07:15Z

Hey @Shivam-170103, you need to use the code lines that @adilosa provided to replace the corresponding ones in the reviews.py function file in your environment. Let me know if that helps as I am not that familiar with Google Colab.

terrichiachia · 2024-03-07T02:34:15Z

Thanks @adilosa and @paulolacombe ,
your posts are worked for me :)

lucasbral · 2024-03-07T12:57:37Z

I don't know why but even applying @adilosa 's solution the number of reviews returned here is still very low.

ej-white · 2024-03-09T23:34:52Z

Hello! I tried the monkey patch suggested by @adilosa, scraping a big app like eBay.

Instead of getting 8 or 10 reviews, I did end up getting 199, but I am expecting thousands of reviews (that's how it used be several weeks ago).

Any updated for getting this fixed? Cheers, and thank you

sfischerw · 2024-03-11T05:34:18Z

Same for me TT: the number of reviews scraped has plummeted since around 15 Feb and @adilosa's patch does not change my numbers by much
Is there something else I can try?

funnan · 2024-03-11T05:46:18Z

This mod did not work for me either. I tried a different approach that worked for me:

In reviews.py:

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

sfischerw · 2024-03-17T21:34:46Z

@funnan, thanks for sharing it!
It does not fix the issue for me, I still only retrieve 200-300 reviews for an app like Ebay
And every run still yields a different number of reviews

ej-white · 2024-03-17T21:43:37Z

@funnan Thank you! I tried that and seemed to get a little more reviews, but not the full count. But I'm not sure if I implemented the patch correctly.

What I did was put the entire features/reviews.py into a new file (my_reviews.py), updated the try/except block with your change, and patched it like this:

import google_play_scraper
from my_reviews import reviews  # <- patched version

google_play_scraper.features.reviews = reviews

# Then call google_play_scraper.reviews(app, count=1000, ...)

Is this how to apply your patch? If not, could you provide an example of the correct way? Thanks so much

Bigsy · 2024-03-20T11:03:37Z

Both mods dont work for me, first doesn't change anything and funnan's just loops forever and never returns.

MemeRunner · 2024-03-20T12:28:55Z

I'm having the same issue, and trying to use the workaround posted by @adilosa (thx!).

However, it's giving me a pagination token error.

TypeError: Formats._Reviews.build_body() missing 1 required positional argument: 'pagination_token'

Can someone please tell me what this should be set at? I've tried None, 0, 100, 200, and 2000 as values for 'pagination_token', but always get the same TypeError.

This is how I have the variables defined:

google_play_scraper.reviews._fetch_review_items = _fetch_review_items

# Set values for 'url', 'app_id', 'sort', 'count', 'filter_score_with', and 'pagination_token'
url = 'https://play.google.com/store/getreviews'
app_id = 'com.doctorondemand.android.patient'
sort = 1  # 1 for most relevant, 2 for newest
count = 20  # Number of reviews to fetch
filter_score_with = None
pagination_token = 100

# Example call to the function with provided values
_fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)

Greatly appreciate any input.

funnan · 2024-03-20T19:42:43Z

Here's my code ( I am fixing the number of reviews I need and break the loop when that number has crossed):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

and in reviews.py I added the mod as my original comment.

Mayumiwandi · 2024-03-21T16:39:15Z

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

I have tried with your code, and it worked for me running on colab

ej-white · 2024-03-24T23:24:34Z

@funnan Thank you, that works!

ej-white · 2024-03-24T23:27:47Z

@JoMingyu Any chance we could get @funnan 's fix added to the code and merged?

It works for me and others, I can once again scrape 10,000's of reviews. Based on this discussion, seems like this issue is affecting many people! Cheers

HuDHuD0x1 · 2024-03-29T10:50:24Z

Here's my code ( I am fixing the number of reviews I need and break the loop when that number has crossed):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

and in reviews.py I added the mod as my original comment.

Thanks bro worked for me as well

myownhoney · 2024-04-01T22:22:48Z

Unfortunately, it is still not working for me. I suspect that Google has put some limitations on the crawling

`from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

app_id = 'com.zhiliaoapp.musically'


result = []
continuation_token = None
reviews_count = 5000

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=199
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
print(len(df))`

The progress bar is raised after displaying the following:
8%|▊ | 398/5000 [00:00<00:03, 1302.69it/s]398
sometimes it will get more data like 995. but most time just 199 or 398 data retrieved

AndreasKarasenko · 2024-04-02T09:00:47Z

@myownhoney did you edit the reviews.py file using the fix from @funnan?
I just tested it for v1.2.6 and this app id: "com.ingka.ikea.app" and except for hanging on 10950 reviews it works.

myownhoney · 2024-04-02T10:11:04Z

@myownhoney您是否使用来自的修复编辑了reviews.py文件@funnan？我刚刚测试了它的 v1.2.6 和这个应用程序 ID：“com.ingka.ikea.app”，除了挂在 10950 条评论上之外，它可以工作。

it works now :) Cheers!

RamaDNA · 2024-04-03T13:30:29Z

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

myownhoney · 2024-04-03T15:02:45Z

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

My code is in the previous comment. Have you tried editing reviews.py?
If you're working on colab, I strongly suggest you run this code before running your scrape code
`

import json
from time import sleep
from typing import List, Optional, Tuple

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.NEWEST,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

`

HuDHuD0x1 · 2024-04-03T15:44:19Z

@AndreasKarasenko @myownhoney can you show me your code please. it is still did not work for me as well

My code is in the previous comment. Have you tried editing reviews.py? If you're working on colab, I strongly suggest you run this code before running your scrape code `

import json
from time import sleep
from typing import List, Optional, Tuple

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.NEWEST,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

`

if we use this code before running our script is it compulsory to edit reviews.py first? or just run this code and that's all!! because the @funnan patch is worked for me on Jupiter

RamaDNA · 2024-04-04T05:53:27Z

yeah just run the first one, than the second one
thanks for sharing, does not work for me

gianlucascoccia · 2024-04-04T06:36:47Z

I digged a bit in the code starting from @adilosa's solution.
I found two issues that prevented his solution from working:

Now, when the API fails silently, it returns a "play.gateway.proto.PlayGatewayError" rather than a "error.PlayDataError".
Now, the _fetch_review_items needs a filter_device_with parameter too

After applying the required changes, this is the new patch to be done in reviews.py:

def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )

    # PATCH START
    if ("error.PlayDataError" in dom) or (".PlayGatewayError" in dom): # <--- Keeping both for robustness
        return _fetch_review_items(url, app_id, sort, count, filter_score_with, filter_device_with, pagination_token)
    # PATCH END

    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]

With this changes it appears to be working

sheldon0711 · 2024-04-04T21:33:51Z

I digged a bit in the code starting from @adilosa's solution. I found two issues that prevented his solution from working:

Now, when the API fails silently, it returns a "play.gateway.proto.PlayGatewayError" rather than a "error.PlayDataError".

Now, the _fetch_review_items needs a filter_device_with parameter too

After applying the required changes, this is the new patch to be done in reviews.py:
def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )

    # PATCH START
    if ("error.PlayDataError" in dom) or (".PlayGatewayError" in dom): # <--- Keeping both for robustness
        return _fetch_review_items(url, app_id, sort, count, filter_score_with, filter_device_with, pagination_token)
    # PATCH END

    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]
With this changes it appears to be working

Thanks for fix! I still have problem get it work. Since the filter_device_with is needed, do we need to place it in other functions as well when _fetch_review_items is used?

And I tried your patch but without the filter_device_with parameter, somehow it works. But I'm wondering if there is any problem with it?

iniandrew · 2024-04-05T01:33:54Z

thank you so much @adilosa , your code worked well. for you who are experiencing the same problem, here's what I did:

I updated the file named reviews.py (i use pycharm IDE and jupyterlab, it's located in .venv\Lib\site-packages\google_play_scrapper_features\reviews.py)

add this code above match variable in _fetch_review_items function

# MOD error handling
    if "error.PlayDataError" in dom:
        return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
    # ENDMOD

before:

def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )

    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]

after

def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )

    # MOD error handling
    if "error.PlayDataError" in dom:
        return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
    # ENDMOD

    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]

then i run code from documentation

from google_play_scraper import Sort, reviews

result, continuation_token = reviews(
    'app-id', # replace this with the application id you want to scrap
    lang='id', # defaults to 'en'
    country='id', # defaults to 'us'
    sort=Sort.MOST_RELEVANT, # defaults to Sort.NEWEST
    count=20000, # defaults to 100
    filter_score_with=None # defaults to None(means all score)
)

# If you pass `continuation_token` as an argument to the reviews function at this point,
# it will crawl the items after 3 review items.

result, _ = reviews(
    'app-id', # replace this with the application id you want to scrap
    continuation_token=continuation_token # defaults to None(load from the beginning)
)

result:

gianlucascoccia · 2024-04-05T08:54:52Z

Thanks for fix! I still have problem get it work. Since the filter_device_with is needed, do we need to place it in other functions as well when _fetch_review_items is used?

And I tried your patch but without the filter_device_with parameter, somehow it works. But I'm wondering if there is any problem with it?

In my experience it is not necessary to add the parameter to other parts of the code, but I am only using the reviews_all method

Also, it seems that different people are getting different error messages (perhaps depending on their location?), so it really depends on what behaviour the program has on your side

DanielGusman · 2024-04-05T21:44:34Z

Hi everybody.
I've been studying Python for a week. I also encountered this problem. I used the last proposed method, but, unfortunately, I can't scrape more than 600 reviews. Tell me what I'm doing wrong.
I also added a small piece of code to unloading data to Excel.

from typing import Optional
def _fetch_review_items(
url: str,
app_id: str,
sort: int,
count: int,
filter_score_with: Optional[int],
pagination_token: Optional[str],
):
dom = post(
url,
Formats.Reviews.build_body(
app_id,
sort,
count,
"null" if filter_score_with is None else filter_score_with,
pagination_token,
),
{"content-type": "application/x-www-form-urlencoded"},
)

# MOD error handling
if "error.PlayDataError" in dom:
    return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
# ENDMOD

match = json.loads(Regex.REVIEWS.findall(dom)[0])

return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]

import pandas as pd
from google_play_scraper import Sort, reviews
result, continuation_token = reviews(
'eu.livesport.FlashScore_com', # replace this with the application id you want to scrap
lang='en', # defaults to 'en'
country='US', # defaults to 'us'
sort=Sort.MOST_RELEVANT, # defaults to Sort.NEWEST
count=20000, # defaults to 100
filter_score_with=None # defaults to None(means all score)
)
df = pd.json_normalize(result)
df.head()
df = pd.DataFrame(result)
df.to_excel('FS_en.xlsx')

asornbor · 2024-04-05T23:37:35Z

Hi everybody. I've been studying Python for a week. I also encountered this problem. I used the last proposed method, but, unfortunately, I can't scrape more than 600 reviews. Tell me what I'm doing wrong. I also added a small piece of code to unloading data to Excel.

from typing import Optional def _fetch_review_items( url: str, app_id: str, sort: int, count: int, filter_score_with: Optional[int], pagination_token: Optional[str], ): dom = post( url, Formats.Reviews.build_body( app_id, sort, count, "null" if filter_score_with is None else filter_score_with, pagination_token, ), {"content-type": "application/x-www-form-urlencoded"}, )
# MOD error handling
if "error.PlayDataError" in dom:
    return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
# ENDMOD

match = json.loads(Regex.REVIEWS.findall(dom)[0])

return json.loads(match[0][2])[0], json.loads(match[0][2])[-1][-1]
import pandas as pd from google_play_scraper import Sort, reviews result, continuation_token = reviews( 'eu.livesport.FlashScore_com', # replace this with the application id you want to scrap lang='en', # defaults to 'en' country='US', # defaults to 'us' sort=Sort.MOST_RELEVANT, # defaults to Sort.NEWEST count=20000, # defaults to 100 filter_score_with=None # defaults to None(means all score) ) df = pd.json_normalize(result) df.head() df = pd.DataFrame(result) df.to_excel('FS_en.xlsx')

Hey I was having a similar issue yesterday --> You have to make sure you don't run import reviews after you run the fix on fetch_review or it will revert it back to the broken form. Try importing at the beginning, then running the fix, then running the call and that should work!

DanielGusman · 2024-04-06T06:49:25Z

Thank you very much for the advice!

DanielGusman · 2024-04-06T16:20:17Z

Today I tried to collect reviews all day but to no avail. I tried all the methods from the thread, but without success.
The last method I tried, but for some reason it returns an empty Excel table.
Please tell me what I did wrong.

I updated the file called review.py (using pycharm)

MOD error handling

if "error.PlayDataError" in dom:
    return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
# ENDMOD

Then I pasted the code
from google_play_scraper import Sort, reviews

result, continuation_token = reviews(
'app-id', # replace this with the application id you want to scrap
lang='id', # defaults to 'en'
country='id', # defaults to 'us'
sort=Sort.MOST_RELEVANT, # defaults to Sort.NEWEST
count=20000, # defaults to 100
filter_score_with=None # defaults to None(means all score)
)

If you pass `continuation_token` as an argument to the reviews function at this point,

it will crawl the items after 3 review items.

result, _ = reviews(
'app-id', # replace this with the application id you want to scrap
continuation_token=continuation_token # defaults to None(load from the beginning)
)

I would be grateful for any advice.

asornbor · 2024-04-06T20:26:15Z

Today I tried to collect reviews all day but to no avail. I tried all the methods from the thread, but without success. The last method I tried, but for some reason it returns an empty Excel table. Please tell me what I did wrong.
1. I updated the file called review.py (using pycharm)
MOD error handling
if "error.PlayDataError" in dom:
    return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
# ENDMOD
2. Then I pasted the code
   from google_play_scraper import Sort, reviews
result, continuation_token = reviews( 'app-id', # replace this with the application id you want to scrap lang='id', # defaults to 'en' country='id', # defaults to 'us' sort=Sort.MOST_RELEVANT, # defaults to Sort.NEWEST count=20000, # defaults to 100 filter_score_with=None # defaults to None(means all score) )

If you pass continuation_token as an argument to the reviews function at this point,

it will crawl the items after 3 review items.

result, _ = reviews( 'app-id', # replace this with the application id you want to scrap continuation_token=continuation_token # defaults to None(load from the beginning) )

I would be grateful for any advice.

Fill in your app_id and try running this:

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

app_id = ''

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.MOST_RELEVANT,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

result = []
continuation_token = None
reviews_count = 20000

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.MOST_RELEVANT,
            filter_score_with=None,
            count=199
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
print(len(df))

petskratt · 2024-04-07T09:56:29Z

when debugging a Node.JS sister project I was able to fix similar problem by ensuring cookie persistence from first request - for testing purposes you can grab NID cookie from your browser and send it with each script request (e.g where headers are added, like {"content-type": "application/x-www-form-urlencoded", "cookie": "NID=[cookie value from browser];"}).

DKZPT · 2024-04-07T12:47:25Z

i have same issue, can only take 398 comments/reviews using this simple code:

from google_play_scraper import app, Sort, reviews_all
import pandas as pd

def scrape_google_play_reviews(app_id, sort_by=Sort.NEWEST, count=2000):
    # Fetch reviews
    result = reviews_all(
        app_id,
        sleep_milliseconds=1000,  # Don't use sleep if you're not making many requests
        lang='pt',  # Language in which you want to fetch reviews
        country='pt',  # Country to which the reviews are targeted
        sort=sort_by,  # Sorting method
        count=count  # Number of reviews to fetch
    )

    # Convert to DataFrame
    reviews_df = pd.DataFrame(result)

    # Save to CSV
    reviews_df.to_csv(f'{app_id}_reviews.csv', index=False)

    print(f"Saved {len(reviews_df)} reviews for app ID {app_id} to CSV.")

# Example usage
app_id_example = ''  # Replace with the app ID you're interested in
scrape_google_play_reviews(app_id_example, sort_by=Sort.NEWEST, count=2000)

If anyone find a fix let us know.

DanielGusman · 2024-04-09T07:47:38Z

Today I tried to collect reviews all day but to no avail. I tried all the methods from the thread, but without success. The last method I tried, but for some reason it returns an empty Excel table. Please tell me what I did wrong.
1. I updated the file called review.py (using pycharm)
MOD error handling
if "error.PlayDataError" in dom:
    return _fetch_review_items(url, app_id, sort, count, filter_score_with, pagination_token)
# ENDMOD
2. Then I pasted the code
   from google_play_scraper import Sort, reviews
result, continuation_token = reviews( 'app-id', # replace this with the application id you want to scrap lang='id', # defaults to 'en' country='id', # defaults to 'us' sort=Sort.MOST_RELEVANT, # defaults to Sort.NEWEST count=20000, # defaults to 100 filter_score_with=None # defaults to None(means all score) )

If you pass continuation_token as an argument to the reviews function at this point,

it will crawl the items after 3 review items.

result, _ = reviews( 'app-id', # replace this with the application id you want to scrap continuation_token=continuation_token # defaults to None(load from the beginning) )
I would be grateful for any advice.

Fill in your app_id and try running this:

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

app_id = ''

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.MOST_RELEVANT,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

result = []
continuation_token = None
reviews_count = 20000

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.MOST_RELEVANT,
            filter_score_with=None,
            count=199
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
print(len(df))

Thanks for the help. I'll try to run the code today.

JoMingyu · 2024-04-09T14:30:31Z

Hello guys. Sorry for not paying attention to the library. I've read all the discussions, and I've confirmed that the small modifications in @funnan are valid for most.

Unfortunately, Google Play is conducting various experiments in various countries, including a/b testing of the UI and data structure. Therefore, although most of them can be solved with the method proposed by @funnan , for example, the following function calls generate infinite loops.

reviews_all(
    "com.poleposition.AOSheroking",
    sort=Sort.MOST_RELEVANT,
    country="kr",
    lang="ko",
)

So @funnan and your every suggestions are really good, but it can cause infinite loop-like problems in edge case, so I need to research it a more.

By default, this library is not official, and Google Play does not allow crawling through robots.txt. Therefore, I think it might have been better not to support complex features like reviews_all in the first place.

I think it would be good for everyone to write your own reviews_all function according to your situation. I'm sorry I couldn't tell you the good news.

adilosa · 2024-04-10T01:46:16Z

We also observed that the API response from Google randomly didn't include the token on some calls, meaning the loop would end as if it was the last page. We simply retried the request a few times and usually get a continuation token eventually!

petskratt · 2024-04-10T06:25:40Z

@adilosa @JoMingyu pls try capturing NID cookie from first response and sending it with all following paging requests - with multiple attempts you just hope to hit the same worker behind LB. Check my fix & test in Node.js project facundoolano/google-play-scraper#677

gianlucascoccia · 2024-04-10T06:41:20Z

We also observed that the API response from Google randomly didn't include the token on some calls, meaning the loop would end as if it was the last page. We simply retried the request a few times and usually get a continuation token eventually!

I also observed different error messages from other users, I believe Google's API is currently not working 100% correctly.

JoMingyu · 2024-04-10T08:45:49Z

@adilosa @JoMingyu pls try capturing NID cookie from first response and sending it with all following paging requests - with multiple attempts you just hope to hit the same worker behind LB. Check my fix & test in Node.js project facundoolano/google-play-scraper#677

I'll give it a try. That makes sense. I'm sorry, but I don't have a lot of time to spend on it. However, I'll do my best to work on it.

DKZPT · 2024-04-10T20:58:13Z

Im not expert on this but.. i find something wierd.

If i change the "country" and "language" i get more reviews, maybe something changed on google side?!

I made this code and i get more reviews then if i fix the country and language.

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# App ID to fetch reviews for
app_id = 'xxx.com'

# Lists of languages and countries to fetch reviews in
languages = ['en', 'pt']
countries = ['us', 'pt', 'br']

# Initialize results list
result = []

# Number of reviews to attempt to fetch per language-country combination
reviews_count_per_combination = 10000  

for country in countries:
    for lang in languages:
        continuation_token = None
        fetched_reviews = 0
        with tqdm(total=reviews_count_per_combination, desc=f"Fetching reviews in {lang}-{country}", position=0, leave=True) as pbar:
            while fetched_reviews < reviews_count_per_combination:
                new_result, continuation_token = reviews(
                    app_id,
                    continuation_token=continuation_token,
                    lang=lang,
                    country=country,
                    sort=Sort.NEWEST,
                    filter_score_with=None,
                    count=min(200, reviews_count_per_combination - fetched_reviews)  #i change this for 400-500 and take more then 200.
                )
                if not new_result:
                    break
                result.extend(new_result)
                fetched_reviews += len(new_result)
                pbar.update(len(new_result))

# Convert aggregated results to DataFrame
df = pd.DataFrame(result)

# Save the DataFrame to a CSV file
today = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
filename = f'reviews_{app_id}_{today}.csv'
df.to_csv(filename, index=False)

print(f"Saved {len(df)} reviews from multiple languages and countries to {filename}")

Any suggestions?

gianlucascoccia · 2024-04-11T12:28:57Z

If i change the "country" and "language" i get more reviews, maybe something changed on google side?!

This is expected, the API returns reviews from a single country and in a sigle language (default is US and English)

rifson · 2024-04-12T12:01:30Z

@JoMingyu could it be that it will scrape until it cant divide by 200 that is making this happen?
I have tried scraping a couple of times but i seem to get just under 200 on the first try or maybe on the 3 try thus its stopping early.
From what i can see here: https://pypi.org/project/google-play-scraper/
Its saying: "http requests are generated as long as the number of app reviews is divided by 200."
Right now if im trying this app i will get 198 reviews in the first. From the description above this should then stop http requests:

from google_play_scraper import Sort, reviews_all

results = reviews_all(
    'com.lego.legobuildinginstructions',
    sleep_milliseconds=0, # defaults to 0
    lang='en', # defaults to 'en'
    country='us', # defaults to 'us'
)

DanielGusman · 2024-04-13T20:53:47Z

Hi all! I found this article on this issue. Unfortunately, I don't yet have enough knowledge to run it.
https://www.scrapehero.com/scrape-google-play-store-reviews/

https://github.com/scrapehero-code/google-play-review-scraper/blob/main/scraper.py

DKZPT · 2024-04-13T22:17:50Z

Hi all! I found this article on this issue. Unfortunately, I don't yet have enough knowledge to run it. https://www.scrapehero.com/scrape-google-play-store-reviews/

https://github.com/scrapehero-code/google-play-review-scraper/blob/main/scraper.py

Not a good solve.

singgihsaputro · 2024-05-02T15:00:52Z

I used

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

Ini kode saya (saya memperbaiki jumlah ulasan yang saya perlukan dan memutus perulangan ketika angka itu telah melewatinya):

from google_play_scraper import Sort, reviews
import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

# Fetch reviews using google_play_scraper, Replace with ur app-id!
app_id = 'com.XXX'

# Fetch reviews
result = []
continuation_token = None
reviews_count = 25000  # change count here

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.NEWEST,
            filter_score_with=None,
            count=150
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

# Create a DataFrame from the reviews & Download the file
df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
df.to_csv(f'reviews-{app_id}_{today}.csv', index=False)
print(len(df))
files.download(f'reviews-{app_id}_{today}.csv')

dan di review.py saya menambahkan mod sebagai komentar asli saya.

I have tried with your code, and it worked for me running on colab

thanks man, I've tried this, using latest version not working somehow, but using old version works fine, this is worked with version 0.1.2

dekwahdimas · 2024-05-11T22:37:59Z

Thank you so much to all contributors on this thread and @asornbor for the summary. It worked for me using Google Colab and the latest version of google-play-scraper (1.2.6). I just need to import other missing libraries.

import json
from time import sleep
from typing import List, Optional, Tuple

Fill in your app_id and try running this:

from google_play_scraper import Sort
from google_play_scraper.constants.element import ElementSpecs
from google_play_scraper.constants.regex import Regex
from google_play_scraper.constants.request import Formats
from google_play_scraper.utils.request import post

import pandas as pd
from datetime import datetime
from tqdm import tqdm
import time

app_id = ''

MAX_COUNT_EACH_FETCH = 199


class _ContinuationToken:
    __slots__ = (
        "token",
        "lang",
        "country",
        "sort",
        "count",
        "filter_score_with",
        "filter_device_with",
    )

    def __init__(
        self, token, lang, country, sort, count, filter_score_with, filter_device_with
    ):
        self.token = token
        self.lang = lang
        self.country = country
        self.sort = sort
        self.count = count
        self.filter_score_with = filter_score_with
        self.filter_device_with = filter_device_with


def _fetch_review_items(
    url: str,
    app_id: str,
    sort: int,
    count: int,
    filter_score_with: Optional[int],
    filter_device_with: Optional[int],
    pagination_token: Optional[str],
):
    dom = post(
        url,
        Formats.Reviews.build_body(
            app_id,
            sort,
            count,
            "null" if filter_score_with is None else filter_score_with,
            "null" if filter_device_with is None else filter_device_with,
            pagination_token,
        ),
        {"content-type": "application/x-www-form-urlencoded"},
    )
    match = json.loads(Regex.REVIEWS.findall(dom)[0])

    return json.loads(match[0][2])[0], json.loads(match[0][2])[-2][-1]


def reviews(
    app_id: str,
    lang: str = "en",
    country: str = "us",
    sort: Sort = Sort.MOST_RELEVANT,
    count: int = 100,
    filter_score_with: int = None,
    filter_device_with: int = None,
    continuation_token: _ContinuationToken = None,
) -> Tuple[List[dict], _ContinuationToken]:
    sort = sort.value

    if continuation_token is not None:
        token = continuation_token.token

        if token is None:
            return (
                [],
                continuation_token,
            )

        lang = continuation_token.lang
        country = continuation_token.country
        sort = continuation_token.sort
        count = continuation_token.count
        filter_score_with = continuation_token.filter_score_with
        filter_device_with = continuation_token.filter_device_with
    else:
        token = None

    url = Formats.Reviews.build(lang=lang, country=country)

    _fetch_count = count

    result = []

    while True:
        if _fetch_count == 0:
            break

        if _fetch_count > MAX_COUNT_EACH_FETCH:
            _fetch_count = MAX_COUNT_EACH_FETCH

        try:
            review_items, token = _fetch_review_items(
                url,
                app_id,
                sort,
                _fetch_count,
                filter_score_with,
                filter_device_with,
                token,
            )
        except (TypeError, IndexError):
            #funnan MOD start
            token = continuation_token.token
            continue
            #MOD end

        for review in review_items:
            result.append(
                {
                    k: spec.extract_content(review)
                    for k, spec in ElementSpecs.Review.items()
                }
            )

        _fetch_count = count - len(result)

        if isinstance(token, list):
            token = None
            break

    return (
        result,
        _ContinuationToken(
            token, lang, country, sort, count, filter_score_with, filter_device_with
        ),
    )


def reviews_all(app_id: str, sleep_milliseconds: int = 0, **kwargs) -> list:
    kwargs.pop("count", None)
    kwargs.pop("continuation_token", None)

    continuation_token = None

    result = []

    while True:
        _result, continuation_token = reviews(
            app_id,
            count=MAX_COUNT_EACH_FETCH,
            continuation_token=continuation_token,
            **kwargs
        )

        result += _result

        if continuation_token.token is None:
            break

        if sleep_milliseconds:
            sleep(sleep_milliseconds / 1000)

    return result

result = []
continuation_token = None
reviews_count = 20000

with tqdm(total=reviews_count, position=0, leave=True) as pbar:
    while len(result) < reviews_count:
        new_result, continuation_token = reviews(
            app_id,
            continuation_token=continuation_token,
            lang='en',
            country='us',
            sort=Sort.MOST_RELEVANT,
            filter_score_with=None,
            count=199
        )
        if not new_result:
            break
        result.extend(new_result)
        pbar.update(len(new_result))

df = pd.DataFrame(result)

today = str(datetime.now().strftime("%m-%d-%Y_%H%M%S"))
print(len(df))

mylovelycodes · 2024-05-25T09:50:21Z

To resolve this issue, simply retry the operation without exiting when a PlayGatewayError is returned.

JoMingyu · 2024-06-07T01:04:53Z

Could you guys try on 1.2.7? I released #216 for now.

wanghaisheng · 2024-06-24T05:16:25Z

@JoMingyu

    results = reviews_all(package,sleep_milliseconds=0,lang='en',country=country,sort=Sort.MOST_RELEVANT)

https://play.google.com/store/apps/details?id=redmasiva.bibliachat&hl=en_IN&pli=1

only 124 items

Jl-wei assigned JoMingyu Mar 1, 2024

Jl-wei changed the title ~~[BUG] Cannot download all reviews of an app with large amount of reviews~~ [BUG] reviews_all doesn't download all reviews of an app with large amount of reviews Mar 1, 2024

adilosa mentioned this issue Mar 6, 2024

[BUG] Function "reviews_all" returns diferent amount of reviews at each script execution #208

Open

gianlucascoccia mentioned this issue Apr 4, 2024

Variable number of reviews returned facundoolano/google-play-scraper#676

Closed

Eitol mentioned this issue May 11, 2024

Resolution of various bugs and general maintenance of the project #216

Merged

[BUG] reviews_all doesn't download all reviews of an app with large amount of reviews #209

[BUG] reviews_all doesn't download all reviews of an app with large amount of reviews #209

Comments

Jl-wei commented Mar 1, 2024

funnan commented Mar 2, 2024

Jl-wei commented Mar 2, 2024

adilosa commented Mar 5, 2024 • edited Loading

funnan commented Mar 5, 2024

paulolacombe commented Mar 5, 2024

Shivam-170103 commented Mar 6, 2024

paulolacombe commented Mar 6, 2024

terrichiachia commented Mar 7, 2024

lucasbral commented Mar 7, 2024 • edited Loading

ej-white commented Mar 9, 2024

sfischerw commented Mar 11, 2024

funnan commented Mar 11, 2024 • edited Loading

sfischerw commented Mar 17, 2024 • edited Loading

ej-white commented Mar 17, 2024

Bigsy commented Mar 20, 2024

MemeRunner commented Mar 20, 2024

funnan commented Mar 20, 2024

Mayumiwandi commented Mar 21, 2024

ej-white commented Mar 24, 2024

ej-white commented Mar 24, 2024

HuDHuD0x1 commented Mar 29, 2024

myownhoney commented Apr 1, 2024 • edited Loading

AndreasKarasenko commented Apr 2, 2024

myownhoney commented Apr 2, 2024

RamaDNA commented Apr 3, 2024

myownhoney commented Apr 3, 2024 • edited Loading

HuDHuD0x1 commented Apr 3, 2024 • edited Loading

RamaDNA commented Apr 4, 2024

gianlucascoccia commented Apr 4, 2024

sheldon0711 commented Apr 4, 2024 • edited Loading

iniandrew commented Apr 5, 2024 • edited Loading

gianlucascoccia commented Apr 5, 2024

DanielGusman commented Apr 5, 2024

asornbor commented Apr 5, 2024

DanielGusman commented Apr 6, 2024

DanielGusman commented Apr 6, 2024

MOD error handling

If you pass continuation_token as an argument to the reviews function at this point,

it will crawl the items after 3 review items.

asornbor commented Apr 6, 2024 • edited Loading

MOD error handling

If you pass continuation_token as an argument to the reviews function at this point,

it will crawl the items after 3 review items.

petskratt commented Apr 7, 2024 • edited Loading

DKZPT commented Apr 7, 2024 • edited Loading

DanielGusman commented Apr 9, 2024

MOD error handling

If you pass continuation_token as an argument to the reviews function at this point,

it will crawl the items after 3 review items.

JoMingyu commented Apr 9, 2024

adilosa commented Apr 10, 2024

petskratt commented Apr 10, 2024

gianlucascoccia commented Apr 10, 2024

JoMingyu commented Apr 10, 2024

DKZPT commented Apr 10, 2024 • edited Loading

gianlucascoccia commented Apr 11, 2024

rifson commented Apr 12, 2024 • edited Loading

DanielGusman commented Apr 13, 2024

DKZPT commented Apr 13, 2024

singgihsaputro commented May 2, 2024

dekwahdimas commented May 11, 2024

mylovelycodes commented May 25, 2024

JoMingyu commented Jun 7, 2024

wanghaisheng commented Jun 24, 2024

adilosa commented Mar 5, 2024 •

edited

Loading

lucasbral commented Mar 7, 2024 •

edited

Loading

funnan commented Mar 11, 2024 •

edited

Loading

sfischerw commented Mar 17, 2024 •

edited

Loading

myownhoney commented Apr 1, 2024 •

edited

Loading

myownhoney commented Apr 3, 2024 •

edited

Loading

HuDHuD0x1 commented Apr 3, 2024 •

edited

Loading

sheldon0711 commented Apr 4, 2024 •

edited

Loading

iniandrew commented Apr 5, 2024 •

edited

Loading

If you pass `continuation_token` as an argument to the reviews function at this point,

asornbor commented Apr 6, 2024 •

edited

Loading

If you pass `continuation_token` as an argument to the reviews function at this point,

petskratt commented Apr 7, 2024 •

edited

Loading

DKZPT commented Apr 7, 2024 •

edited

Loading

If you pass `continuation_token` as an argument to the reviews function at this point,

DKZPT commented Apr 10, 2024 •

edited

Loading

rifson commented Apr 12, 2024 •

edited

Loading