Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exclude bot traffic from tracking stats #4452

Closed
FuhuXia opened this issue Sep 6, 2023 · 6 comments
Closed

exclude bot traffic from tracking stats #4452

FuhuXia opened this issue Sep 6, 2023 · 6 comments
Assignees
Labels
bug Software defect or bug

Comments

@FuhuXia
Copy link
Member

FuhuXia commented Sep 6, 2023

Tracking update job was changed from nightly to weekly due to its long processing time. It is speculated that bot crawling traffic is the cause for the long processing time. Bot crawling traffic makes the popularity count and tracking stats less meaningful, its crawling each and every dataset makes tracking update job unnecessarily longer to process. After user-agent ticket we are now able to differentiate bot traffic from regular user visits, we should try to exclude bot traffic from tracking stats.

Sketch

  • Research the effect of excluding bot traffic from tracking stats. Does it dramatically reduce the unique dataset visits by 30%, 50%, or 75% ...?
  • Research or gather our own list of user-agents to be considered as bot.
  • Custom extension or pushing upstream to add feature of excluding bot traffic from tracking stats
  • change tracking update job from weekly back to nightly
@FuhuXia FuhuXia added the bug Software defect or bug label Sep 6, 2023
@hkdctol hkdctol moved this to 📔 Product Backlog in data.gov team board Sep 7, 2023
@hkdctol hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Sep 14, 2023
@FuhuXia FuhuXia moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Sep 25, 2023
@FuhuXia FuhuXia self-assigned this Sep 25, 2023
@FuhuXia
Copy link
Member Author

FuhuXia commented Sep 29, 2023

Inspected 19 days of CloudFront logs from 2023-09-01 to 2023-09-19:

Raw file size: 15 GB
Total requests: 19 million (19,435,508)
Requests made by all bots: 9 million (8,556,836)
Requests made by biggest bot Googlebot: 4 million (4,159,174)

Top 10 bots' requests:

Googlebot\/: 4159174
PetalBot: 1684301
YisouSpider: 275729
[wW]get: 237275
Bytespider: 230629
GPTBot: 229068
python-requests: 200007
axios: 197848
Y!J: 183735
bingbot: 179556

Not all bots parse javascript and made requests to /_tracking, for those who do:

Bots traffic to /_tracking

Googlebot\/: 804965
Y!J: 55168
Yeti: 10050
Bytespider: 9645
Applebot: 3236
HeadlessChrome: 522
Baiduspider: 260
YisouSpider: 250
Chrome-Lighthouse: 171
Cincraw: 149
Google-Read-Aloud: 125
yandex\.com\/bots: 104
bingbot: 48
heritrix: 38
SeekportBot: 30
PetalBot: 27
Google-Safety: 6
Google-InspectionTool: 6
Blackboard: 5
HubSpot: 4
Ahrefs(Bot|SiteAudit): 3
proximic: 3
Dataprovider.com: 2
facebookexternalhit: 2
Google-Structured-Data-Testing-Tool: 1
BingPreview\/: 1
archive.org_bot: 1
AdsBot-Google([^-]|$): 1
SkypeUriPreview: 1

Top five bots count for 99.8% of total bot tracking data, which means all we need to do is to exclude 5 bots.

The difference between human traffic and bot traffic is:

human focus on pupolar datasets, bots' interests are widely spreaded, as shown in this data:

For all datasets (roughly represented by "GET /dataset/.+") visited in this period:
91% of them was never visited by human.
Googlebot visited 70% of them.

For those most popular datasets such as electric-vehicle-population-data and fdic-failed-bank-list,
Human visits count 99%
Bot visits count 1%

Conclusion:

If we exclude top 5 bots traffic from entering tracking data, we will reduce 80-90% of workload for tracking-update script, while seeing 1% of drop in the top visited dataset popular count.

@FuhuXia
Copy link
Member Author

FuhuXia commented Oct 2, 2023

The PR above has stopped traffic from the top 6 bots accessing the "/_tracking" endpoint. The preliminary analysis suggests that this change could potentially lead to an 80-90% reduction in execution time for the tracking-update script. Note that these estimates are based on several assumptions. To accurately assess the actual impact, we will run a few nightly tracking-update task three days later and check the real-world effects. Hopefully a 12-hour weekly job can become a 2-hour nightly job.

@FuhuXia FuhuXia moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Oct 3, 2023
@FuhuXia FuhuXia moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Oct 3, 2023
@FuhuXia
Copy link
Member Author

FuhuXia commented Oct 5, 2023

PR deployed on 02 Oct 2023 20:50:02 GMT.

One manual tracking_update run executed:

2023-10-05T10:22:06.10-0400 [APP/TASK/58121c57/0] OUT 2023-10-05 14:22:06,106 INFO
[ckanext.geodatagov] 105470 package indexes to be rebuilt starting from 2023-09-29 00:00:00

Will run another one tomorrow, do some calculations to have a good estimate on the workload of nightly job.

@FuhuXia
Copy link
Member Author

FuhuXia commented Oct 6, 2023

Another manual tracking_update run:

2023-10-06 17:15:57,123 INFO  [ckanext.geodatagov] 8337 package indexes to be rebuilt starting from 2023-10-03 00:00:00

So nightly job will be indexing 8k dataset. It can be done in 40 mins.

@FuhuXia
Copy link
Member Author

FuhuXia commented Oct 6, 2023

This PR changes tracking-update from week job to nightly job. Before bot traffic gets excluded, it takes 4-5 hours for a nightly job or 10-12 hour weekly job to finish a tracking update. With this issue resolved, we are going back to a nightly job which can be done < 1 hour.

@FuhuXia FuhuXia closed this as completed Oct 6, 2023
@FuhuXia
Copy link
Member Author

FuhuXia commented Oct 10, 2023

The nightly jobs for the past a few days are looking good, finished in 40-50 mins.

@hkdctol hkdctol moved this from ✔ Done to 🗄 Closed in data.gov team board Oct 16, 2023
@FuhuXia FuhuXia mentioned this issue Jan 3, 2024
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug
Projects
Archived in project
Development

No branches or pull requests

1 participant