-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exclude bot traffic from tracking stats #4452
Comments
Inspected 19 days of CloudFront logs from 2023-09-01 to 2023-09-19: Raw file size: 15 GB Top 10 bots' requests:
Not all bots parse javascript and made requests to
Top five bots count for 99.8% of total bot tracking data, which means all we need to do is to exclude 5 bots. The difference between human traffic and bot traffic is:human focus on pupolar datasets, bots' interests are widely spreaded, as shown in this data: For all datasets (roughly represented by "GET /dataset/.+") visited in this period: For those most popular datasets such as Conclusion:If we exclude top 5 bots traffic from entering tracking data, we will reduce 80-90% of workload for tracking-update script, while seeing 1% of drop in the top visited dataset popular count. |
The PR above has stopped traffic from the top 6 bots accessing the "/_tracking" endpoint. The preliminary analysis suggests that this change could potentially lead to an 80-90% reduction in execution time for the tracking-update script. Note that these estimates are based on several assumptions. To accurately assess the actual impact, we will run a few nightly tracking-update task three days later and check the real-world effects. Hopefully a 12-hour weekly job can become a 2-hour nightly job. |
PR deployed on One manual tracking_update run executed:
Will run another one tomorrow, do some calculations to have a good estimate on the workload of nightly job. |
Another manual tracking_update run:
So nightly job will be indexing 8k dataset. It can be done in 40 mins. |
This PR changes tracking-update from week job to nightly job. Before bot traffic gets excluded, it takes 4-5 hours for a nightly job or 10-12 hour weekly job to finish a tracking update. With this issue resolved, we are going back to a nightly job which can be done < 1 hour. |
The nightly jobs for the past a few days are looking good, finished in 40-50 mins. |
Tracking update job was changed from nightly to weekly due to its long processing time. It is speculated that bot crawling traffic is the cause for the long processing time. Bot crawling traffic makes the popularity count and tracking stats less meaningful, its crawling each and every dataset makes tracking update job unnecessarily longer to process. After user-agent ticket we are now able to differentiate bot traffic from regular user visits, we should try to exclude bot traffic from tracking stats.
Sketch
The text was updated successfully, but these errors were encountered: