-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable a site to set an optional section name #17
Comments
Thanks for opening the issue. What is the browser supposed to do differently when identifying topics in a section as compared to the hostname? If a section is available should it also include the full URL or page content or something? |
The only thing different would be to group the topics for those pages under hostname+section instead of just hostname. The section should not need any other content besides what the classifier normally uses -- it's just a site-provided "hint" that this sub-division of the site may have more informative topics if it is treated as its own section. See #2 -- if a whole site gets miscategorized because the classifier is "confused" by multiple topics on the same domain, the maintainer of the site can use sections to split off groups of pages that should be treated separately by the classifier, and get more accurate topics identified on the remaining pages. |
But all that the classifier uses as input is the hostname. There is no page-level distinction on a site. e.g., https://example.com/a/ and https://example.com/b/ will have the same topics since they have the same hostname. Sections won't change anything. |
Not specific to a section name but in the same line of thought; it'd be interesting to see if various lines of intent can be determined by Topics. For example, I've been in financial services for some time, where the vast majority of visitors to a website can be for service-based activities. I can foresee that some users may be assigned a 'Financial services' type Topic if they're casual users of the internet who mostly rely on mobile apps for most of their day-to-day needs, without a genuine interest in the area. Certain parts of a website should be 'excluded' from forming how a Topic is calculated, otherwise advertisers are going to find Topics useless for certain industries which have a high service-based component. Not always, but sometimes. |
@jkarlin Yes, the classifier would need to use the section in addition to the hostname (It can't make assumptions based on the first pathname component, but can use the section because that is supplied by the site) @pugzor Yes, another example is that a new site with just some basic info, a signup form, and a (long) privacy policy could end up being classified under a bunch of boring privacy law topics instead of the true topic. Putting the legal docs in their own section would make the site's top-level topics better reflect the text from the homepage. |
Effectively, sections are custom cluster names and we need a way to differentiate clusters. The downside, as @jkarlin pointed out in (#8 (comment)), is that custom names expose new state. Ultimately though, these clusters are discarded in favor of predefined topics.. we could skip the intermediate step, I think? Diffrent strategy, ~same outcome, PTAL: Site-seeded topics The gist is that we can directly assign pages against predefined clusters (the topics themselves). This also allows sites to have some input into output of the classifier. |
A large site might have multiple contributors (such as columnists or videographers) and not know in advance which contributor is planning to cover which topics. In that case, a site could assign a section name based on the contributor name, or column or channel title, and let the classifier figure out the right topics (or, if there are not enough pages in that section to classify, use the top-level topics for the site) It may turn out that large sites with many topics would need to use both site-seeded topics (#50) and sections for the exchange of information to be fair enough to incentivize topic-specific sites to participate. |
There is a schema.org property for articles that could be used for this - articleSection - on https://schema.org/Article However, I imagine that this would be VERY easy to create fraudulent generated content site sections around for malicious publishers. |
Yes, the section name would not be used at all for classification. It's just an identifier. If I name a section of my blog "luxury SUV test drives Mountain View" but it's all about my cat, the ML would classify those pages as "cat". |
I think that definitely helps conceptually prevent some misrepresentation but as we've seen in the wild it is very easy to run a set of articles through a scrambler spit out something that is almost the same as a ripped off article, and place it into a page with the associated tagging. I don't think it really solves the problem... though arguably it's very easy for a fraudster to spin up their own domain for each topic as well so I'm not sure there's really a solution that has to do with sections.... more just this is a general concern that the ML generating the topics will have to handle some other way. |
@AramZS Yes, the scrambled or plagiarized article problem is general and not really tied to sections or even topics. (Kind of like brand safety -- it shows that none of this stuff works very well if adtech firms do a bad job of checking which sites they're willing to work with) |
Allow callers to specify a section name that the classifier can use to develop a topics list, to improve personalization for users of large, multi-topic sites. Callers could populate the section name in Topics API calls using the existing schema.org articleSection property already in use.
If the topic list is per-hostname, a user of a large general-interest site may receive inadequate personalization compared to a user of multiple niche sites with only a few topics per site.
A section can be any subdivision of a site, including a "channel" "group" or "space."
This is separate from the question of allowing publishers to specify individual topics. The publisher-provided "section" is just an identifier applied to a subset of pages on that site, and the actual topics for pages in that section would still have to be determined by the classifier.
The text was updated successfully, but these errors were encountered: