Enable a site to set an optional section name #17

dmarti · 2022-01-26T18:42:44Z

Allow callers to specify a section name that the classifier can use to develop a topics list, to improve personalization for users of large, multi-topic sites. Callers could populate the section name in Topics API calls using the existing schema.org articleSection property already in use.

If the topic list is per-hostname, a user of a large general-interest site may receive inadequate personalization compared to a user of multiple niche sites with only a few topics per site.

A section can be any subdivision of a site, including a "channel" "group" or "space."

This is separate from the question of allowing publishers to specify individual topics. The publisher-provided "section" is just an identifier applied to a subset of pages on that site, and the actual topics for pages in that section would still have to be determined by the classifier.

jkarlin · 2022-01-26T18:54:56Z

Thanks for opening the issue. What is the browser supposed to do differently when identifying topics in a section as compared to the hostname? If a section is available should it also include the full URL or page content or something?

dmarti · 2022-01-26T22:51:12Z

The only thing different would be to group the topics for those pages under hostname+section instead of just hostname.

The section should not need any other content besides what the classifier normally uses -- it's just a site-provided "hint" that this sub-division of the site may have more informative topics if it is treated as its own section.

See #2 -- if a whole site gets miscategorized because the classifier is "confused" by multiple topics on the same domain, the maintainer of the site can use sections to split off groups of pages that should be treated separately by the classifier, and get more accurate topics identified on the remaining pages.

jkarlin · 2022-01-26T22:57:09Z

But all that the classifier uses as input is the hostname. There is no page-level distinction on a site. e.g., https://example.com/a/ and https://example.com/b/ will have the same topics since they have the same hostname. Sections won't change anything.

pugzor · 2022-01-27T00:53:50Z

Not specific to a section name but in the same line of thought; it'd be interesting to see if various lines of intent can be determined by Topics.

For example, I've been in financial services for some time, where the vast majority of visitors to a website can be for service-based activities. I can foresee that some users may be assigned a 'Financial services' type Topic if they're casual users of the internet who mostly rely on mobile apps for most of their day-to-day needs, without a genuine interest in the area. Certain parts of a website should be 'excluded' from forming how a Topic is calculated, otherwise advertisers are going to find Topics useless for certain industries which have a high service-based component. Not always, but sometimes.

dmarti · 2022-01-27T01:09:50Z

@jkarlin Yes, the classifier would need to use the section in addition to the hostname (It can't make assumptions based on the first pathname component, but can use the section because that is supplied by the site)

@pugzor Yes, another example is that a new site with just some basic info, a signup form, and a (long) privacy policy could end up being classified under a bunch of boring privacy law topics instead of the true topic. Putting the legal docs in their own section would make the site's top-level topics better reflect the text from the homepage.

igrigorik · 2022-02-09T18:00:44Z

Whatever topic is returned, will continue to be returned for any caller on that site for the remainder of the three weeks. When a site provides a section name, results will be the same across the entire site, not just within a section. (s)

Effectively, sections are custom cluster names and we need a way to differentiate clusters. The downside, as @jkarlin pointed out in (#8 (comment)), is that custom names expose new state. Ultimately though, these clusters are discarded in favor of predefined topics.. we could skip the intermediate step, I think?

Diffrent strategy, ~same outcome, PTAL: Site-seeded topics

The gist is that we can directly assign pages against predefined clusters (the topics themselves). This also allows sites to have some input into output of the classifier.

dmarti · 2022-02-11T18:58:28Z

A large site might have multiple contributors (such as columnists or videographers) and not know in advance which contributor is planning to cover which topics.

In that case, a site could assign a section name based on the contributor name, or column or channel title, and let the classifier figure out the right topics (or, if there are not enough pages in that section to classify, use the top-level topics for the site)

It may turn out that large sites with many topics would need to use both site-seeded topics (#50) and sections for the exchange of information to be fair enough to incentivize topic-specific sites to participate.

AramZS · 2022-02-15T16:48:39Z

There is a schema.org property for articles that could be used for this - articleSection - on https://schema.org/Article

However, I imagine that this would be VERY easy to create fraudulent generated content site sections around for malicious publishers.

dmarti · 2022-02-15T16:51:38Z

Yes, the section name would not be used at all for classification. It's just an identifier. If I name a section of my blog "luxury SUV test drives Mountain View" but it's all about my cat, the ML would classify those pages as "cat".

AramZS · 2022-02-15T17:04:37Z

I think that definitely helps conceptually prevent some misrepresentation but as we've seen in the wild it is very easy to run a set of articles through a scrambler spit out something that is almost the same as a ripped off article, and place it into a page with the associated tagging. I don't think it really solves the problem... though arguably it's very easy for a fraudster to spin up their own domain for each topic as well so I'm not sure there's really a solution that has to do with sections.... more just this is a general concern that the ML generating the topics will have to handle some other way.

dmarti · 2022-02-15T17:52:25Z

@AramZS Yes, the scrambled or plagiarized article problem is general and not really tied to sections or even topics. (Kind of like brand safety -- it shows that none of this stuff works very well if adtech firms do a bad job of checking which sites they're willing to work with)

This was referenced Jan 26, 2022

Improve personalization for users of large sites #8

Open

Should sites be able to set their own topics via response headers? #1

Open

dmarti mentioned this issue Jan 26, 2022

What should happen if a site disagrees with the topics assigned to it by the browser? #2

Closed

dmarti mentioned this issue Feb 22, 2022

Classifier corpus #22

Open

dmarti mentioned this issue Mar 17, 2022

Provide Topics API for not adding current page's topics #54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable a site to set an optional section name #17

Enable a site to set an optional section name #17

dmarti commented Jan 26, 2022 •

edited

Loading

jkarlin commented Jan 26, 2022 •

edited

Loading

dmarti commented Jan 26, 2022

jkarlin commented Jan 26, 2022

pugzor commented Jan 27, 2022

dmarti commented Jan 27, 2022

igrigorik commented Feb 9, 2022

dmarti commented Feb 11, 2022

AramZS commented Feb 15, 2022

dmarti commented Feb 15, 2022

AramZS commented Feb 15, 2022

dmarti commented Feb 15, 2022

Enable a site to set an optional section name #17

Enable a site to set an optional section name #17

Comments

dmarti commented Jan 26, 2022 • edited Loading

jkarlin commented Jan 26, 2022 • edited Loading

dmarti commented Jan 26, 2022

jkarlin commented Jan 26, 2022

pugzor commented Jan 27, 2022

dmarti commented Jan 27, 2022

igrigorik commented Feb 9, 2022

dmarti commented Feb 11, 2022

AramZS commented Feb 15, 2022

dmarti commented Feb 15, 2022

AramZS commented Feb 15, 2022

dmarti commented Feb 15, 2022

dmarti commented Jan 26, 2022 •

edited

Loading

jkarlin commented Jan 26, 2022 •

edited

Loading