Schema definition #4

gpavlov2016 · 2024-10-23T21:05:42Z

To jumpstart the conversation, what other mandatory fields do we need besides the following:

Paper Title
Author(s)
Code repository

gpavlov2016 · 2024-10-25T19:45:57Z

Max Vasiliev commented on Slack:

I was wondering what we mean by manditory? A paper title and its authors always go together, but a code repository may or may not be associated with some paper title/author, or any for that matter.
Suppose a famous paper spawns 100s of github projects, how do we show that? Or suppose the paper author(s) repo is broken and dead, while another has a thriving community and ends up spinning off into various companies and other projects, we want to show that and acknowledge the original paper/authors while also accurately describing the connection to and actors within the open-source world. Especially since the paper may go on to be referenced by other papers in the academic world, diverged from software side developments

gpavlov2016 · 2024-10-25T19:49:12Z

The project is about mapping research papers with code repositories, if I got it wrongs please correct me. Hence, code repository seems to be a requirement when submitting new entry to the database.

jring-o · 2024-10-26T11:11:10Z

What this issue seems to be addressing is the "Papers" node, and what the necessary attributes and relationships are for that node. This is fantastic. I'm going to provide a bit more context on the larger project, then address minimum attributes and relationships for papers.

I am using the term "minimum" here because there are no "mandatory" attributes or relationships. Some papers might have all attributes (an author, title, code, etc.), while others might only have one or two. Both can exist in the graph.

Context

MOSS is about mapping an ecosystem, not only papers. Papers are one aspect of the ecosystem.

Four core questions related to this thread are:

What papers use what software to produce discoveries and/or patents?
What other projects do those papers cite?
Who builds those projects?
Who supports those people?

With this in mind, we might use these questions to guide us:

What are the minimum nodes needed to tell this story?
What are the minimum fields/attributes to map to those nodes?
What are the minimum relationships to map between those nodes?

The minimums we have identified so far are laid out here:

https://docs.google.com/document/d/1NEWtI7hqQA74jk9Geg8bwKVS3qTzV9hWAfEMQg_Y1gM/edit?tab=t.0

High level, the core nodes are:

People
Projects
Papers
Organizations

The core attributes and relationships can be viewed in the doc.

So the question becomes:

Is this model a good starting point? What are we missing? For example, in the "projects" node, I don't think we yet have a "depends on" relationship for mapping dependencies.

Paper Attributes

Here are the current attributes and relationships for papers:

Attributes

doi
title
description
url
published Date
has Public Data
has Public Code

Relationships

Cites another paper
Cites a project
Mentions a project
Is related to a domain

We capture authors through a relationship stemming from the "people" node.

So, the question becomes:

Are these good starting attributes and relationships for papers? What is missing?

pheochromo · 2024-10-29T12:04:42Z

One concern I see is that mapping outward from paper space reaches only a subset of all projects without some work / interpretation / confidence score etc.

Great if (:Paper)-[:OFFICIAL_CODE]->(:Project) exists (paper links to authors repo)
But if the goal is to eventually quantify aggregate effective (:Paper)-{[:CONTRIBUTED_CODE]}->(:Project) to rank paper impact, you'd either need to resolve (:Project)-[:DEPENDS_ON]->(:Project) aka build a dependency scraper, or start classifying and hand-waving
ie. do all projects calling Hugging Face transformers library automatically connect to/count towards "Attention is all you need"s score? How tractable is this?

jring-o, its not just missing from the schema, its a key piece of the work we'd need to do.

In the other direction, (:Paper)-[:CITES]->(:Project) would show what exactly?
If we have (:Paper)-[:OFFICIAL_CODE]->(:Project) we can scrape official implementations dependencies upstream.
But then what about (:Paper)-[:MENTIONED]->(:Project)? .. No one even mentions LaTeX :')

Ex.

Mamba paper: https://arxiv.org/abs/2312.00752 (Dec 1 2023, cited by 84)
links to 2 repos: https://github.com/radarFudan/mamba & https://github.com/state-spaces/mamba (13k star, 34 contributor, git init Dec 3, 2023), former is a fork of the latter, which has way more activity
Both repos also link a more recent paper by the duo, Mamba2: https://arxiv.org/abs/2405.21060 (May 31 2024, cited by.. 1?)**
Whose official code in also incorporated in the repo (still called mamba!)

** OpenAlex has only the preprint of https://arxiv.org/abs/2405.21060, with 1 citation. But this work is already in active use and being further built upon.

HF transformers (134k star, 2813 contributors) added Mamba support Mar 5 2024 https://github.com/huggingface/transformers/commits/main/docs/source/en/model_doc/mamba.md under transformers.MambaForCausalLM
and Mamba2 support ~ Aug 6 2024 https://github.com/huggingface/transformers/commits/main/docs/source/en/model_doc/mamba2.md under transformers.Mamba2ForCausalLM
while llama.cpp (67k star, 921 contributors) added Mamba2 Aug 21 2024 llama : initial Mamba-2 support ggerganov/llama.cpp#9126

I think this shows integration, but how much are those models actually being used? can we estimate based on code class names? forum discussions?
https://github.com/search?q=MambaForCausalLM&type=code

I also Imagine the number of definitive connections between papers and projects is relatively scarce compared to the total papers and projects. that is, most papers won't have official code. are we focusing on those that do? Maybe building off something like inclusion in HF transformers/Tensorflow/keras?

How do we feel about cycles? 😅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema definition #4

Schema definition #4

gpavlov2016 commented Oct 23, 2024

gpavlov2016 commented Oct 25, 2024

gpavlov2016 commented Oct 25, 2024

jring-o commented Oct 26, 2024

pheochromo commented Oct 29, 2024 •

edited

Loading

Schema definition #4

Schema definition #4

Comments

gpavlov2016 commented Oct 23, 2024

gpavlov2016 commented Oct 25, 2024

gpavlov2016 commented Oct 25, 2024

jring-o commented Oct 26, 2024

Context

Paper Attributes

pheochromo commented Oct 29, 2024 • edited Loading

pheochromo commented Oct 29, 2024 •

edited

Loading