Softcite is a project to improve the visibility of research software. We produce datasets, software, and papers.
(archived) Softcite Dataset v1
Go from a folder of PDFs to XML extracted full text annoated with software mentions.
- Softcite Mention Extractor Example Notebook
- Softcite Mention Extractor client
- Softcite Mention Extractor Server
We have an infrastructure to build a website that provides a browser to a database created from software extractions.
There is a demonstration of this available (populated with a small set of extractions):
Du, C., Cohoon, J., Lopez, P., & Howison, J. (2022). Understanding progress in software citation: a study of software citation in the CORD-19 corpus. PeerJ Computer Science, 8, e1022.
Lopez, P., Du, C., Cohoon, J., Ram, K., & Howison, J. (2021). Mining Software Entities in Scientific Literature: Document-level NER for an Extremely Imbalance and Large-scale Task. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3986–3995.
Du, C., Cohoon, J., Lopez, P., & Howison, J. (2021). Softcite dataset: A dataset of software mentions in biomedical and economic research publications. Journal of the Association for Information Science and Technology, 72(7), 870–884.
Bassinet, A., Bracco, L., L’Hôte, A., Jeangirard, E., Lopez, P., & Romary, L. (2023). Monitoring the production and the openness of research data and software in France:Large-scale Machine-Learning analysis of scientific PDF.
Andrew Nesbitt, Boris Veytsman, Daniel Mietchen, Eva Maxfield Brown, James Howison, João Felipe Pimentel, Laurent Hèbert-Dufresne, and Stephan Druskat. 2024. Biomedical Open Source Software: Crucial Packages and Hidden Heroes. arXiv,