Publications

Research on software and scientific knowledge.

Publications on research software, scientific knowledge graphs, ontology infrastructure, and semantic program analysis.

7 publications

Oct 26, 2025

Conference paper · ISWC 2025 Companion Volume, Nara, Japan

Attributes, Taxonomies and Semantic alignment for Automated Research Software Classification

Research software (RS) plays a critical role in computational science, yet remains poorly categorized and difficult to discover or reuse. This research explores RS classification by investigating how textual and metadata attributes can be leveraged to develop scalable, interpretable classification methodologies. Existing taxonomies are evaluated through alignment with scientific knowledge graphs to identify redundancies and structural gaps. Labeled datasets are constructed by linking publications to software repositories, and RS attributes, such as README files, abstracts, and source code features are benchmarked using multiple machine learning models and embedding strategies. A methodology that integrates semantic enrichment and transformer-based models is proposed for robust RS classification. Preliminary findings highlight the informativeness of publication abstracts for classification tasks and expose limitations in current community-defined taxonomies.

Jenifer Tabita Ciuciu-Kiss

Oct 13, 2025

Workshop paper · Sci-K 2025 at ISWC, CEUR Workshop Proceedings 4065

Are Scientific Annotations Consistently Represented across Science Knowledge Graphs?

Scientific Knowledge Graphs (SKGs) are increasingly used to annotate and interlink research outputs. However, little is known about how consistently they annotate the same publication. This paper presents a comparative analysis of category annotations across four major SKGs (ORKG, OpenAlex, OpenAIRE, and Papers with Code) using a manually curated gold-standard dataset of 70 AI-related papers. We examine differences in annotation coverage, granularity, and semantic alignment, highlighting frequent inconsistencies such as label mismatches, overly generic terms, and coverage gaps. Our analysis reveals that manual curation offers high-quality but sparse annotations, while automated systems achieve broader coverage at the cost of precision. This work contributes insights into the reliability of SKG metadata and outlines pathways for improving interoperability and annotation practices.

Jenifer Tabita Ciuciu-Kiss, Daniel Garijo

Jun 16, 2025

Workshop paper · Natural Scientific Language Processing at ESWC 2025

A study of the categories used in ‘Papers with Code’

An increasing number of machine learning developers share research software online to support their scientific investigations. In order to improve software findability, the scientific community has developed domain-specific taxonomies. However, are these taxonomies appropriate for software classification? This paper explores this question through a case study on Papers with Code, a popular platform where authors share their publications together with their software implementations. We define and apply a comparative framework with state-ofthe-art text similarity techniques (TF-IDF, Sentence-BERT, CLIP), and we assess the level of overlap between different software categories defined in the platform, based on the methods descriptions contained in them. Our results show significant category overlap, which may limit the effectiveness of classification algorithms. While community-defined categories provide a useful foundation, they may require refinement, such as subcategories or refined definitions, to better capture interdisciplinary methods and improve classification accuracy.

Jenifer Tabita Ciuciu-Kiss, Daniel Garijo

May 8, 2025

Conference paper · Companion Proceedings of the ACM Web Conference 2025

Breaking Accessibility Barriers: An Ontology Proxy with Failure Recovery and Time Travel Capabilities

This paper introduces a novel concept and implementation of an ontology proxy designed to seamlessly enhance accessibility and reliability of the Web of Ontologies by addressing challenges such as link rot, evolution inconsistencies, and communication failures. The proxy features a time travel mode, powered by DBpedia Archivo, that provides access to archived and versioned snapshots of ontologies. This enables failure recovery and the emulation of a consistent state in time, supporting reproducible research and enhancing the FAIRness of ontologies and associated (meta)data in a plug-and-play manner. Initial evaluations show significant improvements in ontology retrieval success rates, underscoring the proxy's potential as a viable interface for breaking accessibility barriers.

Johannes Frey, Jenifer Tabita Ciuciu-Kiss, Natanael Arndt

Jun 30, 2022

Thesis · Universidad Politécnica de Madrid

A methodology for research software classification

An increasing number of scientific publications rely on computational experiments to deliver their intermediate and final results. Software developed for this purpose is known as research software and ranges from simple transformation or visualization scripts to complex computational pipelines. Research software is critical for reproducibility, and therefore developers and researchers deposit their contributions in online repositories, such as GitHub. However, these repositories do not provide a feature for users to help them find similar software. Therefore, there is a need for an approach to automatically characterize research software according to common functionality or domain. This work proposes a flexible methodology to classify research software with similar functionality. We understand software with ‘same functionality’ as those to software repositories that belong to the same category, as agreed by the scientific community or external vocabularies. Our proposed methodology provides the means to classify new categories without the need to retrain previous classifiers. Our approach focuses on three main research questions: 1) Can we identify common categories to group software from different domains? 2) Can we develop a flexible methodology for classifying these categories? 3) Can we define a methodology to incorporate new categories without having to retrain our classifiers for existing categories? In order to address our research questions, we explore and compare against state of the art techniques for software classification. We focused first on specific areas with existing annotated data (such as open platforms for machine learning), where papers and code have been made available by the community. We tested our methodology with lists of domain-specific software tools crowdsourced by the community. A key step of our methodology is to find out how to automatically incorporate new labelled datasets, which are costly to produce, and how to prepare data for successful classification of software projects based on their available documentation. Our approach was evaluated by using a separate test set containing multi-labeled test samples. The achieved result on the training set using cross validation is an f1 score of 92%. The result on the test set is is 76%. Considering that the state of the art approaches could achieve only an f1 score of 36% we could achieve an improvement of 40%. Once the methodology have been achieved a reasonable performance level, the results of our methodology have been implemented into an existing framework for software metadata extraction. Thanks to our approach, the extractor is able to group similar software together.

Jenifer Tabita Ciuciu-Kiss

Aug 4, 2021

Journal article · Acta Cybernetica 25

Towards Version Controlling in RefactorErl

Static source code analyser tools are operating on an intermediate representation of the source code that is usually a tree or a graph. Those representations need to be updated according to the different versions of the source code. However, the developers might be interested in the changes or might need information about previous versions, therefore, keeping different versions of the source code analysed by the tools are required. RefactorErl is an open-source static analysis and transformation tool for Erlang that uses a graph representation to store and manipulate the source code. The aim of our research was to create an extension of the Semantic Program Graph of RefactorErl that is able to store different versions of the source code in a single graph. The new method resulted in 30% memory footprint decrease compared to the available workaround solutions.

Jenifer Tabita Ciuciu-Kiss, Melinda Tóth, István Bozó

AI · knowledge graphs · ontology engineering

Have a related problem or research question?

Get in touch about research, engineering, or product work.