Introduction

The rapid development of science and the constantly increasing production of scientific publications have made it necessary to create efficient tools for organizing and analyzing bibliography. The Bibliography Graph represents an innovative approach that enables the systematic recording and study of scientific knowledge. Its logic does not merely focus on the collection of documents but emphasizes the creation of a framework that addresses critical research questions, enhancing the understanding of the dynamic relationship between publications, authors, and references.

Purpose of the Bibliography Graph

The main purpose of the graph is to improve the research process by providing a dynamic and expandable system for organizing bibliography. Despite the significant progress made in scientific search engines, data mining, and digital libraries, gaps still remain in managing and interconnecting information. The graph seeks to bridge these gaps by offering a more effective way of mapping scientific production and identifying the relationships that govern it.

Structure of the Graph

The structure of the Bibliography Graph is based on a property graph model with distinct edges. Every element of the scientific community is represented as a node, while the relationships between them are expressed through connections. Documents are represented as individual nodes with attributes such as title, year, place, and content. Authors are recorded and linked to their works, thus forming a network of collaborations and scholarly interactions. References function as edges that connect different documents, revealing the chain of continuity and influence in science. In this way, the graph is not limited to a simple database but serves as a dynamic tool that uncovers patterns, trends, and research networks.

Data Extraction

Data extraction plays a crucial role in the development and utilization of the graph. While some publishers provide well-organized metadata, many scientific articles are made available with incomplete information. Moreover, publications collected through web crawling usually lack structured metadata. To address this issue, a system was developed that can predict structured data from raw PDF files. Through this process, key information is automatically extracted, such as the article’s title, the list of authors, bibliographic references, as well as details about the place and year of publication. This capability enhances the reliability and completeness of the graph, contributing to a more effective mapping of scientific production.

Research Challenges

Despite the advantages of the Bibliography Graph, its implementation faces several challenges. One major issue is author disambiguation, as names alone cannot serve as unique identifiers, since many different individuals may share the same name. The solution requires additional information such as institutional affiliations or unique researcher identifiers like ORCID. Another important challenge is concept matching, as popular terms often appear repeatedly in different forms, making semantic alignment necessary in order to avoid duplication or misinterpretation. Additionally, extraction of figures and tables is a significant problem, as these elements contain essential information in many scientific papers. The lack of large, labeled datasets has limited the development of advanced methods for extracting such visual content, although recent progress in artificial intelligence holds promise for overcoming this obstacle.

Conclusions and Perspectives

The Bibliography Graph emerges as a powerful tool for organizing and understanding scientific knowledge. Its contribution is not confined to data collection and cataloging but extends to revealing the structure of science itself through connections and interactions. Future directions involve improving the quality and enriching the content of the graph in order to provide more accurate and comprehensive information. At the same time, bringing together experts from different fields is considered essential for achieving a multidimensional understanding of the literature. The use of machine learning and artificial intelligence methods will create new opportunities for semantic analysis, while the development of collaborative platforms will enable researchers to actively contribute to the formation of a shared and integrated knowledge network.