Extracting licence information from web resources with a Large Language Model

Daga, Enrico; Carvalho, Jason and Morales Tirado, Alba (2024). Extracting licence information from web resources with a Large Language Model. In: Workshop at ESWC on Semantic Technologies for Scientific, Technical and Legal Data (Dessi', Rima; Dessi', Danilo; Osborne, Francesco and Aras, Hidir eds.).


Data catalogues play an increasing role in supporting information sharing and reuse on the Web. However, evaluating the reusability of Web resources requires an understanding of the related licence and terms of use. Recent methods for licence representation and reasoning allows to explore Web resources according to their permissions, obligations, and duties. Therefore, licence annotations should be linked to those representations in order to support users in filtering and exploring datasets according to their licencing requirements. However, populating data catalogues with licence information is a tedious and error-prone task. In this paper, we explore the suitability of a Large Language Model (LLM) to support the automatic extraction, annotation, and linking of licence information from reference Web pages of data catalogue items. The approach is evaluated for its capacity to automatically find relevant pages from within a main web page, extract data about copyright and licencing, and link licence descriptions to a knowledge graph of licences expressed in RDF/ODRL. We apply our method to extend the coverage of licence annotations of a data catalogue in the music domain.

Viewing alternatives

Download history

Item Actions