Jonas R. (er/ihm) (@jonasjrichter.mastodon.pnpde.social.ap.brid.gy)

Improving the interconnection between Wikidata and the CERL Thesaurus Camillo Carlo Pellizzari di San Girolamo (Scuola Normale Superiore, Pisa): Wikidata is a collaboratively edited knowledge base containing data about 120 million entities; since its creation in 2013 it has been edited over 2.4 billion times by over 700,000 users (statistics from WikiScan as of 2026-05-18). Wikidata is one of the projects supported by the Wikimedia Foundation, the most famous of which is the oldest one, Wikipedia, which has turned 25 last January. Wikipedia is written in more than 300 languages, mostly by different communities, and has the aim to provide human-readable knowledge in the form of encyclopaedia articles. Wikidata has a complementary aim: providing human- and machine-readable structured data in as many languages as possible; it is a multilingual project, where users from all linguistic backgrounds contribute to the construction and improvement of the same data. Wikidata does not only collect factual data, i.e. data describing concepts or persons or places or events etc., but also links to other databases, which can be useful for the readers as sources for the factual data or just as further readings to get a better understanding of the topic. All the data in Wikidata can be queried via a SPARQL endpoint: SPARQL queries are a powerful tool to cross data in order to infer new knowledge, to create effective visualizations through maps and graphs, and to compare Wikidata’s data with other databases that also provide SPARQL endpoints, so as to find concordances and (more interestingly) discordances requiring investigation. Among Wikidata’s properties (currently more than 13,000) more than 9,000 are external identifiers; 230 of these external identifiers link to library authority files (list). The items containing at least one of these properties are nearly 6.5 million (list); the most linked authority files (cf. complete list) are VIAF (4.6 M items), GND (2.9 M), ISNI (2.5 M), LCNAF (1.7 M), IDREF (1.0 M), and NKC (0.9 M).[1] The interaction between Wikidata and authority files can be seen as mutually beneficial. For Wikidata, authority files are crucial for the clear identification of the entities, especially in case of confusions between homonyms or near-homonyms, and as references, especially in the items regarding persons. For authority files, Wikidata can be used to extract data, both factual data and links to other resources, and to gain greater visibility. Finally, the comparison of data between Wikidata and authority files is a good occasion to find mistakes on both sides, thus improving the overall quality. Wikidata makes this comparison very easy through SPARQL queries, especially when the authority files also offer a SPARQL endpoint, as GND and IDREF do; otherwise, the data extracted from Wikidata queries can be downloaded and compared locally with the data extracted in other ways from authority files that do not (yet) have a SPARQL endpoint. The CERL Thesaurus has been linked by Wikidata since the creation of its property, P1871, in 2015. Most of the links have been added through Mix’n’match, the most relevant tool used for reconciling data with Wikidata (together with the software OpenRefine). However, due to an unclear issue back in 2016, a few thousands of wrong matches were also added, and remained mostly unnoticed for about a decade. In March 2026 I decided to undertake a systematic cleanup of the CERL IDs present in Wikidata, in order to finally solve this issue. The cleanup included the following steps: 1) I created two lists to be manually checked: the items containing 2(+) CERL Thesaurus IDs (list) and the items containing a deprecated CERL Thesaurus ID (list); in Wikidata, an ID is deprecated when it is considered invalid for some reason, usually because it refers to 2(+) persons or because it contains very few identifying data; these lists are periodically updated by a bot, and the update can be manually triggered if needed. 2) To filter out some false positives from the first list, I edited 7,172 items containing both a CERL _nomen personae_ and a CERL _nomen imprimatoris_ ID adding the appropriate “object has role” qualifiers (cf. example edit and list of edits); I repeated the same operation on 9,284 more items found after the operation in point 4 below (cf. list of edits) 3) I did a manual check of both the lists created in point 1 and when I found both matching mistakes in Wikidata as well as issues in the CERL Thesaurus entries themselves, I contacted Drs. Marian Lefferts, who quickly confirmed she was interested in receiving mistake reports from these lists. Thus, in cooperation with Valentina Piccinin (CT Editor) and Elena Liventsova (Data Conversion Group, DCG), we started a manual cleanup of both the lists mentioned above: I first fixed issues in Wikidata and then together we fixed issues in the CERL Thesaurus (example of CERL duplication in Wikidata). The cleanup is still ongoing. 4) In early May I used the matches between CERL Thesaurus entries and GND entries, which I had obtained from DCG, to do a major import of 414,217 CERL Thesaurus IDs to Wikidata, matching them on the basis of the GND. I was thus not only able to increase the number of CERL Thesaurus IDs in Wikidata (example edit) from 282,214 to 450,633 (a 62% increase), but could also confirm 245,798 existing matches (example edit), and found and removed 1,787 old, wrong matches (cf. example edit and list of edits). Although a few wrong matches probably are still present in Wikidata, most of them have been fixed through this operation. 5) Finally, I used the list of currently non-redirected CERL Thesaurus IDs, also obtained from Elena at DCG, to do a further cleanup in Wikidata: 16,229 IDs absent in this list, mostly redirected and in a few cases invalid, were deleted (cf. example edit and list of edits). This operation allowed me to remove nearly all the remaining false positives from the first list. This entire cleanup operation, documented in its various phases in the talk page of the property, is an example of how easy and quick massive adjustments in Wikidata can be implemented, thanks to mass editing tools such as QuickStatements and QuickStatements 3.0. After its completion, further interactions between the data of CERL Thesaurus entries and the data of the Wikidata items linking to them will be possible, and much easier thanks to the removal of the majority of the wrong matches; I list here the main ones: * first of all, the checks will continue on the two aforementioned lists, allowing to reduce duplications in the CERL Thesaurus entries on the basis of Wikidata items linking to 2(+) of them; * comparing the authority records of the Bibliothèque nationale de France (BNF), the Servizio bibliotecario nazionale in Italy (SBN), the Dutch Name Authority file (NTA) etc. linked by CERL Thesaurus entries and by the corresponding Wikidata items could allow to find further inconsistencies; * dates (of birth, of death and of activity) in CERL Thesaurus and Wikidata could be compared to find differences and progressively fix them after thorough checks on the sources; * CERL Thesaurus could use Wikidata to enrich its entries with links to other authority files, or to other types of resources already linked in Wikidata, such as biographical dictionaries and archival databases; an example of this kind of enrichment through Wikidata’s data is the AuthorityBox described by Stefano Bargioni in this paper and this interview. Similar procedures of cooperation have already proven effective for many other authority files, both on the small scale of mistake reports for issues affecting specific authority records and to improve data quality on a bigger scale through massive imports, to Wikidata (as in the case of NKC and GND) or from Wikidata as in the case of SBN. I have written about these cooperations in a recently published paper, and detailed documentation is also present in Wikidata (for GND massive imports, cf. the talk page of the property; for SBN, cf. the project page) and in the scientific literature (for NKC, cf. this paper; for SBN, cf. the most recent papers are this one regarding the data imported from Wikidata into SBN and this one regarding the strategies for increasing the number of matched authority records in Wikidata). I am very grateful to Dr. Marian Lefferts, Valentina Piccinin, and Elena Liventsova for this cooperation we have started and I hope it can continue and also become a useful example for other authority files regarding the possibilities that matching authority records with Wikidata and comparing their data with it offers. Wikidata is useful not only in terms of visibility and data enrichment, but also for improving the quality of preexisting authority data through the creation of lists of inconsistencies and the possibility to check them in cooperation with the community of Wikidata, in order to ultimately improve the entire ecosystem of authority files on an international scale. * * * [1] VIAF – Virtual International Authority File VIAF; GND – Gemeinsame Normdatei; ISNI – International Standard Name Identifier; LCNAF – Library of Congress Name Authority File; IDREF – Identifiants et Référentiels; NKC – Online Catalogue of the National Library of the Czech Republic ### Share this: * Share on X (Opens in new window) X * Share on Facebook (Opens in new window) Facebook * Like Loading... ### _Related_ https://cerlblog.wordpress.com/2026/06/01/improving-the-interconnection-between-wikidata-and-the-cerl-thesaurus/