Semantic control for the Cybersecurity domain: investigation on the representativeness of a domain-specific terminology referring to lexical variation
No Thumbnail Available
Date
2021-05-12
Journal Title
Journal ISSN
Volume Title
Publisher
Università della Calabria
Abstract
The underlying idea of this PhD research project is to develop a model meant to
guarantee the terminological coverage of a semantic resource, such as a thesaurus, and
its representativeness threshold with reference to semantic variation over time within
a highly specialized domain, such as the Cybersecurity. By building an Italian thesaurus
related to the Cybersecurity domain, this project wants to offer organizations
a knowledge representation of the field of study in Information and Communications
Technology (ICT) security as complete as possible. The development of an Italian
thesaurus for the Cybersecurity knowledge domain is part of the activities included
in the main project “Cybersecurity Observatory” held by the Institution of Informatics
and Telematics (IIT) at the National Research Council (CNR) sited in Pisa (Italy).
The thesis describes the steps followed for the construction of the Italian Cybersecurity
thesaurus and for the assessment of a multi-domain methodology to fix a semantic
representativeness threshold with reference to qualitative terms richness within a specialized
domain and the variation in information related to the latter over time. The
main phases henceforth described are related to (1) a presentation of the principal reasons
for building a semantic tool, such as a thesaurus, as a means of semantic control for
a specific domain; (2) a description of the steps which characterize the corpus creation
and the terminological extraction through the use of specific Natural Language Processing
(NLP) tasks and linguistic pattern configuration within the employed software; (3)
the way a bilingual thesaurus and a bilingual ontology have been realized by creating
parallel and comparable corpora; (4) a presentation of a model of mapping existing
standards on Cybersecurity in English to all the head terms contained in the source corpus
in Italian through Python scripts in order to evaluate which candidate terms should
be chosen for inclusion in the thesaurus; (5) a descriptive section on the work done in
migrating the terms and their relationships from the Italian thesaurus on Cybersecurity
to an ontology system; (6) the phase related to keyphrases extraction, with the help of
document oriented algorithms, i.e., Multipartite Rank or TopicRank, from the source
documents. This was carried out to obtain a targeted clustering of the domain and as an
aide in the process of semantic abstraction, needed to better systematize the structure of thesaurus’ main entry categories; (7) the exploration of new methodologies, i.e., distributional
semantics, term variation, pattern-based detection schemes or inference from
the Web Ontology Language (OWL) properties, to deduce the technical information
included in the source corpus with the goal of automatically generating the semantic
network of connections between the representative terms of the Cybersecurity domain
in a thesaurus system; (8) a future perspective, accompanied by evolving examples in
practice, of creating an additional database to populate the Cybersecurity source corpus
through the use of the social media world. Twitter is one of the preferred web portals
from which to retrieve information about the domain: this new information flow should
give to the semantic resources, set up for Cybersecurity knowledge organization, an increased
level of terminological density to be analyzed in order to improve the semantic
coverage.
Description
Università della Calabria. Dipartimento di Ingegneria Infprmatica, Modellistica, Elettronica, e Sistemistica. Dottorato di ricerca in Information and Communication Technologies. Ciclo XXXIII
Keywords
Cybersecurity, Thesauri, Ontologie, Semantica distribuzionale, Rappresentatività