Tesi di Dottorato

Permanent URI for this communityhttps://lisa.unical.it/handle/10955/10

Browse

Search Results

Now showing 1 - 3 of 3
  • Item
    Semantic control for the Cybersecurity domain: investigation on the representativeness of a domain-specific terminology referring to lexical variation
    (Università della Calabria, 2021-05-12) Lanza, Claudia; Guarasci, Roberto; Crupi, Felice
    The underlying idea of this PhD research project is to develop a model meant to guarantee the terminological coverage of a semantic resource, such as a thesaurus, and its representativeness threshold with reference to semantic variation over time within a highly specialized domain, such as the Cybersecurity. By building an Italian thesaurus related to the Cybersecurity domain, this project wants to offer organizations a knowledge representation of the field of study in Information and Communications Technology (ICT) security as complete as possible. The development of an Italian thesaurus for the Cybersecurity knowledge domain is part of the activities included in the main project “Cybersecurity Observatory” held by the Institution of Informatics and Telematics (IIT) at the National Research Council (CNR) sited in Pisa (Italy). The thesis describes the steps followed for the construction of the Italian Cybersecurity thesaurus and for the assessment of a multi-domain methodology to fix a semantic representativeness threshold with reference to qualitative terms richness within a specialized domain and the variation in information related to the latter over time. The main phases henceforth described are related to (1) a presentation of the principal reasons for building a semantic tool, such as a thesaurus, as a means of semantic control for a specific domain; (2) a description of the steps which characterize the corpus creation and the terminological extraction through the use of specific Natural Language Processing (NLP) tasks and linguistic pattern configuration within the employed software; (3) the way a bilingual thesaurus and a bilingual ontology have been realized by creating parallel and comparable corpora; (4) a presentation of a model of mapping existing standards on Cybersecurity in English to all the head terms contained in the source corpus in Italian through Python scripts in order to evaluate which candidate terms should be chosen for inclusion in the thesaurus; (5) a descriptive section on the work done in migrating the terms and their relationships from the Italian thesaurus on Cybersecurity to an ontology system; (6) the phase related to keyphrases extraction, with the help of document oriented algorithms, i.e., Multipartite Rank or TopicRank, from the source documents. This was carried out to obtain a targeted clustering of the domain and as an aide in the process of semantic abstraction, needed to better systematize the structure of thesaurus’ main entry categories; (7) the exploration of new methodologies, i.e., distributional semantics, term variation, pattern-based detection schemes or inference from the Web Ontology Language (OWL) properties, to deduce the technical information included in the source corpus with the goal of automatically generating the semantic network of connections between the representative terms of the Cybersecurity domain in a thesaurus system; (8) a future perspective, accompanied by evolving examples in practice, of creating an additional database to populate the Cybersecurity source corpus through the use of the social media world. Twitter is one of the preferred web portals from which to retrieve information about the domain: this new information flow should give to the semantic resources, set up for Cybersecurity knowledge organization, an increased level of terminological density to be analyzed in order to improve the semantic coverage.
  • Thumbnail Image
    Item
    Ontology-driven information extraction
    (2017-07-20) Adrian, Weronika Teresa; Leone, Nicola; Manna, Marco
    Information Extraction consists in obtaining structured information from unstructured and semi-structured sources. Existing solutions use advanced methods from the field of Natural Language Processing and Artificial Intelligence, but they usually aim at solving sub-problems of IE, such as entity recognition, relation extraction or co-reference resolution. However, in practice, it is often necessary to build on the results of several tasks and arrange them in an intelligent way. Moreover, nowadays, Information Extraction faces new challenges related to the large-scale collections of documents in complex formats beyond plain text. An apparent limitation of existing works is the lack of uniform representation of the document analysis from multiple perspectives, such as semantic annotation of text, structural analysis of the document layout and processing of the integrated knowledge. The recent proposals of ontology-based Information Extraction do not fully exploit the possibilities of ontologies, using them only as a reference model for a single extraction method, such as semantic annotation, or for defining the target schema for the extraction process. In this thesis, we address the problem of Information Extraction from homogeneous collections of documents i.e., sets of files that share some common properties with respect to the content or layout. We observe that interleaving semantic and structural analysis can benefit the results of the IE process and propose an ontology-driven approach that integrates and extends existing solutions. The contributions of this thesis are of theoretical and practical nature. With respect to the first, we propose a model and a process of Semantic Information Extraction that integrates techniques from semantic annotation of text, document layout analysis, object-oriented modeling and rule-based reasoning. We adapt existing solutions to enable their integration under a common ontological view and advance the state-of-the-art in the field of semantic annotation and document layout analysis. In particular, we propose a novel method for automatic lexicon generation for semantic annotators, and an original approach to layout analysis, based on common labels identification and structure recognition. We design and implement a framework named KnowRex that realize the proposed methodology and integrates the elaborated solutions.
  • Thumbnail Image
    Item
    Ontologies and Semantic Interoperability in Distributed Systems
    (2014-03-05) Pirrò,Giuseppe; Talia,Domenico