Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica - Tesi di Dottorato
Permanent URI for this collectionhttps://lisa.unical.it/handle/10955/31
Questa collezione raccoglie le Tesi di Dottorato afferenti al Dipartimento di Ingegneria Informatica, Modellistica, Elettronica e Sistemistica dell'Università della Calabria.
Browse
2 results
Search Results
Item Ensemble of deep learning prediction models for data analytics(Università della Calabria, 2021-06-21) Zicari, Paolo; Fortino, Giancarlo; Folino, GianluigiThe abundance of available unstructured or raw text requires the automatic extraction of information for di↵erent tasks. One of the most relevant, Text Classification, extracts this information by assigning informative labels to raw texts from a pre-defined set. Deep Learning (DL) o↵ers challenging solutions to the automatic text classification problem. Despite the great potentialities of DL-based text classifiers, current solutions are exposed to a number of challenging issues that frequently occur in scenarios where text categorization is used in reallife applications. First of all, a large number of labelled data are usually necessary to train a deep model adequately, while labelling texts is timeconsuming, expensive, and very often requires specific knowledge. Moreover, configuring the structure and hyper-parameters of a Deep Neural Network (DNN) architecture is a difficult task, which entails long and careful design and tuning activities to make the DNN perform well. Typical scenarios are characterized by the fact that classes are often imbalanced. These issues entail a high risk of eventually obtaining a DNN-based classifier that overfits the training data and relies on non-general, biased and unreliable classification patterns. On the other hand, the black-box nature of a DNN model does not allow for easy reasoning on which features of a data instance drove the model to its classification decision. The work in this thesis, starting from the general problem of text classification, focuses on some challenging aspects associated with using an ensemble of deep learning methods to classify raw texts. More in detail, this work focuses on the analysis, exploration, study and test of algorithms and learning models to be employed in the proposal of novel techniques of Ensemble Deep Learning (EDL) aimed at performing classification and explanation tasks and on the research of semi-supervised strategies based on pseudo-labelling for improving classifier prediction performances in case of scarcity of labelled data. To this aim, this thesis proposes a complete framework based on the paradigm of ensembles of deep learning algorithms. The proposed framework is designed to furnish a valid instrument for exploring, validating and testing the proposed novel deep ensemble techniques contextualised in reallife applications, covering the entire classification process, including preprocessing, learning model building, explanation of the results, self-training for scarce labelled data, human-in-the-loop validating and model refining. Even though the methods proposed in this work could be used in any field of interest, the problem of extracting information from the raw text was specialised for two specific application contexts: automatic customer support ticket classification and the problem of fake detection. The first application scenario deals with the necessity of the Customer Care Department of most companies to answer their customer requests applied as tickets through several common channels like email, short message texts, social posts, etc. Ticket classification is necessary for automatic answer generation and routing to the specific human operator. Limiting the spread of misinformation, related to the high growth of social media dissemination and sharing of information, has raised the issue of distinguishing true news from fakes, with the challenging problem of processing long texts like news for fake detection. For this reason, the second scenario deals with the critical problem of discerning fake news from the vast amount of information circulating on the Web. In these research areas, the ensemble paradigm has been adopted only recently; thus, discovering the possible advantages when applying this technique is challenging. Experimental tests conducted on real data collected by two Customer Relationship Management (CRM) systems have proven the framework’s effectiveness in di↵erent ticket categorisation tasks and the practical value of their associated explanations. In addition, experiments conducted on two fake news datasets have proven the e↵ectiveness of the proposed semisupervised self-training ensemble-based strategy for improving performances when a few labelled data are available.Item Ensemble learning techniques for cyber security applications(2017-07-13) Pisani, Francesco Sergio; Crupi, Felice; Folino, GianluigiCyber security involves protecting information and systems from major cyber threats; frequently, some high-level techniques, such as for instance data mining techniques, are be used to efficiently fight, alleviate the effect or to prevent the action of the cybercriminals. In particular, classification can be efficiently used for many cyber security application, i.e. in intrusion detection systems, in the analysis of the user behavior, risk and attack analysis, etc. However, the complexity and the diversity of modern systems opened a wide range of new issues difficult to address. In fact, security softwares have to deal with missing data, privacy limitation and heterogeneous sources. Therefore, it would be really unlikely a single classification algorithm will perform well for all the types of data, especially in presence of changes and with constraints of real time and scalability. To this aim, this thesis proposes a framework based on the ensemble paradigm to cope with these problems. Ensemble is a learning paradigm where multiple learners are trained for the same task by a learning algorithm, and the predictions of the learners are combined for dealing with new unseen instances. The ensemble method helps to reduce the variance of the error, the bias, and the dependence from a single dataset; furthermore, it can be build in an incremental way and it is apt to distributed implementations. It is also particularly suitable for distributed intrusion detection, because it permits to build a network profile by combining different classifiers that together provide complementary information. However, the phase of building of the ensemble could be computationally expensive as when new data arrives, it is necessary to restart the training phase. For this reason, the framework is based on Genetic Programming to evolve a function for combining the classifiers composing the ensemble, having some attractive characteristics. First, the models composing the ensemble can be trained only on a portion of the training set, and then they can be combined and used without any extra phase of training. Moreover the models can be specialized for a single class and they can be designed to handle the difficult problems of unbalanced classes and missing data. In case of changes in the data, the function can be recomputed in an incrementally way, with a moderate computational effort and, in a streaming environment, drift strategies can be used to update the models. In addition, all the phases of the algorithm are distributed and can exploits the advantages of running on parallel/ distributed architectures to cope with real time constraints. The framework is oriented and specialized towards cyber security applications. For this reason, the algorithm is designed to work with missing data, unbalanced classes, models specialized on some tasks and model working with streaming data. Two typical scenarios in the cyber security domain are provided and some experiment are conducted on artificial and real datasets to test the effectiveness of the approach. The first scenario deals with user behavior. The actions taken by users could lead to data breaches and the damages could have a very high cost. The second scenario deals with intrusion detection system. In this research area, the ensemble paradigm is a very new technique and the researcher must completely understand the advantages of this solution.