Back to Search
Start Over
On the role of words in the network structure of texts: application to authorship attribution
- Source :
- Repositório Institucional da USP (Biblioteca Digital da Produção Intelectual), Universidade de São Paulo (USP), instacron:USP
- Publication Year :
- 2017
-
Abstract
- Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words, and bigrams. In a recent, alternative approach, medium and large-scale text structures were used in opposition to the belief that text structure is dominated by the language features. In this paper, we introduce a generalized similarity measure to compare texts which accounts for both the network structure of texts and the role of individual words in the networks. The similarity measure is used for authorship attribution of three collections of books, each composed of 8 authors and 10 books per author. High accuracy rates were obtained with typical values between 90 % and 98 . 75 % , much higher than with the traditional term frequency-inverse document frequency (tf-idf) approach for the same collections. These accuracies are also higher than those obtained solely with the topology of networks. We conclude that the different properties of specific words on the macroscopic scale structure of a whole text are as relevant as their frequency of appearance; conversely, considering the identity of nodes brings further knowledge about a piece of text represented as a network.
- Subjects :
- Statistics and Probability
FOS: Computer and information sciences
Bigram
Similarity measure
computer.software_genre
01 natural sciences
010305 fluids & plasmas
Authorship attribution
0103 physical sciences
010306 general physics
Structure (mathematical logic)
Social and Information Networks (cs.SI)
Computer Science - Computation and Language
business.industry
Computer Science - Social and Information Networks
Complex network
Condensed Matter Physics
Term (time)
SEMÂNTICA DA PALAVRA
Burstiness
Identity (object-oriented programming)
Artificial intelligence
business
Psychology
computer
Computation and Language (cs.CL)
Natural language processing
Subjects
Details
- Language :
- English
- Database :
- OpenAIRE
- Journal :
- Repositório Institucional da USP (Biblioteca Digital da Produção Intelectual), Universidade de São Paulo (USP), instacron:USP
- Accession number :
- edsair.doi.dedup.....f32bb6134e5f2c7191fd10d1e32f6f61