6 results on '"Zaranka, Eimantas"'
Search Results
2. From Images to Smart Data: Digitization of Logistic Documents.
- Author
-
Zaranka, Eimantas, Zdanavičiūtė, Monika, and Krilavičius, Tomas
- Subjects
OBJECT recognition (Computer vision) ,OPTICAL character recognition ,ENTERPRISE resource planning ,DATA entry ,TRANSPORTATION industry - Abstract
According to the Transport Innovation Association estimations, on average, a single lorry driver carries about 50 sheets of paper consisting of only 15 CMR documents. In Lithuania alone, there are about 50,000 active trucks each month, resulting in about 192 tonnes of wasted paper annually. The manual entry of document data into enterprise resource planning (ERP) systems not only is time-consuming but inefficient and could consist of errors. To address these issues, a framework for the digitisation of logistic documents, such as invoices, receipts and CMRs, is proposed that uses object detection, optical character recognition (OCR) and a semi-supervised finetuning pipeline. This study focuses on both the experimentation and implementation phases of research. During the experimentation phase, multiple object detection models like SSD MobileNet, SSD ResNet-50, Faster-RCNN, EfficientDet-D4, and CenterNet HourGlass104 were evaluated. OCR models like Tesseract, EasyOCR, KerasOCR, Kraken, Doctr, and Google OCR were tested. Extensive evaluations showed that using the combination of Faster-RCNN and Google OCR works the best for document digitisation. The object detection model was trained using approximately 1000 images that were equally distributed among three classes of documents. The Faster-RCNN model achieved an average precision (AP) of 0.95 at an Intersection over the Union (IoU) threshold of 0.5, 0.84 AP at an IoU of 0.75, and 0.71 AP IoU ranging from 0.5 to 0.95, with an average recall (AR) of 0.75, within the same range. OCRs performances were manually assessed due to the lack of annotations, with Google OCR proving the best results in the presence of minor inaccuracies in bounding box placement or noise within the bounding box. To further increase the accuracy of the object detection model, a semi-automated labelling process was introduced, where a trained Faster-RCNN model is used to generate initial bounding boxes and class labels on unseen data, which later are manually adjusted for further finetuning of a pre-trained Faster-RCNN model. The proposed system is an improvement in automating logistics document digitisation, reducing dependence on manual labour and an overall increase of efficiency in the transportation industry. [ABSTRACT FROM AUTHOR]
- Published
- 2024
3. Application of Machine Learning Techniques for Lithuanian Enterprise Clustering.
- Author
-
Zaranka, Eimantas, Kuizinienė, Dovilė, and Krilavičius, Tomas
- Subjects
CLUSTERING algorithms ,FEATURE selection ,DATA scrubbing ,PRIVATE companies ,GOVERNMENT aid - Abstract
The precise identification of enterprise activity codes stands as a crucial task enabling the rapid and effective establishment or renewal of databases encompassing both public and private companies, which in return helps to make an informative decision about countries' economic tendencies. The research involves combining multi-source datasets, data cleaning, explanatory data analysis, retrieval of embeddings, feature selection, the optimal number of clusters identification, data clustering, and post-clustering analysis. Gathered insights allow for informative decisions about taxes, needed state aid and competition analysis. In both the Republic of Lithuania and the European Union, the enterprise classification system operates under the Nomenclature of Economic Activities (NACE), which employs a six-digit framework. For instance, code 461900 indicates that the business conducts the sales of various goods that involve agents. The initial two digits represent overarching enterprise classifications, in this case, retail trade, while the final four digits delineate specific categorisations within the country's industries. This study aims to apply clustering methods to help in the identification of the economic activities of enterprises using descriptions that could be found in the "Company Description" section of the rekvizitai.lt website. The dataset consists of 28350 business descriptions. Two main themes were observed in the data: (1) the average description lengths are 14, excluding stop-words; (2) the most common activities in the Lithuania economic sector are wholesale, retail, agriculture, and service industry. In this study, 3 embedding methods (BERT, LaBSE and Word2Vec), 4 feature selection methods (PCA, UMAP, SVD, and autoencoders) and 8 clustering methods (K-means, GMM, agglomerative, mean shift, OPTICS, BIRCH, HDBSCAN, DEC) were used for experimentations with 195 models trained in total. Three main metrics, silhouette score, Davies Bouldin score, and Calinski-Harabasz Index, are evaluated across all clustering algorithms, with adjusted Rand Index and mutual information evaluated for hard-clustering methods. The initial experiments showed that LaBSE and Word2Vec are the most prominent methods for embedding retrieval, while PCA and UMAP are most suitable for dimensionality reduction. The elbow approach was employed in additional experiments to determine the ideal number of clusters. Although these experiments demonstrated that data may be grouped into fewer clusters, the outcomes did not indicate a statistically significant improvement, and adhering to the original NACE space facilitates a more accurate assessment of the current economic landscape situation. Clustering results from K-means, agglomerative, and mean shift methods showed good intra-clustering and slightly above average inter-clustering results. This research demonstrates that enterprise activity sectors can be categorised using Lithuanian descriptions and the K-means, agglomerative, or mean shift clustering algorithms. Future research will focus on all three algorithms hyperparameter optimisation to improve inter-clustering and intra-clustering results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
4. Hate Speech Detection for Lithuanian Language.
- Author
-
Songailaitė, Milita, Mandravickaitė, Justina, Rimkienė, Eglė, Petkevičius, Mindaugas, Zaranka, Eimantas, and Krilavičius, Tomas
- Subjects
NATURAL language processing ,INTERNET content ,LITHUANIAN language ,HATE speech ,ARTIFICIAL intelligence ,ONLINE comments - Abstract
The rapid increase of online content, which is often coupled with the ease with which people can share their opinions, has contributed to a rise in social issues such as cyberbullying, insults, and hate speech. To mitigate these issues, some online platforms have implemented measures like disabling anonymous comments or completely removing the option to comment on articles the users used to have. Additionally, certain platforms employ human moderators to identify and remove hate speech. However, due to a huge volume of online interactions, manually moderating content requires substantial human resources. Advances in artificial intelligence, particularly in natural language processing (NLP), offer promising results in hate speech identification. Automated hate speech detection systems can facilitate content moderation by effi- ciently processing and managing large volumes of data. In this study, we present a comparative evaluation of hate speech detection solutions for the Lithuanian language. We used several deep learning models for hate speech detection: Multilingual BERT, LitLat BERT, Electra, open Llama2 for the Lithuanian language, RWKV, BiLSTM, LSTM, CNN and ChatGPT. For the Electra model, we trained ourselves from scratch with Lithuanian texts that made more than 2.5 billion tokens. Multilingual BERT, LitLat BERT, Electra and RWKV were further fine-tuned to classify Lithuanian user-generated comments into three main classes: hate, offensive, and neutral speech. For comparison purposes, we also trained BiLSTM, LSTM and CNN models for the task. Open Llama2 for the Lithuanian language and ChatGPT were used without fine-tuning, and Open Llama2 for the Lithuanian language was then fine-tuned to get better results. To train or adapt the models to the hate speech detection task, we prepared an annotated dataset. It has had 27 357 user-generated comments (hate speech -- 4220, offensive -- 7821, neutral -- 15 316). All models were evaluated with accuracy, precision, recall, and F1-score metrics. Our future plans include augmentation of our annotated dataset with additional data sources and hate topics as well as experiments in model bias, robustness and output explainability. [ABSTRACT FROM AUTHOR]
- Published
- 2024
5. Order in Document Chaos: Logistics Documents Classification.
- Author
-
Abramov, Danylo, Zaranka, Eimantas, Zdanavičiūtė, Monika, Šakinis, Nerijus, and Krilavičius, Tomas
- Subjects
MACHINE learning ,SUPPORT vector machines ,K-nearest neighbor classification ,RANDOM forest algorithms ,PHYSICAL distribution of goods ,DEEP learning - Abstract
Arun Kumar Mishra wrote that every movement of goods from one point to the next must have the attached documents. According to Statista, logistics industry worldwide grows 0.5 trillion dollars each year, meaning more transportation is required the more documents needs to be processed. To efficiently manage huge volumes of documents and automate a decision-making process a classification system is required. This research focuses on logistics documents classification utilizing deep learning and machine learning algorithms. For this study, 50 GB of unlabeled data were presented, and the initial experiments were conducted using 5078 manually selected documents. Manually selected documents were assigned to 4 commonly used logistics document categories: CMRs, invoices, receipts, and others. The dataset was split into train and test sets, where 80% or 4058 of the documents were designated for training and 20% or 1014 of the documents for testing. Five main preprocessing steps were applied: convertion from PDF to JPG, resizing, deskewing, tint and noise removal. Two main methodologies were applied, application of neural networks and traditional machine learning classification techniques. Both approaches utilized pretrained backbone models on ImageNet. For neural networks we used Efficient-Net80, VGG16, MobileNet, ResNet50, DenseNet and InceptionV3. The neural network with the ResNet50 backbone outperformed other models achieving 0.9582 accuracy, 0.9593 precision, 0.9582 recall and 0.9585 F1 score. Rest models showed comparable results in performance evaluation: EfficientNet80 achieved 0.9467, VGG16 0.9176, MobileNet 0.9307, DenseNet 0.9387 and InceptionV3 0.9387 F1 scores. In addition, traditional machine learning classifiers, including Support Vector Machines (SVM), Random Forest, K-Nearest Neighbors (KNN), and XGBoost (XGB), were trained using features extracted from the ResNet50 backbone. The best-performing machine learning model was the Support Vector Classifier, achieving an accuracy of 0.9471, 0.9470 precision, 0.9471 recall and 0.9466 F1 score, while the XGBoost classifier, Random Forest, and KNearest Neighbors classifiers achieved F1 scores of 0.9440, 0.9455, and 0.9281, respectively. The research showed that the most promising solution for logistics document classification is the ResNet50 model and that it could be implemented in logistic environments to automate document separation. Future research will focus on dataset expansion utilizing pretrained ResNet50 model to label the remaining unused documents and further fine-tune models to enhance model F1 score, minimizing the need for human intervention in document classification. [ABSTRACT FROM AUTHOR]
- Published
- 2024
6. Customer Churn Prediction in the Software as a Service Industry.
- Author
-
Zaranka, Eimantas, Zhyhun, Bohdan, Songailaitė, Milita, Juozaitienė, Rūta, and Krilavičius, Tomas
- Subjects
SERVICE industries ,DIGITAL technology ,ARTIFICIAL neural networks ,MACHINE learning ,DEEP learning ,ARTIFICIAL intelligence - Published
- 2023
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.