Author: "Seljan, Sanja" / Database: OpenAIRE - Searchworks@Jio Institute Digital Library Search Results

1. Information Extraction from Security-Related Datasets

Author: Seljan, Sanja, Tolj, Nevenka, Dunđer, Ivan, and Skala, Karolj
Subjects: information extraction, machine learning, corpus analysis, security datasets, information security, information and communication sciences
Abstract: There are various approaches to executing security breaches which are nowadays massively occurring in electronic communication environments, and phishing attacks are one of the most applied ones. A vast majority of phishing attacks are initiated using electronic messages, which attackers utilize to direct users to harmful or fake websites, to infect computers or to obtain personal or sensitive data for malicious purposes. Consequently, it is necessary to identify phishing messages in order to provide suitable user protection. Research and numerous studies have included machine learning algorithms and techniques from the field of artificial intelligence which predominantly depend on language-specific datasets and characteristics of phishing messages, and which have demonstrated to be effective for extracting critical information and for data-driven decision making. However, phishing datasets exist mainly for the English language. The aim of this paper is to present an information extraction pipeline that encompasses phases, such as corpus pre-processing, generating predictions of phishing messages using selected machine learning algorithms, along with a basic analysis, confusion matrices and evaluation scores for Croatian phishing messages. This type of key information can be used for teaching in higher education, e.g. in security-related courses or subjects that deal with artificial intelligence, machine learning, big data analysis, computational linguistics etc. This is essential as it can provide deeper insights into phishing attack strategies and potential countermeasures.
Published: 2023

2. E-mails/SMS as Archival Materials: Topic Detection of Fraud Messages

Author: Seljan, Sanja, Tolj, Nevenka, and Dunđer, Ivan
Subjects: information, security, fraud, prevention, SMS, messages
Abstract: Cilj istraživanja je analiza zlonamjernih SMS poruka kojima se od korisnika nastoje na prijevaru izvući financijska sredstva ili osobni podaci.
Published: 2023

3. Data Acquisition and Corpus Creation for Phishing Detection

Author: Dunđer, Ivan, Seljan, Sanja, Odak, Marko, and Skala, Karolj
Subjects: data acquisition, digital corpus creation, computational data analysis, natural language processing, phishing, information privacy, information security
Abstract: Detecting phishing attacks is not straightforward, since there are many obstacles that derive from language complexity and technical aspects. Studying phishing attacks and other related issues heavily relies on computer datasets, i.e. digital corpora that reflect these linguistic and technical intricacies. Diverse studies using phishing datasets have been performed, but mainly for the English language. Research for other languages is scarce, and especially for not widely spoken languages. For the Croatian language there is an evident lack of corpora that are essential for diverse analyses and for constructing models that are capable of recognizing phishing attacks and protecting users. These datasets are necessary for natural language processing and building machine learning workflows, where results largely depend on corpora that must be specifically crafted for this purpose. Therefore, creating high-quality domain-specific corpora is of great importance in the domain of information security. Such corpora can be employed for teaching purposes in various courses in higher education, and could be analyzed in numerous ways in order to understand the underlying principles of phishing attack strategies. The aim of this paper is to demonstrate the entire process of data acquisition and corpus creation for the phishing detection domain. In addition, an analysis of the corpus is presented with regard to different aspects, such as descriptive attributes, terminology characteristics, metadata and language.
Published: 2023

4. Impact of missing values on the performance of machine learning algorithms

Author: Radišić, Bojan, Seljan, Sanja, Dunđer, Ivan, Xhina, Endrit, and Hoxha, Klesti
Subjects: machine learning, neural network, missing data, confusion matrix, accuracy
Abstract: Machine learning (ML) can be used to analyze and predict student success outcome in order to avoid various problems and to plan future actions for helping students overcome difficulties during their study. This paper analyzes data from a digital system of 309 students who were enrolled in the Specialist Study in Trade Business at the Faculty of Tourism and Rural Development from 2010 to 2018. The paper explores the impact of four different data sets on the performance of ML algorithms. The first data set is with partially missing data on the length of study (around 7%), the second one uses arithmetic means in place of missing data, the third is based on median values, whereas the fourth uses the geometric mean instead. Four popular ML algorithms were considered: k-Nearest Neighbors (KNN), Naïve Bayes (NB), Random Forest (RF) and Probabilistic Neural Network (PNN). All of them are used for predicting student success based on achieved ECTS credit points. The aim of this paper is to compare and analyze the impact of missing values on the results of individual ML algorithms.
Published: 2023

5. National Brand Identity: Pilot Study on Perception of Croatian Student Population

Author: Seljan, Sanja, Horvat, Sara, Starešinić, Berislava, and Pejić Bach, M.
Subjects: visual brand identity, Croatia, branding, nation, brand identity, reputation, structural equation modeling, information, ddc:330, structural equation modelling, M3
Abstract: A message sent to a specific market or an audience contains certain types of information that affect the audience. For this reason, brand identity, nowadays increasingly in a digital form, plays an important role. Each state wants to create a robust, attractive, and different brand identity that will set it apart from other states and thus augment its reputation. This research aimed to determine the elements of brand identity that respondents mostly associate with Croatia and to examine their attitudes towards the belief that the brand identity that influences emotions is essential in the creation of national visual identity, as well as their attitudes towards the assumption that the brand identity of Croatia should be liked first by the inhabitants of Croatia, and only then by foreign tourists. Two research propositions were tested using the structural equation modeling, measuring the relationship between the emotional and formal elements of brand identity with the attitudes towards the emotions concerning the brand identity and the relevance of the brand identity to the country residents. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Published: 2021

6. An overview of machine learning algorithms for detecting phishing attacks on electronic messaging services

Author: Kovač, Antonio, Dunđer, Ivan, and Seljan, Sanja
Subjects: information security, information privacy, machine learning, phishing, attack detection, spam, electronic messages
Abstract: Phishing attacks have become today one of the most common security breaches performed on different communication channels. Their goal is to direct users to malicious websites or to infect a user’s computer as a means to acquire personal or sensitive data for later misuse. Phishing is often the first step in the process of cybercrime, and in order to be able to recognize potential attacks and adequately protect users, it is necessary to understand the underlying principles of attack strategies. Therefore, applying machine learning for training a system that would recognize phishing messages would be essential for increasing the level of security from cyberattacks. The aim of this paper is to give an overview of machine learning techniques used for the detection of phishing (and spam) e-mails, focusing mainly on regression and classification algorithms. In addition to the mentioned techniques, an analysis of datasets that are used for training of systems for detecting phishing attacks (and spam) is presented with regard to their size, language and accuracy scores. Different types of phishing messages are analyzed as well in this paper.
Published: 2022

7. Professional and popular terminology in official guide for insulin pump: original English vs. Croatian version

Author: Baretić, Maja, Seljan, Sanja, Bralić Lang, Valerija, and Pejić Bach, Mirjana
Subjects: health literacy, diabetes, insulin pump manual
Abstract: Introduction In using online medical manuals, patients that do not speak English natively are in effect “virtual immigrants”, confronted with a problem of low health literacy. Our aim was to investigate the use of diabetes key terms when the official guide for insulin pump was translated into a non-widely spoken language like Croatian. Data set and methods An official guide for insulin pump was translated by the manufacturer into Croatian and compared to its original English version for the ratio of professional and popular key diabetes terms. Analyzed professional terminology included: diabetes (Cro. dijabetes) ; blood glucose (glukoza u krvi) ; retinopathy (retinopatija) ; nephropathy (nefropatija) ; neuropathy (neuropatija) ; cardiovascular (kardiovaskularni) ; hyperglycemia (hiperglikemija) and hypoglycemia (hipoglikemija). Analyzed popular terminology included: sugar disorders (poremećaj šećera u krvi) ; blood sugar (šećer u krvi) ; eye damage (oštećenje oka) ; kidney damage (oštećenje bubrega) ; nerve damage (oštećenje živaca) ; blood system damage (oštećenje krvnih žila/srčanožilnog sustava) ; high/low levels of sugar (povišena vrijednosti šećera) and low level of sugar (niska vrijednost šećera). Results Ratio of professional vs. popular terminology was 5.8 in the English vs. 2.92 in Croatian version resulting in more professional terminology in English manual. Particular terms used more frequently in English were diabetes, blood glucose, hyperglycemia and hypoglycemia. Conclusion Medical manuals often advocate the use of professional terms in addressing the patient ; in case of this guide for insulin pump it could result in less information accessibility. Most insulin pump users have good knowledge on particular terminology but highly professional language can represent a barrier to understand information.
Published: 2022

8. Big Data Analysis for Health Information Access: towards Hospital Websites as Interactive Communication Channel

Author: Seljan, Sanja
Subjects: Big Data, health, analytics, interactive tools, Global Digital Health Index
Abstract: The aim of the analysis is to present publicly available health information sources that can be used to gain insight into health information needs. Hospital websites are widely used the access point for information search or for communication. Information presented on hospital websites should be presented in clear, up-to-date and understandable way enabling easy information access and interactive communication. The research presents Big Data analysis performed on three types of data. The research is a result of the institutional project (grant 11-931-1072).
Published: 2022

9. Natural Language Processing (NLP) for Cyber Security: detection of malicious (phishing) e-mails

Author: Seljan, Sanja
Subjects: NLP, AI, cyber security, phishing, malicious, detecttion, e-mail
Abstract: The aim of the research is implementation of Natural Language Processing (NLP) techniques for detection of malicious (phishing) e-mails in order to augment cyber security. Fake and malicious messages, i.e. phishing messages, often represent the first step in the cyber-crime process. The research presents the use of the Natural Language Processing (NLP) techniques in order to detect malicious and fake e-mails, aiming to steal information, data or financial resources.
Published: 2022

10. Informacijska i komunikacijska tehnologija u komunikaciji liječnik-pacijent

Author: Seljan, Sanja, Dunđer, Ivan, Katić, M., and Vučak, J.
Subjects: Informacijska i komunikacijska tehnologija, komunikacija, liječnik-pacijent, kvaliteta, zadovoljstvo
Abstract: U prvome dijelu se prikazuje porast istraživačkog interesa za primjenom IKT u zdravstvu, u bazi Web of Science, od 2000. godine do danas. U drugome dijelu istraživanja provedena je analiza primjene različitih tehnologija (platformi, aplikacija i servisa) u komunikaciji liječnik-pacijent: zdravstvenih portala, e-mail komunikacijskih usluga, virtualnih platformi, telemedicinskih usluga za ljudsku interakciju - osobito putem videozapisa, mobilnih aplikacija (mHealth) i društvenih mreža koje se sve više koriste za interakciju između korisnika zdravstvene zaštite i pružatelja usluga. Prikazana su rješenja temeljena na umjetnoj inteligenciji.
Published: 2022

11. Dijagnoza debljine – kako reći istinu, a ne uvrijediti: presječno istraživanje korištenja nazivlja u zdravstvenom i svakodnevnom okruženju

Author: Baretić, Maja, Seljan, Sanja, Matovinović, Martina, Ranilović, Darjan, and Sedak, Kristijan
Subjects: debljina, komunikacija : terminologija, Deskriptori PRETILOST – psihologija, TERMINOLOGIJA, KOMUNIKACIJA, ODNOS LIJEČNIKA I BOLESNIKA, MOTIVACIJA, STAVOVI, ANKETE I UPITNICI, PRESJEČNA ISTRAŽIVANJA
Abstract: Cilj studije: Kod postavljanja dijagnoze debljine izrazito je važna komunikacija liječnika s bolesnikom, jer se i sama terminologija ponekada doživljava uvredljivom. Cilj ove studije bio je iznaći nazivlje kojim bi se naglasila ozbiljnost medicinskog stanja, ali i ono kojim bi se izbjegla nepotrebna nelagoda zbog imenovanja dijagnoze. Ispitanici i metode: u studiju je bilo uključeno 500 ispitanika (bolesnici, liječnici, nutricionisti, studenti medicine i društveno- humanističkih znanosti) koji su odgovorili na upitnik evaluirajući četiri naziva koji opisuju debljinu. Analizirani su njihovi stavovi prema terminu koji im je bio prihvatljiv/uvredljiv u zdravstvenom i u svakodnevnom okruženju. Rezultati: podatci su sakupljeni koristeći online alat SurveyMonkey®. Ispitanici su smatrali nazive ‘pretio’ i ‘adipozan’ prihvatljivima i u zdravstvenom i u svakodnevnom okruženju, a navedeno se najviše odnosi na liječnike i studente medicine. Naziv ‘debeo’ svi su smatrali neprihvatljivim, a najviše nutricionisti. Naziv ‘bucko’ je doživljen kao najviše uvredljiv ; zanimljvo je da su ga bolesnici najbolje prihvatili. Zaključak: u dijagnozi debljine terminologiju treba koristiti s oprezom. Imajući u vidu rezultate ove studije, preporučuje se koristiti nazive ‘pretio’ i ‘adipozan’, izbjegavati kolokvijalne nazive, a termin ‘debeo’ koristiti s oprezom.
Published: 2022

12. Procjena učinkovitosti ekosustava otvorenih podataka: pristup bolnicama u hrvatskim gradovima

Author: Seljan, Sanja, Viličić, Marina, Nevistić, Zvonimir, Dedić, Luka, Grubišić, Marina, Cibilić, Iva, van Loenen, Bastiaan, Welle Donker, Frederika, and Alexopoulos, Charalampos
Subjects: ekosustav, okvir procjene, Hrvatska, bolnice, pristup, gradovi, otvoreni podaci
Abstract: Procjena učinkovitosti ekosustava otvorenih podataka zadobila je značajan interes istraživača širom svijeta (vidi Welle Donker & Van Loenen, 2016; Dodds and Newman, 2015; Capgemini Consulting, 2015; Public Sector Information (PSI) Scoreboard; Independent Reporting Mechanism (IRM), 2015; World Wide Web Foundation, 2015; Open Knowledge International (OKI), 2014). Ove postojeće metode procjene usredotočuju se na status komponenti ekosustava (postoji li politika otvorenih podataka, jesu li podaci u skladu s standardima, jesu li dostupni besplatno, itd.). Pristupom prema potrebama aplikacije/korisnika, ovaj projekt implementira drugačiji pristup procjeni: otvoreni podaci kao sredstvo za procjenu (fizičkog) pristupa bolničkim zgradama.
Published: 2021
Full Text: View/download PDF

13. Photogrammetric 3D Scanning of Physical Objects: Tools and Workflow

Author: Reljić, Ivan, Dunđer, Ivan, and Seljan, Sanja
Subjects: comparative analysis, information technology, lcsh:T, 3D scanning, photogrammetry, lcsh:L, virtual 3D models, lcsh:Technology, lcsh:Education
Abstract: Ease of access to and low cost of hardware and software for 3D scanning have made 3D technologies increasingly popular in recent research. One of the possible 3D scanning approaches is photogrammetry which relies on using a data set consisting of photographs of the same physical object. In this paper are evaluated different 3D models generated from the same input data set by specialised software packages for photogrammetry. The main attributes of the 3D models are examined in comparative analyses and their differences highlighted. Furthermore, visual qualitative inspections are performed on the models and the results are compared.
Published: 2019

14. Artifical Intelligence (AI) in promotion of Medical Tourism

Author: Seljan, Sanja
Subjects: ComputingMethodologies_PATTERNRECOGNITION, GeneralLiterature_INTRODUCTORYANDSURVEY, AI, chatbots, conversational agent, tourism, human-computer communication, GeneralLiterature_MISCELLANEOUS
Abstract: The presentation gives overview of different AI technologies with implementation in the domain of tourism.
Published: 2021

15. Procjena učinka ekosustava otvorenih podataka: Pristup bolnicama u hrvatskim gradovima

Author: Seljan, Sanja, Viličić, Marina, Nevistić, Zvonimir, Dedić, Luka, Grubišić, Marina, Cibilić, Iva, van Loenen, Baastian, Welle Donker, Frederika, Alexopoulos, Charalampos, Vujić, Miroslav, and Šalamon, Dragica
Subjects: okvir za procjenu, bolnice, pristup, Hrvatska, gradovi, otvoreni podaci, ekosustav
Abstract: The focus of the research is to detect available data sets in order to assess access to hospitals in three Croatian towns (Zagreb, Rijeka, Split). The research is based on existing open data, found on web portals, containing information on hospitals, emergency service, coronavirus screening clinics, public transport and stations (bus lines, tramways, bike routes), roads and car parking. Research method includes qualitative research on publicly available datasets. Methodology includes identification of portals, search for datasets, acquisition, data inspection, qualitative and quantitative analysis based on created model and reporting. The created framework entitled “Hospital Access Framework” consists of five domains: Supply side, Demand side (open data user skills), Demand side (end-user capabilities), Legal aspects and privacy, Impact. For each domain, several KPIs (Key Performance Indicators) are suggested, using SMART (Specific Measurable Achievable Relevant Time-bound) criteria. Qualitative analysis is performed on suggest eco- system framework, created for this research, while quantitative analysis includes Yes/Partly/No answers of suggested KPIs.
Published: 2021

16. What Makes Machine-Translated Poetry Look Bad? A Human Error Classification Analysis

Author: Dunđer, Ivan, Seljan, Sanja, Pavlovski, Marko, Vrček, Neven, Pergler, Elisabeth, and Grd, Petra
Subjects: ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, machine translation, quality evaluation, human analysis, error classification, natural language processing, information and communication sciences
Abstract: Human translation of literary works is a profession as ancient as the appearance of writing with a very long tradition. However, with the ever-growing improvements of technologies in the field of machine translation, the possibilities of adopting machine- translated literary texts are in constant progress. Nevertheless, human evaluation of machine-translated text is a necessity. The aim of this paper is to examine what makes machine translations appear bad in terms of error types that occur within such translations. Therefore, a human error classification analysis on a machine-translated text corpus in the domain of poetry is conducted. This study could be used for improving the methodology for machine translation evaluation of literary texts.
Published: 2021

17. Quality Assurance of Terminology in Business Communication: Analysis by the Herfindahl-Hirshman Index (HHI)

Author: Seljan, Sanja, Gašpar, Angelina, Milković, Marin, Seljan, Sanja, Pejić Bach, Mirjana, Peković, Sanja, and Perovic, Djurdjica
Subjects: terminology, consistency, HHI index, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries)
Abstract: The focus of the research is on evaluation of terminology consistency in domain of business ommunication.
Published: 2019

18. Artificial Intelligence (AI) in Action: Implementation of Chabot (Conversational Agent) in the Domain of Tourism

Author: Seljan, Sanja, Ljubić, Matej, Milković, Marin, Seljan, Sanja, Pejić Bach, Mirjana, Peković, Sanja, and Perovic, Djurdjica
Subjects: AI, chatbots, conversational agent, tourism, human-computer communication
Abstract: In the research, chatbot is implemented in the domain of tourism, for the selected domain of tourism.
Published: 2019

19. e-Government in European countries: gender and ageing digital divide

Author: Seljan, Sanja, Miloloža, Ivan, Pejić Bach, Mirjana, and Barković, Dražen... [et al]
Subjects: e-government, ICT, digital divide, digital transformation, cluster analysis
Abstract: The 4th industrial revolution has brought not only new opportunities but new demands and challenges, also in government communication to citizens and businesses. Transformations do not only concern ICT usage differences but also social, economic, educational and business changes, and bring risks known as „digital divide“. One of the services that are integrated into the new digital environment, and not yet in full deployment, is e-government. Governments are faced with challenges of digital transformation, aiming to achieve, such as enhanced public services, enhanced administration, enhanced social value, cost reductions or empowerment of citizens bringing them closer to the public policy of decision-making. Although e-government aims to provide access to individuals, businesses, interest groups or their governments, various types of “digital divide” present barriers to full implementation of e-government services. In order to investigate the utilization of e-government, we used data from the Eurostat database for 32 countries for 2018. The research is performed by the use of K-means cluster analysis and the non-parametric Kruskal-Wallis test. The aim of the research is two-fold: (i) to perform cluster analysis of e-government practice across European countries and identify the position of Croatia, and (ii) to detect possible reasons of a digital divide in relation to age and gender divide and economic development. K-means cluster analysis revealed three clusters as optimum. Results of the Kruskal-Wallis test confirm the statistically significant difference among clusters at 1% significance regarding GDP, global gender inequality index and political empowerment. Results reveal that the digital divide is one of the major challenges preventing citizens from the use of e-government services. According to existing studies, governments should perform business processes re-engineering and transform the way of work and communication, but also enable ICT training programs in order to decrease the digital divide and to exploit the full potential of e-services.
Published: 2020

20. Using digitised documents as a source for machine learning

Author: Stančić, Hrvoje, Seljan, Sanja, and Dunđer, ivan
Subjects: digitisation, OCR, ML
Abstract: Information and communication technologies have changed the way of communication, users’ habits and expectations in all types of settings (business, education, entertainment, service production, tourism, manufacture, …), but also set new requirements for institutions. One of them is online access to digitised authentic materials, which creates added value for users and institutions.The process includes selection, digitisation, format processing, annotation and information retrieval conducted on the collection of materials from the Archives of the Faculty of Humanities and Social Sciences, University of Zagreb, consisting of minutes from the Faculty Council meetings from 1874 until the digital era. Results confirm the presented process as one of possible solutions but requires planned strategy and interdisciplinary approach.
Published: 2020

21. Using AI and crowdsourcing in digitisation and processing of archival materials

Author: Stančić, Hrvoje, Seljan, Sanja, Ivanjko, Tomislav, and Malysheva, E. P.
Subjects: archival materials, digitisation, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, crowdsourcing, artificial intelligence, DRUŠTVENE ZNANOSTI. Informacijske i komunikacijske znanosti, SOCIAL SCIENCES. Information and Communication Sciences, name entity recognition, archives
Abstract: Digitisation of archival materials is a lengthy process in which the act of digitisation is usually the shortest and the most straightforward one. It is the processing of the digitised materials that take a lot of time and effort. It is valuable to have digitised materials available online, even for the reason of not having to travel to the physical location of the originals, but the tendency of extracting additional value from the materials increases. To that end, the archival materials can be processed by new techniques such as artificial intelligence (AI) and crowdsourcing. The authors use the example of digitisation of the minutes from the Faculty (of Humanities and Social Sciences) Council meetings dating from 1913 to 1996 and the application of AI to improve the OCR results and NER for semantic enrichment as well the example of digitisation of the food rationing cards, used between 1941 and 1945 in Zagreb, Croatia and the application of AI and crowdsourcing to data extraction, analysis and visualisation.
Published: 2019

22. Integration of MT technology into AV web-based environment

Author: Seljan, Sanja and Zaghouani, Wajdi
Subjects: machine translation (MT), audio-video (AV) technology, domain terminology, automatic metrics, human evaluation
Abstract: The main goal of the paper is to research the possibilities of MT integration into AV web-based environment and to evaluate improvements by adding new digital resources (specific terminology) related to the domain. The research is performed for English-Arabic, English-Croatian for the specific domain. Evaluation is performed after the two iterations when using online MT tool (without and with additional terminology resource) by native speakers (Arabic and Croatian. Evaluation consists of two steps: human evaluation and automatic evaluation.
Published: 2019

23. Improving Information Transfer through Quality Assurance

Author: Seljan, Sanja and DG TRAD
Subjects: ComputingMilieux_THECOMPUTINGPROFESSION, ComputingMilieux_COMPUTERSANDEDUCATION, information transfer, quality assurance, CAT technology
Abstract: The lecture deals with analysis of information transfer improvement through the Quality Assurance using CAT technology.
Published: 2018

24. Quality Control (QC) of Terminology in Computer-Assisted Translation Process

Author: Seljan, Sanja
Subjects: quality control, quality management, computer-assisted translation, CAT, error detection, quality control (QC), quality management (QM), CAT tools, terminology, information transfer, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries)
Abstract: Quality control (QC), represents the basic part of the quality management (QM) translation process. Quality control (QC) could be performed by human check or by use of quality assurance tools, in order to detect various types of error-categories. In the translation process, the special attention is given to the quality control (QC) of terminology. Detection of terminology (in)consistency in the legal domain can considerably influence readability, understanding and information transfer, important when dealing with texts created for a large number of users and for the specific user. The quality control (QC) process will be analysed at the textual level and as part of the translation workflow process, asking for specific CAT tools. Practical examples will be presented, related to terminology extraction, error-detection, terminology validation, integration and terminology evaluation. Examples will be given for English, French, and other languages.
Published: 2018

25. Towards educating and motivating the crowd – a crowdsourcing platform for harvesting the fruits of NLP students' labour

Author: Jaworski, Rafał, Seljan, Sanja, Dunđer, Ivan, Vetulani, Zygmunt, and Paroubek, Patrick
Subjects: crowdsourcing, gamification, NLP, machine translation resources, parallel corpora, sentence alignment, less-resourced languages, Croatian, TMrepository
Abstract: This paper presents an idea to bring crowdsourcing to a higher level, for the purpose of acquiring valuable machine translation and natural language processing resources. In the proposed scenario, students are being educated in order to improve the quality and effectiveness of their natural language processing (NLP) related work. Their motivation is ensured by introducing an element of gamification – a ranking is kept, where the best contributing users are decorated with medals. The ranking is available at all times to all users and is always up-to-date, hence the effects of the contributions are immediately visible to the users. This scenario was applied to a group of students enrolled in Natural Language Processing course, who were presented with a task of collecting parallel corpora for less-resourced language pairs, in this case Croatian-English and English- Croatian. The whole experiment was supervised with the help of a custom-made open-source system named TMrepository, developed and maintained by the authors of this paper.
Published: 2017

26. Information Transfer through Machine Translation in Migrant Environment

Author: Seljan, Sanja and Kučiš, Vlasta
Subjects: translaltion technology, migrants, benefits, drawbacks
Abstract: The presentation discusses the role of machine translation technology in meeting migrants’ needs, benefits and drawbacks of technology use in the communication process.
Published: 2017

27. Suradnja između jezične industrije i akademske zajednice: Intervju s prof. Sanjom Seljan

Author: Seljan, Sanja
Subjects: suradnja, jezična industrija, akademska zajednica
Abstract: U intervju se prikazuju modućnosti suradnje između jezične industrije i akademske zajednice, razvoje vještina i protrebnih znanja, praksa, projekti i spremnost za posao.
Published: 2017

28. Extracting terminology by language independent methods

Author: Seljan, Sanja, Dunđer, Ivan, Stančić, Hrvoje, Zybatow, Lew N., Stauder, Andy, and Ustaszewski, Michael
Subjects: automatic terminology extraction, statistical tools, language independent methods, evaluation, indexing, terminology, digital archiving
Abstract: The paper presents automatic extraction process from monolingual text performed by three language independent tools, but relying on different principles. The research is conducted on the domain of pharmaceutical documentation. After the digitization process and use of OCR techniques, the automatic extraction process is performed. Results are compared with reference terminology list created by responsible institution and evaluated by measures of recall, precision and F-measure. Results are discussed in the frame of possible integration into the process of digital archiving.
Published: 2017

29. Hospital websites as a road to transparency: Case study of transition countries

Author: Pejić Bach, Mirjana, Seljan, Sanja, Zoroja, Jovana, Buljan, Ante, Cafuta, Brigitta, Kovač, T., and Cingula, M.
Subjects: public communication, hospital web page, member of EU
Abstract: The main goal of the paper is to detect to what extent hospitals in selected transition countries currently use Web pages in order to increase its transparency and communications with the public. Three countries were selected for investigation: Bosnia and Herzegovina, Croatia, and Slovenia. Therefore, the aim of the paper is to examine to what extent events specific country factors influence usage of Web pages among hospitals, since we explore candidate EU member, recent EU member country and established EU member country. Exploratory research was conducted on the sample of 300 hospitals from 3 transition countries. In order to take into account the specific information that hospitals communicate with the public, research instrument already developed for the similar research was used.
Published: 2017

30. Kakve su vrste podataka potrebne i zašto?

Author: Seljan, Sanja
Subjects: data type, resources, language, technology
Abstract: Predavanje govori o vrsti podataka potrebnih za izgradnju digitalnih jezičnih izvora.
Published: 2016

31. Consistency of Translated Terminology Measured by the Herfindahl-Hirschman Index (HHI)

Author: Gašpar, Angelina and Seljan, Sanja
Subjects: Herfindahl-Hirshman Index (HHI), terminology consistency, extraction, Croatian, English, translation, legislation
Abstract: This paper presents research on consistency of translated terminology conducted on three types of legal domain sub corpora, dating from different periods: Croatian-English parallel corpus (1991-2009), English and Croatian versions of the Code of Canon Law translated from Latin (1983), English and Croatian versions of the EU legislation (2013). After the process or terminology extraction, validation of term candidates was performed, followed by an evaluation. Terminological consistency was measured by the Herfindahl-Hirshman Index (HHI). Extracted terminology was compared at the contrastive level and verified in online terminology resources (IATE and EuroVoc). A diachronic analysis of terminological consistency was performed on documents until and after 2006, the year the online translation style guide was published, also calculated by HHI. At the end conclusions are given and further research is suggested.
Published: 2016

32. Automatic Quality Evaluation of Machine-Translated Output in Sociological-Philosophical-Spiritual Domain

Author: Seljan, Sanja, Dunđer, Ivan, Rocha, Álvaro, Martins, Arnaldo, Paiva Dias, Gonçalo, Reis, Luís P., and Pérez Cota, Manuel
Subjects: automatic quality evaluation, machine translation, BLEU, NIST, METEOR, GTM, English-Croatian, Croatian-English, sociological-philosophical-spiritual domain
Abstract: Automatic quality evaluation of machine translation systems has become an important issue in the field of natural language processing, due to raised interest and needs of industry and everyday users. Development of online machine translation systems is also important for less-resourced languages, as they enable basic information transfer and communication. Although the quality of free online automatic translation systems is not perfect, it is important to assure acceptable quality. As human evaluation is time-consuming, expensive and subjective, automatic quality evaluation metrics try to approach and approximate human evaluation as much as possible. In this paper, several automatic quality metrics will be utilised, in order to assess the quality of specific machine translated text. Namely, the research is performed on sociological-philosophical-spiritual domain, resulting from the digitisation process of a scientific publication written in Croatian and English. The quality evaluation results are discussed and further analysis is proposed.
Published: 2015

33. Interdisciplinary Education of Technology in Translation

Author: Seljan, Sanja, Pešorda, Barbara, and Stojanac, Mara
Subjects: language technologies, technology, translation, European Union, Croatia, Faculty of Humanities and Social Sciences, Information and Communication sciences, ComputerApplications_GENERAL, ComputingMilieux_COMPUTERSANDEDUCATION, InformationSystems_MISCELLANEOUS
Abstract: Interdisciplinary Education of Technology in Translation
Published: 2015

34. Machine Translation and Automatic Evaluation of English/Russian-Croatian

Author: Seljan, Sanja, Dunđer, Ivan, Zakharov, V. P., Mitrofanova, O. A., and Khokhlova, M. V.
Subjects: Automatic evaluation, machine translation, English-Croatian, Russian-Croatian, public machine translation service
Abstract: In this research, a specific data set was machine translated by two publicly available machine translation services, Google Translate and Yandex.Translate. Machine translations were performed for two language pairs: English-Croatian and Russian-Croatian. Afterwards, automatic quality evaluation of the machine translated data set was carried out. Several automatic metrics were used: BLEU, NIST, METEOR and GTM, in order to evaluate machine translations relating to the domain of city description, for each language pair and for each machine translation service.
Published: 2015

35. Radionica 'ParGram meeting 2015' & INESS workshop

Author: Seljan, Sanja
Subjects: LFG, meeting
Abstract: LFG, meeting
Published: 2015

36. Budućnost prevoditeljskog posla (Hrvatska gospodarska komora)

Author: Seljan, Sanja
Subjects: tehnologija, prevođenje, znanja, vještine
Abstract: Na tribini se raspravlja o mogućnostima, ograničenjima i izazovima primjene tehnologije u profesionalnom prevođenju.
Published: 2015

37. Koncept automatske klasifikacije registraturnoga i arhivskoga gradiva

Author: Dunđer, Ivan, Seljan, Sanja, Stančić, Hrvoje, and Babić, Silvija
Subjects: automatska klasifikacija, računalna obrada prirodnoga jezika, statističke metode, digitalizacija, arhivsko gradivo
Abstract: Sustavi za upravljanje dokumentima i zapisima (EDRMS) koji su najčešće dijelovi sveobuhvatnijeg sustava za upravljanje korporacijskim sadržajima (ECMS) zahvaćaju dokumente i zapise koji izvorno nastaju u digitalnom obliku kao i one koji su digitalizirani. Dok je izvorno digitalne zapise relativno jednostavno opisati tijekom njihovoga nastanka te im pridodati sve potrebne metapodatke, do problema dolazi kod onih koji u sustav ulaze prolazeći postupak digitalizacije. Ukoliko je riječ o velikoj količini gradiva, pri čemu su dokumenti raznorodni i nemaju neka jedinstvena ili ponavljajuća obilježja, tada nije jednostavno odrediti o kojem je dokumentu riječ, ispravno ga klasificirati i pridodati mu metapodatke. Autori analiziraju i prikazuju mogućnosti rješenja koja pripadaju području statistički utemeljenih jezičnih tehnologija i istražuju njihovu moguću primjenu u području (polu)automatske klasifikacije registraturnoga i arhivskoga gradiva. U radu su objašnjena osnovna polazišta pojedinih metoda, mogućnosti automatske ekstrakcije teksta, metode statističke obrade te postavljanje osnove za (polu)automatsku klasifikaciju. Autori prikazuju rezultate testiranja primijenjenih metoda na konkretnome arhivskom gradivu i zaključuju o mogućim budućim pravcima istraživanja.
Published: 2015

38. Primjena informacijske tehnologije u prevođenju specijaliziranih tekstova

Author: Seljan, Sanja
Subjects: ICT, potrebe, računalno potpomognuto prevođenje
Abstract: Primjena tehnologije u prevođenju specijaliziranih vrsta tekstova: mogućnosti primjene, ograničenja i evaluacija
Published: 2015

39. 1st Croatian Translation Forum, Zagreb, 5.11.2015

Author: Seljan, Sanja
Subjects: translation, technology
Abstract: translation, technology
Published: 2015

40. Evaluation of the Statistical Machine Translation Service for Croatian-English

Author: Brkić, Marija, Vičić, Tomislav, Seljan, Sanja, Stančić, Hrvoje, Seljan, Sanja, Bawden, David, Lasić-Lazić, Jadranka, and Slavić, Aida
Subjects: SMT (statistical machine translation), online, Google Translate, MT, Croatian-English, manual evaluation, fluency, adequacy, χ2-test
Abstract: Much thought has been given in an endeavour to formalize the translation process. As a result, various approaches to MT (machine translation) were taken. With the exception of statistical translation, all approaches require cooperation between language and computer science experts. Most of the models use various hybrid approaches. Statistical translation approach is completely language independent if we disregard the fact that it requires huge parallel corpus that needs to be split into sentences and words. This paper compares and discusses state-of-the-art statistical machine translation (SMT) models and evaluation methods. Results of statistically-based Google Translate tool for Croatian-English translations are presented and multilevel analysis is given. Three different types of texts are manually evaluated and results are analysed by the χ2-test.
Published: 2009

41. Using Translation Memory to Speed up Translation Process

Author: Brkić, Marija, Seljan, Sanja, Bašić Mikulić, Božena, Stančić, Hrvoje, Seljan, Sanja, Bawden, David, Lasić-Lazić, Jadranka, and Slavić, Aida
Subjects: Translation Memory (TM), Déjà Vu, Computer-Assisted Translation (CAT), language pair, translation unit (TU), translation speed
Abstract: Translation process is one aspect of human creativity. Due to globalization, EU accession negotiations, and the need for information exchange, the amount of translation work increases on a daily basis. The translation process is hindered by the fact that the languages involved differ culturally, stylistically, syntactically and lexically. This paper explores the benefits and limitations of TMs (translation memories). TMs are not used for replacing humans in the translation process, but rather for enhancing the human translation process. In this paper, a detailed analysis of Atril’s Déjà Vu X system is presented, along with its time-saving implications, which are based on the reuse of previously stored segments. Excerpts from three different digital camera user manuals are translated from English into Croatian. Evaluation is performed by measuring the time difference between human and TM-based translation speeds in preparation, translation, and revision phases, and with regard to six different parameters.
Published: 2009

42. Information retrieval and terminology extraction in online resources for patients with diabetes

Author: Seljan, Sanja, Baretić, Maja, and Kučiš, Vlasta
Subjects: Self Care, Internet, Croatia, Slovenia, Diabetes Mellitus, Humans, Information Storage and Retrieval, health literacy, terminology, information extraction, diabetes mellitus type 1, documentation online, language barriers, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries)
Abstract: There are various online resources (on foreign and/or native language) created for patients with diabetic problems who need information on self-education of basic diabetic knowledge, on self-care activities regarding importance of dietetic food, medications, physical exercises and on self-management of insulin pumps. The research is divided into three interrelated parts aiming to detect the role of terminology in online resources: i) comparison of professional and popular terminology use in English and Croatian manuals and in Croatian online texts ii) semi-automatic statistically- based extraction of terminology from English and Croatian manuals and online Croatian texts, and evaluation of extracted terminology by comparison with three types of reference sets using measures of recall, precision and f- measure iii) comparison and evaluation extracted terminology from the English manual using statistical and hybrid approaches compared with three types of reference sets using measures of recall, precision and f- measure. Extracted terminology candidates are compared with three reference lists: one created by professional medical person, list of highly professional vocabulary and list created by academic non-medical persons, made as intersection of 15 lists. Online texts and manuals contain popular and professional terminology in different proportions: online texts 1:71, English manual 1:4.5, Croatian manual 1:7, all in favour of professional terminology. When comparing results of terminology extraction based on statistical approach, higher scores are obtained for the measure of recall, especially for the lists created by doctor specialist involved in diabetes education and by non-medical persons. Reference list created by diabetologist has almost perfect recall for Croatian web pages, while the reference list suggested by non- medical person corresponds more to terminology used in manuals, especially Croatian version. The list of highly specialized vocabulary contained in MeSH is not included in manuals. When comparing two approaches, higher scores for about 30% are obtained for the hybrid approach based on statistical and language methods. Use of automatic and semi-automatic methods in terminology extraction could contribute to better information retrieval as one aspect of health literacy.
Published: 2014

43. Cross-disciplinary integration of CAT and MT tools into curriculum

Author: Seljan, Sanja
Subjects: ComputingMilieux_THECOMPUTINGPROFESSION, ComputingMilieux_COMPUTERSANDEDUCATION, interdisciplinary education, computer-assisted translation (CAT), Machine Translation (MT, information sciences, languages
Abstract: The paper deals with educational state-of-art which requires not only highly specialized professionals, but also wide spectrum of interdisciplinary knowledge and skills, especially in the domain of computer-assisted translation (CAT) and machine translation (MT).
Published: 2014

44. From Digitisation to Terminology Base

Author: Seljan, Sanja
Subjects: extraction, rule-based, statistical, evaluation, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries)
Abstract: The presentation discusses rule-base and statistical approaches to terminology extraction and evaluation process.
Published: 2014

45. Impact of summarizing and translation technology in online information transfer

Author: Seljan, Sanja, Pešorda, Barbara, Stojanac, Mara, and Zgrabljić Rotar, Nada
Subjects: information transfer quality, text summarization, online translation tools, information understanding, media, reader’s perception
Abstract: In recent years media and online translation technology have emerged as a point of interest occurring in information access, information literacy, in education, business, etc. Aiming to enable multilingual information transfer, communication, understanding and dialogue, use of this type of technology still requires awareness about technological possibilities which can influence the reader’s perception and text understanding. Translation technology along with summarizing technology has opened new possibilities and perspectives providing quick and easy translation into another language, requiring in the same time the critical opinion in information analysis. The main purpose of this research is to present the impact of text summarization and online machine translation tools on information transfer. In the summarizing process of condensing a source document into its shorter version, the aim is to preserve the information content and the original text meaning. Summarized text can be machine-translated by use of freely available online translation technology. In the research the role of both technologies are analysed.
Published: 2014

46. Towards Integrated Translation Technology / From Digitisation to Terminology Base

Author: Seljan, Sanja
Subjects: ComputingMilieux_COMPUTERSANDEDUCATION, integrated translation technology, terminology management, machine translation, computer-assisted translation, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries)
Abstract: Two lectures are presented related to integrated translation technology and terminology management.
Published: 2014

47. IN 29 'Prevođenje uz pomoć prijevodne memorije – TRADOS 2014'

Author: Seljan, Sanja
Subjects: Trados 2014, computer-assisted translation (CAT), translation memories
Abstract: Rad s alatom Trados 2014
Published: 2014

48. QTLaunchPad Workshop - EAMT2014

Author: Seljan, Sanja
Subjects: evaluation metrics, MT
Abstract: Worshop on evaluation metrics. Conference EAMT.
Published: 2014

49. Pseudo-lemmatization in Croatian-English SMT

Author: Brkić, Marija, Matetić, Maja, Seljan, Sanja, Hunjak, T., Lovrenčić, S., and Tomičić, I.
Subjects: phrase-based statistical machine translation, pseudolemmatization, Croatian-English
Abstract: One of the ﬁrst difﬁculties in conducting a thorough analysis of statistical machine translation involving Croatian as a morphologically rich and resource poor language is the lack of quality language resources. This paper presents results of two standard fourteen feature Croatian-English phrase-based statistical machine translation systems. Prior to building the second system a partial pseudo- lemmatization of the Croatian parts of training and test sets is made in an attempt to simplify the translation process. Besides automatic evaluation, a manual evaluation is conducted in order to gain insight into the nature of the translation differences achieved between the two systems.
Published: 2014

50. Interoperability of Standards and Models in Official Statistics

Author: Poljičak, Martina, Stančić, Hrvoje, Seljan, Sanja, Hunjak, Tihomir, Lovrenčić, Sandra, and Tomičić, Igor
Subjects: GeneralLiterature_INTRODUCTORYANDSURVEY, Statistical Metadata, Generic Systems, Official Statistics' Standards and Models, Interoperability
Abstract: In Croatian Central Bureau of Statistics (CBS) an Integrated Statistical Information System (ISIS) is built using in its core Croatian Statistical Metadata Repository (CROMETA). ISIS is a multilingual platform intended for generic approach to various statistical surveys using standardized processes for producing standardized statistical outputs. An analysis of the development designs, tools and underlying standards used in projects for development of ISIS will be presented. Also, authors will present the current state of usage of ISIS and analyze the possibility of aligning CBS’ ISIS with currently emerging standards in official statistics.
Published: 2014

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

260 results on '"Seljan, Sanja"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources