68 results on '"Ari Z Klein"'
Search Results
2. Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review
- Author
-
Karen O'Connor, Su Golder, Davy Weissenbacher, Ari Z Klein, Arjun Magge, and Graciela Gonzalez-Hernandez
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Public aspects of medicine ,RA1-1270 - Abstract
BackgroundPatient health data collected from a variety of nontraditional resources, commonly referred to as real-world data, can be a key information source for health and social science research. Social media platforms, such as Twitter (Twitter, Inc), offer vast amounts of real-world data. An important aspect of incorporating social media data in scientific research is identifying the demographic characteristics of the users who posted those data. Age and gender are considered key demographics for assessing the representativeness of the sample and enable researchers to study subgroups and disparities effectively. However, deciphering the age and gender of social media users poses challenges. ObjectiveThis scoping review aims to summarize the existing literature on the prediction of the age and gender of Twitter users and provide an overview of the methods used. MethodsWe searched 15 electronic databases and carried out reference checking to identify relevant studies that met our inclusion criteria: studies that predicted the age or gender of Twitter users using computational methods. The screening process was performed independently by 2 researchers to ensure the accuracy and reliability of the included studies. ResultsOf the initial 684 studies retrieved, 74 (10.8%) studies met our inclusion criteria. Among these 74 studies, 42 (57%) focused on predicting gender, 8 (11%) focused on predicting age, and 24 (32%) predicted a combination of both age and gender. Gender prediction was predominantly approached as a binary classification task, with the reported performance of the methods ranging from 0.58 to 0.96 F1-score or 0.51 to 0.97 accuracy. Age prediction approaches varied in terms of classification groups, with a higher range of reported performance, ranging from 0.31 to 0.94 F1-score or 0.43 to 0.86 accuracy. The heterogeneous nature of the studies and the reporting of dissimilar performance metrics made it challenging to quantitatively synthesize results and draw definitive conclusions. ConclusionsOur review found that although automated methods for predicting the age and gender of Twitter users have evolved to incorporate techniques such as deep neural networks, a significant proportion of the attempts rely on traditional machine learning methods, suggesting that there is potential to improve the performance of these tasks by using more advanced methods. Gender prediction has generally achieved a higher reported performance than age prediction. However, the lack of standardized reporting of performance metrics or standard annotated corpora to evaluate the methods used hinders any meaningful comparison of the approaches. Potential biases stemming from the collection and labeling of data used in the studies was identified as a problem, emphasizing the need for careful consideration and mitigation of biases in future studies. This scoping review provides valuable insights into the methods used for predicting the age and gender of Twitter users, along with the challenges and considerations associated with these methods.
- Published
- 2024
- Full Text
- View/download PDF
3. Automatically Identifying Self-Reports of COVID-19 Diagnosis on Twitter: An Annotated Data Set, Deep Neural Network Classifiers, and a Large-Scale Cohort
- Author
-
Ari Z Klein, Shriya Kunatharaju, Karen O'Connor, and Graciela Gonzalez-Hernandez
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Public aspects of medicine ,RA1-1270 - Published
- 2023
- Full Text
- View/download PDF
4. Pregex: Rule-Based Detection and Extraction of Twitter Data in Pregnancy
- Author
-
Ari Z Klein, Shriya Kunatharaju, Karen O'Connor, and Graciela Gonzalez-Hernandez
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Public aspects of medicine ,RA1-1270 - Published
- 2023
- Full Text
- View/download PDF
5. Automatically Identifying Twitter Users for Interventions to Support Dementia Family Caregivers: Annotated Data Set and Benchmark Classification Models
- Author
-
Ari Z Klein, Arjun Magge, Karen O'Connor, and Graciela Gonzalez-Hernandez
- Subjects
Geriatrics ,RC952-954.6 - Abstract
BackgroundMore than 6 million people in the United States have Alzheimer disease and related dementias, receiving help from more than 11 million family or other informal caregivers. A range of traditional interventions has been developed to support family caregivers; however, most of them have not been implemented in practice and remain largely inaccessible. While recent studies have shown that family caregivers of people with dementia use Twitter to discuss their experiences, methods have not been developed to enable the use of Twitter for interventions. ObjectiveThe objective of this study is to develop an annotated data set and benchmark classification models for automatically identifying a cohort of Twitter users who have a family member with dementia. MethodsBetween May 4 and May 20, 2021, we collected 10,733 tweets, posted by 8846 users, that mention a dementia-related keyword, a linguistic marker that potentially indicates a diagnosis, and a select familial relationship. Three annotators annotated 1 random tweet per user to distinguish those that indicate having a family member with dementia from those that do not. Interannotator agreement was 0.82 (Fleiss kappa). We used the annotated tweets to train and evaluate support vector machine and deep neural network classifiers. To assess the scalability of our approach, we then deployed automatic classification on unlabeled tweets that were continuously collected between May 4, 2021, and March 9, 2022. ResultsA deep neural network classifier based on a BERT (bidirectional encoder representations from transformers) model pretrained on tweets achieved the highest F1-score of 0.962 (precision=0.946 and recall=0.979) for the class of tweets indicating that the user has a family member with dementia. The classifier detected 128,838 tweets that indicate having a family member with dementia, posted by 74,290 users between May 4, 2021, and March 9, 2022—that is, approximately 7500 users per month. ConclusionsOur annotated data set can be used to automatically identify Twitter users who have a family member with dementia, enabling the use of Twitter on a large scale to not only explore family caregivers’ experiences but also directly target interventions at these users.
- Published
- 2022
- Full Text
- View/download PDF
6. Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers
- Author
-
Ari Z Klein, Karen O'Connor, Lisa D Levine, and Graciela Gonzalez-Hernandez
- Subjects
Medicine - Abstract
BackgroundDespite the fact that medication is taken during more than 90% of pregnancies, the fetal risk for most medications is unknown, and the majority of medications have no data regarding safety in pregnancy. ObjectiveUsing β-blockers as a proof-of-concept, the primary objective of this study was to assess the utility of Twitter data for a cohort study design—in particular, whether we could identify (1) Twitter users who have posted tweets reporting that they took medication during pregnancy and (2) their associated pregnancy outcomes. MethodsWe searched for mentions of β-blockers in 2.75 billion tweets posted by 415,690 users who announced their pregnancy on Twitter. We manually reviewed the matching tweets to first determine if the user actually took the β-blocker mentioned in the tweet. Then, to help determine if the β-blocker was taken during pregnancy, we used the time stamp of the tweet reporting intake and drew upon an automated natural language processing (NLP) tool that estimates the date of the user’s prenatal time period. For users who posted tweets indicating that they took or may have taken the β-blocker during pregnancy, we drew upon additional NLP tools to help identify tweets that report their pregnancy outcomes. Adverse pregnancy outcomes included miscarriage, stillbirth, birth defects, preterm birth (
- Published
- 2022
- Full Text
- View/download PDF
7. A chronological and geographical analysis of personal reports of COVID-19 on Twitter from the UK
- Author
-
Su Golder, Ari Z Klein, Arjun Magge, Karen O’Connor, Haitao Cai, Davy Weissenbacher, and Graciela Gonzalez-Hernandez
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Objective Given the uncertainty about the trends and extent of the rapidly evolving COVID-19 outbreak, and the lack of extensive testing in the United Kingdom, our understanding of COVID-19 transmission is limited. We proposed to use Twitter to identify personal reports of COVID-19 to assess whether this data can help inform as a source of data to help us understand and model the transmission and trajectory of COVID-19. Methods We used natural language processing and machine learning framework. We collected tweets (excluding retweets) from the Twitter Streaming API that indicate that the user or a member of the user's household had been exposed to COVID-19. The tweets were required to be geo-tagged or have profile location metadata in the UK. Results We identified a high level of agreement between personal reports from Twitter and lab-confirmed cases by geographical region in the UK. Temporal analysis indicated that personal reports from Twitter appear up to 2 weeks before UK government lab-confirmed cases are recorded. Conclusions Analysis of tweets may indicate trends in COVID-19 in the UK and provide signals of geographical locations where resources may need to be targeted or where regional policies may need to be put in place to further limit the spread of COVID-19. It may also help inform policy makers of the restrictions in lockdown that are most effective or ineffective.
- Published
- 2022
- Full Text
- View/download PDF
8. Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States
- Author
-
Ari Z Klein, Steven Meanley, Karen O'Connor, José A Bauermeister, and Graciela Gonzalez-Hernandez
- Subjects
Public aspects of medicine ,RA1-1270 - Abstract
BackgroundPre-exposure prophylaxis (PrEP) is highly effective at preventing the acquisition of HIV. There is a substantial gap, however, between the number of people in the United States who have indications for PrEP and the number of them who are prescribed PrEP. Although Twitter content has been analyzed as a source of PrEP-related data (eg, barriers), methods have not been developed to enable the use of Twitter as a platform for implementing PrEP-related interventions. ObjectiveMen who have sex with men (MSM) are the population most affected by HIV in the United States. Therefore, the objectives of this study were to (1) develop an automated natural language processing (NLP) pipeline for identifying men in the United States who have reported on Twitter that they are gay, bisexual, or MSM and (2) assess the extent to which they demographically represent MSM in the United States with new HIV diagnoses. MethodsBetween September 2020 and January 2021, we used the Twitter Streaming Application Programming Interface (API) to collect more than 3 million tweets containing keywords that men may include in posts reporting that they are gay, bisexual, or MSM. We deployed handwritten, high-precision regular expressions—designed to filter out noise and identify actual self-reports—on the tweets and their user profile metadata. We identified 10,043 unique users geolocated in the United States and drew upon a validated NLP tool to automatically identify their ages. ResultsBy manually distinguishing true- and false-positive self-reports in the tweets or profiles of 1000 (10%) of the 10,043 users identified by our automated pipeline, we established that our pipeline has a precision of 0.85. Among the 8756 users for which a US state–level geolocation was detected, 5096 (58.2%) were in the 10 states with the highest numbers of new HIV diagnoses. Among the 6240 users for which a county-level geolocation was detected, 4252 (68.1%) were in counties or states considered priority jurisdictions by the Ending the HIV Epidemic initiative. Furthermore, the age distribution of the users reflected that of MSM in the United States with new HIV diagnoses. ConclusionsOur automated NLP pipeline can be used to identify MSM in the United States who may be at risk of acquiring HIV, laying the groundwork for using Twitter on a large scale to directly target PrEP-related interventions at this population.
- Published
- 2022
- Full Text
- View/download PDF
9. Toward Using Twitter Data to Monitor COVID-19 Vaccine Safety in Pregnancy: Proof-of-Concept Study of Cohort Identification
- Author
-
Ari Z Klein, Karen O'Connor, and Graciela Gonzalez-Hernandez
- Subjects
Medicine - Abstract
BackgroundCOVID-19 during pregnancy is associated with an increased risk of maternal death, intensive care unit admission, and preterm birth; however, many people who are pregnant refuse to receive COVID-19 vaccination because of a lack of safety data. ObjectiveThe objective of this preliminary study was to assess whether Twitter data could be used to identify a cohort for epidemiologic studies of COVID-19 vaccination in pregnancy. Specifically, we examined whether it is possible to identify users who have reported (1) that they received COVID-19 vaccination during pregnancy or the periconception period, and (2) their pregnancy outcomes. MethodsWe developed regular expressions to search for reports of COVID-19 vaccination in a large collection of tweets posted through the beginning of July 2021 by users who have announced their pregnancy on Twitter. To help determine if users were vaccinated during pregnancy, we drew upon a natural language processing (NLP) tool that estimates the timeframe of the prenatal period. For users who posted tweets with a timestamp indicating they were vaccinated during pregnancy, we drew upon additional NLP tools to help identify tweets that reported their pregnancy outcomes. ResultsWe manually verified the content of tweets detected automatically, identifying 150 users who reported on Twitter that they received at least one dose of COVID-19 vaccination during pregnancy or the periconception period. We manually verified at least one reported outcome for 45 of the 60 (75%) completed pregnancies. ConclusionsGiven the limited availability of data on COVID-19 vaccine safety in pregnancy, Twitter can be a complementary resource for potentially increasing the acceptance of COVID-19 vaccination in pregnant populations. The results of this preliminary study justify the development of scalable methods to identify a larger cohort for epidemiologic studies.
- Published
- 2022
- Full Text
- View/download PDF
10. ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets.
- Author
-
Ari Z Klein, Arjun Magge, and Graciela Gonzalez-Hernandez
- Subjects
Medicine ,Science - Abstract
Advancing the utility of social media data for research applications requires methods for automatically detecting demographic information about social media study populations, including users' age. The objective of this study was to develop and evaluate a method that automatically identifies the exact age of users based on self-reports in their tweets. Our end-to-end automatic natural language processing (NLP) pipeline, ReportAGE, includes query patterns to retrieve tweets that potentially mention an age, a classifier to distinguish retrieved tweets that self-report the user's exact age ("age" tweets) and those that do not ("no age" tweets), and rule-based extraction to identify the age. To develop and evaluate ReportAGE, we manually annotated 11,000 tweets that matched the query patterns. Based on 1000 tweets that were annotated by all five annotators, inter-annotator agreement (Fleiss' kappa) was 0.80 for distinguishing "age" and "no age" tweets, and 0.95 for identifying the exact age among the "age" tweets on which the annotators agreed. A deep neural network classifier, based on a RoBERTa-Large pretrained transformer model, achieved the highest F1-score of 0.914 (precision = 0.905, recall = 0.942) for the "age" class. When the age extraction was evaluated using the classifier's predictions, it achieved an F1-score of 0.855 (precision = 0.805, recall = 0.914) for the "age" class. When it was evaluated directly on the held-out test set, it achieved an F1-score of 0.931 (precision = 0.873, recall = 0.998) for the "age" class. We deployed ReportAGE on a collection of more than 1.2 billion tweets, posted by 245,927 users, and predicted ages for 132,637 (54%) of them. Scaling the detection of exact age to this large number of users can advance the utility of social media data for research applications that do not align with the predefined age groupings of extant binary or multi-class classification approaches.
- Published
- 2022
- Full Text
- View/download PDF
11. Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium.
- Author
-
Ari Z. Klein, Juan M. Banda, Yuting Guo, Ana Lucía Schmidt, Dongfang Xu, Ivan Flores Amaro, Raul Rodriguez-Esteban, Abeed Sarker, and Graciela Gonzalez-Hernandez
- Published
- 2024
- Full Text
- View/download PDF
12. Overview of the Seventh Social Media Mining for Health Applications (#SMM4H) Shared Tasks at COLING 2022.
- Author
-
Davy Weissenbacher, Juan M. Banda, Vera Davydova, Darryl Estrada-Zavala, Luis Gascó Sánchez, Yao Ge, Yuting Guo, Ari Z. Klein, Martin Krallinger, Mathias Leddin, Arjun Magge, Raul Rodriguez-Esteban, Abeed Sarker, Ana Lucía Schmidt, Elena Tutubalina, and Graciela Gonzalez-Hernandez
- Published
- 2022
13. Overview of the Sixth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at NAACL 2021.
- Author
-
Arjun Magge, Ari Z. Klein, Antonio Miranda-Escalada, Mohammed Ali Al-Garadi, Ilseyar Alimova, Zulfat Miftahutdinov, Eulàlia Farré, Salvador Lima-López, Ivan Flores, Karen O'Connor, Davy Weissenbacher, Elena Tutubalina, Abeed Sarker, Juan M. Banda, Martin Krallinger, and Graciela Gonzalez-Hernandez
- Published
- 2021
- Full Text
- View/download PDF
14. Overview of the Fifth Social Media Mining for Health Applications (#SMM4H) Shared Tasks at COLING 2020.
- Author
-
Ari Z. Klein, Ilseyar Alimova, Ivan Flores, Arjun Magge, Zulfat Miftahutdinov, Anne-Lyse Minard, Karen O'Connor, Abeed Sarker, Elena Tutubalina, Davy Weissenbacher, and Graciela Gonzalez-Hernandez
- Published
- 2020
15. Active neural networks to detect mentions of changes to medication treatment in social media.
- Author
-
Davy Weissenbacher, Suyu Ge, Ari Z. Klein, Karen O'Connor, Robert Gross, Sean Hennessy, and Graciela Gonzalez-Hernandez
- Published
- 2021
- Full Text
- View/download PDF
16. A Rule-based Approach to Determining Pregnancy Timeframe from Contextual Social Media Postings.
- Author
-
Masoud Rouhizadeh, Arjun Magge, Ari Z. Klein, Abeed Sarker, and Graciela Gonzalez 0001
- Published
- 2018
- Full Text
- View/download PDF
17. Deep neural networks ensemble for detecting medication mentions in tweets.
- Author
-
Davy Weissenbacher, Abeed Sarker, Ari Z. Klein, Karen O'Connor, Arjun Magge, and Graciela Gonzalez-Hernandez
- Published
- 2019
- Full Text
- View/download PDF
18. ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets.
- Author
-
Ari Z. Klein, Arjun Magge, and Graciela Gonzalez-Hernandez
- Published
- 2021
19. Detecting Personal Medication Intake in Twitter: An Annotated Corpus and Baseline Classification System.
- Author
-
Ari Z. Klein, Abeed Sarker, Masoud Rouhizadeh, Karen O'Connor, and Graciela Gonzalez 0001
- Published
- 2017
- Full Text
- View/download PDF
20. Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.
- Author
-
Ari Z. Klein, Abeed Sarker, Haitao Cai, Davy Weissenbacher, and Graciela Gonzalez-Hernandez
- Published
- 2018
- Full Text
- View/download PDF
21. Automatically Identifying Comparator Groups on Twitter for Digital Epidemiology of Pregnancy Outcomes.
- Author
-
Ari Z. Klein, Abeselom Gebreyesus, and Graciela Gonzalez-Hernandez
- Published
- 2019
22. Towards Automatic Bot Detection in Twitter for Health-related Tasks.
- Author
-
Anahita Davoudi, Ari Z. Klein, Abeed Sarker, and Graciela Gonzalez-Hernandez
- Published
- 2019
23. An interpretable natural language processing system for written medical examination assessment.
- Author
-
Abeed Sarker, Ari Z. Klein, Janet Mee, Polina Harik, and Graciela Gonzalez-Hernandez
- Published
- 2019
- Full Text
- View/download PDF
24. Towards scaling Twitter for digital epidemiology of birth defects.
- Author
-
Ari Z. Klein, Abeed Sarker, Davy Weissenbacher, and Graciela Gonzalez-Hernandez
- Published
- 2019
- Full Text
- View/download PDF
25. Dealing with Medication Non-Adherence Expressions in Twitter.
- Author
-
Takeshi Onishi, Davy Weissenbacher, Ari Z. Klein, Karen O'Connor, and Graciela Gonzalez-Hernandez
- Published
- 2018
- Full Text
- View/download PDF
26. Automatically Detecting Self-Reported Birth Defect Outcomes on Twitter for Large-scale Epidemiological Research.
- Author
-
Ari Z. Klein, Abeed Sarker, Davy Weissenbacher, and Graciela Gonzalez-Hernandez
- Published
- 2018
27. Eliminative induction: a basis for arguing system confidence.
- Author
-
John B. Goodenough 0002, Charles B. Weinstock, and Ari Z. Klein
- Published
- 2013
- Full Text
- View/download PDF
28. Measuring assurance case confidence using Baconian probabilities.
- Author
-
Charles B. Weinstock, John B. Goodenough 0002, and Ari Z. Klein
- Published
- 2013
- Full Text
- View/download PDF
29. Automatically Identifying Childhood Health Outcomes on Twitter for Digital Epidemiology in Pregnancy
- Author
-
Ari Z. Klein, José Agustín Gutiérrez Gómez, Lisa D. Levine, and Graciela Gonzalez-Hernandez
- Abstract
Data are limited regarding associations between pregnancy exposures and childhood outcomes. The objectives of this preliminary study were to (1) assess the availability of Twitter data during pregnancy for users who reported having a child with attention deficit/hyperactivity disorder (ADHD), autism spectrum disorders (ASD), delayed speech, or asthma, and (2) automate the detection of these outcomes. We annotated 9734 tweets that mentioned these outcomes, posted by users who had reported their pregnancy, and used them to train and evaluate the automatic classification of tweets that reported these outcomes in their children. A classifier based on a RoBERTa-Large pretrained model achieved the highest F1-score of 0.93 (precision = 0.92 and recall = 0.94). Manually and automatically, we identified 3806 total users who reported having a child with ADHD (678 users), ASD (1744 users), delayed speech (902 users), or asthma (1255 users), enabling the use of Twitter data for large-scale observational studies.
- Published
- 2022
- Full Text
- View/download PDF
30. Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: A Proof-of-Concept with Beta-Blockers
- Author
-
Ari Z. Klein, Karen O’Connor, Lisa D. Levine, and Graciela Gonzalez-Hernandez
- Abstract
BackgroundDespite that medication is taken during more than 90% of pregnancies, the fetal risk for most medications is unknown, and the majority of medications have no data regarding safety in pregnancy.ObjectiveUsing beta-blockers as a proof-of-concept, the primary objective of this study was to assess the utility of Twitter data for a cohort study design—in particular, whether we could identify (1) Twitter users who have posted tweets reporting that they took a beta-blocker during pregnancy and (2) their associated pregnancy outcomes.MethodsWe searched for mentions of beta-blockers in 2.75 billion tweets posted by 415,690 users who announced their pregnancy on Twitter. We manually reviewed the matching tweets to first determine if the user actually took the beta-blocker mentioned in the tweet. Then, to help determine if the beta-blocker was taken during pregnancy, we used the timestamp of the tweet reporting intake and drew upon an automated natural language processing (NLP) tool that estimates the date of the user’s prenatal time period. For users who posted tweets indicating that they took or may have taken the beta-blocker during pregnancy, we drew upon additional NLP tools to help identify tweets that report their adverse pregnancy outcomes, including miscarriage, stillbirth, preterm birth, low birth weight, birth defects, and neonatal intensive care unit admission.ResultsWe retrieved 5114 tweets, posted by 2339 users, that mention a beta-blocker, and manually identified 2332 (45.6%) tweets, posted by 1195 (51.1%) of the users, that self-report taking the beta-blocker. We were able to estimate the date of the prenatal time period for 356 pregnancies among 334 (27.9%) of these 1195 users. Among these 356 pregnancies, we identified 257 (72.2%) during which the beta-blocker was or may have been taken. We manually verified an adverse pregnancy outcome—preterm birth, neonatal intensive care unit admission, low birth weight, birth defects, or miscarriage—for 38 (14.8%) of these 257 pregnancies.ConclusionsOur ability to detect pregnancy outcomes for Twitter users who posted tweets reporting that they took or may have taken a beta-blocker during pregnancy suggests that Twitter can be a complementary resource for cohort studies of drug safety in pregnancy.
- Published
- 2022
- Full Text
- View/download PDF
31. Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers (Preprint)
- Author
-
Ari Z Klein, Karen O'Connor, Lisa D Levine, and Graciela Gonzalez-Hernandez
- Abstract
BACKGROUND Despite the fact that medication is taken during more than 90% of pregnancies, the fetal risk for most medications is unknown, and the majority of medications have no data regarding safety in pregnancy. OBJECTIVE Using β-blockers as a proof-of-concept, the primary objective of this study was to assess the utility of Twitter data for a cohort study design—in particular, whether we could identify (1) Twitter users who have posted tweets reporting that they took medication during pregnancy and (2) their associated pregnancy outcomes. METHODS We searched for mentions of β-blockers in 2.75 billion tweets posted by 415,690 users who announced their pregnancy on Twitter. We manually reviewed the matching tweets to first determine if the user actually took the β-blocker mentioned in the tweet. Then, to help determine if the β-blocker was taken during pregnancy, we used the time stamp of the tweet reporting intake and drew upon an automated natural language processing (NLP) tool that estimates the date of the user’s prenatal time period. For users who posted tweets indicating that they took or may have taken the β-blocker during pregnancy, we drew upon additional NLP tools to help identify tweets that report their pregnancy outcomes. Adverse pregnancy outcomes included miscarriage, stillbirth, birth defects, preterm birth ( RESULTS We retrieved 5114 tweets, posted by 2339 users, that mention a β-blocker, and manually identified 2332 (45.6%) tweets, posted by 1195 (51.1%) of the users, that self-report taking the β-blocker. We were able to estimate the date of the prenatal time period for 356 pregnancies among 334 (27.9%) of these 1195 users. Among these 356 pregnancies, we identified 257 (72.2%) during which the β-blocker was or may have been taken. We manually verified an adverse pregnancy outcome—preterm birth, NICU admission, low birth weight, birth defects, or miscarriage—for 38 (14.8%) of these 257 pregnancies. We manually verified a gestational age ≥37 weeks for 198 (90.4%) and a birth weight ≥5 pounds and 8 ounces for 50 (22.8%) of the 219 pregnancies for which we did not identify an adverse pregnancy outcome. CONCLUSIONS Our ability to detect pregnancy outcomes for Twitter users who posted tweets reporting that they took or may have taken a β-blocker during pregnancy suggests that Twitter can be a complementary resource for cohort studies of drug safety in pregnancy.
- Published
- 2022
- Full Text
- View/download PDF
32. Toward Using Twitter Data to Monitor COVID-19 Vaccine Safety in Pregnancy: Proof-of-Concept Study of Cohort Identification
- Author
-
Ari Z. Klein, Graciela Gonzalez-Hernandez, and Karen O'Connor
- Subjects
Vaccine safety ,medicine.medical_specialty ,Pregnancy ,Coronavirus disease 2019 (COVID-19) ,business.industry ,social media ,pregnancy outcomes ,Medicine (miscellaneous) ,COVID-19 ,Health Informatics ,data mining ,medicine.disease ,Computer Science Applications ,Proof of concept ,Family medicine ,medicine ,Short Paper ,Preprint ,natural language processing ,business ,Cohort identification ,COVID-19 vaccine - Abstract
Background COVID-19 during pregnancy is associated with an increased risk of maternal death, intensive care unit admission, and preterm birth; however, many people who are pregnant refuse to receive COVID-19 vaccination because of a lack of safety data. Objective The objective of this preliminary study was to assess whether Twitter data could be used to identify a cohort for epidemiologic studies of COVID-19 vaccination in pregnancy. Specifically, we examined whether it is possible to identify users who have reported (1) that they received COVID-19 vaccination during pregnancy or the periconception period, and (2) their pregnancy outcomes. Methods We developed regular expressions to search for reports of COVID-19 vaccination in a large collection of tweets posted through the beginning of July 2021 by users who have announced their pregnancy on Twitter. To help determine if users were vaccinated during pregnancy, we drew upon a natural language processing (NLP) tool that estimates the timeframe of the prenatal period. For users who posted tweets with a timestamp indicating they were vaccinated during pregnancy, we drew upon additional NLP tools to help identify tweets that reported their pregnancy outcomes. Results We manually verified the content of tweets detected automatically, identifying 150 users who reported on Twitter that they received at least one dose of COVID-19 vaccination during pregnancy or the periconception period. We manually verified at least one reported outcome for 45 of the 60 (75%) completed pregnancies. Conclusions Given the limited availability of data on COVID-19 vaccine safety in pregnancy, Twitter can be a complementary resource for potentially increasing the acceptance of COVID-19 vaccination in pregnant populations. The results of this preliminary study justify the development of scalable methods to identify a larger cohort for epidemiologic studies.
- Published
- 2021
33. Toward Using Twitter Data to Monitor COVID-19 Vaccine Safety in Pregnancy: Proof-of-Concept Study of Cohort Identification (Preprint)
- Author
-
Ari Z Klein, Karen O'Connor, and Graciela Gonzalez-Hernandez
- Abstract
BACKGROUND COVID-19 during pregnancy is associated with an increased risk of maternal death, intensive care unit admission, and preterm birth; however, many people who are pregnant refuse to receive COVID-19 vaccination because of a lack of safety data. OBJECTIVE The objective of this preliminary study was to assess whether Twitter data could be used to identify a cohort for epidemiologic studies of COVID-19 vaccination in pregnancy. Specifically, we examined whether it is possible to identify users who have reported (1) that they received COVID-19 vaccination during pregnancy or the periconception period, and (2) their pregnancy outcomes. METHODS We developed regular expressions to search for reports of COVID-19 vaccination in a large collection of tweets posted through the beginning of July 2021 by users who have announced their pregnancy on Twitter. To help determine if users were vaccinated during pregnancy, we drew upon a natural language processing (NLP) tool that estimates the timeframe of the prenatal period. For users who posted tweets with a timestamp indicating they were vaccinated during pregnancy, we drew upon additional NLP tools to help identify tweets that reported their pregnancy outcomes. RESULTS We manually verified the content of tweets detected automatically, identifying 150 users who reported on Twitter that they received at least one dose of COVID-19 vaccination during pregnancy or the periconception period. We manually verified at least one reported outcome for 45 of the 60 (75%) completed pregnancies. CONCLUSIONS Given the limited availability of data on COVID-19 vaccine safety in pregnancy, Twitter can be a complementary resource for potentially increasing the acceptance of COVID-19 vaccination in pregnant populations. The results of this preliminary study justify the development of scalable methods to identify a larger cohort for epidemiologic studies.
- Published
- 2021
- Full Text
- View/download PDF
34. Toward Using Twitter Data to Monitor Covid-19 Vaccine Safety in Pregnancy
- Author
-
Ari Z. Klein, Graciela Gonzalez-Hernandez, and Karen O'Connor
- Subjects
Vaccine safety ,Pregnancy ,Coronavirus disease 2019 (COVID-19) ,business.industry ,medicine.disease ,Intensive care unit ,law.invention ,Vaccination ,Increased risk ,law ,Medicine ,Maternal death ,Observational study ,Medical emergency ,business - Abstract
BackgroundCoronavirus Disease 2019 (Covid-19) during pregnancy is associated with an increased risk of maternal death, intensive care unit (ICU) admission, and preterm birth; however, many people who are pregnant refuse to receive Covid-19 vaccination because of a lack of safety data.ObjectiveThe objective of this preliminary study was to assess whether we could identify (1) users who have reported on Twitter that they received Covid-19 vaccination during pregnancy or the periconception period, and (2) reports of their pregnancy outcomes.MethodsWe searched for reports of Covid-19 vaccination in a large collection of tweets posted by users who have announced their pregnancy on Twitter. To help determine if users were vaccinated during pregnancy, we drew upon a natural language processing (NLP) tool that estimates the timeframe of the prenatal period. For users who posted tweets with a timestamp indicating they were vaccinated during pregnancy, we drew upon additional NLP tools to help identify tweets that report their pregnancy outcomes.ResultsUpon manually verifying the content of tweets detected automatically, we identified 150 users who reported on Twitter that they received at least one dose of Covid-19 vaccination during pregnancy or the periconception period. Among the 60 completed pregnancies, we manually verified at least one reported outcome for 45 (75%) of them.ConclusionsGiven the limited availability of data on Covid-19 vaccine safety in pregnancy, Twitter can be a complementary resource for potentially increasing the acceptance of Covid-19 vaccination in pregnant populations. Directions for future work include developing machine learning algorithms to detect a larger number of users for observational studies.
- Published
- 2021
- Full Text
- View/download PDF
35. Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States (Preprint)
- Author
-
Ari Z Klein, Steven Meanley, Karen O'Connor, José A Bauermeister, and Graciela Gonzalez-Hernandez
- Abstract
BACKGROUND Pre-exposure prophylaxis (PrEP) is highly effective at preventing the acquisition of HIV. There is a substantial gap, however, between the number of people in the United States who have indications for PrEP and the number of them who are prescribed PrEP. Although Twitter content has been analyzed as a source of PrEP-related data (eg, barriers), methods have not been developed to enable the use of Twitter as a platform for implementing PrEP-related interventions. OBJECTIVE Men who have sex with men (MSM) are the population most affected by HIV in the United States. Therefore, the objectives of this study were to (1) develop an automated natural language processing (NLP) pipeline for identifying men in the United States who have reported on Twitter that they are gay, bisexual, or MSM and (2) assess the extent to which they demographically represent MSM in the United States with new HIV diagnoses. METHODS Between September 2020 and January 2021, we used the Twitter Streaming Application Programming Interface (API) to collect more than 3 million tweets containing keywords that men may include in posts reporting that they are gay, bisexual, or MSM. We deployed handwritten, high-precision regular expressions—designed to filter out noise and identify actual self-reports—on the tweets and their user profile metadata. We identified 10,043 unique users geolocated in the United States and drew upon a validated NLP tool to automatically identify their ages. RESULTS By manually distinguishing true- and false-positive self-reports in the tweets or profiles of 1000 (10%) of the 10,043 users identified by our automated pipeline, we established that our pipeline has a precision of 0.85. Among the 8756 users for which a US state–level geolocation was detected, 5096 (58.2%) were in the 10 states with the highest numbers of new HIV diagnoses. Among the 6240 users for which a county-level geolocation was detected, 4252 (68.1%) were in counties or states considered priority jurisdictions by the Ending the HIV Epidemic initiative. Furthermore, the age distribution of the users reflected that of MSM in the United States with new HIV diagnoses. CONCLUSIONS Our automated NLP pipeline can be used to identify MSM in the United States who may be at risk of acquiring HIV, laying the groundwork for using Twitter on a large scale to directly target PrEP-related interventions at this population.
- Published
- 2021
- Full Text
- View/download PDF
36. Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States
- Author
-
Karen O'Connor, Steven Meanley, Graciela Gonzalez-Hernandez, José A. Bauermeister, and Ari Z. Klein
- Subjects
Male ,Population ,Psychological intervention ,Health Informatics ,HIV Infections ,computer.software_genre ,Filter (software) ,Men who have sex with men ,Sexual and Gender Minorities ,Humans ,Homosexuality, Male ,education ,Natural Language Processing ,education.field_of_study ,User profile ,business.industry ,Public Health, Environmental and Occupational Health ,United States ,Metadata ,Geolocation ,Scale (social sciences) ,Artificial intelligence ,Psychology ,business ,computer ,Social Media ,Natural language processing - Abstract
Background Pre-exposure prophylaxis (PrEP) is highly effective at preventing the acquisition of HIV. There is a substantial gap, however, between the number of people in the United States who have indications for PrEP and the number of them who are prescribed PrEP. Although Twitter content has been analyzed as a source of PrEP-related data (eg, barriers), methods have not been developed to enable the use of Twitter as a platform for implementing PrEP-related interventions. Objective Men who have sex with men (MSM) are the population most affected by HIV in the United States. Therefore, the objectives of this study were to (1) develop an automated natural language processing (NLP) pipeline for identifying men in the United States who have reported on Twitter that they are gay, bisexual, or MSM and (2) assess the extent to which they demographically represent MSM in the United States with new HIV diagnoses. Methods Between September 2020 and January 2021, we used the Twitter Streaming Application Programming Interface (API) to collect more than 3 million tweets containing keywords that men may include in posts reporting that they are gay, bisexual, or MSM. We deployed handwritten, high-precision regular expressions—designed to filter out noise and identify actual self-reports—on the tweets and their user profile metadata. We identified 10,043 unique users geolocated in the United States and drew upon a validated NLP tool to automatically identify their ages. Results By manually distinguishing true- and false-positive self-reports in the tweets or profiles of 1000 (10%) of the 10,043 users identified by our automated pipeline, we established that our pipeline has a precision of 0.85. Among the 8756 users for which a US state–level geolocation was detected, 5096 (58.2%) were in the 10 states with the highest numbers of new HIV diagnoses. Among the 6240 users for which a county-level geolocation was detected, 4252 (68.1%) were in counties or states considered priority jurisdictions by the Ending the HIV Epidemic initiative. Furthermore, the age distribution of the users reflected that of MSM in the United States with new HIV diagnoses. Conclusions Our automated NLP pipeline can be used to identify MSM in the United States who may be at risk of acquiring HIV, laying the groundwork for using Twitter on a large scale to directly target PrEP-related interventions at this population.
- Published
- 2021
37. Towards scaling Twitter for digital epidemiology of birth defects
- Author
-
Abeed Sarker, Graciela Gonzalez-Hernandez, Ari Z. Klein, and Davy Weissenbacher
- Subjects
Exploit ,Computer science ,Epidemiology ,Medicine (miscellaneous) ,Health Informatics ,computer.software_genre ,lcsh:Computer applications to medicine. Medical informatics ,030226 pharmacology & pharmacy ,Article ,03 medical and health sciences ,0302 clinical medicine ,Health Information Management ,Social media ,030212 general & internal medicine ,Data mining ,Artificial neural network ,business.industry ,Deep learning ,Class (biology) ,Computer Science Applications ,Scale (social sciences) ,Cohort ,lcsh:R858-859.7 ,Artificial intelligence ,business ,computer ,Classifier (UML) ,Natural language processing - Abstract
Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes—the leading cause of infant mortality—could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms—feature-engineered and deep learning-based classifiers—that automatically distinguish tweets referring to the user’s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the “defect” class and 0.51 for the “possible defect” class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.
- Published
- 2019
- Full Text
- View/download PDF
38. Active Neural Networks to Detect Mentions of Changes to Medication Treatment in Social Media
- Author
-
Davy Weissenbacher, Sean Hennessy, Graciela Gonzalez-Hernandez, Robert E. Gross, Ari Z. Klein, Suyu Ge, and Karen O'Connor
- Subjects
medicine.medical_specialty ,text classification ,AcademicSubjects/SCI01060 ,social media ,Applied psychology ,Health Informatics ,Research and Applications ,Convolutional neural network ,Medication Adherence ,Class imbalance ,active learning ,Pharmacovigilance ,medicine ,Humans ,Social media ,Psychiatry ,AcademicSubjects/MED00580 ,Artificial neural network ,Direct observation ,pharmacovigilance ,medication non-adherence ,Active learning ,Adherence monitoring ,Neural Networks, Computer ,AcademicSubjects/SCI01530 ,Transfer of learning ,Psychology - Abstract
Objective We address a first step toward using social media data to supplement current efforts in monitoring population-level medication nonadherence: detecting changes to medication treatment. Medication treatment changes, like changes to dosage or to frequency of intake, that are not overseen by physicians are, by that, nonadherence to medication. Despite the consequences, including worsening health conditions or death, 50% of patients are estimated to not take medications as indicated. Current methods to identify nonadherence have major limitations. Direct observation may be intrusive or expensive, and indirect observation through patient surveys relies heavily on patients’ memory and candor. Using social media data in these studies may address these limitations. Methods We annotated 9830 tweets mentioning medications and trained a convolutional neural network (CNN) to find mentions of medication treatment changes, regardless of whether the change was recommended by a physician. We used active and transfer learning from 12 972 reviews we annotated from WebMD to address the class imbalance of our Twitter corpus. To validate our CNN and explore future directions, we annotated 1956 positive tweets as to whether they reflect nonadherence and categorized the reasons given. Results Our CNN achieved 0.50 F1-score on this new corpus. The manual analysis of positive tweets revealed that nonadherence is evident in a subset with 9 categories of reasons for nonadherence. Conclusion We showed that social media users publicly discuss medication treatment changes and may explain their reasons including when it constitutes nonadherence. This approach may be useful to supplement current efforts in adherence monitoring.
- Published
- 2020
- Full Text
- View/download PDF
39. Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set (Preprint)
- Author
-
Ari Z Klein, Arjun Magge, Karen O'Connor, Jesus Ivan Flores Amaro, Davy Weissenbacher, and Graciela Gonzalez Hernandez
- Abstract
BACKGROUND In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. CONCLUSIONS We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
- Published
- 2020
- Full Text
- View/download PDF
40. An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter
- Author
-
Graciela Gonzalez-Hernandez and Ari Z. Klein
- Subjects
medicine.medical_specialty ,Epidemiology ,lcsh:Computer applications to medicine. Medical informatics ,Miscarriage ,Social media ,03 medical and health sciences ,0302 clinical medicine ,Pregnancy ,Intensive care ,Machine learning ,medicine ,lcsh:Science (General) ,Data mining ,030304 developmental biology ,Data Article ,0303 health sciences ,Multidisciplinary ,business.industry ,Natural language processing ,Timeline ,medicine.disease ,Infant mortality ,Family medicine ,lcsh:R858-859.7 ,Observational study ,business ,030217 neurology & neurosurgery ,lcsh:Q1-390 - Abstract
Despite the prevalence in the United States of miscarriage [1] , stillbirth [2] , and infant mortality associated with preterm birth and low birthweight [3] , their causes remain largely unknown [4] , [5] , [6] . To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7] . Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome (“outcome” tweets) from those that merely mention the outcome (“non-outcome” tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as “outcome” include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These “outcome” tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users’ broader timelines—tweets posted by a user over time—for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in “A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes” [10] .
- Published
- 2020
41. Automatically Identifying Comparator Groups on Twitter for Digital Epidemiology of Pregnancy Outcomes
- Author
-
Ari Z, Klein, Abeselom, Gebreyesus, and Graciela, Gonzalez-Hernandez
- Subjects
Articles - Abstract
Despite the prevalence of adverse pregnancy outcomes such as miscarriage, stillbirth, birth defects, and preterm birth, their causes are largely unknown. We seek to advance the use of social media for observational studies of pregnancy outcomes by developing a natural language processing pipeline for automatically identifying users from which to select comparator groups on Twitter. We annotated 2361 tweets by users who have announced their pregnancy on Twitter, which were used to train and evaluate supervised machine learning algorithms as a basis for automatically detecting women who have reported that their pregnancy had reached term and their baby was born at a normal weight. Upon further processing the tweet-level predictions of a majority voting-based ensemble classifier, the pipeline achieved a user-level F1-score of 0.933 (precision = 0.947, recall = 0.920). Our pipeline will be deployed to identify large comparator groups for studying pregnancy outcomes on Twitter.
- Published
- 2020
42. Extending A Chronological and Geographical Analysis of Personal Reports of COVID-19 on Twitter to England, UK
- Author
-
S Golder, Ari Z. Klein, Arjun Magge, Karen O’Connor, Haitao Cai, Davy Weissenbacher, and Graciela Gonzalez-Hernandez
- Subjects
Coronavirus disease 2019 (COVID-19) ,business.industry ,05 social sciences ,MEDLINE ,Distribution (economics) ,050801 communication & media studies ,Data science ,Article ,law.invention ,03 medical and health sciences ,0302 clinical medicine ,0508 media and communications ,Transmission (mechanics) ,Geography ,Social media mining ,law ,Pandemic ,030212 general & internal medicine ,business - Abstract
The rapidly evolving COVID-19 pandemic presents challenges for actively monitoring its transmission. In this study, we extend a social media mining approach used in the US to automatically identify personal reports of COVID-19 on Twitter in England, UK. The findings indicate that natural language processing and machine learning framework could help provide an early indication of the chronological and geographical distribution of COVID-19 in England.
- Published
- 2020
43. A Chronological and Geographical Analysis of Personal Reports of COVID-19 on Twitter
- Author
-
Arjun Magge, Davy Weissenbacher, Graciela Gonzalez-Hernandez, Karen O'Connor, Haitao Cai, and Ari Z. Klein
- Subjects
Coronavirus disease 2019 (COVID-19) ,Computer science ,business.industry ,MEDLINE ,Distribution (economics) ,02 engineering and technology ,Data science ,03 medical and health sciences ,0302 clinical medicine ,Social media mining ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,030212 general & internal medicine ,business ,Personally identifiable information - Abstract
The rapidly evolving outbreak of COVID-19 presents challenges for actively monitoring its spread. In this study, we assessed a social media mining approach for automatically analyzing the chronological and geographical distribution of users in the United States reporting personal information related to COVID-19 on Twitter. The results suggest that our natural language processing and machine learning framework could help provide an early indication of the spread of COVID-19.
- Published
- 2020
- Full Text
- View/download PDF
44. Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter
- Author
-
Abeed Sarker, Davy Weissenbacher, Graciela Gonzalez-Hernandez, Ari Z. Klein, and Haitao Cai
- Subjects
Heart Defects, Congenital ,Male ,0301 basic medicine ,Georgia ,Population ,Health Informatics ,030105 genetics & heredity ,computer.software_genre ,Lexicon ,Article ,Congenital Abnormalities ,Machine Learning ,03 medical and health sciences ,0302 clinical medicine ,Social media mining ,International Classification of Diseases ,Pregnancy ,False positive paradox ,Data Mining ,Humans ,False Positive Reactions ,Social media ,030212 general & internal medicine ,education ,Natural Language Processing ,education.field_of_study ,business.industry ,Data Collection ,Infant, Newborn ,Infant ,Reproducibility of Results ,Rule-based system ,Bootstrapping (linguistics) ,Timeline ,Unified Medical Language System ,United States ,Computer Science Applications ,Europe ,Female ,Illinois ,Artificial intelligence ,business ,Psychology ,Social Media ,computer ,Algorithms ,Natural language processing - Abstract
Background Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. Objective The primary objectives of this study were (i) to assess whether rare health-related events—in this case, birth defects—are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. Methods To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user’s child has a birth defect, and (ii) accessibility to the user’s tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. Results We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user’s child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohen’s kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. Conclusions Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.
- Published
- 2018
- Full Text
- View/download PDF
45. Pharmacoepidemiologic Evaluation of Birth Defects from Health-Related Postings in Social Media During Pregnancy
- Author
-
Davy Weissenbacher, Graciela Gonzalez-Hernandez, Stephanie Chiuve, Martin Bland, Mondira Bhattacharya, Murray Malin, Linda Scarazzini, Su Golder, Ari Z. Klein, and Karen O'Connor
- Subjects
Adult ,medicine.medical_specialty ,Population ,MEDLINE ,Toxicology ,030226 pharmacology & pharmacy ,03 medical and health sciences ,0302 clinical medicine ,Pregnancy ,medicine ,Adverse Drug Reaction Reporting Systems ,Humans ,Pharmacology (medical) ,Social media ,030212 general & internal medicine ,Original Research Article ,Registries ,Adverse effect ,education ,Pharmacology ,education.field_of_study ,Obstetrics ,business.industry ,Pharmacoepidemiology ,Health related ,Abnormalities, Drug-Induced ,medicine.disease ,Cohort ,Feasibility Studies ,Residence ,Female ,business ,Social Media - Abstract
Introduction Adverse effects of medications taken during pregnancy are traditionally studied through post-marketing pregnancy registries, which have limitations. Social media data may be an alternative data source for pregnancy surveillance studies. Objective The objective of this study was to assess the feasibility of using social media data as an alternative source for pregnancy surveillance for regulatory decision making. Methods We created an automated method to identify Twitter accounts of pregnant women. We identified 196 pregnant women with a mention of a birth defect in relation to their baby and 196 without a mention of a birth defect in relation to their baby. We extracted information on pregnancy and maternal demographics, medication intake and timing, and birth defects. Results Although often incomplete, we extracted data for the majority of the pregnancies. Among women that reported birth defects, 35% reported taking one or more medications during pregnancy compared with 17% of controls. After accounting for age, race, and place of residence, a higher medication intake was observed in women who reported birth defects. The rate of birth defects in the pregnancy cohort was lower (0.44%) compared with the rate in the general population (3%). Conclusions Twitter data capture information on medication intake and birth defects; however, the information obtained cannot replace pregnancy registries at this time. Development of improved methods to automatically extract and annotate social media data may increase their value to support regulatory decision making regarding pregnancy outcomes in women using medications during their pregnancies.
- Published
- 2018
46. An Analysis of a Twitter Corpus for Training a Medication Intake Classifier
- Author
-
Ari Z, Klein, Abeed, Sarker, Karen, O'Connor, and Graciela, Gonzalez-Hernandez
- Subjects
Articles - Abstract
While social media has evolved into a useful resource for studying medication-related information, observational studies of medications have continued to rely on other sources of data. Towards advancing the use of social media data for medication-related observational studies, we analyze an annotated corpus of 27,941 tweets designed for training machine learning algorithms to automatically detect users' medication intake. In particular, we assess how a baseline classifier trained on the general corpus-that is, on various types of medication-performs for specific types. For most types, the classifier performs significantly better than it does overall; however, for nervous system medications, it performs significantly worse. These results suggest that, while the general corpus may have utility for observational studies focusing on most types of medication, studying nervous system medications may benefit from training a classifier exclusively for this type. We will explore this data-level approach in future work.
- Published
- 2019
47. Deep neural networks ensemble for detecting medication mentions in tweets
- Author
-
Ari Z. Klein, Abeed Sarker, Davy Weissenbacher, Karen O'Connor, Arjun Magge, and Graciela Gonzalez-Hernandez
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,text classification ,020205 medical informatics ,Computer science ,media_common.quotation_subject ,social media ,Health Informatics ,02 engineering and technology ,computer.software_genre ,Lexicon ,Research and Applications ,drug name detection ,Machine Learning (cs.LG) ,Computer Science - Information Retrieval ,03 medical and health sciences ,Pharmacovigilance ,0302 clinical medicine ,Deep Learning ,Classifier (linguistics) ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Social media ,030212 general & internal medicine ,media_common ,Natural Language Processing ,Computer Science - Computation and Language ,Artificial neural network ,business.industry ,Ambiguity ,Ensemble learning ,Spelling ,3. Good health ,Pharmaceutical Preparations ,ensemble learning ,Artificial intelligence ,Neural Networks, Computer ,business ,F1 score ,Computation and Language (cs.CL) ,computer ,Information Retrieval (cs.IR) ,Natural language processing - Abstract
Objective: After years of research, Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names may fail due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them. Methods: We present Kusuri, an Ensemble Learning classifier, able to identify tweets mentioning drug products and dietary supplements. Kusuri ("medication" in Japanese) is composed of two modules. First, four different classifiers (lexicon-based, spelling-variant-based, pattern-based and one based on a weakly-trained neural network) are applied in parallel to discover tweets potentially containing medication names. Second, an ensemble of deep neural networks encoding morphological, semantical and long-range dependencies of important words in the tweets discovered is used to make the final decision. Results: On a balanced (50-50) corpus of 15,005 tweets, Kusuri demonstrated performances close to human annotators with 93.7% F1-score, the best score achieved thus far on this corpus. On a corpus made of all tweets posted by 113 Twitter users (98,959 tweets, with only 0.26% mentioning medications), Kusuri obtained 76.3% F1-score. There is not a prior drug extraction system that compares running on such an extremely unbalanced dataset. Conclusion: The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness and ready to be integrated in larger natural language processing systems., This is a pre-copy-editing, author-produced PDF of an article accepted for publication in JAMIA following peer review. The definitive publisher-authenticated version is "D. Weissenbacher, A. Sarker, A. Klein, K. O'Connor, A. Magge, G. Gonzalez-Hernandez, Deep neural networks ensemble for detecting medication mentions in tweets, Journal of the American Medical Informatics Association, ocz156, 2019"
- Published
- 2019
48. Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set
- Author
-
Karen O'Connor, Arjun Magge, Jesus Ivan Flores Amaro, Davy Weissenbacher, Graciela Gonzalez Hernandez, and Ari Z. Klein
- Subjects
020205 medical informatics ,Computer science ,Interface (Java) ,social media ,coronavirus ,Datasets as Topic ,Health Informatics ,02 engineering and technology ,pandemics ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,infodemiology ,Disease Outbreaks ,03 medical and health sciences ,0302 clinical medicine ,Resource (project management) ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Speech ,Longitudinal Studies ,030212 general & internal medicine ,Regular expression ,natural language processing ,Original Paper ,Artificial neural network ,SARS-CoV-2 ,business.industry ,lcsh:Public aspects of medicine ,COVID-19 ,lcsh:RA1-1270 ,data mining ,Pipeline (software) ,United States ,Metadata ,Geolocation ,lcsh:R858-859.7 ,epidemiology ,Self Report ,Artificial intelligence ,Timestamp ,business ,computer ,Natural language processing - Abstract
Background In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. Objective The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. Methods Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. Results Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations. Conclusions We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
- Published
- 2021
- Full Text
- View/download PDF
49. A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes
- Author
-
Lisa D. Levine, Haitao Cai, Graciela Gonzalez-Hernandez, Ari Z. Klein, and Davy Weissenbacher
- Subjects
medicine.medical_specialty ,Epidemiology ,Health Informatics ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Filter (software) ,Miscarriage ,Social media ,Pregnancy ,Machine learning ,medicine ,Data mining ,business.industry ,Natural language processing ,medicine.disease ,Pipeline (software) ,Infant mortality ,Computer Science Applications ,lcsh:R858-859.7 ,Observational study ,Artificial intelligence ,business ,computer - Abstract
Background: In the United States, 17% of pregnancies end in fetal loss: miscarriage or stillbirth. Preterm birth affects 10% of live births in the United States and is the leading cause of neonatal death globally. Preterm births with low birthweight are the second leading cause of infant mortality in the United States. Despite their prevalence, the causes of miscarriage, stillbirth, and preterm birth are largely unknown. Objective: The primary objectives of this study are to (1) assess whether women report miscarriage, stillbirth, and preterm birth, among others, on Twitter, and (2) develop natural language processing (NLP) methods to automatically identify users from which to select cases for large-scale observational studies. Methods: We handcrafted regular expressions to retrieve tweets that mention an adverse pregnancy outcome, from a database containing more than 400 million publicly available tweets posted by more than 100,000 users who have announced their pregnancy on Twitter. Two annotators independently annotated 8109 (one random tweet per user) of the 22,912 retrieved tweets, distinguishing those reporting that the user has personally experienced the outcome (“outcome” tweets) from those that merely mention the outcome (“non-outcome” tweets). Inter-annotator agreement was κ = 0.90 (Cohen’s kappa). We used the annotated tweets to train and evaluate feature-engineered and deep learning-based classifiers. We further annotated 7512 (of the 8109) tweets to develop a generalizable, rule-based module designed to filter out reported speech—that is, posts containing what was said by others—prior to automatic classification. We performed an extrinsic evaluation assessing whether the reported speech filter could improve the detection of women reporting adverse pregnancy outcomes on Twitter. Results: The tweets annotated as “outcome” include 1632 women reporting miscarriage, 119 stillbirth, 749 preterm birth or premature labor, 217 low birthweight, 558 NICU admission, and 458 fetal/infant loss in general. A deep neural network, BERT-based classifier achieved the highest overall F1-score (0.88) for automatically detecting “outcome” tweets (precision = 0.87, recall = 0.89), with an F1-score of at least 0.82 and a precision of at least 0.84 for each of the adverse pregnancy outcomes. Our reported speech filter significantly (P
- Published
- 2020
- Full Text
- View/download PDF
50. A Rule-based Approach to Determining Pregnancy Timeframe from Contextual Social Media Postings
- Author
-
Arjun Magge, Abeed Sarker, Graciela Gonzalez, Masoud Rouhizadeh, and Ari Z. Klein
- Subjects
Pregnancy ,020205 medical informatics ,Computer science ,Rule-based system ,02 engineering and technology ,medicine.disease ,Data science ,Filter (software) ,Clinical trial ,03 medical and health sciences ,0302 clinical medicine ,Social media mining ,0202 electrical engineering, electronic engineering, information engineering ,medicine ,Generalizability theory ,Social media ,Observational study ,030212 general & internal medicine - Abstract
Recent advances in social media mining have opened the door to observational studies that are limited only by the capacity of systems deployed to collect and analyze the data. The significance of this power becomes important when studying specific cohorts not typically found in clinical trials or other health-related research, such as pregnant women, who are generally excluded from participating in particular studies for safety concerns. A major challenge of pregnancy studies in social media is determining the pregnancy timeframe, given that the significance of some events (e.g., medication exposure) may depend on the trimester when it occurred. Existing systems that mine pregnancy data from social media have limited coverage and generalizability and have not addressed the problem of automatically determining the estimated beginning and end of pregnancy, and general-purpose temporal taggers deployed on this dataset generate ambiguous results. We present here a rule-based system to automatically identify pregnancy timeframe based on linguistic clues about the progress of pregnancy in users» tweets. In addition, we demonstrate that we could also use this system to find and filter bots and other that repost or quote such expressions.
- Published
- 2018
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.