Juan Ciro

ORCID: 0009-0000-4179-3076
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Speech Recognition and Synthesis
  • Generative Adversarial Networks and Image Synthesis
  • Music and Audio Processing
  • Digital Media Forensic Detection
  • Adversarial Robustness in Machine Learning
  • Privacy-Preserving Technologies in Data
  • Cerebrospinal fluid and hydrocephalus
  • Autoimmune Neurological Disorders and Treatments
  • Brain Metastases and Treatment
  • linguistics and terminology studies
  • Traumatic Brain Injury and Neurovascular Disturbances
  • Machine Learning and Data Classification
  • Trauma and Emergency Care Studies
  • Big Data Technologies and Applications
  • Topic Modeling
  • Myasthenia Gravis and Thymoma
  • Data Quality and Management
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neural Network Applications

Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, Ryan Cotterell. Proceedings of the BabyLM Challenge at 27th Conference on Computational Natural Language Learning. 2023.

10.18653/v1/2023.conll-babylm.1 article EN cc-by 2023-01-01

Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, faithfulness of underlying problems. Neglecting fundamental importance data given rise inaccuracy, bias, fragility in real-world applications, is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite evaluating data-centric algorithms. We aim foster...

10.48550/arxiv.2207.10062 preprint EN cc-by arXiv (Cornell University) 2022-01-01

The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic commercial usage under CC-BY-SA (with CC-BY subset). data collected via searching the Internet appropriately audio with existing transcriptions. We describe our collection methodology release system Apache 2.0 license. show that model trained on this achieves 9.98% word error rate Librispeech's test-clean test set.Finally, we discuss legal...

10.48550/arxiv.2111.09344 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about methods (how), domains (where), people (who) and objectives (to what end) human collection. To navigate these questions, we introduce PRISM, new dataset which maps sociodemographics stated preferences 1,500 diverse participants from 75 countries, to their contextual fine-grained 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic demographic...

10.48550/arxiv.2404.16019 preprint EN arXiv (Cornell University) 2024-04-24

With text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks mitigate the generation of offensive images. By focusing on "implicitly adversarial" prompts (those that trigger T2I generate unsafe images for reasons), we isolate a set difficult safety issues human creativity well-suited uncover. To this end, built Adversarial Nibbler Challenge, red-teaming methodology crowdsourcing diverse implicitly adversarial...

10.1145/3630106.3658913 article EN other-oa 2022 ACM Conference on Fairness, Accountability, and Transparency 2024-06-03

The generative AI revolution in recent years has been spurred by an expansion compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors inherited from pretraining on uncurated internet-scraped datasets thus have the potential cause...

10.48550/arxiv.2305.14384 preprint EN cc-by arXiv (Cornell University) 2023-01-01

A paraneoplastic syndrome characterized by neuropsychiatric symptoms, involuntary movements and seizures has been recently associated with antibodies targeting NMDA (N-methyl-D-aspartate) receptor in patients an ovarian teratoma. Severe neurological impairment is frequent treatment the intensive care unit often required because of ventilatory failure life-threatening autonomic instability. Tumor removal curative many cases improvement demonstrated shortly after surgery.Here we report on a...

10.33588/rn.5209.2010568 article EN Revista de Neurología 2011-01-01

With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks mitigate generation offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I generate unsafe images for reasons), we isolate a set difficult safety issues human creativity well-suited uncover. To this end, built Adversarial Nibbler Challenge, red-teaming methodology crowdsourcing diverse implicitly...

10.48550/arxiv.2403.12075 preprint EN arXiv (Cornell University) 2024-02-14

The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Commons. It includes 1780 hours (195 GB) CC-BY-SA licensed transcribed speech diverse set scenarios and speakers, in 77 different languages. Each file has one or more languages, making this dataset suitable for training recognition, translation, machine translation models.

10.48550/arxiv.2308.15710 preprint EN cc-by arXiv (Cornell University) 2023-01-01

This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate different models, we create an artificial dataset deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed most with best model reaching 0.96. Deduplication enables more effective training by preventing from learning distribution that differs real one as result repeated data.

10.48550/arxiv.2112.11478 preprint EN other-oa arXiv (Cornell University) 2021-01-01
Coming Soon ...