Yair Schiff

ORCID: 0000-0003-0748-3706
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Cell Image Analysis Techniques
  • Topological and Geometric Data Analysis
  • Computational Drug Discovery Methods
  • Multimodal Machine Learning Applications
  • Machine Learning in Materials Science
  • Domain Adaptation and Few-Shot Learning
  • Machine Learning in Healthcare
  • Stock Market Forecasting Methods
  • Neural Networks and Applications
  • Anomaly Detection Techniques and Applications
  • Advanced Neural Network Applications
  • Medical Image Segmentation Techniques
  • Explainable Artificial Intelligence (XAI)
  • Human Pose and Action Recognition
  • Stochastic Gradient Optimization Techniques
  • Genomics and Phylogenetic Studies
  • Video Analysis and Summarization
  • Generative Adversarial Networks and Image Synthesis
  • Cloud Data Security Solutions
  • Metabolomics and Mass Spectrometry Studies
  • Privacy-Preserving Technologies in Data
  • Topic Modeling
  • Machine Learning and Data Classification
  • Advanced Fluorescence Microscopy Techniques
  • Music and Audio Processing

Cornell University
2024

Cambridge Scientific (United States)
2023

IBM Research - Thomas J. Watson Research Center
2022

IBM Research - Africa
2022

IBM (United States)
2020-2021

Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms order fully unlock potential. Here we propose neural network models that represent tabular time series can optionally leverage hierarchical structure. This results two architectures for series: one representations is analogous BERT and be pre-trained end-to-end used downstream tasks, akin GPT generation of realistic synthetic sequences. We...

10.1109/icassp39728.2021.9414142 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021-05-13

Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, genomic sequences introduces challenges such as the need to model long-range token interactions, effects of upstream downstream regions genome, reverse complementarity (RC) DNA. Here, we propose an architecture motivated by these builds off Mamba block, extends it a BiMamba component supports bi-directionality, MambaDNA block additionally RC equivariance. We use basis Caduceus, first...

10.48550/arxiv.2403.03234 preprint EN arXiv (Cornell University) 2024-03-04

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation offer cross-species prediction better than supervised through fine-tuning limited labeled data. We introduce PlantCaduceus, a DNA LM based the Caduceus Mamba architectures, curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus Arabidopsis data for four tasks, including...

10.1101/2024.06.04.596709 preprint EN cc-by-nc bioRxiv (Cold Spring Harbor Laboratory) 2024-06-05

Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by promise deployment systems practical applications. However, scarcity data and contexts many competition datasets renders utility these limited as an assistive technology real-world settings, such helping visually impaired people navigate accomplish everyday tasks. This gap novel VizWiz...

10.1613/jair.1.13113 article EN cc-by Journal of Artificial Intelligence Research 2022-01-31

Learning dynamics from dissipative chaotic systems is notoriously difficult due to their inherent instability, as formalized by positive Lyapunov exponents, which exponentially amplify errors in the learned dynamics. However, many of these exhibit ergodicity and an attractor: a compact highly complex manifold, trajectories converge finite-time, that supports invariant measure, i.e., probability distribution under action dynamics, dictates long-term statistical behavior system. In this work,...

10.48550/arxiv.2402.04467 preprint EN arXiv (Cornell University) 2024-02-06

Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by promise deployment systems practical applications. However, scarcity data and contexts many competition datasets renders utility these limited as an assistive technology real-world settings, such helping visually impaired people navigate accomplish everyday tasks. This gap novel VizWiz...

10.48550/arxiv.2012.11696 preprint EN other-oa arXiv (Cornell University) 2020-01-01

Deep generative models have emerged as a powerful tool for learning useful molecular representations and designing novel molecules with desired properties, applications in drug discovery material design. However, most existing deep are restricted due to lack of spatial information. Here we propose augmentation topological data analysis (TDA) representations, known persistence images, robust encoding 3D geometry. We show that the TDA character-based Variational Auto-Encoder (VAE) outperforms...

10.1109/icassp43922.2022.9747088 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022-04-27

The field of Deep Learning is rich with empirical evidence human-like performance on a variety prediction tasks. However, despite these successes, the recent Predicting Generalization in (PGDL) NeurIPS 2020 competition suggests that there need for more robust and efficient measures network generalization. In this work, we propose new framework evaluating generalization capabilities trained networks. We use perturbation response (PR) curves capture accuracy change given as function varying...

10.48550/arxiv.2106.04765 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Deep generative models are increasingly becoming integral parts of the in silico molecule design pipeline and have dual goals learning chemical structural features that render candidate molecules viable while also being flexible enough to generate novel designs. Specifically, Variational Auto Encoders (VAEs) which encoder-decoder network pairs trained reconstruct training data distributions such a way latent space encoder is smooth. Therefore, candidates can be found by sampling from this...

10.48550/arxiv.2010.08548 preprint EN other-oa arXiv (Cornell University) 2020-01-01

While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete is more performant than previously thought. We apply an effective training recipe improves the of derive simplified, Rao-Blackwellized objective results additional improvements. Our has form -- it mixture classical modeling losses can be used to train encoder-only admit...

10.48550/arxiv.2406.07524 preprint EN arXiv (Cornell University) 2024-06-11

Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete faces challenges given that guidance methods do not directly apply diffusion. Here, we provide a straightforward derivation of classifier-free classifier-based diffusion, as well new class leverage uniform noise are more guidable because they can continuously edit outputs. We improve the these with novel continuous-time...

10.48550/arxiv.2412.10193 preprint EN arXiv (Cornell University) 2024-12-13

Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues. This paradigm relies on generative AI models generate unbiased, privacy-preserving while maintaining fidelity the original data. However, assessing trustworthiness of synthetic is a critical challenge. We introduce holistic auditing framework that comprehensively evaluates models. It focuses preventing bias discrimination, ensures source data, assesses utility,...

10.48550/arxiv.2304.10819 preprint EN other-oa arXiv (Cornell University) 2023-01-01

While diffusion models excel at generating high-quality samples, their latent variables typically lack semantic meaning and are not suitable for representation learning. Here, we propose InfoDiffusion, an algorithm that augments with low-dimensional capture high-level factors of variation in the data. InfoDiffusion relies on a learning objective regularized mutual information between observed hidden variables, which improves space quality prevents latents from being ignored by expressive...

10.48550/arxiv.2306.08757 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Image captioning systems have made substantial progress, largely due to the availability of curated datasets like Microsoft COCO or Vizwiz that accurate descriptions their corresponding images. Unfortunately, scarce such cleanly labeled data results in trained algorithms producing captions can be terse and idiosyncratically specific details image. We propose a new technique, cooperative distillation combines clean with web-scale automatically extracted Google Conceptual Captions dataset...

10.48550/arxiv.2012.11691 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Tabular datasets are ubiquitous in data science applications. Given their importance, it seems natural to apply state-of-the-art deep learning algorithms order fully unlock potential. Here we propose neural network models that represent tabular time series can optionally leverage hierarchical structure. This results two architectures for series: one representations is analogous BERT and be pre-trained end-to-end used downstream tasks, akin GPT generation of realistic synthetic sequences. We...

10.48550/arxiv.2011.01843 preprint EN other-oa arXiv (Cornell University) 2020-01-01

The field of Deep Learning is rich with empirical evidence human-like performance on a variety regression, classification, and control tasks. However, despite these successes, the lacks strong theoretical error bounds consistent measures network generalization learned invariances. In this work, we introduce two new measures, Gi-score Pal-score, that capture deep neural network's capabilities. Inspired by Gini coefficient Palma ratio, income inequality, our statistics are robust invariance to...

10.48550/arxiv.2104.03469 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit via minimizing integral probability metrics (IPMs). In this paper, we expand learning paradigm to stochastic orders, namely, the convex Choquet order between measures. Towards end, exploiting relation orders and optimal transport, introduce Choquet-Toland distance measures, that can be used as a drop-in replacement for IPMs. We also Variational Dominance Criterion (VDC) learn measures...

10.48550/arxiv.2205.13684 preprint EN other-oa arXiv (Cornell University) 2022-01-01
Coming Soon ...