- Data Quality and Management
- Anomaly Detection Techniques and Applications
- Machine Learning and Data Classification
- Natural Language Processing Techniques
- Big Data and Business Intelligence
- Data Visualization and Analytics
- Software Engineering Research
- Artificial Intelligence in Law
- Semantic Web and Ontologies
- Advanced Surface Polishing Techniques
- Time Series Analysis and Forecasting
- High-Velocity Impact and Material Behavior
- Topic Modeling
- Software Testing and Debugging Techniques
- Data Analysis with R
- Imbalanced Data Classification Techniques
- Metallurgy and Material Forming
- Web Data Mining and Analysis
IBM Research - India
2020-2023
IBM (United States)
2022
It is well understood from literature that the performance of a machine learning (ML) model upper bounded by quality data. While researchers and practitioners have focused on improving models (such as neural architecture search automated feature selection), there are limited efforts towards data quality. One crucial requirements before consuming datasets for any application to understand dataset at hand failure do so can result in inaccurate analytics unreliable decisions. Assessing across...
The quality of training data has a huge impact on the efficiency, accuracy and complexity machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation annotation stage. This necessitates profiling assessment understand its suitability for tasks failure do so can result in inaccurate analytics unreliable decisions. While researchers practitioners have focused improving models, there are limited efforts towards quality.
It is widely accepted that data preparation one of the most time-consuming steps machine learning (ML) lifecycle. also important steps, as quality directly influences a model. In this tutorial, we will discuss importance and role exploratory analysis (EDA) visualisation techniques to find issues for preparation, relevant building ML pipelines. We latest advances in these fields bring out areas need innovation. To make tutorial actionable practitioners, popular open-source packages can get...
Democratisation of machine learning (ML) has been an important theme in the research community for last several years with notable progress made by model-building automated models. However, data play a central role building ML models and there is need to focus on data-centric AI innovations. In this article, we first map steps taken scientists preparation phase identify open areas pain points via user interviews. We then propose framework four novel algorithms exploratory analysis quality...
The quality of training data has a huge impact on the efficiency, accuracy and complexity machine learning tasks. Various tools techniques are available that assess with respect to general cleaning profiling checks. However these not applicable detect issues in context tasks, like noisy labels, existence overlapping classes etc. We attempt re-look at building pipeline build tool can detect, explain remediate data, systematically automatically capture all changes applied data. introduce Data...
This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling length 3B/8B from 2K/4K consists a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled data. Additionally, we also release instruction-tuned which are derived further finetuning the long on mix permissively licensed short instruction-response pairs. While comparing...
The saying Garbage In, Out resonates perfectly within the machine learning and artificial intelligence community. While there has been considerable ongoing effort for improving quality of models, is relatively less focus on systematically analysing data with respect to its efficacy learning. Assessing across intelligently designed metrics developing corresponding transformation operations address gaps helps reduce a scientist iterative debugging ML pipeline improve model performance. In this...
Arvind Agarwal, Laura Chiticariu, Poornima Chozhiyath Raman, Marina Danilevsky, Diman Ghazi, Ankush Gupta, Shanmukha Guttula, Yannis Katsis, Rajasekar Krishnamurthy, Yunyao Li, Shubham Mudgal, Vitobha Munigala, Nicholas Phan, Dhaval Sonawane, Sneha Srinivasan, Sudarshan R. Thitte, Mitesh Vasa, Ramiya Venkatachalam, Vinitha Yaski, Huaiyu Zhu. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.
In mapping enterprise applications, data remains a fundamental part of integration development, but its time consuming. An increasing number applications lack naming standards, and nested field structures further add complexity for the developers. Once is done, transformation next challenge users since each application expects to be in certain format. Also, while building flow, developers need understand format source target come up with program that can change from The problem automatic...
Hypernym discovery is the problem of finding terms that have is-a relationship with a given term. We introduce new context type, and relatedness measure to differentiate hypernyms from other types semantic relationships. Our Document Structure based on hierarchical position in document, their presence or otherwise definition text. This quantifies document structure using multiple attributes, classes weighted distance functions.