Shanmukha Guttula

ORCID: 0009-0001-1982-024X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Data Quality and Management
  • Anomaly Detection Techniques and Applications
  • Machine Learning and Data Classification
  • Natural Language Processing Techniques
  • Big Data and Business Intelligence
  • Data Visualization and Analytics
  • Software Engineering Research
  • Artificial Intelligence in Law
  • Semantic Web and Ontologies
  • Advanced Surface Polishing Techniques
  • Time Series Analysis and Forecasting
  • High-Velocity Impact and Material Behavior
  • Topic Modeling
  • Software Testing and Debugging Techniques
  • Data Analysis with R
  • Imbalanced Data Classification Techniques
  • Metallurgy and Material Forming
  • Web Data Mining and Analysis

IBM Research - India
2020-2023

IBM (United States)
2022

It is well understood from literature that the performance of a machine learning (ML) model upper bounded by quality data. While researchers and practitioners have focused on improving models (such as neural architecture search automated feature selection), there are limited efforts towards data quality. One crucial requirements before consuming datasets for any application to understand dataset at hand failure do so can result in inaccurate analytics unreliable decisions. Assessing across...

10.1145/3394486.3406477 article EN 2020-08-20

The quality of training data has a huge impact on the efficiency, accuracy and complexity machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation annotation stage. This necessitates profiling assessment understand its suitability for tasks failure do so can result in inaccurate analytics unreliable decisions. While researchers practitioners have focused improving models, there are limited efforts towards quality.

10.1145/3447548.3470817 article EN 2021-08-12

It is widely accepted that data preparation one of the most time-consuming steps machine learning (ML) lifecycle. also important steps, as quality directly influences a model. In this tutorial, we will discuss importance and role exploratory analysis (EDA) visualisation techniques to find issues for preparation, relevant building ML pipelines. We latest advances in these fields bring out areas need innovation. To make tutorial actionable practitioners, popular open-source packages can get...

10.1145/3534678.3542604 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022-08-12

Democratisation of machine learning (ML) has been an important theme in the research community for last several years with notable progress made by model-building automated models. However, data play a central role building ML models and there is need to focus on data-centric AI innovations. In this article, we first map steps taken scientists preparation phase identify open areas pain points via user interviews. We then propose framework four novel algorithms exploratory analysis quality...

10.1145/3603709 article EN cc-by-nc Journal of Data and Information Quality 2023-06-26

The quality of training data has a huge impact on the efficiency, accuracy and complexity machine learning tasks. Various tools techniques are available that assess with respect to general cleaning profiling checks. However these not applicable detect issues in context tasks, like noisy labels, existence overlapping classes etc. We attempt re-look at building pipeline build tool can detect, explain remediate data, systematically automatically capture all changes applied data. introduce Data...

10.48550/arxiv.2108.05935 preprint EN other-oa arXiv (Cornell University) 2021-01-01

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling length 3B/8B from 2K/4K consists a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled data. Additionally, we also release instruction-tuned which are derived further finetuning the long on mix permissively licensed short instruction-response pairs. While comparing...

10.48550/arxiv.2407.13739 preprint EN arXiv (Cornell University) 2024-07-18

The saying Garbage In, Out resonates perfectly within the machine learning and artificial intelligence community. While there has been considerable ongoing effort for improving quality of models, is relatively less focus on systematically analysing data with respect to its efficacy learning. Assessing across intelligently designed metrics developing corresponding transformation operations address gaps helps reduce a scientist iterative debugging ML pipeline improve model performance. In this...

10.1145/3493700.3493774 article EN 2022-01-07

Arvind Agarwal, Laura Chiticariu, Poornima Chozhiyath Raman, Marina Danilevsky, Diman Ghazi, Ankush Gupta, Shanmukha Guttula, Yannis Katsis, Rajasekar Krishnamurthy, Yunyao Li, Shubham Mudgal, Vitobha Munigala, Nicholas Phan, Dhaval Sonawane, Sneha Srinivasan, Sudarshan R. Thitte, Mitesh Vasa, Ramiya Venkatachalam, Vinitha Yaski, Huaiyu Zhu. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.

10.18653/v1/2021.naacl-industry.28 article EN cc-by 2021-01-01

In mapping enterprise applications, data remains a fundamental part of integration development, but its time consuming. An increasing number applications lack naming standards, and nested field structures further add complexity for the developers. Once is done, transformation next challenge users since each application expects to be in certain format. Also, while building flow, developers need understand format source target come up with program that can change from The problem automatic...

10.48550/arxiv.2307.03966 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Hypernym discovery is the problem of finding terms that have is-a relationship with a given term. We introduce new context type, and relatedness measure to differentiate hypernyms from other types semantic relationships. Our Document Structure based on hierarchical position in document, their presence or otherwise definition text. This quantifies document structure using multiple attributes, classes weighted distance functions.

10.48550/arxiv.1811.12728 preprint EN other-oa arXiv (Cornell University) 2018-01-01
Coming Soon ...