NFDI4DS | UHH-SEMS - Publication Details

Shanmukha Guttula

ORCID: 0009-0001-1982-024X

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5082881366

Research Areas

Data Quality and Management
Anomaly Detection Techniques and Applications
Machine Learning and Data Classification
Natural Language Processing Techniques
Big Data and Business Intelligence
Data Visualization and Analytics
Software Engineering Research
Artificial Intelligence in Law
Semantic Web and Ontologies
Advanced Surface Polishing Techniques
Time Series Analysis and Forecasting
High-Velocity Impact and Material Behavior
Topic Modeling
Software Testing and Debugging Techniques
Data Analysis with R
Imbalanced Data Classification Techniques
Metallurgy and Material Forming
Web Data Mining and Analysis

IBM Research - India
2020-2023

IBM (United States)
2022

Overview and Importance of Data Quality for Machine Learning Tasks

OPENALEX - Publications

Abhinav Jain Hima Patel Lokesh Nagalapatti Nitin Gupta Sameep Mehta and 5 more

It is well understood from literature that the performance of a machine learning (ML) model upper bounded by quality data. While researchers and practitioners have focused on improving models (such as neural architecture search automated feature selection), there are limited efforts towards data quality. One crucial requirements before consuming datasets for any application to understand dataset at hand failure do so can result in inaccurate analytics unreliable decisions. Assessing across...

10.1145/3394486.3406477 article EN 2020-08-20

Data Quality for Machine Learning Tasks

OPENALEX - Publications

Nitin Gupta Shashank Mujumdar Hima Patel Satoshi Masuda Naveen Panwar and 6 more

The quality of training data has a huge impact on the efficiency, accuracy and complexity machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation annotation stage. This necessitates profiling assessment understand its suitability for tasks failure do so can result in inaccurate analytics unreliable decisions. While researchers practitioners have focused improving models, there are limited efforts towards quality.

10.1145/3447548.3470817 article EN 2021-08-12

Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems

OPENALEX - Publications

Hima Patel Shanmukha Guttula Ruhi Sharma Mittal Naresh Manwani Laure Berti‐Équille and 1 more

It is widely accepted that data preparation one of the most time-consuming steps machine learning (ML) lifecycle. also important steps, as quality directly influences a model. In this tutorial, we will discuss importance and role exploratory analysis (EDA) visualisation techniques to find issues for preparation, relevant building ML pipelines. We latest advances in these fields bring out areas need innovation. To make tutorial actionable practitioners, popular open-source packages can get...

10.1145/3534678.3542604 article EN Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022-08-12

A Data-centric AI Framework for Automating Exploratory Data Analysis and Data Quality Tasks

OPENALEX - Publications

Hima Patel Shanmukha Guttula Nitin Gupta Sandeep Hans Ruhi Sharma Mittal and 1 more

Democratisation of machine learning (ML) has been an important theme in the research community for last several years with notable progress made by model-building automated models. However, data play a central role building ML models and there is need to focus on data-centric AI innovations. In this article, we first map steps taken scientists preparation phase identify open areas pain points via user interviews. We then propose framework four novel algorithms exploratory analysis quality...

10.1145/3603709 article EN cc-by-nc Journal of Data and Information Quality 2023-06-26

Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets

OPENALEX - Publications

Nitin Gupta Hima Patel Shazia Afzal Naveen Panwar Ruhi Sharma Mittal and 8 more

The quality of training data has a huge impact on the efficiency, accuracy and complexity machine learning tasks. Various tools techniques are available that assess with respect to general cleaning profiling checks. However these not applicable detect issues in context tasks, like noisy labels, existence overlapping classes etc. We attempt re-look at building pipeline build tool can detect, explain remediate data, systematically automatically capture all changes applied data. introduce Data...

10.48550/arxiv.2108.05935 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Scaling Granite Code Models to 128K Context

OPENALEX - Publications

Matt Stallone Vaibhav Saxena Leonid Karlinsky Bridget McGinn Tim Bula and 17 more

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling length 3B/8B from 2K/4K consists a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled data. Additionally, we also release instruction-tuned which are derived further finetuning the long on mix permissively licensed short instruction-response pairs. While comparing...

10.48550/arxiv.2407.13739 preprint EN arXiv (Cornell University) 2024-07-18

Automatic Assessment of Quality of your Data for AI

OPENALEX - Publications

Hima Patel Nitin Gupta Naveen Panwar Ruhi Sharma Mittal Sameep Mehta and 5 more

The saying Garbage In, Out resonates perfectly within the machine learning and artificial intelligence community. While there has been considerable ongoing effort for improving quality of models, is relatively less focus on systematically analysing data with respect to its efficacy learning. Assessing across intelligently designed metrics developing corresponding transformation operations address gaps helps reduce a scientist iterative debugging ML pipeline improve model performance. In this...

10.1145/3493700.3493774 article EN 2022-01-07

Development of an Enterprise-Grade Contract Understanding System

OPENALEX - Publications

Arvind Agarwal Laura Chiticariu Poornima Chozhiyath Raman Marina Danilevsky Diman Ghazi and 15 more

Arvind Agarwal, Laura Chiticariu, Poornima Chozhiyath Raman, Marina Danilevsky, Diman Ghazi, Ankush Gupta, Shanmukha Guttula, Yannis Katsis, Rajasekar Krishnamurthy, Yunyao Li, Shubham Mudgal, Vitobha Munigala, Nicholas Phan, Dhaval Sonawane, Sneha Srinivasan, Sudarshan R. Thitte, Mitesh Vasa, Ramiya Venkatachalam, Vinitha Yaski, Huaiyu Zhu. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies: Industry Papers. 2021.

10.18653/v1/2021.naacl-industry.28 article EN cc-by 2021-01-01

Multi-Intent Detection in User Provided Annotations for Programming by Examples Systems

OPENALEX - Publications

Nischal Ashok Kumar Nitin Gupta Shanmukha Guttula Hima Patel

In mapping enterprise applications, data remains a fundamental part of integration development, but its time consuming. An increasing number applications lack naming standards, and nested field structures further add complexity for the developers. Once is done, transformation next challenge users since each application expects to be in certain format. Also, while building flow, developers need understand format source target come up with program that can change from The problem automatic...

10.48550/arxiv.2307.03966 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Document Structure Measure for Hypernym discovery

OPENALEX - Publications

Aswin Kannan Shanmukha Guttula Balaji Ganesan Hima Karanam Arun Kumar

Hypernym discovery is the problem of finding terms that have is-a relationship with a given term. We introduce new context type, and relatedness measure to differentiate hypernyms from other types semantic relationships. Our Document Structure based on hierarchical position in document, their presence or otherwise definition text. This quantifies document structure using multiple attributes, classes weighted distance functions.

10.48550/arxiv.1811.12728 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Coming Soon ...