- Topic Modeling
- Spreadsheets and End-User Computing
- Natural Language Processing Techniques
- Statistics Education and Methodologies
- Software Engineering Research
- Data Quality and Management
- Data Visualization and Analytics
- Time Series Analysis and Forecasting
- Explainable Artificial Intelligence (XAI)
- Scientific Computing and Data Management
- Advanced Database Systems and Queries
- Machine Learning and Data Classification
- Semantic Web and Ontologies
- Algorithms and Data Compression
- Educational Games and Gamification
- Recommender Systems and Techniques
- Multimodal Machine Learning Applications
- Mobile Crowdsensing and Crowdsourcing
- Neural Networks and Applications
- Simulation Techniques and Applications
- Advanced Text Analysis Techniques
- Cloud Computing and Resource Management
- Text Readability and Simplification
- Cloud Data Security Solutions
- Artificial Intelligence in Games
Microsoft Research (United Kingdom)
2020-2024
University of Toronto
2024
University of California, San Diego
2024
University College London
2023-2024
University of Cambridge
2023-2024
Microsoft (United States)
2023
Carnegie Mellon University
2023
Microsoft (Belgium)
2023
Code-generating large language models translate natural into code. However, only a small portion of the infinite space naturalistic utterances is effective at guiding code generation. For non-expert end-user programmers, learning this challenge abstraction matching. We examine in specific context data analysis spreadsheets, system that maps users query to Python using Codex generator, executes code, and shows result. propose grounded matching, which bridges gap by translating back systematic...
Large language models, such as OpenAI's codex and Deepmind's AlphaCode, can generate code to solve a variety of problems expressed in natural language. This technology has already been commercialised at least one widely-used programming editor extension: GitHub Copilot. In this paper, we explore how with large models (LLM-assisted programming) is similar to, differs from, prior conceptualisations programmer assistance. We draw upon publicly available experience reports LLM-assisted...
LLM-powered tools like ChatGPT Data Analysis, have the potential to help users tackle challenging task of data analysis programming, which requires expertise in processing, and statistics.However, our formative study (n=15) uncovered serious challenges verifying AI-generated results steering AI (i.e., guiding system produce desired output).We developed two contrasting approaches address these challenges.The first (Stepwise) decomposes problem into step-by-step subgoals with pairs editable...
Generative AI tools can help users with many tasks. One such task is data analysis, which notoriously challenging for non-expert end-users due to its expertise requirements, and where holds much potential, as finding relevant sources, proposing analysis strategies, writing code. To understand how workflows be assisted or impaired by generative AI, we conducted a study (n=15) using Bing Chat via participatory prompting. Participatory prompting recently developed methodology in researchers...
String data is common in real-world datasets: 67.6% of values a sample 1.8 million real Excel spreadsheets from the web were represented as text. Automatically cleaning such string can have significant impact on users. Previous approaches are limited to error detection, require that user provides annotations, examples, or constraints fix errors, and focus independently syntactic errors semantic strings, but ignore strings often contain both substrings. We introduce DataVinci, fully...
Code-generating large language models (LLMs) are transforming programming. Their capability to generate multi-step solutions provides even non-programmers a mechanism harness the power of coding. Non-programmers often use spreadsheets manage tabular data, as they offer an intuitive understanding data manipulation and formula out-comes. Considering that LLMs can complex, potentially incorrect code, our focus is on enabling user trust in accuracy LLM-generated code. We present ColDeco, first...
Spreadsheets are widely used for table manipulation and presentation. Stylistic formatting of these tables is an important property presentation analysis. As a result, popular spreadsheet software, such as Excel, supports automatically based on rules. Unfortunately, writing rules can be challenging users it requires knowledge the underlying rule language data logic. We present Cornet, system that tackles novel problem learning from user-provided formatted cells. Cornet takes inspiration...
Imagine a developer who can only change their last line of code—how often would they have to start writing function from scratch before it is correct? Auto-regressive models for code generation natural language similar limitation: do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, pre-trained diffusion model that addresses this limitation by iteratively denoising complete program conditioned on the encoded language. evaluate CodeFusion task Bash, Python, and...
The following topics are dealt with: computer science education; programming; software tools; aided instruction; engineering; interactive systems; learning (artificial intelligence); data analysis; text groupware.
Formatting is an important property in tables for visualization, presentation, and analysis. Spreadsheet software allows users to automatically format their by writing data-dependent conditional formatting (CF) rules. Writing such rules often challenging as it requires understanding implementing the underlying logic. We present FormaT5, a transformer-based model that can generate CF rule given target table natural language description of desired find user descriptions these tasks are...
Imagine a developer who can only change their last line of code, how often would they have to start writing function from scratch before it is correct? Auto-regressive models for code generation natural language similar limitation: do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, pre-trained diffusion model that addresses this limitation by iteratively denoising complete program conditioned on the encoded language. evaluate CodeFusion task Bash, Python,...
LLM-powered tools like ChatGPT Data Analysis, have the potential to help users tackle challenging task of data analysis programming, which requires expertise in processing, and statistics. However, our formative study (n=15) uncovered serious challenges verifying AI-generated results steering AI (i.e., guiding system produce desired output). We developed two contrasting approaches address these challenges. The first (Stepwise) decomposes problem into step-by-step subgoals with pairs editable...
Users are increasingly being warned to check AI-generated content for correctness. Still, as LLMs (and other generative models) generate more complex output, such summaries, tables, or code, it becomes harder the user audit evaluate output quality Hence, we seeing emergence of tool-assisted experiences help double-check a piece content. We refer these co-audit tools. Co-audit tools complement prompt engineering techniques: one helps construct input prompt, while them response. As specific...
Spreadsheets are widely used for table manipulation and presentation. Stylistic formatting of these tables is an important property both presentation analysis. As a result, popular spreadsheet software, such as Excel, supports automatically based on rules. Unfortunately, writing rules can be challenging users it requires knowledge the underlying rule language data logic. We present CORNET, system that tackles novel problem learning from user examples in form formatted cells. CORNET takes...
Row completion is the task of augmenting a given table text and numbers with additional, relevant rows. The divides into two steps: subject suggestion, populating main column; gap filling, remaining columns. We present state-of-the-art results for suggestion filling measured on standard benchmark (WikiTables).
Data management and analysis tasks are often carried out using spreadsheet software. A popular feature in most platforms is the ability to define data-dependent formatting rules. These rules can express actions such as "color red all entries a column that negative" or "bold rows not containing error failure." Unfortunately, users who want exercise this functionality need manually write these conditional (CF) We introduce CORNET, system automatically learns from user examples. CORNET takes...
String data is common in real-world datasets: 67.6% of values a sample 1.8 million real Excel spreadsheets from the web were represented as text. Systems that successfully clean such string can have significant impact on users. While prior work has explored errors data, proposed approaches often been limited to error detection or require user provide annotations, examples, constraints fix errors. Furthermore, these systems focused independently syntactic semantic strings, but ignore strings...
Data management and analysis tasks are often carried out using spreadsheet software. A popular feature in most platforms is the ability to define data-dependent formatting rules. These rules can express actions such as "color red all entries a column that negative" or "bold rows not containing error failure". Unfortunately, users who want exercise this functionality need manually write these conditional (CF) We introduce Cornet, system automatically learns from user examples. Cornet takes...
With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs generate code (Excel OfficeScripts, a TypeScript API for executing many in Excel) that solves Excel specific provided via natural language user instructions. To do so introduce new large-scale benchmark, InstructExcel, created by leveraging 'Automate' feature to automatically OfficeScripts from users' actions....
Formatting is an important property in tables for visualization, presentation, and analysis. Spreadsheet software allows users to automatically format their by writing data-dependent conditional formatting (CF) rules. Writing such rules often challenging as it requires them understand implement the underlying logic. We present FormaT5, a transformer-based model that can generate CF rule given target table natural language description of desired find user descriptions these tasks are...
Justin Payan, Swaroop Mishra, Mukul Singh, Carina Negreanu, Christian Poelitz, Chitta Baral, Subhro Roy, Rasika Chakravarthy, Benjamin Van Durme, Elnaz Nouri. Findings of the Association for Computational Linguistics: EMNLP 2023.