Kyle C. Hale

ORCID: 0000-0001-9488-9311
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Parallel Computing and Optimization Techniques
  • Cloud Computing and Resource Management
  • Advanced Data Storage Technologies
  • Security and Verification in Computing
  • Distributed systems and fault tolerance
  • Software System Performance and Reliability
  • Distributed and Parallel Computing Systems
  • Real-Time Systems Scheduling
  • Advanced Malware Detection Techniques
  • Explainable Artificial Intelligence (XAI)
  • Interconnection Networks and Systems
  • Software Testing and Debugging Techniques
  • Network Packet Processing and Optimization
  • Opportunistic and Delay-Tolerant Networks
  • Embedded Systems Design Techniques
  • Noise Effects and Management
  • Species Distribution and Climate Change
  • Cognitive Functions and Memory
  • Machine Learning and Data Classification
  • Supercapacitor Materials and Fabrication
  • Data Stream Mining Techniques
  • IoT and Edge/Fog Computing
  • Digital and Cyber Forensics
  • Human-Automation Interaction and Safety
  • Human-Animal Interaction Studies

Illinois Institute of Technology
2017-2024

Northwestern University
2012-2016

The University of Texas at Austin
2009

Large memory workloads with favorable locality of reference can benefit by extending the hierarchy across machines. Systems that enable such far configurations improve application performance and overall utilization in a cluster. There are two current alternatives for software-based memory: kernel-based library-based. Kernel-based approaches sacrifice to achieve programmer transparency, while library-based transparency performance. We argue novel third approach, compiler-based which...

10.1145/3617232.3624856 article EN cc-by 2024-04-17

The needs of parallel runtime systems and the increasingly sophisticated languages compilers they support do not line up with services provided by general-purpose OSes. Furthermore, semantics available to are lost at system-call boundary in such Finally, because a executes user-level an environment, it cannot leverage hardware features that require kernel-mode privileges---a large portion functionality machine is it. These limitations warp design, implementation, functionality, performance...

10.1145/2749246.2749264 article EN 2015-06-08

An important class of applications, including programs that leverage third-party libraries, use user-defined functions in databases, and serverless benefit from isolating the execution untrusted code at granularity individual or function invocations. However, existing isolation mechanisms were not designed for this case; rather, they have been adapted to it. We introduce virtines, a new abstraction specifically isolation, describe how we build virtines ground up by pushing hardware...

10.1145/3492321.3519553 preprint EN 2022-03-28

In our hybrid runtime (HRT) model, a parallel system and the application are together transformed into specialized OS kernel that operates entirely in mode can thus implement exactly its desired abstractions on top of fully privileged hardware access. We describe design implementation two new tools support HRT model. The first, Nautilus Aerokernel, is framework specifically designed to enable HRTs for x64 Xeon Phi hardware. Aerokernel primitives creation operate much faster, up orders...

10.1145/2892242.2892255 article EN 2016-03-25

Chip multiprocessors (CMPs) have emerged as a primary vehicle for overcoming the limitations of uniprocessor scaling, with power constraints now representing key factor CMP design. Recent studies shown that on-chip interconnection network (NOC) can consume much 36% overall chip power. To date, researchers employed several techniques to reduce consumption in network, including use on/off links by means gating. However, many these target dynamic power, and those consider static focus...

10.1145/1645213.1645227 article EN 2009-12-12

We argue that the implementation of VMM-based virtual services for a guest should extend into itself, even without its cooperation. Placing service components directly OS or application can reduce complexity and increase performance. In this paper we show set tools in VMM required to enable broad range such guest-context is fairly small. Further, outline evaluate these describe their design context Guest Examination Revision Services (GEARS), new framework within Palacios VMM. then two...

10.1145/2371536.2371542 article EN 2012-09-18

Runtimes and applications that rely heavily on asynchronous event notifications suffer when such must traverse several layers of processing in software. Many these necessarily exist order to support a general-purpose, portable kernel architecture, but they introduce considerable overheads for demanding, high-performance parallel runtimes applications. Other can arise from mismatched programming or system call interface. Whatever the case, average latency variance commonly used software...

10.1109/mascots.2018.00041 article EN 2018-09-01

Achieving parallel performance and scalability involves making compromises between sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders magnitude. Today, we expect programmers to implement this compromise optimizing their code manually. This process is labor intensive, requires deep expertise, reduces quality. Recent work on heartbeat scheduling shows a promising approach that manifests potentially vast amounts...

10.1145/3453483.3460969 article EN 2021-06-18

The hybrid runtime (HRT) model offers a path towards high performance and efficiency. By integrating the OS kernel, runtime, application, an HRT allows developer to leverage full feature set of hardware specialize services runtime's needs. However, conforming currently requires port kernel level, for example Nautilus framework, this knowledge internals. In response, we developed Multiverse, system that bridges gap between built-from-scratch legacy system. Multiverse unmodified applications...

10.1109/icac.2017.24 preprint EN 2017-07-01

The hybrid runtime (HRT) model offers a plausible path towards high performance and efficiency. By integrating the OS kernel, parallel runtime, application, an HRT allows developer to leverage full privileged feature set of hardware specialize services runtime's needs. However, conforming currently requires complete port application kernel level, for example our Nautilus framework, this knowledge internals. In response, we developed Multiverse, system that bridges gap between...

10.1145/2907294.2907309 article EN 2016-05-31

Memory mapping enhances decision tree implementations by enabling constant-time statistical inference, and is particularly effective when memory mapped tables fit in processor cache. However, more challenging applied to random forests—ensembles of many trees—as the table sizes can easily outstrip cache capacity. We argue that careful system design for parallel efficiency make forests. Our preliminary results show memory-mapped forests speed up inference latency a factor 30 × .

10.1145/3458744.3474052 article EN 2021-08-09

We argue that memory content-tracking across the nodes of a parallel machine should be factored into distinct platform service on top which application services can built. ConCORD is proof-of-concept system we have developed and evaluated to test this claim. Our core insight many described as query over content. This leads concept in ConCORD, content-aware command architecture, an implemented parametrization single general knows how execute well. dynamically adapts execution amount...

10.1145/2600212.2600214 article EN 2014-06-20

OpenMP implementations make increasing demands on the kernel. We take next step and consider bringing into Our vision is that entire application, run-time system, a kernel framework interwoven to become kernel, allowing implementation full advantage of hardware in custom manner. compare contrast three approaches achieving this goal. The first, runtime (RTK), ports any code use pragmas. second, process (PIK) adds specialized abstraction for running user-level within third, compilation (CCK),...

10.1145/3458817.3476183 article EN 2021-10-21

We describe the design, implementation, and evaluation of emulated hardware transactional memory, specifically Intel Haswell Restricted Transactional Memory (RTM) architectural extensions for x86/64, within a virtual machine monitor (VMM). Our system allows users to investigate RTM on that does not provide it, debug their RTM-based software, stress test it diverse configurations, including potential future configurations might support arbitrary length transactions. Initial performance...

10.1145/2612262.2612265 article EN 2014-06-10

The Julia programming language continues to gain popularity both for its potential programmer productivity and impressive performance on scientific code. It thus holds large-scale HPC, but we have not yet seen this fully realized. While certainly has the machinery run at scale, while others done so embarrassingly parallel workloads, see an analysis of Julia's communication-intensive codes that are common in HPC domain. In paper investigate light, first with a suite microbenchmarks within...

10.48550/arxiv.2109.14072 preprint EN cc-by-nc-nd arXiv (Cornell University) 2021-01-01

Hearing loss can render an aviator more susceptible to the adverse effects of degraded communication signals and consequently lead increased allocation mental resources hear (referred as listening effort). Army aviation hearing standards, which are primarily based on pure tone speech recognition test scores in quiet environments, do not necessarily predict functional impact loss. The has recently adopted a new Military Operational Test (MOHT) assess current study aimed validate MOHT,...

10.1121/10.0022692 article EN The Journal of the Acoustical Society of America 2023-10-01

Random forests use ensembles of decision trees to boost accuracy for machine learning tasks. However, large slow down inference on platforms that process each tree in an ensemble individually. We present Bolt, a platform restructures whole random forests, not just individual trees, speed up inference. Conceptually, Bolt maps every path lookup table which, if cache were enough, would allow with one memory access. When the size exceeds capacity, employs novel combination lossless compression,...

10.1145/3528535.3531519 article EN 2022-05-31

Specialized operating systems have enjoyed a recent revival driven both by pressing need to rethink the system software stack in several domains and convenience flexibility that on-demand infrastructure virtual execution environments offer. Several barriers exist which curtail widespread adoption of such highly specialized systems, but perhaps most consequential them is these are simply difficult use. In this paper we discuss challenges faced OSes, for HPC more broadly, argue what needed...

10.1145/3322789.3328742 article EN 2019-06-17

Address translation fundamentally embodies a function that maps from virtual to physical addresses. In current systems, the is encoded by kernel in an in-memory radix tree structure (the page table hierarchy) which then interpreted hardware pagewalker, pagewalk-caches, and TLBs). We consider implementing itself as reconfigurable hardware-does this make any sense? To study question, we collected numerous in-situ Linux tables for wide range of workloads, including those HPC, serve example...

10.1109/mascots.2019.00047 article EN 2019-09-25

In our hybrid runtime (HRT) model, a parallel system and the application are together transformed into specialized OS kernel that operates entirely in mode can thus implement exactly its desired abstractions on top of fully privileged hardware access. We describe design implementation two new tools support HRT model. The first, Nautilus Aerokernel, is framework specifically designed to enable HRTs for x64 Xeon Phi hardware. Aerokernel primitives creation operate much faster, up orders...

10.1145/3007611.2892255 article EN ACM SIGPLAN Notices 2016-03-25

Software prefetching and hardware-based cache allocation techniques (CAT) have been successfully applied in main-memory database engines to fetch data into before it is needed partition a shared last-level (LLC) prevent concurrent tasks from evicting each others' data. We investigate the interaction of these demonstrate that while single strategy sufficient, combination both only effective if partitioning adapts based on types currently sharing an LLC. present simple, yet effective, scheme...

10.1145/3465998.3466016 article EN 2021-06-18

Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due the high cost synchronization in concurrent data structures. To overcome these limitations we present XQueue, novel lock-less queuing system relaxed ordering semantics that geared towards realizing scalability up threads. We demonstrate XQueue using microbenchmarks and show can deliver operations latencies...

10.1109/mascots53633.2021.9614292 article EN 2021-11-03

For workloads that place strenuous demands on system software, novel operating designs like unikernels, library OSes, and hybrid runtimes offer a promising path forward. However, while these systems can outperform general-purpose they have limited ability to support legacy applications. Multi-OS environments, where the application's execution is split between compute plane data system, address this challenge, but reasoning about performance of applications run in such environment currently...

10.1109/mascots.2019.00044 article EN 2019-09-25
Coming Soon ...