Dong Dai

ORCID: 0000-0003-4078-8149
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Data Storage Technologies
  • Distributed and Parallel Computing Systems
  • Parallel Computing and Optimization Techniques
  • Cloud Computing and Resource Management
  • Graph Theory and Algorithms
  • Scientific Computing and Data Management
  • Caching and Content Delivery
  • Software System Performance and Reliability
  • Interconnection Networks and Systems
  • Network Security and Intrusion Detection
  • Advanced Algorithms and Applications
  • IoT and Edge/Fog Computing
  • Distributed systems and fault tolerance
  • Advanced Database Systems and Queries
  • Data Quality and Management
  • Algorithms and Data Compression
  • Anomaly Detection Techniques and Applications
  • Online Learning and Analytics
  • Genomics and Phylogenetic Studies
  • Research Data Management Practices
  • Advanced Memory and Neural Computing
  • Privacy-Preserving Technologies in Data
  • Advanced Sensor and Control Systems
  • Embedded Systems Design Techniques
  • Chromosomal and Genetic Variations

University of North Carolina at Charlotte
2018-2024

University of Delaware
2024

Shenyang University
2024

Texas Tech University
2013-2019

University of Science and Technology of China
2011-2014

Suzhou Research Institute
2011-2012

Henan Institute of Technology
2009

Today's high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective job scheduling is crucial to obtain high system efficiency. Existing HPC schedulers typically leverage heuristic priority functions prioritize and schedule But, once configured deployed the experts, such can hardly adapt changes of loads, optimization goals, or settings, potentially leading degraded efficiency when occur. To address this fundamental issue, we present RLScheduler, an...

10.1109/sc41405.2020.00035 article EN 2020-11-01

Log-based anomaly detection has been extensively studied to help detect complex runtime anomalies in production systems. However, existing techniques exhibit several common issues. First, they rely heavily on expert-labeled logs discern anomalous behavior patterns. But labelling enough log data manually effectively train deep neural networks may take too long. Second, numeric model prediction based vector input which causes decisions be largely non-interpretable by humans further rules out...

10.1145/3588195.3595943 article EN 2023-08-07

Data provenance, or data lineage, describes the life cycle of data. In scientific workflows on HPC systems, scientists often seek diverse provenance (e.g., origins products, usage patterns datasets). Unfortunately, existing solutions cannot address challenges due to their incompatible models and/or system implementations. this paper, we analyze four representative in collaboration with domain identify concrete needs. Based first-hand analysis, propose a framework called PROV-IO...

10.1109/tpds.2024.3374555 article EN IEEE Transactions on Parallel and Distributed Systems 2024-03-14

Graphs have become increasingly important in many applications and domains such as querying relationships social networks or managing rich metadata generated scientific computing. Many of these use cases require high-performance distributed graph databases for serving continuous updates from clients and, at the same time, answering complex queries regarding current graph. These operations databases, also referred to online transaction processing (OLTP) operations, specific design...

10.1145/3078597.3078606 article EN 2017-06-23

High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that local storage systems, largely due to lack an effective analysis methodology.

10.1145/3205289.3205302 article EN 2018-06-12

As core components of High-performance computing (HPC) platforms, parallel file systems (PFSes) grow quickly in scale and complexity, hence are subject to various failures anomalies. Identifying their anomalies runtime is critically helpful for HPC operators administrators. Analyzing the logs detect large-scale has been proven effective many recent studies. However, applying them faces significant challenges due large volume irregularity PFSes logs. This study proposes SentiLog, a new...

10.1145/3465332.3470873 article EN 2021-07-20

Object storage is considered a promising solution for next-generation (exascale) high-performance computing platform because of its flexible and object interface. However, delivering high burst-write throughput still critical challenge. Although deploying more servers can potentially provide higher throughput, it be ineffective the limited by small number stragglers (storage that are occasionally slower than others). In this paper, we propose two-choice randomized dynamic I/O scheduler...

10.1109/sc.2014.57 article EN 2014-11-01

HPC platforms are capable of generating huge amounts metadata about different entities including jobs, users, and files. Simple metadata, which describe the attributes these (e.g., file size, name, permissions mode), has been well recorded used in current systems. However, only a limited amount rich records not but also relationships between them, captured Rich may include information from many sources, users applications, must be integrated into unified framework. Collecting, integrating,...

10.1109/pdsw.2014.11 article EN 2014-11-01

Many graph-related applications face the challenge of managing excessive and ever-growing graph data in a distributed environment. Therefore, it is necessary to consider partitioning algorithm distribute onto multiple machines as comes in. Balancing distribution minimizing edge-cut ratio are two basic pursuits problem. While achieving balanced partitions for streaming graphs easy, existing algorithms either fail work on workloads, or leave be further improved. Our research aims provide...

10.1109/ccgrid.2018.00033 article EN 2018-05-01

With the coming concept of 'big data', ability to handle large datasets has become a critical consideration for success industrial organizations such as Google, Amazon, Yahoo! and Facebook. As an important Cloud Computing framework bulk data processing, Hadoop is widely used in these organizations. However, performance MapReduce seriously limited by its stiff configuration strategy. Even single simple job Hadoop, number tuning parameters have be set users. This may easily lead loss due some...

10.1109/iceccs.2014.17 article EN 2014-08-01

Provenance describes detailed information about the history of a piece data, containing relationships among elements such as users, processes, jobs, and workflows that contribute to existence data. is key supporting many data management functionalities are increasingly important in operations identifying sources, parameters, or assumptions behind given result; auditing usage; understanding details how inputs transformed into outputs. Despite its importance, however, provenance support...

10.1109/pact.2017.14 article EN 2017-09-01

Object storage has been increasingly adopted in high-performance computing for scientific, big data applications. With object storage, applications usually use IDs, queries, or collections to identify the instead of using files. Since store changes way is accessed applications, it introduces new challenges I/O prediction, which used work based on interfile intrafile pattern detection. The key challenge that inputs object-based are no longer expressed as static file names: they become much...

10.1109/bigdata.2014.7004242 article EN 2021 IEEE International Conference on Big Data (Big Data) 2014-10-01

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, reliability is much less studied or understood compared with that of local storage cloud systems. Recent failure incidents at real HPC centers have exposed the latent defects PFS clusters as well urgent need for a systematic analysis. To address challenge, we perform study recovery and logging mechanisms PFSs this article. First, to trigger operations target...

10.1145/3483447 article EN ACM Transactions on Storage 2022-03-29

Improving the performance of job executions is an important goal HPC batch schedulers, such as minimizing waiting time, slowdown, or completion time. Such a often accomplished using carefully designed heuristics based on features, size and duration. However, these overlook runtime factors (e.g., cluster availability patterns), which may vary across time make previously sound scheduling decision not hold any longer. In this study, we propose new approach to incorporate into for better...

10.1145/3502181.3531470 article EN 2022-06-23

With the increasing prevalence of scalable file systems in context High Performance Computing (HPC), importance accurate anomaly detection on runtime logs is increasing. But as it currently stands, many state-of-the-art methods for log-based detection, such DeepLog, have encountered numerous challenges when applied to from parallel (PF-Ses), often due their irregularity and ambiguity time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a pre-processing...

10.1109/ftxs56515.2022.00006 article EN 2022-11-01

Large-scale storage systems, a critical part of modern computing are subject to various runtime bugs, failures, and anomalies in production. Identifying their at is thus for users administrators. Since logs record the important status log-based anomaly detection has been studied extensively timely identifying system malfunctions. However, existing solutions share common limitations representing log entries accurately robustly, hence can not effectively handle that were seen historical logs,...

10.1109/ipdps54959.2023.00028 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2023-05-01

Comparing with the traditional disk based distributed storage system, RAM has been proven to be an effective way accelerate real time applications processing speed. In this paper, we propose a memory Cloud system called Sedna. Managing 'big data' across lots of commodity servers, Sedna provides high scalability, simple data access APIs consistency and persistency, new trigger for applications. To guarantee scalability low latency, design implement hierarchical structure manage huge size...

10.1109/clusterw.2012.28 article EN 2012-09-01

Block correlations represent the semantic patterns in storage systems. These can be exploited for data caching, pre-fetching, layout optimization, I/O scheduling, etc. In this paper, we introduce Block2Vec, a deep learning based strategy to mine block The core idea of Block2Vec is twofold. First, it proposes new way abstract blocks, which are considered as multi-dimensional vectors instead traditional Ids. way, able capture similarity between blocks through distances their vectors. Second,...

10.1109/icppw.2016.43 article EN 2016-08-01

Data replication is a key technique to achieve high data availability, reliability, and optimized performance in distributed storage systems. In recent years, with emerged new devices, heterogeneous object-based systems, such as system mix of hard disk drives, solid state other non-volatile memory devices have become increasingly attractive since they combine the merits different deliver better promises. However, existing schemes do not well consider distinct characteristics yet, which could...

10.1109/tc.2019.2954089 article EN publisher-specific-oa IEEE Transactions on Computers 2019-11-19

Object-based parallel file systems have emerged as promising storage solutions for high-end computing (HEC) systems. Despite the fact that object provides a flexible interface, scheduling highly concurrent I/O requests access large number of objects still remains challenging problem, especially in case when stragglers (storage servers are significantly slower than others) exist system. An efficient scheduler needs to avoid possible achieve low latency and high throughput. In this paper, we...

10.1109/icppw.2016.38 article EN 2016-08-01

10.1109/ipdps57955.2024.00069 article EN 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2024-05-27

The maximal information coefficient (MIC) has been proposed to discover relationships and associations between pairs of variables. It poses significant challenges for bioinformatics scientists accelerate the MIC calculation, especially in genome sequencing biological annotations. In this paper, we explore a parallel approach which uses MapReduce framework improve computing efficiency throughput computation. acceleration system includes data storage on HDFS, preprocessing algorithms,...

10.1109/tcbb.2016.2550430 article EN IEEE/ACM Transactions on Computational Biology and Bioinformatics 2016-04-05

In order to predict network anomalies and get rid of the drawbacks current detection, early prediction abnormal for detecting characteristics is introduced in invasion anomaly detection process. First, objective functions are constructed according feature subset dimensions accurate rates model. Then artificial fish swarm algorithm used search optimal chaotic, feedback mechanisms improve algorithm, excessive intrusion rough sets produced classification process simplified guarantee simplicity...

10.4304/jcp.8.11.2990-2996 article EN Journal of Computers 2013-11-01

In recent years, more and applications in the cloud have needs to process large-scale on-line datasets, which evolve over time as new entries are added existing modified. Several programming frameworks, such Percolator Oolong, proposed for incremental data processing can achieve efficient with an event-driven abstraction. However, these frameworks inherently asynchronous, leaving heavy burden of managing synchronization applications' developers, further significantly restricts their...

10.1109/tcc.2018.2830348 article EN publisher-specific-oa IEEE Transactions on Cloud Computing 2018-06-21
Coming Soon ...