Pinghui Wang

ORCID: 0000-0002-1434-837X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Complex Network Analysis Techniques
  • Advanced Graph Neural Networks
  • Network Security and Intrusion Detection
  • Internet Traffic Analysis and Secure E-voting
  • Caching and Content Delivery
  • Data Stream Mining Techniques
  • Topic Modeling
  • Anomaly Detection Techniques and Applications
  • Graph Theory and Algorithms
  • Data Management and Algorithms
  • Domain Adaptation and Few-Shot Learning
  • Adversarial Robustness in Machine Learning
  • Advanced Database Systems and Queries
  • Natural Language Processing Techniques
  • Peer-to-Peer Network Technologies
  • Cryptography and Data Security
  • Complexity and Algorithms in Graphs
  • Multimodal Machine Learning Applications
  • HIV, Drug Use, Sexual Risk
  • Advanced Clustering Algorithms Research
  • AI in Service Interactions
  • Limits and Structures in Graph Theory
  • Software System Performance and Reliability
  • Web Data Mining and Analysis
  • Construction Project Management and Performance

Xi'an Jiaotong University
2011-2025

Huawei Technologies (China)
2014

Nanjing University of Science and Technology
2012

Despite recent efforts to characterize complex networks such as citation graphs or online social (OSNs), little attention has been given developing tools that can be used directed in the wild, where no pre-processed data is available. The presence of hidden incoming edges but observable outgoing poses a challenge large through crawling, existing sampling methods cannot cope with links. driving principle behind our random walk (RW) method construct, real-time, an undirected graph from on...

10.1109/infcom.2012.6195540 article EN 2012-03-01

Due to the massive amount of data in high-speed network traffic and limit on processing capability, it is a great challenge accurately measure monitor over links online. A new structure presented this paper for locating hosts associated with large connection degrees or significant changes based reversible degree sketch anomalous traffic. The builds compact summary host efficiently accurately. For each packet coming, only needs set several bits selected bit array by group hash functions....

10.1109/tifs.2011.2123094 article EN IEEE Transactions on Information Forensics and Security 2011-03-07

Counting 3-, 4-, and 5-node graphlets in graphs is important for graph mining applications such as discovering abnormal/ evolution patterns social biology networks. In addition, it recently widely used computing similarities between classification protein function prediction malware detection. However, challenging to compute these graphlet counts a large or set of due the combinatorial nature problem. Despite recent efforts counting 3-node 4-node graphlets, little attention has been paid...

10.1109/tkde.2017.2756836 article EN IEEE Transactions on Knowledge and Data Engineering 2017-09-26

Legal case retrieval aims to automatically scour comparable legal cases based on a given query, which is crucial for offering relevant precedents support the judgment in intelligent systems. Due similar goals, it often associated with matching task. To address them, daunting challenge assessing uniquely defined legal-rational similarity within judicial domain, distinctly deviates from semantic similarities general text retrieval. Past works either tagged domain-specific factors or...

10.1145/3725729 article EN ACM transactions on office information systems 2025-03-21

Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array symbolic facts. Consequently, integrating LLMs with has been extensively explored, Knowledge Graph Question Answering (KGQA) serving as critical touchstone for integration. This task requires answer natural language questions by retrieving triples graphs. However,...

10.1609/aaai.v39i23.34658 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Recent years have witnessed rapid advancements in the safety alignments of large language models (LLMs). Methods such as supervised instruction fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) thus emerged vital components constructing LLMs. While these methods achieve robust fine-grained alignment to values, their practical application is still hindered by high annotation costs incomplete alignments. Besides, intrinsic values within training corpora not been fully...

10.1609/aaai.v39i26.34957 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

Log messages provide a valuable source of runtime information for ensuring the safety and consistency systems. Recently, many machine learning deep methods have been proposed to automatically detect anomalous log messages, obviating need manual detection by experts. However, we find that in practice, effectiveness existing learning-based is severely affected incomplete distribution shift. Specifically, each message can actually be parsed into fixed number key fields, while analyze using only...

10.1145/3588918 article EN cc-by Proceedings of the ACM on Management of Data 2023-05-26

In this work we study the set size distribution estimation problem, where elements are randomly sampled from a collection of non-overlapping sets and seek to recover original samples. This problem has applications capacity planning network theory. Examples real-world include characterizing in-degree distributions in large graphs uncovering TCP/IP flow on Internet. We demonstrate that it is difficult estimate distribution. The recoverability presents sharp threshold with respect fraction...

10.1109/jsac.2013.130604 article EN IEEE Journal on Selected Areas in Communications 2013-05-17

Molecular representation learning has emerged as a game-changer at the intersection of AI and chemistry, with great potential in applications such drug design materials discovery. A substantial obstacle successfully applying molecular is difficulty effectively completely characterizing geometry, which not been well addressed to date. To overcome this challenge, we propose novel framework that features geometric graph, termed HAGO-Graph, specifically designed graph model, HAGO-Net. In...

10.1609/aaai.v38i13.29373 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

The host connection degree distribution (HCDD) is an important metric for network security monitoring. However, it difficult to accurately obtain the HCDD in real time high-speed links with a massive amount of traffic data. In this paper, we propose new sketch method build probabilistic summary host's flows using uniform Flajolet-Martin combined small bitmap. To study its performance comparison previous sampling and methods, present general model that encompasses all these methods. With...

10.1109/tifs.2014.2312544 article EN IEEE Transactions on Information Forensics and Security 2014-03-19

Locating hosts with large connection degree is very important for monitoring anomalous network traffics. The in-degree (out-degree), defined as the number of distinct sources (destinations) that a host connected (connects) during given time interval. Due to massive amount data in high speed traffics and limit on processing capability, it difficult accurately locate over links line. In this paper we present new streaming method locating based reversible sketch monitor required memory space...

10.1109/glocom.2009.5426280 article EN GLOBECOM '05. IEEE Global Telecommunications Conference, 2005. 2009-11-01

Counting the frequencies of 3-, 4-, and 5-node undirected motifs (also know as graphlets) is widely used for understanding complex networks such social biology networks. However, it a great challenge to compute these metrics large graph due intensive computation. Despite recent efforts count triangles (i.e., 3-node motif counting), little attention has been given developing scalable tools that can be characterize 4- motifs. In this paper, we develop computational efficient methods sample 5-...

10.48550/arxiv.1509.08089 preprint EN other-oa arXiv (Cornell University) 2015-01-01

Graphlets are induced subgraph patterns and have been frequently applied to characterize the local topology structures of graphs across various domains, e.g., online social networks (OSNs) biological networks. Discovering computing graphlet statistics highly challenging. First, massive size real-world makes exact computation graphlets extremely expensive. Secondly, graph may not be readily available so one has resort web crawling using application programming interfaces (APIs). In this work,...

10.48550/arxiv.1603.07504 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Legal Judgment Prediction (LJP) aims to automatically predict a law case’s judgment results based on the text description of its facts. In practice, confusing articles (or charges) problem frequently occurs, reflecting that cases applicable similar tend be misjudged. Although some recent works prior knowledge solve this issue well, they ignore confusion also occurs between with high posterior semantic similarity due data imbalance instead only highly ones, which is work’s further finding....

10.1145/3689628 article EN ACM transactions on office information systems 2024-08-24

Characterizing large online social networks (OSNs) through node querying is a challenging task. OSNs often impose severe constraints on the query rate, hence limiting sample size to small fraction of total network. Various ad-hoc subgraph sampling methods have been proposed, but many them give biased estimates and no theoretical basis accuracy. In this work, we focus developing for where also reveals partial structural information about its neighbors. Our are optimized NoSQL graph databases...

10.48550/arxiv.1311.3037 preprint EN other-oa arXiv (Cornell University) 2013-01-01

Exploring small connected and induced subgraph patterns (CIS patterns, or graphlets) has recently attracted considerable attention. Despite recent efforts on computing the number of instances a specific graphlet appears in large graph (i.e., total CISes isomorphic to graphlet), little attention been paid characterizing node's degree, i.e., that include node, which is an important metric for analyzing complex networks such as social biological networks. Similar global counting, it challenging...

10.48550/arxiv.1604.08691 preprint EN other-oa arXiv (Cornell University) 2016-01-01

Monitoring user behaviors over high speed links is important for applications such as network anomaly detection. Previous work focuses on monitoring anomalies extremely frequent users occurring in a short timeslot 1 minute. Little attention has been paid to detect with stealthy (e.g., persistent, co-occurrence, anti-co-occurrence, and periodic behaviors) long period of time at the granularity. Due limited computation storage resources routers, it prohibitive collect massive traffic time. We...

10.1109/tkde.2018.2873319 article EN IEEE Transactions on Knowledge and Data Engineering 2018-10-01

Mining user behaviors over high speed links is important for applications such as network anomaly detection. Previous work focuses on monitoring anomalies extremely frequent users occurring in a short timeslot 1 minute. Little attention has been paid to detect with stealthy persistent and co-occurrence long period of time at the granularity (e.g., minute level). Unlike users, do not necessarily occur more frequently than other single timeslot, but persist larger number timeslots. Due limited...

10.1109/infocom.2018.8485858 article EN IEEE INFOCOM 2022 - IEEE Conference on Computer Communications 2018-04-01

Despite recent effort to estimate topology characteristics of large graphs (i.e., online social networks and peer-to-peer networks), little attention has been given develop a formal methodology characterize the vast amount content distributed over these networks. Due scale nature networks, exhaustive enumeration this is computationally prohibitive. In paper, we show how one can obtain properties by sampling only small fraction vertices. We first that when naively applied, produce huge bias...

10.48550/arxiv.1311.3882 preprint EN other-oa arXiv (Cornell University) 2013-01-01

Distinct sampling is fundamental for computing statistics (e.g., the age and gender distribution of distinct users accessing a particular website) depending on set keys user IDs) in large high speed data stream such as sequence key-update pairs. However, major shortcoming existing methods their computational cost incurred by determining whether each incoming key currently sampled keeping track keys’ update aggregations. To solve this challenge, we develop new method <i>random projection...

10.1109/tpds.2018.2865452 article EN IEEE Transactions on Parallel and Distributed Systems 2018-08-14

Given two sets of elements held by different parties separately, computing the cardinality (i.e., number distinct elements) their intersection set is a fundamental task in applications such as network monitoring and database systems. To handle large with limited space, computation, communication costs, lightweight probabilistic methods sketch methods) Flajolet-Martin (FM) HyperLogLog (HLL) are extensively used. However, when set's data summary hash functions used to construct disclosed an...

10.1145/3639281 article EN Proceedings of the ACM on Management of Data 2024-03-12

Estimating cardinality, i.e., the number of distinct elements, a data stream is fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into broader scenario where each element carries positive weight. Unlike traditional cardinality estimation, limited research exists on weighted with current methods requiring substantial memory computational resources, challenging for devices capabilities real-time applications anomaly detection. To...

10.48550/arxiv.2406.19143 preprint EN arXiv (Cornell University) 2024-06-27
Coming Soon ...