NFDI4DS | UHH-SEMS - Publication Details

Weijia Xu

ORCID: 0000-0002-5134-6381

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5101883722

Research Areas

Topic Modeling
Natural Language Processing Techniques
Scientific Computing and Data Management
Algorithms and Data Compression
Data Management and Algorithms
Advanced Data Storage Technologies
Cloud Computing and Resource Management
Distributed and Parallel Computing Systems
Genomics and Phylogenetic Studies
Biomedical Text Mining and Ontologies
Multimodal Machine Learning Applications
Semantic Web and Ontologies
Research Data Management Practices
Advanced Neural Network Applications
Data Visualization and Analytics
RNA modifications and cancer
Web Data Mining and Analysis
Data Mining Algorithms and Applications
RNA and protein synthesis mechanisms
Parallel Computing and Optimization Techniques
Data Quality and Management
Advanced Clustering Algorithms Research
Video Analysis and Summarization
Advanced Proteomics Techniques and Applications
Traffic Prediction and Management Techniques

The University of Texas at Austin
2013-2024

Research Institute of Petroleum Exploration and Development
2024

Daqing Oilfield of CNPC
2024

Northeast Petroleum University
2024

Texas Advanced Computing Center
2011-2022

Central University of Finance and Economics
2021-2022

College of Marin
2019-2021

University of Maryland, College Park
2018-2021

Amazon (United States)
2020-2021

Microsoft (United States)
2021

Building a PubMed knowledge graph

OPENALEX - Publications

Jian Xu Sunkyu Kim Min Song Minbyul Jeong Donghyeon Kim and 10 more

PubMed

10.1038/s41597-020-0543-2 article EN cc-by Scientific Data 2020-06-26

Gramene 2021: harnessing the power of comparative genomics and pathways for plant research

OPENALEX - Publications

Marcela K. Tello‐Ruiz Sushma Naithani Parul Gupta Andrew Olson Sharon Wei and 24 more

Gramene (http://www.gramene.org), a knowledgebase founded on comparative functional analyses of genomic and pathway data for model plants major crops, supports agricultural researchers worldwide. The resource is committed to open access reproducible science based the FAIR principles. Since last NAR update, we made nine releases; doubled genome portal's content; expanded curated genes, pathways expression sets; implemented Domain Informational Vocabulary Extraction (DIVE) algorithm extracting...

10.1093/nar/gkaa979 article EN cc-by-nc Nucleic Acids Research 2020-10-10

End-to-End Slot Alignment and Recognition for Cross-Lingual NLU

OPENALEX - Publications

Weijia Xu Batool A Haider Saab Mansour

Natural language understanding (NLU) in the context of goal-oriented dialog systems typically includes intent classification and slot labeling tasks. Existing methods to expand an NLU system new languages use machine translation with label projection from source translated utterances, thus are sensitive errors. In this work, we propose a novel end-to-end model that learns align predict target labels jointly for cross-lingual transfer. We introduce MultiATIS++, multilingual corpus extends...

10.18653/v1/2020.emnlp-main.410 article EN cc-by 2020-01-01

EDITOR: An Edit-Based Transformer with Repositioning for Neural Machine Translation with Soft Lexical Constraints

OPENALEX - Publications

Weijia Xu Marine Carpuat

Abstract We introduce an Edit-Based TransfOrmer with Repositioning (EDITOR), which makes sequence generation flexible by seamlessly allowing users to specify preferences in output lexical choice. Building on recent models for non-autoregressive (Gu et al., 2019), EDITOR generates new sequences iteratively editing hypotheses. It relies a novel reposition operation designed disentangle choice from word positioning decisions, while enabling efficient oracles imitation learning and parallel...

10.1162/tacl_a_00368 article EN cc-by Transactions of the Association for Computational Linguistics 2021-01-01

Qwen2.5-1M Technical Report

OPENALEX - Publications

Yang An B. X. Yu Chengyuan Li Dayiheng Liu Fei Huang and 23 more

We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared previous 128K version, Qwen2.5-1M have significantly enhanced long-context capabilities through pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, multi-stage supervised fine-tuning are employed effectively enhance performance while reducing training costs. To promote use among broader user base, we present open-source our inference...

10.48550/arxiv.2501.15383 preprint EN arXiv (Cornell University) 2025-01-25

Research on numerical simulation of initial process of sediment discharge from deep-sea mining vehicle

OPENALEX - Publications

Kun Huang Yong Zhao Weijia Xu Shuangfang Lu T. A. Bai and 1 more

10.1016/j.oceaneng.2025.120973 article EN Ocean Engineering 2025-03-22

Multifractal detrended cross-correlation analysis on NO, NO2and O3</…

OPENALEX - Publications

Weijia Xu Chunqiong Liu Kai Shi Yonghong Liu

10.1016/j.physa.2018.02.114 article EN Physica A Statistical Mechanics and its Applications 2018-03-04

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification

OPENALEX - Publications

Sweta Agrawal Weijia Xu Marine Carpuat

We introduce a new approach for the task of Controllable Text Simplification, where systems rewrite complex English sentence so that it can be understood by readers at different grade levels in US K-12 system.It uses non-autoregressive model to iteratively edit an input sequence and incorporates lexical complexity information seamlessly into refinement process generate simplifications better match desired output than strong autoregressive baselines.Analysis shows our model's local operations...

10.18653/v1/2021.findings-acl.330 article EN cc-by 2021-01-01

A metric model of amino acid substitution

OPENALEX - Publications

Weijia Xu Daniel P. Miranker

Abstract Motivation: We address the question of whether there exists an effective evolutionary model amino-acid substitution that forms a metric-distance function. There is always trade-off between speed and sensitivity among competing computational methods determining sequence homology. A metric evolution prerequisite for development entire class fast analysis algorithms are both scalable, O(log n) sensitive. Results: have reworked mathematics point accepted mutation (PAM) by calculating...

10.1093/bioinformatics/bth065 article EN Bioinformatics 2004-02-10

Structural Constraints Identified with Covariation Analysis in Ribosomal RNA

OPENALEX - Publications

Lei Shang Weijia Xu Stuart Ozer Robin R. Gutell

Covariation analysis is used to identify those positions with similar patterns of sequence variation in an alignment RNA sequences. These constraints on the evolution two are usually associated a base pair helix. While mutual information (MI) has been accurately predict secondary structure and few its tertiary interactions, early studies revealed that phylogenetic event counting methods more sensitive provide extra confidence prediction pairs. We developed novel powerful events method (PEC)...

10.1371/journal.pone.0039383 article EN cc-by PLoS ONE 2012-06-19

A fast coarse filtering method for peptide identification by mass spectrometry

OPENALEX - Publications

Smriti Ramakrishnan Rui Mao Aleksey A. Nakorchevskiy John T. Prince Willard S. Willard and 3 more

We reformulate the problem of comparing mass-spectra by mapping spectra to a vector space model. Our search method leverages metric indexing algorithm produce an initial candidate set, which can be followed any fine ranking scheme.We consider three distance measures integrated into multi-vantage point index structure. Of these, semi-metric fuzzy-cosine using peptide precursor mass constraints performs best. The acts as coarse, lossless filter with respect SEQUEST and ProFound scoring...

10.1093/bioinformatics/btl118 article EN Bioinformatics 2006-04-04

Bi-Directional Differentiable Input Reconstruction for Low-Resource Neural Machine Translation

OPENALEX - Publications

Xing Niu Weijia Xu Marine Carpuat

Xing Niu, Weijia Xu, Marine Carpuat. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

10.18653/v1/n19-1043 article EN 2019-01-01

Convolutional Neural Network Training with Distributed K-FAC

OPENALEX - Publications

J. Gregory Pauloski Zhao Zhang Lei Huang Weijia Xu Ian Foster

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kroneckerfactored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that be used in natural gradient optimizers. We investigate here a scalable K-FAC design its applicability convolutional network (CNN) training scale. study optimization techniques such layer-wise distribution...

10.1109/sc41405.2020.00098 article EN 2020-11-01

The Impact of COVID-19 on China’s Capital Market and Major Industry Sectors

OPENALEX - Publications

Weijia Xu Aihua Li Lu Wei

This paper studied the impact of COVID-19 on China's capital market and major industry sectors via an improved ICSS algorithm, a time series model with exogenous variable non-parametric conditional probability estimation. Through empirical analysis, it is found that epidemic has no significant return stock bond markets, but increased volatility gradual more obvious. There are differences in significance, direction duration different sectors. In addition, been some industries rapid others....

10.1016/j.procs.2022.01.011 article EN Procedia Computer Science 2022-01-01

Building Big Data Processing and Visualization Pipeline through Apache Zeppelin

OPENALEX - Publications

Yanzhe Cheng Fang Liu Shan Jing Weijia Xu Duen Horng Chau

Big data analytics pipeline becomes popular for large volume processing, Apache Zeppelin provides an integrated environment ingestion, discovery, and visualization collaboration with extended framework which allows different programming languages processing back ends to be plugged in. The supported include Scala, Python, SQL, Shell script as well big including Hadoop, Spark Hive. With the necessary tool sets, interactive dynamic analysis can done on fly heterogeneous interfaces. Although is...

10.1145/3219104.3229288 article EN Proceedings of the Practice and Experience on Advanced Research Computing 2018-07-12

Performance evaluation of enabling logistic regression for big data with R

OPENALEX - Publications

Ruizhu Huang Weijia Xu

The software package R is a free, powerful, open source with extensive statistical computing and graphics capabilities. Due to its high-level expressiveness multitude of domain-specific packages, has become popular tool for data analysis in many scientific fields. While there are number packages enabling running parallel using message passing interface across multiple nodes, only few extend the new system paradigm intensive computing, such as Hadoop Spark. In this paper, we focus on three...

10.1109/bigdata.2015.7364048 article EN 2021 IEEE International Conference on Big Data (Big Data) 2015-10-01

Research on multi factor stock selection model based on LightGBM and Bayesian Optimization

OPENALEX - Publications

Zimo Li Weijia Xu Aihua Li

With the development of stock market, number individual investors has become more and more. Because emotional irrational factors, risk instability market in China have been greatly increased. Therefore, it is necessary to introduce idea quantitative investment into financial field. In this paper, firstly we use random forest, XGBoost LightGBM conduct rolling tests on multiple factors. After parameter adjustment based Bayesian Optimization, find that LightGBM-Bayes best effect. Finally, paper...

10.1016/j.procs.2022.11.301 article EN Procedia Computer Science 2022-01-01

Performance evaluation of R with Intel Xeon Phi coprocessor

OPENALEX - Publications

Yaakoub El-Khamra Niall Gaffney David Walling Eric Wernert Weijia Xu and 1 more

Over the years, R has been adopted as a major data analysis and mining tool in many domain fields. As Big Data overwhelms those fields, computational needs workload of existing solutions increases significantly. With recent hardware software developments, it is possible to enable massive parallelism with little no modification. In this paper, we evaluated approaches speed up computations utilization Intel Math Kernel Library automatic offloading Xeon Phi SE10P Co-processor. The testing...

10.1109/bigdata.2013.6691695 article EN 2013-10-01

OmniSR-M: A Rock Sheet with a Multi-Branch Structure Image Super-Resolution Lightweight Method

OPENALEX - Publications

Tianyong Liu Chengwu Xu Lu Tang Yingjie Meng Weijia Xu and 2 more

With the rapid development of digital core technology, acquisition high-resolution rock thin section images has become crucial. Due to limitation optical principles, imaging involves a contradiction between resolution and field view. In order solve this problem, paper proposes lightweight, fully aggregated network with multi-branch structure for super images. The experimental results on dataset demonstrate that improved method, called OmniSR-M, achieves significant enhancement compared...

10.3390/app14072779 article EN cc-by Applied Sciences 2024-03-26

Analysis and Optimization of Data Import with Hadoop

OPENALEX - Publications

Weijia Xu Wei Luo Nicholas Woodward

Data driven research has become an important part of scientific discovery in increasing number disciplines. In many cases, the sheer volume data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs significant expertise software development. Due its simplicity for user effectiveness at processing big data, Hadoop a popular platform large-scale...

10.1109/ipdpsw.2012.129 article EN 2012-05-01

Forecasting Urban Air Quality via a Back-Propagation Neural Network and a Selection Sample Rule

OPENALEX - Publications

Yonghong Liu Qianru Zhu Dawen Yao Weijia Xu

In this paper, based on a sample selection rule and Back Propagation (BP) neural network, new model of forecasting daily SO2, NO2, PM10 concentration in seven sites Guangzhou was developed using data from January 2006 to April 2012. A meteorological similarity principle applied the development rule. The key factors influencing concentrations as well weight matrices threshold were determined. basic then improved BP network. Improving model, identification factor variation consistency added...

10.3390/atmos6070891 article EN cc-by Atmosphere 2015-07-09

Enabling versatile analysis of large scale traffic video data with deep learning and HiveQL

OPENALEX - Publications

Lei Huang Weijia Xu Si Liu Venktesh Pandey Natalia Ruiz Juri

While monocular roadside cameras have been widely deployed and used to monitor traffic conditions across the United States, analysis of those video data are commonly implemented either manually or through commercial applications tailor-made for specific tasks. The goal this project is develop an efficient system that can meet dynamic content based needs scale large camera data. proposed utilizes deep learning methods recognize objects in That information then be processed analyzed layer...

10.1109/bigdata.2017.8258041 article EN 2021 IEEE International Conference on Big Data (Big Data) 2017-12-01

Using MoBIoS' scalable genome join to find conserved primer pair candidates between two genomes

OPENALEX - Publications

Weijia Xu Willard J. Briggs Joanna Melinda Padolina Ruth Timme Wenguo Liu and 2 more

For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number paired, conserved DNA oligomers that may be used as primers to amplify orthologous regions using polymerase chain reaction (PCR).We develop an initial candidate set by comparing Arabidopsis and rice genomes MoBIoS (Molecular Biological Information System). is metric-space database management system targeting life science data. Through use indexing techniques, two can compared...

10.1093/bioinformatics/bth929 article EN Bioinformatics 2004-07-19

Two accurate sequence, structure, and phylogenetic template-based RNA alignment systems

OPENALEX - Publications

Lei Shang David P. Gardner Weijia Xu Jamie J. Cannone Daniel P. Miranker and 2 more

The analysis of RNA sequences, once a small niche field for collection scientists whose primary emphasis was the structure and function few molecules, has grown most significantly with realizations that 1) is implicated in many more functions within cell, 2) ribosomal sequences revealing about microbial ecology all biological environmental systems. accurate rapid alignment these essential to decipher maximum amount information from this data. Two computer systems utilize Gutell lab's...

10.1186/1752-0509-7-s4-s13 article EN BMC Systems Biology 2013-10-01

Coming Soon ...