- Topic Modeling
- Natural Language Processing Techniques
- Scientific Computing and Data Management
- Algorithms and Data Compression
- Data Management and Algorithms
- Advanced Data Storage Technologies
- Cloud Computing and Resource Management
- Distributed and Parallel Computing Systems
- Genomics and Phylogenetic Studies
- Biomedical Text Mining and Ontologies
- Multimodal Machine Learning Applications
- Semantic Web and Ontologies
- Research Data Management Practices
- Advanced Neural Network Applications
- Data Visualization and Analytics
- RNA modifications and cancer
- Web Data Mining and Analysis
- Data Mining Algorithms and Applications
- RNA and protein synthesis mechanisms
- Parallel Computing and Optimization Techniques
- Data Quality and Management
- Advanced Clustering Algorithms Research
- Video Analysis and Summarization
- Advanced Proteomics Techniques and Applications
- Traffic Prediction and Management Techniques
The University of Texas at Austin
2013-2024
Research Institute of Petroleum Exploration and Development
2024
Daqing Oilfield of CNPC
2024
Northeast Petroleum University
2024
Texas Advanced Computing Center
2011-2022
Central University of Finance and Economics
2021-2022
College of Marin
2019-2021
University of Maryland, College Park
2018-2021
Amazon (United States)
2020-2021
Microsoft (United States)
2021
PubMed
Gramene (http://www.gramene.org), a knowledgebase founded on comparative functional analyses of genomic and pathway data for model plants major crops, supports agricultural researchers worldwide. The resource is committed to open access reproducible science based the FAIR principles. Since last NAR update, we made nine releases; doubled genome portal's content; expanded curated genes, pathways expression sets; implemented Domain Informational Vocabulary Extraction (DIVE) algorithm extracting...
Natural language understanding (NLU) in the context of goal-oriented dialog systems typically includes intent classification and slot labeling tasks. Existing methods to expand an NLU system new languages use machine translation with label projection from source translated utterances, thus are sensitive errors. In this work, we propose a novel end-to-end model that learns align predict target labels jointly for cross-lingual transfer. We introduce MultiATIS++, multilingual corpus extends...
Abstract We introduce an Edit-Based TransfOrmer with Repositioning (EDITOR), which makes sequence generation flexible by seamlessly allowing users to specify preferences in output lexical choice. Building on recent models for non-autoregressive (Gu et al., 2019), EDITOR generates new sequences iteratively editing hypotheses. It relies a novel reposition operation designed disentangle choice from word positioning decisions, while enabling efficient oracles imitation learning and parallel...
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared previous 128K version, Qwen2.5-1M have significantly enhanced long-context capabilities through pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, multi-stage supervised fine-tuning are employed effectively enhance performance while reducing training costs. To promote use among broader user base, we present open-source our inference...
We introduce a new approach for the task of Controllable Text Simplification, where systems rewrite complex English sentence so that it can be understood by readers at different grade levels in US K-12 system.It uses non-autoregressive model to iteratively edit an input sequence and incorporates lexical complexity information seamlessly into refinement process generate simplifications better match desired output than strong autoregressive baselines.Analysis shows our model's local operations...
Abstract Motivation: We address the question of whether there exists an effective evolutionary model amino-acid substitution that forms a metric-distance function. There is always trade-off between speed and sensitivity among competing computational methods determining sequence homology. A metric evolution prerequisite for development entire class fast analysis algorithms are both scalable, O(log n) sensitive. Results: have reworked mathematics point accepted mutation (PAM) by calculating...
Covariation analysis is used to identify those positions with similar patterns of sequence variation in an alignment RNA sequences. These constraints on the evolution two are usually associated a base pair helix. While mutual information (MI) has been accurately predict secondary structure and few its tertiary interactions, early studies revealed that phylogenetic event counting methods more sensitive provide extra confidence prediction pairs. We developed novel powerful events method (PEC)...
We reformulate the problem of comparing mass-spectra by mapping spectra to a vector space model. Our search method leverages metric indexing algorithm produce an initial candidate set, which can be followed any fine ranking scheme.We consider three distance measures integrated into multi-vantage point index structure. Of these, semi-metric fuzzy-cosine using peptide precursor mass constraints performs best. The acts as coarse, lossless filter with respect SEQUEST and ProFound scoring...
Xing Niu, Weijia Xu, Marine Carpuat. Proceedings of the 2019 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kroneckerfactored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix that be used in natural gradient optimizers. We investigate here a scalable K-FAC design its applicability convolutional network (CNN) training scale. study optimization techniques such layer-wise distribution...
This paper studied the impact of COVID-19 on China's capital market and major industry sectors via an improved ICSS algorithm, a time series model with exogenous variable non-parametric conditional probability estimation. Through empirical analysis, it is found that epidemic has no significant return stock bond markets, but increased volatility gradual more obvious. There are differences in significance, direction duration different sectors. In addition, been some industries rapid others....
Big data analytics pipeline becomes popular for large volume processing, Apache Zeppelin provides an integrated environment ingestion, discovery, and visualization collaboration with extended framework which allows different programming languages processing back ends to be plugged in. The supported include Scala, Python, SQL, Shell script as well big including Hadoop, Spark Hive. With the necessary tool sets, interactive dynamic analysis can done on fly heterogeneous interfaces. Although is...
The software package R is a free, powerful, open source with extensive statistical computing and graphics capabilities. Due to its high-level expressiveness multitude of domain-specific packages, has become popular tool for data analysis in many scientific fields. While there are number packages enabling running parallel using message passing interface across multiple nodes, only few extend the new system paradigm intensive computing, such as Hadoop Spark. In this paper, we focus on three...
With the development of stock market, number individual investors has become more and more. Because emotional irrational factors, risk instability market in China have been greatly increased. Therefore, it is necessary to introduce idea quantitative investment into financial field. In this paper, firstly we use random forest, XGBoost LightGBM conduct rolling tests on multiple factors. After parameter adjustment based Bayesian Optimization, find that LightGBM-Bayes best effect. Finally, paper...
Over the years, R has been adopted as a major data analysis and mining tool in many domain fields. As Big Data overwhelms those fields, computational needs workload of existing solutions increases significantly. With recent hardware software developments, it is possible to enable massive parallelism with little no modification. In this paper, we evaluated approaches speed up computations utilization Intel Math Kernel Library automatic offloading Xeon Phi SE10P Co-processor. The testing...
With the rapid development of digital core technology, acquisition high-resolution rock thin section images has become crucial. Due to limitation optical principles, imaging involves a contradiction between resolution and field view. In order solve this problem, paper proposes lightweight, fully aggregated network with multi-branch structure for super images. The experimental results on dataset demonstrate that improved method, called OmniSR-M, achieves significant enhancement compared...
Data driven research has become an important part of scientific discovery in increasing number disciplines. In many cases, the sheer volume data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs significant expertise software development. Due its simplicity for user effectiveness at processing big data, Hadoop a popular platform large-scale...
In this paper, based on a sample selection rule and Back Propagation (BP) neural network, new model of forecasting daily SO2, NO2, PM10 concentration in seven sites Guangzhou was developed using data from January 2006 to April 2012. A meteorological similarity principle applied the development rule. The key factors influencing concentrations as well weight matrices threshold were determined. basic then improved BP network. Improving model, identification factor variation consistency added...
While monocular roadside cameras have been widely deployed and used to monitor traffic conditions across the United States, analysis of those video data are commonly implemented either manually or through commercial applications tailor-made for specific tasks. The goal this project is develop an efficient system that can meet dynamic content based needs scale large camera data. proposed utilizes deep learning methods recognize objects in That information then be processed analyzed layer...
For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number paired, conserved DNA oligomers that may be used as primers to amplify orthologous regions using polymerase chain reaction (PCR).We develop an initial candidate set by comparing Arabidopsis and rice genomes MoBIoS (Molecular Biological Information System). is metric-space database management system targeting life science data. Through use indexing techniques, two can compared...
The analysis of RNA sequences, once a small niche field for collection scientists whose primary emphasis was the structure and function few molecules, has grown most significantly with realizations that 1) is implicated in many more functions within cell, 2) ribosomal sequences revealing about microbial ecology all biological environmental systems. accurate rapid alignment these essential to decipher maximum amount information from this data. Two computer systems utilize Gutell lab's...