- Topic Modeling
- Artificial Intelligence in Law
- Natural Language Processing Techniques
- Comparative and International Law Studies
- Text and Document Classification Technologies
- Legal Language and Interpretation
- Legal Education and Practice Innovations
- Sentiment Analysis and Opinion Mining
- Advanced Text Analysis Techniques
- European and International Law Studies
- Stock Market Forecasting Methods
- Machine Learning and Data Classification
- Auditing, Earnings Management, Governance
- Explainable Artificial Intelligence (XAI)
- Artificial Intelligence in Healthcare and Education
- Semantic Web and Ontologies
- Banking stability, regulation, efficiency
- Medical Imaging and Pathology Studies
- Fibroblast Growth Factor Research
- Imbalanced Data Classification Techniques
- Occupational Health and Safety Research
- Financial Reporting and XBRL
- Domain Adaptation and Few-Shot Learning
- Law, Economics, and Judicial Systems
- Machine Learning in Healthcare
University of Copenhagen
2019-2024
Athens University of Economics and Business
2017-2023
University of Essex
2021-2023
Tilburg University
2023
Utrecht University
2023
Chicago Kent College of Law
2023
Illinois Institute of Technology
2023
Ludwig-Maximilians-Universität München
2023
Munich Center for Machine Learning
2023
Commonwealth Scientific and Industrial Research Organisation
2022
BERT has achieved impressive performance in several NLP tasks. However, there been limited investigation on its adaptation guidelines specialised domains. Here we focus the legal domain, where explore approaches for applying models to downstream tasks, evaluating multiple datasets. Our findings indicate that previous pre-training and fine-tuning, often blindly followed, do not always generalize well domain. Thus propose a systematic of available strategies when These are: (a) use original...
Legal judgment prediction is the task of automatically predicting outcome a court case, given text describing case’s facts. Previous work on using neural models for this has focused Chinese; only feature-based (e.g., bags words and topics) have been considered in English. We release new English legal dataset, containing cases from European Court Human Rights. evaluate broad variety establishing strong baselines that surpass previous three tasks: (1) binary violation classification; (2)...
We consider Large-Scale Multi-Label Text Classification (LMTC) in the legal domain. release a new dataset of 57k legislative documents from EUR-LEX, annotated with ∼4.3k EUROVOC labels, which is suitable for LMTC, few- and zero-shot learning. Experimenting several neural classifiers, we show that BIGRUs label-wise attention perform better than other current state art methods. Domain-specific WORD2VEC context-sensitive ELMO embeddings further improve performance. also find considering only...
Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, Nikolaos Aletras. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
We study how contract element extraction can be automated. provide a labeled dataset with gold annotations, along an unlabeled of contracts that used to pre-train word embeddings. Both datasets are provided in encoded form bypass privacy issues. describe and experimentally compare several methods use manually written rules linear classifiers (logistic regression, SVMs) hand-crafted features, embeddings, part-of-speech tag The best results obtained by hybrid method combines machine learning...
Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapatsanis, Nikolaos Aletras, Ion Androutsopoulos, Prodromos Malakasiotis. Proceedings of the 2021 Conference North American Chapter Association for Computational Linguistics: Human Language Technologies. 2021.
We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. highlight effect temporal concept drift and importance chronological, instead random splits. use as testbed zero-shot cross-lingual transfer, where we exploit training documents one language (source) to classify another (target). find that fine-tuning...
Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, Katerina Margatina, Phillip Rust, Anders Søgaard. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Ilias Chalkidis, Manos Fergadiotis, Sotiris Kotitsas, Prodromos Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Law, interpretations of law, legal arguments, agreements, etc. are typically expressed in writing, leading to the production vast corpora text. Their analysis, which is at center practice, becomes increasingly elaborate as these collections grow size. Natural language understanding (NLU) technologies can be a valuable tool support practitioners endeavors. usefulness, however, largely depends on whether current state-of-the-art models generalize across various tasks domain. To answer this...
In many jurisdictions, the excessive workload of courts leads to high delays. Suitable predictive AI models can assist legal professionals in their work, and thus enhance speed up process. So far, Legal Judgment Prediction (LJP) datasets have been released English, French, Chinese. We publicly release a multilingual (German, Italian), diachronic (2000-2020) corpus 85K cases from Federal Supreme Court Switzer- land (FSCS). evaluate state-of-the-art BERT-based methods including two variants...
The recent literature in text classification is biased towards short sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of transformers encode much longer text, namely sparse attention hierarchical encoding methods.We examine several...
Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the is tedious and costly. We, therefore, introduce XBRL as a new entity extraction task for financial domain release FiNER-139, dataset of 1.1M sentences gold Unlike typical datasets, FiNER-139 uses much larger label set 139 types. Most annotated tokens numeric, correct tag per token depending mostly on context, rather than itself. We show...
Lately, propelled by phenomenal advances around the transformer architecture, legal NLP field has enjoyed spectacular growth. To measure progress, well-curated and challenging benchmarks are crucial. Previous efforts have produced numerous for general models, typically based on news or Wikipedia. However, these may not fit specific domains such as law, with its unique lexicons intricate sentence structures. Even though there is a rising need to build systems languages other than English,...
Following the hype around OpenAI's ChatGPT conversational agent, last straw in recent development of Large Language Models (LLMs) that demonstrate emergent unprecedented zero-shot capabilities, we audit latest GPT-3.5 model, 'gpt-3.5-turbo', first available LexGLUE benchmark a fashion providing examples templated instruction-following format. The results indicate achieves an average micro-F1 score 49.0% across tasks, surpassing baseline guessing rates. Notably, model performs exceptionally...
We consider the task of detecting contractual obligations and prohibitions. show that a self-attention mechanism improves performance BILSTM classifier, previous state art for this task, by allowing it to focus on indicative tokens. also introduce hierarchical BILSTM, which converts each sentence an embedding, processes embeddings classify sentence. Apart from being faster train, outperforms flat one, even when latter considers surrounding sentences, because model has broader discourse view.
We consider the task of Extreme Multi-Label Text Classification (XMTC) in legal domain. release a new dataset 57k legislative documents from EURLEX, European Union’s public document database, annotated with concepts EUROVOC, multidisciplinary thesaurus. The is substantially larger than previous EURLEX datasets and suitable for XMTC, few-shot zero-shot learning. Experimenting several neural classifiers, we show that BIGRUs self-attention outperform current multi-label state-of-the-art...
This study examines the predictive power of textual information from S-1 filings in explaining initial public offering (IPO) underpricing. The authors' approach differs previous research because they utilize several machine learning algorithms to predict whether an IPO will be underpriced or not, as well magnitude Using a sample 2,481 US IPOs, find that can effectively complement financial variables terms prediction accuracy models use both sources data produce more accurate estimates. In...
In this work, we conduct a detailed analysis on the performance of legal-oriented pre-trained language models (PLMs). We examine interplay between their original objective, acquired knowledge, and legal understanding capacities which define as upstream, probing, downstream performance, respectively. consider not only models' size but also pre-training corpora used important dimensions in our study. To end, release multinational English corpus (LeXFiles) knowledge probing benchmark...
Ilias Chalkidis, Tommaso Pasini, Sheng Zhang, Letizia Tomada, Sebastian Schwemer, Anders Søgaard. Proceedings of the 60th Annual Meeting Association for Computational Linguistics (Volume 1: Long Papers). 2022.
Transformer-based language models, such as BERT and its variants, have achieved state-of-the-art performance in several downstream natural processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE). However, these models mostly been applied to the resource-rich English language. In this paper, we present GREEK-BERT, a monolingual -based model for modern Greek. We evaluate three NLP tasks, i.e., part-of-speech tagging, named entity recognition, inference, obtaining...