Phillip Ströbel

ORCID: 0000-0003-2063-5495
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Natural Language Processing Techniques
  • Handwritten Text Recognition Techniques
  • Topic Modeling
  • Digital Humanities and Scholarship
  • Image Processing and 3D Reconstruction
  • Radiomics and Machine Learning in Medical Imaging
  • Digital and Traditional Archives Management
  • Liver physiology and pathology
  • Digital and Cyber Forensics
  • Peptidase Inhibition and Analysis
  • Linguistics and language evolution
  • Vehicle License Plate Recognition
  • Speech and dialogue systems
  • Libraries, Manuscripts, and Books
  • Speech Recognition and Synthesis
  • Digital Media Forensic Detection
  • Text Readability and Simplification
  • Ubiquitin and proteasome pathways
  • Library Science and Information Systems

University of Zurich
2019-2024

Software (Spain)
2024

Swiss Re (Switzerland)
2024

This paper presents how we enhanced the accessibility and utility of historical linguistic data in project Bullinger Digital.The involved transformation 3,100 letters, primarily available as scanned PDFs, into a dynamic, fully digital format.The expanded collection now includes 12,000 edited, 5,400 transcribed, 3,500 represented through detailed metadata results from handwritten text recognition.Central to our discussion is innovative workflow developed for this multilingual corpus.This...

10.5334/johd.174 article EN cc-by Journal of Open Humanities Data 2024-01-01

In this paper we propose an algorithm for computing the full lemma of German verbs that occur in sentences with a separated prefix. The is meant large-scale corpus annotation. It relies on Part-of-Speech tags and works 97% precision when are correct. Unfortunately there multi-word adverbs particles homographs verb prepositions. Since usage as particle preposition much more frequent, these often incorrectly tagged. We show special treatment bi-particle improves re-attachment particles.

10.5167/uzh-126372 article EN 2016-09-21

We report on our processing steps to build a diachronic parallel corpus based the world's oldest banking magazine. The magazine has been published since 1895 in German, with translations French and partly English Italian. Our data sources are printed issues (until 1997), PDF (since 1998) HTML files 2001). building poses special challenges article boundary recognition cross-language sentence alignment. fills gap corpora respect genre (magazine articles), domain (banking economy its time span...

10.5167/uzh-125746 article EN 2016-09-21

The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR a supervised problem, the usual data split into training, validation, and test sets allows in terms accuracy or error rates. However, process becomes tricky as soon we switch from to application. A compilation new (and forcibly smaller) ground truth (GT) sample that want apply model on subsequent thereon only provides hints about quality recognised text, do confidence scores...

10.48550/arxiv.2201.06170 preprint EN cc-by arXiv (Cornell University) 2022-01-01

We apply the TrOCR framework to real-world, historical manuscripts and show that per se is a strong model, ideal for transfer learning. has been trained on English only, but it can adapt other languages use Latin alphabet fairly easily with little training material. compare against SOTA HTR (Transkribus) beat such systems. This finding essential since Transkribus performs best when access baseline information, which not needed at all fine-tune TrOCR.

10.48550/arxiv.2203.11008 preprint EN cc-by arXiv (Cornell University) 2022-01-01

This paper describes the challenges of building a Statistical Machine Translation (SMT) system for non-fictional subtitles. Since our experiments focus on difficult translation direction (i.e. French-German), we investigate several methods to improve performance. We also compare in-house SMT systems (including domain adaptation and pre-reordering techniques) other services show that alone significantly improves baseline systems.

10.5167/uzh-111980 article EN 2015-05-01
Coming Soon ...