Main

Invited Speaker
 

 

Prof. Xiwen Zhang
Beijing Language and Culture University, China

Xiwen Zhang is currently a Professor of Digital Media Department, School of Information Science in the Beijing Language and Culture University. He worked as an associated professor from 2002 to 2007 at the Human-computer interaction Laboratory, Institute of Software, Chinese Academy of Sciences. From 2005 to 2006 he was a Post doctor advised by Prof. Michael R. Lyu in the Department of Computer Science and Engineering, the Chinese University of Hong Kong. From February to April in 2001 he was a Research Assistant by Dr. KeZhang Chen in the Department of Mechanical Engineering, the University of Hong Kong. From 2000 to 2002 he was a Post doctor advised by Prof. ShiJie Cai in the Computer Science and Technology department, Nanjing University. Prof. Zhang's research interests include pattern recognition, computer vision, and human-computer interaction, as well as their applications in digital image, digital video, and digital ink. Prof. Zhang has published over 60 refereed journal and conference paper.

 

Speech Title: "Intelligently Extracting Information from Digital Ink Chinese Text by Junior International Students"

 

Abstract: Chinese characters have complex structures. Their writing plays an import role in learning Chinese. Junior international students can use digital pen to record their handwriting as digital ink. Various information can be extracted from the digital ink text, such as text line, Chinese characters, stroke errors, shape normalization. Digital ink is a new media compared with digital image and digital video. It is captured from handwriting and freehand drawing using digital pen. Point samples are captured by digital pens, containing positions, time stamp, and pressures. A stroke is a list of sampling points from pen down and movement to pen up. A list of strokes consists of a digital ink. Digital ink Chinese text are stroke sets, have neither text line, nor Chinese characters. Digital ink Chinese texts written by junior international students contain many information including errors and unnormal issues. It is difficult to recognize them. We proposed some intelligent methods to extract information, such as adaptive segmentation based on statistics analysis, classification using machine learning, stroke matching using Genetic Algorithm, evaluating the normalization for entire characters and their components using knowledge bases. With developing new intelligent methods and collecting more data, more valued information can be extracted.

 

Dr. Dmitry Ustalov
Research at Toloka AI, Russia

Dr. Dmitry Ustalov is the Head of Research at Toloka, a global data labeling platform. He is responsible for enabling the state-of-the-art methods for quality control in Toloka and spreading the innovations made by the Toloka Research team. His research interests focus on Crowdsourcing and Natural Language Processing; his research is published in the leading scientific venues, such as NeurIPS, COLI, ACL, EACL, and EMNLP. Dmitry has been co-organizing Crowd Science Workshop at NeurIPS and VLDB since 2020, and TextGraphs workshop at ACL since 2018. He co-authored and presented crowdsourcing tutorials at NAACL-HLT '21, WWW '21, SIGMOD/PODS '20, and WSDM '20.

 

Speech Title: "Crowdsourcing Natural Language Data at Scale"

 

Abstract: In this talk, we present a portion of our unique industry experience in efficient natural language data annotation via crowdsourcing, shared by a leading researcher from Toloka AI. We will make an introduction to data labeling via public crowdsourcing marketplaces and will present the key components of efficient label collection. We will talk about projects run on real crowds and present useful quality control techniques and other selection settings for the labeling process. This will be followed by a Q&A session and a discussion of other annotation ideas.