1908 working memory

Human Interface Laboratory
How a Young Speech Researcher
Dived into Computational Linguistics and
Where He Heads Now
2019. 8. 26, @Working Memory
Won Ik Cho

About Me
• 조원익
 B.S. in EE/Mathematics (SNU, ’10~’14)
 Ph.D. student (SNU INMC, ‘14~)
• Academic background
 Interested in mathematics > EE!
 Double major?
• ...
 Early years in Speech processing lab
• Source separation
• Voice activity & endpoint detection
• Automatic music composition
 Currently studying on computational linguistics
1

Early years
• Undergraduate years
 Played guitar alone ... or with friends
3

Early years
• Why started mathematics?
 Long dreamed romance
4
From 2008!
but...

Early years
• Undergraduate design project: Music source separation
 Why? – To automatically extract and transcribe the score of polyphonic
music for guitar orchestration
• And result?
5
Image: http://kimgooni.blog.me/221427899920

Early years
• Could not give up mathematics... (although it dismissed me)
 First, aimed at cryptography laboratory
6
Image top: https://blog.goodaudience.com/cryptography-for-dummies-part-1-a811d4852daa
bottom: https://www.coindesk.com/bitcoin-hits-new-2019-high-above-8900
or...?

Early years
• How? 어케들왔누
8

Early years
• Signal processing domain revisited
9
EE!
Circuit
Power
Semiconductor
ControlBio
Communication
?
?
System

Early years
• One step toward grand dream of polyphony music transcription?
 Paper survey on ...
• Multiple pitch estimation
• Music grammar
• Automatic music composition
10
Image: https://www.youtube.com/watch?v=TwQybAwL7NY

Early years
• And reality...
11
Image: https://www.lumenvox.com/resources/caseStudies/redmond/redmond-software-3.aspx

Early years
• First paper technical report on rule-based 4-part chorus
composition!
 Done while attending to a lecture on music theory (harmony)
 EXTREMELY HEURISTIC AND NOT NATURAL!
12

Early years
• Started a new government-funded project from April 2017
 Development of free-running speech
recognition technologies for
embedded robot system
13
Image: https://www.musicaslanguage.com/, https://imgur.com/gallery/iWKad22

Dive into computational linguistics
• New task?
 Development of free-running speech recognition technologies for
embedded robot system
 로봇용 free-running 임베디드 자연어 대화음성인식을 위한 원천 기술 개발
• In other words:
 Non wake-up-word based speech understanding system
 ...?
14
오늘 또
떨어졌네
이게 대체
며칠째
파란불이냐
지금 손실이
얼마지

• How?
 Related to many aspects of (speaker-dependent) speech recognition
• Speaker-dependency (in terms of a personal assistant)
• Noisy far-talk recognition and beamforming
• Speech intention understanding
– To which utterances should AI react?
15
오늘 또
떨어졌네
이게 대체
며칠째
파란불이냐
지금 손실이
얼마지

• It’s about finding an internal intention of a human speech
 And in Korean?
16
Image: Top https://onepageinfo.tistory.com/52
Bottom: https://m.blog.naver.com/barim12/220831241685
데이터도 없는데 어떻게 해요!
일단 만들어!

• Need to find what `questions’ and `commands’ are
• Sentence type?
 밥 먹었다
 밥 먹었니
 밥 먹어라
• But...
 너랑 오랜만에 밥 먹고 싶네
 부른지가 언젠데 밥 안 먹냐
 밥 좀 작작 먹어라
• ...?
17
Image: Top https://www.pinterest.co.kr/pin/367887863281568581/?lp=true
Bottom https://www.hkn24.com/news/articleView.html?idxno=69187

• WHY??
 The utterance intention should be identified among colloquial
conversations (non-wake-up-word-based!)
 In pragmatics and sentence-level semantics, this is called `speech act’
(Searle, 1976)
• And in some cases, `dialog act’ (Stolcke et al., 2000)
• It is best to use dialog history and prosodic information, if possible
 But only text data (about 800K, unlabeled single utterances) for us
• and need to show the potential for the left years!
 Let’s first choose 20K randomly and make a labeled corpus on:
18

• Intonation-dependent utterances
 How to figure out if the utterances is intonation-dependent?
19
천천히 가고 있어! (utterance)
천천 히 가 고 있 어 (transcript)
question
statement
command
?

• Intonation-dependent utterances
 Underspecified sentence enders
• -어, -지, -대, -해, -라고, -다며, etc. (differs from –다, -니, -라)
• Sentence type is determined based upon the sentence-final intonations that are
assigned considering the speech act
 Conversation maxim (Levinson, 2000)
• 정보성-원리 Informativeness-principle (단순화 버전)
– 화자: 필요한 것 이상으로 말하지 말라.
» Do not say more than is required (bearing the Q-principle in mind)
– 청자: 화자가 일반적으로 말한 것은 전형적으로 그리고 특칭적으로 해석하라.
» What is generally said is stereotypically and specifically exemplified.
– e.g., 내일 학생회관에서 두시 반에 만나서 얘기해 (질문? x)
20

• Introducing phonetic features: Intonation-dependency
 Annotating proper intention for possible cases of intonation
• 기본적으로 문말 억양을 고려함
• 한 가지 intonation에서 여러 intention이 가능한 경우는 ambiguous한 것으로 봄
• 부사, 수일치 등과 관련하여, 서술이 아닌 것으로 해석하기 어색한 것들은 제외함.
• 너무 많은 정보를 담고 있는 문장을 질문으로 판단하는 것을 피함
• Wh-particle들이 의문사의 기능을 하지 않는 경우들에 주의함
• 많은 한국어 문장이 그렇듯, 주어가 생략되어 1,2,3-인칭 등으로 해석할 수 있을 경
우에는, 각각을 대입해 보고, 어색하지 않은 것들로 판단함
• 호격의 유무에 주의함
21

• Intention understanding – how?
 Our approach (for Korean)
22
단일 문장인가?
Intonation 정보로
결정 가능한가?
Question set이 있고
청자의 답을 필요로 하는가?
Effective한 To-do list가
청자에게 부여되는가?
No
Yes
No
Yes
요구 (Commands)
수사명령문 (RC)
Full clause를
포함하는가?
No
No
Compound sentence: 힘이 강한 화행에 중점
(서로 다른 문장도 같은 토픽일 때 한 문장으로 간주)
Fragments (FR)
질문 (Questions)
No
Context-dependent (CD)
Yes
Yes
Yes
Intonation 정보가
필요한가?
Yes
Intonation-dependent (ID)
No Questions /
Embedded form
Requirements /
Prohibitions
수사의문문 (RQ)
Target: single sentence
without context
nor punctuation
Otherwise
서술 (Statements)

• System overview: Text-based sieving + Speech-aided analysis
 Compatible to text-speech alignment
23

• 2017.04~2017.11
 Setup phase
• Study syntax, semantics, and pragmatics for making up an annotation
guideline for colloquial utterances
• 2017.12~2018.08
 Corpus annotation
• Along with undergraduate students (design project)
• 2018.09~2018.11
 Implement classifiers and make a documentation (paper on arXiv)
• Repository for an open sourcing
• 2018.12~
 Modifying & maintaining the repository
• https://github.com/warnikchow/3i4k
 Article (in Korean) on the annotation guideline (DisCog 26:3)
24

The Struggle within
• Project is successful (so far) - But does it bring me a publication?
 Approach #1: First, let’s make a similar guideline for English! (2017.11)
• Manual tagging on Cornell Movie dialog dataset (binary: only obligatory/non-
obligatory)
25
and REJECT! (for a signal processing conference)
... My sense here is that the two positive reviewers come from a linguistics
background, and are happy to see linguistic insights being applied to a new real-
world problem. The two negative reviewers seem more aware of the methods
currently in use for this type of problem. The field today is hugely dominated by
data-driven methods with very few linguistics insights, so I think it's important to
make room for papers like this. But at the same time, I think this is a bridge too
far for people working in this area to really engage with. There were lots of missed
opportunities – in addition to the reviewers' comments, this paper would have
been saved by experiments on a common dataset, like DSTC2 or ATIS. I therefore
hesitantly recommend reject.

The Struggle within
• How about finer categorization and bigger dataset?
 Approach #2: `question’, `command’, and `statement’
26
and REJECT! (for a computational linguistics conference)
... One weakness or point of criticism is that it did not become clear to me whether
the annotated corpus is being made available open source as a corpus for further
study ... I am not entirely convinced that it is the best idea to use abbreviations /
names for features that are so similar to established "academic" terminology for
sentence types. While the intention is obvious to point out the relationship, it might
be a good idea to make the difference more explicit in the names (int, imp, dec) ...
We learn very little about the annotation guidelines, their granularity, publication
status etc. ... We learn little about the prospective application to spoken language
corpora and the expected impact of an application to spoken / phonetic data incl.
phonological features.

The Struggle within
• And the issues solved?
 Approach #3: More on justification
27
and REJECT! (for an AI/ML conference)
... I agree with the main motivation of categorize the utterance in a conversation
based on the expected response of that one who receives the information. This
position could make procedures more dynamic and direct. However, the authors do
not make an argument that the proposed categories are sufficient for a dialogue,
they might be the main but I will suggest to pay attention to clarifications and
continuations which not necessary correspond to an answer or action. ... It is not
clear what shows the results of the classification regarding the proposal of category.
A good classification means that the proposal categories are good? I missing a
guide on how to interpret this relation. ...

The Struggle within
• Should I do it for my own language?
 Approach #4: Similar categorization in Korean, incorporating a new label
regarding acoustic cues for a head-final language
28
and REJECT! (for an NLP conference)
Summary: The paper presents an approach for intention recognition in Korean,
leveraging both text and acoustic information.
Strengths: The approach is relevant for SLU or dialog systems, and addresses issues
of recognition for head-final languages. It also considers acoustic information
specifically in those cases where acoustic cues are the best discriminators.
Weaknesses: The paper would benefit from specific comparison between the
proposed, somewhat complex architecture and other possible alternative models to
justify the system. It was unclear whether the approach was limited to head-final
languages, and Korean in particular, which would yield a narrow result, and if so,why.

The Struggle within
• There were encouraging words though!
29

The Struggle within
• And some unexpected invitations from linguistic venues/journals
30

And to speech again
• Conference to be presented: ICPhS
 International Congress of Phonetic Science
• 8/5-9, Melbourne, Australia
31
(although I couldn’t attend...)

And to speech again
• Prosody-ambiguous statements?
 Problems in: Wh-intervention?
• Needs disambiguation
32
몇 개 가져오래
Should I bring some?
How many should I bring?
They told you to bring some?

And to speech again
• Prosody-ambiguous statements? How about constructing a corpus that contains ONLY the
utterances whose syntactic ambiguity can be resolved by introducing prosody?
# Wh- particles
누구 (nwukwu, who), 뭐 (mwe, what),
어디 (eti, where), 언제 (encey, when),
어떻게 (ettehkey, how), 몇 (myech, how much)
왜 (way, why) was not utilized because it is not used as an existential quantifier
# Predicates
Depend on the wh- particle being adopted
Chosen among 5,800 frequently used lexicons
Pronouns and polarity items were added in some cases
# Reportive particles
Added to form an evidential mood
Induces rhetoricalness for some questions
# Sentence enders (SEs)
SEs with an unfixed role (underspecified SE)
e.g. -래 (ray), -어 (e), -지 (ci)
SEs with a fixed role
e.g. -까 (kka: interrogative)
# Politeness suffix
Attached at the end of a sentence to assign politeness
Restricts rhetoricalness under some circumstances
33

And to speech again
 Created 1,292 sentences
 Constructed 3,552 utterances (with speech intention, under consensus)
34

And to speech again
 Recorded 7,104 utterances (female/male)
35

And to speech again
36
AdvisorPresenter
Editing
Wrote paper
Checked dataset & Recording
Co-author / Equal contribution
Checked dataset
Proofreading

And to speech again
37

And to speech again
 And in future?
• Disambiguation inspired by neuro-scientific phenomenon
38

Interdisciplinary
• Cowork with the friends of similar interest
39

Interdisciplinary
• AI ethics? (ACL workshop topic)
 Measuring gender bias in machine translation
 Originally claimed to deal with the proposal of government project ... but
40

Done and afterward
• Done
 억양 의존성 및 rhetoricalness를 고려한, 음성인식 output 분석에 적합한 일
반언어학적 speech act 분류 방법론 제시
 한국어를 위한 annotation guideline 정립, corpus 구성 및 모델 학습
 한국어 Speech intention의 disambiguation을 위한 corpus 구성
 질문/요구 paraphrasing 위한 parallel corpus 제작 (under progress)
• Afterward?
 Speech disambiguation을 위한 co-attention framework 개량
 대화체/비정형 질문/요구의 structured paraphrasing
 Task-oriented와 non-oriented 간 code switching이 자유로운 dialog
manager 시스템의 개발
41

Done and afterward
• Where do I head now?
42
Image: Top https://phdcomics.com/comics/archive_print.php?comicid=1733
Bottom: https://slideplayer.com/slide/15366786/

Reference
• Searle, John R. A classification of illocutionary acts. Language in society 5.1 (1976): 1-23.
• Stolcke, Andreas, et al. Dialogue act modeling for automatic tagging and recognition of
conversational speech. Computational linguistics 26.3 (2000): 339-373.
• Levinson, Stephen C. Presumptive meanings: The theory of generalized conversational
implicature. MIT press (2000).
• Cho, Won Ik, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim, and Nam Soo Kim. Speech
intention understanding in a head-final language: A disambiguation utilizing intonation-
dependency. arXiv preprint arXiv:1811.04231 (2018).
• Cho, Won Ik, and Nam Soo Kim. Discourse component-based Korean speech act categorization
to resolve the vagueness in understanding text intention: A computational linguistics
perspective. Discourse and Cognition 26.3 (2019): 227-247. [Korean]
• Cho, Won Ik*, Jeonghwa Cho*, Jeemin Kang, and Nam Soo Kim. Prosody-semantics interface in
Seoul Korean: Corpus for a disambiguation of wh- intervention. Proc. ICPhS (2019): 3902-3906.
43

1908 working memory

More Related Content

What's hot

What's hot (15)

Similar to 1908 working memory

Similar to 1908 working memory (20)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (20)

1908 working memory

Editor's Notes