DSpace at KOASAS: Uncovering the linguistic characteristics and modeling the language of the dark web

DSpace at KOASAS

College of Engineering(공과대학)School of Electrical Engineering(전기및전자공학부)EE-Theses_Master(석사논문)

Uncovering the linguistic characteristics and modeling the language of the dark web다크웹의 언어적 특성 관찰 및 언어모델링

Cited 0 time in webofscience

Cited 0 time in scopus

Hit : 160
Download : 0

Export

Jin, Youngjin

The hidden nature and the limited accessibility of the Dark Web, combined with the lack of public datasets in this domain, make it difficult to study its inherent characteristics such as linguistic properties. Previous works on text classification of Dark Web domain have suggested that the use of deep neural models may be ineffective, potentially due to the linguistic differences between the Dark and Surface Webs. However, not much work has been done to uncover the linguistic characteristics of the Dark Web. In addition, some of the activities that are prevalent in the Dark Web have shown to be malicious in nature. Therefore, it is imperative that a thorough investigation of the activities in the Dark Web is conducted. To this end, this work introduces CoDA, a publicly available Dark Web dataset consisting of 10,000 web documents tailored towards text-based Dark Web analysis. By leveraging CoDA, we conduct a thorough linguistic analysis of the Dark Web and examine the textual differences between the Dark Web and the Surface Web. We also assess the performance of various methods of Dark Web page classification. We then compare CoDA with an existing public Dark Web dataset and evaluate their suitability for various use cases. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. By confirming the apparent differences in the language of the Dark Web and the Surface Web and collecting more data, we create DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model may offer.

Advisors: Shin, Seungwon researcher; 신승원 researcher

Description: 한국과학기술원 :전기및전자공학부,

Publisher: 한국과학기술원

Issue Date: 2022

Identifier: 325007

Language: eng

Description: 학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2022.8,[v, 48 p. :]

Keywords: dark web▼anatural language processing▼amachine learning▼ainformation retrieval▼alanguage modeling▼alinguistic analysis; 다크 웹▼a자연 언어 처리▼a기계 학습▼a정보 검색▼a언어 모델링▼a언어 분석

URI: http://hdl.handle.net/10203/309941

Link: http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1008371&flag=dissertation

Appears in Collection: EE-Theses_Master(석사논문)

Files in This Item: There are no files associated with this item.

Display Full Item Record

qr_code

트윗하기

KOASAS

Knowledge Service Development Team, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. T. 82-42-350-4493 Email. koasas@kaist.ac.kr
Copyright © 2016. Korea Advanced Institute of Science and Technology. All Rights Reserved.

KOASAS

KOASAS

Browse

Uncovering the linguistic characteristics and modeling the language of the dark web다크웹의 언어적 특성 관찰 및 언어모델링

KOASAS

Communities & Collections