Uncovering the linguistic characteristics and modeling the language of the dark web다크웹의 언어적 특성 관찰 및 언어모델링

Cited 0 time in webofscience Cited 0 time in scopus
  • Hit : 138
  • Download : 0
The hidden nature and the limited accessibility of the Dark Web, combined with the lack of public datasets in this domain, make it difficult to study its inherent characteristics such as linguistic properties. Previous works on text classification of Dark Web domain have suggested that the use of deep neural models may be ineffective, potentially due to the linguistic differences between the Dark and Surface Webs. However, not much work has been done to uncover the linguistic characteristics of the Dark Web. In addition, some of the activities that are prevalent in the Dark Web have shown to be malicious in nature. Therefore, it is imperative that a thorough investigation of the activities in the Dark Web is conducted. To this end, this work introduces CoDA, a publicly available Dark Web dataset consisting of 10,000 web documents tailored towards text-based Dark Web analysis. By leveraging CoDA, we conduct a thorough linguistic analysis of the Dark Web and examine the textual differences between the Dark Web and the Surface Web. We also assess the performance of various methods of Dark Web page classification. We then compare CoDA with an existing public Dark Web dataset and evaluate their suitability for various use cases. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. By confirming the apparent differences in the language of the Dark Web and the Surface Web and collecting more data, we create DarkBERT, a language model pretrained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain specific model may offer.
Advisors
Shin, Seungwonresearcher신승원researcher
Description
한국과학기술원 :전기및전자공학부,
Publisher
한국과학기술원
Issue Date
2022
Identifier
325007
Language
eng
Description

학위논문(석사) - 한국과학기술원 : 전기및전자공학부, 2022.8,[v, 48 p. :]

Keywords

dark web▼anatural language processing▼amachine learning▼ainformation retrieval▼alanguage modeling▼alinguistic analysis; 다크 웹▼a자연 언어 처리▼a기계 학습▼a정보 검색▼a언어 모델링▼a언어 분석

URI
http://hdl.handle.net/10203/309941
Link
http://library.kaist.ac.kr/search/detail/view.do?bibCtrlNo=1008371&flag=dissertation
Appears in Collection
EE-Theses_Master(석사논문)
Files in This Item
There are no files associated with this item.

qr_code

  • mendeley

    citeulike


rss_1.0 rss_2.0 atom_1.0