Skip to content

sangdee/kss-java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Korean Sentence Splitter

latest version BSD 3-Clause

Split Korean text into sentences using heuristic algorithm.



1. Installation

  • Maven
<dependency>
  <groupId>io.github.sangdee</groupId>
  <artifactId>kss-java</artifactId>
  <version>2.6.1</version>
</dependency>
  • Gradle
repositories {
    mavenCentral()
}

dependencies {
    implementation 'io.github.sangdee:kss-java:2.6.1'
}



2. Usage of splitSentences

ArrayList<String> splitSentences(
        String text,
        boolean useHeuristic,  //default = true
        boolean useQuotesBracketsProcessing, //default = true
        int maxRecoverStep, //default = 5
        int maxRecoverLength, // default = 20000
        int recoverStep //default = 0
    ) 

2.1. Split sentences with heuristic algorithm.

  • splitSentences is the key method of Kss.
  • You can segment text to sentences with this method.
import kss.Kss;

Kss kss = new Kss();
String text = "ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š” ๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค ๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต.";
kss.splitSentences(text);
["ํšŒ์‚ฌ ๋™๋ฃŒ ๋ถ„๋“ค๊ณผ ๋‹ค๋…€์™”๋Š”๋ฐ ๋ถ„์œ„๊ธฐ๋„ ์ข‹๊ณ  ์Œ์‹๋„ ๋ง›์žˆ์—ˆ์–ด์š”,"
 "๋‹ค๋งŒ, ๊ฐ•๋‚จ ํ† ๋ผ์ •์ด ๊ฐ•๋‚จ ์‰‘์‰‘๋ฒ„๊ฑฐ ๊ณจ๋ชฉ๊ธธ๋กœ ์ญ‰ ์˜ฌ๋ผ๊ฐ€์•ผ ํ•˜๋Š”๋ฐ ๋‹ค๋“ค ์‰‘์‰‘๋ฒ„๊ฑฐ์˜ ์œ ํ˜น์— ๋„˜์–ด๊ฐˆ ๋ป” ํ–ˆ๋‹ต๋‹ˆ๋‹ค,"
 "๊ฐ•๋‚จ์—ญ ๋ง›์ง‘ ํ† ๋ผ์ •์˜ ์™ธ๋ถ€ ๋ชจ์Šต."]

2.2. Split sentences without heuristic algorithm.

  • If your articles follow the punctuation rules reratively well, we recommend to you set the useHeuristic = false. (default is true)
  • In these cases, Kss segments text depending only on punctuataion and you can segment text much more safely.
    • Formal articles (Wiki, News, Essay, ...) : recommend useHeuristic = false
    • Informal articles (SNS, Blogs, Messages, ...) : recommend useHeuristic = true
import kss.Kss;

Kss kss = new Kss();
String text = "๋ฏธ๋ฆฌ ์˜ˆ์•ฝ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์œผ๋กœ ํ•ฉ๋ฆฌ์ ์ธ ๊ฐ€๊ฒฉ์— ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ์ƒ์„ , ๊ทธ๋ฆฌ๊ณ  ๋‹ค์–‘ํ•œ ๋ถ€์œ„๋ฅผ ์ฆ๊ธธ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ณ„์ ˆ์— ๋”ฐ๋ผ ๋ชจ๋‘ ํšŒ์˜ ์ข…๋ฅ˜๋Š” ์กฐ๊ธˆ์”ฉ ๋‹ฌ๋ผ์ง€์ง€๋งŒ ์ž์ฃผ ์˜ฌ๋ ค์ฃผ๋Š” ์ฐธ๋” ๋งˆ์Šค๊นŒ์™€๋Š” ํŠนํžˆ ๋ง›์ด ๋งค์šฐ ์ข‹๋‹ค. ์ผ๋ฐ˜ ๋ชจ๋‘ ํšŒ๋„ ์ข‹์ง€๋งŒ ์ข€ ๋” ํŠน๋ณ„ํ•œ ๋ง›์„ ์ฆ๊ธฐ๊ณ  ์‹ถ๋‹ค๋ฉด ํŠน์ˆ˜ ๋ถ€์œ„ ๋ชจ๋‘ ํšŒ๋ฅผ ์ถ”์ฒœํ•œ๋‹ค ์ œ์ฒ  ์ƒ์„  5~6๊ฐ€์ง€ ๊ตฌ์„ฑ์— ํ‰์†Œ ์ ‘ํ•˜์ง€ ๋ชปํ–ˆ๋˜ ๋ถ€์œ„๊นŒ์ง€ ์ƒ‰๋‹ค๋ฅด๊ฒŒ ์ฆ๊ธธ ์ˆ˜ ์žˆ๋‹ค.";
kss.splitSentences(text, false);  
["๋ฏธ๋ฆฌ ์˜ˆ์•ฝ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์œผ๋กœ ํ•ฉ๋ฆฌ์ ์ธ ๊ฐ€๊ฒฉ์— ์—ฌ๋Ÿฌ ์ข…๋ฅ˜์˜ ์ƒ์„ , ๊ทธ๋ฆฌ๊ณ  ๋‹ค์–‘ํ•œ ๋ถ€์œ„๋ฅผ ์ฆ๊ธธ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.", 
 "๊ณ„์ ˆ์— ๋”ฐ๋ผ ๋ชจ๋‘ ํšŒ์˜ ์ข…๋ฅ˜๋Š” ์กฐ๊ธˆ์”ฉ ๋‹ฌ๋ผ์ง€์ง€๋งŒ ์ž์ฃผ ์˜ฌ๋ ค์ฃผ๋Š” ์ฐธ๋” ๋งˆ์Šค๊นŒ์™€๋Š” ํŠนํžˆ ๋ง›์ด ๋งค์šฐ ์ข‹๋‹ค.", 
 "์ œ์ฒ  ์ƒ์„  5~6๊ฐ€์ง€ ๊ตฌ์„ฑ์— ํ‰์†Œ ์ ‘ํ•˜์ง€ ๋ชปํ–ˆ๋˜ ๋ถ€์œ„๊นŒ์ง€ ์ƒ‰๋‹ค๋ฅด๊ฒŒ ์ฆ๊ธธ ์ˆ˜ ์žˆ๋‹ค."]

2.3. Brackets and quotation marks processing

  • Kss provides a technique for not segmenting sentences enclosed in brackets (๊ด„ํ˜ธ) or quotation marks (๋”ฐ์˜ดํ‘œ).
import kss.Kss;

Kss kss = new Kss();
String text = "๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค. '๊ฑฐ๊ธฐ๋Š” ๊ฐ€์ง€ ๋งˆ์„ธ์š”. ์œ„ํ—˜ํ•˜๋‹ˆ๊นŒ์š”. ์•Œ๊ฒ ์ฃ ?' ๊ทธ๋Ÿฌ์ž ๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค. ์•Œ๊ฒ ์–ด์š”.";
kss.splitSentences(text)
        
["๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค.","'๊ฑฐ๊ธฐ๋Š” ๊ฐ€์ง€ ๋งˆ์„ธ์š”. ์œ„ํ—˜ํ•˜๋‹ˆ๊นŒ์š”. ์•Œ๊ฒ ์ฃ ?' ๊ทธ๋Ÿฌ์ž ๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค.","์•Œ๊ฒ ์–ด์š”."]

2.3.1. Several options to optimize recursion

  • However, this can cause problem when brackets and quotation marks are misaligned, and it was a cronic problem of Kss 1.x (C++ version).
  • From Kss 2.xx, we provide quotes and brocket calibration feature to solve this problem, but it uses recursion and has very poor time complexity O(2^n).
  • So, we also provide several options to optimize recursion. You can save your precious time with these options.
    • The depth of the recursion can be modified through a parameter maxRecoverStep. (default is 5)
    • You can turn off calibration using the maxRecoverLength parameter. (default is 20,000)
import kss.Kss;

Kss kss = new Kss();
String text = "VERY_LONG_TEXT";

splitSentences(text, true, true, 5);
// you can adjust recursion depth using `maxRecoverStep` (default is 5)
splitSentences(text, true, true, 5, 20000);
// you can turn it off when you input very long text using `maxRecoverLength` (default is 20000)

2.3.2. Turn off brackets and quotation marks processing

  • You can also turn off brackets and quotation marks processing if you want.
  • Set useQuotesBracketsProcessing = false to turn it off.
import kss.Kss;

Kss kss = new Kss();
String text = "๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค. (๊ฑฐ๊ธฐ๋Š” ๊ฐ€์ง€ ๋งˆ์„ธ์š”. ์œ„ํ—˜ํ•˜๋‹ˆ๊นŒ์š”. ์•Œ๊ฒ ์ฃ ?) ๊ทธ๋Ÿฌ์ž ๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค. ์•Œ๊ฒ ์–ด์š”.";

kss.splitSentences(text);
['๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค.','(๊ฑฐ๊ธฐ๋Š” ๊ฐ€์ง€ ๋งˆ์„ธ์š”. ์œ„ํ—˜ํ•˜๋‹ˆ๊นŒ์š”. ์•Œ๊ฒ ์ฃ ?) ๊ทธ๋Ÿฌ์ž ๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค.','์•Œ๊ฒ ์–ด์š”.']

kss.splitSentences(text, true, false);
['๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค.','(๊ฑฐ๊ธฐ๋Š” ๊ฐ€์ง€ ๋งˆ์„ธ์š”.','์œ„ํ—˜ํ•˜๋‹ˆ๊นŒ์š”.','์•Œ๊ฒ ์ฃ ?',') ๊ทธ๋Ÿฌ์ž ๊ทธ๊ฐ€ ๋งํ–ˆ๋‹ค.','์•Œ๊ฒ ์–ด์š”.']



3. Usage of splitChunks

 ArrayList<ChunkWithIndex> splitChunks(
        String text, 
        int maxLength,
        boolean overlap, //default = false
        boolean useHeuristic, //default = true
        boolean useQuotesBracketsProcessing,  //default = true
        int maxRecoverStep,  //default = 5
        int maxRecoverLength  //default = 20000
    ) 

3.1. Set maximum length of chunks via maxLength

  • splitChunks combine sentences into chunks of a maxlength or less.
  • You can set the maximum length of one chunk to maxLength.
import kss.Kss;

Kss kss = new Kss();
String text = "NoSQL์ด๋ผ๊ณ  ํ•˜๋Š” ๋ง์€ No 'English'๋ผ๊ณ  ํ•˜๋Š” ๋ง๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. ์„ธ์ƒ์—๋Š” ์˜์–ด ๋ง๊ณ ๋„ ์ˆ˜๋งŽ์€ ์–ธ์–ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. MongoDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด์™€ CouchDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ์„œ๋กœ ์ „ํ˜€ ๋‹ค๋ฅด๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ด ๋‘ ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ๊ฐ™์€ NoSQL ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค. ์–ด์จŒ๊ฑฐ๋‚˜ SQL์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ NoSQL์ด No RDBMS๋ฅผ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. BerkleyDB๊ฐ™์€ ์˜ˆ์™ธ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  No RDBMS๊ฐ€ NoSQL์ธ ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค. SQLํ˜ธํ™˜ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” KV-store๋ผ๋Š” ์˜ˆ์™ธ๊ฐ€ ์—ญ์‹œ ์กด์žฌํ•œ๋‹ค. ๋ฌผ๋ก  KV-store์˜ ํŠน์ง•์ƒ range query๋ฅผ where์ ˆ์— ๋„ฃ์„ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ์™„์ „ํ•œ SQL์€ ๋ชป ๋˜๊ณ  SQL์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ ์ •๋„๋ฅผ ์ œ๊ณตํ•œ๋‹ค.";
kss.splitChunks(text, 128);
[ChunkWithIndex(start = 0, text = "NoSQL์ด๋ผ๊ณ  ํ•˜๋Š” ๋ง์€ No 'English'๋ผ๊ณ  ํ•˜๋Š” ๋ง๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. ์„ธ์ƒ์—๋Š” ์˜์–ด ๋ง๊ณ ๋„ ์ˆ˜๋งŽ์€ ์–ธ์–ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. MongoDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด์™€ CouchDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ์„œ๋กœ ์ „ํ˜€ ๋‹ค๋ฅด๋‹ค."),
 ChunkWithIndex(start = 124, text = "๊ทธ๋Ÿผ์—๋„ ์ด ๋‘ ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ๊ฐ™์€ NoSQL ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค. ์–ด์จŒ๊ฑฐ๋‚˜ SQL์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ NoSQL์ด No RDBMS๋ฅผ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. BerkleyDB๊ฐ™์€ ์˜ˆ์™ธ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค."),
 ChunkWithIndex(start = 236, text = "๊ทธ๋ฆฌ๊ณ  No RDBMS๊ฐ€ NoSQL์ธ ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค. SQLํ˜ธํ™˜ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” KV-store๋ผ๋Š” ์˜ˆ์™ธ๊ฐ€ ์—ญ ์‹œ ์กด์žฌํ•œ๋‹ค."),
 ChunkWithIndex(start = 305, text = "๋ฌผ๋ก  KV-store์˜ ํŠน์ง•์ƒ range query๋ฅผ where์ ˆ์— ๋„ฃ์„ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ์™„์ „ํ•œ SQL์€ ๋ชป ๋˜๊ณ  SQL์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ ์ •๋„๋ฅผ ์ œ๊ณตํ•œ๋‹ค.")]

3.2. Overlap sentences across chunks

  • If overlap is true, text will be chunked similar with sliding window.
  • Each chunk allows for duplicate sentences if you turn this feature on.
import kss.Kss;

Kss kss = new Kss();
String text = "NoSQL์ด๋ผ๊ณ  ํ•˜๋Š” ๋ง์€ No 'English'๋ผ๊ณ  ํ•˜๋Š” ๋ง๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. ์„ธ์ƒ์—๋Š” ์˜์–ด ๋ง๊ณ ๋„ ์ˆ˜๋งŽ์€ ์–ธ์–ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. MongoDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด์™€ CouchDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ์„œ๋กœ ์ „ํ˜€ ๋‹ค๋ฅด๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ด ๋‘ ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ๊ฐ™์€ NoSQL ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค. ์–ด์จŒ๊ฑฐ๋‚˜ SQL์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ NoSQL์ด No RDBMS๋ฅผ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. BerkleyDB๊ฐ™์€ ์˜ˆ์™ธ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  No RDBMS๊ฐ€ NoSQL์ธ ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค. SQLํ˜ธํ™˜ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” KV-store๋ผ๋Š” ์˜ˆ์™ธ๊ฐ€ ์—ญ์‹œ ์กด์žฌํ•œ๋‹ค. ๋ฌผ๋ก  KV-store์˜ ํŠน์ง•์ƒ range query๋ฅผ where์ ˆ์— ๋„ฃ์„ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ์™„์ „ํ•œ SQL์€ ๋ชป ๋˜๊ณ  SQL์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ ์ •๋„๋ฅผ ์ œ๊ณตํ•œ๋‹ค.";
kss.splitChunks(text, 128, false, true); // text maxLength, overlap, useHeuristic,
[ChunkWithIndex(start = 0, text = "NoSQL์ด๋ผ๊ณ  ํ•˜๋Š” ๋ง์€ No 'English'๋ผ๊ณ  ํ•˜๋Š” ๋ง๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. ์„ธ์ƒ์—๋Š” ์˜์–ด ๋ง๊ณ ๋„ ์ˆ˜๋งŽ์€ ์–ธ์–ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. MongoDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด์™€ CouchDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ์„œ๋กœ ์ „ํ˜€ ๋‹ค๋ฅด๋‹ค."),
 ChunkWithIndex(start = 43, text = "์„ธ์ƒ์—๋Š” ์˜์–ด ๋ง๊ณ ๋„ ์ˆ˜๋งŽ์€ ์–ธ์–ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. MongoDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด์™€ CouchDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ์„œ๋กœ ์ „ํ˜€ ๋‹ค๋ฅด๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ด ๋‘ ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ๊ฐ™์€ NoSQL ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค."),
 ChunkWithIndex(start = 69, text = "MongoDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด์™€ CouchDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ์„œ๋กœ ์ „ํ˜€ ๋‹ค๋ฅด๋‹ค. ๊ทธ๋Ÿผ ์—๋„ ์ด ๋‘ ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ๊ฐ™์€ NoSQL ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค. ์–ด์จŒ๊ฑฐ๋‚˜ SQL์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค."),
 ChunkWithIndex(start = 124, text = "๊ทธ๋Ÿผ์—๋„ ์ด ๋‘ ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ๊ฐ™์€ NoSQL ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค. ์–ด์จŒ๊ฑฐ๋‚˜ SQL์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ NoSQL์ด No RDBMS๋ฅผ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. BerkleyDB๊ฐ™์€ ์˜ˆ์™ธ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค."),
 ChunkWithIndex(start = 180, text = "๋˜ํ•œ NoSQL์ด No RDBMS๋ฅผ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. BerkleyDB๊ฐ™์€ ์˜ˆ์™ธ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  No RDBMS๊ฐ€ NoSQL์ธ ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค. SQLํ˜ธํ™˜ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” KV-store๋ผ๋Š” ์˜ˆ์™ธ๊ฐ€ ์—ญ์‹œ ์กด์žฌํ•œ๋‹ค."),
 ChunkWithIndex(start = 236, text = "๊ทธ๋ฆฌ๊ณ  No RDBMS๊ฐ€ NoSQL์ธ ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค. SQLํ˜ธํ™˜ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” KV-store๋ผ๋Š” ์˜ˆ์™ธ๊ฐ€ ์—ญ ์‹œ ์กด์žฌํ•œ๋‹ค. ๋ฌผ๋ก  KV-store์˜ ํŠน์ง•์ƒ range query๋ฅผ where์ ˆ์— ๋„ฃ์„ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ์™„์ „ํ•œ SQL์€ ๋ชป ๋˜๊ณ  SQL์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ ์ •๋„๋ฅผ ์ œ๊ณตํ•œ๋‹ค.")]

3.3. Use every options used in splitSentences

  • You can use the EVERY options used in splitSentences.
  • For example, if you want to turn off the processing about quotation marks, you can set useQuotesBracketsProcessing the same as split_sentences.
import kss.Kss;

Kss kss = new Kss();
String text = "NoSQL์ด๋ผ๊ณ  ํ•˜๋Š” ๋ง์€ No 'English'๋ผ๊ณ  ํ•˜๋Š” ๋ง๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค. ์„ธ์ƒ์—๋Š” ์˜์–ด ๋ง๊ณ ๋„ ์ˆ˜๋งŽ์€ ์–ธ์–ด๊ฐ€ ์กด์žฌํ•œ๋‹ค. MongoDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด์™€ CouchDB์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ์„œ๋กœ ์ „ํ˜€ ๋‹ค๋ฅด๋‹ค. ๊ทธ๋Ÿผ์—๋„ ์ด ๋‘ ์ฟผ๋ฆฌ ์–ธ์–ด๋Š” ๊ฐ™์€ NoSQL ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค. ์–ด์จŒ๊ฑฐ๋‚˜ SQL์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋˜ํ•œ NoSQL์ด No RDBMS๋ฅผ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค. BerkleyDB๊ฐ™์€ ์˜ˆ์™ธ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  No RDBMS๊ฐ€ NoSQL์ธ ๊ฒƒ๋„ ์•„๋‹ˆ๋‹ค. SQLํ˜ธํ™˜ ๋ ˆ์ด์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” KV-store๋ผ๋Š” ์˜ˆ์™ธ๊ฐ€ ์—ญ์‹œ ์กด์žฌํ•œ๋‹ค. ๋ฌผ๋ก  KV-store์˜ ํŠน์ง•์ƒ range query๋ฅผ where์ ˆ์— ๋„ฃ์„ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ์™„์ „ํ•œ SQL์€ ๋ชป ๋˜๊ณ  SQL์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ ์ •๋„๋ฅผ ์ œ๊ณตํ•œ๋‹ค.";
splitChunks(text, 128, false, true, false); // text maxLength, overlap, useHeuristic, useQuotesBracketsProcessing,



4. References

Kss is available in various programming languages.

About

Korean Sentence Splitter

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages