Skip to content

CRAN version 0.5

Compare
Choose a tag to compare
@qinwf qinwf released this 29 Apr 11:28
· 206 commits to master since this release

Changes in Version 0.5 (2015-04-29)

  • Fix: edit_dict() on Mac
  • New function: filter_segment() to filter segmentation result
  • New function: vector_keywords() to extract keywords from a string
  • Enhancement: Segmentation support: Vector input => List output
  • Enhancement: Segmentation support: Input by lines => Output by lines
  • Enhancement: Add option write = "NOFILE"
  • Enhancement: New rules for "English word + Numbers"
  • Update documentation

一、 增加过滤分词结果的方法 filter_segment(),类似于关键词提取中使用的停止词功能。

cutter = worker()
result_segment = cutter["我是测试文本,用于测试过滤分词效果。"]
result_segment
[1] ""   ""   "测试" "文本" "用于" "测试" "过滤" "分词" "效果"
filter_words = c("","","","大家")
filter_segment(result_segment,filter_words)
[1] ""   "测试" "文本" "用于" "测试" "过滤" "分词" "效果"

二、 分词支持 “向量文本输入 => list输出” 与 “按行输入文件 => list输出”

通过 bylines 选项控制是否按行输出,默认值为bylines = FALSE

cutter = worker(bylines = TRUE)
cutter
Worker Type:  Mix Segment

Detect Encoding :  TRUE
Default Encoding:  UTF-8
Keep Symbols    :  FALSE
Output Path     :  
Write File      :  TRUE
By Lines        :  TRUE
Max Read Lines  :  1e+05
....
cutter[c("这是非常的好","大家好才是真的好")]
[[1]]
[1] "这是" "非常" ""   ""  

[[2]]
[1] "大家" ""   ""   ""   "真的" ""  
cutter$write = FALSE

# 输入文件文本是:
# 这是一个分行测试文本
# 用于测试分行的输出结果

cutter["files.path"] 
[[1]]
[1] "这是" "一个" "分行" "测试" "文本" 

[[2]]
[1] "用于" "测试" "分行"   "" "输出" "结果" 
# 按行写入文件
cutter$write = TRUE
cutter$bylines = TRUE

三、可以使用 vector_keywords 对一个文本向量提取关键词。

keyworker = worker("keywords")
cutter = worker()
vector_keywords(cutter["这是一个比较长的测试文本。"],keyworker)
8.94485 7.14724 4.77176 4.29163 2.81755 
 "文本"  "测试"  "比较"  "这是"  "一个" 
vector_keywords(c("今天","天气","真的","十分","不错","","感觉"),keyworker)
6.45994 6.18823 5.64148 5.63374 4.99212 
 "天气"  "不错"  "感觉"  "真的"  "今天" 

四、增加 write = "NOFILE" 选项,不检查文件路径。

cutter = worker(write = "NOFILE",symbol = TRUE)
cutter["./test.txt"] # 目录下有test.txt 文件
[1] "."    "/"    "test" "."    "txt"