這幾天在翻舊代碼時發(fā)現(xiàn)以前寫的注釋部分有很多單詞拼寫錯誤,這些單詞錯得不算離譜,應(yīng)該可以用工具自動糾錯絕大部分。用 Python 寫個拼寫檢查腳本很容易,如果能很好利用 aspell/ispell 這些現(xiàn)成的小工具就更簡單了。
要點
1、輸入一個拼寫錯誤的單詞,調(diào)用 aspell -a 后得到一些候選正確單詞,然后用距離編輯進一步嗮選出更精確的詞。比如運行 aspell -a,輸入 ‘hella' 后得到如下結(jié)果:
hell, Helli, hello, heal, Heall, he'll, hells, Heller, Ella, Hall, Hill, Hull, hall, heel, hill, hula, hull, Helga, Helsa, Bella, Della, Mella, Sella, fella, Halli, Hally, Hilly, Holli, Holly, hallo, hilly, holly, hullo, Hell's, hell's
2、什么是距離編輯(Edit-Distance,也叫 Levenshtein algorithm)呢?就是說給定一個單詞,通過多次插入、刪除、交換、替換單字符的操作后枚舉出所有可能的正確拼寫,比如輸入 ‘hella',經(jīng)過多次插入、刪除、交換、替換單字符的操作后變成:
‘helkla', ‘hjlla', ‘hylla', ‘hellma', ‘khella', ‘iella', ‘helhla', ‘hellag', ‘hela', ‘vhella', ‘hhella', ‘hell', ‘heglla', ‘hvlla', ‘hellaa', ‘ghella', ‘hellar', ‘heslla', ‘lhella', ‘helpa', ‘hello', …
3、綜合上面2個集合的結(jié)果,并且考慮到一些理論知識可以提高拼寫檢查的準確度,比如一般來說寫錯單詞都是無意的或者誤打,完全錯的單詞可能性很小,而且單詞的第一個字母一般不會拼錯。所以可以在上面集合里去掉第一個字母不符合的單詞,比如:'Sella', ‘Mella', khella', ‘iella' 等,這里 VPSee 不刪除單詞,而把這些單詞從隊列里取出來放到隊列最后(優(yōu)先級降低),所以實在匹配不了以 h 開頭的單詞才去匹配那些以其他字母開頭的單詞。
4、程序中用到了外部工具 aspell,如何在 Python 里捕捉外部程序的輸入和輸出以便在 Python 程序里處理這些輸入和輸出呢?Python 2.4 以后引入了 subprocess 模塊,可以用 subprocess.Popen 來處理。
5、Google 大牛 Peter Norvig 寫了一篇 How to Write a Spelling Corrector 很值得一看,大牛就是大牛,21行 Python 就解決拼寫問題,而且還不用外部工具,只需要事先讀入一個詞典文件。本文程序的 edits1 函數(shù)就是從牛人家那里 copy 的。
代碼
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
#!/usr/bin/python # A simple spell checker import os, sys, subprocess, signal alphabet = 'abcdefghijklmnopqrstuvwxyz' def found(word, args, cwd = None , shell = True ): child = subprocess.Popen(args, shell = shell, stdin = subprocess.PIPE, stdout = subprocess.PIPE, cwd = cwd, universal_newlines = True ) child.stdout.readline() (stdout, stderr) = child.communicate(word) if ": " in stdout: # remove \n\n stdout = stdout.rstrip( "\n" ) # remove left part until : left, candidates = stdout.split( ": " , 1 ) candidates = candidates.split( ", " ) # making an error on the first letter of a word is less # probable, so we remove those candidates and append them # to the tail of queue, make them less priority for item in candidates: if item[ 0 ] ! = word[ 0 ]: candidates.remove(item) candidates.append(item) return candidates else : return None # copy from http://norvig.com/spell-correct.html def edits1(word): n = len (word) return set ([word[ 0 :i] + word[i + 1 :] for i in range (n)] + [word[ 0 :i] + word[i + 1 ] + word[i] + word[i + 2 :] for i in range (n - 1 )] + [word[ 0 :i] + c + word[i + 1 :] for i in range (n) for c in alphabet] + [word[ 0 :i] + c + word[i:] for i in range (n + 1 ) for c in alphabet]) def correct(word): candidates1 = found(word, 'aspell -a' ) if not candidates1: print "no suggestion" return candidates2 = edits1(word) candidates = [] for word in candidates1: if word in candidates2: candidates.append(word) if not candidates: print "suggestion: %s" % candidates1[ 0 ] else : print "suggestion: %s" % max (candidates) def signal_handler(signal, frame): sys.exit( 0 ) if __name__ = = '__main__' : signal.signal(signal.SIGINT, signal_handler) while True : input = raw_input () correct( input ) |
更簡單的方法
當然直接在程序里調(diào)用相關(guān)模塊最簡單了,有個叫做 PyEnchant 的庫支持拼寫檢查,安裝 PyEnchant 和 Enchant 后就可以直接在 Python 程序里 import 了:
1
2
3
4
5
6
7
8
9
|
>>> import enchant >>> d = enchant. Dict ( "en_US" ) >>> d.check( "Hello" ) True >>> d.check( "Helo" ) False >>> d.suggest( "Helo" ) [ 'He lo' , 'He-lo' , 'Hello' , 'Helot' , 'Help' , 'Halo' , 'Hell' , 'Held' , 'Helm' , 'Hero' , "He'll" ] >>> |