the freeware console x-gram micro-spell-checker
POWERED WITH: MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
QUALITY: Heavily cross-checked, suitable for general purposes.
GOALS: To become the people's choice through its openness (all C sources included).
DOWNLOAD: www.sanmayce.com/Downloads/Masakari.zip (14,734,210 bytes)
First off, 'Masakari' is a sub-project of mine of my neverending project 'Gamera'. As you can see, from the PDF above, Masakari is used to cut off evil, here the English language ignorance. The most profound and sad audio/video saturating an image of the spirit of Kami-Gamera is this: www.sanmayce.com/MSKR/Gamera_3_Gamera_VS_God_Evil_Iris_at_Kyoto.flv God Evil Iris attacks Kyoto, Gamera comes flying to help the city, as gratitude the two pilots were ordered to attack her (the goddess of the earth incarnated). Gamera was overpowered by God Iris, then she entered kamikaze mode, cut off her own paw and saved the hateful traitor-girl who summoned Iris. The final scene (after battle versus God Evil Iris) where Gamera walks through hellish fires betrayed, injured and pawless and still remaining kind - this is the epitome of Kami-Gamera - a martyress. I am deeply thankful to the guy who created this diamond-clip, viva! Okay, word-checking is fast enough but I couldn't bear those awfully slow searches in 3-gram-checking, so after some quick tweaks in my flagman (in fact flagelf) Leprechaun, revision 15FIXFIX+ came into being allowing superfast x-grams vs x-grams checks. x-gram definition: - A x-phrase (or x-gram) has exactly x words; - Only alpha ASCII chars form our words, each word is in range 1..31 chars; - A x-phrase is lowercased i.e. contains only small letters 'a'..'z'; - A x-phrase has length (adjustable) between A and B chars inclusive, e.g. for 3-grams A=9 and B=41; - X words concatenated with '_' form one x-phrase; - Symbols not allowed between x words forming a x-phrase: '.', '!', '?', ':', ';', ',', '\t'. The following excerpt from 'When the Last Sword Is Drawn' movie subtitles ... 497 01:02:27,956 --> 01:02:35,089 Morioka, in Nanbu. It's pretty as a picture! 498 01:02:35,196 --> 01:02:38,723 There's nowhere like it in all Japan! 499 01:02:39,834 --> 01:02:43,827 The Morioka cherry blossom splits through rock to bloom. 500 01:02:44,506 --> 01:02:48,875 The Morioka magnolia blooms even facing north. 501 01:02:49,911 --> 01:02:54,848 So I want you to run ahead of the times. 502 01:02:55,950 --> 01:03:00,046 Go wild. Bloom. ... when x-grammed down to 3-grams looks like: as_a_picture to_run_ahead there_s_nowhere the_morioka_cherry morioka_magnolia_blooms the_morioka_magnolia it_in_all so_i_want pretty_as_a you_to_run s_nowhere_like nowhere_like_it it_s_pretty of_the_times i_want_you want_you_to splits_through_rock through_rock_to like_it_in rock_to_bloom in_all_japan morioka_cherry_blossom s_pretty_as even_facing_north How to "hit" unfamiliar i.e. "suspicious" to MASAKARI 3-grams? Copy the targeted text file(s) to 'Masakari\_Gamera_r15_3-grams\Your_textual_folders' folder and execute 'RUNME_FAST.BAT' located in 'Masakari\_Gamera_r15_3-grams' folder, then next file 'Your_words_unfamiliar_to_Masakari.txt' will be autoloaded into NOTEPAD: even_facing_north morioka_cherry_blossom morioka_magnolia_blooms nowhere_like_it rock_to_bloom s_nowhere_like splits_through_rock the_morioka_cherry the_morioka_magnolia /Unfamiliar 3-grams to MASAKARI, powered with 100,088,208 3-grams, copyleft Sanmayce 2012-Dec-11/ This is the fastest execution I am capable of (for now), here the latency (i.e. the starting overhead is huge (70s), BUT there is no search-structure whatsoever - it means the External/Internal memory footprint is supersmall). And of course the throughput/bandwidth is EXCELLENT, for instance getting the "suspicious" 3-grams from 200MB pure English text (comprised of 22+ million 3-grams) took only 128 seconds!
DOWNLOAD TRIMASAKARI: www.sanmayce.com/MSKR/Masakari_revision5.zip (648,020,188 bytes)
* * *At last I wrote the long overdue revision 16 of my x-gram ripper. The ability to make instant (with less than a second latency i.e. initial response time) queries has been added. The price for such functionality is one additional 8,660,958KB file (speaking of my 100 million 3-grams). These raw 3-grams are 2.73GB in size, the search-structure housing them is that 8,660,958KB file. Of course when something is gained something is lost at the same time, here the trade-off concerns speed-size: With gaining <1s latencies losing bandwidth happens, that is, we lose those hundreds of thousands phrases per second performance since search-structure file is too big to fit fast internal memory. Quite obviously the two approaches are not rivalrous, they complement each other according to the situation. Needless to say but anyway I'm foxy enough to feint the odds and to ensure spry behaviour even when using netbooks (with only 512MB main RAM), the current needs are 128MB physical memory to house the HASH, the rest i.e. TREES can reside even on HDD, SSD is at least ten times better. I am gonna make it available as soon as the reripping finishes, had my computer had 64GB RAM I would have finished by now. Add-on 2012-Dec-19: Notes on Masakari revision 3 [ Notes below concern my laptop (T7500 2200MHz, Windows 7, Samsung 470 SSD 64GB). [Note 01]: When the external HASH-TREEs structure was NOT in use (RUNME_FAST.BAT) i.e. ripping '_Gamera.tar.3.sorted.4andabove.txt': - 5,957,249 distinct phrases (out of 22+ million) were checked for existence against 100,088,208 phrases in 27+95 seconds. - Total performance: 5,957,249/95 = 62,707P/s i.e. phrases per second - As a result 1,036,826 of them were unfamiliar i.e. not to be found into those 100+ million. When the external HASH-TREEs structure 8.25GB (non-compressed) was in use (RUNME_SPRY.BAT) i.e. using 'Leprechaun_64bit.hsh' and 'Leprechaun_64bit.swp': - 5,957,249 distinct phrases (out of 22+ million) were checked for existence against 100,088,208 phrases in 27+2552 seconds. - Total performance: 5,957,249/2552 = 2,334P/s i.e. phrases per second - As a result 1,036,826 of them were unfamiliar i.e. not to be found into those 100+ million. When the external HASH-TREEs structure 4.08GB (NTFS compressed) was in use (RUNME_SPRY.BAT) i.e. using 'Leprechaun_64bit.hsh' and 'Leprechaun_64bit.swp': - 5,957,249 distinct phrases (out of 22+ million) were checked for existence against 100,088,208 phrases in 27+4425 seconds. - Total performance: 5,957,249/4425 = 1,346P/s i.e. phrases per second - As a result 1,036,826 of them were unfamiliar i.e. not to be found into those 100+ million. [Note 02]: Everest Disk Benchmark gave for my SSD: Random Read 4KB block: 36.5MB/s Roughly 36.5*1024/4 = 9,344 operations per second. Now comparing Leprechaun's 2,334 phrases per second with Samsung's 9,344 operations per second gives 2,334/9,344*100% = 24.9% seek utilization. Since the highest BTREE is 3 levels (not counting the root) we have maximum 4 SEEK ATTEMPTS (the first attempt jumps to the hash slot housing the root) i.e. any phrase needs maximum 4 RANDOM READS. In fact the bottleneck is exactly these IOPS provided by the drive. For HDDs the picture is tragical - 11ms SEEK TIME compared to Samsung's 0.22ms (Everest reports so) is 11/0.22 = 50 times slower! [Note 03]: Of course enlarging the current HASH (24bit x 8bytes = 128MB) to say 27bit (1024MB) will decrease these 4 attempts probably to 2. [Note 04]: E:\Masakari_revision3\_Gamera_r15_3-grams>EXTRACT_ALL_COMPRESSED_FILES.BAT Extracting ... Leprechaun_64bit.hsh.bsc decompressed 69607339 into 134217793 in 25.444 seconds. Leprechaun_64bit.swp.bsc decompressed 1561846893 into 8868821006 in 847.647 seconds. !!! 6:1 !!! _Gamera.tar.3.sorted.4andabove.txt.bsc decompressed 463047749 into 2938594566 in 137.406 seconds. !!! 6:1 !!! ] Just want to put on the table the weight of one freely downloadable English text corpus namely 'enwiki-20120403-pages-articles.xml'. As name suggests it is a compilation of all English Wikipedia articles, the file is 37,430,769,961 bytes long. Some months ago I ripped it and for 3-grams the stats are: Total memory needed for one pass: 52,701,578KB Total distinct phrases: 625,323,984 Total time: 51784 second(s) Total performance: 61,038P/s i.e. phrases per second The outcome is enwiki-20120403-pages-articles.3.sorted 19,354,345,361 bytes long. For example those 19,354,345,361 bytes are the pure/raw data they require 52,701,578KB HASH+TREES structure for fast searching. Through practical rips as this it becomes clear what are the hardware requirements for dealing with one real-world text corpus. Having removed 'occurrences' field we have: 19,354,345,361 - (625,323,984*10) = 13,101,105,521 raw x-grams, now I need the signal-noise ratio, here, raw x-grams vs search-structure ratio which is 13,101,105,521/52,701,578KB or 12,794,048KB/52,701,578KB = 24% (bigger-the-better). For me 1:4 ratio is a good one especially when x-grams are ten times as much, just take a seat: a 500GB search-structure. elfin adj. 1a. Relating to or suggestive of an elf. 1b. Made, done, or produced by an elf. 2. Small and sprightly or mischievous. 3. Having a magical quality or charm; fairylike: moved across the dimly lit stage with elfin grace. /HERITAGE/ The word that locked/captured my lock 'sprightly': 'adj. Full of spirit and vitality; lively; brisk. adv. In a lively, animated manner.' was used within the 'small_and_sprightly_or_mischievous' 5-gram, the 3-gram I want to have in my armament is 'small_and_sprightly' - fitting elf description well. Also the 'moved_across_the_dimly_lit_stage_with_elfin_grace' 9-gram is yummy, the 3-gram I want to enrich my 3-gram corpus is 'with_elfin_grace'. My point: corpus lacking the must-have 3-grams 'small_and_sprightly' and 'with_elfin_grace' is a crippled one, the same goes for all those nifty x-grams floating in books/magazines/newspapers not grabbed/ripped yet - this is the source of my avariciousness/greediness - I want them all. * * *
"... Young hearts can go their way
Can't put it off another day
I don't care what others say ..."
/Chambers Brothers - Time Has Come Today/
Enfun! Copyleft Sanmayce, 2013 Jan 07