MASAKARI: the freeware micro-spell-checker

MASAKARI
the freeware console x-gram micro-spell-checker

POWERED WITH: MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
QUALITY: Heavily cross-checked, suitable for general purposes.
GOALS: To become the people's choice through its openness (all C sources included).
DOWNLOAD: www.sanmayce.com/Downloads/Masakari.zip (14,734,210 bytes)
FORUM: www.thefreedictionary.com

	First off, 'Masakari' is a sub-project of mine of my neverending project 'Gamera'.
	As you can see, from the PDF above, Masakari is used to cut off evil, here the English language ignorance.
    
	The most profound and sad audio/video saturating an image of the spirit of Kami-Gamera is this:
	www.sanmayce.com/MSKR/Gamera_3_Gamera_VS_God_Evil_Iris_at_Kyoto.flv

	God Evil Iris attacks Kyoto, Gamera comes flying to help the city,
	as gratitude the two pilots were ordered to attack her (the goddess of the earth incarnated).
	
	Gamera was overpowered by God Iris, then she entered kamikaze mode, cut off her own paw and
	saved the hateful traitor-girl who summoned Iris.
	
	The final scene (after battle versus God Evil Iris) where Gamera walks through hellish fires betrayed, injured
	and pawless and still remaining kind - this is the epitome of Kami-Gamera - a martyress.	
	
	
	I am deeply thankful to the guy who created this diamond-clip, viva!

	Okay, word-checking is fast enough but I couldn't bear those awfully slow searches in 3-gram-checking,
	so after some quick tweaks in my flagman (in fact flagelf) Leprechaun,
	revision 15FIXFIX+ came into being allowing superfast x-grams vs x-grams checks.

	x-gram definition:
	- A x-phrase (or x-gram) has exactly x words;
	- Only alpha ASCII chars form our words, each word is in range 1..31 chars;
	- A x-phrase is lowercased i.e. contains only small letters 'a'..'z';
	- A x-phrase has length (adjustable) between A and B chars inclusive, e.g. for 3-grams A=9 and B=41;
	- X words concatenated with '_' form one x-phrase;
	- Symbols not allowed between x words forming a x-phrase: '.', '!', '?', ':', ';', ',', '\t'.

	The following excerpt from 'When the Last Sword Is Drawn' movie subtitles 
	...
	497
	01:02:27,956 --> 01:02:35,089
	Morioka, in Nanbu.
	It's pretty as a picture!
	498
	01:02:35,196 --> 01:02:38,723
	There's nowhere like it in all Japan!
	499
	01:02:39,834 --> 01:02:43,827
	The Morioka cherry blossom
	splits through rock to bloom.
	500
	01:02:44,506 --> 01:02:48,875
	The Morioka magnolia blooms
	even facing north.
	501
	01:02:49,911 --> 01:02:54,848
	So I want you to run ahead
	of the times.
	502
	01:02:55,950 --> 01:03:00,046
	Go wild. Bloom.
	...
	when x-grammed down to 3-grams looks like:
	as_a_picture
	to_run_ahead
	there_s_nowhere
	the_morioka_cherry
	morioka_magnolia_blooms
	the_morioka_magnolia
	it_in_all
	so_i_want
	pretty_as_a
	you_to_run
	s_nowhere_like
	nowhere_like_it
	it_s_pretty
	of_the_times
	i_want_you
	want_you_to
	splits_through_rock
	through_rock_to
	like_it_in
	rock_to_bloom
	in_all_japan
	morioka_cherry_blossom
	s_pretty_as
	even_facing_north

	How to "hit" unfamiliar i.e. "suspicious" to MASAKARI 3-grams?
	Copy the targeted text file(s) to 'Masakari\_Gamera_r15_3-grams\Your_textual_folders' folder and
	execute 'RUNME_FAST.BAT' located in 'Masakari\_Gamera_r15_3-grams' folder,
	then next file 'Your_words_unfamiliar_to_Masakari.txt' will be autoloaded into NOTEPAD:
	even_facing_north
	morioka_cherry_blossom
	morioka_magnolia_blooms
	nowhere_like_it
	rock_to_bloom
	s_nowhere_like
	splits_through_rock
	the_morioka_cherry
	the_morioka_magnolia
	
	/Unfamiliar 3-grams to MASAKARI, powered with 100,088,208 3-grams, copyleft Sanmayce 2012-Dec-11/

	This is the fastest execution I am capable of (for now), here the latency (i.e. the starting overhead is huge (70s),
	BUT there is no search-structure whatsoever - it means the External/Internal memory footprint is supersmall).
	And of course the throughput/bandwidth is EXCELLENT, for instance getting the "suspicious" 3-grams from
	200MB pure English text (comprised of 22+ million 3-grams) took only 128 seconds!

DOWNLOAD TRIMASAKARI: www.sanmayce.com/MSKR/Masakari_revision5.zip (648,020,188 bytes)

* * *

At last I wrote the long overdue revision 16 of my x-gram ripper.
The ability to make instant (with less than a second latency i.e. initial response time) queries has been added.
The price for such functionality is one additional 8,660,958KB file (speaking of my 100 million 3-grams).
These raw 3-grams are 2.73GB in size, the search-structure housing them is that 8,660,958KB file.
Of course when something is gained something is lost at the same time, here the trade-off concerns speed-size:
With gaining <1s latencies losing bandwidth happens, that is, we lose those hundreds of thousands phrases per second
performance since search-structure file is too big to fit fast internal memory.
Quite obviously the two approaches are not rivalrous, they complement each other according to the situation.

Needless to say but anyway I'm foxy enough to feint the odds and to ensure spry behaviour even when using
netbooks (with only 512MB main RAM), the current needs are 128MB physical memory to house the HASH,
the rest i.e. TREES can reside even on HDD, SSD is at least ten times better.
I am gonna make it available as soon as the reripping finishes, had my computer had 64GB RAM I would have finished
by now.

Add-on 2012-Dec-19: Notes on Masakari revision 3
[
Notes below concern my laptop (T7500 2200MHz, Windows 7, Samsung 470 SSD 64GB).

[Note 01]:
When the external HASH-TREEs structure was NOT in use (RUNME_FAST.BAT) i.e.
ripping '_Gamera.tar.3.sorted.4andabove.txt':
- 5,957,249 distinct phrases (out of 22+ million) were checked for existence against 100,088,208
phrases in 27+95 seconds.
- Total performance: 5,957,249/95 = 62,707P/s i.e. phrases per second
- As a result 1,036,826 of them were unfamiliar i.e. not to be found into those 100+ million.

When the external HASH-TREEs structure 8.25GB (non-compressed) was in use (RUNME_SPRY.BAT) i.e.
using 'Leprechaun_64bit.hsh' and 'Leprechaun_64bit.swp':
- 5,957,249 distinct phrases (out of 22+ million) were checked for existence against 100,088,208
phrases in 27+2552 seconds.
- Total performance: 5,957,249/2552 = 2,334P/s i.e. phrases per second
- As a result 1,036,826 of them were unfamiliar i.e. not to be found into those 100+ million.

When the external HASH-TREEs structure 4.08GB (NTFS compressed) was in use (RUNME_SPRY.BAT) i.e.
using 'Leprechaun_64bit.hsh' and 'Leprechaun_64bit.swp':
- 5,957,249 distinct phrases (out of 22+ million) were checked for existence against 100,088,208
phrases in 27+4425 seconds.
- Total performance: 5,957,249/4425 = 1,346P/s i.e. phrases per second
- As a result 1,036,826 of them were unfamiliar i.e. not to be found into those 100+ million.

[Note 02]:
Everest Disk Benchmark gave for my SSD:
Random Read 4KB block: 36.5MB/s
Roughly 36.5*1024/4 = 9,344 operations per second.
Now comparing Leprechaun's 2,334 phrases per second with Samsung's 9,344 operations per second gives
2,334/9,344*100% = 24.9% seek utilization.
Since the highest BTREE is 3 levels (not counting the root) we have maximum 4 SEEK ATTEMPTS (the first attempt
jumps to the hash slot housing the root) i.e. any phrase needs maximum 4 RANDOM READS.
In fact the bottleneck is exactly these IOPS provided by the drive.
For HDDs the picture is tragical - 11ms SEEK TIME compared to Samsung's 0.22ms (Everest reports so) is 11/0.22 =
50 times slower!

[Note 03]:
Of course enlarging the current HASH (24bit x 8bytes = 128MB) to say 27bit (1024MB) will decrease these 4 attempts
probably to 2.

[Note 04]:
E:\Masakari_revision3\_Gamera_r15_3-grams>EXTRACT_ALL_COMPRESSED_FILES.BAT
Extracting ...
Leprechaun_64bit.hsh.bsc decompressed 69607339 into 134217793 in 25.444 seconds.
Leprechaun_64bit.swp.bsc decompressed 1561846893 into 8868821006 in 847.647 seconds. !!! 6:1 !!!
_Gamera.tar.3.sorted.4andabove.txt.bsc decompressed 463047749 into 2938594566 in 137.406 seconds. !!! 6:1 !!!
]

Just want to put on the table the weight of one freely downloadable English text corpus
namely 'enwiki-20120403-pages-articles.xml'.
As name suggests it is a compilation of all English Wikipedia articles, the file is 37,430,769,961 bytes long.

Some months ago I ripped it and for 3-grams the stats are:
Total memory needed for one pass: 52,701,578KB
Total distinct phrases: 625,323,984
Total time: 51784 second(s)
Total performance: 61,038P/s i.e. phrases per second

The outcome is enwiki-20120403-pages-articles.3.sorted 19,354,345,361 bytes long.
For example those 19,354,345,361 bytes are the pure/raw data they require 52,701,578KB HASH+TREES structure
for fast searching. Through practical rips as this it becomes clear what are the hardware requirements for
dealing with one real-world text corpus.

Having removed 'occurrences' field we have: 19,354,345,361 - (625,323,984*10) = 13,101,105,521 raw x-grams,
now I need the signal-noise ratio, here, raw x-grams vs search-structure ratio which
is 13,101,105,521/52,701,578KB or 12,794,048KB/52,701,578KB = 24% (bigger-the-better).
For me 1:4 ratio is a good one especially when x-grams are ten times as much,
just take a seat: a 500GB search-structure.

elfin adj.
1a. Relating to or suggestive of an elf.
1b. Made, done, or produced by an elf.
2. Small and sprightly or mischievous.
3. Having a magical quality or charm; fairylike: moved across the dimly lit stage with elfin grace.
/HERITAGE/

The word that locked/captured my lock 'sprightly':
'adj. Full of spirit and vitality; lively; brisk. adv. In a lively, animated manner.'
was used within the 'small_and_sprightly_or_mischievous' 5-gram, the 3-gram I want to have in my armament
is 'small_and_sprightly' - fitting elf description well.
Also the 'moved_across_the_dimly_lit_stage_with_elfin_grace' 9-gram is yummy, the 3-gram I want to enrich
my 3-gram corpus is 'with_elfin_grace'.
My point: corpus lacking the must-have 3-grams 'small_and_sprightly' and 'with_elfin_grace' is
a crippled one, the same goes for all those nifty x-grams floating in books/magazines/newspapers not grabbed/ripped
yet - this is the source of my avariciousness/greediness - I want them all.
* * *

"... Young hearts can go their way
Can't put it off another day
I don't care what others say ..."
/Chambers Brothers - Time Has Come Today/

Enfun! Copyleft Sanmayce, 2013 Jan 07