Index| libgocr| Download| Screenshots| Examples| Developers| Support| Links
I would like to give some points you should have in mind: - write simple and easy understandable code - do only use only few libaries especially for the command-line based gocr - Often users have poor systems. Gocr read a file and write a file. This should be possible with standard libs on any system. - make good comments, think that another programmer have to understand and change your code - be careful if using lot of memory or recursive functions. I can get xli or gimp on my 128MB-pc out-of-mem often. :( - ... (I remember that there are some guide lines for programmers on the web, may be I should insert some links to the "best" of them here.) 1.0 name of the package I am a bit unlucky about have gocr on freshmeat and jocr on SourceForge for the same thing. I think gocr is better name, more intentional. On sourceforge this name was allready in use. What now? ------------------------------------------------------------
2.1. database-module [0.2 update]What's going to happen with database-module? I can't see a way to implement a general database in libgocr, so it probably will be filed in main module. The database module is well seperated from ocr0-engine. It is a directory containing special font (p.e. greek). We need a special routine, able to compare one unknown char with the chars of the database in a robust way. I have done a first implementation working more bad as good. see load_db(), ocr_db(), distance2() But the distance2() function does work not good enough. Try to write a better function in seperate file database.c (?). The distance funtion (distance2) has to be robust against size and small angles. UPDATE: It will probably be part of one of the engine plugins now. The development of this feature is frozen now. 2.1.1 detect difficult regions It could be usefull to check each box/cluster for regions which are instable against shrinking. I mean if you shrink a character lines get smaller or break into two pieces. Or in the other case an end of a line is touching a black region in the nearest environment. If we are possible to detect such regions and mark or list them, it could be easier to correct errors by using filter and such things only in the this regions. estimated time: depence harliy on the power of you, propably more than 100h ------------------------------------------------------------ 2.2. ocr0-engine This is the main part I think. I have a fast growing database of image files. Some of them show, that characters are recognized in a wrong way. You have to take such a example and look at the engine, why the error occurs, and find a good way to fix the problem. After it, all other examples should work like before changes happen or even better. That is, what I have done most of the development time. ------------------------------------------------------------ 2.2.1 ocr0-elementary functions The ocr0 engine uses a small set of functions: loop(), turmite(), num_cross(), get_bw(), get_line() get_line2() function must be improved - should detect if 2 points are connected by a black line - should work for any resolution - tolerance parameter should be used ------------------------------------------------------------ 2.2.2 ocr0 special chars Unicode implementation is currently being done. ------------------------------------------------------------ 2.2.3 ocr0 splitting put ocr0-args into one structure which can be used as argument for ocr0a()...ocr0z(), good idea? I am not sure. splitting is good for: - better reading the code (reducing number of lines, compilation time) - better working on code - better reorganization but worse - for speed and ... what else? votes: splitting=2 no_splitting=0 ------------------------------------------------------------ 2.2.4 using propabilities and alternatives if characters are bad ini_list(), exclude() and getresult() is a ansatz, look ... Used for other things like: list="abcde..." wert={100,100,100,100,100,... /* wert is german word for value; sorry, 100 is 100%, could be also 1000 or other initial values */ test_if_char_is_list[2]==c, function tells: it is never a 'c' => wert[2]=0 test_if_char_is_list[0]==a, may be it is, but not sure => wert[0]=80 at_the_end look at highest wert wert[i] and take list[i] as result. If there are problems look at the second highest wert[j] clear? hmm, think I have to be more precise, an example: list={ 'c', 'e' } wert={ 100, 100 } // 100 could be also 1000 or MAX_VALUE test_left_bow(box) // only a example function! result: wert={ 100, 100 } // left bow detected, not changed (could be c or e) test_horizontal_midlie_line() // only a example function! result: wert={ 100, 50 } // no midlie line detected, but could be bad scan // old_val=100 weight=0.5 => 100*0.5=50 test_crotchet_on_right_upper_end() // only example! result: wert={ 80, 60 } // crotchet was detected, typical for e but not for // c, so propability for c is lowered, but for e is // enlarged to (MAX_VALUE-old_value*weight), // weight=0..1 ------------------------------------------------------------ 2.3. library In development, called libgocr, and already working. We still need to do/decide the following: - frontend communication architecture call_this_notifier_if_progress((int *)notifierfkt(char *what_happens,int percent )) ask_user(char *something) ------------------------------------------------------------ 2.4 ocr2-engine (feature based) What you think about completely different engines. The main program could switch from one to another engine if problems arise. The form based and data base engine is partly implemented. The third way should be a feature based engine. I have started on ocr1.cc ocr2(). This engine should find essentials of each char. A list of longest lines and bows found should be compared with a database. This database should contain essentials of letters like 'A' is build by 3 lines between points p1,p2,p3,p4,... Point p2 is lying "near" the midlie of p1,p3 or similar. UPDATE: will be a modular engine. ------------------------------------------------------------ 2.5 fax/screen-fonts What is the best way detecting mini fonts? Gocr could work together with screen grabber, translation programs or speach programs. ------------------------------------------------------------ 2.6 detect lines and boxes - detect underlined text, frames arround boxes ------------------------------------------------------------ 2.7 store images if -m 4 used, pictures should be detected create a list of pictures and write functions: int get_num_pictures(), getpicture(int num,pix *dest,...) use this functions to create images o_imgXXX.pnm ------------------------------------------------------------ 2.8 font-type/serifs detection I would like to see a function able to distinguish between italic, bold, slanted, tt-fonts tt-fonts detection is important for inter character space detection distance between midlie axis are always multiple of tt-width UPDATE: bbg has some ideas. 2.8.1 space between characters Write a function which is able to decide which type of following fonts is used: - fixed-width font - proportional-width font The algorithm should use the boxlist where informations about size and position of every detected character/glyph is stored. Extract the essential information to find space between words. The distance between midliepoints could be such a value (fixed-font). Also the distance between right side and left side of the following character could be the essential value (prop.-font). An extension could be to estimate the values and there tolerance for grouping characters to words and sentences (related variable = env.cs). Another extension could it be, to find single fixed-width-font words in a proportional-width-font text or vica versa. estimated time: 20-40h ------------------------------------------------------------ 2.9 what about outputing HTML or TeX? Not in a near future, but. [0.2] Solved by libgocr. ------------------------------------------------------------ 2.9.1 users wish to get the positions (absolute or relative?) of the chars, output of xfig or pdf format? sounds to me like a compression like function of textimages, every character can be seen as minipicture, some of them are equal ------------------------------------------------------------ 2.10 learn mode If we do not use a simple expandable database of master characters, it would be nice to have a "code morphing" algorithm for the hard coded engine. It should be possible, because we have the sources (Open Source) and other projects also modify parts of its own code (MUDs, Transmeta, ...). The more simple variant is the database variant of engine. ------------------------------------------------------------------------ 2.11 page orientation Write a function which is able to decide if the picture is rotated by 90,180 or 270 degree. If possible the algorithm should use only the boxlist where informations about size and position of every detected character/glyph is stored. Therefore you have to play with scanned examples. Analyse characteristical quantities using fourier transformations or other things (look at the literature) After detection rotate the pixmap back if rotated and modify the lists. estimated time: 40-80h -------------------------------------------------------------------- 2.12 detect math formulas [0.2]is supported, but not implemented, by libgocr. --------------------------------------------------------------------- 2.13 improvement of essential functions 2.13.1 improve speed and quality of frame_nn() algorithm [0.2]rewritten as gocr_charSetAllNearPixels(). 2.13.2 improve remove_melted_serifs() function On low resolution scans often two neighboured characters are glued together if they have serifs. This problem arises very often! I think serifs are easy detectable and it should be possible to write a algorithm which does a better work than the old function. The old function does not detect all serifs. If you write a better function be careful, do not remove to much. Removing means: change black pixels between the two chars to white or lightgray ones. estimated time: 40-60h or more 2.13.3 improve remove_dust() function I mean scanned dusty pages. They often have speckles (hope its the right word). The speckle size and form should follow statistical laws, so you can estimate the largest size and do not remove to much. The existing function should be a good starting point. The function should remove the boxes from the boxlist and lighten the pixels on the pixmap. estimated time: 60h ------------------------------------------------------------
3.1. developper-introducing [0.2]done in libgocr. 3.1.1 definitions/english [0.2]done in libgocr.
4.1 graphical user interface Should be a seperate package (XGOCR ?) using the gocr-lib or exe. I would like to see a easy to use tcl/tk mini GUI (gocr.tcl). simple problem: how to present pgm-image on tk-canvas? UPDATE: A gtk frontend is done. 4.2 graphical debug interface See section 5.
5.1. speed optimizing I think this task is not very important yet. But ... it could someone a eye to it to avoid dramatical slow down of the program. You should have experience with gprof. Make a list of speed killer, suggestions of possible changes. Or you can do simple changes. 5.1.1 speed analyzis Make a documentation about speed of every function and a list of function sorted by runtime they need. Analyse where improvements could give a high speedup and make suggestions what could be changed and how much the improvement would be. Use gprof for testing speed. Of course the next step would be to make the changes and report the speedup. A list of speed of other programs could be extracted from the literature and a milestone could be set for gocr-package. estimated time: 20-40h 5.1.2 pixel() fkt contains lot of if then constructs v0.2.7 and is often called by other fkts. speed up if filter is splitted into AND and OR part? If yes do it and report the speed up. Another possibility is to make a kind of morphing code. I mean: you should read the filter table (filt3[][] of pixel.c) and create source code which can be compiled and ... in ideal case loaded at run time. For first it would be enough to generate the code from the table at compilation time. estimated time: 40-80h for the table to c-code translator 5.2 quality check Look at the REMARK.txt file. It is a list of test images some remarks and the number of errors. I would like to see a html-table. The user should see how good gocr is in comparision to other programs. I can give you access to some images with different fonts or some difficulties (noise etc.). All files are from users (direct from live). 5.2.1 Tests could be automated if we used the following system for the image library: besides [imagename].ext, there should be a [imagename].txt, which was manually typed and is the correct text. We could make tests, comparisons and statistics quite easily that way. It could be managed by a Makefile easily. The image library should be well-organized, with different styles of text in different directories: fax, excellent quality, italic, bold, accents, greek, etc. 5.2.2 Testfiles should be sorted in directories: > text/quality/excellent > text/quality/good > text/quality/bad > text/accents > (with the TeX, HTML, etc codes) > chars/ > (ASCII characters; only one type per file. First letter of the file is the > character. Examples: a0.ext, aitalic.ext, etc, are images only with lower > case 'a's. I think no text files are needed; perhaps the number of > characters in file? can be used as database.) > symbols/ > (text file should contain information such as name of character, code in > TeX, html, etc. Can be used as database) Put them all in a directory, called "examples" or something. UPDATE: being done by BBG. 5.3 bmp files May be it is good for harddisk space and speed up if output of bmp(rle-packed) is used. See pcx.cc writebmp(). 5.4 graphical debug interface [0.2]Support is built into libgocr.
6.1 make a rpm package change Makefile.in to implement a "make rpm-package" - configure.in etc. should be updated for - using libpgm,libppm,libpbm (define USE_LIBPGM,...) 6.2 make winexe I am able to make a winexe gocr using DJGPP. People is interested in it, I can count the downloads and from time to time I get EMAILs of Window users telling me that there are better programs. ;) I like that but I do not like working with windows. May be you are interested to gladden winusers and create the WINEXE of gocr. Developpers: dealiine: Notes: We need sponsors: - What do you think of give prices for best contributions? - Or have money for paying students for programming? - Or could pay for a journey to a developer conference? - Or could pay for an online connection? - Or have some additional computer equipment (notebooks, scanners). Do we need www.gocr.org?