SMART. String Matching Algorithms Research Tool

how to

a smart documentation

This page contains manuals and quick reference guides for the free smart framwork.
Here you can find a list of how to items.

how to

system requirements

You can run SMART in any computer running LINUX, Windows or MAC OS X.
The tool uses shared memory for storing the text. Thus smart requires your system to allow the allocation of shared memory. The default size of the text is 1MB, which is small enough to be supported by any system.
However if you want to use smart for testing algorithms on larger texts you must check your system settings for shared memory.

Manage shared memory on a MAC OS X

The amount of shared memory available on a Mac is configured at boot time. Once the shared memory system has been initiallized it is not possible to change the shared memory configuration. At present the same amount of shared memory is configured on any Mac (about 4MB), regardless of the number of processors or the amount of total memory available.
You can view the shared memory settings on your Mac by opening the Terminal application and giving the command

sysctl -A | grep shm

which should produce something like:

kern.sysv.shmmax: 4194304
kern.sysv.shmmin: 1
kern.sysv.shmmni: 32
kern.sysv.shmseg: 8
kern.sysv.shmall: 1024

As of Mac OS X 10.3.9 a relatively simple mechanism has existed for configuring shared memory at boot time. If the file /etc/sysctl.conf exists then the settings in this file are applied at boot time, before the default shared memory settings. To change the shared memory settings you have to type the command:

sudo emacs /etc/sysctl.conf

It is likely this file doesn't exist on your system, in which case an empty file will be created. Edit this file so that it contains the lines:

kern.sysv.shmmax=16777216
kern.sysv.shmmin=1
kern.sysv.shmmni=128
kern.sysv.shmseg=32
kern.sysv.shmall=4096

These settings increase the amount of shared memory to four (4) times the usual default. These shared memory settings will be applied the next time the computer boots. You can verify the settings after the reboot using the "sysctl -A" command demonstrated above.
Shared memory can be viewed with the ipcs command and you can delete shared memory segments with the ipcrm command.

Manage shared memory on LINUX

In order to configure shared memory on linux you have to login as root, then edit the file /etc/sysctl.conf.
The kernel.shmax parameter defines the maximum size in bytes for a shared memory segment. Determine the value of kernel.shmax by performing the following:

cat /proc/sys/kernel/shmmax
33554432

The kernel.shmall parameter sets the total amount of shared memory in pages that can be used at one time on the system. Set the value of both of these parameters to the amount physical memory on the machine.
As in the previous case you can determine the value of kernel.shmax by performing the following:

cat /proc/sys/kernel/shmmall
2097152

Set the values of kernel.shmax and kernel.shmall, as follows:

echo MemSize > /proc/sys/shmmax
echo MemSize > /proc/sys/shmall

where MemSize is the number of bytes.
For example, to set both values to 2GB, use the following:

echo 2147483648 > /proc/sys/kernel/shmmax
echo 2147483648 > /proc/sys/kernel/shmall

Then reboot the machine using.
Shared memory can be viewed with the ipcs command and you can delete shared memory segments with the ipcrm command.

Manage shared memory on WINDOWS VISTA

Windows Vista sets aside a certain amount of memory space in case it needs to be used by open programs. You can adjust how much shared memory is set aside by Vista by changing settings in your system's BIOS menu.
First restart your computer and repeatedly tap the "F1" button. This should open the BIOS menu. Some computers use "F10" or the "Delete" key. A message is displayed immediately on restart alerting you to the proper key for your system.
Press the down arrow to select "Integrated Peripherals" and hit the "Enter" key.
Highlight the option titled "AGP Aperture Size." This option designates your shared memory.
Adjust the number according to your desires. Lower settings give you less shared memory, while higher ones give you more.
Hit "F10" to save an exit.

how to

how to install smart

In order to install the smart tool in your system you have to follow the simple steps showed below:

download the correct binary package from the download page, according with your system. This is a compressed archive containing binary codes and text corpus.
copy the archive in a local directory, where you want to install smart, and unpack the archive.

This will create the new directory, named smart/, and containing the following files and sub directories:

docs/ is the directory containing the documentation files;
source/ is the directory containing the binary code of all string matching algorithms;
results/ is the directory which will cointain the files with experimental data;
data/ is the directory cointaining the corpus which are used for testing the string matching algorithms;
copyright.txt contains the copyright disclaimer;
gpl-3.0.txt contains the GNU general public license;
smart is the main program used for running experimental tests;
test is a smart utility used for testing the correctness of string matching algorithms;
select is a smart utility used to select/deselect string matching algorithms.

how to

algorithms selection for testing

The program select can be used to select/deselect algorithms for experimental results. Write the name of the algorithm (or a list of algorithms), just after the command select, to select or deselect it. For example, the command

$ ./select bndm fs hash3

will select the BNDM, Fast-Search and HASH3 algorithms. The subsequent command

$ ./select fs

will deselect the Fast-Search algorithm.
The command parameter -which shows the algorithms which have been selected for experimental results. Thus running the command

$ ./select -which

will produce the output

bndm
hash3

The command parameter -show lists the name of all algorithms which can be included in the experimental results. Moreover use the parameter -none for deselecting all algorithms and -all for selecting all algorithms.
For example the command

$ ./select -none kmp bf

deselects all previous algorithms and selects the Knuth-Morris-Pratt and Brute-Force algorithms. Similarly the command

$ ./select -all kmp bf

will select all algorithms with the exception of Knuth-Morris-Pratt and Brute-Force.
Finally the parameter -h will produce an help list.

how to

how to run experimental tests

The main command smart is used for running experimental tests.
The easiest way to use smart is to run a single search for a custom pattern and a custom text. To this pourpose use the -simple parameter followed by the pattern and the text. For instance the command

$ ./smart -simple let sampletext

will search text sampletext for occurrences of the pattern let. Observe that the input pattern size is bounded by 100 characters, while the text size is bounded by 1000 characters.
The simple mode does not output any experimental map.
Otherwise you can select the corpus which will be used to compute the experimental results by using the parameter -text (this parameter is mandatory). For instance the command

$ ./smart -text englishTexts

will compute experimental results on the englishTexts corpus. The directory data/, located in the smart main directory, contains all the corpus which can be selected in smart. See this page for the list of all corpus included in smart. Otherwise you can select the parameter -text all in order to run experimental tests for all corpus.

$ ./smart -text all

In this last case, the corpus will be processed one after the other.
You can set an upper bound dimension of the text size used for testing the string matching algorithms. By default this upper bound dimension is set to 1MB (1024 bytes). This means that (at most) the first 1024 bytes of the selected corpus will be used for testing. You can change the default upper bound dimension by using the parameter -tsize, followed by an integer value which indicate the Mbytes which will be used. For instance the command

$ ./smart -text englishTexts -tsize 4

will perform the tests on at most 4 MB of the englishText corpus.
The text buffer is stored in shared memory, thus if you set the upper bound to a value K it is necessary to accertain your system allows the allocation of at least K MB of shared memory.
By default for each input file, smart generates sets of 500 patterns of fixed length, randomly extracted from the text (500 copies of the same pattern in the case of the -simple mode). The length of the patterns ranges over the values 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048 and 4096. For each set of patterns the tool reports the mean over the running times of the 500 runs. Running times are expressed in thousandths of seconds.
Use the parameter -pset in order to modify the size of the set of patterns generated by the tool. For instance the command

$ ./smart -texts genome -pset 100

will run experimental tests on the genome corpus generating sets of 100 patterns of fixed length (the default value is 500).
You can use the parameter -short in order to perform experimental tests on short patterns. In particular the command

$ ./smart -texts genome -pset 100 -short

performs experimental tests by generating sets of 100 patterns of fixed length ranging over the values 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 and 32.
If you want to restrict the search to a given length pattern, use the parameter -plen and define an upper bound and a lower bound for the lengths of the pattern. for instance the command

$ ./smart -texts genome -pset 100 -plen 16 128

performs experimental tests over the length values 16, 32, 64, 128. If you want to test the algorithms for a single length of the pattern (for instance 128) use the command

$ ./smart -texts genome -pset 100 -plen 128 128

During the execution of the tests the tool prints out the running times of each algorithm (this is the default setting). Use the parameter -occ for printing out also the number of occurrences found by the algorithm during the runs. Since patterns are randomly extracted from the text, the number of occurrences is at least 1.
Finally the parameter -h will produce an help list.
The smart tool associates to any experimental test a unique alphanumeric code on 13 characters, beginning with EXP, followed by 10 numbers. The execution of the following command

$ ./smart -text genome -pset 100 -occ

starts the test on the genome corpus, using sets of 100 patterns patterns. The first lines of the execution should be the following

Try to process archive genome
Loading the file data/genome/ecoli.txt
Text buffer of dimension 1048576 byte
Starting experimental tests with code EXP1306868286

____________________________________________________________
SMART EXP1306868286
Experimental results on genome
Searching for a set of 100 patterns with length 2
Testing 4 algorithms
- [1/4] BF ..................[OK] 10.34 ms 66513
- [2/4] BNDM ................[OK] 12.68 ms 66513
- [3/4] FS ..................[OK] 9.13 ms 66513
- [4/4] HASH3 ...............[--]

The code associated with the test is EXP1306868286.
At the end of the test experimental results are saved in the direcory results/EXP1306868286
The last lines of the execution should be the following

SMART EXP1306868286
Experimental results on genome
Searching for a set of 100 patterns with length 4096
Testing 4 algorithms

- [1/4] BF ..................[OK] 10.71 ms 1
- [2/4] BNDM ................[OK] 4.26 ms 1
- [3/4] FS ..................[OK] 4.04 ms 1
- [4/4] HASH3 ...............[OK] 3.13 ms 1

OUTPUT RUNNING TIMES EXP1306868286
Saving data on EXP1306868286/genome.txt
Saving data on EXP1306868286/genome.xml
Saving data on EXP1306868286/genome.html
Writing EXP1306868628/index.html

how to

dispaly experimental results

At the end of the execution of an experimental test smart stores experimental data in the directory results/EXPCODE, where EXPCODE is the unique code associated with the experimental test.
Files containing experimental data are named with the name of the corpus which has been selected. The system can store experimental data in three different formats: latex, xml and html format.
Experimental results in html format are store by default.
In addition you can make smart store experimental data in latex or txt formats by using the options -tex and -txt, respectively.
For instance, the command

./smart -text englishTexts -txt -tex

will test the selected algorithms of the englishTexts corpus and will store experimental data in latex and txt format.

Files in text format have extension .txt and the following structure

BF    10.45  10.85  10.65 10.71  10.64  10.70....
BNDM  12.32  8.26   5.43   3.95   3.30   3.42....
FS    9.22   6.70   5.61   4.62   4.48   4.32....
HASH3 -      6.90   3.87   2.93   2.58   3.40....

Files in xml format have extension .xml. They report data in a structured format suitable to be processed or included in html files. They have the following structure

<RESULTS>
  <CODE>EXP1306868628</CODE>
  <TEXT>genome</TEXT>
  <ALGO>
    <NAME>BF</NAME>
    <DATA>10.45</DATA>
    <DATA>10.85</DATA>
    <DATA>10.65</DATA>
    <DATA>10.71</DATA>
    <DATA>10.64</DATA>
    <DATA>10.70</DATA>
    <DATA>10.92</DATA>
    <DATA>10.79</DATA>
    <DATA>10.80</DATA>
    <DATA>10.92</DATA>
    <DATA>10.70</DATA>
    <DATA>10.71</DATA>
  </ALGO>
  <ALGO>
    <NAME>BNDM</NAME>
    <DATA>12.32</DATA>
    <DATA>8.26</DATA>
    .................

Finally, html files present data in a tabular format. An additional index.html file is generated which contains the list of all experimental results computed during the test.

how to

how to test your own algorithms

It is possible to add your own string matching algorithms to smart and testing them against the other algorithms.
A new algorithm must be implemented in C language. The file must contains the following include command

#include "include/main.h"

while the main method must be defined as

int search(unsigned char x, int m, unsigned char y, int n)

where x is the pattern, y is the text, and m and n are their length. The method must return the number of occurrences of the pattern in the text.
If the algorithm does not run under particular conditions (for instance when the length of the pattern is less than a given value), please make it return the value -1.
Before compiling the C file, copy the header file main.h (which is in source/algos/include) in the same directory. Then put the compiled binary file in the directory source/bin
Before running a new experimental setting you can test the correctness of your algorithm by executing the following command

$ ./test algoname

where algoname is the name of the binary file of your algorithm. If your algorithm is correct the following output will be dispalyed

Please, wait a moment....
Well done! Test passed successfully

Now you can include your algorithm in smart by using the select. The following command add the algorithm

$ ./select -add algoname

Notice that smart includes the algorithm in the set only if it's correct. If your algorithm is correct the following output will be displayed

Adding the algorithm algoname to SMART
Testing the algorithm for correctness....ok
Algorithm algoname added succesfully.

Now you need to select the algorithm by running the command

$ ./select algoname

how to

how to compile smart source code

In order to compile the smart source code you have to:

download the source package from the download page. This is a compressed archive containing all source C codes and text corpus.
copy the archive in a local directory, where you want to install smart, and unpack the archive.

This will create the new directory, named smart/, and containing the following files and sub directories:

docs/ is the directory containing the documentation files;
source/ is the directory containing the C source code of all string matching algorithms and utilities;
results/ is the directory which will contain the files with experimental data;
data/ is the directory containing the corpus which are used for testing the string matching algorithms;
copyright.txt contains the copyright disclaimer;
gpl-3.0.txt contains the GNU general public license;
makefile is the bash file used for compiling the C source files of the tool.

Run the command ./makefile in order to compile the smart tool. The code of each string matching algorithm is compiled an tested separately. The resulting binary code is store in the directory source/bin.

a smart documentation

system requirements

Manage shared memory on a MAC OS X

sysctl -A | grep shm

kern.sysv.shmmax: 4194304 kern.sysv.shmmin: 1 kern.sysv.shmmni: 32 kern.sysv.shmseg: 8 kern.sysv.shmall: 1024

sudo emacs /etc/sysctl.conf

kern.sysv.shmmax=16777216 kern.sysv.shmmin=1 kern.sysv.shmmni=128 kern.sysv.shmseg=32 kern.sysv.shmall=4096

Manage shared memory on LINUX

cat /proc/sys/kernel/shmmax 33554432

cat /proc/sys/kernel/shmmall 2097152

echo MemSize > /proc/sys/shmmax echo MemSize > /proc/sys/shmall

echo 2147483648 > /proc/sys/kernel/shmmax echo 2147483648 > /proc/sys/kernel/shmall

Manage shared memory on WINDOWS VISTA

how to install smart

algorithms selection for testing

$ ./select bndm fs hash3

$ ./select fs

$ ./select -which

bndmhash3

$ ./select -none kmp bf

$ ./select -all kmp bf

how to run experimental tests

$ ./smart -simple let sampletext

$ ./smart -text englishTexts

$ ./smart -text all

$ ./smart -text englishTexts -tsize 4

$ ./smart -texts genome -pset 100

$ ./smart -texts genome -pset 100 -short

$ ./smart -texts genome -pset 100 -plen 16 128

$ ./smart -texts genome -pset 100 -plen 128 128

$ ./smart -text genome -pset 100 -occ

dispaly experimental results

./smart -text englishTexts -txt -tex

BF 10.45 10.85 10.65 10.71 10.64 10.70.... BNDM 12.32 8.26 5.43 3.95 3.30 3.42.... FS 9.22 6.70 5.61 4.62 4.48 4.32.... HASH3 - 6.90 3.87 2.93 2.58 3.40....

how to test your own algorithms

#include "include/main.h"

int search(unsigned char *x, int m, unsigned char *y, int n)

$ ./test algoname

Please, wait a moment.... Well done! Test passed successfully

$ ./select -add algoname

Adding the algorithm algoname to SMART Testing the algorithm for correctness....ok Algorithm algoname added succesfully.

$ ./select algoname

how to compile smart source code

kern.sysv.shmmax: 4194304
kern.sysv.shmmin: 1
kern.sysv.shmmni: 32
kern.sysv.shmseg: 8
kern.sysv.shmall: 1024

kern.sysv.shmmax=16777216
kern.sysv.shmmin=1
kern.sysv.shmmni=128
kern.sysv.shmseg=32
kern.sysv.shmall=4096

cat /proc/sys/kernel/shmmax
33554432

cat /proc/sys/kernel/shmmall
2097152

echo MemSize > /proc/sys/shmmax
echo MemSize > /proc/sys/shmall

echo 2147483648 > /proc/sys/kernel/shmmax
echo 2147483648 > /proc/sys/kernel/shmall

bndm
hash3

BF 10.45 10.85 10.65 10.71 10.64 10.70....
BNDM 12.32 8.26 5.43 3.95 3.30 3.42....
FS 9.22 6.70 5.61 4.62 4.48 4.32....
HASH3 - 6.90 3.87 2.93 2.58 3.40....

int search(unsigned char x, int m, unsigned char y, int n)

Please, wait a moment....
Well done! Test passed successfully

Adding the algorithm algoname to SMART
Testing the algorithm for correctness....ok
Algorithm algoname added succesfully.