num-utils - Goals for this project ----------------------------------- (NOTE: This document does not represent a currrent feature set, please read the README file and man pages instead.) This document outlines certain criteria that the num-utils should eventually meet. It outlines a general interface that each utility should follow and guides the project to bring the num-utils to version 1.0 The initial releases of the num-utils will be small and much less featured than what is shown below. I originally wrote the num-utils with many of the features shown below, but everything was rather unorganized and some of the programs simply processed the data incorrectly. I decided that I would go back to the beginning, write down all the features that I would like to have and think are useful, and create a development cycle. I'll release each version to the public for review, comments and contribution. Anyone who is interested in helping out in any way, please contact me at suso@suso.org. Any contributions are welcome. I've put an 'X' at the begining of the line for features that have been completeled. -= All numeric utilities =- 1. All num utils should process data on either STDIN or from a file or files specified from command line arguments. Examples of STDIN usage: cat file | num-util num-util < file num-util [Program will wait for data on STDIN until Ctrl-D is pressed] Examples of file argument usage: num-util file num-util file1 file2 file3 ... fileN num-util -[options] file1 file2 file3 2. All num utils should use the following options that are standard among all the utilities: X -h -- provide helpful usage information X -V -- Be verbose. Send verbose information to STDERR. X -d -- Debug information for developers. This implies verbose and more. -q -- Quiet mode, Don't print out any warnings about file read errors. -i -- Output numeric values as their integer only. -I -- Output numeric values as their decimal portion only. -a -- Don't treat '-' characters as negative signs. Think of -a as 'absolute value'. -d -- Don't treat the '.' character as a decimal point. -N -- Don't process complete numbers, process each numeric character individually. -C -- Commify numbers in output. Instead of printing out '1000000', it should print out '1,000,000'. -b -- Use the following base when dealing with numbers. -- -- Don't process any more arguments as options. 3. All num utils should provide user documentation in the form of man pages. - While the programs are written in perl I plan on using POD information in the programs themselves and then that documentation can be exported to man pages, info docs or HTML format. - Eventually I would like to write the documentation in SGML so that it can be easily converted into many different formats. 4. All num utils should provide summarized help with the -h option simular to the following: ---------------------------------------------- numutil: process numbers in such a way...... ---------------------------------------------- Usage: numutil [options] [file args] STDIN | numutil [options] numutil [options] < file Options: -h Help: You're looking at it. -V Increase verbosity. -d Don't treat the '.' character as a decimal point. 5. All numeric utilities should abstract their functionality as much as possible so that a numeric processing module can be written to make things consistent among all numeric utility programs. 6. The num-utils set of utilities should be available in the following forms for download: o tar format, both gz and bz2 o rpm package o gentoo ebuild tree o deb package 7. The Makefile should be setup to be able to create the different package formats listed above as well as the man pages from the perldoc information in each utility's source code. 8. Numeric expressions (regular expressions, plus operations for numbers) .. -- range operator i -- increment operator f -- factor operator m -- multiple operator , -- expression seperation separator [] -- character grouping + -- quantifier (1 or more) * -- quantifier (0 or more) ? -- Match the preceding character 0 or 1 times. {} -- for matching specific number of times. -- Some goals sent in by areiner@tph.tuwien.ac.at (thanks) -- - Support for locale changes between . and , usage. In other parts of the world they use a period in the place of a decimal point (ie. 1.000.000,50 means one million and 5 tenths). - Some of the utilities should be able to recurse through subdirectories, where appropriate. - Support for output and input of numbers in scientific noation. (ie. 1.3e-7 for 0.0000013) - A utility for sorting numbers in mixed notations. For instance, it is hard to sort numbers that are in scientific notation or commified along side ones that aren't. Maybe this is a call for a numsort utility. -= Individual program goals =- -- numsum -- This program adds up all numbers encountered. In it's basic operation, it will simply add up all the numbers that it encounters. The numbers can be one per line, multiple per line separated by space or multiple numbers separated by anything. Eventually, it would be nice if numsum could do things like sum up individual rows or columns of input separately. There might be other types of summing functions that would be useful when dealing with textual input. Examples $ cat numbers 1 2 3 4 $ cat numbers | numsum 10 $ numsum numbers 10 Advanced Examples $ cat columns 1 6 11 16 21 2 7 12 17 22 3 8 13 18 23 4 9 14 19 24 5 10 15 20 25 $ cat columns | numsum -c 15 40 65 90 115 $ cat columns | numsum -r (add up the rows) 55 60 65 70 75 $ Usage options (in addition to the standard options) -a -- Add all numbers in the file, not just the first ones found on each line. X -c -- Treat each line as a set of columns separated by white space, or a string if the -s option is used. Sum up the values in each column and print out the result of each column seperated by the seperation character. This is shown above in the advanced examples section. X -s [for columns] -- Use as the separator between each column. This is allowed to be more than one character and possibly even a number. X -r -- Treat each line (by default) as a row of numbers to sum up. The results of the sums of each row will be printed on seperate lines. -s [for rows] -- When used with the -r flag, this will specify the seperator for rows. By default it is the new line character. It could be a character, set of characters or even a number. -t -- When used, it will add up every th number and at the end print out rows showing the sums of each th number. This would be useful for example if you had data from each day of the month and you wanted to sum up the date for each week day. So you might do something like this $ cat monthly-data | numsum -t 7 Options that might be included eventually. X -x -- Where is some number. This would be a shortcut for adding up all the numbers in the th column of the input. By default, the columns would be determined by white space, but could also be determined by the -s flag. This must be used with the -c or -r flag. So you can do something like: $ numsum -c -x 10 access_log To get the total bytes transfered in an access_log. Maybe this could also be able to handle comma seperated values, so 1,5,10 would sum up the 1st, 5th and 10th columns. -- numgrep -- This program is the numeric equivilent of the unix grep utility. numgrep will search textual input for numbers matching the expression specified from the command line. The main power of numgrep is in being able to search for ranges of numbers. Such as searching for all numbers between 1 and 100. Normal unix grep and regular expressions would not allow you to do this simple task, but with numgrep's numeric matching expressions it is possible to match numbers in ways not previously possible. A few examples of numgrep's usage: X o Search for all numbers between 1 and 100 in the file data.txt. numgrep /1..100/ data.txt X o search for numbers between 1 and 37,even numbers between 50 and 58 as well as the numbers 79, 86 and 94. numgrep /1..37,5[2468],79,86,94/ data.txt X o search for numbers from -10 to 10. numgrep /-10..10/ data.txt X o search for numbers that are multiples of 7 numgrep /m7/ data.txt o search for numbers that are multiples of 7, 12 and 22 numgrep /m7m12m22/ data.txt o search for numbers that are factors of 1024 and multiples of 12 or numbers that are factors of 2333 and multiples of 9 numgrep /f1024m12,f2333m9/ data.txt o seach for numbers that are in the set 1, 4, 7, 10, 13 and 16 numgrep /1..16i3/ data.txt Usage options (in addition to the standard options) -l -- Instead of keeping the numbers in their textual context, print them out one number per line. -R -- Inverse the sense of matching. To match non-matching numbers -r -- Recurse subdirectories. -p -- Obtain the patterns from file -f -- Suppress normal output and just print the names of the files that contain a match one per line. -F -- Suppress normal output and print the names of the files for which no output would have been printed. -n -- Print the line number in front of each line that contains a match. -c -- Don't save non-numeric context. This will cause numgrep to dump all non-matching/non-numeric values around the numbers that match. Eventual options -A [n] -- Print [n] lines of context out after the matching line. The default is 2. -B [n] -- Print [n] lines of context out before the matching line. The default is 2. Feature submitted by areiner@tph.tuwien.ac.at: - Intervals: I may be biased due to my work, but usually when you determine a quantity with only finite precision, people write something like 1.234 +/- 0.056 or 1.234(56) (the last form is more common, at least in physics journals; the interpretation is that the digits in brackets are taken to line up with the last digit shown, left-padded with zeroes and a decimal point in the correct place). The natural representation of both of these is as intervals, i.e. as sets of two numbers representing lower and upper bounds. Thus, the given number should be turned into something like 1.178..1.290, and there should be a conversion function that turns it back into either of the other two. -- average -- This program finds the average of all numbers encountered. By default it will find the mean average of all the numbers. Meaning (;-) that it will add up all the values and divide that sum by the number of values encountered. It should eventually offer more sophisticated calculations like finding the median and mode values. Examples of usage: $ cat numbers 4 8 9 20 99 $ average numbers 28 Usage options: X -m -- Print out the mode value of all the numbers entered. The mode is the most frequently occuring value in the set. X -M -- Print out the median of the set of numbers entered. The median is the middle value all all numbers encountered. So if the numbers 88, 12, 2, 1, 9, 100 and 1000 are encountered, the median of that set is 12. Illustrated: 1 2 9 12 88 100 1000 ^^ X -l -- Use the lower number of the median on even counted sets. -a -- average all numbers in the file, not just the first ones found on each line. -c -- Treat each line as a set of columns separated by white space, or a string if the -s option is used. Average the values in each column and print out the result of each column seperated by the seperation character. This is shown above in the advanced examples section. -s [for columns] -- Use as the separator between each column. This is allowed to be more than one character and possibly even a number. -r -- Treat each line (by default) as a row of numbers to sum up. The results of the averages of each row will be printed on seperate lines. -s [for rows] -- When used with the -r flag, this will specify the seperator for rows. By default it is the new line character. It could be a character, set of characters or even a number. Options that might be included eventually. -n -- Where is some number. This would be a shortcut for averaging all the numbers in the th column of the input. By default, the columns would be determined by white space, but could also be determined by the -s flag. This must be used with the -c or -r flags. Usage options (in addition to the standard options) -- normalize -- This program will distribute a group of numbers between 0 and 1 by default according to their initial value. You can change the range using the -R option. Usage options: X -R -- This is for specifying a range to normalize for instead of 0..1 -- round -- This utility will round each number encountered up or down depending on it's decimal value or it's relation to a number. It will probably also deal with finding the floor or ceiling values of numbers. It should also be able to round to factors of certain numbers. Options: X -n -- Round to the nearest factor of . Instead of just rounding all decimal numbers, you can also round to a factor of any number. So if you set to 1000 and you encounter the number 6777, it will round that number to 7000. If you set to 3 and encounter the number 7, it will round it to 6. X -c -- Find the ceiling of each number encountered. Round up. X -f -- Find the floor of each number. Round down. -- range -- Print out a range of numbers for use in for loops and such. o Do zero or character padding with the -p option. So 1 becomes 001 if the range has an upper limit of 3 digits. o Accept ranges in the following formats: n1..n2 (ex. 1..100 ; All integers from 1 to 100) n1..n2,n3..n4 (ex. 1..10,50..100 ; All integers from 1 to 10 and from 50 to 100) n1..n2i2 (ex. 2..20i2 ; All even numbers from 2 to 20) (ex. 1..19i2 ; All odd numbers from 1 to 19) (ex. 3..21i3 ; 3, 6, 9, 12, 15, 18, 21) n1.d..n2.di0.1 (ex. 1.1..2.0i0.1 ; 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0) Or any combination of the above. Usage options (in addition to the standard options) -p -- Put the following prefix before each number. -s -- Put the following suffix after each number. X -n -- Use as a character to separate each number. By default, a space is used. Use the sequence \n to specify a newline. X -N -- Shortcut for using a newline separator. X -e -- Exclude the numbers in from the output. This is so that if you want to do a complex range without including certain numbers. is a list of numbers seperated by a ','. -- random -- Print out a random number or numbers. This program will take a range of numbers much like the range command does. Except that instead of printing out all the numbers in that range, it will print out a random number from that range. Ex: $ random 1..100 37 $ random 1.0..100.0 56.9 $ random 1.000..100.000 42.397 $ random 2..100i2 [only pick among the even integers from 2 to 100] 68 Usage options (in addition to the standard options) -n -- Generate random numbers seperated by a space. -s -- Use as the seperation character. -N -- Shortcut for using a newline separation character. Feature submitted by areiner@tph.tuwien.ac.at: - You should specify the distribution you are generating; probably, this is just an equal distribution. What would come in handy quite often is to have a way of producing pseudo-random numbers that realize a chosen distribution with a given set of parameters. But that should probably be a different program than random, e.g. distribution, as the name is already taken. So, e.g., ``distribution --gaussian 3.0 4.0 100'' should produce 100 numbers distributed according to a gauss distribution with mean 3 and variance 4. -- bound -- This program is for finding the maximum, minimum and surrounding numbers in a set. You can use different options to specify how many numbers are returned as to show top maximum and minimum lists. Also you can show the numbers that occur around the context of the boundary numbers. Ex: $ cat numbers 1 10 8 100 15 1000 2 9 5 15 27 12 136 $ bound -u numbers (upper) 1000 $ bound -l numbers (lower) 1 Top 4 upper numbers sorted $ bound -u -n 4 numbers 1000 136 100 27 Bottom 4 lower numbers sorted) $ bound -l -n 4 numbers 1 2 5 8 Context around lowest number. Show 2 numbers of context. $ bound -l -c 2 numbers 1 10 8 Find the 5 closet numbers to the number 75. They will print out in closeness order. $ bound -f 75 -n 6 100 27 15 15 136 10 Usage options (in addition to the standard options) -u -- Return the upper bound number in the set (the maximum number) -l -- Return the lower bound number in the set (the minimum number) -n -- Return the top or bottom numbers or numbers around a number. -c -- In addition to the number returned, show numbers on both sides of the number. -f -- Find the number or the closest number to . -- numprocess -- (maybe a name like mutate, alter, or process would be better) This program mutates numbers as it encounters them. It should do the following operations: o Add/Subtract a value to a number o Multiply/Divide a number by a factor o Raise a number to a power (includes the concept of roots) o Round a number up or down o Do any mathematical function to a number. o Quantify a number. So 1024 bytes is 1KB and so on. Ex: Add 1 to each number $ numprocess /+1/ Multiply each number by 8 and then divide by 5 $ numprocess /*8,%5/ Convert from Farenheit to Celcius $ numprocess /-32,*5,%9/ Usage options (in addition to the standard options) -- interval -- This program calculates and displays the interval between one number and the next.