Thursday, March 10, 2016

DATA SIMPLIFICATION: System Calls


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.


A system call is a command line, inserted into a software program, that interrupts the script while the operating system executes the command line. Immediately afterwords, the script resumes, at the next line. Any utility that runs from the command line can be embedded in any scripting language that supports system calls, and this includes all of the languages discussed in this book.

Here are the properties of system calls that make them useful to programmers:

1. System calls can be inserted into iterative loops (e.g., while loops, for loops), so that they can be repeated any number of times, on collections of files, or data elements.

2. Variables that are generated at run-time (i.e.,during the execution of the script) can be included as arguments added to the system call.

3. The results of the system call can be returned to the script, and used as variables.

4. System calls can utilize any operating system command and any program that would normally be invoked through a command line, including external scripts written in other programming languages. Hence, a system call can initiate an external script written in an alternate programming language, composed at run-time within the original script, using variables generated in the original script, and capturing the output from the external script for use in the original script!

System calls enhance the power of any programming language by providing access to a countless number of external methods and by participating in iterated actions using variables created at run-time.

How does the system call help with the task of data simplification? Data simplification is very often focused on uniformity and reproducibility. If you have 100,000 images, data simplification might involve calling ImageMagick to resize every image to the same height and width. If you need to convert spreadsheet data to a set of triples, than you might need to provide a UUID string (see prior blog) to every triple in the database, all at once. If you are working on a Ruby project, and you need to assert one of Python's numpy methods, on every data file in a large collection of data files, then you might want to create a short Python file that you can be accessed, via a system call, from your Ruby script.

Once you have gotten the hang of including system calls in your scripts, you will probably use them in most of your your data simplification tasks. It's important to know how system calls can be used to great advantage, in Perl, Python, and Ruby. A few examples follow.

The following short Perl script makes a system call, consisting of the DOS "dir" command:
#!/usr/bin/perl
system("dir");
exit; 
The "dir" command, launched as a system call, displays the files in the current directory. Here is the equivalent script, in Python:
#!/usr/local/bin/python
import os
os.system("dir")
exit
Notice that system calls in Python require the importation of the os (operating system) module into the script.

Here is an example of a Ruby system call, to ImageMagick's "Identify" utility [note: this only works if you have pre-installed ImageMagick]. The system call instructs the "Identify" utility to provide a verbose description of the image file3320_out.jpg, and to pipe the output into the text file, myimage.txt.
#!/usr/bin/ruby
system("Identify -verbose c:/ftp/3320_out.jpg >myimage.txt")
exit
Here is an example of a Perl system call, to ImageMagick's "convert" utility, that incorporates a Perl variable ($file, in this case) that is passed to the system call [note: this only works if you have pre-installed ImageMagick].
#!/usr/local/bin/perl
$file = "try2.gif";
system("convert -size 350x40 xc:lightgray -font Arial -pointsize 32 -fill black
-gravity north -annotate +0+0 \"Hello, World\" $file");
exit;
The following Python script opens the current directory and parses through every filename, looking for jpeg image files. When a jpeg file is encountered, the script makes a system call to imagemagick, instructing imagemagick's "convert" utility to copy the jpeg file to the thumb drive (designated as the f: drive), in the form of a grayscale image. If you try this script at home, be advised that it requires a mounted thumb drive, in the "f:" drive [note: this only works if you have pre-installed ImageMagick].
#!/usr/local/bin/python
import os, re, string
filelist = os.listdir(".")
for file in filelist:
  if ".jpg" in file:  
    img_in = file
    img_out = "f:/" + file 
    command = "convert " + img_in + " -set colorspace Gray -separate -average " + img_out
    os.system(command)
exit
Let's look at a Ruby script that calls a Perl script, a Python script, and another Ruby script, from within one Ruby script.

Here are the Perl, Python and Ruby scripts that will be called from within a Ruby script:
hi.py
#!/usr/local/bin/python
print("Hi, I'm a Python script")
exit

hi.pl
#!/usr/local/bin/perl
print "Hi, I'm a Perl script\n";
exit;

hi.rb
#!/usr/local/bin/ruby
puts "Hi, I'm a Ruby script"
exit
Here is the Ruby script, call_everyone.rb, that calls external scripts, written in Python, Perl and Ruby:
#!/usr/local/bin/ruby
system("python hi.py")
system("perl hi.pl")
system("ruby hi.rb")
exit
Here is the output of the Ruby script, call_everyone.rb:
c:\ftp>call_everyone.rb
Hi, I'm a Python script
Hi, I'm a Perl script
Hi, I'm a Ruby script
If you have some facility with a variety of language-specific methods and utilities, you can deploy them all from within your favorite scripting language.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, system calls, Perl, Python, Ruby, jules j berman

No comments: