IT

Inserting graphviz diagrams in a CVStrac wiki

CVStrac is an amazing productivity booster for any software development group. This simple tool, built around a SQLite database (indeed, by the author of SQLite) combines a bug-tracking database, a CVS browser and a wiki. The three components are fully cross-referenced and build off the strengths of each other. You can handle almost all aspects of the software development process in it, and since it is built on an open database with a radically simple schema, it is trivial to extend. I use CVStrac for Temboz to track bugs, but also to trace changes in the code base to requirements or to bugs, and last but not least, the wiki makes documentation a snap.

For historical reasons, my company uses TWiki for its wiki needs. We configured Apache with mod_rewrite so that the wiki links from CVStrac lead to the corresponding TWiki entry instead of the one in CVStrac itself, which is unused. TWiki is very messy (not surprising, as it is written in Perl), but it has a number of good features like excellent search (it even handles stemming) and a directed graph plug-in that makes it easy to design complex graphs using Bell Labs’ graphviz, without having to deal with the tedious pixel-pushing of GUI tools like Visio or OmniGraffle. The plug-in makes it easy to document UML or E-R graphs, document software dependencies, map process flows and the like.

CVStrac 2.0 introduced extensibility in the wiki syntax via external programs. This allowed me to implement similar functionality in the CVStrac native wiki. To use it, you need to:

  1. Download the Python script dot.py and install it somewhere in your path. The sole dependency is graphviz itself, as well as either pysqlite2 or the built-in version bundled with Python 2.5
  2. create a custom wiki markup in the CVStrac setup, of type “Program Block”, with the formatter command-line:
    path/dot.py –db CVStrac_database_file –name ‘%m’
    • Insert the graphs using standard dot syntax, bracketed between CVStrac {dot} and {enddot} tags.
For examples of the plugin at work, here is the graph corresponding to this markup:
{dot}
digraph sw_dependencies {
style=bold;
dpi=72;

temboz [fontcolor=white,style=filled,shape=box,fillcolor=red];
python [fontcolor=white,style=filled,fillcolor=blue];
cheetah [fontcolor=white,style=filled,fillcolor=blue];
sqlite [fontcolor=white,style=filled,fillcolor=blue];

temboz -> cheetah -> python;
temboz -> python -> sqlite -> gawk;
temboz -> cvstrac -> sqlite;
python -> readline;
python -> db4;
python -> openssl;
python -> tk -> tcl;

cvstrac -> "dot.py" -> graphviz -> tk;
"dot.py" -> python;
"dot.py" -> sqlite;
graphviz -> gdpng;
graphviz -> fontconfig -> freetype2;
fontconfig -> expat;
graphviz -> perl;
graphviz -> python;
gdpng -> libpng -> zlib;
gdpng -> freetype2;
}
{enddot}
Dot

Another useful plug-in for CVStrac I wrote is one that highlights source code in the CVS browser using the Pygments library. Simply download pygmentize.py, install it Setup/Diff & Filter Programs/File Filter, using the string path_to/pygmentize.py %F. Here is an example of Pygment applied to pygmentize.py itself:

#!/usr/bin/env python
# $Log: pygmentize.py,v $
# Revision 1.3  2007/07/04 19:54:26  majid
# cope with Unicode characters in source
#
# Revision 1.2  2006/12/23 03:51:03  majid
# import pygments.lexers and pygments.formatters explicitly due to Pygments 0.6
#
# Revision 1.1  2006/12/05 20:19:57  majid
# Initial revision
#
"""
CVStrac plugin to Pygmentize source code
"""
import sys, pygments, pygments.lexers, pygments.formatters

def main():
  assert len(sys.argv) == 2
  block = sys.stdin.read()
  try:
    lexer = pygments.lexers.get_lexer_for_filename(sys.argv[1])
    out = pygments.highlight
    block = pygments.highlight(
      block, lexer, pygments.formatters.HtmlFormatter(
      style='colorful', linenos=True, full=True))
  except ValueError:
    pass
  print unicode(block).encode('ascii', 'xmlcharrefreplace')

if __name__ == '__main__':
  main()

Troubleshooting Windows remotely

Unpaid computer tech support for relatives is not a popular topic among geeks. It is very much a reality, however, specially in Indian communities with extensive extended families like mine. Some of the griping is churlish considering all the favors your family cheerfully does for you, and we probably have it better than MDs who are constantly bombarded with requests for free medical consultations.

At first sight, I would be better off if my relatives had the good sense to ditch Windows and get a Mac instead, but that would in fact compound the problem because I would get even more calls for help from people who are having a hard time dealing with very basic issues on an unfamiliar platform. Mac OS X may be better integrated and secure than Windows, but contrary to popular opinion it is not that much less crash-prone. All computers are unnecessarily hard to use in the first place. I doubt very much the computer industry will mend its ways and put human-centered design first, more likely than not the problem will be “solved” by the progressive eclipse of generations born before widespread computing, the rest of us having perforce adapted to these flawed tools.

A big part of the problem is doing “blind” support over the phone where you don’t see what is going on, and often the person in front of the screen is not technical enough to know what is significant and how to give you a useful and actionable description of what is on screen.

To its credit, Microsoft added remote assistance functionality in Windows XP. Explaining to users how to activate it is a challenge in itself, however, and in any case you need another Windows XP machine to provide the support. I still run Windows 2000 in the sole PC I have (used exclusively for games nowadays) and it makes such a racket I am almost viscerally reluctant to boot it up.

The best solution is to use virtual network computing (VNC), a free, cross-platform remote control protocol originally invented by the former Olivetti-Oracle-AT&T labs in Cambridge, UK. I often use VNC to take control of my home Mac from my office PC or my MacBook Pro. Indeed, VNC is integral to Apple Remote Desktop, Apple’s official remote management product for large Mac installations. There are even VNC clients available for PalmOS and Windows CE so you could remote control your home computer from a Treo. Having VNC running on the ailing PC would allow me to troubleshoot it efficiently from the comfort of my Mac.

Unfortunately, there is still a chicken-and-egg effect. I once tried to get an uncle to set up UltraVNC on his PC and do a reverse SSH forwarding so I could bypass his firewall. It took the better part of an hour, and barely worked. Surely, there has to be a better solution.

One such solution is Copilot, a service from Fog Creek software that repackages VNC in a form that’s easier to use. It is somewhat expensive, however (although that can be seen as a feature if the people calling for help have to pay for it and thus have an incentive to moderate their requests).

Another one that shows some promise is UltraVNC SC, a simplified version of UltraVNC that is designed for help desks (here is a more friendly walkthrough). Unfortunately, it shows a very clunky dialog that makes sense in a corporate help desk setting, but is too confusing for a novice user, and it uses UltraVNC extensions that are not compatible with most other VNC clients like the one I use most, Chicken of the VNC.

In the end, what I ended up doing was to take the source code for the full-featured UltraVNC server, rip out all the user interface and registry settings from it, and hardcode it to open an outgoing connection to my home server alamut.majid.org on TCP port 5500. There isn’t anything on the server listening on port 5500 by default, but I can open a SSH connection to it from anywhere in the world and use SSH reverse port forwarding to connect port 5500 to wherever I am. This neatly sidesteps the problem of firewalls that block incoming connections.

The resulting executable is larger than SC, but still manageable at 500K (vs. 950K for the full version), and requires no input from the user beyond downloading it and running it, thus triggering all sorts of warnings. It’s not good practice to teach users to download and run executables, but presumably they trust me. After the VNC session is finished, the program simply exits (as evidenced by the disappearance of the UltraVNC eye icon from the toolbar

If you want to use a setup like mine, it’s easy enough for a technically inclined person:

  1. You could download my executable at majid.org/help, open it in a hex editor (or even Emacs), search for the string alamut.majid.org and overwrite it with the name of the machine you want to use instead (I left plenty of null bytes as padding just in case). Make sure you are overwriting, not inserting new bytes or shrinking the string, as the executable won’t work correctly otherwise.
  2. Or you could download the modified source code I used (UltraVNC is a GPL open-source project, so I am bound by the license to release my mods). Edit the string host in winvnc/winvnc/winvnc.cpp (you can also change the reverse VNC port from its default of 5500 if you want), and recompile using the free (as in beer) Visual C++ 2005 Express Edition and the Platform SDK. My Windows programming skills are close to nil, so if I could do it, you probably can as well.

To use the tool, put it up on a website, and when you get a request for help, SSH into the server. On UNIX (including OS X), you would need to issue the command:

ssh -R5500:127.0.0.1:5500 your.server.name.com

Please note I explicitly use 127.0.0.1 rather than localhost, as the former is always an IPv4 address, but on some systems, localhost could bind to the IPv6 equivalent ::1 instead.

On Windows, you will need to set the reverse port forwarding options in PuTTY (or just replace ssh with plink in the command-line above). After that start your VNC client in listen mode (where the VNC client awaits a connection from the server on port 5500 instead of connecting to the server on port 5900). You can then tell the user to download the executable and run it to establish the connection.

Some caveats:

  1. The leg of the connection between the PC and the server it is connecting to is not encrypted
  2. Depending on XP firewall settings, Windows may ask the user to authorize the program to open a connection
  3. At many companies, running a program like this is grounds for dismissal, so make sure whoever is calling you is asking for help on a machine they are authorized to open to the outside.

I hesitated to make this widely available due to the potential for mischief, but crackers have had similar tools like Back Orifice for a very long time, so I am not exactly enhancing their capabilities. On the other hand, this makes life so much easier it’s worth sharing. Helping family deal with Windows will still be a chore, but hopefully a less excruciating one.

Update (2007-03-23):

You can make a customized download of the executable targeting your machine using the form below. Replace example.com with whatever hostname or IP address you have. If you do not have a static IP address, you will need to use a dynamic DNS service like DynDNS or No-IP to map a host name to your dynamic IP address.

Gates opening new Vistas

Bill Gates announced yesterday he is progressively going to disengage himself from day-to-day participation in Microsoft over the next two years, to concentrate exclusively on his foundation. However questionable Microsoft’s business practices may be, they are no worse than Standard Oil’s. The Rockefellers or Carnegie bought social respectability by endowing institutions for the already comfortable. No matter what the IRS may claim, donating to places like Harvard in exchange for naming rights does not qualify as charity in my book.

In contrast, Gates’ humanitarian work has been remarkable — his money is comforting the truly afflicted of this world, like sufferers of leprosy or malaria. His example is highly unlikely to be emulated by Silicon Valley’s skinflint tycoons (Larry Ellison, Steve Jobs, I’m looking at you). The latter conveniently convinced themselves their wealth is due entirely to their own efforts, never to luck, or government funding in the case of the Internet moguls. This leads to the self-serving belief that they are absolved of any obligation to society or to those less fortunate (in both senses of the term).

This decision is not entirely unexpected. Microsoft has been floundering for the last several years, and has accrued severe managerial bloat, something the ruthless and paranoid Bill Gates circa 1995 would never have allowed to continue. There is remarkable dearth of insightful commentary on the announcement. My take is that the harrowing and humiliating process of the DoJ anti-trust trial proved cathartic and led him to review his priorities, even if the lawsuit itself ended up with an ineffective slap on the wrist.

Some equally interesting reading coming out of Redmond: Broken Windows Theory, an article by a Microsoft project manager on the back story behind the Windows Vista delay, with some really interesting metrics. Apparently Vista takes no less than 24 hours to compile on a fast dual-processor PC. It has 50 levels of dependencies, 50 million lines of code (one metric I personally find meaningless, as you can get more done in one line of Python than in a hundred lines of C/C++). His conclusion is that due to its scale, Vista could simply be structurally unmanageable. Certainly, the supporting infrastructure, as in automation tools, code and dependency analysis, project management et al. ought to be a project in itself of the same scope as, say, Microsoft Word.

When I worked at France Télécom in the late nineties, they were reeling from the near total failure of Frégate, a half-billion dollar billing system of the future project (another interesting metric: two-thirds of billing systems projects worldwide end in failure). The grapevine even devised a unit of measurement, the Potteau, after an eponymous Ingénieur Général (a typically French title with roots in the military engineering side of the civil service) involved in the project. One potteau equals one man-century. It is deemed the unit beyond which any software project is doomed to failure.

Vista involved 2000 developers over 5 years. That’s over 100 potteaux.

Put whiny computers to work

I have noticed a trend lately of computers making an annoying whining sound when they are running at low utilization. This happens with my Dual G5 PowerMac, the Dells I ordered 18 months ago for my staff (before we ditched Dell for HP due to the former’s refusal to sell desktops powered with AMD64 chips instead of the inferior Intel parts), and I am starting to notice it with my MacBook Pro when in a really quiet room.

These machines emit an incredibly annoying high-pitched whine when idling, one that disappears when you increase the CPU load (I usually run openssl speed on the Macs). Probably the fan falls in some oscillating pattern because no hysteresis was put into the speed control firmware. It looks like these machines were tested under full load, but not under light load, which just happens to be the most common situation for computers. The short-term work-around is to generate artificial load by running openssl in an infinite loop, or to use a distributed computing program like Folding@Home.

Load-testing is a good thing, but idle-testing is a good idea as well…

Taming the paper tiger

A colleague was asking for some simple advice about all-in-one printer/copier/fax devices and got instead a rambling lecture on my paper workflow. There is no reason the Internet should be exempted from my long-winded rants, so here goes, an excruciatingly detailed description of my paper workflow. It shares the same general outline as my digital photography workflow, with a few twists.

Formats

The paperless office is what I am striving for. Digital files are easier to protect than paper from fire or theft, and you can carry them with you everywhere on a Flash memory stick. As for file formats, you don’t want to be locked in, so you should either use TIFF or PDF, both of which have open-source readers and are unlikely to disappear anytime soon, unlike Microsoft’s proprietary lock-in format of the day.

TIFF is easier to retouch in an image editing program, but:

  1. Few programs cope correctly with multi-page TIFFs
  2. PDF allows you to combine a bitmap layer to have an exact fac-simile with a searchable OCR text layer for retrieval, TIFF does not.
  3. TIFF is inefficient for vector documents, e.g. receipts printed from a web page.
  4. The TIFF format lacks many of the amenities designed in a format like PDF expressly designed as a digital replacement for paper.

Generating PDFs from web pages or office documents is as simple as printing (Mac OS X offers this feature out of the box, for Windows, you can print to PostScript and use Ghostscript to convert the PS to PDF.

Please note the bloated Acrobat Reader is not a must-have to view PDFs, Mac OS X’s Preview does a much better job, and on Windows Foxit Reader is a perfectly serviceable alternative that easily fits on a Flash USB stick. UNIX users have Ghostscript and the numerous UI wrappers that make paging and zooming easy..

Acquisition

You should process incoming mail as soon as you receive it, and not let it build up. If you have a backlog, set it aside and start your new system, applicable to all new snail mail. That way the situation does not degrade further, and you can revisit old mail later.

Junk mail that could lead to identity theft (e.g. credit card solicitations) should be shredded or even better, burnt (assuming your local environmental regulations permit this). if you get a powerful enough shredder, it can swallow the entire envelope without even forcing you to open it. Of course, you should only consider a cross-cut shredder. Junk mail that does not contain identifiable information should be recycled. When in doubt, shred. Everything else should be scanned.

Forget about flatbed scanners, what you want is a sheet-fed batch document scanner. It should support duplex mode, i.e. be capable of scanning both sides of a sheet of paper in a single pass. For Mac users Fujitsu ScanSnap is pretty much the only game in town, and for Windows users I recommend the Canon DR-2050C (the ScanSnap is available in a Windows version, but the Canon has a more reliable paper feed less prone to double-feeding). Either will quickly scan a sheaf of paperwork to a PDF file at 15–20 pages per minute.

Filing

Paper is a paradox: it is the most intuitive medium to deal with in the short-term, but also the most unwieldy and unmanageable over time. As soon as you layer two sheets into a pile, you have lost the fluidity that is paper’s essential strength. Shuffling through a pile takes an ever increasing amount of time as the pile grows.

For this reason, you want to organize your filing plan in the digital domain as much as possible. Many experts set up elaborate filing plans with color-coded manila folders and will wax lyrical about the benefits of ball-bearing sliding file cabinets. In the real world, few people have the room to store a full-fledged file cabinet.

The simplest form of filing is a chronological file. You don’t even need file folders — I just toss my mail in a letter tray after I scan it. At the end of each month, I dump the accumulated mail into a 6″x9″ clasp envelope (depending on how much mail you receive, you may need bigger envelopes), and label it with the year and month. In all likelihood, you will never access these documents again, so there is no point in arranging them more finely than that. This filing arrangement takes next to no effort and is very compact – you can keep a year’s worth in the same space as a half dozen suspended file folders, as can be seen with 9 months’ worth of mail in the photo below (the CD jewel case is for scale).

Monthly filesThere are some sensitive documents you should still file the old-fashioned way for legal reasons, such as birth certificates, diplomas, property titles, tax returns and so on. You should still scan them to have a backup in case of fire.

Date stamping

As you may have to retrieve the paper original for a scanned document, is important to date stamp every page (or at least the first page) of any mail you receive. I use a Dymo Datemark, a Rube Goldberg-esque contraption that has a rubber ribbon with embossed characters running around an ink roller and a small moving hammer that strikes when the right numeral passes by. All you really need is a month resolution so you know which envelope to fetch, thus an ordinary month-year rubber stamp would do as well. Ideally you would have software to insert a digital date stamp directly in the document, but I have not found any yet. A tip: stamp your document diagonally so the time stamp stands out from the horizontal text.

Management

Much as it pains me to admit it, Adobe Acrobat (supplied with the Fujitsu ScanSnap) is the most straightforward way to manage PDF files on Windows, e.g. merge multiple files together, insert new pages, annotate documents and so on. Through web capture OCR, it can create an invisible text layer that makes the PDF searchable with Spotlight. There are alternatives, such as Foxit PDF Page Organizer or PaperPort on Windows, and PDFPen on OS X. Since Leopard, Apple’s Preview app has included most of the PDF editing functionality required, so I take great pains to ensure my Macs are untainted by Acrobat (e.g. unselecting it when installing CS3). See also my article on resetting the creator code for PDF files on OS X so they are opened by Preview for viewing.

Encryption

If you are storing a backup of your personal papers at work or on a public service like Google’s rumored Gdrive, you don’t want third-parties to access your confidential information. Similarly, you don’t want to be exposed to identity theft if you lose a USB Flash stick with the data on it. The solution is simple: encryption.

There are many encryption packages available. Most probably have back doors for the NSA, but your threat model is the ID fraudster rummaging through your trash for backup DVDs or discarded bank statements, not the government. I use OpenSSL’s built-in encryption utility as it is cross-platform and easily scripted (I compiled a Windows executable for myself, and it is small enough to be stored on a Flash card). Mac and UNIX computers have it preinstalled, of course, do man enc for more details.

To encrypt a file using 256-bit AES, you would use the command:

openssl enc -aes-256-cbc -in somefile.pdf -out somefile.pdf.aes

to decrypt it, you would issue the command:

openssl enc -d -aes-256-cbc -in somefile.pdf.aes -out somefile.pdf

OpenSSL will prompt you for the password, but you can also supply it as a command-line argument, e.g. in a script.

Backup

Backing up scanned documents is no different than backing up photos (apart from the encryption requirements), so I will just link to my previous essay on the subject or my current backup scheme. In addition to my external Firewire hard drive rotation scheme, I have a script that does an incremental encryption of modified files using OpenSSL, and then uploads the encrypted files to my office computer using rsync.

Retention period

I tend to agree with Tim Bray in that you shouldn’t bother erasing old files, as the minimal disk space savings are not worth the risk of making a mistake. As for paper documents, you should ask your accountant what retention policy you should adopt, but a default of 2 years should be sufficient (the documents that need more, such as tax returns, are in the “file traditionally” category, in any case).

Fax

The original question was about fax. OS X can be configured to receive faxes on a modem and email them to you as PDF attachments, at which point you can edit them in Acrobat, and fax it back if required, without ever having to kill a tree with printouts. Windows has similar functionality. Of course, fax belongs in the dust-heap of history, along with clay tablets, but habits change surprisingly slowly.

Update (2006-08-26):

I recently upgraded my shredder to a Staples SPL-770M micro-cut shredder. The particles generated by the shredder are incredibly minute, much smaller than those of conventional home or office grade shredders, and it is also very quiet to boot.

Unfortunately, it isn’t able to shred an entire unopened junk mail envelope, and the micro-cut shredding action does not work very well if you feed it folded paper (the particles at the fold tend to cling as if knitted together). This unit is also more expensive than conventional shredders (but significantly cheaper than near mil-spec DIN level 5 shredders that are the nearest equivalent). Staples regularly has specials on them, however. Highly recommended.

Update (2007-04-12):

I recently upgraded my document scanner to a Fujitsu fi-5120C. The ScanSnap has a relatively poor paper feed mechanism, which often jams or double-feeds. Many reviews of the new S500M complain it also sufffers from double-feeding. The 5120C is significantly more expensive but it has a much more reliable paper feed with hitherto high-end features like ultrasonic double-feed detection. You do need to buy ScanTango software to run it on the Mac, however.

Update (2009-01-21):

I moved recently, and realized I have never yet had to open one of those envelopes. From now on, all papers not required for legal reasons (e.g. tax documents) go straight to the shredder after scanning.

Update (2009-09-08):

The new ScanSnap 1500 has ultrasonic double-feed detection. I bought a copy of ABBYY FineReader Express for the Mac. It used to be only available as bundled software with certain scanners like recent ScanSnaps, or software packages like DEVONthink, but you can now buy it as a standalone utility. It is not full-featured, missing some of the more esoteric OCR functionality of the Windows version, batch capabilities and scripting, but works well, unlike the crash-prone ReadIRIS I had but seldom used.

Update (2009-09-22):

Xamance is a really interesting French startup. Their product, the Xambox, integrates a document scanner, document management software and a physical paper filing system. The system can tell you exactly where to find the paper original for a scanned document (“use box 2, third document after tab 7”). In other words, essentially the same filing system I suggest above, but systematically managed in a database for easy retrieval.

It is quite expensive, however, making it more of a solution for businesses. I have moved on and no longer need the safety blanket of keeping the originals, but I can easily see how a complete solution like this would be valuable for businesses that are required for compliance to keep originals, such as notaries, or even government public records offices.

Credit card receipt slips and business cards are problematic for a paperless workflow. They are prone to jam in scanners, have non-standard layouts so hunting for information takes more time than it should, and are usually so trivial you don’t really feel they are worth scanning in the first place. I just subscribed to the Shoeboxed service to manage mine.They take care of the scanning and for pouring the resulting data in a form that can be directly imported into personal finance or contact-management software. I don’t yet have sufficient experience with the service, but on paper at least it seems like a valuable service that will easily save me an hour a week.

Update (2011-01-13):

I finally broke down and upgraded to a ScanSnap S1500M (we have one at work, and it is indeed a major improvement over the older models). In theory this is a downgrade as the fi-5120C is a business scanner, whereas the S1500M is a consumer/SoHo model, but with some simple customization, the integrated software bundle makes for a much more streamlined workflow: put the paper in the hopper, press the button, that’s it. With the fi-5120C, I had to select the scan settings in ScanTango, scan, press the close button, select a filename, drag the file into ABBYY FineReader, select OCR options, click save, click to confirm I do want to overwrite the original file, then dismiss the scan detection window. One step vs. nine.

Update (2012-06-19):

For portable storage of the documents, I don’t bother with manually encrypting the files any more. The IronKey S200 is a far superior option: mil-spec security and hardware encryption, with tamper-resistant circuitry, potted for environment resistance and using SLC flash memory for speed. Sure, it’s expensive, but you get what you pay for (I tried to cut costs by getting the MLC D200, and ended up returning it because it is so slow as to be unusable).