IT

Pointless referrer spamming

Q: What happens when you cross a mobster with a cell phone company?
A: Someone who makes you an offer you can’t understand.

The HTTP protocol used by web browsers specifies an optional Referer: (sic) header that allows them to tell the server where the link to a page came from. This was originally intended as a courtesy, so webmasters could ask people with obsolete links to update their pages, but it is also a valuable source of information for webmasters who can find out which sites link to them, and in most cases what keywords were used on a search engine. Unfortunately, spammers have found another well to poison on the Internet.

Over the past month, referrer spam on my site has graduated from nuisance to menace, and I am writing scripts that attempt to filter that dross automatically out of my web server log reports. In recent days, it seems most of the URLs spammers are pushing on me point to servers with names that aren’t even registered in the DNS. This seems completely asinine, even for spammers: why bother spamming someone without a profit motive? I was beginning to wonder whether this was just a form of vandalism like graffiti, but it seems the situation is more devious than it seems at first glance.

Referrer spam is very hard to fight (although not quite as difficult as email spam). I am trying to combine a number of heuristics, including behavioral analysis (e.g. whether the purported browser is downloading my CSS files or not), WHOIS lookups, reverse lookups for the client IP address, and so on. Unfortunately, if any of these filtering methods become widespread, the spammers can easily apply countermeasures to make their requests look more legitimate. This looks like another long-haul arms race…

IM developments

Telcos look at instant messaging providers with deep suspicion. Transporting voice is just a special case of transporting bits, and even the global Internet is now good enough for telephony (indeed, many telcos are already using IP to transport voice for their phone networks, albeit on private IP backbones). The main remaining barriers to VoIP adoption are interoperability with the legacy network during the transition, and signaling (i.e. finding the destination’s IP address). IM providers offer a solution for the latter, and could thus become VoIP providers. AOL actually is, indirectly, through Apple’s iChat AV. This competitive threat explains why, for instance, France Télécom made a defensive investment in open-source IM provider Jabber.

Two recent developments promise to change dramatically the economic underpinnings of the IM industry:

  1. Yahoo announced a few weeks ago it would drop its enterprise IM product. Within a week, AOL followed suit.
  2. AOL and Yahoo agreed to interoperate with LCS, Microsoft’s forthcoming Enterprise IM server. Microsoft will pay AOL and Yahoo a royalty for access to their respective IM networks.

These announcement make it clear neither Yahoo nor AOL feel they can sell successfully into enterprise accounts, and certainly not match Microsoft’s marketing muscle in that segment.

The second part, in effect Microsoft agreeing to pay termination fees to AOL and Yahoo, means that Microsoft’s business IM users will subsidize consumers. This is very similar to the situation in telephony, where businesses cross-subsidize local telephony for residential customers by paying higher fees. For most telcos, interconnect billing is either the first or second largest source of revenue, and this development may finally make IM profitable for Yahoo and AOL, rather than the loss-leader it is today.

Apparently Microsoft has concluded it cannot bury its IM competitors, and would rather make money now serving its business customers’ demand for an interoperable IM solution than wait to have the entire market to itself using its familiar Windows bundling tactics. Left out in the cold is IBM’s Lotus Sametime IM software.

Businesses will now be able to reach customers on all three major networks, but this does not change the situation for consumers. The big three IM providers have long played cat-and-mouse games with companies like Trillian that tried to provide reverse-engineered clients that work with all three networks. Ostensibly, this is for security reasons, but obviously the real explanation is to protect their respective walled gardens, just as in the early days the Bell Telephone company would refuse to interconnect with its competitors, and many businesses had to have maintain multiple telephones, one for each network. It is not impossible, however, that interoperability will be offered to consumers as a paid, value-added option. Whether consumers are ready to pay is an entirely different question.

Effective anti-spam enforcement

The European Union E-Privacy directive of 2002, the US CAN-SPAM act of 2003 and other anti-spam laws allow legal action against spammers. Only official authorities can initiate action (although there are proposals to set up a bounty system in the US), but enforceability of these statutes is a problem, as investigations and prosecutions are prohibitively expensive, and both law enforcement and prosecutors have other pressing priorities contending for finite resources. Financial investigative techniques (following the money trail) that can be deployed against terrorists, drug dealers and money launderers are overkill for spammers, and would probably raise civil liberties issues.

There is an option that could dramatically streamline anti-spam enforcement, however. Spammers have to find a way to get paid, and payment is usually tendered using a credit card. Visa and Mastercard both have systems by which a temporary, one-time use credit card number can be generated. This service is used mostly to assuage the fears of online shoppers, but also provides a solution.

Visa and Mastercard could offer an interface that would allow FTC investigators and their European counterparts to generate “poisoned” credit card numbers. Any merchant account that attempts a transaction using such a number would be immediately frozen and its balance forfeited. Visa and Mastercard’s costs could be defrayed by giving them a portion of the confiscated proceeds.

Of course, proper judicial oversight would have to be provided, but this is a relatively simple way to nip the spam problem in the bud, by hitting spammers where it hurts most – in the pocketbook.

Why IPv6 will not loosen IP address allocation

The current version of Internet Protocol (IP), the communications protocol underlying the Internet, is version 4. In IPv4, the address of any machine on the Internet, whether a client or a server, is encoded in 4 bytes. Due to various overheads, the total number of addresses available for use is much less than the theoretical 4 billion possible. This is leading to a worldwide crunch in the availability of addresses, and rationing is in effect, specially in Asia, which came late to the Internet party and has a short allocation (Stanford University has more IPv4 addresses allocated to it than the whole of China).

Internet Protocol version 6, IPv6, quadrupled the size of the address field to 16 bytes, i.e. unlimited for all practical purposes, and made various other improvements. Unfortunately, its authors severely underestimated the complexity of migrating from IPv4 to IPv6, which is why it hasn’t caught on as quickly as it should have, even though the new protocol is almost a decade old now. Asian countries are leading in IPv6 adoption, simply because they don’t have the choice. Many people make do today with Network Address Translation (NAT), where a box (like a DSL router) allows several machines to share a single global IP address, but this is not an ideal solution, and one that only postpones the inevitable (but not imminent) reckoning.

One misconception, however, is that that the slow pace of the migration is somehow related to the fact you get your IP addresses from your ISP, and don’t “own” them or have the option to port them the way you now can with your fixed or mobile phone numbers. While IPv6 greatly increases the number of addresses available for assignment, this will not change the way addresses are allocated, for reasons unrelated to the address space crunch.

First of all, nothing precludes anyone from requesting an IPv4 address directly from the registry in charge of their continent:

  • ARIN in North America and Africa south of the Equator
  • LACNIC for Latin America and the Caribbean
  • RIPE (my former neighbors in Amsterdam) for Europe, Africa north of the Equator, and Central Asia
  • APNIC for the rest of Asia and the Pacific.

That said, these registries take the IP address shortage seriously and will require justification to grant the request. Apart from ISPs, the other main kind of allocation recipients are large organizations that require significant numbers of IP addresses (e.g. for a corporate Intranet) and that will use multiple ISPs for their Internet connectivity.

The reason why IP addresses are allocated mostly through ISPs is the stability of the routing protocols used by ISPs to provide global IP connectivity. The Internet is a federation of independent networks that agree to exchange traffic, sometimes for free (peering) or for a fee (transit). Each of these networks is called an “Autonomous System” (AS) and has an AS number (ASN) assigned to it. ASNs are coded in 16 bits, so there are only 65536 available to begin with.

When your IP packets go from your machine to their destination, they will first go through your ISP’s routers to your ISP’s border gateway that connects to other transit or final destination ISPs leading to your destination. There usually are an order of magnitude or two fewer border routers than interior routers. The interior routers do not need much intelligence, all they need to know is how to get their packets to the border. The border routers, on the other hand, need to have a map of the entire Internet. For each block of possible destination IP addresses, they need to know which next-hop ISP to forward the packet on to. Border routers exchange routing information using the Border Gateway Protocol, version 4 (BGP4).

BGP4 is in many ways black magic. Any mistake in BGP configuration can break connectivity or otherwise impair the stability of vast swathes of the Internet. Very few vendors know how to make reliable and stable implementations of BGP4 (Cisco and Juniper are the only two really trusted to get it right), and very few network engineers have real-world experience with BGP4, learned mostly through apprenticeship. BGP4 in the real scary world of the Internet is very different from the safe and stable confines of a Cisco certification lab. The BGP administrators worldwide are a very tightly knit cadre of professionals, who gather in organizations like NANOG and shepherd the Net.

The state of the art in exterior routing protocols like BGP4 has not markedly improved in recent years, and the current state of the art in core router technology just barely keeps up with the fluctuations in BGP. One of the control factors is the total size of BGP routing tables, which has been steadily increasing as the Internet expands (but no longer exponentially, as was the case in the early days). The bigger the routing tables, the more memory has to be added to each and every border router in the planet, and the slower route lookups will be. For this reason, network engineers are rightly paranoid about keeping routing tables small. Their main weapon consists of aggregating blocks of IP addresses that should be forwarded the same way, so they take up only one slot.

Now assume every Internet user on the planet has his own IP address that is completely portable. The size of the routing tables would explode from 200,000 or so today to hundreds of millions. Every time someone logged on to a dialup connection, every core router on the planet would have to be informed, and they would simply collapse under the sheer volume of routing information overhead, and not have the time to forward actual data packets.

This is the reason why IP addresses will continue to be assigned by your ISP: doing it this way allows your ISP to aggregate all its IP addresses in a single block, and send a single route to all its partners. Upstream transit ISPs do even more aggregation, and keep the routing tables to a manageable size. The discipline introduced by the regional registries and ISPs is precisely what changed the exponential trend in routing table growth (one which even Moore’s law would not be able to keep up with) to a linear one.

It’s not as if this requirement is anti-competitive, unlike telcos dragging their feet on number portability – the DNS was precisely created so users would not have to deal with IP addresses, and can easily be changed to point to new addresses in the event of a change of IP addresses.

Networked storage on the cheap

As hard drives get denser, the cost of raw storage is getting ridiculously cheap – well under a dollar per gigabye as I write. The cost of managed storage, however, is an entirely different story.

Managed storage is the kind required for “enterprise applications”, i.e. when money is involved. It builds on raw storage by adding redundancy, the ability to hot-swap drives, to add capacity without disruption. In the higher-end of the market, additional manageability features include fault tolerance, the ability to take “snapshots” of data for backup purposes, and to mirror data remotely for disaster recovery purposes.

Traditionally, managed storage has been more expensive than raw disk by a factor of at least two, sometimes even an order of magnitude or more. When I started my company in 2000, for instance, we paid $300,000, almost half of our initial capital investment, for a pair of clustered Network Appliance F760 filers, with a total disk capacity of 600GB or so ($500/GB, when disk drives would cost $10/GB at the time). The investment was well worth it, as these machines have proven remarkably reliable, and the Netapps’ instant snapshot capability is vital for us, as it allows us to take instantaneous snapshots of our Oracle databases, which we can then back up in a leisurely backup window, without having to keep Oracle in the performance-sapping backup mode during that time.

Web serving workloads and the like can easily be distributed across farms of inexpensive rackmount x86 servers, an architecture pioneered by ISPs. Midrange servers (up to 4 processors), pretty much commodities nowaday, are adequate for all but the very highest transaction volume databases. Storage and databases are the backbone of any information system, however, and a CIO cannot afford to take any risks with them, that is why storage represents such a high proportion of hardware costs for most IT departments, and why specialists like EMC have the highest profit margins in the industry.

Most managed storage is networked, i.e. does not consist of hard drives directly attached to a server, but instead of disks attached to a specialized storage appliance connected to the server with a fast interconnect. There are two schools:

  • Network-Attached Storage (NAS), like our Netapps, that basically serve act as network file servers using common protocols like NFS (for UNIX) and SMB (for Windows). These are more often used for midrange applications and unstructured data, and connect using inexpensive Ethernet (Gigabit Ethernet, in our case) networks every network administrator is familiar with. NAS are available for home or small office use, at prices of $500 and up.
  • Storage Area Networks (SAN) offer a block-level interface (they behave like virtual hard drives that serve fixed-size blocks of data, without any understanding of what is in them). They currently use Fibre Channel, a fast and low latency interconnect, that is unfortunately also terribly expensive (FC switches are over ten times more expensive than equivalent Gigabit Ethernet gear). The cost of setting up a SAN usually limits them to high-end, mainframe-class data centers. Exotic cluster filesystems or databases like Oracle RAC need to be used if multiple servers are going to access the same data.

One logical way to lower the cost of SANs is to use inexpensive Ethernet connectivity. This was recently standardized as iSCSI, which is essentially SCSI running on top of TCP/IP. I recently became aware of Ximeta, a company that makes external drives that apparently implement iSCSI, at a price that is very close to that of raw disks (since iSCSI does not have to manage state for clients the way a more featured NAS does, Ximeta can shun expensive CPUs and RAM, and use a dedicated ASIC instead).

The Ximeta hardware is not a complete solution, and the driver software manages the metadata for the cluster of networked drives, such as the information that allows multiple drives to be concatenated to add capacity while keeping the illusion of a single virtual disk. The driver is also responsible for RAID, although Windows, Mac OS X and Linux all have volume managers capable of this. There are apparently some Windows-only provisions to allow multiple computers to share a drive, but I doubt they constitute a full-blown clustered filesystem. There are very few real-world cases in the target market where anything more than a cold standby is required, and it makes a lot more sense to designate one machine to share a drive for the others in the network.

I think this technology is very interesting and has the potential to finally make SANs affordable for small businesses, as well as for individuals (imagine extending the capacity of a TiVo by simply adding networked drives in a stack). Disk-to-disk Backups are replacing sluggish and relatively low-capacity tape drives, and these devices are interesting for that purpose as well.