Message Queuing & Segregation: Lessons from the Airline Industry

Note: This also appears at http://www.messagesystems.com/wordpress/?p=38.

Many email marketers are unaware of the importance of message queuing to the successful delivery of their email. As a component of their messaging infrastructure, queuing is something that marketers typically defer to their IT department to manage. Yet, the reality is that queuing and the segregation of message streams can make the critical difference between the success and failure of a company’s messaging programs, and therefore, should be of concern to both the IT and marketing departments.

Effective queuing really comes down to the choice of messaging infrastructure. When using a technologically advanced messaging platform, companies can efficiently manage parallel queues with messages assigned into multiple streams to ensure that each stream flows at an appropriate rate, providing efficient delivery of all classes of traffic. Unfortunately, those that rely legacy MTAs have no such options. They’re left trying to manage the bottlenecks and slowdowns that result from poor architecture with complicated priority schemes within a single queue.

Recently, one legacy MTA provider suggested that their routine for queue prioritization was the answer for reaching high-value customers first. While the business need is certainly legitimate, trying to prioritize messages within a single queue is both outmoded and a solution to a problem that should not exist in the first place. There are better ways to satisfy this need that are both simpler and more powerful at the same time. To illustrate my point, allow me to provide an analogy that should be familiar to my fellow business travelers.

One of the most common headaches for the air traveler is the security checkpoint; you get your ID checked, get in line, get your ID checked again, get in another line, empty your bags, take off your shoes and belt, get in yet another line and then get radiated in the name of public safety. During the highest traffic times these lines can become so long that people start missing their flights because there are a very limited number of security checkpoints and the airports were architected in a time that predated the need for such extensive security. This fundamental flaw in the architecture of the airports means that the current needs of travelers for additional parallel security screening checkpoints cannot be met, and everyone has to wait in a queue to get to their plane, sometimes with unacceptable results, leading to additional costs for all involved and potential lost business.

This is handled in a variety of ways, including performing the security check at every gate area (creating a very parallel security screening system) and by using priority security lines. Imagine for a moment that instead of this solution, the airport chose instead to assign priority on an individual basis to every single passenger and then tried to sort the individual travelers in the security line. As you can imagine such a solution would require additional work to make the individual assignments and then keep track of who ranks where in the line, with a risk that low-priority passengers would find themselves significantly delayed as they were repeatedly bumped. In all my travels I have never seen such an approach, primarily because the airports already have an approach that works.

Find a particularly efficient airport and what you’ll see is a fairly consistent set of practices:

  • Separation of passengers into queues based on their fitting into a certain profile.
  • A large number of parallel checkpoints.
  • Efficient handling of passengers.
  • Intelligent queue management that can modify queuing on the fly to meet circumstances.

Look at a particularly efficient airport and you will see multiple queues, including queues for:

  • Frequent / First Class travelers – The most frequent travelers and those who sit in First Class. These people bring a lot of value to the airlines and receive a lot of value in return.
  • Expert travelers – A new lane starting to appear in some airports, for those who are experienced in getting through security and unlikely to cause delays.
  • Family / Special Assistance – A slow lane, these groups will take longer to get through security.
  • Casual Travelers – A lane for those who move at an average speed through security.
  • Staff / Crew – While in some airports workers and air crew jump to the front of the line, the most efficient airports avoid this disruption by maintaining a separate checkpoint for those who work at the airport, minimizing disruption and ensuring that staff can get to work on time.
  • Specialty – From time to time I’ve seen the airport create a special temporary queue for unique groups such as chartered planes by opening a checkpoint and redirecting the group to the specialty group.

Not only has the separation of travelers into queues according to their profiles (including their priority as a group) proven sufficient to make prioritization by individual traveler unnecessary, it is much easier to manage.

While separating passengers into a number of queues can certainly benefit airports, the most efficient airports are also architected to operate a large number of parallel checkpoints, preventing a situation where every passenger in the airport needs to be funneled through the same metal detector. Imagine an airport trying to service millions of passengers a year on only one or two metal detectors and a single x-ray machine.

In addition to having many checkpoints, the best airports will also have efficient checkpoints, maximizing the flow of passengers through any given checkpoint through better design of the checkpoints and better training of the staff, all without compromising safety.

Perhaps most importantly to the smooth operation of an airport is intelligent management. I’ve seen airports where there were several queues open but empty because the people managing things weren’t flexible enough to reassign lines and adjust the queues to ensure well balanced passenger flow. The best airports will change the designation of queues, move staff around and even redirect passengers to alternate checkpoints that are less busy, all in the interests of moving the highest number of passengers per hour.

Senders can follow these same principles to get maximum throughput and deliverability in their own environment:

  • Segment mail by profile.
  • Choose a sending solution that supports highly parallel sending.
  • Choose a sending solution that provides sufficient throughput.
  • Choose a sending solution that is intelligent.

When sending, remember that segmentation is not just for who to send to or what to send them, but for deciding how to send a message and with what priority. You will want to create queues for high priority messages to satisfy your most valuable customers, queues for high-reputation traffic that delivers without issues as well as for traffic that you expect to deliver slowly (one example is traffic that results in human interactions, you may need to slow this to prevent overloading your call centers), test and administrative traffic that needs to go out as soon as possible and transactional traffic that should not queue up behind bulk sends. This is a common practice among ESPs, who often add specialized segmenting for scenarios such as new customers and customers with specific SLAs.

As with the airports, you need to architect your environment to be able to handle more traffic in parallel. This can be accomplished by adding more injectors and more messaging infrastructure, or by adding better infrastructure. Look at your existing solution: how many IPs can it send from? How many messages per hour can you send on a single machine and how many concurrent connections can it handle? Most Open Source solutions can handle one IP address, send 100,000 messages per hour at most, and can open less than one hundred connections. Low-end commercial solutions can often do over a hundred IPs, send close to a million messages per hour and can handle a few hundred to a couple of thousand connections. On the high end you have carrier-grade systems designed for the enterprise such as Momentum by Message Systems, which can utilize thousands of IPs to send millions of messages per hour across tens of thousands of concurrent connections (while you will never open more than a few connections to a given ISP on a given IP address, lesser solutions will fall short when sending across hundreds of IPs to thousands of ISPs, defaulting to ISP prioritization as a workaround).

Finally consider the intelligence of your infrastructure:

  • Can your infrastructure send across all servers simultaneously and fail-over in the event of an outage?
  • Can your infrastructure adjust throttles on the fly based on responses from the ISPs and bounce and feedback loop data?
  • Does your infrastructure handle queues so efficiently that performance is the same with thousands of messages in the queues as it is with millions of messages in the queues?
  • Can your infrastructure dynamically change from email to other protocols such as SMS and MMS based on subscriber preferences and ISP responses?

If not, we should talk.

How To Send One Billion Email Marketing Messages Per Month

One... *billion* emails!
One... *billion* emails!

One *Billion* Emails

In email marketing there are senders of all shapes and sizes, from small businesses using self-serve ESPs to the largest web properties self-sending to massive user bases. While only a few senders will reach or exceed volumes of one billion messages per month, the tools and practices needed to achieve such a volume level are applicable to all senders who want to succeed in email marketing.

Who Am I?

My name is Mike Hillyer (click here for bio and social links). I manage a team of Sales Engineers for Message Systems, a leading provider of digital messaging solutions for both senders and receivers. In my work over the last several years I have helped a number of clients reach the billion messages per month level and even more clients successfully deploy email marketing solutions ranging in scale from hundreds of thousands to millions of messages per month.

Who Needs To Send This Much Mail?

Contrary to what the image to the image above implies, there’s nothing inherently evil about sending a billion messages a month. Some of the businesses that move a billion messages a month include ESPs, social networks (some move more than a billion a day for that matter), social gaming sites and large online retailers.

Any time you have a fairly large number of users (5-20 million) who receive multiple messages per day, or a really large number of users (40-50 million) receiving one message per day you are heading into the billion messages per month territory.

What Are The Numbers?

So exactly how much mail are we talking about here? That will depend on sending patterns:

In a lot of high-volume environments the sending will be to a world-wide audience, resulting in round-the-clock sending with no significant bursts of traffic. In such an environment the hourly volume will be 1,000,000,000 messages divided by 30 days divided by 24 hours equaling 1,388,889 messages per hour (386 messages per second), assuming 30 days in a month.

In an environment with inconsistent hourly volumes, we have to allow for both an average hourly volume and a maximum hourly volume and then design our solution to address the maximum hourly volume.

We need to look at seasonal factors: Does your social network move a lot of extra messages around Mother’s Day? Does your dating site move a lot of extra messages around Valentine’s day? Does your web shopping portal do a lot of extra business around Christmas?

We need to look at growth: If you are sending a billion messages a month it is very likely due to successful growth of your user base, something which you certainly have no intention of slowing. Look at how you have grown your email volume so far and extrapolate it out for the next year or two (especially if you only get budget for your infrastructure every two years).

Let’s assume for the sake of this article that you have an average volume of one million messages per hour with a peak volume of two and a half million messages per hour during your busiest season. You expect to double your user base each year for the next two years. At the end of two years you expect to be sending ten million messages per hour, or 7.2 billion messages per month (I’ve seen just this kind of growth several times with customers and prospects).

You Will Need to Send In-House

A lot of senders start by using an Email Service Provider (ESP) for their sending and should do so: an ESP provides infrastructure and expertise to handle the details of sending email marketing messages for their clients at a good price, allowing companies to focus on their business. In addition, the costs of installing and maintaining proper sending infrastructure and practices are not justifiable for most low-volume senders.

That said, if you are aiming for a billion email marketing messages a month and are using an ESP it’s time to plan your move to in-house sending. Assuming a $1.00 CPM (Cost Per Mille with Mille being Latin for thousand, so cost per thousand) you are looking at paying an ESP a million dollars a month to handle this kind of volume. Naturally you can probably secure a better rate than $1.00 CPM at these volume levels but regardless of the discount at this volume level you will pay less to buy the infrastructure and hire the people needed to do this yourself, gaining the control you need when sending at these volume levels.

Start With a Good Reputation

In order to hit the volume levels we’re talking about it is going to be vital that you have a solid sending reputation. This means you need to follow best practices for list acquisition, list hygiene, segmentation and relevancy. There’s a wealth of information online and an excellent catalog of it at Email Marketing Reports. This article will focus primarily on the technical aspects of sending one billion email messages per month but keep in mind that reaching one billion messages a month without a solid reputation on your domain and sending IPs is very difficult. A number of tools for checking the reputation of your IP addresses can be found at Word to the Wise. At this volume level email is key to your business and a solid reputation is going to be essential.

From a technical perspective there’s a number of bases we need to cover regarding authentication, whitelisting, bounce processing and complaint handling.

Authentication

As a reputable sender you will want to associate your IP addresses with your domain using the authentication standards available to you. These include SPF, SenderID, DomainKeys (DK) and DomainKeys Identified Mail (DKIM). There are indications that SPF (and SenderID by association) is ineffective but given the low effort required to implement it I would recommend doing so anyway. While SPF and SenderID are purely DNS-based, DK and DKIM require an implementation either during message creation or during relay by the MTA and as a result will impact the maximum throughput of your infrastructure (more about this later).

DomainKeys is quickly being superseded by DomainKeys Identified Mail but with most solutions supporting both DK and DKIM it is simple enough to use both when sending to an ISP that supports one standard or the other. Implementation details will vary based on your sending solution. While some recommend selectively signing DK and DKIM for only messages sent to ISPs that are known to check authentication (in order to lower the impact signing has on throughput on a solution that takes a significant performance hit from signing), I recommend signing all messages; you never know who is checking for authentication without announcing it.

Whitelisting

One benefit of getting on the various whitelists provided by ISPs and reputation providers is that in some cases you can send higher volumes on whitelisted IP addresses than would otherwise be possible. Keep in mind that in most situations whitelisting is something that comes after sending has already begun in order to allow the provider of the whitelist to examine your sending patterns as part of the whitelisting process, so put your best foot forward (and follow it up with consistent behavior).

Bounce Processing

One quick way to lose reputation is to repeatedly send mail to recipients that do not exist. The ISPs will track how many non-existing addresses you send to and throttle you accordingly. Even more seriously, ISPs will occasionally take inactive email addresses and re-activate them as spam traps; any mail sent to the address will immediately get the source classified as a spam source and filtered accordingly.

To prevent this it is necessary to capture and act on the responses sent by the ISPs and unsubscribe those addresses identified as non-existent or inactive, while retaining those with responses that identify users on vacation and other not-fatal errors. Commercial sending solutions will perform this automatically with varying levels of effectiveness while other platforms will require a third-party solution such as Boogie Tools. Keep in mind that the more you send, the more you receive back in the form of automated responses and bounce notifications. As your reply addresses reach more and more users the flow of notifications will become contaminated with spam and virus-carrying messages, requiring the implementation of Anti-Virus/Anti-Spam solutions for your incoming mail stream.

Complaint Handling

In an effort to help senders improve their practices, a number of ISPs have implemented ARF formatted Feedback Loop programs. When a user on a supported ISP clicks the “This is Spam” button, an automated message is sent to an address you define in advance (when signing up with the ISP for the Feedback Loop program). By processing these messages and un-subscribing the relevant users, you prevent further reputation damage that may result when sending them future messages.

The ARF format used by the ISPs makes it relatively straightforward to process Feedback Loop messages and use them to unsubscribe the users who have complained about your messages. There are tools available to process ARF formatted messages and some sending solutions will handle FBL messages natively.

Infrastructure Considerations

There are a number of architectural components that come into play to make it possible to send email marketing messages at volume levels of one billion email messages per month (or more) including network connectivity, server hardware and software.

Connectivity

Most professional sending operations are based in rented datacenters, simplifying the provisioning of network connectivity. In our initial example of a maximum throughput of 2.5 million messages per hour we’ll use a sample message size of 50 kilobytes (51,200 bytes), meaning that we need to send at a rate of  2,500,000 * 51,200 = 128,000,000,000 bytes per hour or 271.2 megabits per second.

With the throughput we’re talking about we certainly need to use gigabit speed networking within the datacenter and, more importantly, need backbone connectivity that can support not only a sustained throughput of 271 megabits per second but than can handle our future needs of 7.2 billion messages per month. You need to look at a datacenter that will be able to provide sustained gigabit speeds to the backbone.

Keep in mind that when you are sending a billion messages per month it means that email has significant impact on your bottom line and you won’t be able to tolerate extended outages. You need to not only make sure that the datacenter you choose has redundant power and backbone connections, you also need to consider using redundant datacenters.

Server Hardware

Moving over a million messages per hour does not require the purchase of custom server hardware but it does require making a proper investment in hardware. Generally speaking you will be using an infrastructure similar to the following:

An example of a basic sending infrastructure

The Message Injector queries the database and uses the results to assemble one or more messages which it relays to the Outbound Mail Server. The Outbound Mail Server queues the message, performs any necessary manipulations on the message and then sends it via the Internet to the destination server. In the event of a delivery failure message or a feedback loop message, the incoming message arrives via the Internet to the Inbound Mail Server. The Inbound Mail Server performs anti-virus/anti-spam scanning and then, in the case of a legitimate message, processes the message and updates the subscriber information in the database (not all server solutions can perform this processing in-stream, when using such solutions an intermediate server will be needed to accept the clean message from the Inbound Mail Server and process it using custom code).

In a production deployment there can be several variations on this example, typically with multiple servers used on the outbound and inbound roles, with multiple message injectors pushing to the outbound machines and often specialty servers on the inbound side dedicated to processing incoming feedback loop and bounce messages.

I generally recommend mail servers similar to the following:

  • 2x multi-core, 64-bit processors
  • 16-32 GB of RAM
  • 8x 15K RPM hard-disks
  • Battery-backed RAID-10 controller

The specific details of your hardware selection will depend on the ability of your specific software to leverage the resources provided. A large number of fast disks in a RAID-10 array is recommended for the message spool as standards-compliant mail servers must write messages to disk before accepting them for delivery, placing significant demands on storage resources.

Software

As an employee of a leading software provider for high-volume senders you would rightly expect me to recommend a commercial solution, and specifically my company’s solution. I’d like to take a moment to point out why:

Performance

We need to send at a rate in excess of one million messages per hour. I’ve dealt with a number of solutions and my experience has shown that most Open Source MTAs such as Postfix and Sendmail are limited to around 100,000 messages per hour. Commercial sending solutions typically show real-world performance ranging from 500,000 messages per hour to over two million messages per hour.

I have helped several companies that were operating dozens of Open Source servers to consolidate down to one-tenth as many servers running a commercial solution.

Segregation

In addition to limited throughput, Open Source MTAs are usually limited to sending through a single IP address, meaning that to send through ten IP addresses you need ten separate server instances. Commercial solutions support sending through multiple IP addresses simultaneously.

Advanced Functionality

Commercial solutions go beyond basic message queuing and sending, providing the additional functionality required for a high-volume sender. This includes features such as APIs, bounce classification, feedback loop processing, internal scripting, automated throttling, and database integration.

Availability

If you’re sending a billion emails a month, you absolutely need a solution that provides high availability out of the box. If a server goes down you can’t afford to be frantically activating a warm spare, just to find out that it too has some issue. You need an active-active solution that reacts automatically to server failures and keeps the mail flowing.

Manageability

You need a solution that can be easily managed on your terms, whether you prefer editing configuration files or using a web interface. In addition, you need something that grows with you, providing centralized management of an entire cluster of servers. Commercial solutions will provide easier, centralized management.

Reporting

One key to successful sending at high volumes is keeping tabs on how your server is performing and how your mailings are doing. You need to know what is passing through your server, how quickly messages are moving, whether queues are backed up, how the various ISPs are treating your traffic, all with the ability to drill down on specific source IPs and specific destination ISPs. You need to be able to see all of this in real-time and across your entire infrastructure. A good commercial solution provides all of this out of the box.

Time is Money

On multiple occasions I’ve seen organizations choose a free or low-cost solution and then spend countless hours building workarounds to the weaknesses of their chosen platform, writing scripts to automate administration, reporting tools to fill their needs, failover scripts to provide redundancy, etc.

While a lot of this work was impressive, it required time to implement and time to maintain. Time spent creating tools that are already provided in an alternative solution is time (and money) wasted. You are always better off using your time to create your “special sauce”; that which makes your business unique and gives you a competitive advantage.

The Price of Success is Continued Vigilance

When sending at a rate of one billion messages a month (or more), you can’t just use a ‘fire and forget’ mentality. You are going to have to have people around to keep a constant eye on what is happening in your environment, monitoring multiple key factors to ensure you can continue to successfully send.

Reputation Monitoring

Remember earlier when I said you need to start with a good reputation? You also need to keep a good reputation, and the only way to do that is to know what your reputation is. You will need to take advantage of reputation monitoring tools provided by companies like Return Path and Pivotal Veracity as well as keep a close eye on the reporting produced by your sending solution (remember when I said you need good reporting?)

You need to watch things such as bounce rates, FBL hit rates, blacklist hits, transient failures and response rates.

Infrastructure Monitoring

If email is your company’s lifeblood (and if you’re sending a billion messages a month it certainly is) then you need to make sure to keep it flowing no matter what happens, and that means making sure your email infrastructure stays online. I spoke earlier of  the need for high-availability, active monitoring goes hand-in-hand with this need. You will need to monitor the health of the servers that support your infrastructure, the network components that carry your messages and the software that creates and relays your messages.

There are a number of monitoring solutions available to accomodate any platform and budget, be sure to implement one that meets your needs and get monitoring. Make sure to test simulated failures to confirm that monitoring is working successfully. Consider setting up a simulated mailing that runs on a regular basis using your full infrastructure stack: a monitoring script can check an inbox and if the test message fails to appear, something is potentially wrong in your sending infrastructure. This approach can help identify issues that may pass by other monitoring systems unnoticed (and can be integrated into some monitoring solutions directly).

Response Monitoring

Keep in mind that you get what you monitor for; if you focus too much on one metric it may improve without helping the big picture. In addition to making sure all the underlying pieces are in place, don’t forget to keep an eye on things where the rubber meets the road. You may be sending at phenomenal rates with great metrics but failing to generate customer actions that lead to revenue.

Conclusion

While by no means an exhaustive list, I hope this gives you some idea as to the scope of sending in a high-volume environment of one billion email messages per month. Watch this space over the coming weeks for deeper dives into some of the subjects covered here.

Questions? Did I miss something? Let me know in the comments!

Disclaimer

The opinions and information in this post are my own and do not necessarily reflect those of my employer.