Articles tagged perl at null program

Torrent File Strip

2011-02-19T00:00:00Z

You can skip my explanation and download the tool here,

git clone git://github.com/skeeto/btstrip.git

My main computer recently stopped working for me, and, until I have a replacement, I've been using my wife's old 2004-era desktop, with only 500MB of memory. That has required me to make use of a lighter-weight BitTorrent client than I have in the past: rTorrent.

Because rTorrent's interface is built on ncurses, when combined with GNU Screen it behaves very much like a daemon. I have configured it to watch a certain directory for new torrent files. When I want to start downloading a new torrent, I just put the torrent file there and rTorrent gets to work on it. Share this watched directory on a network, and rTorrent becomes a network service.

# rTorrent configuration
directory = /torrents/
session = /torrents/.session/
schedule = watch_directory,5,5,load_start=/torrents/watch/*.torrent
schedule = untied_directory,5,5,stop_untied=
schedule = tied_directory,5,5,start_tied=

Unfortunately, the rTorrent documentation has not been kept up to date, and appears to be inaccurate. I prefer to rely completely on the distributed hash table (DHT) rather than use normal trackers (so long as the torrent is free of DRM). My last BitTorrent client allowed me to do this, but the documented rTorrent option for doing this, enable_trackers, doesn't seem to work at all.

enable_trackers = no

My current workaround is to strip away all of the tracker information from the torrent file before giving it to rTorrent. It's a Perl script, with all the hard work being done by the Bencode module. Stripping out the trackers is trivial,

my $decoded = bdecode do { local $/; <STDIN> };
$$decoded{'announce'} = "http://127.0.0.1/";
delete $$decoded{'announce-list'};
print( bencode $decoded );

The encoding of a torrent file is of critical importance, due to the info_hash: the hash of all of the torrent data and metadata, which uniquely identifies that torrent. If any aspect of the info field in the torrent file changes, you get a different hash and will not be able to participate in the original torrent. The reason the above code guaranteed to work properly is because, for any possible bencode data structure, there is exactly one possible way to bencode it.

There are four data types in the bencode format: integers, byte strings, lists, and associative arrays. Integers are encoded in ASCII as base-10 (making it unaffected by endianess). Byte strings are stored as an integer indicating its length, followed by the literal string itself. Lists are stored as a list of objects in order, with sentinel. And associative arrays are stored as a list of key pairs. Most importantly, those key pairs are stored in alphabetical order, enforcing a single encoding for any given associative array.

The simple code above only works on stdin to stdout, but it would be more convenient to edit torrent files in place. So I buffed it up a bit in my working version. It has two command line switches: one for operating on the file in-place (-i), and one for setting the tracker URL manually, rather than inserting a dummy value (-t). It also strips out some of the extra, optional fields, cutting the torrent file down to its minimal size.

Once you've installed the Bencode module, drop that script in your PATH somewhere and you're ready to go.

E-mail Obfuscater Perl One-liner

2009-06-02T00:00:00Z

If you look at the page sources around here you might notice that there are no bare e-mail addresses around. This is because I obfuscate them into a series of HTML entities. So far this has been pretty effective at hiding from address-scraping, web-crawling spam bots. They don't seem to try very hard at decoding HTML entities.

When I added the comment system, I needed to obfuscate addresses automatically. I quickly realized that this is yet another perl one-liner (and implemented as one line in the comment system). It can be used on the command line to obfuscate a file/pipe containing a list of addresses.

perl -lpe '$_ = join "", map {"&#" . ord() . ";"} split //'

All of the spaces are really only there for us humans,

perl -lpe '$_=join"",map{"&#".ord.";"}split//'

I keep running into these one-liners.

Another Perl One-liner: Byte Order

2009-05-20T00:00:00Z

At work right now I am using two different machines. One is a 32-bit SPARC and the other is a very powerful SMP x86-64. Sometimes data are generated on one machine and used in a simulation on the other. There is a problem of byte-order, though. The SPARC is big-endian and the other is little-endian, and the programs on both sides don't pay attention to that small detail.

Luckily, the data are all 4-byte aligned. That's perfect for a Perl one-liner byte order conversion,

perl -e 'print scalar reverse while read STDIN, $_, 4' < in.le > out.be

Perl is really great for concise hacks. I really like how this one-liner almost reads like a natural language sentence. Is there any other language that can do powerful one-liners like Perl?

SWF Decompression Perl One-liner

2009-04-18T00:00:00Z

Flash seems to be the popular way of playing videos online. This is a bit better than the bad old days of online video where a user had to select from a few buggy media player plug-ins. Things have improved.

However, if you don't use Flash, or if you want to watch the videos in your own media player, you are stuck. A download link for the video is almost never provided. The video is always somewhere, though, to be fetched via http. I mentioned this before for downloading YouTube videos using youtube-dl.

The trick is finding the URL. Sometimes you can derive it from the HTML code, sometimes you have to dig a little deeper by inspecting the Flash player itself. strings can be invaluable here.

There could be an extra layer of stuff to work out, which is explained below. My main reason for posting this is so I can refer back to it later when I need to do it again.

So, the other day I ran into a Flash video player that contained part of the URL of its video. I began by studying the embed tag in the HTML, which gave me some information about where to find the video (the video ID number). I downloaded the Flash player SWF file for the purpose of running strings on it.

I ran into a problem here. I wasn't finding any non-garbage strings inside the file. file told me it was compressed.

$ file player.swf
player.swf: Macromedia Flash data (compressed), version 9

Searching online quickly revealed that a compressed Flash file is just zlib compression after an 8-byte header. Decompression can actually be done with a Perl one-liner,

perl -MCompress::Zlib -0777 -e \
      'print uncompress substr <>, 8;' player.swf > player

I ran strings and greped for "http", revealing the location of the video. That was it!

I actually came across a Java program that does the same thing. It is 115 lines of code. Java programs always seem to be bloated like this.

I hope you find this useful!

Fantasy Name Generator: Request for Patterns

2009-01-04T00:00:00Z

Whether choosing a name for my character in a fantasy game or populating a world which I pretend to myself that I will one day DM, I have always gone to the RinkWorks Fantasy Name Generator. The author of this tool, Samuel Stoddard, gives some history on how he came to design and develop it.

It works by using pattern to select sets of letters to put together into a name. There is a thorough, long description on the website. Unfortunately, he didn't share his source code and so we can see how he did it.

Therefore, I used his description to duplicate his generator.

You can grab a copy here with git,

git clone git://github.com/skeeto/fantasyname.git

It includes a command line interface as well as a web interface, which I am running and linked to at the beginning of this post for you to use. The code is available under the same license as Perl itself.

I used Perl and the Parse::RecDescent parser generator. Thanks to this module, it essentially comes down to about 40 lines of code. The name pattern is executed, just like a computer program, to generate a name. Here is the BNF grammar I came up with,

LITERAL ::= /[^|()<>]+/

TEMPLATE ::= /[-svVcBCimMDd']/

literal_set ::= LITERAL | group

template_set ::= TEMPLATE | group

literal_exp ::= literal_set literal_exp | literal_set

template_exp ::= template_set template_exp | template_set

literal_list ::= literal_exp "|" literal_list | literal_exp "|" | literal_exp

template_list ::= template_exp "|" template_list | template_exp "|" | template_exp

group ::= "<" template_list ">" | "(" literal_list ")"

name ::= template_list | group

The program is just that, decorated with some bits of Perl. Since I came up with it, I have found that it is slightly different than Mr. Stoddard's generator, in that his allows empty sets anywhere. Mine only allows them at the end of lists. For example, this is valid for his generator,

<|B|C>(ikk)

But to work in mine, the empty item must be moved to the end,

(ikk)

This can be adjusted my making the proper changes to the grammar, which I haven't figured out yet.

Another problem with mine is that Parse::RecDescent is slooooowwwww. Ridiculously slow. Maybe I designed the grammar poorly? This is probably the biggest problem. Even simple patterns can take several seconds to generate names, specifically with deeply nested patterns. For example, this can take minutes,

<<<<<<>>>>>>

Before you go thinking you are going to tank my server, I have written the web interface so that it limits the running time of the generator. If you want to do something fancy, use your own hardware. ;-)

There is also a problem that it will silently drop invalid pattern characters at the end of the pattern. This has to do with me not quite understanding how to apply Parse::RecDescent yet.

And this is where I need your help. I have had some trouble coming up with good patterns. I don't even have a good default, generic fantasy name pattern. Here are some of mine,

# default D # idiot # short

None of which I am very satisfied.

You can design patterns for Nordic names, Gallic names, Tolkienesque Middle Earth names, orc names, idiot names, dragon names, dwarf names, elf names, Wheel of Time names, and so on. There is so much potential available with this tool.

To suggest one to me, e-mail me some patterns, or even better, clone my git repository and add one to it yourself (then ask me to pull from you). This way your credit will stay directly attached to it with a commit.

Good luck!

Don't Write Your Own E-mail Validator

2008-12-24T00:00:00Z

Gmail has a nice feature: when delivering e-mail, everything including and after a + in a Gmail address is ignored. For example, mail arriving at all of these addresses would go to the same place if they were Gmail addresses,

account@example.com account+nullprogram@example.com account+slashdot@example.com

Thanks to this feature, when a user acquires a Gmail account, Google is actually providing about a googol (as in the number 10¹⁰⁰) different e-mail addresses to that user! Quite appropriate, really.

I have seen other mailers do similar things, like ignoring everything after dashes. A nice advantage to this is when registering at a new website I can customize my e-mail address for them by, say, throwing the website name in it. Because I have a google of e-mail addresses available, it is impossible to run out, so I can give every person I meet their own version of my address. The custom address can come in handy for sorting and filtering, and it will also tell me who is selling out my e-mail address. This, of course, assumes that someone isn't stripping out the extra text in my address to counter the Gmail feature.

However, in my personal experience, most websites will not permit +'s in addresses. This is completely ridiculous, because it means that virtually every website will incorrectly invalidate perfectly valid e-mail addresses. Even major websites, like coca-cola.com, screw this up. They see the + in the address and give up.

In fact, if I do a Google search for "email validation regex" right now, 9 of the first 10 results return websites with regular expressions that are complete garbage and will toss out many common, valid addresses. The only useful result was at the fifth spot (linked below).

For the love of Stallman's beard, stop writing your own e-mail address validators!

Why shouldn't you even bother writing your own? Because the proper Perl regular expression for RFC822 is over 6 kilobytes in length! Follow that link and look at that. This is the smallest regular expression you would need to get it right.

If you really insist on having a nice short one and don't want to use a validation library, which, again, is a stupid idea and you should be using a library, then use the dumbest, most liberal expression you can. (Just don't forget the security issues.) Like this,

.+@.+

Seriously, if you add anything else you most almost surely make it incorrectly reject valid addresses. Note that e-mail addresses can contain spaces, and even more than one @! These are valid addresses,

"John Doe"@example.com "@TheBeach"@example.com

I have not yet found a website that will accept either of these, even though both are completely valid addresses. Even MS Outlook, which I use at work (allowing me to verify this), will refuse to send e-mail to these addresses (Gmail accepts it just fine). Hmmm... maybe having an address like these is a good anti-spam measure!

So if your e-mail address is "John Doe"@example.com no one using Outlook can send you e-mail, which sounds like a feature to me, really.

So, everyone, please stop writing e-mail validation regular expressions. The work has been done, and you will only get it wrong, guaranteed.

This is a similar rant I came across while writing mine: Stop Doing Email Validation the Wrong Way.

Memoization

2008-03-25T00:00:00Z

I had written in a previous post about a neat feature of Lua. I found out later that this is simply a form of Memoization. The idea is that you trade memory for speed by only doing calculations once and keeping track of previously calculated values. I had even complained about Perl hashes not being flexible enough (not true, thanks to Tie::Hash). Perl actually has something even cooler, which is the Memoize module.

The module can memoize any function, although it is only useful on "pure" functions: functions with no side effects and not dependant external data that will change. The official documentation contains a nice example demonstrating a recursive implementation of a Fibonacci sequence generator. My example is a little program I wrote the other day where the memoize module came in handy.

You have coins valued at 1, 2, 5, 10, 20, 50, 100, and 200. How many different ways can 200 be made using any number of coins. A simple recursive solution is this: stick in each coin one at a time and ask the same question again. So, we use a coin worth 1, now the question is how many ways can we make change for 199. Then 198, then 195, then 190, etc. Because the order of the coins is not important these two sets are identical: (1 1 5) (5 1 1). So, to avoid counting the same set twice, we also want to tell the function the largest size coin to use from then on. Our function may look like this now (Perl),

use List::Util qw(sum); sub count { my ($s, $m) = @_; return 1 if ($s == 0); my @valid = grep {$_ <= $s and $_ >= $m} @coins; return 0 if ($#valid == -1); return sum map {count($s - $_, $_)} @valid; }

Where it is called as count(total, max_coin_value).

However, we will be calculating the same value twice many times over. For example, lets say we start filling the first 10 of 200 like this: (1 1 1 1 1 5) or (5 5). The next call to count will be count(190, 5) for both cases. Just like the recursive Fibonacci implementation, we are spending an enormous amount of time repeating ourselves. Running this for a value of 200 will take minutes. Running it for a value of 2000 may take days! Memoization to the rescue!

We will now add this,

use Memoize; memoize('count');

The module has now transparently installed a new version of the function over our original. If we ever pass the same arguments that we already have passed, the module will look up the original calculated value and return it instead of calling the real function. It now can calculate the number of ways to make change for a value of 2000 in a couple seconds rather than days. That's how much redundant work the function was doing. Here is the whole thing,

#!/usr/bin/perl use strict; use warnings; no warnings qw(recursion); use List::Util qw(sum); use Memoize; memoize('count'); my @coins = (1, 2, 5, 10, 20, 50, 100, 200); print count(200, 1); print "\n"; sub count { my ($s, $m) = @_; return 1 if ($s == 0); my @valid = grep {$_ <= $s and $_ >= $m} @coins; return 0 if ($#valid == -1); return sum map {count($s - $_, $_)} @valid; }

Now, to apply it to the Collatz problem from my previous post we get a nice simple little program,

#!/usr/bin/perl use strict; use warnings; no warnings qw(recursion); use List::Util qw(max); use Memoize; memoize('collatz'); while (<>) { my ($n, $m) = split; printf("$n $m %d\n", max map { collatz($_) } ($n..$m)); } sub collatz { my $n = shift; return 1 if ($n == 1); return 1 + collatz(3*$n+1) if ($n & 1); return 1 + collatz($n/2); }

I really do love Perl.

A Faster Montage

2007-12-26T00:00:00Z

Update May 2015: Somehow the original script was lost while changing hosts four years ago. I’ve replaced the script with a smaller, better, standalone C program. Note: it has a different interface, so read the header first!

/download/fastmontage.c (new!)

I had written a previous post called Movie DNA where I described a simple way of distilling an entire movie down to a single frame. It involved the use of two tools, with no intermediate code or software in the middle to glue things together.

The first tool, mplayer was used to dump all of the frames we needed. This took about the running length of the movie to do, which wasn’t so bad. There may be a way to speed this up by giving mplayer some extra hints. I have not yet figured this part out.

The real time cost was in ImageMagick’s montage tool, which made the final montage out of the images. This took between 6 and 10 hours to do this, depending on the length of the movie. The process seemed to be non-linear for some reason, with long movies taking unproportionally longer to process (One could always dig around the montage source to find out why). I knew there had to be a way that this could be improved!

Well, I wrote a Perl script last night, dubbed gdmontage to speed up the montage process. It was even faster than I thought it would be, taking only 12 seconds on the same machine as before. It uses the GD Graphics Library via Perl’s GD module, which you would need to install to use this. It also uses the Term::ProgressBar, if it’s available, to provide a progress bar and ETA.

Like the original montage program, the script recognizes file globs, so you can provide the files through a glob in order to avoid the limits on command line arguments.

$ ./gdmontage.pl "frames/*"

It is a bit unfair to call my code a “faster montage” because it only covers a tiny subset of the original montage. It makes some big assumptions in order to be faster; specifically, it assumes that every image is the same size. The original montage must look at every image before it even starts in order to determine the dimensions and placement of the final image.

It is also geared towards the Cinema Redux thing, doing only 60 images per row. This can be changed internally (no command line arguments for this) by adjusting the parameters at the top of the script. The script could probably be easily expanded to include most of the features of ImageMagick’s montage, but I am sure this Perl script would be much faster when it comes to creating large montage’s. (Why is montage so slow?)