For a long time I couldn't figure out how to make decent thumbnails
with
ImageMagick. Specifically, I wanted to create uniform sized
thumbnails from arbitrary images. Over the weekend I came across the
ImageMagick Examples page, which shows exactly how to do
this. Here's the command for a 150x150 thumbnail,
convert orig.jpg -thumbnail 150x150^ -gravity center \
-extent 150x150 thumb.jpg
It cuts out the largest square possible from the center of the image
and resizes that to 150x150. This capability has actually only been
available for 2 years now! It wasn't there last time I needed it.
I can think of one way to improve it: instead of selecting the center,
it selects the area with the highest information density. This could
be measured by edge detection, corner detection, or some other
statistical method. It would be selected by changing the gravity
option to, say, "entropy".
I'm listing this here mostly for my own future reference. :-)
Since I recently got back into Java recently, I threw together this
little Game of Life implementation in Java. It looks a lot like my maze generator/solver on the inside,
reflecting the way I think about these things. Gavin wrote a
competing version of the game in Processing which we were partially
discussing the other night, so I made my own.
The individual cells are actually objects themselves, so you could
inherit the abstract Cell class and drop in your own rules. I bet you
could even write a Cell that does the
iterated prisoner's dilemma cellular automata. The Cell objects
are wired together into a sort of mesh network. Instead of growing it
wraps around on the sides.
It takes up to four arguments right now, with three types of cells,
basic, implementing the basic Conway's Game of Life,
growth, which is a cell that matures over time, and
random which mixes both types together (seen in the
screenshot). The arguments work as follows,
Developing C in Emacs is a real joy, and it's mostly thanks to the
compile command. Once you have your Makefile -- or SConstruct or
whatever build system you like -- setup and you want to compile your
latest changes, just run M-x compile, which will run your
build system in a buffer. You can then step through the errors and
warnings with C-x `, and Emacs will take you to
them. It's a very nice way to write code.
I use the compile command so much that I bound it to C-x
C-k (C-k tends to be part of compile key
bindings),
(global-set-key "\C-x\C-k" 'compile)
Until recently, I didn't have as nice of a setup for Java. Since they
generally force offensive IDEs onto me at work this wasn't something I
needed yet anyway, but I get to choose my environment on a new
project this time. If you're using Makefiles for some reason when
building your Java project, it still works out fairly well because
they're usually called recursively. It gets more complicated with Ant, where there is only one
top-level build file. Emacs' compile command only runs the build
command in the buffer's current directory.
I know three solutions to this problem. One is to provide the build
file's absolute path when compile asks for the command
with the -buildfile (-f) option. You only
need to type it once per Emacs session, so that's not too bad.
ant -emacs -buildfile /path/to/build.xml
It's not well documented, but there is a -find option
that can be given to Ant that will cause it to search for the build
file itself. This is even nicer than the previous solution. Just
remember to place it last, unless you give it the build filename
too. For example, if you wanted to run the clean target,
ant -emacs clean -find
To keep the actual call as simple as possible, I wrote a wrapper for
compile, and put a hook in java-mode to
change the local binding. The wrapper, ant-compile,
searches for the build file the same way -find would do.
(defunant-compile ()
"Traveling up the path, find build.xml file and run compile."
(interactive)
(with-temp-buffer
(while (and (not (file-exists-p "build.xml"))
(not (equal "/" default-directory)))
(cd ".."))
(call-interactively 'compile)))
So I can transparently keep using my muscle memory compile binding, I
set up the key binding in a hook,
Yesterday was a slightly chilly day, in the mid 50s (Fahrenheit), so
when my officemate said we were getting snow I thought she was
kidding. Well, today we woke up to this winter wonderland,
My wife leapt for her camera and filmed that as soon as she realized
what was going on. We had much more than that by late afternoon.
According to the
Baltimore Sun it has snowed on December 5th for six of the last
eight years, and we don't get much snow around here. My only wish
would be for this to have happened on a weekday. I work for a
university, and educational organizations tend to fold and give snow
days really easily. This would probably have given me a nice paid day
off!
Remember that Emacs
web server I wrote back in May? Well, I got an e-mail last night
from Chunye Wang containing a patch with a variant of my dynamic lisp
idea, called "servlets" (not to be confused with Java
servlets). Chunye had similar concept for an Emacs web server for a
long time, but never implemented because Emacs lacked network
functionality until recently (Specifically, make-network-process in Emacs 22.1, June 2007). This
led Chunye to find my implementation.
Again, you can clone/view the code here. I turned the patch into a
series of commits,
The servlets are simply functions installed under an
"httpd/" namespace, where the trailing slash represents
the server root. So, the function httpd/example-servlet
will be executed when "/example-servlet" is requested from the
server. The servlet runs on a temporary buffer, whose contents are
served when the servlet function returns.
To assist in HTML generation, Chunye also wrote a function to turn an
S-expression
into HTML, similar to the one I described in the web server previous
post. Symbols are converted into strings, alists are attributes, and
the elisp symbol indicates code to be executed, and the
results used to generate HTML. For a simple hello word,
(defunhttpd/roll-die (uri-query req uri-path)
"Rolls a die with the requested number of sides (default 6)."
(let ((sides
(1- (string-to-number (or (cadr (assoc "sides" uri-query)) "6")))))
(httpd-generate-html
'(html
(head
(title "Die Roll Servlet"))
(body
(h1 "Die Roll Servlet")
"You rolled a "
(b
(elisp (list (number-to-string (1+ (random sides)))))))))))
That one would be accessed from the browser with with
"/roll-die" or "/roll-die?sides=100".
Chunye provided some sample servlets that list the buffers, with links
that serve them up. There is also another servlet that will switch the
current buffer, which I find compelling. All of Emacs' functionality
is available to the servlet.
Now, to write a servlet that runs the Emacs psychiatrist ...
Your BitTorrent client probably has DRM in it, even if it's Free
software. Torrent files (.torrent) may contain a special,
but unofficial, "private" flag. When it does, clients are supposed to
disable decentralized tracking features, regardless of the user's
desires. This is a direct analog to the copy-prevention flag that PDFs
may have set, which tell PDF viewers to cripple themselves, except
that your Free PDF reader is actually more likely to ignore it.
It's impossible to simply open the torrent file and turn off the
flag. The client has to be modified, fixing the purposeful defect, to
ignore it. Note, simpler clients that don't have these features in the
first place don't have this problem, since they don't have any
features to disable.
The private flag exists because modern BitTorrent trackers can
function without a central tracker. If the central tracker is down, or
if the user doesn't want to use it, the client can fetch a list of
peers in the torrent from a worldwide
distributed hash table. It's one big decentralized BitTorrent
tracker (though any arbitrary data can be inserted into it). Clients
also have the ability to tell each other about peers when they are
doing their normal data exchange. Thanks to this, clients can
transcend central trackers and join the larger global torrent of
peers. It makes for healthier torrents.
Anyone who knows a few peers involved with a torrent can join in,
regardless of their ability to talk to the central tracker. But
private tracker sites don't want their torrents to be available
outside to those outside of their control, so they proposed an
addition to the BitTorrent spec for a "private" flag. Clients with
decentralized capabilities are advised cripple that ability when the
flag is on, so no peer lists will leak outside the private
tracker. This flag was never accepted into the official spec, and I
hope it never is.
Unfortunately the private trackers set an ultimatum: obey the private
flag or find your client banned. The client developers fell in line
and, and as far as I am aware, no publicly available clients will use
decentralized tracking while the flag is on. At one point, the
BitComet client ignored the flag and was banned for some time until it
was "fixed".
The private flag wasn't placed in front with the rest of the metadata
where it belonged. It's intentionally placed at the end of the torrent
file inside of the info section. This means that the flag is part of
the info_hash property of the torrent, which is the global identifier
for the torrent. Unset or remove the private flag and the hash
changes, creating a whole new torrent without any seeds.
This is DRM, an artificial restriction imposed on the user. It's
insulting. Users should be the ones that control what happens with
their computers. The reasonable approach to a private flag is that,
when the private flag is enabled, decentralized tracking is turned off
by default, but can be re-enabled by the user should they desire. That
way the desired behavior is indicated but the user has the final say,
not some unrelated website operator.
I rarely use private trackers, since they are nearly pointless, but I
still find this private flag set on public torrents, probably from
someone simply reposting the torrent file from a private site. It's
annoying to run into. It makes the torrents weaker.
Debian, which is my distribution of choice, is generally good about
removing DRM from the software it distributes. For example, the PDF
readers in the repositories have their DRM disabled (i.e. xpdf). So
why not do the same thing for all the intentionally defective
BitTorrent clients?
I went on the Debian IRC channel on
Freenode and brought up the issue only to find out that everyone
thought a little DRM was reasonable. So then I filed
a bug report on it, which was simply closed citing
that the DRM is a beneficial "feature" and that removing the
intentional defect would make the clients "poorer". They also insisted
that it's part of the spec when it's
not. I'm really disappointed in Debian now.
Now, I could modify a client to ignore the flag, but it's not
useful if I am the only one not running DRM. It takes two to tango. A
client used by many people would have to be fixed before it becomes
beneficial.
So when someone asks for an example of Free Software or Open Source
software with DRM in it, you can point to BitTorrent clients.
I started getting Asian spam in my comments in the last couple
days. If you are subscribed to the comments feed you probably noticed
this. The spammer was manually filling out captchas, so this wasn't a
bot but rather a patient human being. Getting tired of removing these,
I set up some filters to silently drop messages that fit certain
criteria. By "silently" I mean the server tells the client everything
went fine but the comment never actually gets written to the database.
The spammer gained nothing except annoying us because all links in
comments get a rel="nofollow" attribute, which tells
search engines to ignore it. That, plus small readership and
captcha-solving gives little incentive for spamming.
Well anyway, while I had my sleeves rolled up and my hands on the code
I decided to make some upgrades I have been wanting to do. The e-mail
address is no longer displayed (stupid idea in the first place) but
instead used for a Gravatar
image. You can also specify a home URL, which will be linked from your
name. This makes my comment system work very similarly to what you
find around the web, except that I don't require anything from you but
a captcha and a comment.
I also fixed a small usability issue: when you preview a comment now
it takes you right down to the form rather than leaving you at the top
of the page. The back-end database was also adjusted from the original
pollxn design to scale better as the website grows.
Now, Gravatar is a neat concept but I have two complaints. One, I
don't like centralized systems. It's a single point of failure and a
single point of control. It has privacy issues. It's anti-web. It's
inelegant. Decentralized systems built around self-enforcing protocols
are more robust and democratic.
Luckily, a decentralized version does exist! It's a specification
called Pavatar. The avatar is tied
to a URL rather than an e-mail address. However, it's a bit less
flexible, since it needs to remain simple on the server side. It's
harder to set up and I doubt 99% of the users on the web would be
capable of doing it. What would help Pavatar gain a wider audience
would be Pavatar provider services. Hmmm...
So, I think might switch it to Pavatar sometime, with a possible
fall-back to Gravatar. That takes significantly more work to set up
than Gravatar does, so it's a future project. And, well, no one uses
it yet either. I actually thought the project was dead until just now
because their website was down the first time I visited it a couple
months ago.
My second complaint is that Gravatar incorrectly assumes e-mail
addresses are not case-sensitive. The domain part is, but the alias
part is not. These two addresses could technically arrive to two
different e-mail inboxes,
chris@example.com
Chris@example.com
Pretty much every e-mail server will treat these as the same address
as a convenience, because treating these differently would just be
confusing, but it's not necessarily the case. Gravatar specifically
says to hash the e-mail in lowercase form, so the unique address
Chris@example.com can't be used with the service.
Under Gavin's suggestion, I've been
watching The
Prisoner, a 1960's British television show. The main character is
an ex-spy held prisoner in "the Village", an Orwellian, isolated,
enclosed town. No one in the Village has a name, but is instead
assigned a number. The main character's number is 6.
As far as I can tell, after number 2 the order of the numbers is not
important. Number 56 is no more important than number 12. By using
numbers to name things there is an implied ordering, even if the the
ordering is insignificant. It could be misleading to a newcomer.
Is there an unordered set could be used to name things? More
specifically, is there a set that cannot be ordered? If it is
unorderable then there is no implicit ordering to cause
confusion. It's easy to have an unorderable set in theory, but I think
it is difficult to have in practice.
Using letters is obviously out, as the alphabet has an order. Words
and names made of letters can be sorted according to the alphabet.
However, the ability to order words is almost never used outside of
indexing. If words are used to name things, a newcomer is unlikely to
assume relationships based on ordering. No one will assume Alan is
more important than Bob.
Large numbers also tend to lack an assumed order. I don't think anyone
assumes a larger or smaller social security number has meaning, or a
larger or smaller phone number. However, these values are also known
to be handed out in some semi-random way.
But can we do better? For at least English speakers, is it possible to
create an unorderable set? If the items in the set have a vocal
pronunciation, then they can probably be ordered by their
phonetics. That could be avoided by using non-standard phonetic
components, like clicks and pops, which won't have a standard ordering
(in English, anyway).
A set has an order if there is a total, transitive,
relational
operator for the set. If such an operator does not exist then the
set isn't linearly ordered. I want a set that can't easily have such
an operator.
If a set of symbols was created, how might they be presented as to
show no ordering. The order of the symbols in the original
presentation might be considered the ordering, like how the alphabet
is always presented in order. A circle could be used, but this is
circularly ordered. I think there is also the issue of memorization. A
human will have a much better time memorizing the symbols if memorized
in some order. For example, try naming all the letters of the alphabet
at random, without repeats. Or US states.
Thanks to modern day technology, with dynamic content, the set could
be displayed in a random order each time it is viewed. For a web page,
the server could select a random order, or a JavaScript program could
reorder the images at random.
There could be partially ordered sets, like hierarchies and DAGs. The
ordering in The Prisoner is one of these. There is number 1, then
number 2, then everyone else. Is there a partially ordered set in use
that has unique names at the same level?
The penalties incurred by intentionally prohibiting order would likely
outweigh the benefit of the set. If it's not orderable, we can't index
it, and it's difficult to deal with. I expect it's much easer to just
use numbers and tell people that the order isn't important, or just
use an obviously unordered set.
In the summer of 2007 my friend, Michael Abraham, and I wrote a robot
programming game in Matlab. The idea came out of a little demo he
wrote where one circle was chasing another circle. Eventually the
circles could be run by their own functions, then they were shooting
at each other, then dropping mines, and so on. It was dubbed
"Matbots".
It turned into a little programming competition among several of us at
work. Someone would invent a new feature, like cooperative bots or
bullet dodging, that would allow it to dominate for awhile. Then
someone else would build on that and come up with some other killer
feature. At one point we even hooked up a joystick so we could control
our own bot live against our creations.
I never shared that here. Someday I will.
Anyway, Mike took a little bit of the Matbots code and design and came
up with a new, similar game called SumoBots. Programmable bots are in
a frictionless circular playing area with the ability to accelerate in
any direction. Collisions between bots are governed by a spring force
dynamics. Bots are out of the game when they leave the circular
area. The goal is to knock out all the other bots. Here's a video,
I like how smooth and natural that looks.
He got a number of people at his lab to code up some bots to
compete. I threw the code up over at
bitbucket, a
Mercurial host, which you can checkout with,
(I haven't convinced anyone else over there to use a version control
repository, let alone this one, but here's hoping!)
You can use either Matlab or Octave to program your bot. It works in
both, with Octave being on the slow side. For Octave, you'll need to
define a cla function as a call to clf (they
don't define it for some reason). To start coding a bot, just look at
the samplebot bot for information on programming a
bot. Edit the players in player_list.txt, then launch it
by running engine (engine.m).
I wrote one bot so far called bully. It charges at the nearest bot,
being careful to cancel out its own velocity. In the video above this
bot is red.
The other bots so far is kyleBot, blue in the video, which passively
runs circles around the outside of the surface hoping to trick
aggressive bots, like mine, into running out. The bot2fun bot, black
in the video, is like bully, but tries to keep safely near the center
of the playing area. The brick bot, green in the video, is a dummy,
practice bot that just gets knocked around.
The tricky part in writing a bot is managing the frictionless
surface. Acceleration can't just be in the direction you want to
travel, but also against the current velocity. You might want to set
up some kind of process
control for this.
Right now I think there is a balance issue. Passive bots have an
advantage, but the game won't progress if everyone takes this
advantage and writes passive bots. It's like camping in a first-person
shooter. As a partial solution to these stalemates, Mike has proposed
that the playing surface should shrink over time.
If you want to join in, grab the code and start coding.
This website/blog turns two years old today. That's right, the first
post was on September 1, 2007. This also happens to be the 100th post!
Woo! Which means I have nearly kept up with my original one post per week planned average (104
would put me on schedule).
The site has evolved a bit in that time. Two years ago I was using
vanilla blosxom (I've since hacked it up), no comments, no portrait
(portraits increase reader trust), simpler logo, no color, no RSS (it
was there, but I wasn't linking it), crappy index, no features but
this blog and the "projects" section, no Creative Commons license (but
I've always had the other simple copyleft statement), no lisp, and
only one permalink per post. Since I started this site I lived in
three different US states, had two different jobs, bought two
computers, bought a car, adopted a cat, learned several programming
languages, graduated with a bachelor's degree, had major surgery, and
got married.
I've tried to make all my posts more than a trivial "hey, check this
link out", and instead have some real content for each one. Thanks to
this, if you concatenate all the posts together it's long enough to be
a short book (about 170 pages). And, this blog is 100% Free
Culture. All text and images are public domain or under a Free
license (and none of that non-commercial-only crap). So I -- or anyone
-- could easily publish it as a book. (Ha! As if anyone would buy it!)
And I do maintain another completely separate blog, with a
(currently) higher post frequency, under a pseudonym (so I don't get
myself into trouble), which is why I won't tell you what it is. :-)
I'm really happy I kept kept this "log" of my activities. I've had
some real misconceptions about my previous self, which the log has
cleared up. I often think I came up with ideas or had viewpoints much
more recently than I really did ("Oh, learning Scheme really
was that long ago."). I wish it went back further, because, for
example, I really am not sure at which point I became an atheist
(early 2003?) or a libertarian (summer 2003?). I tend to think of my
previous self as more stupid than I really was, which gives me this
false impression of intellectual progress, when really I am as dumb as
I ever was! I know one one thing for sure, my writing hasn't really
improved!
So, thanks for reading! If you don't already have a blog, start one
today. Keep up with it, and at two years I'm sure you'll be really
glad you have it.
This is related to a project I am working on and will post here
soon. I imagine that, with a little more effort, this algorithm could
turn into a short amateur paper.
Suppose you want to use a computer to simulate the roll of two
six-sided dice (notated 2d6). The simplest approach would
be to replicate the results the same way you would roll dice:
independently and randomly generate two numbers between 1 and 6
inclusively. We easily can do this for any number of dice, we just
iterate and roll each die. Like this recursive function,
(defunroll (n sides)
"Roll n dice with given number of sides."
(if (zerop n) ()
(cons (1+ (random sides))
(roll (1- n) sides))))
However, generating a number between 1 and 6 wastes small amounts of
entropy. A six-sided die only takes about 2.58 bits of entropy to
generate. Since we can only use bits discretely we have to spend 3
bits, throwing out 0.42 bits. On top of that, when we pull out 3 bits
and they are out of range (0 or 7) we have to throw them out and try
again.
Let's say we wanted to roll 10 dice, or 100 dice, or 1000 dice? Do we
really need to generate that many numbers individually? That's a lot
of wasted entropy adding up, entropy which can be expensive to
gather. Well, we could instead use the probability distribution of
the roll so that only a single number needs to be generated.
For a 2d6 roll, there are 36 unique possible outcomes
(6^2). We could select a number between 0 and 35, then choose that
specific roll. This roll can be calculated with a series of division
and modulus operations (u for a number from a
uniform distribution) (also, note that the division is
integer division),
(defunroll-perm (n s u)
"Get uth permutation of n s-sided dice."
(if (zerop n) ()
(let ((perms (expt s (1- n))))
(cons (1+ (/ u perms))
(roll-perm (1- n) s (mod u perms))))))
If we're only interested in the sum, we could save memory by making
this tail recursive -- or iterative -- and summing the dice as we
calculate them. Ignoring the exponent, this is O(n), not
better than the simple algorithm in terms of growth rate. This
algorithm is more efficient when it comes to entropy, though.
Consider 3d6, with 216 possible outcomes, ideally with
the simple algorithm takes 3 3-bit rolls, consuming 9 bits. About 1.25
bits was not actually used (0.42 * 3). In the entropy-efficient
algorithm we need about 7.75 bits, so it only consumes 8 bits of
entropy. We saved a bit. That gap only gets larger with more dice. For
100d6 the simple algorithm uses 41 more bits than
necessary.
The efficient roll is basically defragmenting the individual rolls on
the entropy stream.
In non-ideal world, though, some cases don't work out well. In
12d6, almost half the numbers (compared to 25% in the
case of 1d6) from the uniform distribution will be out of range and a
lot more bits would be needed. On average, rolling dice individually
(or only some of them individually) for 12d6 will
be more efficient.
The efficient algorithm is only more efficient above a point near
where mod(log2(s), 1) < mod(log2(n^s), 1).
And, all of this doesn't come without a cost. You must pay the piper,
and this algorithm is paid with CPU and memory. Notice that exponent
there? That has to be done to exact precision (no floating point), and
it grows very quickly. If you want to roll more than a handful of
dice, you will be crunching some large numbers. Rolling just
100d6 means you have to work with a 78 digit
integer. 10000d6 is a 7782 digit integer. These can't be
done in floating point because the resolution of floating point is too
low: some rolls would not be possible.
The exponent could be memoized to
trade some of that CPU time for more memory usage. Still, pretty
costly. If you don't value your entropy, the tradeoff might not be
worth it.
I can't see a way around performing that calculation. We need to know
that big number exactly. Perhaps a mathematician might be able to
manipulate the formulas such that it's not so expensive.
If you're rolling lots of dice and you want to preserve binary
entropy, try it out. If you want to be really efficient queue up rolls
-- or generate them ahead of time -- so that the number of outcomes is
just below a power of two. In the case of d6, some good
number of dice to roll are 17 (~43.94 bits), 29 (~74.96), 41
(~105.983), 94 (~242.986 bits), 200 (~516.993), 253 (~653.995 bits),
306 (~791.99853 bits), and 971 (~2509.99859 bits). (Notice these get
closer and closer to an integer number of bits.)
I've been using lossless optimizers for awhile now for PNGs, but more
recently I have found some for other formats. Here's the ones I know
about. These are all intended to be lossless, so running them should
result in no information loss (well, except the SVG one).
For PNG, there are a number of choices, but my favorite is OptiPNG. It adjust the PNG
parameters and recompresses to find the optimal parameters. I run it
on almost all my images around here, and I tend to get around 10% to
30% reduction for images fresh off
Gimp,
Kolourpaint, and
ImageMagick.
For JPEG, I use jpegoptim. It
works by optimizing the Huffman tables (the lossless part of JPEG
compression). I only found this one recently, but I will be using it
all the time, like on our new thousands of wedding reception photos.
For PDF, I found something called QPDF. It's designed more for
other PDF transformations, but without any other parameters it will
simply losslessly optimize a PDF. From what I've seen so far it cuts
PDFs down by about a third.
For SVG, Scour is a
young project, only a few months old. I've been looking for an SVG
optimizer for some time, so this was exciting to find. Due to the type
of file it's working with, it's not quite entirely lossless. Visually,
it is lossless, but it will toss all metadata (comments, etc.), which
may be important. If you hand-crafted your SVG, you won't want to use
this tool. It's good for removing
Inkscape and Illustrator cruft, though.
I have yet to find a good (Free) GIF optimizer. Animated GIFs, with
lots of redundancy between frames, have a lot of potential for
optimization too. A video optimizer (for, say, Theora) would be
interesting; I imagine it might work similarly to jpegoptim. Audio
files (like Vorbis, FLAC, or MP3) probably don't have any room for
optimization. I could be wrong. For XHTML there is tidy if you want to
count that. All the other XML formats (ODF, RSS, etc.) could have
their own too. Or optimizers for archives, like zip and tar. For tar
it might rearrange things to better suit gzip, bzip2, or
lzma. Executable optimizers? Postscript optimizers? It goes on and on.
If you know about any more, especially for other file formats, let me
know.
A year ago while I was reading Applied
Cryptography I got an idea for using IRC as a random number
generator. The book mentioned using a rolling car wheel to generate
random numbers, by measuring its period. So why not IRC? Grab the
code,
It's based entirely on event timing: it logs in and sits on several
channels, then measures the time between events. When a new event
occurs it compares the time passed between this event and the last
event, and the time between the two events before it. If it's greater,
emit a 1, if less emit a 0. Simple.
For skew removal I used transition mappings
(rfc1750), invented by von Neumann. It looks at pairs of bits. If they
differ, pass the first bit, otherwise toss both bits. So if it comes
across "11" or "00", it tosses them. If it comes across "10" it emits
"1", and "01" it emits 0.
Here's an analysis by ent of 476 output bytes I generated overnight: irc.random.bytes.
Entropy = 7.515132 bits per byte.
Optimum compression would reduce the size
of this 476 byte file by 6 percent.
Chi square distribution for 476 samples is 274.79, and randomly
would exceed this value 25.00 percent of the times.
Arithmetic mean value of data bytes is 127.7248 (127.5 = random).
Monte Carlo value for Pi is 2.886075949 (error 8.13 percent).
Serial correlation coefficient is -0.062745 (totally uncorrelated = 0.0).
Eh, not awful, but not too great. It got a good score on the
Chi-square test, which I attribute to the skew filter. This generator
is also extremely slow, generating only a few bytes per hour. At
best, each event only generates a half of a bit, after skew
correction. It would probably be more profitable to take the hash of
the actual event with a timestamp and use that as the RNG.
Also, someone else running this generator on the same channels would
generate very similar numbers. Worse, someone eavesdropping on your
network connection knows mostly what numbers you generated. Worse yet,
someone actively modifying your connection could control your
generator and force whatever output is desired.
I was reading through the HTML 5
specification earlier tonight and I was delighted to come across an
obvious reference to the Flying
Spaghetti Monster, complete with notes about His "noodly
appendages". I'm guessing that someone silently slipped it in there
and the people that might be offended won't get the reference
anyway.
In the following example, a picture representing the flying
spaghetti monster emblem, with each of the left
noodly appendages and the right noodly appendages in
different images, so that the user can pick the left side or the
right side in an adventure.
For those who are unaware, it is the deity in a parody religion
Pastafarianism. It's often used to demonstrate how silly religion is.
Web pages aren't a static medium, like books, brochures, or
pamphlets.
The web is not print. Accordingly, the layout of web pages should
not be locked to some static width, but instead flow to fill the width
of the browser like a liquid. Web pages should normally have a
liquid layout.
One of the most obvious problems with the fixed layout occurs when the
browser window is stretched wider than the designer had intended. The
page looks like this,
I, as a user, have little control my viewing of the website. I'm stuck
reading through a keyhole. It gets much worse if the browser isn't as
wide as the designer intended: a horizontal scrollbar appears and
navigation becomes very difficult. My laptop runs at a resolution of
1024x768, and I frequently come across pages where this is an
issue. And according to Jakob Nielsen, in 2006 77% of
user's screens were 1024 pixels wide or less.
See the liquid for yourself right here: adjust the width of your
browser and watch this text flow to fill the screen. You can also
bring it in pretty far before you clip an image and the horizontal
scrollbar appears. The exact width depends only on the widest image
being displayed. This also comes into play if you adjust the font
size.
Using a liquid layout
allows the page to work well with a wide variety of screen widths,
and most importantly, gives users lots of control over how they view
the site. It's very unfortunate that (in my experience) most websites
employ a poor, fixed layout. Even web design "expert" websites will
ironically hand out web design tips from within these annoying
confines. One of the biggest culprits driving this is Wordpress, which
has this flawed layout by default.
The very worst offenders tend to be websites with little actual
content, like corporate websites or "artist" portfolios. The less
usable the page, the less I wanted to be there anyway.
So please drop the fancy, low-usability web designs for
something with much better usability. Your users will probably
appreciate it.
I use Adblock Plus to block
advertisements and, more importantly, invisible privacy-breaking
trackers (most people aren't even aware of these). I think ad-blocking
is actually easier than ever, because ads are served from a relatively
small number of domains, rather than from the websites
themselves. Instead of patterns matching parts of a path, I can just
block domains.
Adblock Plus emphasizes this by providing, by default, a pattern
matching the server root. Example,
http://ads.example.com/*
But sometimes advertising websites are trickier, and their sub-domain
is a fairly unique string,
http://ldp38fm.example.com/*
That pattern isn't very useful. I want something more like,
http://*.example.com/*
Unfortunately Adblock Plus doesn't provide this pattern automatically
yet, so I have to do it manually. I think this pattern is less obvious
because the URL format is actually broken. Notice have have two
matching globs (*) rather than just one, even though I am simply
blocking everything under a certain level.
Tim Berners-Lee
regrets the format of the URL, and I agree with him. This is what
URLs like http://ads.example.com/spy-tracker.jsshould look like,
http://com/example/ads/spy-tracker.js
It's a single coherent hierarchy with each level in order. This makes
so much more sense! If I wanted to block example.com and all it's
sub-domains, the pattern is much simpler and less error prone,
http://com/example/*
To anyone who ever reinvents the web: please get it right next time.
Update: There is significant further discussion in the comments.
As noted elsewhere here, I grew up in Pittsburgh, Pennsylvania. Back
in February 2001 I got to witness Pittsburgh's Three Rivers
Stadium demolition. My dad brought our old 1980's video camera to
film it, and afterwards we dumped the recording to the computer. And
so I present another low-quality video,
I put this up on YouTube in September 2006, and as of this writing it
has gotten about 28,000 views. Now that I finally put it here, it's
clearly available under any of the licenses at the bottom of this
page.
We were
standing nearly underneath the Fort Pitt Bridge (seen at the top
of the video) facing directly north, next to some railroad tracks. You
can see all three rivers: Ohio river to the left, Monongahela river to
the right, and the Allegheny in the top middle behind the Point.
That thick dust cloud you see at the end rose up and engulfed the
whole city, overtaking us as we headed back to the car. So I got a
little bit of history in my lungs. If you're some distant-future
archaeologist reading this maybe you can retrieve some particles from
my decomposed rib cage. Put them in a museum or something.
First, let me note that I don't watch television. At least not in the
sense of sitting on the couch, turning it on, and flipping through the
stations. I can't stand the compressed audio, the constant, loud
commercial interruptions, and general lack of control over my
viewing. VCRs, and more recently PVRs, have mitigated these last two
points, but not enough to grab my interest.
The way I see it, there are four ways to access television. Here is
the matrix,
For an "acceptable" situation we have cost-free television, but with
advertising, in broadcast and streaming television. And in the
opposite "acceptable" situation we have ad-free television, but with a
monthly fee, in premium television. I think these two are acceptable
compromises. Someone else can foot the bill, or you can foot the bill.
In a few cases, such as viewer-supported television like PBS, it's
both cost-free and ad-free. This is pretty nice. You can have your
cake and eat it too.
However, most television is only legitimately available in the
worst case situation! Not only do you have to pay to access it, but
one-third of it is annoying, unwanted advertising. This is awful, and
it is one reason why I choose not to participate.
Luckily, there is another "best case" option which provides quick
access to most television shows of the world: peer-to-peer
file-sharing. Unfortunately, it doesn't include live television, and
it's usually not quite legal. We have the technology to distribute
large amounts of data to huge numbers of people at practically no
cost, but a bunch of old, out-of-date laws stand in the way. It's a
shame. I think this
quote by "muuh-gnu" sums it up well,
We have 2009. Everybody and their dog has a computer, which is
designed to copy stuff. Also we have broadband which is, again,
designed to ... move stuff around the world. So is what you're
actually pointlessly advocating is that we collectively should
... actually what? Abstain from using a common technology in
order to make absurdly archaic 50's business models of
"manufacturing and selling single copies" viable in day and age
when everybody can manufacture and distribute those copies
themselves?
It's a good thing some bad laws don't get in the way of progress
too much.
This was almost 6 years ago now. In my freshman year of college, three
of us (Matt Stine and Matt Takach) designed and assembled a dry ice
powered potato gun. We got the idea for using dry ice after purchasing
some from the famed
Penn State Creamery. They sold it for 50 cents per pound, no
questions asked. The purpose was for keeping ice cream cold over a
long car trip, but they'd sell it to you even if you didn't buy ice
cream.
I don't recommend doing this yourself unless you are going to
take extreme caution. Like, more caution than we did. We weren't too
careful ourselves and were really lucky no one got hurt. The dry ice
"bombs" can be very powerful, enough to shatter PVC tubing. We did
bring safety glasses with us (but they went mostly unused). We also
did this in cold weather, which made things a bit safer.
On to the details, here is a diagram I drew up at the time,
It's just glued PVC tubes and a nail. The bottom of the "blast
chamber" had a screw-on cap, which kept getting blown to
smithereens. We bought several of those. The original plan also had
some wires, connected to some kind of high-resistance wire, so we
could control when the gun would fire. That didn't work out.
The "power cell" was a soda bottle with water and dry ice in it. Some
brands worked better than others. We had to smash the dry ice into
little pieces to get them in the bottle. Dry ice goes in first, then
the water, then the cap goes on. After that it's a ticking time bomb,
and you want to load it into the gun ASAP. In warm weather you have,
maybe, a few seconds. In cold weather it could be 30 minutes.
Yeah, you don't want that in your hand when it explodes.
Once the bottle explodes, there is a lot of pressure behind that
potato, so it gets quickly pushed down and out of the barrel. As I
said, in that cold weather this could take about 30 minutes. In the
video above we were adding warm water to speed up the process.
When we fired a golf ball out of the gun straight up in the air, it
took about 10 seconds to hit the ground. This put it at maybe 50 m/s
(110 mph) barrel exit speed. When launching into the field, we found
the potatoes between 150 and 250 meters.
At night we stuck glow sticks in the potatoes so we could watch their
entire trajectory. Pretty cool.
Some campus "security" person stopped us on our way home, since
carrying a muddy pipe looks suspicious. We told her it was just a
potato "cannon", which she told us "wasn't allowed on campus" (yeah,
yeah). That's time to just walk away.
Here are some pictures of the construction, and my excessive beard,
What stands out for me is that the patterns can easily be represented
as a S-expression. We represent substitutions with symbols, literals
with strings, and groups with lists. For example, this pattern,
s(ith|<'C>)V
can be represented in code as,
(s ("ith" ("'" C)) V)
I want a function I can apply to this to generate a name. First, I set
up an association list with symbols and its replacements,
(defvarnamegen-subs
'((s ach ack ad age ald ale an ang ar ard as ash at ath augh
aw ban bel bur cer cha che dan dar del den dra dyn
ech eld elm em en end eng enth er ess est et gar gha
hat hin hon ia ight ild im ina ine ing ir is iss it
kal kel kim kin ler lor lye mor mos nal ny nys old om
on or orm os ough per pol qua que rad rak ran ray ril
ris rod roth ryn sam say ser shy skel sul tai tan tas
ther tia tin ton tor tur um und unt urn usk ust ver
ves vor war wor yer)
(v a e i o u y)
...
(d elch idiot ob og ok olph olt omph ong onk oo oob oof oog
ook ooz org ork orm oron ub uck ug ulf ult um umb ump umph
un unb ung unk unph unt uzz))
"Substitutions for the name generator.")
Since we will need this in a couple places, make a function to
randomly select an element from a list,
(defunrandth (lst)
"Select random element from the given list."
(nth (random (length lst)) lst))
A function for replacing a symbol,
(defunnamegen-select (sym)
"Select a replacement for the given symbol."
(if (null (assoc sym namegen-subs))
(throw 'bad-symbol
(concat "Invalid substitution symbol: " (format "%s" sym)))
(symbol-name (randth (cdr (assoc sym namegen-subs))))))
And finally, the generator. Find a string, pass it through, find a
symbol, substitute it, find a list, pick one element and recurse on
it.
(defunnamegen (sexp)
"Generate a name from the given sexp generator."
(cond
((null sexp) "")
((stringp sexp) sexp)
((symbolp sexp) (namegen-select sexp))
((listp sexp)
(concat (if (listp (car sexp)) (namegen (randth (car sexp)))
(namegen (car sexp)))
(namegen (cdr sexp))))))
That's it! We can apply it to the expression above,
(namegen '(s ("ith" ("'" C)) V))
-> "rynithi"
But that's really the easy part. The hard part would be converting the
original pattern into the S-expression, which I don't plan on doing
right now.
Something else to note: this is thousands of times faster than the
Perl version I wrote earlier.
I threw the code in with the rest of my name generation code
(namegen.el),
I converted this website to HTML 5 from its original XHTML 1.0
Transitional. Because I was already valid, it was actually pretty
easy. Well, I first converted it to XHTML 1.0 Strict by shifting some
things into CSS that should have been that way in the first place
(results of laziness). Then when that validated I just changed the
DOCTYPE to HTML 5, added a new meta tag, checked
that validation and that was it. Really simple.
One of the main reasons was so that I could use the new
video tag. From now on any posts with videos will display
them as in-line videos (Ogg Vorbis), provided you are using a browser
with HTML 5 support. In fact, I already did it with two older posts:
Knight's Tour and the Mandelbrot set zoom videos. Here, check it
out,
HTML 5 is still a moving target, so I may have to do some
keeping up over time. Probably not.
Video is finally a first class web citizen and it is quite exciting.
All but one of the major browsers do special URL mangling as a
"feature", but I think of it as bad behavior. Specifically, Firefox
gets it right while IE, Opera, Safari, and Chrome all do it wrong. The
last three are likely replicating the broken behavior of IE. I would
rather not do this.
I ran into this specific issue work. An IT person had sent out a URL
to an important internal website, which contained links to other
resources. However, this person, despite being in IT, wasn't very
tech-savvy and wrote all of the links with backslashes instead of
slashes. When I followed with Firefox I got a 404 Not
Found because Firefox asked for a filename with a backslash in
it -- proper behavior.
As first I thought the page was out of date with dead links, but it
was obviously a brand new page. That's when I noticed that the links
had the wrong slashes. A quick check with IE showed the links as
working, which is why the IT person didn't notice his error. IE was
automatically, and silently, converting the backslashes to slashes.
According to RFC
1738, all URLs must encode their backslashes because of buggy
transport agents (and, years later, because of buggy browsers like
IE). So, if you want to use a backslash in a URL you are supposed to
encode it anyway, but if you don't most major browsers will
silently mangle the URL. I really, really don't like this. I know
what I am doing, Mr. Browser.
It's done to replicate IE behavior, and IE did it because it's a
common error for Windows users to use backslashes when they shouldn't
(Windows directory separators), so it is anticipating user error (and
assuming it knows best!). Let's not cripple our software to bring it
down the same level as the major shortcomings of Windows, the toy OS.
I wrote a little test suite for testing if your browser does this
annoying mangling. You can try it with your browser below,
Other than the browsers listed above, the only other browser I know
that fails is Links (not
Lynx). Note that Opera passes the "automatic" test, but fails the
"manual" test, because it doesn't mangle the http-equiv
URLs used in the automatic tests.
So beware using the browsers I listed at the beginning. Not only may
they be sending you to the wrong URL, but they are crippling the web
for the sake of Microsoft's errors.
Did you know that Emacs comes with a
calculator? Woop-dee-doo! Call the presses! Wow, a whole
calculator! Sounds a bit lame, right?
Actually, it's much more than just a simple calculator. It's a computer
algebra system! It is officially called a calculator, which isn't
fair. It's an understatement, and I am sure has caused many people to
overlook it. I finally ran into it during a thorough (re)reading of
the Emacs manuals and almost skipped over it myself.
Ever see that demonstration by Will Wright for the game Spore
several years ago? The player starts as a single-cell organism and
evolves into a civilization with interstellar presence. When he
started the demo he showed a cell through what looked like a
microscope. No one had any idea yet what the game was about, so every
time he increased the scope, from bacteria to animal, animal to
civilization, civilization to space travel, interplanetary travel to
interstellar travel, there was a huge reaction from the audience. It
was like those infomercials: "But that's not all!!!"
As I made my way through the Emacs calc manual I was continually
amazed by its power, with a similar constant increase in scope. Each
new page was almost saying, "But that's not all!!!" Like an
infomercial I'm going to run through some of its features. See the
calc manual for a real thorough introduction. It has practice
exercises that shows some gotchas and interesting feature
interactions.
Fire it up with C-x * c or M-x calc. There
will be two new windows (Emacs windows, that is), one with the
calculator and the other with usage history (the "trail").
First of all, the calculator operates on a stack and so its basic use
is done with RPN. The stack builds vertically, downwards. Type in
numbers and hit enter to push them onto the stack. Operators can be
typed right after the number, so no need to hit enter all the
time. Because negative (-) is reserved for subtraction an
underscore _ is used to type a negative number. An
example stack with 3, 4, and 10,
3: 3
2: 4
1: 10
.
10 is at the "top" of the stack (indicated by the "1:"), so if we type
a * the top two elements are multiplied. Like so,
2: 3
1: 40
.
The calculator has no limitations on the size of integers, so you work
with large numbers without losing precision. For example, we'll
take 2^200.
But that's not all!!! It has a complex number type, which is entered
in pairs (real, imaginary) with parenthesis. They can be operated on
like any other number. Take -1 + 2i minus 4 +
2i,
2: (-1, 2)
1: (4, 2)
.
Subtract with -,
1: -5
.
Then take the square root of that using Q, the square
root function.
1: (0., 2.2360679775)
.
We can set the calculator's precision with p. The default
is 12 places, showing here 1 / 7.
Numbers can be displayed in various notations, too, like fixed-point,
scientific notation, and engineering notation. It will switch between
these without losing any information (the stored form is separate from
the displayed form).
But that's not all!!! We can represent rational numbers precisely with
ratios. These are entered with a :. Push
on 1/7, 3/14, and 17/29,
3: 1:7
2: 3:13
1: 17:29
.
And multiply them all together, which displays in the lowest form,
1: 51:2842
.
There is a mode for working in these automatically.
But that's not all!!! We can change the radix. To enter a number with
a different radix, which prefix it with the radix and a
#. Here is how we enter 29 in base-2,
2#11101
We can change the display radix with d r. With 29 on the
stack, here's base-4,
1: 4#131
.
Base-16,
1: 16#1D
.
Base-36,
1: 36#T
.
But that's not all!!! We can enter algebraic expressions onto the
stack with apostrophe, '. Symbols can be entered as part
of the expression. Note: these expressions are not entered in RPN.
1: a^3 + a^2 b / c d - a / b
.
There is a "big" mode (d B) for easier reading,
2
3 a b a
1: a + ---- - -
c d b
.
We can assign values to variables to have the expression evaluated. If
we assign a to 10 and use the "evaluates-to" operator,
2
3 a b a 100 b 10
1: a + ---- - - => 1000 + ----- - --
c d b c d b
.
But that's not all!!! There is a vector type for working with vectors
and matrices and doing linear algebra. They are entered with
brackets, [].
Any matrix and vector operator you could probably think of is
available, including map and reduce (and you can define your own
expression to apply).
We can use this to solve a linear system. Find x
and y in terms of a and b,
x + a y = 6
x + b y = 10
Enter it (note we are using symbols),
2: [6, 10]
1: [ [ 1, a ]
[ 1, b ] ]
.
And divide,
4 a 4
1: [6 + -----, -----]
a - b b - a
.
But that's not all!!! We can create graphs if GNU Plot is
installed. We can give it two vectors, or an algebraic
expression. This plot of sin(x) and x cos(x)
was made with just a few keystrokes,
But that's not all!!! There is an HMS type for handling times and
angles. For 2 hours, 30 minutes, and 4 seconds, and some others,
3: 2@ 30' 4"
2: 4@ 22' 13"
1: 1@ 2' 56"
.
Of course, the normal operators work as expected. We can add them all up,
1: 7@ 55' 13"
.
We can convert between this and radians, and degrees, and so on.
But that's not all!!! The calculator also has a date type, entered
inside angled brackets, <> (in algebra entry
mode). It is really flexible on input dates. We can insert the current
date with t N.
1: <6:59:34pm Tue Jun 23, 2009>
.
If we add numbers they are treated as days. Add 4,
1: <6:59:34pm Sat Jun 27, 2009>
.
It works with the HMS format from before too. Subtract 2@ 3'
15".
1: <4:56:32pm Sat Jun 27, 2009>
.
But that's not all!!! There is a modulo form for performing modulo
arithmetic. For example, 17 mod 24,
1: 17 mod 24
.
Add 10,
1: 3 mod 24
.
This is most useful for forms such as n^p mod M, which
this will handle efficiently. For example, 3^100000 mod
24. The naive way would be to find 3^100000 first,
then take the modulus. This involves a computationally expensive
middle step of calculating 3^100000, a huge number. The
modulo form does it smarter.
But that's not all!!! The calculator can do unit conversions. The
version of Emacs (22.3.1) I am typing in right now knows about 159
different units. For example, I push 65 mph onto the stack,
1: 65 mph
.
Convert to meters per second with u c,
1: 29.0576 m / s
.
It is flexible about mixing type of units. For example, I enter 3
cubic meters,
3
1: 3 m
.
I can convert to gallons,
1: 792.516157074 gal
.
I work in a lab without Internet access during the day, so when I need
to do various conversions Emacs is indispensable.
The speed of light is also a unit. I can enter 1 c and
convert to meters per second,
1: 299792458 m / s
.
But that's not all!!! As I said, it's a computer algebra system so it
understands symbolic math. Remember those algebraic expressions from
before? I can operate on those. Let's push some expressions onto the
stack,
3: ln(x)
2 a x
2: a x + --- + c
b
1: y + c
.
Multiply the top two, then add the third,
2 a x
1: ln(x) + (a x + --- + c) (y + c)
b
.
Expand with a x, then simplify with a s,
2 a x y 2 a c x 2
1: ln(x) + a y x + ----- + c y + a c x + ----- + c
b b
.
Now, one of the coolest features: calculus. Differentiate with respect
to x, with a d,
1 a y a c
1: - + 2 a y x + --- + 2 a c x + ---
x b b
.
Or undo that and integrate it,
3 2 3 2
a y x a x y a c x a c x 2
1: x ln(x) - x + ------ + ------ + c x y + ------ + ------ + x c
3 2 b 3 2 b
.
That's just awesome! That's a text editor ... doing calculus!
So, that was most of the main features. It was kind of exhausting
going through all of that, and I am only scratching the surface of
what the calculator can do.
Naturally, it can be extended with some elisp. It provides a
defmath macro specifically for this.
I bet (hope?) someday it will have a functions for doing Laplace and
Fourier transforms.
Awhile ago I wanted to
find every Hamiltonian path in the contiguous 48 states. That is,
trips that visit each state exactly once. Writing a program to search
for Hamiltonian paths is easy ( I did this
already). The most time consuming part was actually putting
together the data that specified the graph to be searched. I hope
someone somewhere finds it useful. Here is a map for reference,
It took me several passes before I stopped finding errors. I
think I have it all right now, but there could still be some
mistakes. If you see one, leave a comment and I'll fix it here. Here
is the graph as an S-expression
alist; the car (first) element in each list is a state, and the
cdr (rest) is the unordered list of states that can be reached from
it.
((me nh)
(nh vt ma me)
(vt ny ma nh)
(ma ri ct ny nh vt)
(ny pa nj ma ct vt)
(ri ma ct)
(ct ri ma ny)
(nj pa ny de)
(de md pa nj)
(pa nj ny de md wv oh)
(md pa de va wv)
(va md wv ky tn nc)
(nc va tn ga sc)
(sc nc ga)
(ga fl sc al nc tn)
(al ms fl ga tn)
(ms la ar tn al)
(tn ms al ga nc va ky mo ar)
(ky wv va tn mo il in oh)
(wv md pa oh ky va)
(oh pa wv ky in mi)
(fl al ga)
(mi wi oh in)
(wi mn ia il mi)
(il in ky mo ia wi)
(in oh ky il mi)
(mo il ky tn ar ok ks ne ia)
(ar mo tn ms la tx ok)
(la ms ar tx)
(tx ok nm ar la)
(ok ks mo ar tx nm co)
(ks ok co ne mo)
(ne sd ia mo ks co wy)
(sd nd mn ia ne wy mt)
(nd mt sd mn)
(ia ne mo il wi mn sd)
(mn wi ia sd nd)
(mt id wy sd nd)
(wy id ut co ne sd mt)
(co ne ks ok nm ut wy)
(nm co ok tx az)
(az nm ut ca nv)
(ut nv id wy co az)
(id mt wy ut nv or wa)
(wa or id)
(or wa id nv ca)
(nv or id ut az ca)
(ca az nv or))
Note that all paths must start or end in Maine because it connects to
only one other state.
I'm not the first to come across this idea: the browser could be used
as part of a distributed computing system. A web server hands out
Javascript and the browser runs the script and reports the results
back to the server. The browser is a portable, widely available
platform so just about anyone can easily contribute, possibly without
even knowing.
Browsers aren't really expecting this sort of thing. They will
complain if a script is running for too long. If you tell Firefox to
continue running the script anyway, it will lock up until the script
is done (or when it complains again). This can be worked around by writing a simple
scheduler with setTimeout().
This could also potentially be used as an alternative to
advertising. Instead of selling advertising space, a website operator
could sell visitor's CPU time by including a little snippet of
code. This may be more successful, because most visitors would be
unaware of it, making it less intrusive. It will be less likely to be
blocked. Of course, there ethical issues about this. In fact, there is already a
company doing this with secret Java applets.
There are two serious constraints on using Javascript in a browser as
a distributed computing platform:
Low bandwidth. There isn't a lot of opportunity to transfer data
between the server and the node, and nodes can't talk to
each other. The data needed by a node must be small. The results data
must also be small.
Short computational units. The Javascript in the browser has no
way to store its running state between browser sessions, so it must
rely on the server for this. This means that the units of work must be
able to be completed within a short period of time. A few minutes at
the most on a normal computer.
A lot of problems won't fit inside these limitations. One that I
thought might was a Mersenne prime
search. A Mersenne prime is a prime of the form,
2n - 1
So even though the largest known Mersenne prime has nearly 13 million
digits (about 5 MB just to store the entire number), it can be
described by it's exponent, 43,112,609, which is small
enough to fit in a 4-byte integer. The result of a calculation, a
probabilistic primality test, is a "yes" or "no". One bit. It fits the
first constraint very well.
However, the smallest amount of work a node can do is an entire
primality test. If we break it down any further, the prime will have
been expanded and we will not fit the first constraint. There will be
too much data. To see how possible it might be, I implemented it,
which you can try out here,
I modified an
existing Javascript bigint library which allowed me to get it up
running quickly. After you receive the page, your browser will run the
Miller-Rabin primality test on 2^9941 - 1. You can
edit the source HTML to try a different exponent. Running it on
several of my computers on different browsers it took anywhere from an
hour to 8 hours. And that's only with an exponent of
9941. It's an unsuitable problem.
It would be neat to see a browser computing grid in action, but I
can't think of a problem to solve with it.
If you look at the page sources around here you might notice that
there are no bare e-mail addresses around. This is because I obfuscate
them into a series of HTML entities. So far this has been pretty
effective at hiding from address-scraping, web-crawling spam
bots. They don't seem to try very hard at decoding HTML entities.
When I added the comment system, I needed to obfuscate addresses
automatically. I quickly realized that this is yet another perl
one-liner (and implemented as one line in the comment system). It can
be used on the command line to obfuscate a file/pipe containing a list
of addresses.
Update: It looks like all these wishes, except the last one,
may actually be coming
true!
Guile can run Elisp better than Emacs! The idea is that the Elisp
engine is replaced with Guile -- the GNU project's Scheme
implementation designed to be used as an extension language -- and
written in Scheme is an Elisp compiler that targets Guile's VM. The
extension language of Emacs then becomes Scheme, but Emacs is still
able to run all the old Elisp code. At the same time Elisp itself,
which I'm sure many people will continue to use, gets an upgrade of
arbitrary precision, closures, and better performance.
I've been using elisp a lot lately, but unfortunately it's missing a
lot of features that one would find in a more standard lisp. The
following are some features I wish elisp had. Many of these could be
fit into a generic "be more like Scheme or Common Lisp". Some of these
features would break the existing mountain of elisp code out there,
requiring a massive rewrite, which is likely the main reason they are
being held back.
Closures, and maybe continuations. Closures are one of the
features I miss the most when writing elisp. They would allow the
implementation of Scheme-style lazy evaluation with delay
and force, among other neat tools. Continuations would
just be a neat thing to have, though they come with a performance
penalty.
Closures would also pretty much require Emacs switch to lexical
scoping.
Arbitrary precision. Really, any higher order language's
numbers should be bignums. Emacs 22 does come with the Calc
package which provides arbitrary precision via
defmath. Perl does something like this with the bignum
module.
Packages/namespaces. Without namespaces all of the Emacs
packages prefix their functions and variables with its name
(i.e. dired-). Some real namespaces would be useful for
large projects.
C interface. This is something GNU Emacs will never have
because Richard Stallman considers Emacs shared libraries support to
be
a GPL threat. If Emacs could be dynamically extended some useful
libraries could be linked in and exposed to elisp.
Concurrency. If some elisp is being executed Emacs will lock
up. This is a particular problem for Gnus. Again, Emacs would really
need to switch to lexical scoping before this could happen. Threading
would be nice.
Speed. Emacs lisp is pretty slow, even when compiled. Lexical
scoping would help with performance (compile time vs. run time
binding).
Regex type. I mention this last because I think this would be
really cool, and I am not aware of any other lisps that do it. Emacs
does regular expressions with strings, which is silly and
cumbersome. Backslashes need extra escaping, for example. Instead, I
would rather have a regex type like Perl and Javascript have. So
instead of,
(string-match "\\w[0-9]+" "foo525")
we have,
(string-match /\w[0-9]+/ "foo525")
Naturally there would be a regexpp predicate for checking
its type. There could also be a function for compiling a regexp from a
string into a regexp object. As a bonus, I would also like to use it
directly as a function,
(/\w[0-9]+/ "foo525")
I think a regexp price would really give elisp an edge, and would be
entirely appropriate for a text editor. It could also be done without
breaking anything (keep string-style regexp support).
I wanted an elisp macro that could measure the running time of a block
of code. Specifically, I wanted it to work like this,
(measure-time
...
body
...)
And it would return the running time as seconds in floating
point. Well, here's a macro that does it!
;; ID: 6a3f3d99-f0da-329a-c01c-bb6b868f3239
(defmacromeasure-time (&rest body)
"Measure and return the running time of the code block."
(declare (indent defun))
(let ((start (make-symbol "start")))
`(let ((,start (float-time)))
,@body
(- (float-time) ,start))))
It's only good for up to around 18 hours, then the time integer
overflows. If only Emacs had arbitrary precision numbers. Here it is
in action using my binomial function from
last week.
(measure-time
(nck 20 10)
(nck 30 7))
Which, just now, returned 3.643713 seconds when executed.
A Knight's
Tour is a path on a chess board such that a knight moving
according to the rules of chess will visit each square exactly
once. Here is a program that solves the Knight's Tour for various
board sizes.
I wrote it purely for speed, and so in cases where the algorithm
doesn't blow up (taking possibly thousands of years to complete) it
only takes a few milliseconds to complete. The algorithm is recursive
and goes like so,
Given a board marked with visited spaces and a position create a
list of valid moves.
If the list is empty, return failure.
Sort of the list ascendingly according to the opportunities of
the target space. That is, sort by the number of valid moves
that are available from the target space.
Mark the current space as visited, then recurse from the new
position with the new board starting with the first target in
the list.
If the recursion returns success, return success.
If the recursion returns failure, recurse with the next target
in the list, and so on. If all the targets on the list have been
tried, return failure.
The purpose of going towards targets with few opportunities is so that
we don't leave any stranded spaces that will not be detected until
much later. When it does run into a dead end, it usually only
has to back out a short ways and try again.
The algorithm breaks down on certain board sizes. For example, try a
120 by 120 board. Instead of nearly instantaneous solution like 120 by
119 would get, it takes enormous amounts of time. I don't yet know
why.
Also included in the repository is a Perl script for creating frames
for a video. Here are two videos, one for an 8x8 board, one for a
80x80 board. They are in Ogg
Theora. Update: if your browser supports HTML 5 you can now
watch these videos inline thanks to the video tag.
It's interesting how on the large one the behavior seems to be random
at some places, but all it's doing is following that simple algorithm.
Many comment/discussion systems get previews wrong. This even includes
major sites like Boing Boing and Slashdot. Sometimes they feed back a
different comment in the textarea, so repeated previews slowly degrade
the comment. Other times the comment preview isn't the same thing as
the final result. A comment actually has four states,
The raw comment is the unfiltered string of bytes from the
user. This is not safe to give directly back to the user, as it could
be exploited to feed an arbitrary page to an innocent user.
The escaped comment is created from the raw comment by
filtering it through the escapeHTML() function. This
function creates HTML entities out of some of the characters, like
< and >. A browser will interpret the escaped comment as a
simple string, and is safe to give back to the user. This function is
actually provided by perl's CGI module, so perl programmers need not
implement this.
Note that escapeHTML() is reversible, though the server
side won't need to reverse it. The browser does.
The stripped comment is created from the raw comment by
filtering it through stripHTML(), which removes
non-whitelisted HTML tags. It also strips non-whitelisted attributes
from allowed tags. It should probably add a
rel="nofollow" to links. It also runs escapeHTML()
on attribute values and content outside tags. This is safe to give
back to the user because only safe tags are left.
If your comments use markup other than HTML, like BBCode, this
function should strip all HTML (your whitelist is empty) and do
the conversion from your markup to HTML.
It might also be a good idea for it to produce well-formed HTML. This
will allow your comments/discussion pages to be XHTML compliant.
stripHTML() is irreversible because it dumps information.
The stored comment is the encoding of the comment in the
system. This depends entirely on the storage system. In some cases it
may be identical to the stripped comment (and store is
the identity function). If the comment is going through SQL into
database, some characters may need to be escaped as to not cause
problems. It could even be a base 64 encoding.
store() must be unambiguously reversible, and the server
should have an unstore() to do this. It should probably
also be able to convert any arbitrary string of characters into a safe
encoding for storage.
There should only be one version of all these functions for both
previews and final posting of comments.
When doing a comment preview both the escaped comment and the stripped
comment are given back to the user. The stripped comment is dropped in
as HTML, and the escaped comment is put into the textarea of the
form. It would probably be convenient for the user if you give them
back any other form information, including the same captcha and their
answer to it (or not charge them with a captcha for that comment
anymore).
You may be tempted to store the raw comments (safely with
store()) and do HTML stripping on the fly. This would
allow you to upgrade your HTML stripping function in the future to
"better" handle user input. I don't recommend it. That's extra
processing for each page request, but worse, it breaks the concept of
the preview, because the comment formatting is subject to change in
the future.
The hardest function to implement is probably stripHTML()
because it needs to be able to handle poorly formed HTML. If you are
using perl, you will probably want to use the HTML::Parser module,
which is what I did. This does everything noted above and also
auto-links anything that looks like a URL, forces proper comment
nesting, automatically makes paragraphs from blank-line-separated
chunks, and almost produces well-formed HTML.
Update: While the basic idea is sound, my implementation below is
definitely not. At the time I obviously did not fully understand Lisp
macros. Using eval is usually a stupid idea, and it's
even more stupid to try to use it inside of a macro where it will most
likely be used, causing an error, at compile time. I doubt this would
work well with byte-compiled code. I'll leave this post here for
historical purposes.
The lisp macro/form let quotes variable's symbols. This
is almost always more useful than not quoting it. Otherwise, simple
use would look like,
(let (('var 100))
body)
instead of,
(let ((var 100))
body)
In elisp, this is analogous to setq, which quotes its
first argument, and set, which doesn't. It could be used
like so in elisp,
The unquoted let form evalues to 20, not 35,
because foo evaluates to the symbol bar,
which is bound to 10. The unquoted version is used to select variable
names dynamically.
This isn't very useful in a lexically scoped lisp because the
variables that can access it are decided, well, lexically. With
dynamic scoping we can use it to temporarily mask variables and select
what variables to mask dynamically. I found a use for it in one of my
projects.
I had a recursive function that searched a graph made of symbols. The
symbols globally stored the state of the node. The current node was
passed into the function, which would change the state of that node,
and recurse. Here's how I was doing it,
I've been using lisp on and off for the past few years. I read some
lisp books, went through The Little
Schemer and some of SICP. But
I could never really think in lisp. When I needed to write some
code, I would prefer another language first. I was writing imperative
code for 10 years before I saw lisp, and I was used to it.
Recently, I have found myself wanting to use lists (including alists,
plists, etc.) as data structures for everything, even when I'm not
even writing lisp. I think I finally
got lisp and now I want to use lisp for everything.
There is a 75% chance of rain on any given day in the next
week. What is the probability that it will rain on at least 5 of
the 7 days?
The purpose of the question was to point out a neat coincidence in the
problem (explained at the link). I used a program to solve the
problem and there are two reasons for this.
First, I wanted to use the program to explore the problem and find the
"special" property. With a program, I could quickly try different
parameters, which would take longer, and be more error prone, by hand.
Second, which is similar to the first, I hate evaluating a large
expression by hand. It's slow and error prone. Writing a program to do
the same work is faster and mistakes are easier to catch. Also, I can
quickly try different parameters to make sure my program's output is
reasonable. In this case, for any reasonable input, the output, a
probability, shouldn't be greater than 1.
Well, let's see, this is a Bernoulli
experiment: each day is independent, so it is like flipping a coin
seven times and counting the heads. That means we need the choose
function.
My first thought was Octave, as this is a simple program and Octave
already provides nchoosek() for me.
# Rain at least n days
function sum = rain (n)
sum = 0;
for i = n:7
sum += nchoosek(7,i) * 0.75 ^ i * 0.25 ^ (7 - i);
endend
Simple, but I actually made a couple little mistakes working it out,
and it took me a little longer than it should have. If you asked
someone to write this program in any imperative language, it would
probably look a lot like this.
I then made a lisp version (elisp), but I first needed to define the
binomial coefficient function since there wasn't one provided.
(defunnck (n k)
"The binomial coefficient."
(if (or (= k 0) (= k n)) 1
(+ (nck (- n 1) (- k 1)) (nck (- n 1) k))))
This is the recursive version, based on Pascal's
rule, so it doesn't need factorials.
In lisp, recursion is preferred to iteration, so that's the way I
approached the program.
(defunp-rain (n)
"Probability of rain on exactly n days."
(* (nck 7 n) (expt 0.75 n) (expt 0.25 (- 7 n))))
(defunrain-at-least (n)
"Probability it will rain on at least n days."
(if (> n 7) 0
(+ (p-rain n) (rain-at-least (+ 1 n)))))
I like the recursive version, and this code, much better. It presents
the solution in a more straight forward way. And I got it right the
first time, too. Yes, it was written after the Octave version,
but I still think it counts for something.
The lisp version is also less complex. The Octave version has to use
some temporary variables, i and sum, which
is extra conceptual overhead. Sure, it could also be written
recursively, but this is not really the way Octave is meant to be
written.
Eh, not a great example, really. Lisp has so many powerful features,
like macros (the powerful lisp kind), symbols, low-level access into
the interpreter, and the REPL, that allow the programmer to do some
really cool things. Many of these features are unique to lisps.
At work right now I am using two different machines. One is a 32-bit
SPARC and the other is a very powerful SMP x86-64. Sometimes data are
generated on one machine and used in a simulation on the other. There
is a problem of byte-order, though. The SPARC is big-endian and the
other is little-endian, and the programs on both sides don't pay
attention to that small detail.
Luckily, the data are all 4-byte aligned. That's perfect for a Perl
one-liner byte order conversion,
Perl is really great for concise hacks. I really like how this
one-liner almost reads like a natural language sentence. Is there any
other language that can do powerful one-liners like Perl?
To all other non-emacsen text editors, can your text editor do that?!
Ha! Even though elisp is a slow, closure-less, dynamically scoped,
ugly cousin of more popular lisps, it's still a lot of fun to write.
To fire it up, load it into Emacs and run the extended command
(M-x) httpd-start. By default it will serve
files from "~/public_html". To change this, change the
variable httpd-root to the desired web root. You can stop
the server with httpd-stop.
It's about 200 lines of code and can serve static websites made of
small, static files. I say small files because it serves files from
buffers, meaning it has to read the entire file in first.
For a simple, text editor based server it can hold up to a pretty
decent load. At one point I hit it with 8 wget instances all making
rapid recursive downloads and my manual navigation wasn't slowed down
noticeably. Despite running in the slow elisp interpreter, I think it
can have much better performance by caching commonly served files in
buffers.
It should run, unmodified, anywhere a modern Emacs can run, so
I expect that it's already very portable. I can imagine it being
useful in a situation where someone needs to temporarily host some
files but there isn't a web server on the machine. Just grab this
script and throw it at Emacs.
Well, it only does IPv4 right now, though I expect IPv6 only requires
changing one number (namely, 4 to 6). I don't have any IPv6 systems to
test it on.
When writing it I also had security in mind so, as far as I know, it
should be safe to use. It cleans up the GET from the
client so that no files underneath the serving root can be accessed.
The server log is lisp itself. Here is an example log starting the
server, serving one request, and halting,
The log is alists of alists, making a hierarchical tree structure that
can be explored with some simple lisp functions. Normally this sort of
thing is done with XML, but lisp already has its own structured
format: lists!
When GET is a directory, it looks for
"index.html" and serves that if it exists. More indexes
can be added to the variable httpd-indexes. This can
actually be done in a special ".htaccess.el" file.
If a ".htaccess.el" exists in the directory from which a
file is being served, Emacs will first load/execute it. You see, it's
just a lisp program. If you wanted to add a new index file name, the
hypertext access file could contain this,
(add-to-list 'httpd-indexes "0001.html")
It's a bit like a .emacs file.
But I think one of the coolest things about having a lisp-based server
is that the server can be modified in place without disrupting or
restarting it. In my Emacs web server, the only change that requires a
restart is changing the server port. In fact, I wrote most of it while
the server was running and tested my changes from a browser right as I
made them -- all on the same instance of the server.
If you want to look into the AI side of this, the server could modify
its own code in response to its use.
I also had the idea of creating dynamic websites with elisp, in the
same way PHP or Perl does. If a .el file (or
.elc) is accessed, the server would pass the
GET/POST arguments as an alist to a function
in the elisp file. The server would also provide some nifty HTML
generation macros. A dynamic script might look like this,
I finally have a comment system, thanks to pollxn, a blosxom comment system that
actually works. There is a link to it, indicating the number of
comments, in the bottom of each post. Try it out and say hello.
Unfortunately, pollxn doesn't have any sort of anti-spam or CAPTCHA system. If
you look around the Interwebs where other people are using pollxn, you
will see everyone has their own little CAPTCHA thing. Well, I am not
different. I hacked together my own to keep away automated spammers.
It selects words from the dictionary (of 40,000 words in this case)
and encrypts them with Blowfish in CBC mode, with a unique IV each
time. This is to passed to the user, who passes it to an image
generator which decrypts the word and uses GD in Perl to render it,
apply some transforms, and drop a line randomly over it. The user
submits the guess of the image along with the encrypted version
(hidden field), which is decrypted and compared on the other end. The
same encrypted ID cannot be used twice, but thanks to the IV the same
word can be used twice.
Here are some samples. If you hit refresh, they will render
differently.
It's not a great CAPTCHA, but it should be good enough for the low
volume of traffic I see here. As I inevitably collect small amounts of
spam (by spammers manually passing the CAPTCHA), I will gradually
create the needed tools to combat it. I can also easily update the
CAPTCHA image algorithm without disrupting the functioning of the
website.
I'm sure I will be making improvements to the comment system over time
as well. I should make it obfuscate e-mail addresses, for one. Maybe
add a preview. And better blosxom integration.
So say hello below! I am excited to finally have a real blog.
I have recently been playing around with Greasemonkey, a great Firefox
add-on that gives users a lot of control over how a website is
displayed. It works by having the user providing a bit of Javascript
that runs when a website is rendered. The user doesn't actually have
to write the script, but can find them in various script repositories.
When I looked around, I couldn't find scripts to do some things I
wanted, so I started writing my own. Now that I have started that
habit I see uses for user scripts all over the place. Suddenly I can
fix anything I find annoying on the web. It's very empowering.
Of course, Firefox add-ons can do anything a user script can do. But
Greasemonkey user scripts are lightweight, more secure (due to being
less powerful), easier to write, and don't require a browser restart
to install and uninstall.
I posted my scripts on
userscripts.org so that people could find them easily, but I
always like to host these things locally, too. You can find it under
/userscripts here. Don't forget to review
the source before you install it! That is, unless you automatically
trust me and my website's security. I, or an infiltrator, could slip
something sneaky in there.
I actually first used Greasemonkey back in 2005, but they had some
very serious security issues back in those days. It was bad enough
that I just uninstalled it, which was actually recommended by the
Greasemonkey people themselves. So, four years later I am back to
check it out.
The first user script I wrote was in response to a "feature" on TV Tropes. In addition to the overall
cruddiness of the website, they started adding "folders" to the
information on long pages. It folds up all of the information behind
little clickable widgets. It uses CSS to hide the information and
Javascript to reveal it by adjusting those styles on the fly. If you
have a browser with CSS support but not Javascript (or have it
disabled like I did), you won't ever be able to see the information in
the browser. As of this writing the
Jumping the Shark article uses this. Take a look.
This is an awful idea! It critically breaks the usability of the
page. And what's the point? We already have a vertical scrollbar to
control the display. Unfortunately, a lot of clueless people seem to
like this sort of behavior -- because it's flashy -- so we will
probably only see more of it on the web in the future.
My TV Tropes user script is simple: it scrapes off some of the
CSS. Specifically, the "folder" CSS. That's it! Someone who already
knows how to make user scripts could probably put this together in
less than 15 minutes.
If you want to tackle web annoyances, learn Greasemonkey!
Here's something interesting I saw on Wikipedia. There is a map of the
US with states colored according to the spread of the H1N1 flu
epidemic. Take a look:
H1N1_USA_Map.svg. The interesting part is actually at the bottom
of that image page.
Wikipedia, being a wiki, versions everything. It has to. No one
actually changes a page, they just add a new version that is a
derivative of the "current" version. Because of this, Wikipedia has
incidentally created a time-lapse version of the map in its version
control system. In case you can't see it when you are reading this,
here's what part of it looks like,
That's the timestamp and the state of the spread at that time,
presented in an extremely useful way. And this was created by
accident. Pretty cool, eh?
A few years ago I made my wife -- girlfriend at the time -- a Klein bottle
(well, the three-dimensional projection of one) out of clay. Since I
hadn't used clay before I used some assistance from my dad. Here's how
it was done,
As you can see, it's not quite the same as the generally depicted
Klein bottle. The form you see here was easier to make with
clay. After it was done, we baked it in a kiln. It's a bad idea to
put sealed items in a kiln because they will burst as they heat. It
took some time to convince the staff that our Klein bottle was
actually unsealed.
I have previously talked about arcfour
and CipherSaber and provided
implementations in C. For fun, I made an implementation of arcfour in
Emacs lisp (elisp), and built upon that to make a CipherSaber
implementation in elisp. Check it out with Git,
If you don't have Git (yet), you can follow that link to use the web
interface. The relevant files are arcfour.el and
ciphersaber.el. There are some test vectors in there that
I won't show here. Here is the arcfour implementation,
(defunrc4-init-state ()
"Initialize the arcfour state vector."
(interactive)
(setq rc4-state (make-vector 256 0))
(setq rc4-i 0)
(setq rc4-j 0)
(let (i)
(dotimes (i 256 rc4-state)
(aset rc4-state i i))))
(defunrc4-swap (i j)
"Swap two elements in the state vector."
(let ((temp (aref rc4-state i)))
(aset rc4-state i (aref rc4-state j))
(aset rc4-state j temp)))
(defunrc4-key-sched (key)
"Arcfour key-scheduler: initialize state from key."
(interactive "sEnter key: ")
(let ((j 0) i)
(dotimes (i 256 rc4-state)
(setq j (% (+ j
(aref rc4-state i)
(aref key (% i (length key)))) 256))
(rc4-swap i j))))
(defunrc4-gen-byte ()
"Generate a single byte."
(interactive)
(setq rc4-i (% (1+ rc4-i) 256))
(setq rc4-j (% (+ rc4-j (aref rc4-state rc4-i)) 256))
(rc4-swap rc4-i rc4-j)
(aref rc4-state (% (+ (aref rc4-state rc4-i)
(aref rc4-state rc4-j)) 256)))
For the sake of simplicity it uses some global variables to store the
cipher state. It would be better if the state was returned as a list,
continuation, or closure. That way we could run a bunch of different
ciphers at the same time.
And this provides interactive functions so that they can be called by
a user right on the buffer being used, with M-x rc4-buffer.
(defunrc4-region (start end key)
"Encrypt/decrypt region with arcfour using given key."
(interactive "r\nsEnter key: ")
(rc4-init-state)
(rc4-key-sched key)
(save-excursion
(let (c)
(goto-char start)
(while (< (point) end)
(setq c (char-after))
(delete-char 1)
(insert-char (logxor c (rc4-gen-byte)) 1)))))
(defunrc4-buffer (key)
"Encrypt/decrypt entire buffer with arcfour."
(interactive "sEnter key: ")
(rc4-region (point-min) (point-max) key))
You may run into encoding issues with encrypted regions. The
CipherSaber implementation below gets around this problem by turning
off multi-byte encoding with set-buffer-multibyte.
In ciphersaber.el we apply these functions. The
CipherSaber functions can be called with M-x
cs-encrypt-buffer and M-x cs-decrypt-buffer. Note
that this is only CipherSaber-1, and I leave CipherSaber-2 as an
exercise for the reader :-P.
(defuncs-encrypt-buffer (key)
"Encrypt buffer with CipherSaber-1."
(interactive "sEnter key: ")
(set-buffer-multibyte nil)
(let* ((iv (cs-generate-iv))
(cs-key (concat key iv)))
(rc4-buffer cs-key)
(beginning-of-buffer)
(insert iv)))
(defuncs-decrypt-buffer (key)
"Decrypt buffer with CipherSaber-1."
(interactive "sEnter key: ")
(let* ((iv (cs-retrieve-iv))
(cs-key (concat key iv)))
(rc4-buffer cs-key)))
(defuncs-generate-iv ()
"Generate a 10-byte initialization vector."
(let ((iv "") i)
(dotimes (i 10 iv)
(setq iv (concat iv (char-to-string (random 256)))))))
(defuncs-retrieve-iv ()
"Remove initialization vector from buffer and return it."
(beginning-of-buffer)
(let ((iv "") i)
(dotimes (i 10 iv)
(setq iv (concat iv (char-to-string (char-after))))
(delete-char 1))))
I didn't bother with save-excursion because I didn't
think it would be important where the point is in the middle of an
encrypted file. Feel free to add it, though!
These functions could be used to make a minor mode to transparently
encrypt and decrypt CipherSaber files. It could also be modified to
take advantage of Emacs' batch mode to handle CipherSaber
processing right from the shell. But those are other projects!
If you are a
crypto-anarchist type like me, you should definitely take a look
at CipherSaber. It is an
extremely simple encryption protocol that even beginner programmers
can implement. The protocol can also easily be memorized and quickly
implemented from memory on the fly. In the case that cryptography was
completely outlawed, CipherSaber would be a useful tool in allowing
its users to continue to communicate privately.
I think the name is just perfect and captures everything CipherSaber
is about. Here is the description right from the CipherSaber page,
In George Lucas' Star Wars trilogy, Jedi Knights were expected to
make their own light sabers. The message was clear: a warrior
confronted by a powerful empire bent on totalitarian control must
be self-reliant.
CipherSaber is based on the arcfour
stream cipher, but goes beyond it by defining the use of an
initialization vector (IV) and how it is stored with the
ciphertext. There are actually two versions: CipherSaber-1 and
CipherSaber-2. The second one exists because of vulnerabilities in the
first. The difference between them is small.
You want to make sure you generate a long enough passphrase for your
encryption key. A normal password isn't good enough because an
adversary will be able to throw all his available processing power at
your ciphertext. Using Diceware would
be a good idea here.
Here is the protocol.
Generate a 10-byte random IV. This need not be done using a very
strong random number generator. It is only important that the same IV
is not used more than once. Concatenate a secret user selected key
(i.e. passphrase) with the IV and use that concatenation as the key
for an arcfour cipher. Encrypt the message using the
cipher. Concatenate the IV and the arcfour ciphertext to create the
CipherSaber ciphertext.
To decipher, remove the first ten bytes of the ciphertext and use it
as an IV. Concatenate the secret passphrase with the IV, and use it as
the key for an arcfour cipher. Decrypt the remaining ciphertext with
the arcfour cipher.
Because of vulnerabilities in the arcfour cipher, CipherSaber-2 is an
updated version that runs the arcfour key scheduler at least 20
times. The exact number of times is a secret that the sender and
receiver must agree on. Notice that CipherSaber-1 is CipherSaber-2
with only 1 key schedule iteration.
Using a large number of iterations could be considered a form of key
strengthening. An adversary who is making a brute force attack on
the ciphertext has that much more work to do for each passphrase
trial.
You should really implement your own, but here is one of my
implementations, written in C. I put it in with the rest of my arcfour
stuff. Get it with git,
You can use it as a reference to make sure your first implementation
is correct. You can use these two ciphertexts to test your
implementation as well,
This is the diagram image above (ciphersaber.png) encrypted with the
key "nullprogram". The first one is CipherSaber-1 and the second is
CipherSaber-2 with 20 key schedule iterations.
In the past I have always posted my code plainly without any sort of
syntax highlighting. Boring! I finally got around to using the Emacs htmlize tool
written by Hrvoje Niksic. So from now on I will try to provide syntax
highlighting for inline code.
Htmlize leverages the syntax highlighting of Emacs by converting the
faces into HTML. It's more or less what I see in my Emacs buffers. The
exact output format can be in HTML + CSS, HTML with inline CSS, or
HTML with old font tags. It can provide highlighting for
any language that Emacs supports, which is just about everything.
I threw the CSS stuff in my main stylesheet so the XHTML of my posts
remains simple. Here are some examples in action. Perl,
## Find the maximum total from top to bottom of the triangle.
usewarnings;
usestrict;
useList::Util qw/reduce max/;
my @triangle;
open my $datafile, "data67.txt";
while (<$datafile>) { push @triangle, [split/\s+/] }
my $result = reduce {
formy $i (0..$#{$b}) {
@$b[$i] += max(@$a[$i], @$a[$i+1]);
}
$b
} reverse @triangle;
print @{$result}[0] . "\n";
# npoly()
function a = npoly (x, y)
X = repmat (x', 1, length(x));
for i = 1:length(x)
X(:,i) = X(:,i) .^ (i - 1);
end
a = X \ y';
end
C,
#defineSWAP(a, b) if (a ^ b) {a ^= b; b ^= a; a ^= b;}
/* Initialize a keystream. */voidinit_keystream (keystream *k, intn)
{
k->i = 0;
k->j = 0;
inti;
for (i = 0; i < 256; i++)
k->S[i] = i;
ints;
bytej = 0;
for (s = 0; s < n; s++)
for (i = 0; i < 256; i++)
{
j += k->S[i] + k->key[i % k->keylen];
SWAP(k->S[i], k->S[j]);
}
}
elisp,
(defuncount-words-buffer ()
"Print a message the number of words in the current buffer."
(interactive)
(save-excursion
(let ((count 0))
(goto-char (point-min))
(while (< (point) (point-max))
(forward-word 1)
(setq count (1+ count)))
(message "%d words." count))))
And even HTML,
<html>
<head>
<title>Sample HTML</title>
</head>
<body>
<h1>Hello World!</h1>
<!-- This is a comment. -->
<p>
This is my HTML sample.
</p>
</body>
I actually didn't start using RSS feeds until recently, and I wish I
had started earlier. However, when I added my favorite webcomics to my
feed I noticed most of them didn't actually include the comic
itself. Personally, I would rather read the comic right in the feed.
Well, I later came across an old post on a blog I frequent: Terminally
Incoherent. Luke Maciak posted some code to
generate an RSS feed by scraping the comic's website. He did it in
order to provide feeds for webcomics that don't have one. I took it,
added a web front-end to it, and made the feeds available to
anyone. You can find them under /feeds/ or at
the link in the main left bar.
The scraper works by searching the comic's "latest comic" page for
image sources matching a regular expression. So to teach the scraper
about a new comic, all that needs to be provided is a URL and a
regular expression matching only the image of the comic.
My front-end has a database of URL/regex pairs and takes care of
caching results. This way it only queries the comic's website once
every few hours at the most.
I also use an .htaccess file to write the links to look
much cleaner,
RewriteEngine on
RewriteRule ^([\w\/]+)$ feeds.pl?q=$1
RewriteRule ^$ list
There are some comics that are immune. Specifically, some comics don't
allow hot-linking, which includes links from an RSS feed. There is
nothing I can really do about these.
If you enjoy these comics, go ahead and add these feeds if you want to
read them right in your RSS reader.
I was reading through a website of " computer
stupidities" today when I came across this,
if (a)
{
/* do something */
return x;
}
else if (!a)
{
/* do something else */
return y;
}
else
{
/* do something entirely different */
return z;
}
This was quickly dismissed as being an obvious beginner mistake. I
don't think this can be dismissed so quickly without thinking it
through for a moment. Yes, in the example above we will never reach
the last condition where we return z, but consider the
following,
if (a < b)
printf ("foo\n");
else if (a > b)
printf ("bar\n");
else if (a == b)
printf ("baz\n");
else
printf ("faz\n");
The same quick dismissal might drop the last "faz" print statement as
being an impossible condition. Can you think of a situation where the
program would print "faz"?
Our final condition will be reached if a or
b is equal to NAN, which is defined by the
IEEE floating-point standard. It is available in C99 from
math.h. A NAN in any of the comparisons
above will evaluate to false.
Flash seems to be the popular way of playing videos online. This is a
bit better than the bad old days of online video where a user had to
select from a few buggy media player plug-ins. Things have improved.
However, if you don't use Flash, or if you want to watch the videos in
your own media player, you are stuck. A download link for the video is
almost never provided. The video is always somewhere, though, to be
fetched via http. I mentioned this
before for downloading YouTube videos using youtube-dl.
The trick is finding the URL. Sometimes you can derive it from the
HTML code, sometimes you have to dig a little deeper by inspecting the
Flash player itself.
strings can be invaluable here.
There could be an extra layer of stuff to work out, which is explained
below. My main reason for posting this is so I can refer back to it
later when I need to do it again.
So, the other day I ran into a Flash video player that contained part
of the URL of its video. I began by studying the embed
tag in the HTML, which gave me some information about where to find
the video (the video ID number). I downloaded the Flash player SWF
file for the purpose of running strings on it.
I ran into a problem here. I wasn't finding any non-garbage strings
inside the file. file told me it was compressed.
$ file player.swf
player.swf: Macromedia Flash data (compressed), version 9
Searching online quickly revealed that a compressed Flash file is just
zlib compression after an 8-byte header. Decompression can actually be
done with a Perl one-liner,
There has been a lot of talk online about the fragility of URL
shortening services, particularly in relation to Twitter and its 140
character limit on posts (based on SMS limits). These services create
a single point of failure and break mechanisms of the web that we rely
on. Several solutions have been proposed, so over the next couple
years we get to see which ones end up getting adopted.
There are many different URL shortening services out there. They take
a large URL, generate a short URL, and store the pair in a
database. Several of these services have already shut down in response
to abuse by spammers who hide fraudulent URLs behind shortened
ones. If these services ever went down all at once, these shortened
URLs would rot, destroying many of the connections that make up the
world wide web. This is called the rot link apocalypse, and it has
some people worried.
I am not very worried about this, though. I don't use Twitter, or any
other service that puts such ridiculous restrictions on message
sizes. Nor do I think information on Twitter is very important. Also,
this mass link rot will occur gradually, slow enough to be dealt with.
In any case, short URLs may be useful sometimes, especially if a URL
needs to be memorized or if the URL is extremely long. Or, it could be
used to get around a design
flaw in an inferior
browser.
One idea that I have not yet seen implemented is simple data
compression. When a short URL is needed, a user can apply a
compression algorithm to the URL. The original URL can be recovered
from this alone, so we don't have to rely on third parties to store
any data.
I have doubts this would work in practice, though. Generic compression
algorithms cannot compress such a small amount of data because their
overhead is too large in relation. Go ahead, try pushing a URL through
gzip. It will only get longer. We would need a special URL compression
algorithm.
For example, I could harvest a large number of URLs from around the
web, probably sticking to a single language, and use it to make a Huffman coding
frequency table. Then I use this to break URLs into symbols to
encode. The ".com/" symbol would likely be mapped to one or two
bits. Finally, this compressed URL is encoded in base 64 for use. The
client, who already has the same URL frequency table, would use it to
decode the URL.
URLs don't seem to have too many common bits, so I doubt this would
work well. I should give it a shot to see how well it works.
We probably need to stick with lookup tables mapping short strings to
long strings. Instead of using a third party, which can disappear with
the valuable data, we do the URL shortening at the same location as
the data. If the URL shortening mechanism disappears, so did the
data. The URL shortening loss wouldn't matter thanks to this
coupling. Getting the shortened URL to users can be tricky, though.
One proposal wants to
set the rev attribute of the link tag to
"canonical" and point to the short URL.
To understand this one must first understand the rel
attribute. rel defines how the linked URL is related to
the current document. rev is the opposite, describing how
the current page is related to the linked page. To say
rev="canonical" means "I am the canonical URL for this
page".
However, I don't think this will get far. Several search engines,
including Google, have already adopted a
rel="canonical" for regular use. It's meant to be placed
with the short URL and will cause search engines to treat it as if it
was a 301
redirect. This won't help someone find the short URL from the long
URL, though. It is also likely to be confused with the
rev attribute by webmasters.
The rev attribute is also considered too difficult to
understand, which is why it was removed from HTML5.
Another idea rests in just using the rel attribute by
setting it to various values: "short", "shorter", "shortlink",
"alternate shorter", "shorturi", "shortcut", "short_url". This website does a good
job of describing why they are all not very good (misleading, ugly, or
wrong), and it goes on to recommend "shorturl".
I went with this last one and added a "short permalink" link in all of
my posts. This points to a 28 letter link that will 301 direct to the
canonical post URL. In order to avoid trashing my root namespace, all
of the short URLs begin with an asterisk. The 4 letter short code is
derived from the post's internal name.
I also took the time to make a long version of the URL that is more
descriptive. It contains the title of the post in the URL so a user
has an idea of the destination topic before following through. The
title is actually complete fluff and simply ignored. Naturally this
link's rel attribute is set to "longurl".
Keep your eyes open to see where this URL shortening stuff ends up
going.
The author of a tool named Hashapass contacted me some time ago
to bring his tool to attention. It is a way to mitigate the problem of
having to memorize and generate many different passwords.
Good security practice is for users to have a different password with
each web site and system they use. Should one of them be compromised,
your other accounts will still be safe. The problem is that passwords
tend to both be hard to remember and difficult to generate.
Hashapass allows a user to have just one password (ideally,
passphrase) that is used to generate many different passwords. Provide
the master passphrase and the name of the website (parameter) needing
a password and Hashapass generates an 8-character password worth 48
bits.
The website works entirely in Javascript, so you don't have to worry
about transmitting your password or master passphrase. This also makes
it easy to see how the hashing is done. If this was a secret, I
wouldn't recommend using it.
It works by applying
HMAC, with the SHA-1 hash, to the the parameter and passphrase as
to stir them together into a hash. Then it outputs the 48 most
significant bits in base-64 as the password.
I mentioned before that you should really use a master
passphrase instead of a master password, because a
compromised hash password can be brute forced to reveal the master
password. Unfortunately, the Hashapass website says "password" instead
of "passphrase".
I made a Hashapass password cracker to test how practical this attack
would be. You can grab it with Git,
The idea is that if a malicious website operator peeked at your
password, knew you used Hashapass, and properly guessed the parameter
(which isn't a secret), he could use a tool like this to brute force
attack the password to retrieve the master passphrase. A short master
password could easily be discovered.
Running on one machine with one instance of the program, my tool can
break any password with five or less characters in a matter of
hours. A 6-character password could take a month or two. A 7-character
password would take a decade. Each character in the password increases
the search time by a factor of 100.
If multiple computers/cores/processors are put to use on the attack,
these times can be shortened: 2 computers would halve the time, for
example. The attack is easy to parallelize.
My tool assumes a strong, but short, master password was chosen, as it
checks against all printable ASCII characters. If a weaker password
was used, and the attacker assumed this, the above time table would be
much shorter.
So, for the master passphrase, use at least 8 characters
generated using a strong random number generator. I recommend
generating the passphrase with
Diceware using 5 words.
On my brainfuck compiler project, I
proposed pre-calculation as an optimization technique. The idea can
work, but it has an issue that will always be unsolvable: how do you
know that the pre-calculation will halt? This is called the halting
problem and it has been proven impossible to solve.
The idea was that the compiler would run the brainfuck program up
until the first input operation -- if there even was one. It would
record all output and the final state of the memory. Instead of
compiling the code was was run, it would compile code that would print
all of the output and set the memory at the final state.
I has mistakenly assumed that it would be possible to detect a
non-halting program and avoid doing pre-calculation on it. I described
how it would be done and left it at that. Recently, someone kindly
sent me an email containing only 5 letters:
+[--]
This defeated my ill-conceived idea.
Because brainfuck is Turing complete, it is actually impossible to
determine whether or not an arbitrary brainfuck loop will halt. A
computer can't do it. A human brain (a fancy computer) can't do it
either. It cannot be done, at least not in this universe.
So, if implemented, this pre-calculation measure will always be
flawed.
In a project I am working on, I want to implement a large list using
lazy
evaluation in Scheme. The list is large enough to be too unwieldy
to store entirely in memory, but I still want to represent it in my
program as if it was. The solution is lazy evaluation.
One use of lazy evaluation is allowing a program to have infinitely
sized data structures without going into the impossible task of
actually creating them. Instead, the structure is created on the fly
as needed. As a prototype for getting it right, I made an infinitely
long list in Scheme that contains the entire Fibonacci series.
This function, given two numbers from the series, returns the lazy
list. It uses delay to delay evaluation of the list.
Notice the recursion here as no base case, so without lazy
evaluation it would continue along forever without halting. Now run
it,
> (fib '(0 1))
(1 . #<promise>)
The rest of the list is stored as a promise, which will later
be teased out using force. This forces evaluation of the
promise. Here is a function to traverse the list to the
nth element and return it. Notice, this does have a base
case.
(define (nth-fib f n)
(if (= n 1) (car f)
(nth-fib (force (cdr f)) (- n 1))))
Here it is in action. It is retrieving the 30th element.
> (define f (fib '(0 1)))
> f
(1 . #<promise>)
> (nth-fib f 30)
832040
If you examine f, it contains the first 30 numbers until
running into an unevaluated promise. This behavior is very similar to
memoization, as calculated values
are stored instead of being recalculated later.
These two functions are also behaving as coroutines. When
nth-fib reaches a promise, it yields to fib,
which continues its non-halting definition. After producing a new
value in f, it yields back to nth-fib.
The way I called these functions above, however, can lead to
problems. We are storing all the calculated values in f,
which can take up a lot of memory. For example, this probably won't
work,
> (nth-fib f 1000000)
We will run out of memory before it halts. Instead, we can do this,
> (nth-fib (fib '(0 1)) 1000000)
Because nth-fib uses tail recursion as it traverses the
list, unneeded calculated values are tossed (which the garbage
collector will handle) and no additional function stack is used. All
Scheme implementations optimize tail recursion in this way. This will
continue along until it hits the millionth Fibonacci number, all while
using a constant amount of memory.
It turns out that Scheme calls this type of data structure a
stream, and some implementations have functions and macros
defined so that they are ready to use.
So there you go: memoization, lazy evaluation, and coroutines all
packed into one example.
My wife was interested in growing a garden this summer. She has never
done it before and wanted to learn. However, we live in an apartment
so we don't exactly have a yard. Instead, we got some pots and started
growing a balcony garden.
I thought we were being clever, but it turns out this is a common
practice called container
gardening. Searches online for "balcony gardening", "apartment
gardening", and "container gardening" will bring up lots of useful
information.
I think it's more convenient than regular gardening, as the plants are
practically in the apartment. In fact, it's about 4 feet from our bed,
through the balcony door. We can move the plants around to the other
balcony, or even inside, if needed.
Gardening is a lot of fun, really.
This year, we are growing (or trying to grow) carrots, peppers,
strawberries, catnip, and dactylis (cat grass). We'll see how things
turn out in a couple months. Here's how it looks right now,
I recently learned about an excellent Firefox add-on called Vimperator,
which I have been using for a few days now. It creates an extremely
efficient Vim-like interface to Firefox. One of the main functions
is to be able to browse completely mouseless.
Why mouseless? Because the mouse is a bad input device for many uses
of a computer. It's a good choice for many games, like first-person
shooters, or graphic design, like Inkscape or GIMP. But for tasks like
text editing, word processing, and data entry, the mouse one of the
worst kinds of input device. The less you touch the mouse, the better.
Vimperator's argument is that the browser is better as a pure keyboard
interface.
I am an Emacs person myself, which I use for text editing, file
management, and IRC, but I appreciate the vi/Vim interface and accept
it as being almost as good. Most of my vi experience actually
comes from NetHack and Less. My main use
for vi is editing my Debian sources.list so I can install Emacs.
Vimperator removes your toolbar, menu bar, and address bar. Then it
transforms the status bar into the standard Vim status lines. This is
because you don't need any of this stuff anymore with the Vim
interface. It's traded for more browser real estate. This also
creates the fun situation of watching your friends try to use your
browser. At first, it really is pretty disorienting.
There is handy built-in documentation, found by pressing F1 or calling
the :help command. You'll want to read through these
before trying to do anything.
My problem right now is breaking my old Firefox keyboard muscle
memory. Before Vimperator, my browsing was already fairly mouseless. I
used keyboard shortcuts for everything. I had the Mouseless
Browsing add-on installed, and occasionally used. When not using
Mouseless Browsing, it worked out well because my right hand did the
mouse, while most of the keyboard shortcuts could easily be performed
with my left hand (C-tab, C-S-tab,
C-t, C-w).
I think Vimperator has the potential to be even more efficient than
that.
Probably one of the biggest adjustments is following links without a
mouse. Like the Mouseless Browsing add-on, Vimperator assigns numbers
to the links to be typed out. It is less intrusive though, because it
doesn't reformat the page to show the numbers. It has a "hint" mode
you go into for that. This mode displays the numbers over the links as
red markers.
But even better than that, you don't generally even need those
numbers. You can enter hint mode and begin typing the type of the link
out. As soon as you reach a unique string prefix, it follows the
link. This is the primary way I follow links, and I started doing this
completely by accident. I wasn't even aware this was possible until I
did it. Vimperator was completely natural in this respect.
Probably my favorite feature so far is automatic page advancement. I
use these all the time now. One set of commands is C-a
and C-x. These increment and decrement the last number in
a URL, handy for those annoying multi-page articles. If they number
the pages in the URL, this should handle it automatically. The other
form of page turning is [[ and ]]. These
search for links labeled "next", ">", "prev", "previous", and
"<" and follow them. This works in Google searches and many
web comics.
A potential use for macros is quick data scraping. You can write a
macro to automatically perform a series of actions, like save the
current page and move the next one, and have them repeat a number of
times. It could also help in rapidly filling out the same form over
and over, leaving only the CAPTCHA for manual input, if you were up to
something mischievous.
For example, here is a macro to open in a new tab the first result of
a Google search on the current page, then move to the next page. If
you repeat it, it will open the first result on page 1, then the first
result on page 2, and so on.
q s F 2 8 ] ] q
Note, the "28" may be different for you. To open the first result on
the next 15 search result pages,
1 5 @ s
It is pretty cool watching it work away.
It's not perfect, though. Like Vim, you can prefix commands with
numbers to repeat them, but this won't work with many commands, such
as the page turning one above. You can get around it sometimes by
placing the command in a macro.
Also, Vimperator still requires a mouse for many actions, like saving
images. The worst part about it is these actions cannot be used as
part of a macro. Hopefully Vimperator will improve in the future and
fix this.
Give it a shot sometime. Like learning a good text editor for the
first time, after you are set up, move your mouse out of reach so you
are forced to use the keyboard. It slows you down in the short run,
but you will be very fast later on down the road.
In a previous post about the LZMA
compression algorithm, I made a negative comment about zip archives
and moved on. I would like to go into more detail about it now.
A zip archive serves three functions all-in-one: compression, archive,
and encryption. On a unix-like system, these functions would normally
provided by three separate tools, like tar, gzip/bzip2, and GnuPG. The
unix philosophy says to "write programs that do one thing and do it
well".
So in the case of zip archives, we are doing three things poorly when,
instead, we should be using three separate tools that each do one
thing well.
When we use three different tools, our encrypted archive is a lot like
an onion. On the outside we have encryption. After we peel that off by
decrypting it, we have compression, and after removing that lair,
finally the archive. This is reflected in the filename:
.tar.gz.gpg. As a side note, if GPG didn't already
support it, we could add base-64 encoding if needed as another layer
on the onion: .tar.gz.gpg.b64.
By using separate tools, we can also swap different tools in and out
without breaking any spec. Previously I mentioned using LZMA, which
could be used in place of gzip or bzip2. Instead of
.tar.gz.gpg you can have .tar.lzma.gpg. Or
you can swap out GPG for encryption and use, say, CipherSaber as
.tar.lzma.cs2. If we use a single one-size-fits-all
format, we are limited by the spec.
Compression
Both zip and gzip basically use the same compression algorithm. The
zip spec actually allows for a variety of other compression
algorithms, but you cannot rely on other tools to support them.
Zip archives are also inside out. Instead of solid
compression, which is what happens in tarballs, each file is
compressed individually. Redundancy between different files cannot be
exploited. The equivalent would be an inside out tarball:
.gz.tar. This would be produced by first individually
gzipping each file in a directory tree, then archiving them with
tar. This results in larger archive sizes.
However, there is an advantage to inside out archives: random
access. We can access a file in the middle of the archive without
having to take the whole thing apart. In general use, this sort of
thing isn't really needed, and solid compression would be more useful.
Archive
In a zip archive, timestamp resolution is limited to 2 seconds, which
is based on the old FAT filesystem time resolution. If your system
supports finer timestamps, you will lose information. But really, this
isn't a big deal.
It also does not store file ownership information, but this is also
not a big deal. It may even be desirable as a privacy measure.
Actually, the archive part of zip seems to be pretty reasonable, and
better than I thought it was. There don't seem to be any really
annoying problems with it.
Tar is still has advantages over zip. Zip doesn't quite allow the same
range of filenames as unix-like systems do, but it does allow
characters like * and ?. What happens when you extract files with
names containing these characters on an inferior operating system that
forbids them will depend on the tool.
Encryption
Encryption is where zip has been awful in the past. The original
spec's encryption algorithm had serious flaws and no one should even
consider using them today.
Since then, AES encryption has been worked into the standard and
implemented differently by different tools. Unless the same zip tool
is used on each end, you can't be sure AES encryption will work.
By placing encryption as part of the file spec, each tool has to
implement its own encryption, probably leaving out considerations like
using secure memory. These tools are concentrating on archiving and
compression, and so encryption will likely not be given a solid
effort.
In the implementations I know of, the archive index isn't encrypted,
so someone could open it up and see lots of file metadata, including
filenames.
When you encrypt a tarball with GnuPG, you have all the flexibility of
PGP available. Asymmetric encryption, web of trust, multiple strong
encryption algorithms, digital signatures, strong key management,
etc. It would be unreasonable for an archive format to have this kind
of thing built in.
Conclusion
You are almost always better off using a tarball rather than a zip
archive. Unfortunately the receiver of an archive will often be unable
to open anything else, so you may have no choice.
Any developer that uses a non-toy operating system will be familiar
with gzip and bzip2 tarballs (.tar.gz, .tgz, and
.tar.bz2). Most places will provide both versions so that the user can
use his preferred decompresser.
Both types are useful because they make tradeoffs at different points:
gzip is very fast with low memory requirements and bzip2 has much
better compression ratios at the cost of more memory and CPU
time. Users of older hardware will prefer gzip, because the benefits
of bzip2 are negated by the long decompression times, around 6 times
longer. This is why OpenBSD prefers
gzip.
But there is a new compression algorithm in town. Well, it has been
around for about 10 years now, but, if I understand correctly, was
patent encumbered (aka useless) for awhile. It is called the Lempel-Ziv-Markov chain
algorithm (LZMA). It is still maturing and different software that
uses LZMA still can't quite talk to each other. 7-zip and LZMA Utils are a couple examples.
GNU tar added an
--lzma option just last April, and finally gave it a
short option, -J, this past December. I take this as a
sign that LZMA tarballs (.tar.lzma) are going to become common over
the next several years. It also would seem that the GNU project has
officially blessed LZMA.
And not only that, I think LZMA tarballs will supplant bzip2
tarballs. The reason is because it is even more asymmetric than bzip2.
According to the LZMA Utils page, LZMA compression ratios are 15%
better than those of bzip2, but at the cost of being 4 to 12 times
slower on compression. In many applications, including tarball
distribution, this is completely acceptable because decompression
is faster than bzip2! There is an extreme asymmetry here that can
readily be exploited.
So, when a developer has a new release he tells his version control
system, or maybe his build system, to make a tar archive and compress
it with LZMA. If he has a computer from this millennium, it won't take
a lifetime to do, but it will still take some time. Since it could
take as much as two orders of magnitude longer to make than a gzip
tarball, he could make a gzip tarball first and put it up for
distribution. When the LZMA tarball is done, it will be about 30%
smaller and decompress almost as fast as the gzip tarball (but while
using a large amount of memory).
At this point, why would someone download a bzip2 archive? It's bigger
and slower. Right now possible reasons may be a lack of an LZMA
decompresser and/or lack of familiarity. Over time, these will both be
remedied.
Don't get me wrong. I don't hate bzip2. It is a very interesting
algorithm. In fact, I was breathless when I first understood the
Burrows-Wheeler transform, which bzip2 uses at one stage. I would
argue that bzip2 is more elegant than gzip and LZMA because it is less
arbitrary. But I do think it will become obsolete.
Unfortunately, the confused zip archive is here to stay for now
because it is the only compression tool that a certain popular, but
inferior, operating system ships with. I say "confused" because it
makes the mistake of combining three tools into one: archive,
compression, and encryption. As a result, instead of doing one thing
well it does three things poorly. Cell phone designers also make the
same mistake. Fortunately I don't have to touch zip archives often.
Finally, don't forget that LZMA is mostly useful where the asymmetry
can be exploited: data is compressed once and decompressed many
times. Take the gitweb interface, which provides access to a git
repository through a browser. I
run one myself. It will provide a gzip tarball of any commit on
the fly. It doesn't do this by having all these tarballs lying around,
but creates them on demand. Data is compressed once and decompressed
once. Because of this, gzip is, and will remain, the best option for
this setting.
In conclusion, consider creating LZMA tarballs next time, and don't be
afraid to use them when you come across them.
Another useful program I use every week is GNU Screen. It
provides virtual terminals at a single terminal. It's a bit like a
window manager for a text terminal. If you are a command line junkie
(and if you are at all serious about computing, you should be), this
is an essential piece of software.
The main reason I use screen is for its persistence. If I am running a
long-running job on a remote machine (i.e. over ssh), like a large
apt-get upgrade, I'll put it in a screen session. This
way I can log out and, later, log in from anywhere and check on it. I
have even used it to persist nethack
sessions, though this isn't really necessary.
The only annoying part is that all of it's mappings are underneath C-a
(ctrl+a), which is a very common Emacs/bash command, which I use a
lot. To get the effect of C-a inside screen, you have to do it twice
in a row because screen captures the first one.
I came across a very interesting tool the other day called rlwrap. It wraps the
readline library over just about any interactive text input
program. The readline library provides basic editing and history. It's
handy for those programs that don't provide their own line editing
facilities.
It tries to be as transparent as possible, detecting yes/no prompts
and passwords, so it should still be reasonable under those
conditions.
If you can't think of anything to try it with, try it with
cat. Instant line editor!
rlwrap cat > some-file.txt
Or with Festival.
rlwrap festival --tts
It will also turn incorrectly compiled shells on your system into
something usable. On my system (Debian GNU/Linux), csh
isn't usable without rlwrap.
In a previous post I
discussed decentralized version control systems, Git in
particular. Because decentralized version control is becoming so
popular, we now have an exciting new area of development: distributed
issue tracking.
Decentralized issue tracking seems to have popped into existance in
the last year or so. A number of projects have appeared (cil, ticgit,
ditz, to name some), but the one that really stands out for me is ditz. Keep an eye on that
one. It's fairly active and mostly usable.
Decentralized trackers generally work by storing the issue tracking
database within the repository itself. One possibility is to have it
sit in its own branch, which I think is the Wrong Way. A second
possibility is to have it sit right next to the code in its own
directory. Yet another possibility is to put the issue tracker in its
own repository. Git could even include this repository as a submodule
(this is a lot like the Wrong Way, though).
First of all, everyone gets their own copy of the issue tracker
database and its history. Second of all, it has history. It's
tracked the same way the code is. And, in the second case usage, one
of the coolest advantages is that issues follow the code very closely.
When a branch is created, it takes its own copy of the issue tracking
database with it. If a bug is fixed in the main branch, the issue
tracking database in the main branch is updated. The bug will remain
in the side branch and the issue will still be open in the side branch
reflecting this. If a merge occurs later, the issue tracking database
also gets merged automatically. I think that's damn cool.
There are some issues that still need to be hammered out. How does a
non-developer enter a ticker? They would need to work the version
control system to do this, then be able to share that change. That's a
pretty large barrier.
Perhaps a web interface could be set up for setting up
issues. Developers could then cherry-pick/pull the issues from that
repository and push ticket updates back out.
Then there is overhead incurred by moving tickets around with
code. How bad is this overhead? How can this be dealt with in the most
transparent way? This all needs to be tested still.
Could the issue database get too big? People like to attach
screenshots to issues. Having many screenshots would make the
repository very big. How do we deal with this?
I finally finished dumping the rest of my lingering Subversion
repositories. I have converted them all to Git repositories. If you manage a
Subversion (or CVS, or Perforce, etc) repository, you should consider
doing the same. Git became my version control system (VCS) of choice
in June and I haven't looked back since.
Why? Because Git is better.
Yes, it really is. Much better.
Git is faster, smaller, more secure, and more powerful. This is a
virtue of decentralized version control systems. Subversion is Blub.
It all starts with to Source
Code Control System (SCCS) and Revision
Control System (RCS). These systems could only track single files
and created headaches for projects with multiple files being worked on
by multiple people.
Then came
Concurrent Versions System (CVS), which improved things slightly,
but still sucked. It still really only tracks individual
files.
Now, CVS did anonymous reads, allowing anyone to access the
repository and see code history.
OpenBSD was the first code base to take advantage of this. These
days, coming across projects that don't give public read access to
their repository seems backwards.
Not using any of these systems would probably be better than using
them. Their flaws are obvious as soon as you start using them.
Finally in the year 2000,
Subversion arrives. It's a huge step up from CVS, fixing many of
its problems. It has a much better interface and uses atomic commits
-- finally tracking more than one file at a time. We still need to
talk to some server every time we want to do something. Branching and
merging also sucks so much no one wants to use it. But branching is
overrated, right? Wrong. I use branches all the time, now that they
are easy.
The reason branching sucks in Subversion can be explained with a
famous quote by Albert Einstein,
Make everything as simple as possible, but not simpler.
Instead of implementing tagging, branching, and merging, the
Subversion guys just implemented "cheap copy". It's a pretty clever
idea, but in practice it doesn't work out well. It's too
simple.
CVS solves the wrong problem, and Subversion solves the right problem
wrongly.
Since Subversion, a number of decentralized VCSs have arisen. We have
GNU arch (2001),
monotone (2002),
darcs (2003),
Bazaar (2005),
Git (2005),
Mercurial (2005), and
fossil (2007).
I played around with all of these when looking for a distributed VCS
(except fossil) and none struck me the same way that Git did. I would
recommend most of them over Subversion.
Distributed VCS has gotten a lot of attention in the past couple
years, which had much to do with the Linux kernel switching to one
(Git). In fact, Git and Mercurial were written precisely for this
event. Since then, some major projects have been switching to Git, or
at least to some sort of distributed VCS: Perl, Ruby on Rails,
Android, WINE, Fedora, X.org, and VLC to name a few.
You can also see the chatter on the Internet about Git. It's is really
popular with fresh, innovative projects, like the Arc programming language. It's
pretty easy to accidentally run into various Git tutorials on the
web. It has a real presence.
No Authority. But why distributed VCS? Why are they better?
First of all, when you "checkout" a distributed VCS, it's really a
"clone" operation, which is what most of them call it. You get
everything. After that, the only reason you need to talk upstream,
which really isn't "up" anymore, is if either end has updates to the
code they wish to share. The only way one clone might more important
than another is human politics. Technically they are equals.
Small. But won't this be huge? A Subversion repository
can easily be several gigabytes. That would be a lot to transfer on
the initial clone.
Actually, distributed VCSs are extremely efficient. A Git clone will
usually be smaller than a Subversion checkout. For example, I once
cloned Freeciv's Subversion
repository using Git (converting it to Git). It was about 15000
revisions. The bare version of the Git repository, containing all
~15000 commits, was half the size of the Subversion repository, which
contained only a singly commit! The non-bare version was still smaller
by a few megabytes. I can't even imagine how much space the server was
using.
I would have some numbers on this example, but, alas, that clone was
lost on a failed hard drive and it took me a week to make. Note, Git
clones of Git repositories aren't that slow: Subversion isn't
optimized for cloning, and the Freeciv Subversion server is extremely
overloaded.
Update: I managed to get another clone, and it only took me a
couple hours. The Freeciv Subversion checkout at revision 15574 is
281MB. Remember, this contains just one single revision. The Git clone
after a repack and garbage collection, which contains all 15574
revisions, is 225MB. It's 56MB smaller! If I told it to leave out the
Subversion metadata it would be even smaller than that. On the server
side, the Subversion repository likely takes up gigabytes. And
finally, to add insult to injury, the Git "bare" clone is 144MB.
The Mozilla project's CVS repository is about 3 GB; it's about 12
GB in Subversion's fsfs format. In Git it's around 300 MB.
Git's packing format is fairly simple, yet so effective.
Fast. Well, duh. With everything being local, operations that
work on multiple revisions will be fast. Beyond this, decentralized VCS
is generally faster on all operations, except the initial clone.
Reduced politics. With a central repository, someone or some
group has to decide who has write access and who doesn't. Developers
without write access are basically stuck without version control,
unless they hack in their own. In the decentralized model, everyone
has write access to their own personal repository, and others can
choose, on their own, to pull revisions from it.
Secure. A centralized VCS has a central, single point of
failure. If that single point is compromised, the server needs to be
restored from backups. Or worse, the compromise goes unnoticed and the
repository history is modified without anyone ever being able to tell.
In a distributed model, each revision (and in Git's case, the files
themselves) is referenced by a hash (SHA-1 in Git's case) of it's
contents, a
content-addressable storage system. Thanks to this, a file, no
matter where it is in the tree or in history, is stored only
once. The main purpose is to avoid collisions between revision
identifiers on parallel lines of development. It also happens to make
the repository tamper-proof.
If you know the revision ID of your HEAD no one will be able to
change any of its history. This is because each revision contains the
ID of its immediate ancestor, all the way back to the initial
commit. If a previous commit changed, it would change the ID of every
following commit. An attacker would have to find a desired collision
for each one: simply impossible.
The hash addressing also provides integrity, as corruption in the
repository is easily detected.
Another security gain, related to the reduced politics note, is the web of
trust. This is the same way PGP handles key authentication. In a
large project, a single developer may only trust a handful of people
to be competent programmers, and therefore only pull from these
developer's repositories. Those developers they pull from also have
their own set of people they trust. In this way, revisions can
safely be pulled from distant strange repositories through the web of
trust.
The only reason to interact with a Subversion repository is for legacy
reasons. Luckily, you don't have to use Subversion to use Subversion.
That wasn't a typo. Git has a Subversion/Git interfaced called
git-svn. I used it to convert my Subversion repositories to Git, but
it can be used as a fully functional Subversion client. It can clone
the Subversion repository and continue to pull changes from it as it
updates.
On your end, you can make commits to your local repository, use cheap
branches, and so on,all of which stay local. Changes can be pushed
back upstream to Subversion with the dcommit command,
which would be done after rebasing any changes on top of the current
Subversion HEAD. This provides most of the advantages of Git without
worrying about having the central repository change.
One of the major complaints about Git is that it once lacked a
plethora of GUIs, like CVS and Subversion have. Git does have
GUIs. I looked at a couple of them out of curiosity, so I am not sure
how good they are by comparison. I also have barely use any other VCS
GUIs. The ones I have used I find incredibly annoying.
I don't understand why people insist on using them anyway. It's like
using training wheels on a bike and claiming that it's better that
way. No, those training wheels just get in the way.
To be brutally honest, if you don't want to use Git because you are
afraid of the command line, what are you doing coding in the first
place?
One topic left is issue tracking. In a centralized VCS, you have a
centralized tracker. Subversion has Trac, for example. Well, what
about distributed VCSs? They should have distributed issue tracking
right?
In my previous article I
drew those red dice myself, using
GIMP. Since I really enjoyed figuring out how to do it, and
actually doing it, here is a little tutorial.
The numbers and sizes are arbitrary, so feel free to adjust things if
you think they look better. I am no artist. I am sure someone could
take this further to make it look better, perhaps by making the pips
look indented, or adding some transparency effect so the dice look
clear. I am a GIMP newbie.
Step 1: Make a Face
In GIMP, create a new 300x300 image and fill it with a dark red. I
used c30808 for this. This will be the base color of the
dice, so if you want differently colored dice, choose whatever color
you like.
Step 2: Make Pips
Next we use the ellipse selection tool (e) to make pips. In the
settings, set the ellipses to a fixed size of 75x75.
Create the ellipse and move it to the upper left-hand corner. Use the
arrow keys to nudge it to position (5, 5) -- or just type in these
values.
Use bucket fill (shift+b) to fill the selected area with white, or
whatever color you want your pips to be. Keep doing this to make pips
in each corner. The positions should be (5, 220), (220, 5), and (220,
220). This makes the 4 face. Put a fifth pip in the middle, (112,
112), to turn it into a 5 face.
Step 3: Make More Faces
You now have one face of your die. The 1, 2, 3, and 4 faces are the
same as the 5 face, but fewer pips. In the layers dialog name the
current layer "5". Now, duplicate the layer (shift+ctrl+d) and name
this new layer 4. Use either the paintbrush tool (p) to paint your
base color over the middle pip, or use the selection tools to remove
it.
Keep duplicating layers and removing pips until you have 5 faces: 1,
2, 3, 4, and 5.
Duplicate the "4" layer and create two more pips to make the 6
face. You now have 6 layers, each containing a single face. Here is
my .xcf when I was done:
dice-faces.xcf.
Step 4: Map it to a Cube
Now comes the fun part, the real guts of the drawing. You are going to
map these layers onto a cube. Go to Filters -> Map -> Map Object. Map
to "Box" and select Transparent background and Create new
image.
Under the Orientation tab adjust the rotation. For the first
die, try something like (20, 40, -5). If you enable Show preview
wireframe you can see your adjustments live. Just don't make these
values too high or it will make the next step more difficult.
Under the Box tab set the Front, Top,
and Left faces to different layers. Note that the opposite
sides of a die always add up to 7. That is, 1 is opposite to 6, 2 is
opposite 5, and 3 is opposite 4. Here is how a typical die looks.
If you are really picky, you might want to pay attention to the
orientation of the 3's, 2's, and 6's and flip those layers
accordingly.
Hit Preview! to see your work. If you are happy,
click OK. Autocrop the new image with Image -> Autocrop Image.
Step 5: Make More Dice
Do this a few more times with different faces at different
orientations. I will make just one more for the example.
Create a new 640x480 image with transparent background. Copy and paste
your dice into this image. After each paste, make a new layer
(shift+ctrl+n), so each die gets its own layer. Use the Move tool (m)
to adjust the dice into a sort-of mid-roll. Whatever looks good.
Step 5: Make Shadows
The last part left is the shadow. First, merge the visible layers
(shift+m), then duplicate the remaining layer. Call this new layer
"Shadow".
Go to Colors -> Brightness-Contrast. Set contrast to -127. This will
be the shadow. If you want a darker or lighter shadow, open the same
dialog again and adjust the brightness. Next scale the shadow layer
vertically by 50%. You want the width to remain the same.
Select the Sheer tool (shift+s) and sheer the layer in the X direction
-100 pixels. Move the shadow layer to the bottom. Now use the Move
tool (m) to move the shadow into an appropriate position.
You can add a
penumbra by applying a Gaussian blur to the shadow layer: Filters
-> Blur -> Gaussian Blur. I blurred mine by 5 pixels.
Finally, you might want to autocrop the layers, then fit the image
canvas to the layers, which will get rid of the excess border.
Diceware is
a method of easy-to-remember, easy-to-type, secure passphrase and
password generation. It works completely off-line and requires no
computer whatsoever, apart from retrieving the Diceware list. By
taking the passphrase generation off-line there is less room for
mistakes to be made.
The reason these password are easy to remember is that they are simply
a series of words in your native language. This also tends makes them
easier to type without lots of practice as you should already be used
to typing words.
Because the official Diceware website is frequently down or unusable,
I have mirrored the original page here,
The lists are cryptographically signed by Arnold G. Reinhold so you
can verify that I have not tampered with it. I must also note that,
unfortunately, the author requires that this list only be distributed
non-commercially, which limits its usefulness but allows me to
distribute it here.
I also came across another list called DialDice, which I have
mirrored here and signed with my own key,
Diceware works by rolling five 6-sided dice (or rolling one 6-sided
die five times, etc.) and using the result to look up the word in one
of the above lists of 6^5, or 7776, words. Each word is worth about
12.9 bits of entropy,
log2(7776) =~ 12.9248
So if you want a password worth about 40 bits -- which is about 7
letters of a random alphanumeric password -- you would generate three
Diceware words. They can be concatenated in any order and in any
fashion. When I use Diceware, I just mash them together, like
"lancealertgrow". Note that the space bar makes a distinctive sound
when pressed, so if you put spaces between your words a listener will
be able to tell how many words you use.
If you don't like what you rolled the first time, DO NOT
generate a new one as an attempt to get someting "better". If you do
this, you will greatly weaken your passwords because you are selecting
passwords from a much smaller pool of possibilities (a very small pool
that contains only passwords you like).
The number of possible three-word passwords is 470,184,984,576. That's
right: 470 billion. Because you are selecting passwords with
dice, each password is as equally likely as the next. Even if an
attacker knew you used Diceware and knew what word list you used, that
still leaves a handful of guesses out of that 470 billion
possibilities.
At first it may be confusing, but it actually doesn't matter what the
words are or how long they are. It doesn't matter that there are no
capitals or special characters. It is the simple fact that there are
7776 words, and one was selected three times.
7776 * 7776 * 7776 = 470184984576
The Diceware website goes into a bit more detail on this.
If the computer system you use annoyingly requires passwords to
contain special characters (which is done to increase entropy in
passwords that actually are poor), Diceware also provides a
method of adding some of those, which adds a couple more bits worth of
entropy. If you don't care about those extra bits, you can throw your
own in.
For passphrases,
Diceware recommends 6 words, about 77.5 bits, which it claims
should be out of range of brute-force attacks from anyone for at least
the next 20 years. If you really think you need more than 6 words, you
should consider hiring guards for all your computer equipment.
I was working on my own Diceware word list to release into the public
domain. The purpose was to provide a word list without any
distribution restrictions, unlike all the Diceware lists I have
found. But making word lists is hard! I wrote a number a little
filters -- word length, no sub-words, spell checking, no special
characters, etc -- to pull out good words from a large word list, then
used some sample English text from Wikipedia to get some frequency
information so that I am was selecting common words over less common
words. I still need to go over the final list by hand to make sure it
all looks good. This is a long tedious process. Carefully examining
7776 words is quite a lot of work. Someday I will finish it.
In the event that you don't have any dice available, or you want to be
able to generate Diceware passwords on the fly automatically, I have
written a little program that will use /dev/random,
assuming there is a true RNG behind it, to roll virtual dice. It can
use a Diceware word list or your local dictionary word list which
contains many more words (usually found at
/usr/share/dict/words). Grab it here,
When I played a Dungeons and Dragons campaign back in college, I
thought it would be interesting to take notes and chronicled the
campaign as a short story. I have
shared a piece of it already, but I can't remember too many more
details. It would have been a lot of work to do it all.
Well, a couple of years ago, Shamus Young put in the huge effort and
did document a campaign that he ran. It is called The Twenty
Sided Tale. If you enjoy tabletop games, or are interested in
them, I strongly recommend reading through it. Hell, even if you want
to read a great story, I recommend it.
Sprinkled in between the story's paragraphs are DM notes where Shamus
gives some in-game context to the events. It provides some information
about what the players know and reasoning about their decisions. The
most interesting are his insights as a DM. It is really quite
inspiring. Shamus knows his games.
The story is really amazing as well. He actually had two unrelated
stories going that the players managed to merge together perfectly all
by accident. The end was so fantastic that it gave me chills the first
time I read it.
If you're still here, go click that link and start reading.
I came across the Ultra
High Security Password Generator the other day, which uses a very
high quality pseudo-random number generator to generate passwords and
keys. The idea is not to use the full 63 characters as a password, but
rather a contiguous subset, such as the first 8 characters.
The website is served securely, so no middle man can sniff at it. If
you trust the maintainer not to store the generated numbers somewhere,
which he claims not to be doing, then you can use it as a nice
password generation service. If you are nervous about matching
password possibilities to your IP, then grab a batch through the Tor network.
In terms of brute-force attacks, each letter from hexadecimal
characters is worth 4 bits, from random printable ASCII characters is
worth about 6.6 bits, and alpha-numeric is worth about 6 bits.
If you go with the set of printable ASCII characters, 6 to 8
characters (39 to 52 bits) should be good enough for an account
password, where an attacker must guess through an attempt-limiting
authentication system. 8 to 12 characters (53 to 79 bits) should be
plenty for a passphrase used as an encryption key, where an attacker
can brute force passphrase attempts at his leisure.
I wouldn't use this website for any serious encryption, though. If he
is logging generated passwords, he will have a nice short list of
possible passphrases to try against your encryption. So don't use it
for your GPG passphrase. Generate those locally.
This is where the purpose of my post comes in! He describes the
pseudo-random number generator he uses to generate the random numbers
at the top of the page. It's the AES
block cipher in
cipher-block chaining (CBC) mode encrypting a 128-bit counter. A
picture (well, diagram) is worth a thousand words,
Note that the diagram is actually being explicit about CBC mode, so
the AES cipher in the diagram is in electronic codebook (ECB) mode. I
missed this myself when initially interpreting the diagram and writing
my implementation. Here is the same diagram with the AES cipher
already in CBC mode,
I don't know if he designed this himself or not. I implemented it in C
to study it a bit. You can grab it here with git (or follow the link
and get a snapshot),
To build it, you will need
libgcrypt installed (with headers).
Here is about 100 megabytes analyzed with ent, a pseudo-random number
sequence test program.
Entropy = 7.999998 bits per byte.
Optimum compression would reduce the size
of this 100000000 byte file by 0 percent.
Chi square distribution for 100000000 samples is 275.58, and randomly
would exceed this value 25.00 percent of the times.
Arithmetic mean value of data bytes is 127.5095 (127.5 = random).
Monte Carlo value for Pi is 3.141182766 (error 0.01 percent).
Serial correlation coefficient is 0.000005 (totally uncorrelated = 0.0).
Wow! These results are great! This is exactly what you would see if
you ran 100 megabytes of /dev/random, a true
random number generator, through ent. It is also pretty fast,
generating that 100 megabytes on my laptop in about 7 seconds. That's
much faster than Linux's /dev/urandom (over a minute
here), whose ent results aren't quite as good, either.
Note: Before you go using this somewhere important you should
make sure this PRNG has been peer reviewed and carefully studied by
professionals with cryptanalysis. It may have fundamentals flaws. I
only found this on a website somewhere!
Still, that's a pretty damn cool pseudo-random number generator.
The generator is only useful if you want to generate more than 512
bits worth of numbers, because it takes 512 bits of randomly generated
numbers to initialize the generator. If you want to generate a single
password if the same strength, give this a shot,
head -c 50 /dev/random | tr -cd "A-Za-z0-9@#\!\$%^&*()_+=-~;,.<>/[]{}|?:'\\\`" && echo
It uses a true random number generator and selects from 94 printable
ASCII characters (space not included).
Try it out right now if you don't feel like reading: my fantasy name generator. I need your help
designing some name patterns.
Whether choosing a name for my character in a fantasy game or
populating a world which I pretend to myself that I will one day DM, I
have always gone to the
RinkWorks Fantasy Name Generator. The author of this tool, Samuel
Stoddard, gives some
history on how he came to design and develop it.
It works by using pattern to select sets of letters to put together
into a name. There is a thorough, long
description on the website. Unfortunately, he didn't share his
source code and so we can see how he did it.
Therefore, I used his description to duplicate his generator.
It includes a command line interface as well as a web interface, which
I am running and linked to at the beginning of this post for you to
use. The code is available under the same license as Perl itself.
I used Perl and the
Parse::RecDescent parser generator. Thanks to this module, it
essentially comes down to about 40 lines of code. The name pattern is
executed, just like a computer program, to generate a name. Here is
the BNF grammar I came up with,
The program is just that, decorated with some bits of Perl. Since I
came up with it, I have found that it is slightly different than
Mr. Stoddard's generator, in that his allows empty sets anywhere.
Mine only allows them at the end of lists. For example, this is valid
for his generator,
<|B|C>(ikk)
But to work in mine, the empty item must be moved to the end,
<B|C|>(ikk)
This can be adjusted my making the proper changes to the grammar,
which I haven't figured out yet.
Another problem with mine is that Parse::RecDescent is
slooooowwwww. Ridiculously slow. Maybe I designed the grammar
poorly? This is probably the biggest problem. Even simple patterns can
take several seconds to generate names, specifically with deeply
nested patterns. For example, this can take minutes,
<<<<<<<s>>>>>>>
Before you go thinking you are going to tank my server, I have written
the web interface so that it limits the running time of the
generator. If you want to do something fancy, use your own hardware. ;-)
There is also a problem that it will silently drop invalid pattern
characters at the end of the pattern. This has to do with me not quite
understanding how to apply Parse::RecDescent yet.
And this is where I need your help. I have had some trouble coming up
with good patterns. I don't even have a good default, generic fantasy
name pattern. Here are some of mine,
<s|B|Bv|v><V|s|'|V><s|V|C> # default
<i|Cd>D<d|i> # idiot
<V|B><V|vs|Vs> # short
None of which I am very satisfied.
You can design patterns for Nordic names, Gallic names,
Tolkienesque Middle Earth names, orc names, idiot names, dragon names,
dwarf names, elf names, Wheel of Time names, and so on. There is so
much potential available with this tool.
You can see what patterns people have been trying in the log
file. To suggest one to me, e-mail me some patterns, or even
better, clone my git repository and add one to it yourself (then ask
me to pull from you). This way your credit will stay directly attached
to it with a commit.
Good luck!
Don't stop here! This isn't everything. Check out the archives
(on the left) for more posts. Or just have a look at
the index.