A regular expression to check for prime numbers (2007)

colanderman · on March 5, 2022

I'll be "that guy" -- this is not a regular expression [1]. A Perl regexp, sure. But backreferences are not part of the grammar of regular expressions, and moreover, using them in that particular manner (to refer to a capture of unbounded length) cannot be encoded as a DFA and therefore does not describe a regular language.

(I have no clue whether unary-representation primality is actually regular or not -- it's been far too long since my automata class to remember how to (dis)prove that -- but I guess it's at least sublinear, since primality is polynomial w/r/t the logarithm of the number being tested.)

[1] https://en.wikipedia.org/wiki/Regular_expression#Formal_defi...

pxx · on March 5, 2022

you can use the pumping lemma (https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_lang...) to do this proof trivially. it'll look something like

{1^x : x is prime} is not a regular language. suppose it is a regular language. then by the pumping lemma there exists a p such that every string of length at least p can be written as w = xyz, where |y| >= 1, |xy| <= p [not an important observation for this proof], and we can "pump" a new string w' = xy^nz (for any n > 0) and have w' remain in L.

let p' be the prime immediately above p. then the string

w = 1^p'

is in the language L and must be "pumpable" for some N = |y| > 0. Choose the number of times we "pump" (n in the expression xy^nz) to be p'. Then that would imply

w' = 1^(p' + p' * N)

is in the language L, a contradiction, as p' + p' * N has factors p' and (1 + N). Note we can also use the CFL pumping lemma to prove that this particular language is not context-free either; if we choose s = uvwxy, we can still just pump p' times; the new string just is of size p' * (1 + |v| + |x|), and |v| + |x| > 0.

thaumasiotes · on March 5, 2022

> we can "pump" a new string w' = xy^nz (for any n > 0)

While this is certainly true, I would prefer to note that n ≥ 0. n > 0 is just a special case of that.

(You can see clearly that 0 repetitions will always be valid by considering the proof of the pumping lemma: since a regular expression is recognized by a finite state machine, the only way to recognize a string of length exceeding the (finite!) number of states in the machine is to visit some states more than once, which means traveling around a cycle in the state graph. The 'pumpable substring' of the pumping lemma is a path through such a cycle, and repeating it zero times corresponds to traveling around the cycle zero times. The rest of the path is just as valid as always.)

colanderman · on March 5, 2022

Fantastic! It's been 15 years since I've had to do those proofs, I'll use your proof to refresh my memory how they work.

maweki · on March 5, 2022

Just to note additionally that the pumping lemma may only be used in that direction, not the other way.

The language {1^x : x is not prime} is easily pumped with x=z=1 and y=11. And regular languages are closed under complement.

No pumping Lemma -> Not regular, Yeah Pumping Lemma -> Who knows

thaumasiotes · on March 6, 2022

Well, you can still easily base the proof on the pumping lemma:

1. The language {1^x: x prime} is not regular (as immediately shown by the pumping lemma).

2. The language {1^x: x not prime} is not regular (because it is the complement of {1^x: x prime}, and the complement of a regular language is regular).

It follows that the pumping lemma does not apply to the language {1^x: x not prime}; just because the string 1111 can be divided that way doesn't mean all strings in the language can be so divided.

> No pumping Lemma -> Not regular, Yeah Pumping Lemma -> Who knows

This is incorrect; any language to which the pumping lemma for regular languages applies is a regular language.

pxx · on March 15, 2022

> This is incorrect; any language to which the pumping lemma for regular languages applies is a regular language

No. The wikipedia page has an example of a pumpable but non-regular language. You just need to take the union of a pumpable language and a non-regular language in a way that you can pump elements of the second into elements of the first. The Wikipedia example is a little more formal, but we can do something like this.

Consider, over the binary alphabet,

L1 = {s : 00 or 11 is a (consecutive) substring of s}

L2 = {(01)^n : n represents a Turing machine that halts on the empty tape}

and L = L1 union L2. Identifying elements of L2 is "difficult" (this is where I'm not being formal, but clearly this is beyond the power of regular languages) and the elements of the two languages are disjoint (so I'm not playing some tricks with my language secretly being Σ*).

Take the pumping length p = 3. There are only eight possibilities for the first three characters: 000,001,010,011 and their bitwise complements. In the first two cases pump on the last character as any string starting 00 is in the language L. In the third case pump on the middle character; either we get repeated 1s or the zeroes coalesce (note that this is exactly where we collapse L2 into L1 by pumping it). In the last case, pump on the first character; the repeated 1s ensure the final string is in L. For the complements just complement my constants.

So any string s in L can be pumped. But it's clear that no DFA is going to recognize L2, a disjoint language from L1. I use the halting problem for L2 as a "hard problem" but you can just choose an arbitrary other language L2 = {(01)^n : n encodes an element of L*, where L* is not regular}.

jwilk · on March 6, 2022

OK, you can trivially pump numbers with small factors. But what about the numbers that have all prime factors greater than the pumping length?

thaumasiotes · on March 6, 2022

Just to tie everything up, if you have Dirichlet's theorem it's easy to show that {1^x: x not prime} is irregular through a "direct" proof by contradiction using the pumping lemma.

1. Assuming the language is regular, the pumping lemma tells us that it has a pumping length m, and all strings of length at least m can be divided into a prefix x, a suffix z, and a pumpable substring y, such that:

1a. |xz| < m

1b. |y| > 1

1c. The concatenation x y^n z remains in the language for all n. (In this case, this means that |xz| + n*|y| is nonprime for all nonnegative y.)

2. Consider the string c = 1^p², where p is a prime number greater than m[1].

3. The pumping lemma will divide c into substrings x y z as described above.

4. Observe that |xz| < m < p and |y| = p² - |xz|

5. Dirichlet's theorem tells us that the arithmetic sequence {a + nd} contains infinitely many primes as long as a and d are coprime. Pumping our string c will generate strings with lengths belonging to the arithmetic sequence {|xz| + n(p² - |xz|)}. It remains to be shown that |xz| and (p² - |xz|) are coprime.

6. Observe that any number which divides both (p² - |xz|) and |xz| will also divide [(p² - |xz|) + |xz|] = p².

7. Since p is prime, the only divisors of p² are 1, p, and p².

8. Since |xz| < p, we can see that gcd(|xz|, (p² - |xz|)) = 1.

This completes the proof; pumping c is guaranteed to produce not just one but infinitely many strings of prime length.

[1] Euclid tells us that there is no greatest prime number, so we can guarantee the existence of such a p.

colanderman · on March 6, 2022

The above proof is just stating that all nonprimes >= 4 are even, and therefore can be pumped -- so contradiction by pumping lemma does not apply.

thaumasiotes · on March 6, 2022

Think about what you're saying. You think 15 is an even number?

maweki may be thinking that you can divide a string for pumping lemma purposes any way you please. If that were true, you could show that all strings of nonprime length could be pumped while still retaining nonprime length:

1. |s| is composite, so |s| = km for some integers k and m

2. Choose the length of the pumpable substring y to be k. Then |xz| = s - k = k(m-1), and |x y^n z| = k(m-1) + kn = k(m + n - 1). This is composite, so the pumping is valid.

The problem, as jwillk has identified, is that the pumping lemma doesn't allow this. The language has a pumping length (traditionally m), and you have to divide your string into three substrings x y z such that all of the following are true:

1. |y| > 0

2. |xz| < m

3. x y^n z remains in the language for all n.

Condition (2) prevents us from using the maweki strategy.

kevinventullo · on March 5, 2022

It’s been a while since I took Theory of Computation. Does this imply that primes in binary or n-ary are also not regular?

jwilk · on March 5, 2022

You would need a different proof (likely a much more complicated one) for n-ary primes.

jwilk · on March 5, 2022

https://math.stackexchange.com/questions/1232463/how-to-prov... has a proof for base 2. This method works for other bases too.

kevinventullo · on March 6, 2022

Nice!

Capira · on March 5, 2022

this is why I love hackernews

nickelpro · on March 5, 2022

I'll be "the other guy" -- no one uses the term regex or regular expression in a programming context to mean a strictly regular language in the theoretical computer science sense. Nitpicking this point is ignorant of how the usage of the term has evolved within the programming community.

Good. There's that obligative conversation over and done with.

colanderman · on March 5, 2022

Having worked in an industry where the linear vs. exponential performance implications of regular expressions and PCREs was crucial (intrusion protection systems), I'd just as soon software professionals stop being sloppy with terminology.

Not to mention the plethora of mostly-but-not-totally compatible regexp languages which exist. C++ supports 6 [1], and that doesn't even include PCREs. Miscommunication about which language one is actually using is a common source of regexp bugs.

[1] https://en.cppreference.com/w/cpp/regex/syntax_option_type

lloeki · on March 5, 2022

> Having worked in an industry where the linear vs. exponential performance implications of regular expressions and PCREs was crucial (intrusion protection systems)

same boat. that's why we use re2 in a way inspired by Go (which means losing a few features but the tradeoff is worth it):

> The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.) For more information about this property, see

> https://swtch.com/~rsc/regexp/regexp1.html

> or any book about automata theory.

https://pkg.go.dev/regexp

colanderman · on March 5, 2022

OCaml's regexps run in linear time also. They're based on Ville Laurikari's Tagged REs: https://laurikari.net/tre/

skissane · on March 5, 2022

> C++ supports 6

Wow, TIL. I always thought C++ had evolved into an excessively complex language, but this really seems to take the cake. ECMAScript (with C++-only changes), basic POSIX, extended POSIX, POSIX awk, POSIX grep, and POSIX extended grep. You are right there is no PCRE in there, yet. (There's always C++23... and if not that, C++26 or C++30)

(Although ECMASCript regex was heavily inspired by Perl/PCRE anyway, albeit without being 100% compatible with either, so in a sense PCRE is already there.)

leni536 · on March 5, 2022

It's unlikely that std::regex will gain new dialects. Unfortunately it is very broken, it's unlikely that a proposal to extend it would go through.

MauranKilom · on March 5, 2022

Can you elaborate (or point me to sources about) how std::regex is broken? I'd like to learn more.

leni536 · on March 5, 2022

AFAIK it's not even the interface that is broken, but some standard libraries poushed out slow implementations that cannot be changed without an ABI break.

earleybird · on March 5, 2022

Yes, thank you.

If you're sloppy with words you probably have other nasty habits. (apologies to Heinlein)

There are reasons to employ regular languages with extensions (to not quite context free languages) but be clear about what you're talking about.

bawolff · on March 5, 2022

I'd argue that if this is sloppy, then the use of "you" instead of "thou" is sloppy too. Using "you" for everything is confusing. How can i possibly know if you mean plural or singular?

Sure some people might say "thou" is archaic, however so is using "regex" outside of academic contexts to mean regular languages. I don't see how one could justify one but not the other.

efitz · on March 5, 2022

Grammar sticklers usually fail on this point: human languages are not static, stuck in place from the time the stickler learned the rules.

The current state of a human language is defined by common usage. The purpose of a language is for people to be able to communicate ideas. Ideas exist all up and down a specificity spectrum, e.g. “regex” might mean a specific syntax or one of several syntaxes depending on context. Kind of like “kleenex” might refer to a specific brand or to the general concept of tissue.

If 99.9% of people successfully (by their own measure, not yours) communicate what they mean when they use the term “regular expression”, but you are among the 0.1% for which that interaction fails, then I would argue that the problem is your pedantry, not the term. I’m not saying this to be mean; I have similar issues.

tomcam · on March 5, 2022

I tried to imagine the test suite for these C++ variations but failed due to a coronary event

jltsiren · on March 5, 2022

There is no such thing as the programming community. Only a huge number of partially overlapping communities using wildly different terminology. Some communities still talk about regular expressions in the original meaning (because labeled graphs are useful objects), just like some communities still use the word "algorithm" outside machine learning or "crypto" in non-blockchain contexts.

I personally prefer using "regular expression" in the language-theoretic sense and "regex" or "regexp" for related concepts in programming languages and software.

dpwm · on March 5, 2022

> I personally prefer using "regular expression" in the language-theoretic sense and "regex" or "regexp" for related concepts in programming languages and software.

This is also what Larry Wall proposes:

> [...] Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

https://raku.org/archive/doc/design/apo/A05.html

thaumasiotes · on March 5, 2022

> (or "regexen", when I'm in an Anglo-Saxon mood)

There's some kind of idea going around that because the plural of "ox" is "oxen", plurals with -n must be triggered by a word ending in "x". So you see cutesy references to "boxen" or similar.

But -n is just an ordinary plural marker; the same ending that makes "oxen" the plural of "ox" also makes "kine" the plural of "cow" and "children" the plural of "child". (In those last two cases, the words are sporting two plural markers in sequence.)

knome · on March 5, 2022

>There's some kind of idea going around

It's not new, by any stretch. I remember seeing such many references in the early 2000s and beyond. Doing some vague poking around textfiles.com, I didn't pull any easy to find 1980s 'boxen' references. It almost looks like it might have been the addition of 'unix boxen' to the 'new hacker dictionary' by esr which spread the joke into the mid-90s communities. I wasn't online at that point myself, so I can't speak from any personal experience. But the various hacker zines etc are using 'boxen' by the later 90s and early 2000s, which would allow for it to have spread through the folks of the time finding and trolling through esr's version of the 'hacker dictionary'. https://www.dourish.com/goodies/jargon.html complains about esr adding unix references, and does not include any boxen references that are found in the http://cd.textfiles.com/hackersencyc/ETC/MISC/JARG_331.TXT version folks in the 90s would likely have spread around.

edit: nope. definitely there in the 80s on usenet

so he was just encoding existing slang usage, rather than introducing it, on this point.

https://www.usenetarchives.com/view.php?id=net.wanted&mid=PD...

more edit: considered the second oldest boxen reference in the usenet archive is esr himself, maybe he was the one pushing it :)

https://www.usenetarchives.com/view.php?id=comp.sys.att&mid=...

thaumasiotes · on March 5, 2022

I didn't say it was new. I said it was current. (And, um, unmotivated.)

We might also note that from the perspective of modern English, plurals in -en are more characteristic of Geoffrey Chaucer than of Alfred the Great. We preserve three such words: oxen, children, and brethren. But only one of those is an Anglo-Saxon form; in Old English, the plural forms of ċild and broþor are ċildru and broþru[1]. The -ens appeared later.

[1] "Brother" is a fairly irregular word, but there's no form ending in -n.

samatman · on March 5, 2022

Well. You're on Hacker News. So I'm going to quote the Jargon File and then you'll stop complaining. Right, cobber?

> Further, note the prevalence of certain kinds of nonstandard plural forms. Some of these go back quite a ways; the TMRC Dictionary includes an entry which implies that the plural of `mouse' is {meeces}, and notes that the defined plural of `caboose' is `cabeese'. This latter has apparently been standard (or at least a standard joke) among railfans (railroad enthusiasts) for many years.

> On a similarly Anglo-Saxon note, almost anything ending in `x' may form plurals in `-xen' (see {VAXen} and {boxen} in the main text). Even words ending in phonetic /k/ alone are sometimes treated this way; e.g., `soxen' for a bunch of socks. Other funny plurals are `frobbotzim' for the plural of `frobbozz' (see {frobnitz}) and `Unices' and `Twenices' (rather than `Unixes' and `Twenexes'; see {UNIX}, {TWENEX} in main text). But note that `Unixen' and `Twenexen' are never used; it has been suggested that this is because `-ix' and `-ex' are Latin singular endings that attract a Latinate plural. Finally, it has been suggested to general approval that the plural of `mongoose' ought to be `polygoose'.

> The pattern here, as with other hackish grammatical quirks, is generalization of an inflectional rule that in English is either an import or a fossil (such as the Hebrew plural ending `-im', or the Anglo-Saxon plural suffix `-en') to cases where it isn't normally considered to apply.

thaumasiotes · on March 6, 2022

> or the Anglo-Saxon plural suffix `-en'

There is no Anglo-Saxon plural suffix -en; the suffix is -an.

samatman · on March 5, 2022

> [1] "Brother" is a fairly irregular word, but there's no form ending in -n.

It's brethren, for the record.

ectopod · on March 5, 2022

That footnote is clearly talking about Old English.

eutropia · on March 5, 2022

I’ll be the guy who postscripts that “hardmode” regexes would have important mathematical and computer science implications when/if they solve certain problems (like identifying primes). So it’s not pure pedantry, and there’s an important difference between “regexes to solve programming problems” and “regexes solving math problems”.

dpwm · on March 5, 2022

"hardmode" regexes do have important computer science implications: they can be matched orders of magnitude faster [0].

Demonstrating the power of a regex engine by solving a problem like this is interesting and very clever, but it’s hopefully not used in production.

[0] https://swtch.com/~rsc/regexp/regexp1.html

eutropia · on March 5, 2022

Yeah I think we agree here. I didn’t mean to assert that chomskian regular grammars or regexes satisfiable by a state machine hadn’t already been useful; I meant to say that because they have existing math implications and properties that new uses for them would imply connections to other maths, analogous computations, new proofs or ways to solve things, etc.

And yeah, they’re faster because they can be statically compiled to the point of running strings “through” the expression rather than running the expression “on” the string, so to speak. (i.e as an FSM, iterating char by char the operations boil down to purely compare and branch to known destinations)

Thorrez · on March 5, 2022

>no one uses the term regex or regular expression in a programming context to mean a strictly regular language in the theoretical computer science sense.

The RE2 library seems to:

>loosely speaking, a regular language is a set of strings that can be matched in a single pass through the text using only a fixed amount of memory.

https://github.com/google/re2/wiki/Syntax

Note that RE2 doesn't support backreferences.

firechickenbird · on March 5, 2022

Regular expressions are Regular Languages, thus equivalent to deterministic finite state automata. These languages have very interesting properties due to their regularity (eg. pumping lemma), and they are a very small strict subset of Context-Free languages, which are also a strict subset of Turing-Complete languages. If you add recursion and backtracking to RegExps you easily go outside the Regular Languages class, and you probably end up in the TC class. Now you are dealing with an undecidable machine which is also very hard to define and reason about (and you also have two more problems). What's the point of calling it regular?

nickelpro · on March 5, 2022

I'm hitting Poe's Law here, are you joking?

Your average coding bootcamp kid cracking away in node and invoking the builtin regex engine has no idea what any of those words mean.

To those programmers the words "regular expression" have become completely separated from their original etymology in lexical analysis. To them a regex means "input to a regex engine", and thus the term "use a regular expression" means "use a regex engine". It could be "use a froofroo boingo", for all the meaning the words hold. "Regular expression" just became the vernacular due to its origination in lexical analysis.

firechickenbird · on March 5, 2022

Wow, I have no idea what regexps mean? I can write a book about regexps, and if you really want to learn what regexps are, let me know, we can scedule a session.

Anyway, real REGUALR expressions belong to the regular languages class, thus equivalent to DFAs. Given a regexp, you can construct a DFA in O(2^n), where n is the number of operators in the regexp. Now this DFA will have a Linear time complexity match. If you extend regular expressions with backreferences, you can't build DFAs anymore, and you also lose the fundamental property of regularity: the pumping lemma for regular languages. Furthermore, the languages you can build with this extension go even outside the context-free class. Not only that: you are also dealing with an NP-complete matching. Again, why do you still call them regular?

Biganon · on March 5, 2022

I think you two are agreeing 100%, but your egos prevent you from even realizing it

bspammer · on March 5, 2022

This isn’t entirely a pedantic thing. Cloudflare had an outage a couple of years ago which was caused by someone not knowing the difference:

https://blog.cloudflare.com/details-of-the-cloudflare-outage...

dudeman13 · on March 5, 2022

Thanks for that.

My initial reaction to the title was "wait, you can describe prime numbers using formal languages?". I could feel it in my guts that this would probably have some mind blowing Math/CS implications.

A shame how I'm so rusty in my Mathematics and CS nowadays :(

aghilmort · on March 5, 2022

had same immediate reaction, like whoa, huge if true, time to read the comments

polski-g · on March 5, 2022

I wish everytime an application said "regular expression" it meant PCRE. People say sed supports RE but it's so divorced from PCRE I end up using `| perl -pe ....`

DyslexicAtheist · on March 5, 2022

as the saying goes if your only tool is a hammer ... perl/regexp is obviously the wrong tool for the job. Finding new prime numbers, minting bitcoin, running seti@home, is what C++ exceptions have been invented for.

colejohnson66 · on March 5, 2022

Coming soon: “Mining Bitcoin at Compile Time using constexpr”

tromp · on March 5, 2022

While we're nitpicking, it doesn't check for prime numbers either. It checks for composite length.

petmon · on March 5, 2022

"A Perl regexp, sure." is not part of the grammar of English because it lacks a verb.

xigoi · on March 5, 2022

Natural languages don't have a formal grammar.

867-5309 · on March 5, 2022

perhaps you meant "does not follow". only the loose rules are "part of"

besides, "it is" is implied in that sentence, so there's your verb

rurban · on March 5, 2022

So left recursive is not regular, so not a language? Every simple language is defined recursively. Regularity just defines it as terminating, to avoid the simplicity of recursion.

DFA is the absurdity, not NFA.

microtonal · on March 5, 2022

DFA is the absurdity, not NFA.

Every non-deterministic finite state automaton can be converted into a deterministic finite state automaton that accepts the same language (see powerset construction [1]), so I am not sure what you are saying here.

[1] https://en.wikipedia.org/wiki/Powerset_construction

rurban · on March 5, 2022

yes, and this absurdity does not make it a better language. rather a worse.

eg tons of simple recursively defined languages were converted to DFA monstrosities, and since then they became slower and unmaintainable.

colanderman · on March 5, 2022

What are you talking about? "Not a language?" "DFA is the absurdity?" I'm sorry, I don't follow.

rurban · on March 5, 2022

read for example about the state explosion of DFA's:

"It is important because NFAs can be used to reduce the complexity of the mathematical work required to establish many important properties in the theory of computation. For example, it is much easier to prove closure properties of regular languages using NFAs than DFAs." https://en.m.wikipedia.org/wiki/Nondeterministic_finite_auto...

but more importantly look at the various attempts to the solve backtracking "problem" (stack overflow by unbounded recursion), as done recently in perl and pcre, which led to more insecure AND more unmaintainable regex code by converting their simple recursion to iterations on the heap. first it got much slower, as the state is now handled on the heap, and second it got much more insecure as DOS attacks can now exhaust the whole heap, not just around 1024 stack depths. and third, it got 10x more complicated, and unreadable.

ad language: you can of course define a language by forbidding left recursion, which yacc folks would prefer. but you can define it much easier with left recursive parsers, like 10x smaller. of course you can always expand simple NFA's to deterministic DFA's at the cost of exponential state explosion and increased complexity. but it is a monstrosity only impractical CS folks prefer. SW engineers prefer the similar, faster and more secure NFA backtracking versions, as long as you count the stack depth.

colanderman · on March 5, 2022

As I mentioned elsewhere, linear performance is a hard requirement for some domains such as intrusion prevention. Unbounded recursion isn't the primary problem; exponential runtime is.

If state-space explosion is a concern, you can implement a non-backtracking regular expression matcher using term rewriting. For true regular expressions, the size of the rewritten terms are a function of the size of the pattern (and thus performance is linear). For non-regular expressions, this does not hold.

jonstewart · on March 5, 2022

You can also just implement an NFA but apply determinization to much of it, with tight memory allocation bounds, to get most of the benefit.

rurban · on March 6, 2022

yeah, with practical backtracking regexes they limit their recursion depth to about 10. with DFA expansion this was pretty hard to implement, with simple NFA's it was trivial.

Pratt and esp. PEG solve the determinism problem, even better than yacc.

avinash · on March 5, 2022

Aha. I wrote this post 15 years ago after explaining to my wife (who is a Language major) how this "regular expression" works.

It has been discussed a few times on Hacker News, Reddit, etc. over the past 15 years...

_ugfj · on March 5, 2022

As an aside, a Hungarian mathematician wrote an entire book explaining how mathematics works to a, I guess, a language major (well, Marcell Benedek was a professor, a writer, a literary historian, a translator, a theatre director).

https://en.wikipedia.org/wiki/Playing_with_Infinity

bmn__ · on March 5, 2022

Credit is due to Abigail <https://web.archive.org/web/2020/http://www.abigail.be/>. Please give it and be a good Internet citizen by amending the post.

phgn · on March 5, 2022

@dang can we please add (2007) to the title?

beardyw · on March 5, 2022

It doesn't seem to be past its "best before" date.

phgn · on March 5, 2022

What do you mean? I actually love posts about old articles, they're often more useful than any fleeting news.

mmaunder · on March 5, 2022

I prefer the source which I think is a clearer explanation. The key is that perl’s x operator turns the number into a string of 1s that is the length of the number. The rest is a hack to use regex to perform division using minimal capturing and back references.

http://montreal.pm.org/tech/neil_kandalgaonkar.shtml

btdmaster · on March 5, 2022

It's crazy how fast Perl's regex engine does this!

userbinator · on March 5, 2022

It's basically https://en.wikipedia.org/wiki/Trial_division using the + operator, and enforcing the bounds of the match to require a zero remainder.

rabbits77 · on March 5, 2022

I think it’s worth noting that the original author of the regex primality test is Abigail, a well known Perl hacker.

https://www.masteringperl.org/2013/06/how-abigails-prime-num...

This blog article does a nice job discussing Abigail’s work.

ogogmad · on March 5, 2022

Can you decide if a number is prime using a true regular expression?

No. This won't work because the time cost of primality testing is too high (cubic in the number of binary digits?), and checking if a string satisfies a regular expression takes only linear time. Since cubic is bigger than linear, this won't work. Too good to be true!

Not a proof, but a rule of thumb. No need to even use the pumping lemma.

thaumasiotes · on March 5, 2022

> This won't work because the time cost of primality testing is too high

This is just a conjecture, though.

> No need to even use the pumping lemma.

But it's really easy to prove this with the pumping lemma. This could be the first problem in a textbook section that introduces the pumping lemma. It would be difficult to come up with a language for which the pumping lemma disproof was more obvious.

chakkepolja · on March 5, 2022

The canonical example for pumping lemma disproof is a^n b^n, which is more obvious IMO.

thaumasiotes · on March 5, 2022

Here are the disproofs:

-----

(1) a^n b^n

The pumping lemma tells us that for some string in this language, we can divide the string into three substrings x y z such that x y^n z remains in the language for all n.

There are two cases:

(1)(a) The number of 'a' characters in y does not equal the number of 'b' characters. In this case, pumping y will violate the constraint that a string in the language must have an equal number of 'a's and 'b's.

(1)(b) y consists of k 'a's followed by k 'b's for some k > 0. In this case, pumping it will violate the constraint that for every string in this language, each 'a' character must precede every 'b' character.

Thus, the language a^n b^n is not regular.

-----

(2) a^n, where n is prime

The pumping lemma tells us that for some string in this language, we can divide the string into three substrings x y z such that x y^n z remains in the language for all n. [same as above]

But the string x y^|xz| z has length |xz| + |y|*|xz|, which is not prime.

Thus, the language a^n (where n is prime) is not regular.

-----

Which of those was easier?

chakkepolja · on March 7, 2022

The proof for a^n b^n, as I was taught, being that

-> there exists a integer N such that pumping lemma holds for any w in L longer than L. (That is we can divide w into xyz and pump y, where |xy| ≤ n, and y is nonempty).

-> now consider a^N b^N, we cant divide it into xyz such that we can pump y and w still remains in a^n b^n.

Yajirobe · on March 5, 2022

> Since cubic is bigger than linear, this won't work.

So it actually 'works', it's just slower.

laszlokorte · on March 5, 2022

No, the point is that the article does not use a real regular expression: 1. Real regular expressiona can always be checked in linear time 2. primes can not be checked in linear time Ergo: primes can not be checked with regular expressions, otherwise there would be a way to check primes in linear time, which would contratict premise 2

tasty_freeze · on March 5, 2022

What it is really doing is checking if the length of a string is a prime number or not. And as coded, it is checking specifically if the length of a string consisting of some number of '1' characters is prime or not.

rawling · on March 5, 2022

I've done some Codewars problems where you have to write a regex to check whether an input number is divisible by a specific number, so I was really interested to see how that would work for primes.

Nope, it's just playing with lengths.

randv · on March 5, 2022

- it is turning the number into a string of 1s 3 => 111 4 => 1111 and so on

- then trying to simulate divisibility test by trying to find perfect groups of '11' to simulate division by 2 failing which (that is the backtracking part) and trying find perfect groups of '111' to simulate division by 3 and so on...

Clever but very inefficient

amp108 · on March 5, 2022

Interesting, I wrote a quick script to use this:

#!/usr/bin/env ruby

def is_prime(n)

  ("1" \* n.to_i)!~ /^1?$|^(11+?)\1+$/

end

num = ARGV.shift

$stdout.puts is_prime(num)

... then decided to test it on a button-mashed number:

$ruby ~/prime.rb 767423

... and it hung.

Thought it was the length of the number, so I tried something smaller (101). Quick "true". Then I tried progressively more digits. I got to `ruby ~/prime.rb 7673121` (false) and it took less than a second.

But it doesn't like something about the number 767423.

TrianguloY · on March 5, 2022

The regexp basically tests all dividers of the number (from 2 to the number-1), if it founds one, it says false and stops, if nothing is found, it says true.

101 is prime, but it does "only" a hundred checks approx.

7673121 is divisible by 3, so as soon as it tries to match groups of 3 it succeeds and stops.

767423 is prime, so it needs to do hundreds of millions of checks, that's probably why it hungs.

Basically: the runtime of the algorithm should be equivalent to the minimum divisor excluding 1 of the number.

nvader · on March 5, 2022

And takes space exponential to the number of bits in the input.

ocimbote · on March 5, 2022

"Take a problem. Solve it with a RegExp. Now you have two problems".

asicsp · on March 5, 2022

See https://blog.codinghorror.com/regular-expressions-now-you-ha... for context around this (mis)quote.

bmn__ · on March 5, 2022

Collapse that indirect pointer. The source is http://regex.info/blog/2006-09-15/247

ocimbote · on March 5, 2022

Thx, I had actually no idea it came from there.

ahmed_ds · on March 5, 2022

I feel like Charlie Kelly from IASIP reading this regex. "1. Starts out with 1, then there is other stuff...."

https://youtu.be/AFDU_R6ShmI?t=74

Dylan16807 · on March 5, 2022

Huh, looking at some regex explaining sites, they don't seem to do a good job of simplifying.

Which is a shame because it's really easy to simplify away most of this regex.

/ does nothing

^ and $ only exist to make it do the whole string

| is just saying "or", and only the part to the right really matters.

And the ? doesn't do anything meaningful.

Throw all that out and all you have is (11+)\1+

It's still a mash of symbols, but it's much simpler.

And changing it to (cc+)\1+ would probably help naive readability too.

mordechai9000 · on March 5, 2022

The question mark isn't meaningless - it changes the matching behavior of + from greedy to non-greedy. The non-greedy version will try to match shorter strings first. This should be more efficient, because non-primes will fail sooner. Most numbers will fail a lot sooner.

Dylan16807 · on March 5, 2022

It takes just as long on primes, so the overall runtime will be very similar. So sure it's an optimization but it's not one to care about when you're trying to understand it.

aghilmort · on March 5, 2022

my fave regex explainer gives this, idk that it adds much? https://regexper.com/#%2F%5E1%3F%24%7C%5E%2811%2B%3F%29%5C1%...

Dylan16807 · on March 5, 2022

That's the best I've seen so far, at least. Much better than a bunch of bullet points of various indentation.

dhosek · on March 5, 2022

At first I was thinking this was actually operating on a binary representation of the numbers. Some interesting things become apparent when you look at numbers in binary, for example, numbers of the form 2^n-1 can be represented as a string of n 1s. From there, it's pretty obvious that if n is composite that 2^n is also composite since you can take that string of n=kp 1s and write it as k groups of p 1s so obviously, 2^p-1 divides 2^n-1. There's the famous quote from Leopold Kronecker, “God made the integers” and I always thought that when he did, he wrote them in binary.

Rerarom · on March 5, 2022

No, in unary.

cmeacham98 · on March 5, 2022

zodiac · on March 5, 2022

From reading the proof, I think he meant “n composite => 2^n - 1 composite” and it’s just a typo

dhosek · on March 6, 2022

johnx123-up · on March 5, 2022

FWIW...

Better explanations in http://neilk.net/blog/2000/06/01/abigails-regex-to-test-for-...

Abigail's other regular expression hacks https://metacpan.org/author/ABIGAIL?sort=[[3,1]]

pxeger1 · on March 5, 2022

Here is a regex to check for integer powers of four: https://codegolf.stackexchange.com/a/223129

And a regex to integer-divide numbers by the square root of 2: https://codegolf.stackexchange.com/a/198428

bawolff · on March 5, 2022

This is interesting - i thought it would be something complex, but really its just using the fundamental meaning of prime.

Literally what its doing is turn a number in to all 1s, if the number is 1 return true. Otherwise try all possible ways to put it into 2 or more equal groups, return true once we find one that works.

You don't get closer to the definition of "prime" than that.

russfink · on March 5, 2022

If the number of 1's itself is prime, then n is prime. That's all this does. Search the other comments for 'hung' or 'hang' to see examples of small numbers that take forever to resolve.

bawolff · on March 5, 2022

Sure. its basically a less efficient form of trial division.

vbezhenar · on March 5, 2022

Simpler regexp:

    ^(11+)\1+$

it does not handle "1" for clarity. I also don't understand why author used `+?`, because simple `+` should work just as well. At least it works in Idea for me.

catlifeonmars · on March 5, 2022

I wonder which one will reject composite numbers faster. The non greedy version starts with the smallest (potential) factor first, while the greedy version starts with the largest candidate factor first.

I suspect that since a factor must be smaller than the square root of a composite, the greedy version you proposed would waste a lot of time checking numbers which could never be factors of the composite under test.

If the number is prime, I think both variants have the same performance because they will exhaustively evaluate all numbers less than or equal to the composite under test.

Edit: this is not quite correct. One of the prime factors must be smaller than the square root of the composite numbers, but the other one can obviously be larger. For example: 2 * 17 = 34. I’m leaving the original statement as is, because I’m hoping someone with a better math background can chime in and explain if the way that prime factors are distributed is symmetrical or not.

MauranKilom · on March 5, 2022

Yeah, should work even with greedy matching as it just does the same checks but in reverse order.

randv · on March 5, 2022

while compact and looks elegant...

it is basically trying to group by 11 (divide by 2), if there no match (via backtracking) group by 111 (divide by 3) and so on

not very efficient by any stretch esp with string matches.

randv · on March 5, 2022

steps:

- it is turning the number into a string of 1s 3 => 111 4 => 1111 and so on

- then trying to simulate divisibility test by trying to find perfect groups of '11' to simulate division by 2 failing which (that is the backtracking part) and trying find perfect groups of '111' to simulate division by 3 and so on...

Clever but very inefficient

dorianmariefr · on March 5, 2022

    > require "prime"
    => true
    > 10.prime?
    => false

rofr999 · on March 5, 2022

I'm going to test this on large numbers and post the results!

ocimbote · on March 5, 2022

I can see you have a very efficient method for wasting time and energy.

Why? The method works. Size doesn't matter. But it's extremely inefficient. Which can be calculated rather than measured.

tmalsburg2 · on March 5, 2022

I think OP understands this and their comment was a joke.

mrjin · on March 5, 2022

Yep. It seems doesn't matter what, there is always people trying to use it in a way it wasn't meant to.

wittycardio · on March 5, 2022

Not sure if anything about his comment suggests that tbh.

reactspa · on March 5, 2022

Is this a case of P != NP?

pxx · on March 5, 2022

primality testing is in P by the AKS construction (https://www.microsoft.com/en-us/research/wp-content/uploads/...)

saagarjha · on March 5, 2022

You’ll probably not be able to finish before the 14-day reply period expires.

cqcn1991 · on March 5, 2022

this is very useful; thanks!