Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A regular expression to check for prime numbers (2007) (noulakaz.net)
289 points by graderjs on March 5, 2022 | hide | past | favorite | 121 comments


I'll be "that guy" -- this is not a regular expression [1]. A Perl regexp, sure. But backreferences are not part of the grammar of regular expressions, and moreover, using them in that particular manner (to refer to a capture of unbounded length) cannot be encoded as a DFA and therefore does not describe a regular language.

(I have no clue whether unary-representation primality is actually regular or not -- it's been far too long since my automata class to remember how to (dis)prove that -- but I guess it's at least sublinear, since primality is polynomial w/r/t the logarithm of the number being tested.)

[1] https://en.wikipedia.org/wiki/Regular_expression#Formal_defi...


you can use the pumping lemma (https://en.wikipedia.org/wiki/Pumping_lemma_for_regular_lang...) to do this proof trivially. it'll look something like

{1^x : x is prime} is not a regular language. suppose it is a regular language. then by the pumping lemma there exists a p such that every string of length at least p can be written as w = xyz, where |y| >= 1, |xy| <= p [not an important observation for this proof], and we can "pump" a new string w' = xy^nz (for any n > 0) and have w' remain in L.

let p' be the prime immediately above p. then the string

w = 1^p'

is in the language L and must be "pumpable" for some N = |y| > 0. Choose the number of times we "pump" (n in the expression xy^nz) to be p'. Then that would imply

w' = 1^(p' + p' * N)

is in the language L, a contradiction, as p' + p' * N has factors p' and (1 + N). Note we can also use the CFL pumping lemma to prove that this particular language is not context-free either; if we choose s = uvwxy, we can still just pump p' times; the new string just is of size p' * (1 + |v| + |x|), and |v| + |x| > 0.


> we can "pump" a new string w' = xy^nz (for any n > 0)

While this is certainly true, I would prefer to note that n ≥ 0. n > 0 is just a special case of that.

(You can see clearly that 0 repetitions will always be valid by considering the proof of the pumping lemma: since a regular expression is recognized by a finite state machine, the only way to recognize a string of length exceeding the (finite!) number of states in the machine is to visit some states more than once, which means traveling around a cycle in the state graph. The 'pumpable substring' of the pumping lemma is a path through such a cycle, and repeating it zero times corresponds to traveling around the cycle zero times. The rest of the path is just as valid as always.)


Fantastic! It's been 15 years since I've had to do those proofs, I'll use your proof to refresh my memory how they work.


Just to note additionally that the pumping lemma may only be used in that direction, not the other way.

The language {1^x : x is not prime} is easily pumped with x=z=1 and y=11. And regular languages are closed under complement.

No pumping Lemma -> Not regular, Yeah Pumping Lemma -> Who knows


Well, you can still easily base the proof on the pumping lemma:

1. The language {1^x: x prime} is not regular (as immediately shown by the pumping lemma).

2. The language {1^x: x not prime} is not regular (because it is the complement of {1^x: x prime}, and the complement of a regular language is regular).

It follows that the pumping lemma does not apply to the language {1^x: x not prime}; just because the string 1111 can be divided that way doesn't mean all strings in the language can be so divided.

> No pumping Lemma -> Not regular, Yeah Pumping Lemma -> Who knows

This is incorrect; any language to which the pumping lemma for regular languages applies is a regular language.


> This is incorrect; any language to which the pumping lemma for regular languages applies is a regular language

No. The wikipedia page has an example of a pumpable but non-regular language. You just need to take the union of a pumpable language and a non-regular language in a way that you can pump elements of the second into elements of the first. The Wikipedia example is a little more formal, but we can do something like this.

Consider, over the binary alphabet,

L1 = {s : 00 or 11 is a (consecutive) substring of s}

L2 = {(01)^n : n represents a Turing machine that halts on the empty tape}

and L = L1 union L2. Identifying elements of L2 is "difficult" (this is where I'm not being formal, but clearly this is beyond the power of regular languages) and the elements of the two languages are disjoint (so I'm not playing some tricks with my language secretly being Σ*).

Take the pumping length p = 3. There are only eight possibilities for the first three characters: 000,001,010,011 and their bitwise complements. In the first two cases pump on the last character as any string starting 00 is in the language L. In the third case pump on the middle character; either we get repeated 1s or the zeroes coalesce (note that this is exactly where we collapse L2 into L1 by pumping it). In the last case, pump on the first character; the repeated 1s ensure the final string is in L. For the complements just complement my constants.

So any string s in L can be pumped. But it's clear that no DFA is going to recognize L2, a disjoint language from L1. I use the halting problem for L2 as a "hard problem" but you can just choose an arbitrary other language L2 = {(01)^n : n encodes an element of L*, where L* is not regular}.


OK, you can trivially pump numbers with small factors. But what about the numbers that have all prime factors greater than the pumping length?


Just to tie everything up, if you have Dirichlet's theorem it's easy to show that {1^x: x not prime} is irregular through a "direct" proof by contradiction using the pumping lemma.

1. Assuming the language is regular, the pumping lemma tells us that it has a pumping length m, and all strings of length at least m can be divided into a prefix x, a suffix z, and a pumpable substring y, such that:

1a. |xz| < m

1b. |y| > 1

1c. The concatenation x y^n z remains in the language for all n. (In this case, this means that |xz| + n*|y| is nonprime for all nonnegative y.)

2. Consider the string c = 1^p², where p is a prime number greater than m[1].

3. The pumping lemma will divide c into substrings x y z as described above.

4. Observe that |xz| < m < p and |y| = p² - |xz|

5. Dirichlet's theorem tells us that the arithmetic sequence {a + nd} contains infinitely many primes as long as a and d are coprime. Pumping our string c will generate strings with lengths belonging to the arithmetic sequence {|xz| + n(p² - |xz|)}. It remains to be shown that |xz| and (p² - |xz|) are coprime.

6. Observe that any number which divides both (p² - |xz|) and |xz| will also divide [(p² - |xz|) + |xz|] = p².

7. Since p is prime, the only divisors of p² are 1, p, and p².

8. Since |xz| < p, we can see that gcd(|xz|, (p² - |xz|)) = 1.

This completes the proof; pumping c is guaranteed to produce not just one but infinitely many strings of prime length.

[1] Euclid tells us that there is no greatest prime number, so we can guarantee the existence of such a p.


The above proof is just stating that all nonprimes >= 4 are even, and therefore can be pumped -- so contradiction by pumping lemma does not apply.


Think about what you're saying. You think 15 is an even number?

maweki may be thinking that you can divide a string for pumping lemma purposes any way you please. If that were true, you could show that all strings of nonprime length could be pumped while still retaining nonprime length:

1. |s| is composite, so |s| = km for some integers k and m

2. Choose the length of the pumpable substring y to be k. Then |xz| = s - k = k(m-1), and |x y^n z| = k(m-1) + kn = k(m + n - 1). This is composite, so the pumping is valid.

The problem, as jwillk has identified, is that the pumping lemma doesn't allow this. The language has a pumping length (traditionally m), and you have to divide your string into three substrings x y z such that all of the following are true:

1. |y| > 0

2. |xz| < m

3. x y^n z remains in the language for all n.

Condition (2) prevents us from using the maweki strategy.


It’s been a while since I took Theory of Computation. Does this imply that primes in binary or n-ary are also not regular?


You would need a different proof (likely a much more complicated one) for n-ary primes.


https://math.stackexchange.com/questions/1232463/how-to-prov... has a proof for base 2. This method works for other bases too.


Nice!


this is why I love hackernews


I'll be "the other guy" -- no one uses the term regex or regular expression in a programming context to mean a strictly regular language in the theoretical computer science sense. Nitpicking this point is ignorant of how the usage of the term has evolved within the programming community.

Good. There's that obligative conversation over and done with.


Having worked in an industry where the linear vs. exponential performance implications of regular expressions and PCREs was crucial (intrusion protection systems), I'd just as soon software professionals stop being sloppy with terminology.

Not to mention the plethora of mostly-but-not-totally compatible regexp languages which exist. C++ supports 6 [1], and that doesn't even include PCREs. Miscommunication about which language one is actually using is a common source of regexp bugs.

[1] https://en.cppreference.com/w/cpp/regex/syntax_option_type


> Having worked in an industry where the linear vs. exponential performance implications of regular expressions and PCREs was crucial (intrusion protection systems)

same boat. that's why we use re2 in a way inspired by Go (which means losing a few features but the tradeoff is worth it):

> The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.) For more information about this property, see

> https://swtch.com/~rsc/regexp/regexp1.html

> or any book about automata theory.

https://pkg.go.dev/regexp


OCaml's regexps run in linear time also. They're based on Ville Laurikari's Tagged REs: https://laurikari.net/tre/


> C++ supports 6

Wow, TIL. I always thought C++ had evolved into an excessively complex language, but this really seems to take the cake. ECMAScript (with C++-only changes), basic POSIX, extended POSIX, POSIX awk, POSIX grep, and POSIX extended grep. You are right there is no PCRE in there, yet. (There's always C++23... and if not that, C++26 or C++30)

(Although ECMASCript regex was heavily inspired by Perl/PCRE anyway, albeit without being 100% compatible with either, so in a sense PCRE is already there.)


It's unlikely that std::regex will gain new dialects. Unfortunately it is very broken, it's unlikely that a proposal to extend it would go through.


Can you elaborate (or point me to sources about) how std::regex is broken? I'd like to learn more.


AFAIK it's not even the interface that is broken, but some standard libraries poushed out slow implementations that cannot be changed without an ABI break.


Yes, thank you.

If you're sloppy with words you probably have other nasty habits. (apologies to Heinlein)

There are reasons to employ regular languages with extensions (to not quite context free languages) but be clear about what you're talking about.


I'd argue that if this is sloppy, then the use of "you" instead of "thou" is sloppy too. Using "you" for everything is confusing. How can i possibly know if you mean plural or singular?

Sure some people might say "thou" is archaic, however so is using "regex" outside of academic contexts to mean regular languages. I don't see how one could justify one but not the other.


Grammar sticklers usually fail on this point: human languages are not static, stuck in place from the time the stickler learned the rules.

The current state of a human language is defined by common usage. The purpose of a language is for people to be able to communicate ideas. Ideas exist all up and down a specificity spectrum, e.g. “regex” might mean a specific syntax or one of several syntaxes depending on context. Kind of like “kleenex” might refer to a specific brand or to the general concept of tissue.

If 99.9% of people successfully (by their own measure, not yours) communicate what they mean when they use the term “regular expression”, but you are among the 0.1% for which that interaction fails, then I would argue that the problem is your pedantry, not the term. I’m not saying this to be mean; I have similar issues.


I tried to imagine the test suite for these C++ variations but failed due to a coronary event


There is no such thing as the programming community. Only a huge number of partially overlapping communities using wildly different terminology. Some communities still talk about regular expressions in the original meaning (because labeled graphs are useful objects), just like some communities still use the word "algorithm" outside machine learning or "crypto" in non-blockchain contexts.

I personally prefer using "regular expression" in the language-theoretic sense and "regex" or "regexp" for related concepts in programming languages and software.


> I personally prefer using "regular expression" in the language-theoretic sense and "regex" or "regexp" for related concepts in programming languages and software.

This is also what Larry Wall proposes:

> [...] Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

https://raku.org/archive/doc/design/apo/A05.html


> (or "regexen", when I'm in an Anglo-Saxon mood)

There's some kind of idea going around that because the plural of "ox" is "oxen", plurals with -n must be triggered by a word ending in "x". So you see cutesy references to "boxen" or similar.

But -n is just an ordinary plural marker; the same ending that makes "oxen" the plural of "ox" also makes "kine" the plural of "cow" and "children" the plural of "child". (In those last two cases, the words are sporting two plural markers in sequence.)


>There's some kind of idea going around

It's not new, by any stretch. I remember seeing such many references in the early 2000s and beyond. Doing some vague poking around textfiles.com, I didn't pull any easy to find 1980s 'boxen' references. It almost looks like it might have been the addition of 'unix boxen' to the 'new hacker dictionary' by esr which spread the joke into the mid-90s communities. I wasn't online at that point myself, so I can't speak from any personal experience. But the various hacker zines etc are using 'boxen' by the later 90s and early 2000s, which would allow for it to have spread through the folks of the time finding and trolling through esr's version of the 'hacker dictionary'. https://www.dourish.com/goodies/jargon.html complains about esr adding unix references, and does not include any boxen references that are found in the http://cd.textfiles.com/hackersencyc/ETC/MISC/JARG_331.TXT version folks in the 90s would likely have spread around.

edit: nope. definitely there in the 80s on usenet

so he was just encoding existing slang usage, rather than introducing it, on this point.

https://www.usenetarchives.com/view.php?id=net.wanted&mid=PD...

more edit: considered the second oldest boxen reference in the usenet archive is esr himself, maybe he was the one pushing it :)

https://www.usenetarchives.com/view.php?id=comp.sys.att&mid=...


I didn't say it was new. I said it was current. (And, um, unmotivated.)

We might also note that from the perspective of modern English, plurals in -en are more characteristic of Geoffrey Chaucer than of Alfred the Great. We preserve three such words: oxen, children, and brethren. But only one of those is an Anglo-Saxon form; in Old English, the plural forms of ċild and broþor are ċildru and broþru[1]. The -ens appeared later.

[1] "Brother" is a fairly irregular word, but there's no form ending in -n.


Well. You're on Hacker News. So I'm going to quote the Jargon File and then you'll stop complaining. Right, cobber?

> Further, note the prevalence of certain kinds of nonstandard plural forms. Some of these go back quite a ways; the TMRC Dictionary includes an entry which implies that the plural of `mouse' is {meeces}, and notes that the defined plural of `caboose' is `cabeese'. This latter has apparently been standard (or at least a standard joke) among railfans (railroad enthusiasts) for many years.

> On a similarly Anglo-Saxon note, almost anything ending in `x' may form plurals in `-xen' (see {VAXen} and {boxen} in the main text). Even words ending in phonetic /k/ alone are sometimes treated this way; e.g., `soxen' for a bunch of socks. Other funny plurals are `frobbotzim' for the plural of `frobbozz' (see {frobnitz}) and `Unices' and `Twenices' (rather than `Unixes' and `Twenexes'; see {UNIX}, {TWENEX} in main text). But note that `Unixen' and `Twenexen' are never used; it has been suggested that this is because `-ix' and `-ex' are Latin singular endings that attract a Latinate plural. Finally, it has been suggested to general approval that the plural of `mongoose' ought to be `polygoose'.

> The pattern here, as with other hackish grammatical quirks, is generalization of an inflectional rule that in English is either an import or a fossil (such as the Hebrew plural ending `-im', or the Anglo-Saxon plural suffix `-en') to cases where it isn't normally considered to apply.


> or the Anglo-Saxon plural suffix `-en'

There is no Anglo-Saxon plural suffix -en; the suffix is -an.


> [1] "Brother" is a fairly irregular word, but there's no form ending in -n.

It's brethren, for the record.


That footnote is clearly talking about Old English.


I’ll be the guy who postscripts that “hardmode” regexes would have important mathematical and computer science implications when/if they solve certain problems (like identifying primes). So it’s not pure pedantry, and there’s an important difference between “regexes to solve programming problems” and “regexes solving math problems”.


"hardmode" regexes do have important computer science implications: they can be matched orders of magnitude faster [0].

Demonstrating the power of a regex engine by solving a problem like this is interesting and very clever, but it’s hopefully not used in production.

[0] https://swtch.com/~rsc/regexp/regexp1.html


Yeah I think we agree here. I didn’t mean to assert that chomskian regular grammars or regexes satisfiable by a state machine hadn’t already been useful; I meant to say that because they have existing math implications and properties that new uses for them would imply connections to other maths, analogous computations, new proofs or ways to solve things, etc.

And yeah, they’re faster because they can be statically compiled to the point of running strings “through” the expression rather than running the expression “on” the string, so to speak. (i.e as an FSM, iterating char by char the operations boil down to purely compare and branch to known destinations)


>no one uses the term regex or regular expression in a programming context to mean a strictly regular language in the theoretical computer science sense.

The RE2 library seems to:

>loosely speaking, a regular language is a set of strings that can be matched in a single pass through the text using only a fixed amount of memory.

https://github.com/google/re2/wiki/Syntax

Note that RE2 doesn't support backreferences.


Regular expressions are Regular Languages, thus equivalent to deterministic finite state automata. These languages have very interesting properties due to their regularity (eg. pumping lemma), and they are a very small strict subset of Context-Free languages, which are also a strict subset of Turing-Complete languages. If you add recursion and backtracking to RegExps you easily go outside the Regular Languages class, and you probably end up in the TC class. Now you are dealing with an undecidable machine which is also very hard to define and reason about (and you also have two more problems). What's the point of calling it regular?


I'm hitting Poe's Law here, are you joking?

Your average coding bootcamp kid cracking away in node and invoking the builtin regex engine has no idea what any of those words mean.

To those programmers the words "regular expression" have become completely separated from their original etymology in lexical analysis. To them a regex means "input to a regex engine", and thus the term "use a regular expression" means "use a regex engine". It could be "use a froofroo boingo", for all the meaning the words hold. "Regular expression" just became the vernacular due to its origination in lexical analysis.


Wow, I have no idea what regexps mean? I can write a book about regexps, and if you really want to learn what regexps are, let me know, we can scedule a session.

Anyway, real REGUALR expressions belong to the regular languages class, thus equivalent to DFAs. Given a regexp, you can construct a DFA in O(2^n), where n is the number of operators in the regexp. Now this DFA will have a Linear time complexity match. If you extend regular expressions with backreferences, you can't build DFAs anymore, and you also lose the fundamental property of regularity: the pumping lemma for regular languages. Furthermore, the languages you can build with this extension go even outside the context-free class. Not only that: you are also dealing with an NP-complete matching. Again, why do you still call them regular?


I think you two are agreeing 100%, but your egos prevent you from even realizing it


This isn’t entirely a pedantic thing. Cloudflare had an outage a couple of years ago which was caused by someone not knowing the difference:

https://blog.cloudflare.com/details-of-the-cloudflare-outage...


Thanks for that.

My initial reaction to the title was "wait, you can describe prime numbers using formal languages?". I could feel it in my guts that this would probably have some mind blowing Math/CS implications.

A shame how I'm so rusty in my Mathematics and CS nowadays :(


had same immediate reaction, like whoa, huge if true, time to read the comments


I wish everytime an application said "regular expression" it meant PCRE. People say sed supports RE but it's so divorced from PCRE I end up using `| perl -pe ....`


as the saying goes if your only tool is a hammer ... perl/regexp is obviously the wrong tool for the job. Finding new prime numbers, minting bitcoin, running seti@home, is what C++ exceptions have been invented for.


Coming soon: “Mining Bitcoin at Compile Time using constexpr”


While we're nitpicking, it doesn't check for prime numbers either. It checks for composite length.


"A Perl regexp, sure." is not part of the grammar of English because it lacks a verb.


Natural languages don't have a formal grammar.


perhaps you meant "does not follow". only the loose rules are "part of"

besides, "it is" is implied in that sentence, so there's your verb


So left recursive is not regular, so not a language? Every simple language is defined recursively. Regularity just defines it as terminating, to avoid the simplicity of recursion.

DFA is the absurdity, not NFA.


DFA is the absurdity, not NFA.

Every non-deterministic finite state automaton can be converted into a deterministic finite state automaton that accepts the same language (see powerset construction [1]), so I am not sure what you are saying here.

[1] https://en.wikipedia.org/wiki/Powerset_construction


yes, and this absurdity does not make it a better language. rather a worse.

eg tons of simple recursively defined languages were converted to DFA monstrosities, and since then they became slower and unmaintainable.


What are you talking about? "Not a language?" "DFA is the absurdity?" I'm sorry, I don't follow.


read for example about the state explosion of DFA's:

"It is important because NFAs can be used to reduce the complexity of the mathematical work required to establish many important properties in the theory of computation. For example, it is much easier to prove closure properties of regular languages using NFAs than DFAs." https://en.m.wikipedia.org/wiki/Nondeterministic_finite_auto...

but more importantly look at the various attempts to the solve backtracking "problem" (stack overflow by unbounded recursion), as done recently in perl and pcre, which led to more insecure AND more unmaintainable regex code by converting their simple recursion to iterations on the heap. first it got much slower, as the state is now handled on the heap, and second it got much more insecure as DOS attacks can now exhaust the whole heap, not just around 1024 stack depths. and third, it got 10x more complicated, and unreadable.

ad language: you can of course define a language by forbidding left recursion, which yacc folks would prefer. but you can define it much easier with left recursive parsers, like 10x smaller. of course you can always expand simple NFA's to deterministic DFA's at the cost of exponential state explosion and increased complexity. but it is a monstrosity only impractical CS folks prefer. SW engineers prefer the similar, faster and more secure NFA backtracking versions, as long as you count the stack depth.


As I mentioned elsewhere, linear performance is a hard requirement for some domains such as intrusion prevention. Unbounded recursion isn't the primary problem; exponential runtime is.

If state-space explosion is a concern, you can implement a non-backtracking regular expression matcher using term rewriting. For true regular expressions, the size of the rewritten terms are a function of the size of the pattern (and thus performance is linear). For non-regular expressions, this does not hold.


You can also just implement an NFA but apply determinization to much of it, with tight memory allocation bounds, to get most of the benefit.


yeah, with practical backtracking regexes they limit their recursion depth to about 10. with DFA expansion this was pretty hard to implement, with simple NFA's it was trivial.

Pratt and esp. PEG solve the determinism problem, even better than yacc.


Aha. I wrote this post 15 years ago after explaining to my wife (who is a Language major) how this "regular expression" works.

It has been discussed a few times on Hacker News, Reddit, etc. over the past 15 years...


As an aside, a Hungarian mathematician wrote an entire book explaining how mathematics works to a, I guess, a language major (well, Marcell Benedek was a professor, a writer, a literary historian, a translator, a theatre director).

https://en.wikipedia.org/wiki/Playing_with_Infinity


Credit is due to Abigail <https://web.archive.org/web/2020/http://www.abigail.be/>. Please give it and be a good Internet citizen by amending the post.


@dang can we please add (2007) to the title?


It doesn't seem to be past its "best before" date.


What do you mean? I actually love posts about old articles, they're often more useful than any fleeting news.


I prefer the source which I think is a clearer explanation. The key is that perl’s x operator turns the number into a string of 1s that is the length of the number. The rest is a hack to use regex to perform division using minimal capturing and back references.

http://montreal.pm.org/tech/neil_kandalgaonkar.shtml


It's crazy how fast Perl's regex engine does this!


It's basically https://en.wikipedia.org/wiki/Trial_division using the + operator, and enforcing the bounds of the match to require a zero remainder.


I think it’s worth noting that the original author of the regex primality test is Abigail, a well known Perl hacker.

https://www.masteringperl.org/2013/06/how-abigails-prime-num...

This blog article does a nice job discussing Abigail’s work.


Can you decide if a number is prime using a true regular expression?

No. This won't work because the time cost of primality testing is too high (cubic in the number of binary digits?), and checking if a string satisfies a regular expression takes only linear time. Since cubic is bigger than linear, this won't work. Too good to be true!

Not a proof, but a rule of thumb. No need to even use the pumping lemma.


> This won't work because the time cost of primality testing is too high

This is just a conjecture, though.

> No need to even use the pumping lemma.

But it's really easy to prove this with the pumping lemma. This could be the first problem in a textbook section that introduces the pumping lemma. It would be difficult to come up with a language for which the pumping lemma disproof was more obvious.


The canonical example for pumping lemma disproof is a^n b^n, which is more obvious IMO.


Here are the disproofs:

-----

(1) a^n b^n

The pumping lemma tells us that for some string in this language, we can divide the string into three substrings x y z such that x y^n z remains in the language for all n.

There are two cases:

(1)(a) The number of 'a' characters in y does not equal the number of 'b' characters. In this case, pumping y will violate the constraint that a string in the language must have an equal number of 'a's and 'b's.

(1)(b) y consists of k 'a's followed by k 'b's for some k > 0. In this case, pumping it will violate the constraint that for every string in this language, each 'a' character must precede every 'b' character.

Thus, the language a^n b^n is not regular.

-----

(2) a^n, where n is prime

The pumping lemma tells us that for some string in this language, we can divide the string into three substrings x y z such that x y^n z remains in the language for all n. [same as above]

But the string x y^|xz| z has length |xz| + |y|*|xz|, which is not prime.

Thus, the language a^n (where n is prime) is not regular.

-----

Which of those was easier?


The proof for a^n b^n, as I was taught, being that

-> there exists a integer N such that pumping lemma holds for any w in L longer than L. (That is we can divide w into xyz and pump y, where |xy| ≤ n, and y is nonempty).

-> now consider a^N b^N, we cant divide it into xyz such that we can pump y and w still remains in a^n b^n.


> Since cubic is bigger than linear, this won't work.

So it actually 'works', it's just slower.


No, the point is that the article does not use a real regular expression: 1. Real regular expressiona can always be checked in linear time 2. primes can not be checked in linear time Ergo: primes can not be checked with regular expressions, otherwise there would be a way to check primes in linear time, which would contratict premise 2


What it is really doing is checking if the length of a string is a prime number or not. And as coded, it is checking specifically if the length of a string consisting of some number of '1' characters is prime or not.


I've done some Codewars problems where you have to write a regex to check whether an input number is divisible by a specific number, so I was really interested to see how that would work for primes.

Nope, it's just playing with lengths.


- it is turning the number into a string of 1s 3 => 111 4 => 1111 and so on

- then trying to simulate divisibility test by trying to find perfect groups of '11' to simulate division by 2 failing which (that is the backtracking part) and trying find perfect groups of '111' to simulate division by 3 and so on...

Clever but very inefficient


Interesting, I wrote a quick script to use this:

#!/usr/bin/env ruby

def is_prime(n)

  ("1" \* n.to_i)!~ /^1?$|^(11+?)\1+$/
end

num = ARGV.shift

$stdout.puts is_prime(num)

... then decided to test it on a button-mashed number:

$ruby ~/prime.rb 767423

... and it hung.

Thought it was the length of the number, so I tried something smaller (101). Quick "true". Then I tried progressively more digits. I got to `ruby ~/prime.rb 7673121` (false) and it took less than a second.

But it doesn't like something about the number 767423.


The regexp basically tests all dividers of the number (from 2 to the number-1), if it founds one, it says false and stops, if nothing is found, it says true.

101 is prime, but it does "only" a hundred checks approx.

7673121 is divisible by 3, so as soon as it tries to match groups of 3 it succeeds and stops.

767423 is prime, so it needs to do hundreds of millions of checks, that's probably why it hungs.

Basically: the runtime of the algorithm should be equivalent to the minimum divisor excluding 1 of the number.


And takes space exponential to the number of bits in the input.


"Take a problem. Solve it with a RegExp. Now you have two problems".



Collapse that indirect pointer. The source is http://regex.info/blog/2006-09-15/247


Thx, I had actually no idea it came from there.


I feel like Charlie Kelly from IASIP reading this regex. "1. Starts out with 1, then there is other stuff...."

https://youtu.be/AFDU_R6ShmI?t=74


Huh, looking at some regex explaining sites, they don't seem to do a good job of simplifying.

Which is a shame because it's really easy to simplify away most of this regex.

/ does nothing

^ and $ only exist to make it do the whole string

| is just saying "or", and only the part to the right really matters.

And the ? doesn't do anything meaningful.

Throw all that out and all you have is (11+)\1+

It's still a mash of symbols, but it's much simpler.

And changing it to (cc+)\1+ would probably help naive readability too.


The question mark isn't meaningless - it changes the matching behavior of + from greedy to non-greedy. The non-greedy version will try to match shorter strings first. This should be more efficient, because non-primes will fail sooner. Most numbers will fail a lot sooner.


It takes just as long on primes, so the overall runtime will be very similar. So sure it's an optimization but it's not one to care about when you're trying to understand it.


my fave regex explainer gives this, idk that it adds much? https://regexper.com/#%2F%5E1%3F%24%7C%5E%2811%2B%3F%29%5C1%...


That's the best I've seen so far, at least. Much better than a bunch of bullet points of various indentation.


At first I was thinking this was actually operating on a binary representation of the numbers. Some interesting things become apparent when you look at numbers in binary, for example, numbers of the form 2^n-1 can be represented as a string of n 1s. From there, it's pretty obvious that if n is composite that 2^n is also composite since you can take that string of n=kp 1s and write it as k groups of p 1s so obviously, 2^p-1 divides 2^n-1. There's the famous quote from Leopold Kronecker, “God made the integers” and I always thought that when he did, he wrote them in binary.


No, in unary.


.


From reading the proof, I think he meant “n composite => 2^n - 1 composite” and it’s just a typo


Yep



Here is a regex to check for integer powers of four: https://codegolf.stackexchange.com/a/223129

And a regex to integer-divide numbers by the square root of 2: https://codegolf.stackexchange.com/a/198428


This is interesting - i thought it would be something complex, but really its just using the fundamental meaning of prime.

Literally what its doing is turn a number in to all 1s, if the number is 1 return true. Otherwise try all possible ways to put it into 2 or more equal groups, return true once we find one that works.

You don't get closer to the definition of "prime" than that.


If the number of 1's itself is prime, then n is prime. That's all this does. Search the other comments for 'hung' or 'hang' to see examples of small numbers that take forever to resolve.


Sure. its basically a less efficient form of trial division.


Simpler regexp:

    ^(11+)\1+$
it does not handle "1" for clarity. I also don't understand why author used `+?`, because simple `+` should work just as well. At least it works in Idea for me.


I wonder which one will reject composite numbers faster. The non greedy version starts with the smallest (potential) factor first, while the greedy version starts with the largest candidate factor first.

I suspect that since a factor must be smaller than the square root of a composite, the greedy version you proposed would waste a lot of time checking numbers which could never be factors of the composite under test.

If the number is prime, I think both variants have the same performance because they will exhaustively evaluate all numbers less than or equal to the composite under test.

Edit: this is not quite correct. One of the prime factors must be smaller than the square root of the composite numbers, but the other one can obviously be larger. For example: 2 * 17 = 34. I’m leaving the original statement as is, because I’m hoping someone with a better math background can chime in and explain if the way that prime factors are distributed is symmetrical or not.


Yeah, should work even with greedy matching as it just does the same checks but in reverse order.


while compact and looks elegant...

it is basically trying to group by 11 (divide by 2), if there no match (via backtracking) group by 111 (divide by 3) and so on

not very efficient by any stretch esp with string matches.


steps:

- it is turning the number into a string of 1s 3 => 111 4 => 1111 and so on

- then trying to simulate divisibility test by trying to find perfect groups of '11' to simulate division by 2 failing which (that is the backtracking part) and trying find perfect groups of '111' to simulate division by 3 and so on...

Clever but very inefficient


    > require "prime"
    => true
    > 10.prime?
    => false


I'm going to test this on large numbers and post the results!


I can see you have a very efficient method for wasting time and energy.

Why? The method works. Size doesn't matter. But it's extremely inefficient. Which can be calculated rather than measured.


I think OP understands this and their comment was a joke.


Yep. It seems doesn't matter what, there is always people trying to use it in a way it wasn't meant to.


Not sure if anything about his comment suggests that tbh.


Is this a case of P != NP?


primality testing is in P by the AKS construction (https://www.microsoft.com/en-us/research/wp-content/uploads/...)


You’ll probably not be able to finish before the 14-day reply period expires.


this is very useful; thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: