I'll be "that guy" -- this is not a regular expression [1]. A Perl regexp, sure. But backreferences are not part of the grammar of regular expressions, and moreover, using them in that particular manner (to refer to a capture of unbounded length) cannot be encoded as a DFA and therefore does not describe a regular language.
(I have no clue whether unary-representation primality is actually regular or not -- it's been far too long since my automata class to remember how to (dis)prove that -- but I guess it's at least sublinear, since primality is polynomial w/r/t the logarithm of the number being tested.)
{1^x : x is prime} is not a regular language. suppose it is a regular language. then by the pumping lemma there exists a p such that every string of length at least p can be written as w = xyz, where |y| >= 1, |xy| <= p [not an important observation for this proof], and we can "pump" a new string w' = xy^nz (for any n > 0) and have w' remain in L.
let p' be the prime immediately above p. then the string
w = 1^p'
is in the language L and must be "pumpable" for some N = |y| > 0. Choose the number of times we "pump" (n in the expression xy^nz) to be p'. Then that would imply
w' = 1^(p' + p' * N)
is in the language L, a contradiction, as p' + p' * N has factors p' and (1 + N). Note we can also use the CFL pumping lemma to prove that this particular language is not context-free either; if we choose s = uvwxy, we can still just pump p' times; the new string just is of size p' * (1 + |v| + |x|), and |v| + |x| > 0.
> we can "pump" a new string w' = xy^nz (for any n > 0)
While this is certainly true, I would prefer to note that n ≥ 0. n > 0 is just a special case of that.
(You can see clearly that 0 repetitions will always be valid by considering the proof of the pumping lemma: since a regular expression is recognized by a finite state machine, the only way to recognize a string of length exceeding the (finite!) number of states in the machine is to visit some states more than once, which means traveling around a cycle in the state graph. The 'pumpable substring' of the pumping lemma is a path through such a cycle, and repeating it zero times corresponds to traveling around the cycle zero times. The rest of the path is just as valid as always.)
Well, you can still easily base the proof on the pumping lemma:
1. The language {1^x: x prime} is not regular (as immediately shown by the pumping lemma).
2. The language {1^x: x not prime} is not regular (because it is the complement of {1^x: x prime}, and the complement of a regular language is regular).
It follows that the pumping lemma does not apply to the language {1^x: x not prime}; just because the string 1111 can be divided that way doesn't mean all strings in the language can be so divided.
> No pumping Lemma -> Not regular, Yeah Pumping Lemma -> Who knows
This is incorrect; any language to which the pumping lemma for regular languages applies is a regular language.
> This is incorrect; any language to which the pumping lemma for regular languages applies is a regular language
No. The wikipedia page has an example of a pumpable but non-regular language. You just need to take the union of a pumpable language and a non-regular language in a way that you can pump elements of the second into elements of the first. The Wikipedia example is a little more formal, but we can do something like this.
Consider, over the binary alphabet,
L1 = {s : 00 or 11 is a (consecutive) substring of s}
L2 = {(01)^n : n represents a Turing machine that halts on the empty tape}
and L = L1 union L2. Identifying elements of L2 is "difficult" (this is where I'm not being formal, but clearly this is beyond the power of regular languages) and the elements of the two languages are disjoint (so I'm not playing some tricks with my language secretly being Σ*).
Take the pumping length p = 3. There are only eight possibilities for the first three characters: 000,001,010,011 and their bitwise complements. In the first two cases pump on the last character as any string starting 00 is in the language L. In the third case pump on the middle character; either we get repeated 1s or the zeroes coalesce (note that this is exactly where we collapse L2 into L1 by pumping it). In the last case, pump on the first character; the repeated 1s ensure the final string is in L. For the complements just complement my constants.
So any string s in L can be pumped. But it's clear that no DFA is going to recognize L2, a disjoint language from L1. I use the halting problem for L2 as a "hard problem" but you can just choose an arbitrary other language L2 = {(01)^n : n encodes an element of L*, where L* is not regular}.
Just to tie everything up, if you have Dirichlet's theorem it's easy to show that {1^x: x not prime} is irregular through a "direct" proof by contradiction using the pumping lemma.
1. Assuming the language is regular, the pumping lemma tells us that it has a pumping length m, and all strings of length at least m can be divided into a prefix x, a suffix z, and a pumpable substring y, such that:
1a. |xz| < m
1b. |y| > 1
1c. The concatenation x y^n z remains in the language for all n. (In this case, this means that |xz| + n*|y| is nonprime for all nonnegative y.)
2. Consider the string c = 1^p², where p is a prime number greater than m[1].
3. The pumping lemma will divide c into substrings x y z as described above.
4. Observe that |xz| < m < p and |y| = p² - |xz|
5. Dirichlet's theorem tells us that the arithmetic sequence {a + nd} contains infinitely many primes as long as a and d are coprime. Pumping our string c will generate strings with lengths belonging to the arithmetic sequence {|xz| + n(p² - |xz|)}. It remains to be shown that |xz| and (p² - |xz|) are coprime.
6. Observe that any number which divides both (p² - |xz|) and |xz| will also divide [(p² - |xz|) + |xz|] = p².
7. Since p is prime, the only divisors of p² are 1, p, and p².
8. Since |xz| < p, we can see that gcd(|xz|, (p² - |xz|)) = 1.
This completes the proof; pumping c is guaranteed to produce not just one but infinitely many strings of prime length.
[1] Euclid tells us that there is no greatest prime number, so we can guarantee the existence of such a p.
Think about what you're saying. You think 15 is an even number?
maweki may be thinking that you can divide a string for pumping lemma purposes any way you please. If that were true, you could show that all strings of nonprime length could be pumped while still retaining nonprime length:
1. |s| is composite, so |s| = km for some integers k and m
2. Choose the length of the pumpable substring y to be k. Then |xz| = s - k = k(m-1), and |xy^nz| = k(m-1) + kn = k(m + n - 1). This is composite, so the pumping is valid.
The problem, as jwillk has identified, is that the pumping lemma doesn't allow this. The language has a pumping length (traditionally m), and you have to divide your string into three substrings xyz such that all of the following are true:
1. |y| > 0
2. |xz| < m
3. xy^nz remains in the language for all n.
Condition (2) prevents us from using the maweki strategy.
I'll be "the other guy" -- no one uses the term regex or regular expression in a programming context to mean a strictly regular language in the theoretical computer science sense. Nitpicking this point is ignorant of how the usage of the term has evolved within the programming community.
Good. There's that obligative conversation over and done with.
Having worked in an industry where the linear vs. exponential performance implications of regular expressions and PCREs was crucial (intrusion protection systems), I'd just as soon software professionals stop being sloppy with terminology.
Not to mention the plethora of mostly-but-not-totally compatible regexp languages which exist. C++ supports 6 [1], and that doesn't even include PCREs. Miscommunication about which language one is actually using is a common source of regexp bugs.
> Having worked in an industry where the linear vs. exponential performance implications of regular expressions and PCREs was crucial (intrusion protection systems)
same boat. that's why we use re2 in a way inspired by Go (which means losing a few features but the tradeoff is worth it):
> The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.) For more information about this property, see
Wow, TIL. I always thought C++ had evolved into an excessively complex language, but this really seems to take the cake. ECMAScript (with C++-only changes), basic POSIX, extended POSIX, POSIX awk, POSIX grep, and POSIX extended grep. You are right there is no PCRE in there, yet. (There's always C++23... and if not that, C++26 or C++30)
(Although ECMASCript regex was heavily inspired by Perl/PCRE anyway, albeit without being 100% compatible with either, so in a sense PCRE is already there.)
AFAIK it's not even the interface that is broken, but some standard libraries poushed out slow implementations that cannot be changed without an ABI break.
I'd argue that if this is sloppy, then the use of "you" instead of "thou" is sloppy too. Using "you" for everything is confusing. How can i possibly know if you mean plural or singular?
Sure some people might say "thou" is archaic, however so is using "regex" outside of academic contexts to mean regular languages. I don't see how one could justify one but not the other.
Grammar sticklers usually fail on this point: human languages are not static, stuck in place from the time the stickler learned the rules.
The current state of a human language is defined by common usage. The purpose of a language is for people to be able to communicate ideas. Ideas exist all up and down a specificity spectrum, e.g. “regex” might mean a specific syntax or one of several syntaxes depending on context. Kind of like “kleenex” might refer to a specific brand or to the general concept of tissue.
If 99.9% of people successfully (by their own measure, not yours) communicate what they mean when they use the term “regular expression”, but you are among the 0.1% for which that interaction fails, then I would argue that the problem is your pedantry, not the term. I’m not saying this to be mean; I have similar issues.
There is no such thing as the programming community. Only a huge number of partially overlapping communities using wildly different terminology. Some communities still talk about regular expressions in the original meaning (because labeled graphs are useful objects), just like some communities still use the word "algorithm" outside machine learning or "crypto" in non-blockchain contexts.
I personally prefer using "regular expression" in the language-theoretic sense and "regex" or "regexp" for related concepts in programming languages and software.
> I personally prefer using "regular expression" in the language-theoretic sense and "regex" or "regexp" for related concepts in programming languages and software.
This is also what Larry Wall proposes:
> [...] Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).
There's some kind of idea going around that because the plural of "ox" is "oxen", plurals with -n must be triggered by a word ending in "x". So you see cutesy references to "boxen" or similar.
But -n is just an ordinary plural marker; the same ending that makes "oxen" the plural of "ox" also makes "kine" the plural of "cow" and "children" the plural of "child". (In those last two cases, the words are sporting two plural markers in sequence.)
It's not new, by any stretch. I remember seeing such many references in the early 2000s and beyond. Doing some vague poking around textfiles.com, I didn't pull any easy to find 1980s 'boxen' references. It almost looks like it might have been the addition of 'unix boxen' to the 'new hacker dictionary' by esr which spread the joke into the mid-90s communities. I wasn't online at that point myself, so I can't speak from any personal experience. But the various hacker zines etc are using 'boxen' by the later 90s and early 2000s, which would allow for it to have spread through the folks of the time finding and trolling through esr's version of the 'hacker dictionary'. https://www.dourish.com/goodies/jargon.html complains about esr adding unix references, and does not include any boxen references that are found in the http://cd.textfiles.com/hackersencyc/ETC/MISC/JARG_331.TXT version folks in the 90s would likely have spread around.
edit: nope. definitely there in the 80s on usenet
so he was just encoding existing slang usage, rather than introducing it, on this point.
I didn't say it was new. I said it was current. (And, um, unmotivated.)
We might also note that from the perspective of modern English, plurals in -en are more characteristic of Geoffrey Chaucer than of Alfred the Great. We preserve three such words: oxen, children, and brethren. But only one of those is an Anglo-Saxon form; in Old English, the plural forms of ċild and broþor are ċildru and broþru[1]. The -ens appeared later.
[1] "Brother" is a fairly irregular word, but there's no form ending in -n.
Well. You're on Hacker News. So I'm going to quote the Jargon File and then you'll stop complaining. Right, cobber?
> Further, note the prevalence of certain kinds of nonstandard plural
forms. Some of these go back quite a ways; the TMRC Dictionary includes
an entry which implies that the plural of `mouse' is {meeces}, and notes
that the defined plural of `caboose' is `cabeese'. This latter has
apparently been standard (or at least a standard joke) among railfans
(railroad enthusiasts) for many years.
> On a similarly Anglo-Saxon note, almost anything ending in `x' may form
plurals in `-xen' (see {VAXen} and {boxen} in the main text). Even
words ending in phonetic /k/ alone are sometimes treated this way; e.g.,
`soxen' for a bunch of socks. Other funny plurals are `frobbotzim' for
the plural of `frobbozz' (see {frobnitz}) and `Unices' and `Twenices'
(rather than `Unixes' and `Twenexes'; see {UNIX}, {TWENEX} in main
text). But note that `Unixen' and `Twenexen' are never used; it has
been suggested that this is because `-ix' and `-ex' are Latin singular
endings that attract a Latinate plural. Finally, it has been suggested
to general approval that the plural of `mongoose' ought to be
`polygoose'.
> The pattern here, as with other hackish grammatical quirks, is
generalization of an inflectional rule that in English is either an
import or a fossil (such as the Hebrew plural ending `-im', or the
Anglo-Saxon plural suffix `-en') to cases where it isn't normally
considered to apply.
I’ll be the guy who postscripts that “hardmode” regexes would have important mathematical and computer science implications when/if they solve certain problems (like identifying primes). So it’s not pure pedantry, and there’s an important difference between “regexes to solve programming problems” and “regexes solving math problems”.
Yeah I think we agree here. I didn’t mean to assert that chomskian regular grammars or regexes satisfiable by a state machine hadn’t already been useful; I meant to say that because they have existing math implications and properties that new uses for them would imply connections to other maths, analogous computations, new proofs or ways to solve things, etc.
And yeah, they’re faster because they can be statically compiled to the point of running strings “through” the expression rather than running the expression “on” the string, so to speak. (i.e as an FSM, iterating char by char the operations boil down to purely compare and branch to known destinations)
>no one uses the term regex or regular expression in a programming context to mean a strictly regular language in the theoretical computer science sense.
The RE2 library seems to:
>loosely speaking, a regular language is a set of strings that can be matched in a single pass through the text using only a fixed amount of memory.
Regular expressions are Regular Languages, thus equivalent to deterministic finite state automata. These languages have very interesting properties due to their regularity (eg. pumping lemma), and they are a very small strict subset of Context-Free languages, which are also a strict subset of Turing-Complete languages. If you add recursion and backtracking to RegExps you easily go outside the Regular Languages class, and you probably end up in the TC class. Now you are dealing with an undecidable machine which is also very hard to define and reason about (and you also have two more problems). What's the point of calling it regular?
Your average coding bootcamp kid cracking away in node and invoking the builtin regex engine has no idea what any of those words mean.
To those programmers the words "regular expression" have become completely separated from their original etymology in lexical analysis. To them a regex means "input to a regex engine", and thus the term "use a regular expression" means "use a regex engine". It could be "use a froofroo boingo", for all the meaning the words hold. "Regular expression" just became the vernacular due to its origination in lexical analysis.
Wow, I have no idea what regexps mean? I can write a book about regexps, and if you really want to learn what regexps are, let me know, we can scedule a session.
Anyway, real REGUALR expressions belong to the regular languages class, thus equivalent to DFAs. Given a regexp, you can construct a DFA in O(2^n), where n is the number of operators in the regexp. Now this DFA will have a Linear time complexity match. If you extend regular expressions with backreferences, you can't build DFAs anymore, and you also lose the fundamental property of regularity: the pumping lemma for regular languages. Furthermore, the languages you can build with this extension go even outside the context-free class. Not only that: you are also dealing with an NP-complete matching.
Again, why do you still call them regular?
My initial reaction to the title was "wait, you can describe prime numbers using formal languages?". I could feel it in my guts that this would probably have some mind blowing Math/CS implications.
A shame how I'm so rusty in my Mathematics and CS nowadays :(
I wish everytime an application said "regular expression" it meant PCRE. People say sed supports RE but it's so divorced from PCRE I end up using `| perl -pe ....`
as the saying goes if your only tool is a hammer ... perl/regexp is obviously the wrong tool for the job. Finding new prime numbers, minting bitcoin, running seti@home, is what C++ exceptions have been invented for.
So left recursive is not regular, so not a language? Every simple language is defined recursively. Regularity just defines it as terminating, to avoid the simplicity of recursion.
Every non-deterministic finite state automaton can be converted into a deterministic finite state automaton that accepts the same language (see powerset construction [1]), so I am not sure what you are saying here.
read for example about the state explosion of DFA's:
"It is important because NFAs can be used to reduce the complexity of the mathematical work required to establish many important properties in the theory of computation. For example, it is much easier to prove closure properties of regular languages using NFAs than DFAs." https://en.m.wikipedia.org/wiki/Nondeterministic_finite_auto...
but more importantly look at the various attempts to the solve backtracking "problem" (stack overflow by unbounded recursion), as done recently in perl and pcre, which led to more insecure AND more unmaintainable regex code by converting their simple recursion to iterations on the heap. first it got much slower, as the state is now handled on the heap, and second it got much more insecure as DOS attacks can now exhaust the whole heap, not just around 1024 stack depths. and third, it got 10x more complicated, and unreadable.
ad language: you can of course define a language by forbidding left recursion, which yacc folks would prefer. but you can define it much easier with left recursive parsers, like 10x smaller. of course you can always expand simple NFA's to deterministic DFA's at the cost of exponential state explosion and increased complexity. but it is a monstrosity only impractical CS folks prefer. SW engineers prefer the similar, faster and more secure NFA backtracking versions, as long as you count the stack depth.
As I mentioned elsewhere, linear performance is a hard requirement for some domains such as intrusion prevention. Unbounded recursion isn't the primary problem; exponential runtime is.
If state-space explosion is a concern, you can implement a non-backtracking regular expression matcher using term rewriting. For true regular expressions, the size of the rewritten terms are a function of the size of the pattern (and thus performance is linear). For non-regular expressions, this does not hold.
yeah, with practical backtracking regexes they limit their recursion depth to about 10. with DFA expansion this was pretty hard to implement, with simple NFA's it was trivial.
Pratt and esp. PEG solve the determinism problem, even better than yacc.
As an aside, a Hungarian mathematician wrote an entire book explaining how mathematics works to a, I guess, a language major (well, Marcell Benedek was a professor, a writer, a literary historian, a translator, a theatre director).
I prefer the source which I think is a clearer explanation. The key is that perl’s x operator turns the number into a string of 1s that is the length of the number. The rest is a hack to use regex to perform division using minimal capturing and back references.
Can you decide if a number is prime using a true regular expression?
No. This won't work because the time cost of primality testing is too high (cubic in the number of binary digits?), and checking if a string satisfies a regular expression takes only linear time. Since cubic is bigger than linear, this won't work. Too good to be true!
Not a proof, but a rule of thumb. No need to even use the pumping lemma.
> This won't work because the time cost of primality testing is too high
This is just a conjecture, though.
> No need to even use the pumping lemma.
But it's really easy to prove this with the pumping lemma. This could be the first problem in a textbook section that introduces the pumping lemma. It would be difficult to come up with a language for which the pumping lemma disproof was more obvious.
The pumping lemma tells us that for some string in this language, we can divide the string into three substrings x y z such that xy^nz remains in the language for all n.
There are two cases:
(1)(a) The number of 'a' characters in y does not equal the number of 'b' characters. In this case, pumping y will violate the constraint that a string in the language must have an equal number of 'a's and 'b's.
(1)(b) y consists of k 'a's followed by k 'b's for some k > 0. In this case, pumping it will violate the constraint that for every string in this language, each 'a' character must precede every 'b' character.
Thus, the language a^n b^n is not regular.
-----
(2) a^n, where n is prime
The pumping lemma tells us that for some string in this language, we can divide the string into three substrings x y z such that xy^nz remains in the language for all n. [same as above]
But the string xy^|xz| z has length |xz| + |y|*|xz|, which is not prime.
Thus, the language a^n (where n is prime) is not regular.
The proof for a^n b^n, as I was taught, being that
-> there exists a integer N such that pumping lemma holds for any w in L longer than L. (That is we can divide w into xyz and pump y, where |xy| ≤ n, and y is nonempty).
-> now consider a^N b^N, we cant divide it into xyz such that we can pump y and w still remains in a^n b^n.
No, the point is that the article does not use a real regular expression:
1. Real regular expressiona can always be checked in linear time
2. primes can not be checked in linear time
Ergo: primes can not be checked with regular expressions, otherwise there would be a way to check primes in linear time, which would contratict premise 2
What it is really doing is checking if the length of a string is a prime number or not. And as coded, it is checking specifically if the length of a string consisting of some number of '1' characters is prime or not.
I've done some Codewars problems where you have to write a regex to check whether an input number is divisible by a specific number, so I was really interested to see how that would work for primes.
- it is turning the number into a string of 1s 3 => 111 4 => 1111 and so on
- then trying to simulate divisibility test by trying to find perfect groups of '11' to simulate division by 2 failing which (that is the backtracking part) and trying find perfect groups of '111' to simulate division by 3 and so on...
... then decided to test it on a button-mashed number:
$ruby ~/prime.rb 767423
... and it hung.
Thought it was the length of the number, so I tried something smaller (101). Quick "true". Then I tried progressively more digits. I got to `ruby ~/prime.rb 7673121` (false) and it took less than a second.
But it doesn't like something about the number 767423.
The regexp basically tests all dividers of the number (from 2 to the number-1), if it founds one, it says false and stops, if nothing is found, it says true.
101 is prime, but it does "only" a hundred checks approx.
7673121 is divisible by 3, so as soon as it tries to match groups of 3 it succeeds and stops.
767423 is prime, so it needs to do hundreds of millions of checks, that's probably why it hungs.
Basically: the runtime of the algorithm should be equivalent to the minimum divisor excluding 1 of the number.
The question mark isn't meaningless - it changes the matching behavior of + from greedy to non-greedy. The non-greedy version will try to match shorter strings first. This should be more efficient, because non-primes will fail sooner. Most numbers will fail a lot sooner.
It takes just as long on primes, so the overall runtime will be very similar. So sure it's an optimization but it's not one to care about when you're trying to understand it.
At first I was thinking this was actually operating on a binary representation of the numbers. Some interesting things become apparent when you look at numbers in binary, for example, numbers of the form 2^n-1 can be represented as a string of n 1s. From there, it's pretty obvious that if n is composite that 2^n is also composite since you can take that string of n=kp 1s and write it as k groups of p 1s so obviously, 2^p-1 divides 2^n-1. There's the famous quote from Leopold Kronecker, “God made the integers” and I always thought that when he did, he wrote them in binary.
This is interesting - i thought it would be something complex, but really its just using the fundamental meaning of prime.
Literally what its doing is turn a number in to all 1s, if the number is 1 return true. Otherwise try all possible ways to put it into 2 or more equal groups, return true once we find one that works.
You don't get closer to the definition of "prime" than that.
If the number of 1's itself is prime, then n is prime. That's all this does. Search the other comments for 'hung' or 'hang' to see examples of small numbers that take forever to resolve.
it does not handle "1" for clarity. I also don't understand why author used `+?`, because simple `+` should work just as well. At least it works in Idea for me.
I wonder which one will reject composite numbers faster. The non greedy version starts with the smallest (potential) factor first, while the greedy version starts with the largest candidate factor first.
I suspect that since a factor must be smaller than the square root of a composite, the greedy version you proposed would waste a lot of time checking numbers which could never be factors of the composite under test.
If the number is prime, I think both variants have the same performance because they will exhaustively evaluate all numbers less than or equal to the composite under test.
Edit: this is not quite correct. One of the prime factors must be smaller than the square root of the composite numbers, but the other one can obviously be larger. For example: 2 * 17 = 34. I’m leaving the original statement as is, because I’m hoping someone with a better math background can chime in and explain if the way that prime factors are distributed is symmetrical or not.
- it is turning the number into a string of 1s 3 => 111 4 => 1111 and so on
- then trying to simulate divisibility test by trying to find perfect groups of '11' to simulate division by 2 failing which (that is the backtracking part) and trying find perfect groups of '111' to simulate division by 3 and so on...
(I have no clue whether unary-representation primality is actually regular or not -- it's been far too long since my automata class to remember how to (dis)prove that -- but I guess it's at least sublinear, since primality is polynomial w/r/t the logarithm of the number being tested.)
[1] https://en.wikipedia.org/wiki/Regular_expression#Formal_defi...