Lex and yacc are tools to automatically build C code suitable for parsing things in simple languages. These tools are most often used for parts of compilers or interpreters, or for reading configuration files. In the first of two articles, Peter Seebach explains what lex and yacc actually do and shows how to use them for simple tasks.
Meet lex, yacc, flex, and bison
2004-08-24 General Development 16 Comments
very nice article explaining the most important parts of lex/yacc very easily.
I wish I had this tutorial last year, when I did my ‘compiler generation’ course at univerity, had saved me a lot of time figuring out how lex/yacc works.
But the facial explination leaves something to be desired:
Lex is a “lexical analyzer generator” that can parse ANY Arbitray code. Personally, I think its the greatest thing since sliced bread,
and Yacc, can use the output of lex to compile and tranalate ANY arbitray code.
Great for cross compliation, computer language translation, and fixing bugs in assembler output. It slices and dices!
These utilities have existed for decades, the greatest unsung utilities ever.
ANTLR < http://www.antlr.org> is a more modern and easier to use and also more powerful tool for lexing , parsing and tree parsing!!
I’m inclined towards the Gold parser myself, but I am still trying to figure out what to do with my fancy parser once I’m done building it.
Build your own languages with JavaCC
JavaCC Grammar Repository
For you C++ users out there, check out the Spirit library. It’s avaliable stand alone or part of the boost C++ library.
I find it very effective, if not a bit complicated.
Because if you tried to do complex things with lex and yacc, you will start pulling your hair.
Lex: it’s NO sliced-bread, it a regex-based tokenizer. A perl script would be easier to write and probably more efficient than a lex “program”.
Yacc: 1-token-look-ahead LARL just sucks. Period. Should be burnt and buried. Use LL(k) grammar generator.
Right! It’s maybe good to know these tools for historical reasons, but novices should not be introduced to such poorly documented crap any longer.
The ‘lexical analyser’ part of a PL can usually be hand-coded. Lex makes it automatically but it is not, IMHO, a tool saving weeks of work. Comparing it with regex scripts is silly, you have probably never created a PL.
The ‘syntaxic analysis’ is a much more complex task. yacc and its object-oriented derivatives are really indispensables.
Usually, if your syntax doesn’t fit with a LALR(1) parser, it is broken ! There are several alternatives :
– Your syntax is too complex ( C++ is known to be not LALR(1) complient )
– Several exceptions are well known and can be handled with yacc ( for example the classical ‘dangling else’ problem )
– You are trying to do semantic analysis inside the syntaxic parser, which is a Bad Idea (TM)
– You are not parsing a programming language or machine readable format ( English or Portuguese is -probably- not LALR(1) ), and yacc is not the right tool for such things.
I never really got how the hell antlr worked, even if I know it is fast, thread safe and all those stuff..
have you ever tried to code a LL(k) parser by hand?
I did it several times and there isn’t so much extra code (compared to a yacc variant). Most of the code in a yacc file is concerned with building a syntax tree or whatever, and that is also most of the code in a handwritten LL parser. And I’m talking about parsing Pascal or Java, not toy languages.
Of course, if you’re just verifying if a grammar is LALR(1) then you’re much better off with yacc.
You’re right, one can hand code a LL parser.
The first parser of my -currently under development- programming language was an hand coded LL parser.
I switched to yacc beacause :
– When you’re tuning the syntax, an automatic parser generator helps much finding inconsistent specifications.
– The parser generator is supposed to be bug-free, that is you doesn’t need to test-proof the generated code ( programming by contract ! )
– In my case, the yacc file was faster to write and shorter than the hand coded parser.
The problem with tools like yacc is it is inherently complex. A good mastering of it needs much time and most people doesn’t create parsers every other day so the development speed improvement is mitigated by the learning time ; finally, I don’t regret the time passed reading the bison info files.
Another decisive avantage of yacc over hand coded generators and other automatic generators is that ready-made yacc generators are available on Internet for virtually all widely used languages…
Those tools are great for prototyping, and even production. Authors of the Lua language, mentioned that the first version of Lua was created through lex/flex and bison/yacc, but later they’ve made custom C parser for the language, replacing the autogenerated one, mainly cause they wanted to minimize the code size, and manual optimizations were easier to make this way.
I tried Spirit, but I think it’s more interesting than useful. It works ok as long as you don’t make any errors, but once an error is made you get the unreadable template error messages. It’s interesting that you can do that kind of thing in C++ though.
i only understand bahnhof. i dont see any easy explanation there
lex/flex = lexer generator.
The input to be analyzed is a list of characters.
A lexer tries to group these into a list of tokens.
yacc/bison = Parser generator.
The input to it is a context free grammar and the parser generator spits out a parser (some push down automaton) that will recognize derivations of the grammar within the list of tokens.
ANTLR is able to generate lexers, parsers and tree walkers that can be used to operate on the syntax trees that descripe the structures recognized by the parser.
ANTLR is implemented in Java, but the generated stuff can be Java, C++ and C# (others to follow?)
It seems a bit more modern. Of course it was implemented 20 or so years later.
There are some nice additions to resolve ambiguities.
I can’t tell if the lookahead strategies (ANTLR has LL(k) and even some kind of LL(oo)) really make a big difference.
Perhaps the computer scientists have some more advanced tools around.
But ANTLR seems to have achieved a level of practical use like old faithful yacc/bison, doesn’t look like it is just a demo for research purpuses.