Understanding Text - Regexp to Tokens

26 Jan 2009
Posted by jcfiala

I've got two Drupal modules that depend on reading through text to do something useful - the first is the 'Drupal Markup Engine', or dme, and the second is currently called 'pingv_coder', an extension to coder to check for what I think are poor programming practices, which isn't done yet.

In both of them I'm currently using regexp to scan and look for what I need - in the first I'm using some truly twisted regular expressions to find and isolate the tags that are left in the text. DME allows programmers to define custom tags in the format of <dme:tagname/> or <dme:tagname&gt...</dme:tagname>, and the code looks for these tags with regex. At the time I started, I read a lot of sites saying that parsing html-like or xml-like tags with regex wasn't the right way to go, and as I've improved and used my code, it seems more and more that that's true. It works well for the simple cases, but when you have tags nested inside of other tags, you start running into problems making sure you've got the right things nested. Happily, for the production sites we've used it in, the tags have been very simple, on the order of 'put this image here' and so on, and so there haven't been any problems.

In the other one I'm looking mostly for where in PHP code variables are defined, and when they're used, a process that *sounds* simple enough...

$var = 15;
print $var;

Defined, and then used, right? Except variables can be defined in the function header, in the class, as a static variable, or even with the list() function:

list($one, $two) = function_what_returns_an_array();

Really, the more I poke at that, the less I think lots of regex is the answer - what I really need to do is parse the file, and keep track of when something is defined and when it's used. Happy PHP has simpler scope rules than a lot of other languages - it's either global, or local to the file, class, or to the function.

In both of these cases, what I really should be doing is tokenizing the input, going through it until I reach the information I'm interested in, keeping track of my depth and my context. For PHP, there's an advantage in that there's the Tokenizer, a built-in part of PHP that goes through and parses the file for you, returning an array of pieces that you can skim over. I've just started looking at it, but I think that'll really do what I need for the coder extension. For the DME, on my next major revision I think I might just need to scan the input myself, possibly first breaking it up on whitespace, or maybe looking for < and > characters. But that will be after I finally get it ready for D6.