Reverse custom MyCode parser

Decided to try writing a basic inverse preg_replace today, knowing that it would be impossible to make a “perfect” algo.

Potentially a number of uses for such, though I’m currently thinking along the lines of a WYSIWYG editor for MyBB (dunno if I’ll make one).  Well, I’ve gone around to making a basic one, which, in fact, probably works on most custom MyCodes posted in the MyBB community forums.  The basic idea is to switch replacement tokens ($1, $2 etc) with the source patterns, and vice versa, which happens to fit nicely with most posted custom MyCodes.

So, in other words, a pattern of \[b\](.*?)\[/b\] and replacement <strong>$1</strong>, after passing through my inverse function, comes out with a pattern of \<strong\>(.*?)\</strong\> and replacement of [b]$1[/b].

It does take into consideration the position and handles repetitions in replacement strings correctly, so a pattern of \[tag\](a)(b)\1\[/tag\] and replacement <strong>$2$1$2$1</strong>, comes out with a pattern of \<strong\>(b)(a)\1\2\</strong\> and replacement of [tag]$2$1$2[/tag].

Obviously, however, it cannot handle all patterns, or even many of them.  The inverse function can’t readily determine what to put in conditions/patterns which aren’t captured or used – it will try to make guesses sometimes, but it’s usually crap.

At the current stage, $0 replacement and nested subpatterns are a little problematic, but probably work.


Update: since I’m no longer working on this (and as requested in the comments) here’s the code I was working on:

<?php

// this is a simple inverse for preg_replace.  This will obviously not be a complete reverse (it will only replace captured patterns) and even then, probably won't work all the time.
function pregreplace_inverse($pattern, $replacement) {

    $pattern = str_replace("\0", '', $pattern); // preg patterns can't contain nulls anyway, but be pedantic...

    // backup pattern
    $orig_pattern = $pattern;

    // first, find all captured patterns

    $matches = $placeholders = array();
    pregreplace_inverse_makephcache($placeholders);
    $num_matches = 0;
    /* explanation of below regular expression
     *    .*                            greedy prefix (to properly handle stacked expressions)
     *    (^|[^\\\\](?:\\\\\\\\)*)    ensure the bracketed thing hasn't been backslash escaped
     *    \((?:[^?\\\\]|[^?].*?([^\\\\](?:\\\\\\\\)*))\)    capture main pattern - if only one character, it can't be a ? or a \, if more than one character, can't start with a ? (non-captured) or end with a backslash (escaped bracket)
     *    ([.*]\\??|\\?|\\{\d+(?:,\d*)?\\})?    also stick in any quantifiers which follow the brackets
     */
    while(preg_match('~.*(?:^|[^\\\\](?:\\\\\\\\)*)(\(([^?\\\\]|[^?].*?(?:[^\\\\](?:\\\\\\\\)*))\)([.*]\\??|\\?|\\{\d+(?:,\d*)?\\})?)~s', $pattern, $match, PREG_OFFSET_CAPTURE)) {
        // remove the captured pattern from the pattern, and put in placeholder
        if(!isset($placeholders[$num_matches])) pregreplace_inverse_makephcache($placeholders);
        $pattern = substr($pattern, 0, $match[1][1]) . $placeholders[$num_matches] . substr($pattern, $match[1][1] + strlen($match[1][0]));

        $matches[$num_matches] = array(
            'pattern' => $match[2][0],
            'quant' => $match[3][0],
            //'id' => $num_matches,
        );

        ++$num_matches;
    }

    // now reverse matches, as we retrieved them from back-to-front
    //$matches = array_reverse($matches);

    // replace matches in matches with back references
    // TODO: check
    /* foreach($matches as &$match) {
        if(strpos($match['pattern'], "\0") !== false) {
            $match['pattern'] = preg_replace('~\\0__placeholder__(\d+)__\\0~e', '\'\\\\\'.'.$num_matches.'-$1', $match['pattern']);
        }
    } */

    // now we start changing the replacement string
    $r = preg_split('~(?<=[^\\\\]|^)(\\\\\\\\)*(\$\d+)~s', $replacement, -1, PREG_SPLIT_DELIM_CAPTURE);
    $c = count($r);
    $pc = 0; // pattern count (for backrefs)
    //for($i=2; $i<$c; $i+=3) {
    for($i=0; $i<$c; $i++) {
        if(($i-2) % 3) {
            $r[$i] = preg_quote($r[$i], '#');
        } else {
            // grab number
            $n = intval(substr($r[$i], 1));

            if($n) {
                $match =& $matches[$num_matches-$n];

                if(isset($match['pid']))
                    $r[$i] = '\\'.$match['pid']; // back reference
                else {
                    $r[$i] = '('.$match['pattern'].')'.$match['quant'];
                    $match['pid'] = ++$pc;
                }
            }
            else {
                // replacement for $0
                if(isset($orig_pid))
                    $r[$i] = '\\'.$orig_pid;
                else {
                    $r[$i] = '('.$orig_pattern.')';
                    $orig_pid = ++$pc;
                    // as the original pattern may contain capturing subpatterns, ammend $pc accordingly
                    $pc += $num_matches-1;
                    // note that the above isn't a "perfectly" correct way to do this (eg patterns can differ)
                }
            }
        }
    }

    // fix nested patterns
    for($i=2; $i<$c; $i+=3) {
        if(strpos($r[$i], "\0") !== false) {
            // TODO: check if it references a future reference and swap if necessary
            $r[$i] = preg_replace('~\\0__placeholder__(\d+)__\\0~e', 'isset($matches[$1][\'pid\']) ? \'\\\\\\\\\'.$matches[$1][\'pid\'] : \'\'', $r[$i]);
        }
    }
    $r = implode('', $r);

    // finally, fix up the source pattern
    // first, try to do basic heuristics - this will suck, but try something at least, that'll probably work in most cases
    $pattern = preg_replace(array(
        '~\\\\(x[0-9a-fA-F]{0,2}|0[0-7]{0,2}|c.|[^xc0-9])~e', // escape sequence (must be first)
        //'~\(\?[<>]?[=!].+?\)~s', // look behind/ahead - bad pattern because nested brackets can stuff it up, but I'm lazy
        //'~\[([^\^]).*?\]~', // character sequence (we're possibly a bit stuffed if this contains an escape sequence...)
        '~\[\^.+?\]~', // exclusion character sequence
        '~\.~', // any char
        '~[*+]\??~', // quantifier
        '~\?~', // quantifier2
        '~\{\d+(?:,\d*)?\}\??~', // quantifier3 ({0} not handled correctly)

        // keep ^ and $ tokens as is, as they're _probably_ okay
        // can't handle (?: ... ) or |'s so ignore >_>
    ), array(
        'pregreplace_inverse_ptn_escape(\'$1\')',
        //'', // throw look ahead/behind away - can't deal with them
        //'$1', // just replace with first character
        "\1", // random character which is unlikely to be excluded
        ' ', // well, here's a chataer...
        '', // throw away quantifiers
        '', // throw away quantifiers
        '', // throw away quantifiers

    ), $pattern);
    // next, placeholders, and backreferences
    $pattern = preg_replace(array(
        '~\\0__placeholder__(\d+)__\\0~e',
        '~\\\\(\d+)~e'
    ), array(
        'isset($matches[$1][\'pid\']) ? \'$\'.$matches[$1][\'pid\'] : \'\'',
        'isset($matches[$num_matches-$1][\'pid\']) ? \'$\'.$matches[$num_matches-$1][\'pid\'] : \'\''
    ), $pattern);

    return array(
        'pattern' => $r,
        'replacement' => $pattern
    );
}

function pregreplace_inverse_makephcache(&$a) {
    $c = count($a);
    for($i=0; $i<50; $i++) // increment cache size by 50
        $a[] = "\0__placeholder__".($c+$i)."__\0";
}

function pregreplace_inverse_ptn_escape($char) {
    $char = str_replace('\\"', '"', $char);
    switch($char{0}) {
        case '[': case ']': case '(': case ')': case '{': case '}':
        case '?': case '*': case '+': case '.': case '|': case '^': case '$': case '\\': 
            return $char;

        case 'a': return "\a";
        case 'e': return "\x1B";
        case 'f': return "\x0C";
        case 'n': return "\n";
        case 'r': return "\r";
        case 't': return "\t";

        case 'c': return chr(ord(strtoupper($char{1})) ^ 0x40);
        case 'x':
            $hex = substr($char, 1);
            if($hex) return chr(hexdec($hex));
            else return "\0";
        case '0':
            $oct = substr($char, 1);
            if($oct) return chr(octdec($oct));
            else return "\0";

        case 'd': return '0';
        case 'D': return 'a';
        case 's': return ' ';
        case 'S': return '_';
        case 'w': return 'a';
        case 'W': return ' ';

        case 'b': case 'B': case 'A': case 'Z': case 'z': case 'G':
            return '';

        default: // also handles back references :P
            return '\\'.$char;
    }
}

var_dump(pregreplace_inverse(
    '\[tag\](a)e?\[/tag\]', '<strong>$1</strong>'
));

2 thoughts on “Reverse custom MyCode parser

    1. ZiNgA BuRgA Post author

      You’re right, no longer interested.
      I’ve put up the reversing code that I wrote above, but probably isn’t much use to you.

Leave a Reply