• Welcome to Jose's Read Only Forum 2023.
 

A Very Small Fast Parser using PB Inline Assembler

Started by Charles Pegge, May 11, 2007, 10:35:31 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Charles Pegge

04. August 2006, 11:59:45

This parser is suitable for use in a script engine or for parsing any data strings
that use symbols as well as words. Call it repeatedly to indicate sequential words
and symbols in a string of text.

I have written it to accelerate my scripting language; the machine code is only 251 bytes
long and on the demo test my PC pushes out about 19 million words per second.

The code can be used inline or in its own subroutine or function. It interfaces PowerBasic
code with just 5 variables.

For further interest:

Once you understand the basics of how a Pentium works. the Assembler is surprisingly easy.
The rules are very strict but relatively simple compared to high level languages.
With careful incremental coding and testing, you can directly tap into the formidible processing
power of the x86 processors and outperform any compiled code.

Intel documentation for download from

http://www.intel.com

Intel Architecture Software Developer's Manual Volumes 1 to 3
The most useful reference is volume 2 chapter 3.

Also, the PowerBasic Help topic on the inline assember discusses how PowerBasic variables can
be used with assembler.


Charles Pegge

#1
05. August 2006, 11:04:07

A Slightly Larger but Faster Parser (PB Assembler)

The machine code is only 124 bytes long, but 'also uses a lookup table of 256 bytes. By reducing the
branching logic, this runs about 10% faster than the previous non tabular version (in excess of 21
million words/sec on my Athlon 3200).

'The table gives added ease of configuration and flexibility. The upper part of the table
'may be dispensed with if your source text uses 7 bit ascii only.

My thanks to all for feedback and Paul Dixon for comments and suggesting the table approach.

discussion:
http://www.powerbasic.com/support/forums/Forum8/HTML/003574.html


Charles Pegge

 08. August 2006, 15:51:23

A Fast Configurable Parser (PB Assembler)

Taking the concept further with more efficient code and additional flexibility.
You can change the parsing rules by tweaking the codings table.

With the demo code, it can read the entire bible in about 50-70 milliseconds
and return counts of various word types.

The text  can be optionally downloaded from:
http://www.talsystems.com/contest12/contest12.zip
It will still perform its default demo without the Bible text.


The parsing routine is now wrapped in a Function, accessible using
conventional Powerbasic, though with a 30% performance overhead.

discussion:
http://www.powerbasic.com/support/forums/Forum8/HTML/003574.html

There are problems in getting an accurate reading of performance, even with
the process set to RealTime priority. However you get the overall picture.


Update:

Bug identified and fixed: 23 Aug 2006

Affecting quoted text:  to interpret ascii 34 as normal punctuation should have a tokentype of 35, not 33
This does not affect this demo but becomes apparent when reading scripts other than the bible12.txt
Updated zip below:



Charles Pegge

19. August 2006, 20:19:22

Fast Parser with Word List (PB Assembler)

This code adds further functionality by enabling the use of a Token word list
in addition to the TokenType table. So you can get specific tokens returned
for words specified in the list.

As before, it can read the entire Bible in about 50-70 milliseconds
and return counts of various word types, but also provides counts
on the specified words.

The text  can be downloaded from:
http://www.talsystems.com/contest12/contest12.zip

The source code is extensively annotated with a view to easy adaptation
to your own needs. For instance if you require partial word matching
instead of exact matching so that document will also accept
documents or documentation, returning the same token.
This requires the substitution of only one instruction.

Update:

Bug identified and fixed: 23 Aug 2006
Affecting quoted text:  to interpret ascii 34 as normal punctuation should have a tokentype of 35, not 33
This does not affect prior demos but becomes apparent when reading scripts other than the bible12.txt
Updated zip below

MikeTrader

Charles,
Thank you for this.
Amazingly fast!
I think I can convert it for searching my UDT array. If it ignores all the binary chars it can search for the text.
I can put the whole UDT array into a dynamic string and search that.
I can figure out which record the text is in by the offset.
(each record is a fixed length)
In this way I can search "multiple documents" for a single word
The question is, how do I search for Phrases?

Charles Pegge

Mike, are you looking for an exact match when searching for a phrase?