Python 5. Dictionaries and Advanced String Handling

Contents

5.1 Dictionary basics

A list allows us to store a sequence of similar or different objects indexed by a number starting with 0.

>>> a=["Fred","Jaz","Angus"]
>>> print a[1]
Jaz

Dictionaries allow us to store values indexed by any non-mutable key, e.g. a number, or a string.

>>> b={"name":"Ibrahim","age":27,"mark":57.8}
>>> b["age"]
27

Note the use of {} curly brackets to construct the dictionary, and [] square brackets to index it. As with lists we can print out the dictionary by printing the reference to it.

>>> b
{'mark': 57.799999999999997, 'age': 27, 'name': 'Ibrahim'}

You can assign to an individual dictionary entry to add it or modify it

>>> b["pass"]=1		# 1 short for boolean true
>>> b["age"]=28		# happy birthday Ibrahim
>>> b
{'mark': 57.799999999999997, 'pass': 1, 'name': 'Ibrahim', 'age': 28}
>>>

You can remove key-value pairs with the del operator.

>>> del b["mark"]
>>> b
{'pass':1,'name': 'Ibrahim', 'age': 28}
>>>

In order to construct a dictionary you can start with an empty one.

>>> c={}

5.2 Compound dictionary keys and values

You can use a tuple as an index if you want more than one field to index a database record.

>>> a={("CV3 2EZ",34):"a/c 89762"}
>>> a
{('CV3 2EZ', 34): 'a/c 89762'}
>>>

In this example, the tuple combining a postcode and street number are used to index an account identifier. In the next example, a similar tuple is used to index a dictionary value which is a list. List items can be obtained by appending the list index after the dictionary index:

>>> a["CV3 2EZ",35]=["35 Hay Rd","Binley","Coventry"]
>>> a[("CV3 2EZ",35)][1]
'Binley'
>>>

You aren't allowed to use mutable lists as dictionary indexes, but you can convert a list to a tuple if you want, using the tuple() built in function.

5.3 Lists of Dictionaries

A list value can be a dictionary reference. This allows for sets of records, with the rows corresponding to the list items, and the columns corresponding to the dictionary name/value pairs .

This is similar to an array of struct records in 'C' but needs less coding:

def add(db):
  """ adds dictionary recs in list db """
  while 1:
    name=raw_input("enter name or * to quit: ")
    if name[0] == "*":
      break;
    mark=float(raw_input("enter mark: "))
    record={"name":name,"mark":mark} # create dictionary
    db.append(record) # add dict to end of list db

def show(db):
  """ displays dictionary recs in db """
  print "%10s %5s" % ("name","mark")
  for rec in db:
    print "%10s %5.1f" % (rec["name"],rec["mark"])

db=[]
add(db)
show(db)

This program was run with the following I/O:

enter name or * to quit: Harry
enter mark: 56.2
enter name or * to quit: Jaz
enter mark: 63.1
enter name or * to quit: *
      name  mark
     Harry  56.2
       Jaz  63.1

5.4 Other data structures

Python makes it relatively easy to design your own. We've seen dictionaries storing lists and lists of dictionaries. Every variable identifier in Python is a reference and the dereferencing is handled automatically by the language, so construction of advanced data structures such as binary trees is easier.

Many needs are met using Python's built in types:

* Dictionaries are implemented using very fast hash tables.

* Python lists grow and shrink using del and insert so you probably won't need to implement linked lists.

Solving more difficult problems often depends upon how you structure the data. E.G. consider the travelling salesman problem. This requires finding the shortest possible route, or a shorter choice of route when visiting a number of places e.g. towns and cities in a single journey. Another example of this problem concerns the sequence of holes to be drilled using a CNC machine tool on a printed circuit board (PCB). With more than 20 PCB holes it is impractical to find the optimal (i.e. the shortest) route for the drill because N x N-1 x N -2 ... x 1 possible routes exist for N holes. For values of N > 10 or so this is too many to compute and compare by trying every possible route. One algorithm which gives a good solution subdivides the board into regions and finds shorter routes to visit every hole in a region before moving on to the next region. This algorithm can be made to work much faster on the CPU if the PCB hole data is first indexed based on X and Y region co-ordinates, and then sorted by X and Y coordinates within the regional group. A data structure which allows for this will need to be designed.

A board game such as chess might require storage of a number of facts or statistics associated with each position on the board. This might require a list of lists (8 x 8) of dictionaries, with each dictionary relating to a particular board position.

Program designers requiring complex or specialised data structures often prefer to define classes to organise and manipulate these.

5.5 Saving and loading complex objects using pickle

Traditionally programmers used to have to parse these in and out of the program, so that data could be stored in external files, e.g. as comma-delimited text records. An alternative approach was to copy the in program memory representation of complex objects to binary files. The text file approach was preferred for networked applications where portability was required.

Object serialisation makes it possible for any object to be converted to a sequence of bytes. There is a similar requirement for storing this sequence on an external file or sending objects across a network. Python's pickle module does the job of serialising, loading and saving an arbitrarily complex object using a portable data format. The following load() and save() functions extend our previous list of dictionaries program so that data can be retained between sessions.

def load():
  import pickle
  """ loads serialised object from pickle file marks.pkl"""
  try:
    fp=open("marks.pkl","r")
    a=pickle.load(fp)
    fp.close()
    return a
  except(IOError):
    return []

def save(a):
  """ saves structured object into pickle file"""
  import pickle
  fp=open("marks.pkl","w")
  pickle.dump(a,fp)
  fp.close()

5.6 Using the string module

This allows for simple searching, substituting and matching of text.

>>> import string
>>> text="The rain in Spain falls mainly on the plain."
>>> string.replace(text,"ain","ough")
'The rough in Spough falls moughly on the plough.'

We can also call string object methods on the strings directly:

>>> text.lower()
'the rain in spain falls mainly on the plain.'
>>> text.upper()
'THE RAIN IN SPAIN FALLS MAINLY ON THE PLAIN.'

We can treat a pattern as a delimiter (default is space) to get a list of fields.

>>> text.split()
['The', 'rain', 'in', 'Spain', 'falls', 'mainly', 'on', 'the', 'plain.']
>>> a=text.split("ain")
>>> a
['The r', ' in Sp', ' falls m', 'ly on the pl', '.']

And we can even put Humpty Dumpty together again using string.join()

>>> b=["one","two","three"]
>>> string.join(b)
'one two three'
>>> string.join(b,"!")
'one!two!three'
>>> string.join(a,"ain")
'The rain in Spain falls mainly on the plain.'

Finding if something is in a string using string.find() gets the index of the first occurrence. Don't assume a 0 result is false here; index [0] is the start of the string. -1 indicates the string was not found.

>>> text
'The rain in Spain falls mainly on the plain.'
>>> string.find(text,"ain")
5
>>> string.find(text,"XXX")
-1

5.7 Introduction to Regular Expressions

Regular expressions (REs) enable more advanced matching, substitution and splitting of text than you can do with the string module. The alternatives to using REs for these jobs involve more complex ways of searching for text patterns using customised programming. Any programmer who has tried to do these jobs using languages without RE support will have created customised, and probably non- reusable code for the purpose. REs are quick and terse so can save a great deal of time, but they are unobvious to someone who isn't familiar with them. Program documentation is therefore needed if you use them.

The simplest RE is a string that matches any instance of itself e.g: "sh".

This RE will match when used to test any string containing: sh anywhere within the string. It matches cash, shell, ashen but not school. For this kind of situation we could probably just as well have used the string module. However, if you need to match on anything more complex read on.

Many applications require text patterns to be found in a file. A programmer designing an email client will want to make special use of a Subject: header record in the top (header) part of the email. Emails have a standardised format which we can use REs to parse.

Given a string object called email, the following program uses 3 REs to:

a. Split the email header from the email body.

b. Split the header into individual records,

c. Find the Subject: header and extract its contents.

import re
# read email text from file
fp=open("email.txt","r")
email=fp.read()

# Note use of r (raw) before the string quote
# to make \n a RE newline, not a Python newline escape.
# Head and body are split using \n\n as delimiter,
# maxsplit=1 splits just once, so everything after the head goes into the body.
head,body=re.split(r"\n\n",email,maxsplit=1)

# Split head into a list of headers. A header
# may have continuation lines starting with
# a space or tab. (?![ \t]) in an RE means "if not
# followed by a space or tab"
headers=re.split(r"\n(?![ \t])" ,head)

subject=None
for rec in headers: # process each header in turn
  # search at start of rec using ^ to indicate starting with
  match=re.search(r"^Subject: ",rec)
  if match: # true if Subject: at start of header line
    # get content of Subject: record using a slice
    subject=rec[match.end():]
 print subject

The above code, by organising the searching into 3 stages and splitting the email into its constituent parts (i.e. a list of header records and a body) was designed to be reused later to extract information from other email headers, e.g. for the email client to extract From: To: Cc: and Reply-To: headers.

REs using almost identical syntax as in Python are also found in ed, sed and grep , emacs and vi text tools and Awk and Perl programming languages. The Java JDK version 1.4 also includes a class RE implementation. Once you have learned REs in one of these languages you will be able to use them anywhere else.

One of the best sources of information on Python RE usage is available from the Python HOWTO site: http://py-howto.sourceforge.net/regex/regex.html

5.8 Common Patterns and REs to match them

Match if pattern at beginning or end

"^xyz"		matches xyz at the start of a string
"xyz$"		matches xyz at the end of a string

match range of characters:

"[0-9]" 		matches any string containing a numeral in the range 0-9.
"[a-zA-Z] 		matches any lower or upper case letter.
"[aeiou]" 		matches any vowel.
"[^aeiou]"		matches any character that is not a vowel.
"[0-9][a-zA-Z]"	matches any numeral followed by a letter.
"c[aeiou]t"		matches cat, faucet, cite, cot and cute etc.

match alternative words

"red|green|blue|yellow" will match strings containing any of the words delimited by |
characters.

Shortcuts

\w		word character, equivalent to [a-zA-Z0-9_].
\W		non-alphanumeric, e.g punctuation, same as [^a-zA-Z0-9_].
\d		any decimal number same as [0-9]
\D		any non-numeric character same as [^0-9].
\s		any whitespace character same as [ \t\n\r\f\v].
\S		any non-whitespace character same as [^ \t\n\r\f\v].
. 		(.) matches any character except newline, see also re.DOTALL
\b		matches any word boundary e.g. space, tab, comma etc.

These can also go inside a [] bracketed group. E.G [\da-z] will match any numeral or lower case alpha.

escapes

\.		matches a literal full stop .
\+		matches a literal +

The list of RE characters with special meanings (i.e. meta-characters) needing escaping with \ is: \. ^ $ * + ? { [ \ | ( ) If in any doubt use a backslash in front of a punctutation character if you want the literal and this will usually work.

multipliers of a match

"[a-z]*"	* indicates zero or more letters in range a-z
"H+"		+ indicates one or more H letters
"[0-9]?"	? indicates zero or one numeral 0-9
Z{2,4}		matches between 2 and 4  inclusive Z's i.e: ZZ, ZZZ, ZZZZ

These multipliers can also apply to more than a single character. By putting a word or part of an RE inside () brackets, the multiplier will apply to the whole word or part:

"(but.)+"		matches one or more but's: but, butbut, butbutbut etc.

greedy and non-greedy multipliers

Where there is a choice when using a multiplier, by default the matching engine will be greedy, i.e. match as many characters as possible. Sometimes we will want to match something else immediately after, which we don't want consumed. In this case we can put a ? character after a multiplier to make it non-greedy i.e. so that a following match atom will match as early as possible and the whole RE matches as short a string as possible.

e.g: "<body\s?.*?>"

See below for \s (whitespace) and (.) any character wildcards. Here the first question mark ? acts as a 0 or 1 multiplier on the \s whitespace character wildcard. The second ? mark qualifies the preceding * (0 or more) multiplier to make it non-greedy. This causes the entire RE to match up to the end of the <body> opener tag, not up to the end of the HTML document.

5.9 Using a RE in a Python program

Simple use is achieved by directly calling functions within the re module

import re
text="No Mozarella, Stilton, Feta , or Gruyere today sir."
regex=r"\bf[aeiou][^aeiou]\w*\b"
# word starting with f followed by vowel, followed
# by consonent, followed by 0 or more word characters

match=re.search(regex,text,re.IGNORECASE)
if match:
  print "found: %s in: %s" % (match.group(),text)
else:
  print "no match"

found: Feta in: No Mozarella, Stilton, Feta, or Gruyere today sir.

If the same RE is going to be applied many times within the same program compiling it into a compiled RE is likely to be faster.

>>> import re
>>> cre = re.compile("and")

Sometimes we would want to match SPAM, spam, Spam or even sPaM, i.e. we want to match our RE case insensitively. Flags which change the way REs will be used can also be passed at compile time:

>>> text="Spandau Ballet"
>>> cre=re.compile("and",re.I)

We could have used re.IGNORECASE instead of re.I . Using the search method on cre we can now create a RE match object:

>>> mo=cre.search(text)

Apart from slightly faster use if used many times this is similar to: mo=re.search(regexp,text)

5.10 Getting text before, on and after the match

Our match object can now give us the start and end indexes for the matching part of the text which we searched, using start() and end() or span() methods:

>>> mo.group()	# gets us the matching word
'and'
>>> mo.end()		# gets the end index
5
>>> mo.start()	# gets the start index
2
>>> mo.span()	# gets start and end indices as a tuple
(2, 5)
>>> text[:mo.start()]	# slice gets part of text before the match
'Sp'
>>> text[mo.end():]	# slice gets part of text after match
'au Ballet'
>>> text[mo.start():mo.end()]	# same as mo.group()
'and'

5.11 Substituting matched text

We can use sub() as a method or function to create a substituted string with the match text replaced with something else we have specified. The method is called on a compiled regular expression object while the function has the RE as its first parameter and is otherwise the same.

In the following example we are testing code which will automatically add a white background bgcolor attribute to an html <body> tag

>>> subtext='<body bgcolor="#ffffff">' # replacement text
>>> regex='<body[^>]*>' # regular expression
>>> text='<html><head></head><body>hello html world!</body></html>'
>>> re.sub(regex,subtext,text,re.I)
'<html><head></head><body bgcolor "#ffffff">hello html world!</body></html>'

Grouping parts of matched text for a more advanced substitution

In a more robust application we might prefer to retain attributes already present within the body tag and to replace the value of an existing attribute. In the following example, we want to retain the LINK and VLINK <body> tag attributes (which colour link text and visited link text respectively) and replace an existing value for BGCOLOR. However, if there is no current BGCOLOR tag we would prefer simply to insert the required value. We can check if a regular expression matched by testing a true (non None) value for the match object.

Here we are grouping 3 parts of the matched expression using () brackets, so that we can refer to the first and third text groups later within the substitution string using: \1 and \3.

import re
# first RE will find a body tag with a bgcolor attribute
cre_bgcol=re.compile(r"""(<body.+?)(bgcolor=[\'\"\#0-9a-f]*)(.*?>)""",re.I)
# second RE finds the start of any body tag
cre_body=re.compile(r"<body",re.I)

test_htmls=['<html><head></head><body LINK="#009900" bgcolor="#888888" Vlink="#990000">hello  html world!</body></html>','<html><head></head><body LINK="#009900" Vlink="#990000">hello  html world!</body></html>','<<<not a valid html doc>>>']

for html in test_htmls:
  mo_bgcol=cre_bgcol.search(html)
  mo_body=cre_body.search(html)
  if mo_bgcol:
    text=cre_bgcol.sub(r'\1BGCOLOR="FFFFFF"\3',html,1)
  elif mo_body:
    text=cre_body.sub(r'<BODY bgcolor="#FFFFFF"',html,1)
  else:
    text="Invalid html: no body tag"
  print "\n\nin: %s\nout: %s" % (html,text)

Running this program resulted in the following output:

in: <html><head></head><body LINK="#009900" bgcolor="#888888" Vlink="#990000">hello  html world!</body></html>
out: <html><head></head><body LINK="#009900" BGCOLOR="FFFFFF" Vlink="#990000">hello  html world!</body></html>

in: <html><head></head><body LINK="#009900" Vlink="#990000">hello  html world!</body></html>
out: <html><head></head><BODY bgcolor="#FFFFFF" LINK="#009900" Vlink="#990000">hello  html world!</body></html>

in: <<<not a valid html doc>>>
out: Invalid html: no body tag