# python....?



## Hex (Aug 9, 2014)

I don't suppose there are any clever python people on here who could tell me basic things in idiot-proof language?

My new job requires some python and I haven't done anything like this for 10 years, easily, and even then I didn't do much (and it was C++, not python...)

So: this is the current basic issue I'm having...

I want to search for specific words within files _without _opening the files (because they're stupidly big). So, I know about Open () and I can get python to go through a file and count, so that's all good.

The problem is, I need it to search for whole words and right now, it's returning every instance of that letter combination. e.g. If I search for "so", it counts all the instances of "so" as a word, but also all the instances of "so" in "*so*me" and "*so*rt" and "*so*lo" etc etc (you get the idea).

My super-complicated code so far includes this:

for line in file_to_work_on:
    line = line.rstrip()
    if line.find (string_of_interest) == -1:
        continue

    count = count + 1

----

Help, anyone? Heeeeeelllllp?


----------



## HareBrain (Aug 9, 2014)

Along the garden stairs
The sluggish python lies

That's what I know about pythons.

My one idea is to ask if you could replace all words where length <>2 with "XXX" (or whatever) first? Is there a function to do that?


----------



## Hex (Aug 9, 2014)

Careful, or I'll turn you into a peacock.

Maybe. I was hoping there was a simpler way (like a symbol that meant there was a space before and after the target string). I discovered something called re.search but its results so far are utterly bizarre.

Looks like I'm going to have to read the documentation properly. Sigh.

(thank you for the suggestion!)


----------



## J-Sun (Aug 9, 2014)

Hex said:


> Maybe. I was hoping there was a simpler way (like a symbol that meant there was a space before and after the target string). I discovered something called re.search but its results so far are utterly bizarre.]



I don't speak python, but that's what I was going to say. '\bso\b' should match only "so"s that have word boundaries. If you're getting weird results, maybe post the exact code and output - it's often as simple as a quoting issue. But I probably won't be able to help. You'd likely get better help on a python or even a general computer forum though I don't doubt there's a python wizard somewhere on here who'll turn up eventually.


----------



## Lenny (Aug 10, 2014)

I agree with J-Sun on the use of regular expressions, and the one he's suggested looks like it should work. Regex in Python guide, if you need one.

I'm not a Python wizard (not even a Python fan, to be honest), but looking at the re module documentation, my opinion is that re.findall is a good bet - it's a method that uses a regex to find all matches in a given string, and returns them as a list (which you can then do len() on to get the count).

Something like:


```
import re

string = "So? Saw some sons sewing, so I did. Rightly so!"

matches = re.findall(r"\bso\b", string.lower())
#the "r" before the regex is necessary for the escaping backslashes

print len(matches)
#prints '3'
```
(I know it says "PHP code" at the top, but it's not - I'm just using the [noparse]
	
	



```

```
[/noparse] tags for the highlighting)

I'd imagine this method of searching is case sensitive, so you'll want to use the lower() method to drop the input string to lowercase (otherwise, "\bso\b" would match "so", but not "So", "sO", or "SO").


If you're feeling fancy, you can cut out the iteration over the lines in the file by passing the file open code as a parameter. No clue if massive files will kill it, but life's no fun without the thrill of the unknown:


```
matches = re.findall(regex, open('file.txt', 'r').read().lower())
print len(matches)

#you could even do it all in one line:
#print len(re.findall(r"\bso\b", open('file.txt', 'r').read().lower()))
```

---

Other links I used:

How to use findall, finditer, split, sub, and subn in Regular Expressions in Python for Regex Code Example - Runnable
https://docs.python.org/2/library/functions.html#len


----------

