User Tools

Site Tools


number_of_matches_per_file

Task

Given a bunch of files in a directory, count the number of times a word occurs in each file. For example, given

% tail -n +1 *
==> junk1.txt <==
foo
bar
foo bar foo
bar foo bar

==> junk2.txt <==
foo
bar
foo bar foo
bar foo bar
foo foo foo

count the number of occurrences of 'foo' in each file. The expected answer is

junk1.txt:4
junk2.txt:7

tags | Number of matches per file

sample code demoes | cat with filename

Solution using git grep and awk

If it is a git repository

git grep -o foo  | awk -F':' '{freq[$1]++} END{for (file in freq) print file ":" freq[file]}'

If it is not a git repository

git grep --no-index -o foo  | awk -F':' '{freq[$1]++} END{for (file in freq) print file ":" freq[file]}'

For example

% git grep --no-index -o foo  | awk -F':' '{freq[$1]++} END{for (file in freq) print file ":" freq[file]}'
junk2.txt:7
junk1.txt:4

How it works

The git grep command gives

% git grep --no-index -o foo
junk1.txt:foo
junk1.txt:foo
junk1.txt:foo
junk1.txt:foo
junk2.txt:foo
junk2.txt:foo
junk2.txt:foo
junk2.txt:foo
junk2.txt:foo
junk2.txt:foo
junk2.txt:foo

The awk command counts the number of hits per file.

References

tags

awk frequency count, awk count breakdown, uniq reverse output, “git grep” count matches, count “grep -o”, “grep -o” counts, “grep -o” summarize

Solution using grep and awk

grep -ro foo * | awk -F':' '{freq[$1]++} END{for (file in freq) print file ":" freq[file]}'

Useful if git is not available.

Solution using find, grep and wc

find * -printf 'echo "%p:$(grep -o "foo" %p | wc -l)";' | sh

For example

% find * -printf 'echo "%p:$(grep -o "foo" %p | wc -l)";' | sh
junk1.txt:4
junk2.txt:7

How it works

To see how it works, run the command without piping the output to sh

% find * -printf 'echo "%p:$(grep -o "foo" %p | wc -l)";'     
echo "junk1.txt:$(grep -o "foo" junk1.txt | wc -l)";echo "junk2.txt:$(grep -o "foo" junk2.txt | wc -l)";

So we are just building up a big command that would run “grep -o” on each file and then format the output.

find *          - find the files
-printf ''      - format and print everything between the single-quotes.
%p in -printf   - will be replaced by the filename in find's output
grep -o         - print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
| sh            - execute the command

Note: You have to use “grep -o” and not “grep -c”. If a string occurs multiple times in a line, “grep -o” matches each of them separately. But “grep -c” counts them together. For example

% cat junk2.txt 
foo
bar
foo bar foo
bar foo bar
foo foo foo

% grep foo junk2.txt
foo
foo bar foo
bar foo bar
foo foo foo

% grep -o foo junk2.txt 
foo
foo
foo
foo
foo
foo
foo

% grep -c foo junk2.txt
4

% grep -o foo junk2.txt| wc -l
7

References

number_of_matches_per_file.txt · Last modified: 2021/11/04 14:59 by admin