Simple Blog Post Stats
I got curious about the word count in my blog. When I generate the static site,
I simply count whitespace separated strings as words, but this will
automatically include code sections. To find out how much prose I wrote, I need
to process each blog post, ignoring the front-matter (information at the top of
each post and surrounded by ---
) and the code sections (which are surrounded
by three tildes), and adding valid words to my statistics (I'm just keeping a
total word count and count of each word). I got some the following results:
Word Count: 7089
Wow! Over 7000 words! I'm pretty happy with that!
| order | word | count | |=======|======|=======| | 1 | the | 402 | | 2 | to | 296 | | 3 | a | 196 | | 4 | i | 172 | | 5 | and | 155 | | 6 | it | 122 | | 7 | this | 109 | | 8 | of | 106 | | 9 | is | 93 | | 10 | in | 91 | | 11 | with | 86 | | 12 | that | 83 | | 13 | for | 66 | | 14 | on | 65 | | 15 | my | 65 | | 16 | you | 52 | | 17 | from | 52 | | 18 | be | 50 | | 19 | use | 47 | | 20 | can | 43 | | 21 | so | 38 | | 22 | if | 35 | | 23 | we | 35 | | 24 | file | 31 | | 25 | command | 31 | | 26 | site.baseurl | 30 | | 27 | then | 29 | | 28 | powershell | 29 | | 29 | but | 28 | | 30 | are | 28 | | 31 | have | 28 | | 32 | like | 28 | | 33 | by | 27 | | 34 | following | 26 | | 35 | an | 26 | | 36 | now | 25 | | 37 | one | 25 | | 38 | will | 24 | | 39 | some | 24 | | 40 | or | 24 | | 41 | at | 24 | | 42 | install | 23 | | 43 | up | 23 | | 44 | using | 23 | | 45 | your | 22 | | 46 | when | 22 | | 47 | get | 21 | | 48 | do | 21 | | 49 | want | 21 | | 50 | also | 21 |
Well, this was a lot more disappointing. I don't use a lot of interesting words, I assume.
Code
I used the following code to generate this:
#!/usr/bin/env python3
from collections import Counter
from pathlib import Path
import string
import sys
# This script goes through my _posts directory, strips
# out lines surrounded by ``` or --- blocks, then does a little
# statistics on the results
def is_valid_word(word):
contains_letters = any(c in string.ascii_letters for c in word)
not_a_variable = '`' not in word
return contains_letters and not_a_variable
def munge_word(word):
""" return the lowercase word with trailing/preceding punctuation stripped"""
word = word.lower()
if word and word[-1] not in string.ascii_lowercase:
word = word[:-1]
if word and word[0] not in string.ascii_lowercase:
word = word[1:]
return word
def main():
counter = Counter()
word_count = 0
topdir = sys.argv[1]
for path in Path(topdir).glob('*.md'):
with open(path) as blog_post:
is_code = False
for line in blog_post:
if line.startswith('```') or line.startswith('---'):
is_code = not is_code
continue
if not is_code:
# print(line, end='\n')
# now get stats :)
for word in line.split():
word = word.strip()
if is_valid_word(word):
word_count += 1
munged_word = munge_word(word)
counter[munged_word] += 1
print()
print('Word Count: ', word_count)
print()
# print(counter.most_common(100))
print("| order | word | count |")
print("|=======|======|=======|")
for order, mci in enumerate(counter.most_common(50)):
word, count = mci
print(f"| {order + 1} | {word} | {count} |")
if __name__ == "__main__":
main()