Scalable and Memory Efficient Markov Chains

Published at Mar 28, 2026

This endeavour all started when I asked myself one morning if there is any way to generate sentences other than natural language models, and to be fair it didn’t take long for me to think of Markov chains. There are in use since the dawn of time, with use cases like predictive typing (if you apply them in the realm of language processing). Now I really don’t want to waffle on about Markov chains. There’s a fantastic video made recently by Veritasium that tells you all you need to know from its history, to how it works.

Note that sentence generation with Markov chains are a relatively low-tech solution to generate sentences. It can heavily plagiarise your training data, especially if you have a high state size, and the sentences it generates are mostly gibberish. Think back to you using the predictive word feature in your phone’s keyboard. If you keep tapping on the next word it generates, the first few words might make sense, but it slowly devolves into gibberish and it repeating itself infinitely over and over. Unfortunately, the state of the art for sentence generation is still LLMs (though you could probably opt for a lower parameter model and even GPT-2 due to it being such a simple task). The upside being that it takes relatively little processing power and can run on a single CPU. My anemic Raspberry Pi could generate sentences in milliseconds. This project is more of a novelty and you should not be using it (or please continue to use it) to power social media troll farms.

Take note that in this article, I will be treating “word” and “tokens” as interchangeable terms, but this might differ if you plan to model other languages.

Memory usage

The first bottleneck when I first started using Markovify is that it stores the entire of its model state in memory. Realistically, it’s going to take a long while before it’s going to saturate even 8 GB of RAM, but I want it to run on a 1 GB Raspberry Pi. I quickly decided to use a database, as non in-memory databases allow you to query vast amounts of data on disk. But first of all, I need a way of processing and storing all the training data. There are two schools of thought here. The first school of thought is to just store them in memory as an ultra-long array. Since my computer is not the one doing all the inferencing but the Pi, it should be fine temporarily storing all the data in RAM. Another way is to store it temporarily as a file in a medium such as an SQLite database, or even a newline-delimited text file. Since I am in no shortage of RAM (in 2025), I decided to just stick with the former solution. To give you the scope as to whether this is feasible, my training data consists of 55 million separate strings. If your training data even comes close to this number, which is still growing as of the time of writing, it should be relatively feasible. Note that the data cleaning step is still pretty important, even if Markov chains are relatively forgiving.

Note that most of the engineering pain is spent in getting the frequency map working. If you don’t care about token frequencies and merely want to generate the next word without any care for how frequent it occurs together with the last few generated words, you can skip implementing the next few parts, but reading the article as a whole is still important in fully understanding how it works.

The next step is to build the frequency map of the training data. If you want to generate sentences that is as close to how your training data would sound, you should take word frequencies into account. For example, if you have the following sentences:

I want to take a nap
I want to take a shit
I want to take a shit now

The resulting frequency table should produce something like this (assuming a state size of 2):

("I want") -> [("to", 3)]
("want to") -> [("take", 3)]
("to take") -> [("a", 3)]
("take a") -> [("nap", 1), ("shit", 2)]
("a shit") -> [("now", 1)]

Notice that each word boundary is considered a single token, but there might be other tokenisation strategies you want to employ especially if you are dealing with other languages. For example, it is very uncommon for Mandarin to employ spaces, so you might want each logograph to be its own token.

When it’s time for the Markov chain to decide what goes after "take a", it might choose "nap" or "shit". But since "shit" happens more often, it is more likely to choose the latter path.

Note that this is a much more simplistic representation of how the data is going to be stored, as it is not very efficient to determine which path is much more frequent, especially when it comes to dealing with huge amounts of data, as I will illustrate later.

Another thing that I forgot to mention is the presence of start and stop tokens. Without them, it is technically possible for the Markov chain to loop forever and generate an infinitely long sentence. Stop tokens have their own frequencies tied to it, and if it is much more likely for the sentence to end, the Markov chain will end the sentence and stop appending words to the current sentence. Start tokens are also equally important, as they prevent the Markov chain from using words that only appear in start sentences while it is in the middle of generating a sentence. This is another reason why I’m preserving capitalisation, even if it yields me a lower variety of options, as I want to generate gramatically correct sentences.

("[START] I") -> [("want", 3)]
("I want") -> [("to", 3)]
("want to") -> [("take", 3)]
("to take") -> [("a", 3)]
("take a") -> [("nap", 1), ("shit", 2)]
("a nap") -> [("[STOP]", 1)]
("a shit") -> [("now", 1), ("[STOP]", 1)]

Be sure to use a totally unique start and stop token (for example by generating a constant but random UUID) so that others will not be able to guess your start and stop tokens and mess up your training data by inserting those terms into it.

You might be already asking what I mean by “state size”. Explained simply, it is the number of tokens that are used to predict the next word. For the sentence chunk “I want to”, “I want” is 2 tokens which means that it has a state size of 2, and both tokens are used to predict the next word, which in this case is “to”. If you have a state size of 3, it would be something like “I want to take” and the words “I want to” will be used to generate the next word which is “take”, and so on. By having a larger state size, the sentences you generate will become much more coherent.

Fast Incremental Growth of your Model

As new data floods in, you don’t want to always start from scratch and populate those sentences in memory and re-generate the frequency maps of data you have already processed. What you ideally want is to store it as data you can reference later, and what better way to do it by using SQLite again. When new data floods in, simply append new rows to the database, or modify existing rows. Have I ever mentioned how much I love SQLite?

Efficient Inferencing

Remember what I said about the frequency map representation earlier? Well I lied… While you could technically use it to generate sentences, as your data grows linearly, the time it takes for the model to generate sentences will also scale linearly. If your state is something extremely common like ”[START] I” and you have an extremely large training set, it will take forever for your Markov chain to sieve through all the possible next tokens and calculate the next possible word via its stored frequencies. Instead, the key is to pre-compute everything during your model generation step, so that the machine that does sentence generation would not have to do it constantly. To put simply, you would want to pre-calculate the cumulative sum of all next possible words. For example, previous we have something like ("take a") -> [("nap", 1), ("shit", 2)] where "shit" has a frequency of 2 and "nap" has a frequency of 1. Instead of just encoding and storing its frequency, you store the cumulative sum of it like ("take a") -> [("nap", 1), ("shit", 3)]. During sentence generation, you get the last cumulative sum which is 3, and generate a random number. You then use binary search to get the most possible next token by constantly bisecting right in a loop until you stumble upon a token with the closest cumulative sum to your randomly generated number. This reduces the time complexity to logarithmic time, and it is a pretty elegant solution if I say so myself. However, this also means that we need an extra processing step to calculate the cumulative sum. This does not mean that we can stop storing the actual frequency map. We still need it in the future if we ever want to continue adding more training data to our model, for example if we want to update the frequency of a pre-existing token pair. Calculating cumulative sum could be daunting for your computer, but with libraries like numpy or even going mental and implementing your own solution, together with the fact that incremental updates of your model is possible, it’s not that much of a hassle waiting around. Also, if you want any pointers as to how to speed up SQLite inserts, this article is a treasure trove. To keep your database sizes in check, do enable auto-vacuuming or run the VACUUM command frequently.

Another SQLite feature that comes in handy is JSONB. With JSONB, you can store your value in each key-value pair as a JSON string, and this allows for quick querying of data if you for instance want to get a particular token in a particular index of your value JSON.

To start off with inferencing, you can either choose to start off with a specific word, or generate a random sentence by getting the total number of key-value pairs and choosing a random SQLite row number. To end off inferencing, simple strip the start and stop tokens off the beginning and end of your completed sentence.

Transferring of model files

As I have said earlier, I am running sentence generation on a Raspberry Pi, but model generation operations are done on my personal laptop. To faciliate efficient transfer of data over SSH, you could either use rsync, or another tool that might fit this use case better: sqlite-rsync. It creates a delta of your local database and remote database and only makes surgical modifications to your remote database, speeding up the transfer process. Do note that the latter tool is typically not packaged by default on package managers, and you might need to compile it yourself.

Conclusion

I have been thinking of making this post for a long time. As I am capping of this post on the train, I realised that instead of reaching for the nearest shiny LLM model, try to evaluate your options and think of other perhaps more energy-efficient and/or performant methods. All I wanted was a way of generating funny messages without the need for it to be coherent, and burning through GPU tokens to achieve my silly objective doesn’t seem like a fair trade-off for me.