Watching a Folder

At some point in every programmer’s life, they discover that the operating system will tell them when a file changes. This knowledge produces a particular kind of enthusiasm — the automation that triggers automatically, no cron job, no polling, no manual step. Just drop the file and the thing happens.

The enthusiasm is not wrong. File system watching is genuinely useful. But it has a character that rewards caution.

How It Goes Wrong

The first version usually looks like this in Python, using watchdog:

1
from watchdog.observers import Observer
2
from watchdog.events import FileSystemEventHandler
3
import time
4

5
class Handler(FileSystemEventHandler):
6
    def on_created(self, event):
7
        if not event.is_directory:
8
            process(event.src_path)
9

10
observer = Observer()
11
observer.schedule(Handler(), path="./incoming", recursive=False)
12
observer.start()
13

14
try:
15
    while True:
16
        time.sleep(1)
17
except KeyboardInterrupt:
18
    observer.stop()
19
observer.join()

This works. It also fires on_created twice on some operating systems for some file types, fires immediately before a file has finished writing, fires during intermediate save states from editors that write to a temp file and rename, and does not tell you what happened if process() fails.

None of these are edge cases. They are the normal behavior of the file system event API on macOS and Linux.

Making It Survivable

The adjustments that matter most are not clever. They are defensive.

Wait before acting

A file that just appeared may not be finished writing. A short sleep before processing catches most cases:

1
import time
2

3
def on_created(self, event):
4
    if event.is_directory:
5
        return
6
    # Give the writing process a moment to finish.
7
    time.sleep(0.5)
8
    process(event.src_path)

This is not a complete solution. A 500ms wait fails for large files or slow writers. The correct solution is to check that the file size is stable:

1
import os
2
import time
3

4
def wait_for_stable(path: str, interval: float = 0.2, retries: int = 10) -> bool:
5
    """Return True when the file size has not changed for two consecutive checks."""
6
    prev = -1
7
    for _ in range(retries):
8
        try:
9
            current = os.path.getsize(path)
10
        except OSError:
11
            return False
12
        if current == prev:
13
            return True
14
        prev = current
15
        time.sleep(interval)
16
    return False

This is more code than the alternative, but it handles large files and slow writers.

Deduplicate events

If you are getting duplicate events, keep a set of recently-seen paths and ignore repeats within a short window:

1
import time
2
from collections import defaultdict
3

4
class Handler(FileSystemEventHandler):
5
    def __init__(self):
6
        self._seen: dict[str, float] = {}
7
        self._window = 1.0  # seconds
8

9
    def on_created(self, event):
10
        if event.is_directory:
11
            return
12
        path = event.src_path
13
        now = time.monotonic()
14
        if now - self._seen.get(path, 0) < self._window:
15
            return
16
        self._seen[path] = now
17
        process(path)

Log everything

When the watcher runs unattended, you will eventually need to know what it did and when. Even a simple log file with timestamps is worth the three lines it takes:

1
import logging
2

3
logging.basicConfig(
4
    filename="watcher.log",
5
    level=logging.INFO,
6
    format="%(asctime)s %(levelname)s %(message)s",
7
)
8

9
def process(path: str) -> None:
10
    logging.info("Processing %s", path)
11
    # ...
12
    logging.info("Done: %s", path)

When to Use a Cron Job Instead

File system watching is the right tool when:

The trigger is genuinely file-arrival (a camera upload, an export from another tool)
The processing window matters (you want it to happen within seconds of arrival)
The volume is low enough that duplicate-event handling is not a serious burden

A cron job is often simpler when:

You just want something to run every N minutes regardless of file changes
The “watch” is really polling a remote source anyway
You need the job to run whether or not there are new files

The overhead of a watcher setup — the daemon, the deduplication logic, the stability check — is only worth paying if the event-driven behavior is genuinely what you need.

A useful summary from experience: reach for watchdog (or equivalent) when the file system event is the right abstraction. Reach for cron when you are retrofitting event-driven behavior onto something that is really just periodic.

More on the properties that make scripts like these survive long-term in A Script That Stays Used.