Skip to content

Perl file parser intended for big files that doesn't fit into main memory.

Notifications You must be signed in to change notification settings

Weborama/File-Sip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME

File::Sip - file parser intended for big files that don't fit into main memory.

VERSION

version 0.003

DESCRIPTION

In most of the cases, you don't want to use this, but File::Slurp::Tiny instead.

This class is able to read a line from a file without loading the whole file in memory. When you want to deal with files of millions of lines, on a limited environment, brute force isn't an option.

An index of all the lines in the file is built in order to be able to access their starting position depending on their line number.

The memory used is then limited to the size of the index plus the size of the line that is read (until the line separator character is reached).

It also provides a way to nicely iterate over all the lines of the file, using only the amount of memory needed to store one line at a time, not the whole file.

ATTRIBUTES

path

Required, file path as a string.

line_separator

Optional, regular expression of the newline seperator, default is /(\015\012|\015|\012)/.

is_utf8

Optional, flag to tell if the file is utf8-encoded, default is true.

If true, the line returned by read_line will be decoded.

index

Index that contains positions of all lines of the file, usage:

$sip->index->[ $line_number ] = $seek_position;

METHODS

read_line

Return the line content at the given position (terminated by line_separator).

my $line = $sip->read_line( $line_number );

It's also possible to read the entire file, line by line without providing a line number to the method, until undef is returned:

while (my $line = $sip->read_line()) {
    # do something with $line
}

ACKNOWLEDGMENT

This module was written at Weborama when dealing with huge raw files, where huge means "oh no, it really won't fit anymore in this compute slot!" (which are limited in main-memory).

BENCHMARK

File::Sip is not faster than in-memory parsers like File::Slurp::Tiny but it has a lower memory footprint. With small files, it's not obvious (when the file is small, the cost of the index is almost equal to the cost of all the characters of the file). But when the file gets bigger, the gain in main memory grows.

With files bigger than few megabytes, File::Sip will consume up to 20 times less memory than File::Slurp. This factor of 20 appears to be an asymptotic limit as size of studied files grows.

If you want to estimate the memory size of a running process that uses File::Sip, you can then assume that the size of the index will be around 1/20th of the size of the processed file.

AUTHORS

This module has been written at Weborama by Alexis Sukrieh and Bin Shu.

AUTHOR

Alexis Sukrieh <sukria@sukria.net>

COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by Weborama.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

About

Perl file parser intended for big files that doesn't fit into main memory.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages