-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement memory-efficient in-file binary search for CDXJ indexes #604
Comments
Comparing the implementations here would be useful: Current ipwb replay binsearch with open(fullFilePath, 'r') as cdxjFile:
lines = cdxjFile.read().split('\n')
lineFound = binary_search(lines, surtURI, retIndex, onlyURI)
def binary_search(haystack, needle, returnIndex=False, onlyURI=False):
lBound = 0
uBound = None
surtURIsAndDatetimes = []
cdxjObj = objectifyCDXJData(haystack, onlyURI)
surtURIsAndDatetimes = cdxjObj['data']
metaLineCount = len(cdxjObj['metadata'])
uBound = len(surtURIsAndDatetimes)
pos = bisect_left(surtURIsAndDatetimes, needle, lBound, uBound)
if pos != uBound and surtURIsAndDatetimes[pos] == needle:
if returnIndex: # Index useful for adjacent line searching
return pos + metaLineCount
return haystack[pos + metaLineCount]
else:
return None Proposed binsearch from MementoMap with open(infile) as f:
surtk, freq, *_ = f.readline().split(maxsplit=2)
if self.debug:
print(f"Matching FIRST> {key} {surtk} {freq}", file=sys.stderr)
if key == surtk:
return [surtk, freq]
left = 0
f.seek(0, 2)
right = f.tell()
while (left < right):
mid = (right + left) // 2
f.seek(mid)
f.readline()
surtk, freq, *_ = f.readline().split(maxsplit=2)
if self.debug:
print(f"Matching {left}:{mid}:{right}> {key} {surtk} {freq}", file=sys.stderr)
if key == surtk:
return [surtk, freq]
elif key > surtk:
left = mid + 1
else:
right = mid - 1 As verbally discussed, the latter is more of a byte-wise instead of a line-wise (former) approach. The former may not perform an even line-wise bisection but that is not important if the ultimate implementation is more efficient in memory. Speed should also be evaluated, which could be well evaluated with both small and very large CDXJ files. An outstanding TODO as verbally described by @ibnesayeed is the implementation of multiple adjacent entries that map the criteria. The current implementation relies on the lines being in memory, which will not be available in the latter. |
The in-file binary search I implemented for MementoMap is showing pretty good speed of about one millisecond per lookup on an average. This measurement was performed on an index with almost 30 million records by successively looking for over 3 million different URIs. The process was performed on a remotely mounted NFS drive and matching records were written in a file that added some additional I/O overhead, otherwise, the lookup time could have been even smaller. |
@ibnesayeed Can you provide a concrete way (e.g., code snippet, timing commands, sample data) to test your approach to the current one? |
The data I worked on is not CDXJ data that might be the expectation of IPWB's current code. While I have large CDXJ files (not the ones with locator key though) which can be used to test it, but I cannot share those indexes publicly. UK Web Archive however has made a handful of their CDX indexes available online publicly (but those are not CDXJ files). My understanding is that even attempting to load a large CDXJ file in the current IPWB replay would take up all the memory and perhaps crash. |
The memory issue might be something good to also test. To get realistic benchmarks, how about we generate some IPWB-style CDXJ TimeMaps? We generate some WARCs in testing, so should be able to also generate CDXJ files in the style of data that the code currently expects. |
Here are corresponding code blocks in MementoMap: |
We don't necessarily need to generate dummy data to test how it would behave of large indexes. We can run it against any large text file as the index, even if it is an invalid index. Though, we can internally use some real CDXJ datasets we have for testing. My guess is, the following line would be the one that will choke the memory. Line 1070 in 13eb385
|
I just realized that a single binary lookup takes a fraction of a millisecond. In the case of MementoMap supplied URI is looked up many times iteratively with shortder and shorter sub-URI keys until one is found or the keys exhaust. This means those 3 million lookups would turn into 8-9 times as many (it's a rough guess, actual number could be different). |
Ok. Per above, some code snippets, timing commands, and sample data would make this more empirically convincing as a better solution than what is currently implemented in ipwb. |
I created a CDXJ file using the code in the aforementioned repo with 500,000 entries and tried replaying it in ipwb via ...and that is not even search for a URI. |
Also, you should see a significant rise in memory consumption. |
@ibnesayeed 300+MB (with a spike) difference handling the request. For the sake of Counting Mementos™, there might be a more memory efficient, quicker (<O(n)) but naive way for display on the main replay interface, e.g., CRLF count in the file. The additional computation that has ipwb traversing the file for the extended display will eventually be moved to the The above is a sort-of related aside from the primary task of this ticket: efficient search, but I figured I'd run a quick empirical test and report it here rather than create a ticket (for now). |
In a generic file, you can't count number of lines without reading it all the way. However, if you are looking for a rough estimate, you can calculate average number of bytes per line heuristically, then use that to divide the file size with. This operation will be O(1). That said, counting lines can be done accurately by processing the I/O stream, instead of loading the whole file in memory. Also, opening file in binary mode for the purpose should be faster than text mode as it does not go through the decoding process for the content. If you want to discount the data transfer and rendering time, then you can make a HEAD request in a terminal using cURL and prefix it with |
On a 500,000 entry CDXJ file, Related aside, This information will eventually be moved to EDIT (so as to not derail): The iteration via enumeration in EDIT: Created a separate issue in #612 to handle these efforts, which are independent of the binary search optimization that this issue (#604) should address. |
If you're looking for a way to do this in the browser, I would recommend looking at the IndexedDB APIs for storing CDXJ in the browser which provides a way to do range queries and I think eliminates the need for implementing this manually, eg. https://developer.mozilla.org/en-US/docs/Web/API/IDBKeyRange I started experimenting with this (but not actually using yet): https://github.com/webrecorder/wabac.js/blob/master/src/collIndex.js#L50 |
Thanks @ikreymer for the pointer to IndexedDB. I am sure we will find a good use-case for that. However, this ticket is about implementing binary search in CDXJ files on the server. |
I extracted the relevant parts from the code to which @ibnesayeed linked from MementoMap and appended them below to be integrated into ipwb. This executes and returns the value in 0.03 seconds on a 500k line CDXJ. import re, sys
def lookup_keys(surt):
keyre = re.compile(r"(.+)([,/]).+")
key = surt.split("?")[0].strip("/")
keys = [key, f"{key}/*"]
while "," in key:
try:
m = keyre.match(key)
keys.append(f"{m[1]}{m[2]}*")
key = m[1]
except Exception as e:
key = key.strip("/,")
return keys
def bin_search(iter, key):
# Read line encompassing current position
surtk, rest = iter.readline().split(maxsplit=1)
if key == surtk:
return [surtk, freq]
# If further searching required...
left = 0
# Go to end of seek stream
iter.seek(0, 2)
right = iter.tell() # Current position
while (right - left > 1):
mid = (right + left) // 2
iter.seek(mid)
iter.readline() # Q: Why are there two readlines?
line = iter.readline()
surtk, rest = line.split(maxsplit=1)
if key == surtk[0:-1]:
return line
elif key > surtk:
left = mid
else:
right = mid
def run_batchlookup(filename, surt):
mobj = open(filename, "rb")
iter = mobj
fobj = open(filename, "rb")
for line in fobj:
res = lookup(iter, surt)
if res:
return res
fobj.close()
mobj.close()
def lookup(iter, surt):
for idx, key in enumerate(lookup_keys(surt)):
res = bin_search(iter, key.encode())
if res:
return res
filename = 'test_sorted.cdxj'
surtKey = 'intuit,sr7ro5w7d6w3e2mmx4w/'
res = run_batchlookup(filename, surtKey)
print(res) EDIT: A needed extension of this will be to test and return the relevant adjacent lines that match the key. |
issue-604 branch uses the aforementioned binary search algorithm and is effective at displaying the initial UI with very large CDXJ files in a reasonable amount of time. As above, the logic to return the other adjacent relevant lines needs to be implemented and some code cleanup is needed entailing removing redundant functions and semantic revisions of new function names. EDIT: The lack of the multiple URIs also breaks some tests, which is good (their purpose). |
Re: #604 Co-authored-by: Sawood Alam <ibnesayeed@gmail.com>
Please pardon my intrusion gentlemen. This is a very interesting project, a lot of work has been done, and I only started digging into the code today. So I might be very wrong. But here it goes. This issue apparently relates to supporting very large CDXJ files. Loading such a file into RAM (which happens currently, and I even proposed an in-memory cache for that) can be unaffordable for a small server where an ipwb node is running. As far as I can see, CDXJ files are essentially key → value maps. The goal is to find value by strictly given key, and binary search in such a file is only possible because the file is ordered by key lexicographically. However, key-value data model is not unique; it is well supported by multiple database engines out there, including open source software in the first ranks. Redis, MongoDB, CouchDB, Cassandra, ElasticSearch, PostgreSQL, MySQL - all of them either support key-value collections directly or permit to model them. Imagine ipwb has support for plugins which are essentially connectors to various database backends. For speed or temporary operations you can choose Redis and offload your CDXJ files there. For permanent storage, you are welcome to use MongoDB. And if you are operating on really huge indices you can offload them to Amazon S3 (which is essentially a key-value storage too, just the values can be very large). Thus, every ipwb node could choose what part of the global Trusted Web to host, and to choose whichever backend is more suitable to host its indices. This is more flexible than to have one hard-coded file based implementation. Thoughts? |
Thanks @anatoly-scherbakov for your interest in the project, contributions, and thoughts. We really appreciate it. Yes, you are right about the index being a key/value store problem which has been solved million times by many systems already. We can use any K/V system that allows two levels of indexing (i.e., primary index on URIs and secondary on the datetime of the capture). While it is sensible to support multiple index backends going forward if the project matures and attracts more users, it was initially wise to hard code a file-based index for the following reasons:
In fact other archival replay systems like OpenWayback and PyWB already support a handful of index backends such as BerkeleyDB, CDX/CDXJ, and OutbackCDX etc. So, this is not unlikely to eventually allow index drivers for different K/V systems in IPWB as well, if there is enough demand to justify the investment of cycles. Also, we are currently experimenting with many attributes to support per record encryption and other features, which would become messy if were to support multiple backends before settling down on the schema, while hard coded text based index allows us that flexibility to experiment more easily. Binary search in CDX/CDXJ files is already a solved problem, but we are redoing it due to licensing issues. This project is released under MIT which is a rather permissive licence and is not compatible with strong copyleft licenses like GPL that were chosen by the tools where binary CDX index search is implemented. We are already working on an independent binsearch implementation that can be used here when ready, but there are only limited number of cycles available to spare on that, hence the progress is slow. |
I see, thanks for the detailed explanation. What is the typical size of a CDXJ file in terms of line count and byte size? Would it be possible to just load it into RAM as Python |
@anatoly-scherbakov Reiterating @ibnesayeed, thanks for your interest. While I am open to using other data sources for K/V lookup, one goal is for the CDXJ indexes to be shareable with the sole dependency being replicate of the lookup to be in IPFS instead of a centralized remote system. We also want the querying procedure to scale, hence the need for binsearch, which has mostly been implemented in #631, just not yet merged in. As @ibnesayeed also said, ipwb was initially (literally) hacked together and progressively improved. Its design is what functioned at the time and so, we are usually pretty open to improving it after-the-fact. I hacked (see a theme?) together a CDXJ generation script some time back in an attempt to generate large amount of fabricated data solely for evaluating the ipwb lookup procedure. The improvements in #631 show a very promising performance increase with 5 million records (arbitrarily chosen). We are very open to improving this evaluation procedure for making the tool more usable and systematic. |
Each line of the CDXJ file corresponds to one archived resource, with a few header lines that hold some metadata about the archive and the file itself. Each line contains a SURT (a sort friendly and canonicalized transformation of URI that is no longer than the original URI), a 14 digit datetime string, original URI, mime-type, status code, two IPFS hashes (corresponding to header and payload of the resource), and some additional attributes to support per-record encryption when needed. CDXJ is flexible about adding more attributes as needed. If you want to store this information in a Python dictionary, you will essentially be creating a nested dict where the primary keys will be SURTs that will store a list or dict of index record objects identified/keyed by their 14-digit datetime (there may be collision too, so be careful about them). Again if something like this is to be implemented, this will only work for smaller archives, for large archives vanilla Python dictionaries will be a nightmare, even if you have sufficient memory available. As I mentioned earlier, it will require support for multiple index back ends that implement a uniform API, where this dictionary one could be just one of many possible back ends. The API will need to return:
|
This is a very detailed explanation, thank you; I agree that a
If
I don't like the cases 5 and 6 - they are too verbose, but at this point I am not sure how to organize them better. |
While I listed all the atomic requirements earlier, actual API, based on the usage requirement, can be much more simpler and efficient with only two public functions:
Note: The value corresponding to each key in the returned dictionary of the second function can be a list of record objects, if we plan to support collisions. In addition to these, the index API may also provide an overall iterator with a filter argument to iterate over all the matching records for the sake of record pinning or MementoMap generation.
|
I've created a separate issue to continue this discussion about an index backend interface which I find quite interesting. |
Currently, the complete index is loaded in memory for faster lookup, but not scalable for large index files as the memory usage grows linearly with the size of index. We can instead implement an in-file binary search to make the memory usage constant for any arbitrary index file size while still being efficient in lookup due to the nature of binary search. A similar implementation is present in the newly introduced MementoMap repo.
The text was updated successfully, but these errors were encountered: