This is a personal reflection of the 56-page whitepaper “Remote BLOB Storage”
Right off the bat, I should let you know that when it comes to RBS I’m always looking for content that is helpful for implementations that don’t relate to Sharepoint.
This paper promised to appeal to people like me. I believe it did a good job in meeting that objective.
The leading paragraphs “The Case for RBS” is ok, and one could nitpick when you see sentences like “If a given SQL Server database would grow to 500GB without RBS enabled, then RBS would be a beneficial option”, but that’s not worth getting upset about because we simply assume that they mean that most of the data might qualify as RBS candidates. It’s often that you’ll hear about the benefits to “Backup & Restore” that RBS offers, such as the whitepaper “SQL Server RBS Performance with SharePoint Server 2010 and StorSimple Storage Solution” states, however it’s very important to understand that no matter where you move the data to, it will inevitably need to get backed up. So statements that say “Backup is now under 1min, whereas previously was 30min” are misleading (Not stated here, but to some degree is inferred).
I really enjoy the paragraph “When Storage Tiers Need to be implemented”, as I believe it can be useful. What is not been said is that if data qualifies to be on different storage tiers based on age, then at some point it needs to be moved. Unfortunately I have not seen any BLOB storage feature, filestream, RBS or FileTable that can leverage set-based mass data movement. They inevitably all require row-by-row mechanisms. Nonetheless, RBS definitely makes available the option to tier data storage far easier than the other BLOB storage options.
Now we get to the primary reason I was reading (this time round) the whitepaper, namely to get advanced knowledge of the installation steps, and write my first sample.
I initially didn’t pay attention to the key word “LOCAL”. This will impact your understanding, and interestingly it made me realize a potential design. It’s possible to install RBS Filestream in the same database as where the application’s data resides, essentially simply creating a filegroup. WOW! That means one could install RBS Filestream provider solution without requiring additional databases…interesting
The document then describes the different components that get installed. In short, you will have to install some components at the application servers, others at the meta-data server, and others at the BLOB store server.
This is how I think it is aligned:
- RBS Library (Client) and Filestream Provider (Client) –> Application Server
- RBS Server Component –> Database Server hosting the Application’s data
- Filestream Provider (Server) –> Database Server hosting the BLOB store
- Maintainer –> I could be wrong, but I think it’s best to find this on the Database Server hosting the application’s data
NOTE: I installed everything on a single laptop, and used the REMOTE configuration to have the BLOB stored in a dedicated database. I plan to have multiple servers in the new server, and will likely update this review then (or create a separate blog)
Throughout the document there are references made that I thought could create confusion, so here’s my take on the synonymous references:
- Application’s database = Content database; Meta-data Database; RBS Enabled database
- BLOB Store = Location where the BLOB data can be found, in our filestream case it’s the filestream enabled database.
I found the detail about the maintainer, as it pertains the Pool sizes and such, presented a little too early in the document. A pool refers to the number of records the maintainer will select per orphan cleanup iteration. I was perplexed and still don’t know why a data table is created per pool, however I know that you must not confuse this with the concept that manages how many objects are stored in the NTFS directory.
NOTE: NTFS will allow billions of documents to be stored on a volume, and within a directory. Customers and product teams have recognized that too many reduces the performance of traversing these files and a general guideline of ~300,000 objects (files) is offered up. I’ll refer to this later on in this blog.
The document goes on to describe HA options, which are fine but keep in mind that Denali offers Database Mirroring support and integration with the AlwaysON feature. Very Cool!!!
The process flow catches me every time. In simple terms I would conclude that an application makes a call to the RBS API to retrieve the BLOB, and several activities occur behind the scenes to return the BLOB. I’m focusing on the installation concepts, so I browse over this for now.
NOTE: I later learned quite a bit about this step, so expect a blog on this at a later date.
What’s important to note is that the application will make a call to the Application’s Database to first retrieve the BlobID, and then when reading or writing a call is directed to the Blob Store database.
BLOB deleting is interesting, though it doesn’t impact installation considerations. It will impact data administration.
Aaahhhh…..now we find those “Do before you run RBS setup” steps
- Enable Filestream for the SQL Server Instance
- Create the databases… yes you must create the RBS Blob Store database too
- Create filegroups and files
- You can define the max size limit for the filestream directories
- Don’t get confused about the filestream references found in the DB Options. These are related to FileTable
- Create a Master Key for each of the databases
- I haven’t quite worked out when and if I will be using a certificate in a trusted domain where all SQL Server Instances will be connecting through trusted connections. Additionally there doesn’t appear to be any reason for cross-database T-SQL queries. Nonetheless, RBS requires a master key otherwise it will not install
NOTE: I imagine that at some point, a naming convention and more thought is required about the configuration, but I’m going to leave the crossing of that bridge for another day.
Ok, that’s it. Now where do we get the RBS setup file? Historically this was found as part of a separate download, namely the feature pack. Now you can find the rbs.msi file from the install directory, for English edition it is
The document makes a few assumptions, so while I click “Next” most of the time, I’m very thoughtful about filegroup destinations. Default will imply “Primary” for the internal tables, and your first filestream filegroup for the BLOB data.
For the initial setup, “I simply want to install” approach, the document covers more than I need, namely all the command line stuff. With that said it is insightful to work through each option as it is easy to relate to the GUI and certainly improves understanding. At some point, you realize that you can have multiple provider instances sent to the same database…WOW!!…that would make for a complex design. I would probably avoid that design unless there was a very compelling reason to combine multiple filestream providers into a single database.
The advanced configuration settings are great, and one of the reasons that make RBS a viable alternative than a standard filestream implementation. I mentioned earlier that the general guideline is to not have more than ~300,000 files in a single directory.
NOTE: I was looking for a Windows Server 2008 NTFS filesystem limit to the number of files per directory, but couldn’t find any. So the guideline is from a Windows Server 2003 article – “How NTFS Works”
The “Pool Capacity” is where RBS will manage this in the background. The default setting is 200,000. What the document fails to reveal is that this setting will not directly correlate to the number of files found in the directory, in some cases there could be many more files, this can ocucr because the update & delete modifications create a scenario where there are more files than the tables would indicate. An update in filestream, creates a new file, a delete is a deferred operation. After backups and garbage collection have taken place, then yes the files in the directory will reduce.
It’s great to see the RBS service handle the in-band and out-band concepts. There is an optimal size for when objects should be written using the Win32 API, and this occurs around the 1.2MB size. Until this point, TDS protocol is a more efficient for reading and writing BLOBs. The objects are still stored in the filesystem, but access to them is handled differently. This is configurable and occurs transparent to the application.
Another fantastic configuration is the inline BLOB store. This will make sure that those filesystem performance debilitating small files are stored directly in the database, without the application needing to worry about where they are been stored. The paper doesn’t expand on where they are stored other than “in the mdf file”, so we can only assume that they are stored in the rowdata filegroup defined during setup. If these files account for large amounts of data, then backup & restore may become unwieldy, of course at 61KB recommended value you would need to have > billion small files before worrying about that.
The write allocation size is an interesting concept, which the document fails to expand upon. This is an unfortunate gap. I would like to know more about its value proposition. Maybe there isn’t one, because any updates create a new file, so pre-allocating has no gain and may in fact cause degradation but is there a performance gain when creating a file of 1MB, and then streaming the bytes in? I don’t know, and hope to find out going forward.
I didn’t find the Maintainer, XML Configuration Files, multiple instance deployments, and administrative views particular useful at this stage, but took a mental note that this is a place to come to in the future. The backup & restore is discussed, but really it’s pretty simple, but dangerous. You stop the maintainer. You backup the application’s database first, and then you backup the blob store. When you restore, the maintainer must not be running until everything is restored. The diagrams say restore the Blob store first, and that you can start the restore of the content database but that simply amuses me as the content database will (typically) complete within minutes. So I am currently of the mind that you can restore whenever you want, but will not have access to the blobs.
It’s my bet that the following is the best restore approach…though don’t be surprised if I get killed by those more knowledgeable than
- Restore Primary Filegroup of Blob store as a Piecemeal Restore
- Restore RowData Filegroups of Blob store as a Piecemeal Restore
- Make the blob store database available to be accessed
- Restore the Content (Application’s) database
I would also be interested to evaluate that the blob store database can have their filegroups turned into Read-Only. Right now I don’t think we can make new records go to a new filegroup, but I do know that we can make new records go to a new blob store. As soon as a blob store is no longer getting populated or modified, then maybe we can make the database read-only (that would be awesome) or at least make the filestream filegroup read-only.
The next section in the document is all about Sharepoint. It’s worth a read, as we think about what the custom application will be capable of doing but for the most part I find the Sharepoint application imposes interesting restrictions.
So you think you can use RBS now….unfortunately not
Before you can use RBS, you have to do additional steps. These are described in “How To: Install (Remote BLOB Store)”:
- Enable the Application’s database for RBS by running SP mssqlrbs.rbs_sp_enable_rbs
- Create a user table and enable the column running SP rbs_sp_register_column
- Remember to index the column that is been registered
NOTE: Unfortunately the document doesn’t provide recommendations or guidelines as it pertains to indexes, or other considerations. Therefore I know that I must investigate what are the experiences when you have billions of rows.
The document misses the above, yet it goes into detail about the GAC and configuration files. I struggled to get the references in my custom application sample, and I can’t really comment on how much of the GAC and configuration file recommendations actually assisted in resolving the issue. In the end I had to add the Microsoft.Data.SqlRemoteBlobs by browsing for the DLL.
I appreciated the insights into the various approaches to reading and writing the blobs. I would’ve liked the content to include transaction concepts, and guidelines. There is an example that demonstrates writing blobs using a buffer, however no insights about buffer sizes or what one should consider when developing a solution and trying to establish the appropriate buffer size.
I definitely don’t want to enlist a Distributed transaction, and therefore need to know why the transaction scope is required.
I’ve created a basic sample based on the articles “How To: Write a BLOB (Remote BLOB Store)” and “How To: Read a BLOB (Remote BLOB Store)”, I can’t wait to create samples using the other techniques to write and read blob data.
The document did a short blurb on creating a provider, not really an interest area for me right now, so it didn’t get much attention.
In summary, it was a good paper with opportunity for more information and areas that may qualify as redundant. Most importantly, and what I would consider a reason for a HIGH VALUE rating from me, is that it refrained Sharepoint orientated, which has made other documents mostly useless to me.
- FILESTREAM Design and Implementation Considerations (shauntj.wordpress.com)