Development Project 268 - AFS on ECDF GPFS storage - Notes

A place to keep some notes on progress, thoughts,etc on the project to access ECDF based storage as AFS.

In theory (and according to various reports) it should just work. ie you mount a bit of your GPFS file system as /vicepX on an AFS file server and it uses it just like any other POSIX compliant file system.

As Iain R already had a small local test GPFS cluster, we first setup crocotta as a GPFS client (node) and mounted a directory in GPFS as /vicepg. And it does just seems to work. I've a test volume in there that hasn't had any problems (other than if the cluster goes down). These interruptions would lessened if crocottas was a full GPFS node, with the ability to run remote root sessions to manage the GPFS cluster as a whole.

Note I need to sort out the start up sequence for GPFS, mounting GPFS and then starting AFS. Currently these are out of sequence and so needs manual intervention following a crocotta reboot.

The next stage is to try this again (or something similar) but using the ECDF GPFS. I had a meeting with Orlando about this on 24/10/2013

Meeting with Orlando about access to ECDF GPFS

Had a meeting with Orlando up at JCMB. My notes/recollection from the meeting:

First thing to point out is this "500GB per researcher" is not ECDF, it's part of the RDM (Research Data Management) Service that IS are bringing on-line.

They will be presenting this data via NFS and CCIFS, but have no effort/resources to set up an AFS Cell.

Orlando explained their setup and the fact like we would not be happy for them to have root access to our AFS servers (mounting their GPFS filesystem), they would not be happy with our servers mounting their existing GPFS file systems (as we could access/destroy data belonging to others on that same file system).

To create an isolated GPFS filesystem for us, that wasn't shared with others, requires them to reconfigure the existing file systems they have (one for MVM and another for CSE ?). Effort.

We discussed 4 options:

  1. We install our servers in their rack with direct block access to some of the LUNS on the SAN.
  2. Create a separate GPFS filesystem for us, which we mount remotely here.
  3. Similar to 2, but we have the machine(s) in their racks to get the benefit of quicker access.
  4. Setup a new cell, managed by IS, on their hardware connected to their existing GPFS filesystem. we access it via something like /afs/rdm.ed.ac.uk/...

Conclusions

Generally Orlando's team seem very stretched for effort, and we'll be waiting for them to find time to make any changes required for us.

1. Is only a fall back option, though we'd get our allocation of disk space, we wouldn't get any of the benefits of GPFS. Also not very scalable/flexible if our allocation changes.

2. This is the quickest option to get something going. Rather than reconfiguring the existing NSD, Orlando is going to use a dev node he has and some unallocated LUNS (disks). I'm going to give him the hostname/IP address of our machine to mount that filesystem. Like what we currently have with our GPFS.

3. If 2 works and is acceptable then this is just a relocation of hardware.

4. This is perhaps the most desirable option for the RDM service, as it just adds AFS to their list of presentation options for the data. The downside is they have no experience in managing AFS, so would be a learning curve for them. We'd have to sort out cross cell/realm permissions.

Actions

Though I offered to help with 4, and we talked about me having root access to an IS machine to get this set up. Orlando still thought it would take 4 weeks before they could set up stuff at their end to let me access a machine.

As time is of the essence, option 2 would be the quickest, using the Orlando's spare node and disks to set up a GPFS file system exported to us. I've since given them the details of crocotta, so we can replace it's current mounting of our GPFS with theirs.

One problem is that we only have GPFS 3.2, and they are running 3.5. Orlando said this could be made to work, but he has to make sure they turn off features at their end that 3.2 doesn't know about. Though the IBM docs only seem to say this works between adjacent version numbers, ie 3.5 and 3.4.

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.v3r5.gpfs300.doc%2Fbl1ins_mig32.htm

I've since asked if he can give use a copy of their 3.5, just temporarily, to get things going. We can sort out any license issues later.

Options 2 also means they don't need root access on our server, however setting up some remote sudo commands to do the odd bit of housekeeping would be useful.

Benchmarking

While waiting for the RDM launch, I've gathered some basic figures from bonnie++ for bits of our file system. I've just been running the test once or twice and letting bonnie++ pick values for the likes of file size, which is recommended to be twice the physical RAM of the machine running bonnie++. For the AFS client tests, I've set the AFS disk cache down to 1GB. +++ in the cell means number too small to measure accurately.

Just running bonnie++ -q -x1 -z 6504422 > results.csv and appending the results to the table below (after passing .csv file through bon_csv2html)

Version 1.96Sequential OutputSequential InputRandom
Seeks
Sequential CreateRandom Create
SizePer CharBlockRewritePer CharBlockNum FilesCreateReadDeleteCreateReadDelete
K/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPUK/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU/sec% CPU
crocotta.inf.ed.ac.uk gpfs32152M233955613968294132997573796347.95168673248842851766362021581
Latency231ms360ms449ms151ms186ms375msLatency245ms219ms297ms269ms45613us211ms
nuggle.inf.ed.ac.uk ext3 kbevo163G4889912305536868372312679439957739374.055162619564++++++++++++++++++++++++++++++++++++++++
Latency31244us44100ms2333ms97476us132ms173msLatency11023us909us962us139us15us127us
chia.inf.ed.ac.uk test.neilb2 AFS fromnuggle4G114983863135154081115439729478437.416166504265769411313667144651693149527
Latency78891us8030ms13528ms32311us4132ms17388msLatency239ms20572us21576us22390us984us44665us
chia.inf.ed.ac.uk work-gpfs AFS from croc4G1169825443238735712908629281410.1316440286718943139420286466941452
Latency72906us20569ms44384ms311ms4988ms42816msLatency200ms4093us463ms390ms946us451ms
nuggle vicepi63G5419912194127873402213839440327538421.7501++++++++++++++++++++++++++++++++++++++++++++++++
Latency28244us45068ms2731ms109ms45135us194msLatency112us149us886us109us30us46us
crocotta gpfs 2nd32152M233965616868243133097574296339.65169403245843771774358421551
Latency232ms327ms543ms151ms189ms228msLatency204ms532ms360ms294ms202ms204ms
jings.inf.ed.ac.uk datastore 2/9/20147640M19198182011113065722218630204352.4916699371157195116129840441141895163722
Latency64883us144s141s26216us8633ms25533msLatency35491us3459us54140us221ms756us19705us
bahookie.inf.ed.ac.uk - memcache6432M7529853345192230023162897546833035.719163501319027100347637414168231003534
Latency23686us11996ms11360ms64698us27835us14730msLatency171ms2758us18652us876ms183us992ms

I've also been experimenting with iozone, but have yet to decide how best to get results from its various tests and results.

15 Aug 2014

Nothing's much has happened in the last couple of weeks due to holidays, and Jan (ECDF) is away for another week, but just a quick note on the current state of play ...

We currently have a test cell - datastore.inf.ed.ac.uk - which contains a single DB server running on a VM afsdb0.datastore.inf.ed.ac.uk (dsone.inf), and a couple of fileservers: fs0.datastore.inf.ed.ac.uk (dstwo.inf another small VM), and inf-fs1.datastore.ed.ac.uk which is a file server running on the real IS datastore hardware. Currently the IS file server is firewalled to EdLAN only.

The cell is using our INF.ED.AC.UK kerberos domain for authentication, and because we already have cross realm trust setup with our regular cell, Jan can use his EASE credentials and aklog against our test cell to get tokens.

Jan's plan, before he went on holiday, is to test performance and test failover, using his /afs/datastore.inf.ed.ac.uk/test/jwinter/ test area, which is a volume served by the datastore hardware.

Activity 7/11/2014

Finally spent a little bit of time back on this project. What I wanted to do was simply configure a machine to use memory for the afs cache manager, rather than local disk. The idea being that hopefully we'll see the performance of the remote file server (which should be the bottle neck compared to the cache manager), with the local disk cache, I suspect bonnie++ was seeing the performance of that, rather than the file server).

We have a currently spare physical machine, bahookie, which I manually added "-memcache" to the openafs.tt template (as there's not a way to specify memcache via the component - should fix that). That seemed to work first of all (following a reboot), but I then had problems running bonnie++ it ran out of disk space. This was because bahookie has 64GB of physical memory, and my AFS test areas are only 10GB in size. bonnie++ tries to write files larger than the physical memory to reveal memory caching affects.

I could have increased the size of my datastore test area, but I decided to start the kernel on bahookie with "mem=4GB" to limit the apparent physical memory, this is more like a client would have. I then had problems with the current DICE default for the afs cache size, about 7GB not fitting in 4GB of physical memory! So finally I reconfigured that to 1GB, and I was able to reboot bahookie and run the bonnie++ tests. I'll add the first result shortly.

Activity 9/1/2015

Craig's asked for an update on how things are going, and what the current state is. I knew I hadn't done much, since reporting on progress in September 2014, but I'm ashamed to see just how little it has been. From my time keeping logs, from the 3/9/2014 (when I gave a talk at the Dev meeting AFSOnDatastoreDevMeet20140903) till today I've only recorded:

04/09/2014 1130 : mail to jan about afs perf : project - ecdf : 30mins
10/10/2014 1640 : chat with craig about afs project : project - ecdf : 10mins
07/11/2014 1430 : AFS mem cache machine for testing : project - ecdf : 180mins
11/11/2014 1340 : update ecdf memcache bonnie table : project - ecdf : 20mins
A total of 4 hours effort in 4 months! And looking back through my notes from that September meeting, I see that it was expected this would be completed by the end of the year!

So the status should be as it was recorded back in September, but as I check just now, the fileserver running on inf-fs1.datastore.ed.ac.uk is down, and so is the bos process (or at least they are both unreachable - ping still works). It was working back in November when I added initial figures for the memcache bahookie benchmarks. I'll make contact with Jan and find out what the state of play at his end is.

The lack of progress hasn't been due to technical issues, just me not prioritising the project over (mostly) operational work.

(Lack of) Activity 24/2/2015

Another disappointing lack of activity this past few weeks. The only thing of note was a meeting with various COs from Science and Engineering with Angus, talking about the problems they have been having with their use of datastore, and what could be done to improve things. Most of the problems are to do with file permissions and ACLs becoming screwed up when accessing the same data via different routes, eg NFS and CIFS (samba), Windows and linux. The solution requested was kerberised samba, as this would be the one preferred route to access their data. It still strikes me that Windows and Linux access could still be an issue if ACLs/permissions are changed from the different OSes.

Anyway, the one thing I took away from this, was that getting access to our datastore space as AFS, would avoid these particular problems.

(Lack of) Activity 8/3/2015

Again nothing has happened for 2 weeks. SL7 and operational stuff is taking up my time. And this week I've the ITPF LCFG talk on Friday to prep for.

Activity 25/3/2015

I thought I'd recorded this already, but as possibly fall back options you can connect to the normal IS provided datastore space with smbclient for example:

> smbclient //csce.datastore.ed.ac.uk/csce -U ed\\neilb  # password is your Active Directory password, *usually* the same as your EASE
smb: \> cd inf/users/neilb
smb: \inf\users\neilb> put /tmp/viasmbclient viasmbclient
or sftp/sshfs
> sftp -oPort=22222 neilb@csce.datastore.ed.ac.uk:/csce/datastore/inf/users/neilb
Connecting to csce.datastore.ed.ac.uk...
neilb@csce.datastore.ed.ac.uk's password: 
Changing to: /csce/datastore/inf/users/neilb
sftp> 
Or
> sshfs -o intr,large_read,auto_cache,workaround=all -oPort=22222 neilb@csce.datastore.ed.ac.uk:/csce/datastore/inf/users/neilb /tmp/neilb
neilb@csce.datastore.ed.ac.uk's password: 
> cd /tmp/neilb

Activity 29/3/2015

This project is supposed to be complete by the end of the month! And despite knowing this I've still not devoted any time to actually doing concrete work! Why? There always seems to be something urgent (mostly operational) that needs to be dealt with now. I know I can do that quickly, and it means that whoever was depending on that work being completed, can then move on and we (the Unit) are no longer holding them up. This project, however, is so behind schedule that it seems that prioritising other work ahead of it is not going to affect the customer of this project. I suspect I'm also procrastinating, as I know that it's been such a while since I did any real work, that getting back into it is going to feel like wasted effort just to get to back to the position I was in when I last spent a reasonable amount of time on it.

Having said all that, now that the deadline is about to officially slip, I really do feel bad enough that I will finally feel compelled to say "no" to everything else, and concentrate on this project. So my plan of action will be to:

  • Get some more easily understandable performance comparison for datastore storage vs our current AFS. Want to test multiple access, not just single process performance.
  • Hack a short term solution for sync datastore PTS with our PTS. A prometheus conduit is the correct way, but is something that can be done properly once we know the basic service is going to work as hoped. - Craig reckons the current PTS conduit takes a cell name as an arg, so might "just work".
  • Use the demand attach file server on datastore, the number of volumes we're going to want to potentially mount, and the apparent unreliability of datastore (this is only from the number of IS alerts about DS problems, I've never noticed it myself, but then I've not been using it heavily), the dafileserver seems like a better choice.
Let's see how it goes!

Activity 24/4/2015

Jan has created another 7TB partition for me to create multiple volumes on to do my performance testing. I only asked for a 2TB, so I need to check what he actually thought he'd done. Perhaps the volume server sees 7TB, but some quota will stop it at 2TB. 2TB is the recommended max size for a partition.

Also been reading http://docs.openafs.org/QuickStartUnix/DAFS004.html and http://wiki.openafs.org/DemandAttach/ about the demand attach system. I've confirmed that the DA binaries are on the datastore server, and they are running AFS 1.6.9 rxdebug inf-fs1.datastore.ed.ac.uk 7000 -version.

Unfortunately holidays, RT Guy, and EdWeb work have got in the way recently.

Activity 1/5/2015

I've got the DA server running on dstwo and inf-fs1.datastore.ed. Not had time to test and investigate its behaviour, I'll do that next week.

I'll also need to update the openafs component as the schema does not allow for a "dafs" as a bnode type.

Activity 15/5/2015

Updated the openafs component, schema and nagios file to cope with "dafs" as a service type. And updated the various headers and packages list, lcfg-openafs-1.1.0-1 should reach stable next week.

Initially I thought it wasn't going to work, as my first tests wouldn't create the requested BosConfig , the perl AFS::Command::BOS modules gave an error. And right enough its docs only talk about "cron", "simple" or "fs" as being suitable values, but a simple test script calling BOS->create with the necessary params worked fine. The problem turned out to be because the tag names used in the tagged list are significant, they need to match the binaries being called.

As a reminder on how to test the nagios changes, the trick was to copy the new nagios/openafs.pm to something like ~/tmp/LCFG/Monitoring/Nagios/Translators/openafs.pm, and then on the nagios server - cockerel, run: lcfg-monitor-test --libs ~/tmp openafs dstwo.

Though we're unlikely to make much use of the dafs using the component for this project, it will be useful for our regular AFS servers.

The next step is to create a lot of volumes on the datastore fileserver and 1. check its shutdown/start up time compared to regular "fs", and 2. do some performance testing on those volumes.

Inactivity 10/7/2015

More months go by without any real work being done on this. So this is a sort of recap of where we are and what needs done next.

  • We have a new cell called datastore.inf.ed.ac.uk.
  • If you have admin privs in our normal cell, then you should also have AFS admin privs in the datastore.inf.ed.ac.uk cell.
  • We have two fileservers - dstwo is a small VM here, and inf-fs1.datastore is a VM in IS, but has access to the underlying GPFS datastore space.
jings> vos listaddr -cell datastore.inf.ed.ac.uk
dstwo.inf.ed.ac.uk
inf-fs1.datastore.ed.ac.uk
  • We have on DB server - dsone is a VM here.
jings> bos listhost dstwo -cell datastore.inf.ed.ac.uk
Cell name is datastore.inf.ed.ac.uk
    Host 1 is afsdb0.datastore.inf.ed.ac.uk
  • In production environment we would have more DB servers (still VMs) and may still use VMs for the small amount of structural volumes that we'd need.
  • Though we have cross realm auth, you may need to do aklog -cell datastore.inf.ed.ac.uk or aklog /afs/datastore.inf.ed.ac.uk/. I have a .xlog file in my home directory to get my datastore tokens automatically with my regular ones.

What we need to do is some confidence testing that things will perform OK. It doesn't need to be blazing fast, just fast enough to be usable. We don't want multiple simultaneous accesses from multiple clients to be an issue. This is something I still need to check.

My basic thoughts are to programatically create a bunch of volumes in inf-fs1.datastore.ed.ac.uk /vicepa and /vicepb, mount them somewhere under /afs/datastore.inf.ed.ac.uk/ and then have scripts running on multiple hosts reading and writing to those volumes, and some how quantifying the performance. It's that "quantifying" that's the trickier bit. Perhaps it would be enough to have the automated reading and writing, and then as a human do some normal activities, copying, editing, listing, and just judging how it feels?

We could then repeating that process on our normal AFS cell to compare.

Once we're happy, then the work required to finish it off includes: Agreeing how to allocate the space, and update to the prometheus conduit to create PTS entries in the new cell.

4/11/2015

As requested at the recent development meeting, this a brief summary of the verbal report I gave on the October 21st meeting.

I confessed my lack of spending anytime on this. Covered that the plan was still to do some basic performance testing to confirm that this wasn't a bad idea, and once that was done, go ahead a implement it properly. IS (Jan) and I have hardy exchanged any words in the past few months, he's not been pressuring me, and and I've not been pressuring him. Presumably we are equally bad at spending time on this task. There was some discussion about NFS v4 perhaps being another option if AFS isn't a winner, or kerberised samba.

27/11/2015

I needed to do something, anything, so I've been reading back over the above to remind myself of what I'm supposed to be doing. I've checked the 3 machines, and updated our 2 to the latest AFS. I should check what version datastore are running, I can't find out via bos status.

The last thing I said I was going to so was some sort of performance test. So that's what I'll do. I'll write some (probably perl) script to create, read, write and delete some files in the working directory that it is currently in, and then loop. Writing out some stats at the end of each loop, probably appending them to so file in that dir too. I'll then run varying numbers of this script in different volumes on the datastore.inf cell, and similarily on our normal cell. And compare numbers. Though there are tools like iozone and bonnie++, I'm not sure I understand what they are actually telling us. See the table above, it gives me raw figures, but not a "feel" for how things are behaving.

22/1/2016

Finally spent some some on a local test script to gauge file system performance. The iozone and bonnie results are too complex, I wanted a simpler measure of real world type file activity. So I've written a script that creates files of varying sizes, fills them with data, does thousands of random reads from the files, then deletes the files. It returns the time to do one itteration, and then the time to do 10.

As it stood, it was pretty quick, about 3.5s for one iteration. I wondered if the AFS cache was doing it's job and limiting the amount of network (and remote disk activity) to the AFS file server. Running fs flushall between operations increase an iteration to about 5s , but putting it in the main read loop really slowed things down, perhaps taking tens of minutes for a single iteration. I then wondered if the kernel's own buffers/cache were having an affect too. I discovered echo 3 > /proc/sys/vm/drop_caches which clears all the kernel caches, and like fs flushall calling that between operations had a severe impact on performance.

The question is though, if this is to emulate what users would experience, then they wouldn't be flushing caches, they'd get the benefit of them. So I'm in two minds as to whether flushing caches is really the thing to do. What I'll probably do is only call the flush between the major sections, create, write, read, delete, and enlarge the size of the files I'm working with to put more of a strain on the cache.

While I've been working on this, and having reviewed the current computing plan, and the fact that we've been considering the future of AFS, I wonder if what we should really be doing is figuring out how we can best make use of datastore in its vanilla form, rather than doing our own thing and shoehoring AFS on to datastore?

26/2/2016

Though we've decided to try and draw a line under this project (but this is yet to be accepted by the development meeting), and instead look at accessing datastore natively via DICE. As I had gone to the bother of knocking up a simple performance script. I'll post the results here. Though the script is configurable, these figures were based on it creating 10 files from 1KB to 20MB (so each file was roughly 20MB/10 = 2MB larger than the previous). First it created 10 empty files. Then filled them with known data in 64KB chunks (sequential writes). Then read a single random block from each file (and checked it contained the correct values), and repeated this process 5000 times (ie a whole bunch of random reads), then deleted the 10 files, and timed how long it took to do all those steps. It then ran 10 times to get an average. The random generator was seeded with a fixed value, so multiple runs would use the same "random" reads.

I did experiment with flushing AFS caches, or even the kernel caches between loops to remove caching/buffering affects. The AFS cache flushing didn't make much difference, but the kernel flushing made it run like a drain. In the end I did neither, as in the real world, a user would get the benefit of the caches/buffers. It was hardly scientific, I ran my script on my desktop, to various different filesystems and the results where:

Storage location Time (s) Notes
homedir AFS 6.1
datastore AFS 8.1
homedir AFS 11.3 these two were run simulatenously
datastore AFS 14.3
/tmp 3.5
/dev/shm 3.1
group.cos AFS 6.8
ptn199 NFS 29.5
homepages NFS 30.5
test.work AFS KB 7.1

From this simple test, it would appear that datastore AFS performance would be comparable to our existing AFS group space, perhaps 30% slower, but 3x faster than our NFS.

2/5/2016

Finally put fingers to keyboard to try and justify why proceeding with this project would be wrong - Project268AFSonECDFwindup

29/8/2016

The above wind-up document was sent to Alastair and Craig, but not yet submitted to sign off, as I'm waiting to hear back. Nothing's happend in the intervening time, though Jan from ECDF, was asking if he could get rid of the VM he created while we were testing things out, as it is costing them money ?!

Some things to consider

PTS - As the test cell is a separate cell, its has its own PTS database. Initially I have populated from our normal "inf" cell (excluding cross realm users), but for a final service we presumably want all our current cell users, to just be able to use the datastore.inf cell without needing to do anything (my require an aklog, see later). This would mean some method of keeping the datastore.inf cell PTS in sync with our inf. cell PTS. We could just run a regular script to do the necessary adds and deletes, or we could do it properly, and hook into prometheus and create/delete as we do for our normal cell.

Then there's the cross realm entries, you can't create them in advance, some magic happens when a cross realm person does their first aklog. Just creating user@ease.ed.ac.uk via pts commands, does not allow "user" to then aklog.

aklog - AFS is happy for you to have more than one token, but you need to aklog into each cell to obtain them. We could just leave it like that, so that users want to access datastore.inf AFS space, just learn to do 'aklog datastore.inf.ed.ac.uk', or perhaps there's something we can do in PAM ? Or we just tell people about ~/.xlog, which if you create that file containing a line for each cell you want to get tokens for, when you aklog (or PAM does it for you) then you get the tokens for those cells.

quotas - We've still to have this discussion, but we need to think about how the various individual and group quotas for datastore will work/be managed. And what it might mean in terms of creation of AFS partitions on the datastore hardware.

backups - Also something we need to talk about.

AFS max recommended partition size is 2TB, and a single server has a max of 255 partitions, so a single AFS file server can have a total of 510TB of storage.
How many INF users qualify for their 0.5TB of space? If it's less than 1000 then we could host all that space on a single server (would we want to though?).

-- NeilBrown - 29 Oct 2013 [ the date this wiki page was first created! ]

Topic revision: r26 - 29 Aug 2016 - 15:57:34 - NeilBrown
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies