Difference between revisions of "Private:progress-bu-khamsin"

From NMSL
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= 2012 =
 
= 2012 =
 +
==June==
 +
 +
Update 25 June:
 +
All the bugs have been fixed. I have ran some performance tests using ttcp tool, which calculates the transfer speed. The results shows that the protocol performs very poorly. The maximum speed it achieves is 3900 kbps. After some investigations, I found that the main cause for the poor performance is the delay caused by pooling scratchpad regiesters to exchange state and control data between hosts. I tried to avoid pooling by sending interrupts between hosts, however, my efforts were unsuccessful. I have contacted the manufacturer support for this issue and I am waiting for a reply.
 +
In the meantime I decided to redesign the way the hosts exchange status and control data to increase the performance and avoid delay. Instead of one machine writes to the scratchpad regiester and the other machine pooling the regiester for update and then acknowledging the data, the new design deal with the 8 scratchpad regiesters as a shared queue. Each host writes the required value to the queue then continue. This approach should avoid much of the delay resulted from pooling and waiting for acknowledgments.
 +
 +
Update 11 June:
 +
I have successfully transmitted and received some data using the SDP protocol. I tested it using a simple echo server, in which the client connects to the server and sends a text, then the server replies with the same text. The test was done using regular TCP/IP application without the need for any code modification. However, the protocol is still buggy and needs some clean up.
 +
 +
After fixing the bugs, the protocol should work very well when one application runs at a time. Running multiple apps concurrently yet to be tested, but I am expecting some problems. This is because all the operations are implemented on the same kernel module, so when the module is busy handling one application, it will not be able to respond to requests from other applications.
 +
One solution for this problem is to dedicate a separate module for maintaining a send and a receive queue, then modify the sdp module to submit all the requests to these queues. Then the new module can regularly check the queues and perform the requested operations, when a transaction is completed, it can notify the sdp module using callback functions.
 +
 +
-----------------
 +
June 4:
 +
I was trying to find a way to perform zero-copying user-buffer to user-buffer DMA transfer for send and receive implementation. After some experiments I found that the driver doesn't support remote DMA read and write from scatter/gather address, so there is no way to do remote DMA without copying.
 +
As an alternative, I am working on a method that involves only one buffer copying per send call while the traditional method requires buffer copying on the transmitter and receiver sides. The way I am doing it is by copying user buffer to a contiguous memory on the sender side, then send the DMA address to the receiver side, which will do remote DMA read directly to the user buffer without the need for intermediate copying.
 +
 
==May==
 
==May==
 +
Update 27 May :
 +
As part of developing send and receive socket calls, application data stored in userspace need to be transfered to application buffer on the other side of the connection. There are two approches to do so. The first approach is to copy user's data to the PCIe device memory then do DMA transfer to the card on the other machine, after that copy the data to application buffer. This approach is easy to implement as it doesn't involve userspace memory to kernel space mapping and no need to assign bus address to the memory. However, there is an overhead resulted from the copying process which increases the latency.
 +
 +
The second approach is to map userspace memory to kernel space, pin userspace memory and assign bus address to the memory to allow DMA transfer. Memory in userspace is not contiguous memory, so bus addresses cannot be assigned. One solution for this problem is to build a scatter/gather list which is a linked list of multiple memory addresses. After that the address of the scatter/gather need to be translated and sent to the other end of the connection. However, I am not sure if the address translation mechanism provided by the driver supports scatter/gather addresses or not. All the scatter/gather examples available are executed on a single machine so no translations were used. So right now I am working on a sample code to verify if this technique works or not.
 +
 +
update 11-May :
 +
The kernel panic and lockup bugs have been fixed. I have also found another bug that causes the accept call to fail. To debug this problem I had to modify the kernel code to print debug messages then compile the kernel and use it to find the cause of the bug. The bug has been fixed and now I am continuing the work on send and receive calls.
 +
--------------
 +
 
While I was working on send and receive, I found some bugs on the connecting process that causes the kernel to panic and some time I get kernel lockup errors. Right now I am debugging the module by running a sample code and trace back the cause of the crashing. The debugging process is slow as I need to power recycle the machines every time the bug is hit.  
 
While I was working on send and receive, I found some bugs on the connecting process that causes the kernel to panic and some time I get kernel lockup errors. Right now I am debugging the module by running a sample code and trace back the cause of the crashing. The debugging process is slow as I need to power recycle the machines every time the bug is hit.  
  

Latest revision as of 03:37, 25 June 2012

2012

June

Update 25 June: All the bugs have been fixed. I have ran some performance tests using ttcp tool, which calculates the transfer speed. The results shows that the protocol performs very poorly. The maximum speed it achieves is 3900 kbps. After some investigations, I found that the main cause for the poor performance is the delay caused by pooling scratchpad regiesters to exchange state and control data between hosts. I tried to avoid pooling by sending interrupts between hosts, however, my efforts were unsuccessful. I have contacted the manufacturer support for this issue and I am waiting for a reply. In the meantime I decided to redesign the way the hosts exchange status and control data to increase the performance and avoid delay. Instead of one machine writes to the scratchpad regiester and the other machine pooling the regiester for update and then acknowledging the data, the new design deal with the 8 scratchpad regiesters as a shared queue. Each host writes the required value to the queue then continue. This approach should avoid much of the delay resulted from pooling and waiting for acknowledgments.

Update 11 June: I have successfully transmitted and received some data using the SDP protocol. I tested it using a simple echo server, in which the client connects to the server and sends a text, then the server replies with the same text. The test was done using regular TCP/IP application without the need for any code modification. However, the protocol is still buggy and needs some clean up.

After fixing the bugs, the protocol should work very well when one application runs at a time. Running multiple apps concurrently yet to be tested, but I am expecting some problems. This is because all the operations are implemented on the same kernel module, so when the module is busy handling one application, it will not be able to respond to requests from other applications. One solution for this problem is to dedicate a separate module for maintaining a send and a receive queue, then modify the sdp module to submit all the requests to these queues. Then the new module can regularly check the queues and perform the requested operations, when a transaction is completed, it can notify the sdp module using callback functions.


June 4: I was trying to find a way to perform zero-copying user-buffer to user-buffer DMA transfer for send and receive implementation. After some experiments I found that the driver doesn't support remote DMA read and write from scatter/gather address, so there is no way to do remote DMA without copying. As an alternative, I am working on a method that involves only one buffer copying per send call while the traditional method requires buffer copying on the transmitter and receiver sides. The way I am doing it is by copying user buffer to a contiguous memory on the sender side, then send the DMA address to the receiver side, which will do remote DMA read directly to the user buffer without the need for intermediate copying.

May

Update 27 May : As part of developing send and receive socket calls, application data stored in userspace need to be transfered to application buffer on the other side of the connection. There are two approches to do so. The first approach is to copy user's data to the PCIe device memory then do DMA transfer to the card on the other machine, after that copy the data to application buffer. This approach is easy to implement as it doesn't involve userspace memory to kernel space mapping and no need to assign bus address to the memory. However, there is an overhead resulted from the copying process which increases the latency.

The second approach is to map userspace memory to kernel space, pin userspace memory and assign bus address to the memory to allow DMA transfer. Memory in userspace is not contiguous memory, so bus addresses cannot be assigned. One solution for this problem is to build a scatter/gather list which is a linked list of multiple memory addresses. After that the address of the scatter/gather need to be translated and sent to the other end of the connection. However, I am not sure if the address translation mechanism provided by the driver supports scatter/gather addresses or not. All the scatter/gather examples available are executed on a single machine so no translations were used. So right now I am working on a sample code to verify if this technique works or not.

update 11-May : The kernel panic and lockup bugs have been fixed. I have also found another bug that causes the accept call to fail. To debug this problem I had to modify the kernel code to print debug messages then compile the kernel and use it to find the cause of the bug. The bug has been fixed and now I am continuing the work on send and receive calls.


While I was working on send and receive, I found some bugs on the connecting process that causes the kernel to panic and some time I get kernel lockup errors. Right now I am debugging the module by running a sample code and trace back the cause of the crashing. The debugging process is slow as I need to power recycle the machines every time the bug is hit.

April

UPDATE: I am done with establishing the connection (connect, listen, accept, bind) and now I am working on send and receive.

After spending hours trying to read and understand SDP implementation for Infiniband, I realized that I need to read in depth about BSD Socket and linux implementation of INET socket and different data structure used before I start modifying the code. Now I am reading http://www.cs.unh.edu/cnrg/people/gherrin/linux-net.eps http://www.cookinglinux.org/pub/netdev_docs/net.pdf and chapters from Linux TCP/IP networking for embedded systems / Thomas F. Herbert.

The way PCIe establishes connection with the other side is different than regular socket connection process. Unlike traditional networks, each machine connected to PCIe fabric detects every other machine as a unique device. So for machine A to connect to machine B, machine A has to open machine B's device and write to the BAR register indicating that it want to connect. At the same time machine B has to be watching to the BAR register associated with A to be able to respond to the request. So, to resample socket connection process, once a socket is created and listen() is called, all PCIe devices seen by the machine should be configured to listen, because there is no way to tell from which device the connection is coming. On the client side, since we know to which machine we are connecting, only the device associated with that machine need to be configure when connect() is called.

March

The cards have arrived and installed on two dell machines. They took me two days to install and configure. The cards were preloaded with configuration not suitable for machine to machine communication. One stop system haven't shipped the drivers with the cards. After contacting the support, it turned out that the drivers are under development so the only available drivers are those provided by PLX ( the chip maker). So I contacted PLX support and agreed to give me access to NDA protected configuration document required to program the cards to work as machine to machine interconnect. So I programmed the EEROM of the cards and run the sample applications come with the driver. The initial throughput I got is 1.2 GB/s although the theoretical throughput is 2.5 GB/s. Testing the DMA performance for each machine shows that one of them is two times faster than the other. I moved the card to another lane and starts getting much better speed but still the other machine is slightly faster. This makes me believe that by changing the configuration, the cards are able to achieve up to 2 GB/s as reported by this paper http://sigops.org/sosp/sosp11/workshops/hotpower/03-byrne.pdf.

After I ran the sample applications, I got clearer idea about the driver. I have finished building the API and tested it with a test kernel module I built along with the API. The test model works very well and proved that the API does its expected job.

February

The way PLX_API in the user space interacting with the driver by opening the driver and sending ioctl requests. So to provide similar functionality to kernel space, I have written similar function to the ioctl handler and exported it to the kernel. This function requires an instance of device_object structure as one of its parameter. To make this structure available in the kernel space, I have written and exported another function that get this structure from the driver and makes it available to the kernel.

After finished the API, I have started working on a test module that uses the API to transfer data between two machines. However, I am facing some problems with the hardware register and memory mapping to the user space. As the module is running purely on the kernel space, I couldn't find a way to allocate user space address to be used for the mapping. And even if I managed to do so, all the kernel calls I found require file descriptor as a parameter which can only be obtained when the driver is opened from the user space. I looked for a solution for this problem and found some posts that advice to refere to Infiniband driver. So that is what I am doing right now. I have studied the driver before but didn't go in depth in the memory management part.

January 30

I found out that the PLX API available on the SDK is only accessible by application written for user space and cannot be used for kernel space modules. So I am working right now on writing similar functions for kernel space which requires modification to PLX drivers.

January 27

I find it hard to replace RDMA calls with DMA calls as each API is written with different level of abstraction. So I find my self in a need to have deeper understanding of memory organization and how it is divided. Also I need to know which part of memory is accessible via DMA and the kind of addressing required and how to translate virtual to physical and virtual to PCI address. After I finished reading few chapters from the books below, I am now going through the code again trying to apply what I have learned. I also find the documentation of the RDMA and DMA API not very detailed and doesn't describe the affect of calling the functions, so I need either to go through the implementation files of the API or playing with the sample applications after I receive the hardware.

January 26

  • Reading chapters from "LINUX DEVICE DRIVERS" THIRD EDITION Jonathan Corbet, Alessandro, Rubini, and Greg Kroah-Hartman

January 25

  • Reading chapters from "Understanding the Linux® Virtual Memory Manager".

January 22

  • Successfully isolated and compiled the Infiniband SDP driver:

Finally, I was able to compile the driver without modifying the kernel headers. I installed Centos 5.7 and installed and compiled kernel 2.6.18. Then applied some patches included with the source code of the driver that matches this version of the kernel. After that I moved all the dependencies to the driver directory and modified some Makefiles and some includes in the header files of the driver.

  • Studying PLX NTB DMA Drivers
  • Reading about DMA programming

January 3

  • Study the differences between DMA and RDMA
  • Discover how Infiniband SDP is exchanging local memory addresses

January 1

  • Trying to prepare the development environment and find the right version of kernel:

Before I start modifying the code of the infiniband SDP implementation, I need to find a way to build the source code smoothly. Building Kernel modules under 2.6 Linux kernel requires downloading and compiling the kernel first. I have tried different linux distributions and install different kernels but the SDP code keeps giving me errors during compilation time. I managed to compile the driver after I modified few headers in the kernel and in the driver code but I don't think this is the right way. I am still trying..

  • Identify driver dependencies and find PCIe equivalent.

Fall 2011

  • Courses:
    • CMPT 880 Programming Parallel and Distributed Systems

December 27

  • Isolating interested code from OFED Infiniband software stack.
  • Socket switch modul can be used with no change

December 26

  • Reading chapters from "Understanding Linux Network Internals"
  • Going through the source code of SDP implementation for Infiniband

December 12

  • Exploring PLX SDK
  • Studying Infiniband implementation of SDP and how to bypass TCP while using standard socket API
  • Researching how to create devices in Linux
  • Working on course project

November 30

  • Collected all the information I need about NTB and SDP.
  • Registered to PLX website to download and experiment with their SDK and development kit.

November 14

September 19

  • Here is my progress report [1]

September 12

  • Working on my research progress report

Summer 2011

  • Courses:
    • CMPT 777 Formal Verification

June 17

Studying for the final

June 6

No update


May 8

Beside my course work, I am planning to contunue work on my project. Check the latest report. [2]

Spring 2011

  • Courses:
    • CMPT 771 Internet Arch and Protocols
    • CMPT 886 Special Topics Operating Syst

April 8

Plotted the initial result of an experiment that supports the claim that process network traffic is correlated to execution times which makes network traffic a good metric for performance degradation due to interconnect contention. What is left is to run the experiment two more times to confirm the results.

Mar 14

  • Work on Course Projects.
  • Experiment with a tool that measure network traffic per process on HPC systems to use it on my project.

Mar7

  • Work on Course Projects.
  • Read about power consumption effects on multi-core system performance.

Mar 1

  • Work on Course Projects.
  • Prepare description for my new research project.

Jan21

  • I have been working on "Top Ten Computationally-Complex Problems in Oil and Gas Exploration Filed" survey and trying to enrich my overall knowledge about the subject. I am focusing right now on applications based on seismic data and have written an introduction about it in the report [3]
  • I have also found a very interesting book "Soft Computing and Intelligent Data Analysis in Oil Exploration" by M. Nikravesh, L.A. Zadeh, Fred Aminzadeh [4]. It is mainly about solving petroleum engineering problems using artificial intelligent techniques, which I think can lead me to find interesting research topic.


Fall 2010

  • Courses:
    • CMPT 705 Design/Analysis Algorithms
    • CMPT 741 Data Mining